Yang Zhilin and Liang Wenfeng's papers have collided.

When two founders put their names in a paper.

One

On the same day that Musk released Grok3, trained with 200,000 cards, two papers published in the tech community took a "contrary" route to Musk's approach.

Among the authors of these two papers, there are familiar names:

Liang Wenfeng, Yang Zhilin.

On February 18, DeepSeek and Dark Side of the Moon simultaneously released their latest papers, and the topics directly "collided"—both challenged the core attention mechanism of the Transformer architecture, making it more efficient in processing longer contexts. Interestingly, the names of the technical star founders of both companies appeared in their respective papers and technical reports.

The paper released by DeepSeek is titled: "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention."

According to the paper, the new architecture NSA (Native Sparse Attention) achieves the same or higher accuracy in benchmark tests compared to the full attention mechanism; it can increase speed by up to 11.6 times when processing 64k token sequences, is more efficient in training, and requires less computational power; it performs excellently in tasks involving ultra-long contexts (such as book summarization, code generation, and reasoning tasks).

Compared to previous innovations in algorithms that people have been excited about, DeepSeek this time reached for the core transformation of the attention mechanism.

The Transformer is the foundation of today's large model prosperity, but its core algorithm, the attention mechanism, still has inherent issues: using reading as an analogy, the traditional "full attention mechanism" reads every word in the text to understand and generate, comparing it with all other words. This leads to increased complexity as the text gets longer, causing the technology to lag or even crash.

The academic community has been providing various solutions, and NSA assembled a training-phase architecture solution consisting of three components through engineering optimization and experiments in real environments:

1) Semantic Compression—Instead of looking at each word, it divides them into groups, or "blocks," reducing the sequence length to 1/k while retaining global semantics, and introduces positional encoding to reduce information loss, thus lowering computational complexity from O(n²) to O(n²/k).

2) Dynamic Selection—The model uses a scoring mechanism to select the most focused words from the text and performs fine-grained calculations on them. This importance sampling strategy maintains 98% of fine-grained information while reducing computational load by 75%.

3) Sliding Window—While the first two are about summarizing and highlighting, the sliding window looks at recent contextual information to maintain coherence, and through hardware-level memory reuse technology, it can reduce memory access frequency by 40%.

Each of these ideas is not an invention of DeepSeek, but it can be imagined as ASML-style work—these technological elements already exist, scattered everywhere, but engineering them together into a scalable solution, a new algorithmic architecture, has not been done before. Now someone has created a "lithography machine" through powerful engineering capabilities, which others can use to train models in real industrial environments.

The paper released by Dark Side of the Moon on the same day proposed a very consistent architecture in core ideas: MoBA. (MoBA: MIXTURE OF BLOCK ATTENTION FOR LONG-CONTEXT LLMS)

From its name, it is clear that it also uses the method of turning "words" into blocks. After "chunking," MoBA has a gated network like an "intelligent selector" that selects the Top-K blocks most relevant to a "block" and only computes attention for these selected blocks. In the actual implementation, MoBA also combines optimizations from FlashAttention (which makes attention calculations more efficient) and MoE (Mixture of Experts).

Compared to NSA, it emphasizes flexibility more, not completely departing from the currently mainstream full attention mechanism, but designing a switchable method that allows these models to toggle between full attention and sparse attention mechanisms, providing more adaptability for existing full attention models.

According to the paper, MoBA's computational complexity shows significant advantages as context length increases. In tests with 1M tokens, MoBA was 6.5 times faster than full attention; at 10M tokens, it was 16 times faster. Moreover, it has already been used in Kimi's products to handle the ultra-long context processing needs of everyday users.

One important reason Yang Zhilin initially attracted attention when founding Dark Side of the Moon was his paper's influence and citation count, but before the K1.5 paper, his last research paper was in January 2024. Although Liang Wenfeng appeared as an author in DeepSeek's most important model technical report, the author list of these reports is almost equivalent to DeepSeek's employee roster, with nearly everyone listed. In contrast, the NSA paper has only a few authors. This highlights the significance of these two works for the founders of these two companies and the importance of understanding their technological directions.

Another detail that underscores this significance is that some netizens discovered that the submission record for the NSA paper on arXiv shows it was submitted on February 16, and the submitter was Liang Wenfeng himself.

Two

This is not the first time Dark Side of the Moon and DeepSeek have "collided." At the same time as R1 was released, Kimi rarely published the K 1.5 technical report, as this company had not prioritized showcasing its technical thoughts externally. At that time, both papers aimed at RL-driven reasoning models. In fact, a close reading of these two technical reports reveals that Dark Side of the Moon provided a more detailed share on how to train a reasoning model in the K1.5 paper, even surpassing the R1 paper in terms of information density and detail. However, the subsequent wave from DeepSeek overshadowed much of the discussion about this paper itself.

One corroborating point is that OpenAI recently published a rare paper explaining the reasoning capabilities of its o-series models, which mentioned both DeepSeek R1 and Kimi K1.5. "DeepSeek-R1 and Kimi K1.5 show through independent research that using the Chain of Thought (COT) method can significantly enhance model performance in mathematical problem-solving and programming challenges." This means that these are the two reasoning models OpenAI chose to compare.

"I feel that the most magical aspect of this large model architecture is that it seems to point out the path forward by itself, leading different people to arrive at similar directions from different angles."

Professor Zhang Mingxing from Tsinghua University, who participated in the core research of MoBA, shared on Zhihu.

He also provided an interesting comparison.

"DeepSeek R1 and Kimi K1.5 both point to ORM-based RL, but R1 starts from zero, being more 'pure' or 'less structured,' and was released earlier, synchronously open-sourcing the model.

Kimi MoBA and DeepSeek NSA again both point to learnable sparse attention that can be backpropagated; this time MoBA is more 'less structured,' released earlier, and synchronously open-sourced the code."

The continuous "collisions" between these two companies help people better understand the technological development of reinforcement learning and the evolutionary direction of more efficient and longer text attention mechanisms.

"Looking at R1 and K1.5 together can better learn how to train a Reasoning Model, while looking at MoBA and NSA together can help us understand from different perspectives our belief that sparsity should exist in Attention and can be learned through end-to-end training," Zhang Mingxing wrote.

Three

After the release of MoBA, Xu Xinran from Dark Side of the Moon also stated on social media that this is a project that has been worked on for a year and a half, and now developers can use it out of the box.

However, choosing this moment to open source is bound to be discussed under the "shadow" of DeepSeek. Interestingly, as various companies actively integrate DeepSeek and open-source their models today, the outside world seems to think of Dark Side of the Moon first, with ongoing discussions about whether Kimi will integrate or open-source its model, making Dark Side of the Moon and Doubao seem like the last two "outliers."

Now it appears that DeepSeek's influence on Dark Side of the Moon is more sustained compared to other players, bringing comprehensive challenges from technical routes to user competition: on one hand, it proves that even in product competition, foundational model capabilities remain the most important; on the other hand, another increasingly clear chain reaction today is that Tencent's WeChat search and Yuanbao's combination are leveraging the momentum of DeepSeek R1 to make up for a marketing push they previously missed, ultimately targeting Kimi and Doubao.

Dark Side of the Moon's response strategy has thus become noteworthy. Among them, open-sourcing is a necessary step. It seems that Dark Side of the Moon's choice is to genuinely match DeepSeek's open-source approach—most of the many open-source projects that have emerged after DeepSeek resemble reactive measures, still following the open-source thinking from the previous Llama era. In fact, DeepSeek's open-source approach is different from before; it is no longer a defensive, disruptive open-source strategy against closed-source competitors, but a competitive strategy that can bring clear benefits.

Recently, there have been rumors within Dark Side of the Moon about "setting SOTA (state-of-the-art) results as a goal," which seems to be the strategy closest to this new open-source model, aiming to develop the strongest models and architectures, thereby gaining the application-side influence it has long desired.

According to the papers from both companies, MoBA has already been used in Dark Side of the Moon's models and products, and NSA has as well, even allowing the outside world to have clearer expectations for DeepSeek's upcoming models. Thus, the next point of interest will be whether Dark Side of the Moon and DeepSeek will collide again with the next generation of models trained using MoBA and NSA, and whether this will be done in an open-source manner—this may also be the node that Dark Side of the Moon is waiting for.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Yang Zhilin and Liang Wenfeng's papers have collided.

One

Two

Three

Selected Articles by 深潮TechFlow

Table of Contents

Related Articles