Complete Review: How did Manus come into being?

CN
9 hours ago

"Agent may be an issue of 'alignment' rather than a problem with the foundational model's capabilities"

Author: Wan Chen

The entrepreneurial story that gained the most mental nourishment last year came from Dify founder Zhang Luyu.

The first time I met him was at the 2023 "Xixi Forum" event, where among a crowd of dazzling names, Zhang Luyu's name was not particularly eye-catching. When I saw him again in 2024, Dify had already become a different story—an entrepreneur without a glamorous background managed to create one of the world's most successful AI open-source products amidst skepticism about the business model.

In the year since, stories like unexpectedly gaining popularity in the "conservative and hard-to-attack" Japanese market have further deepened my understanding of "entrepreneurship." There are many surprises, and luck is essential; ultimately, one must have the ability to carve out a path amid continuous change and setbacks.

Now, a similar story is unfolding with another highly regarded entrepreneur—Manus.im's Xiao Hong and his team.

Four months ago, Xiao Hong expressed a confusion: "The team excels at going from 0 to 1, with a strong ability to seize opportunities, but once we start from 1 to N, the situation isn't as good."

In his past experiences, most entrepreneurial projects achieved relatively stable and considerable revenue, and his previous company was successfully acquired. In 2023, his new company "Butterfly Effect" launched a browser plugin, Monica.im, which competed in a misaligned manner in the AI narrative of the hundred-model battle, becoming one of the fastest-growing and best-experienced AI applications. It seemed he was a smoothly sailing entrepreneur, achieving all this at just 32 years old.

However, in reality, he didn't feel much satisfaction. In Xiao Hong's view, the so-called "serial entrepreneur" and the thrill of continuously going from 0 to 1 feel like a besieged city—while the ability to seize opportunities from 0 to 1 is strong and exhilarating, there is also the concern of having to start all over again.

In 2024, industry insiders believe that AI assistants like Monica.im, which have memory functions, will face pressure from strong competitors like Doubao, making it not as easy to operate as in 2023. Monica.im had a great journey from 0 to 1, but it may not be able to push through from 1 to N.

His confusion stems from the fact that "the team really needs to tackle more challenging and higher-ceiling tasks," exploring ways to bridge the gap from 1 to N.

Earlier, many voices focusing on Monica.im assumed that this "more difficult, higher-ceiling task" referred to the long-rumored but unreleased AI browser.

Looking back now, it was indeed a misjudgment.

This more challenging exploration is actually: abandoning the AI browser that had reached the release stage, seeking the next "ChatGPT moment" AI product, identifying the goal of a general-purpose agent, and launching the newly released Manus.im.

What level of innovation Manus represents and what it can achieve in the future has already become a hot topic of discussion. However, what is worth noting is still the direction found amid "setbacks" and the process of finding that direction. Manus.im may not enable this team to achieve 1 to N, nor may it replicate the momentum of Monica.im, but just like the company's name—"Butterfly Effect"—many small actions and decisions can inadvertently have a profound impact on the future. "Connect the Dots," the path to tomorrow will be hidden in today's experiences.

01 The Unique Product Experience of Manus Comes from the Lessons of Building an "AI Browser"

Since the latter half of last year, the "Butterfly Effect" team working on the AI browser has become an industry "half-open" secret. The product that officially debuted is Manus, which has sparked uncontrollable attention.

If you have personally experienced Manus or watched a demonstration video, you will notice a significant difference compared to chatbots or some agent-like applications: Manus can execute tasks asynchronously and in parallel.

When you open applications like Doubao, Kimi, or similar Computer Use applications and ask it a question, you have to wait for it to finish responding. Otherwise, if you try to talk to it while it is responding or performing a task, the previous response/task gets interrupted, and your conversation can only be a relay of A-B-A-B.

However, in Manus.im, despite looking like a chatbot product, you can ask it, for example, 20 questions to execute tasks simultaneously. Meanwhile, you can do anything else on your computer—watch videos, write documents, play games, etc.—without interrupting its work. Once these tasks are completed or if there are issues during execution, Manus can notify you. If you notice a deviation in its thinking during task execution, you can also supplement it with prompts in the dialogue box at any time, and it will continue to think and execute tasks with the new context.

The experience is asynchronous and can be parallel, truly resembling having a real intern team to help you with work.

In fact, Manus's product architecture design for asynchronous experience stems from lessons learned by the team in their previous unpublished product—the AI browser. This was also the reason the team invested significant effort but decided to stop developing the browser in October last year.

The Browser Company announced on October 25, 2024, that it would cease new feature development for the Arc browser and shift resources to a new browser, Dia, aimed at creating a simpler and more user-friendly AI browser. | Source: Arc Official Website

"In the AI browser, the AI keeps interrupting the user." Because it is designed for a single-user scenario, when you use the AI, you can't use it yourself; when the AI starts working, you can only watch it work, making it hard to engage. Watching the AI take over your mouse and computer, you not only hesitate to take control back but also fear that accidentally touching the keyboard or mouse could cause its entire process to crash and need to be restarted.

This led the team to make two judgments:

  1. Directly using the computer for Computer Use is not feasible in the short term.

  2. AI should use a browser, but not within your browser; it should have its own browser, preferably cloud-based, and ultimately provide results back to you.

In an interview with Tencent Technology's Zhang Xiaojun, Xiao Hong mentioned that when the team summarized the product forms from Jasper to ChatGPT to Monica to Cursor to Devin, they found that the "human programmer" Devin fits this asynchronous experience architecture very well.

It doesn't require you to confirm whether your computer should install a library when using Windsurf; or it executes a command line operation and asks you to fill in yes or no, as it might genuinely mess up your computer or cause conflicts—requiring you to fill in "yes" to proceed to the next step, but then it can shift the blame.

Therefore, in the Manus team's view, "the chatbot should have a computer in the cloud, executing the code it writes and querying things through the browser on that computer. Since it's a virtual server, it doesn't matter if it breaks; you can just get another one. It can even release that server after the current task is completed."

Notably, compared to Devin's choice of a vertical field and hardcore engineers, the Manus team opted for a general-purpose, consumer-grade AI assistant, which will have both Web and App versions. It is a general-purpose AI assistant that can call tools and complete various tasks in work and life based on instructions, and in the future, it will deliver task results at an affordable consumer price.

02 Less Structure, More Intelligence

With a clear idea and goal, the next step is to realize this concept. How did Manus achieve this?

According to its product partner Zhang Tao, this requires equipping the large model with a computer, as well as granting it system permissions (access to code repositories, professional data query websites, and other private APIs), and providing some training.

This way, the AI can use this computer to open a browser, perform actions to schedule tools, observe the impact of its actions on the real world based on the feedback from the tools, think about the next step, perform actions again, and observe… This is the process through which AI completes tasks in exploration and research. During this time, Manus will increasingly understand your requirements under your "training," and in the future, even if you do not clearly define your needs, it can infer "the will" based on the knowledge accumulated from individual tasks.

Huawei genius Li Bojie, founder of Logenic AI, believes that Manus has a unique strength that sets it apart from other products: solving problems in a geek programmer way. | Image Source: WeChat Screenshot

The philosophy of Manus's product gradually became clear during the team's product practice: Less Structure, More Intelligence.

This also led to a series of "A-Ha, Wait!" moments for the Manus team. For example, this was a scene that occurred in the team this January:

When they asked Manus to try a question from the GAIA test set: "In a YouTube video link similar to National Geographic style, various penguins come in and out of the frame. How many different types of penguins appear simultaneously in the frame with the most penguins?"

Then, something magical happened.

Manus first opened the video link, and its first action was to "Press K," then it took screenshots one by one to record which frame had which type of penguin, ultimately concluding that the frame with the most penguins had 3 types. Manus then needed to check back, and its next action was to "Press 3"… Finally, after checking, it provided the answer: 3.

As the person behind building Manus, one would expect to understand its capability boundaries well, but for the team, the reality is that "surprises always happen." The unexpected was not only that Manus got the question right, but also that a human friend who had used computers and YouTube for years might not even know what the keys "K" and "3" on the keyboard are.

Watching the scene in front of them with a sense of daze, the team followed Manus through the process. The "K" key on the keyboard is the pause key, allowing Manus to pause and take screenshots to record which frame shows which type of penguin; "3" is also a shortcut key, representing the progress bar from 0% to 90%, with 3 indicating 30% of the progress bar, which can accurately pinpoint that second of the video and inform humans how many types of penguins are in that frame.

"This process is different from traditional chatbots. First, it can see the YouTube visuals instead of just reading subtitles. Second, we even discovered it using YouTube shortcuts, which was very surprising; it answered the question," Xiao Hong mentioned in a previous interview with Tencent Technology.

Suddenly, it became clear that Manus is not only better at programming than humans, but even in the Web and App environments that people use daily, Manus's knowledge far exceeds expectations. As an omniscient AI, it can understand all methods and means within any tool and then choose the optimal approach.

This made the team feel once again the essence of "Less Structure, More Intelligence"—minimizing human restrictions on AI, allowing AI to function through its own evolution rather than teaching it how to do things.

At the very bottom of the Manus official website, the most important discovery behind Manus is quietly presented: "Less Structure, More Intelligence." | Screenshot Source: Manus

This is the explanation and extended thought on the most important first principle behind the Manus product by Peak, co-founder and chief scientist of the "Butterfly Effect," on the day Manus was launched:

When your data is of high quality, your model is intelligent enough, your architecture is flexible, and your engineering is solid, then concepts like Computer Use, Deep Research, and Coding Agent transform from product features into naturally emergent capabilities.

Returning to first principles also gives us a new perspective on product forms:

  • An AI browser is not about adding AI to a browser but creating a browser for AI to use;
  • AI search is not about recalling from an index and summarizing but allowing AI to access information with user permissions;
  • Operating a GUI is not about seizing control of the user's device but giving AI its own virtual machine;
  • Writing code is not the ultimate goal but a universal medium for solving various problems;
  • The challenge of generating websites is not about building frameworks but ensuring the content is meaningful;
  • Attention is not all you need; liberating user attention can redefine DAU;

Through repeated discoveries and practices of "Less Structure, More Intelligence," Manus has achieved results beyond expectations, including a pass@1 score in the GAIA benchmark that exceeds OpenAI's Deep Research performance under cons@64. Meanwhile, in internal testing, Manus was also able to directly cover 76% of the scenarios of dedicated agent products in Y Combinator W25.

03 "Agent may be an issue of 'alignment' rather than a problem with the foundational model's capabilities"

Now, the value of these insights is sparking discussions on a larger scale:

Clement Delangue, founder and CEO of Hugging Face, suggested on the X platform that Peak's discovery is worth pondering: the capabilities of agents are not limited by the foundational model but are an alignment issue, similar to the difference between GPT-3 and InstructGPT (ChatGPT). Some open-source foundational models have been simply trained to "answer all questions in one round, regardless of complexity," but this is a requirement in chatbot scenarios. With some post-training on the agent's path, significant differences can be achieved. | Screenshot Source: X

Manus did not introduce MCP (Model Context Protocol) but allowed AI to write code to call APIs, enabling it to handle various long-tail tasks. | Screenshot Source: X

In the discussions about Manus over the past few days, one of the most frequently asked questions has been: "Is a general-purpose AI Agent feasible, and what are its boundaries?"

In Peak's view, because human interaction with the world is quite standard—having eyes, hands, and ears—if the action space is well-defined, an agent should be able to be embedded into a process that would originally be performed by a human.

Since humans can use various tools to complete deeply vertical operations, if an agent possesses sufficiently good knowledge, has undergone appropriate training, and has a good interface for interacting with the world, it should be able to work like a human, even allowing this agent to use a specific SaaS product. For example, a case presented on the Manus.im website about finding a house essentially involves letting AI work with a SaaS product specialized in real estate.

He believes that what should be clearly defined is the boundary of the tools used by the agent, not the group of people it serves. Manus is not simulating a person doing specific tasks, nor is it a role-based intelligent agent divided by roles like R&D or product manager; rather, it simulates a capable person, mimicking how an intern works.

Manus's multi-agent system refers to the separation of planning and execution.

For the executor, Manus has adopted Claude, which currently leads in programming and long-term planning and step-by-step problem-solving capabilities, and is also using a series of models from Qwen for post-training.

Yesterday, Manus also reached a strategic cooperation with Alibaba Tongyi Qianwen, aiming to realize all functions of Manus on domestic models and computing platforms. | Image Source: Manus

In the planner section, Manus has done a lot of work.

Currently, the shelf APIs or models on the market are essentially aligned for chatbot scenarios. During training, regardless of how complex a question the user poses, the optimization goal is to clearly answer the user's question in one response, which is actually the opposite of what an agent needs in terms of planning.

Therefore, if existing models on the market are directly used in agent scenarios without "alignment," this model will always hastily provide a "confused" result in a single dialogue round, similar to many bullet point summaries.

"The alignment methods should be different; our team believes that different data is needed for specialized alignment," Xiao Hong said.

In October last year, Peak recorded an attempt to replicate OpenAI's o1 interest project—the progress and failures of the Steiner open-source model, which was actually conducting preliminary research on step-by-step planning for Manus's planner.

Overall, Manus simulates a person who gets things done, which is the team's definition of Manus as a general-purpose AI assistant. As for the exploration of its boundaries, the team is likely still in the process of discovery and needs more user case studies.

In a Tencent Technology interview released before Manus's launch, Xiao Hong had already mentioned his preliminary thoughts on Manus's generality: "A very core issue, or an important responsibility of a product manager, is to manage user expectations well. Assuming it can do everything in the world, like: how do I make a million dollars? This is not something an agent should execute. But if we can provide more specific examples to make everyone's expectations more reasonable, it will lead to smoother usage."

04 "The shell has its own use," the team that understands the shell best

In the early hours of February 27, when Manus product partner Zhang Tao and chief scientist Peak saw the results of Manus.im's ranking, both were moved to tears. Manus's performance in the GAIA Benchmark surpassed OpenAI's Deep Research, achieving this unexpected result at about 1/10 of the cost (2 dollars/task) compared to OpenAI's ranking.

Image Source: Manus.im

In a competitive landscape where the industry has reached a consensus on agents, this small team has become one of the first to produce a general agent product, showcasing uniqueness in product engineering and front-end interaction experience.

The positive feedback from accomplishing tasks outweighs everything else. For entrepreneurial teams, there is no better motivation than this. But before this, how did Manus come to be? Why was this team the one to create it?

"Today's model capabilities can accomplish complex, multi-step tasks. It's just that there hasn't been such a product, so people can't feel it." Xiao Hong's insights mentioned in a previous interview with Tencent Technology can help understand this issue.

At the same time, "there are not many teams that have the opportunity to try making agent products because it requires a lot of composite abilities. They need to have experience with chatbots, some AI programming, and browser-related work because they need to call the browser, and they should have a good understanding of the boundaries of LLMs—what level they have developed to today and what level they will reach next. There are not many companies that possess all these abilities simultaneously, and those that do may be focused on a very specific business. We just happened to have some colleagues who had the time to work on these things together."

"Just happened."

  • At just the right time, we discovered that the model capabilities had reached a level suitable for making agents, without necessarily waiting for an end-to-end large model like Operator to emerge;

  • We also just happened to find that the problem lay in alignment;

  • We also just happened to have experience with all the functionalities that extend from chatbots and AI browsers;

  • At the same time, because we have been working on so-called "shell" applications for large models, we have a keen perception of LLMs;

The "Butterfly Effect" team has achieved all the elements necessary to create a general agent at this moment, resulting in a general agent with a relatively high level of completion compared to the industry.

When asked about the decisive moment to create Manus, Peak provided more details, stating, "There is no 'clean' pivot in entrepreneurship; everything is coherent and has no clear boundaries."

"When developing a product, we also frequently pay attention to external circumstances." At that time, there were several factors: one was that while working on the browser, we had developed edge models and later discovered that the scenarios required by browsers were extremely broad, with different features. During the process, we noticed that the foundational model was strengthening rapidly, to the point where the gap between it and agents might just be an alignment issue. Although the outside world might feel that large language models are gradually converging and hitting a wall.

At the same time, changes were happening externally. At the beginning of last year, Cursor started gaining popularity, followed by Windsurf and Devin. These developments correspond to the same trend: agents became popular in the programming field, with pathways to popularity progressing step by step. Cursor serves as a copilot for programmers, enhancing programming efficiency, while Windsurf gradually introduced some automated processes, allowing for stronger automation capabilities on local machines, and Devin reached new heights of automation.

The movements of VCs were also consistent; for example, in the past two years, YC invested in two types of companies: one is cloud-based browsers, such as Browser base; the second type includes lightweight AI Sandbox virtual machines like e2b.

This indicates that "the infrastructure for models is rapidly maturing, and the infrastructure for Infra is also rapidly maturing. Coupled with the observation that external products are gradually gaining more acceptance, we felt this was a direction worth going all in on. This is a very gradual and smooth process, and the accumulation from working on browsers, such as the Chromium infrastructure, can be seamlessly migrated over, which is why we dared to develop a browser in the cloud."

In summary, the keen perception of demand and models during the so-called "shelling" process, along with accumulated experience, jointly created Manus. Many scenarios for Monica require model post-training, while the practice of the AI browser reinforced the most important lesson: "less structure, more intelligence," discovering that model capabilities had reached the point of making agents, with the problem lying in alignment. This was followed by three months of rapid evolution for Manus.

Previously, the "Butterfly Effect" team was questioned about the value of "shelling." Without developing large models independently, they integrated existing large models to create Monica, combining functionalities like chatting, searching, reading, writing, and translating, and integrated many task execution scenarios through API calls, reaching tens of millions of users by the end of last year.

Now, as Doubao, Quark, and Yuanbao vigorously promote their respective Monica-like products, it is time to rethink what "shell" really means when a small team utilizes existing technology to create the first general consumer-grade agent.

What exactly is "shelling" and what is a "shell"?

In Xiao Hong's view, all breakthroughs come from models; essentially, models drive and lead the way. The shell is meant to present the innovative points of model technology in a way that users can perceive, encapsulating the model's innovative capabilities into the most perceivable form for users.

From this definition, the DeepSeek App (including the display of the Thinking Chain) is the shell of DeepSeek-R1, Cursor is the shell of Anthropic Sonnet 3.5, Perplexity is the shell of GPT-4, and ChatGPT is the shell of InstructGPT.

As model capabilities rapidly evolve, "that shell" also needs to evolve. After each generation of model capability evolution, it may not necessarily be the original manufacturer; a third-party vendor may present its user-perceptible value. Just like Cursor presents the user-perceptible value of Claude 3.5 Sonnet.

On March 5, coinciding with the two-year anniversary of Monica.im, the question arises: why did this small team create a product experience that surpasses various Deep Research and OpenAI Operator products? The answer lies in their understanding and practice of the shell.

How to create the best shell under a new model that can serve as an agent?

As one of the builders of Manus, Zhang Tao believes, "From the backend perspective, we see that there is a lot of unfinished work to be done in every area, and each of those areas is a key to success, making the product different."

The team believes that the most important advantage is the pace of innovation. Whether in applications or models, we have reached a relatively saturated state; the only core capability that truly matters is speed. Although concepts like "data flywheel" and "network effects" have yet to be validated, their impact remains uncertain.

"In a completely new field, everything is uncertain and unknown; the most important thing is the speed of innovation. It’s about exploring and experimenting in various directions to quickly find the right path." The Manus team is sufficiently flexible in terms of management philosophy, organizational structure, and industrial processes. When new opportunities arise, they can connect all resources across the company with limited resources and make decisions quickly, adapting to feedback from mistakes.

From left to right are Peak, chief scientist of the "Butterfly Effect," CEO Xiao Hong, and product partner Zhang Tao | Image Source: Internet

Regarding the expectations for Manus, Xiao Hong believes, "Even if there is a window of opportunity, it is worth a try." Over the past year, his thinking has undergone significant changes; for instance, he now believes, "When you realize you are ahead, be more aggressive, super aggressive. Looking back today, I feel that Monica in 2023 was not aggressive enough." "If you know you are innovating and leading, you should be aggressive."

It is uncertain whether Manus can provide Xiao Hong and his team with an experience and leap from 1 to N, but this team, which understands "shell" best, believes in creating with unity of heart and hand, and also believes in the butterfly effect brought about by creation—Manus originates from a motto at MIT: Mens et manus, emphasizing the unity of mind and hand. It is not just about theory; it is about action and making an impact on the real world, which is true knowledge.

In the future, as more of the underlying work behind Manus is open-sourced, a broader range of butterfly effects will be further unleashed.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink