This article is worth reading, the key points are as follows:

This article is worth reading, with the following key points:

The breakthrough of AlphaGo lies in its ability to play against itself without human correction, providing rewards/punishments to train the AI, which allowed AlphaGo to surpass human players.

The previous generation of AI: Voice recognition, image recognition (Hikvision), and autonomous driving are all based on previous generation AI algorithms, starting from AlphaGo defeating the human Go champion. The characteristics are: clearly defined rules and a single objective make closed space games most suitable for reinforcement learning. However, the real world is an open space where each step has infinite possibilities, with no definite goal (like "winning"), and no clear criteria for success or failure (like occupying more areas on the board), and the cost of trial and error is very high.

ChatGPT can be considered the next generation of AI: Intelligence is generated through compression. The model generates intelligence while predicting the next word, then learns human Q&A patterns through supervised fine-tuning, and finally outputs responses that align with human preferences through RLFH (Reinforcement Learning from Human Feedback).

What do CloseAI people believe? They firmly believe that compression equals intelligence. They think that by using larger volumes of high-quality data and training larger parameter models on bigger GPU clusters, greater intelligence can be achieved. ChatGPT was born under this belief.

The pre-training bottleneck issue: Although the model size has increased tenfold, we can no longer obtain ten times more high-quality data than we currently have. The delay in the release of GPT-5 and rumors about domestic large model manufacturers not conducting pre-training are related to this issue.

Reasoning models: Using reinforcement learning (RL) to train the model's reasoning chain has become a new consensus among everyone. This training significantly improves performance on certain specific, objectively measurable tasks (such as mathematics and coding). It starts with a regular pre-trained model and uses reinforcement learning in the second phase to train the reasoning chain. These models are called Reasoning models. The o1 model released by CloseAI in September 2024 and the subsequent o3 model are both Reasoning models. Human feedback is no longer important because the reasoning results can be automatically evaluated at each step, allowing for rewards/punishments.

DeepSeek's pure reinforcement learning model: Named R1-Zero, it pays homage to AlphaZero, the algorithm that surpassed the strongest players through self-play without needing to learn any game records. The training process of R1-Zero relies entirely on RL to learn those objective, measurable human truths, ultimately enabling reasoning capabilities far superior to all non-Reasoning models, without depending on human IQ, experience, or preferences.

Distillation: Typically refers to using a powerful model as a teacher, with its output serving as the learning target for a smaller, less capable student model, thereby strengthening the student model. For example, the R1 model can be used to distill LLama-70B. The performance of the distilled student model is almost always worse than that of the teacher model, but the R1 model outperforms the o1 model in certain metrics, making it very foolish to say that R1 distills from o1.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

This article is worth reading, the key points are as follows:

Selected Articles by Lanli

Table of Contents

Related Articles