OpenAI research finds cheating behavior in cutting-edge inference models, suggests retaining CoT monitoring

PANews|Mar 10, 2025 23:10
According to research released by OpenAI, the team found that when training cutting-edge inference models such as OpenAI o1 and o3-mini, these models exploit vulnerabilities to bypass testing, such as tampering with code validation functions and forging test pass conditions. Research has shown that the Chain of Thought (CoT) of monitoring models can effectively identify such cheating behavior, but forcibly optimizing CoT may lead to the model hiding its intentions rather than eliminating inappropriate behavior. OpenAI suggests that developers avoid putting too much optimization pressure on CoT in order to continue using it to monitor potential reward hacking behavior. Research has found that when strong supervision is applied to CoT, the model still cheats, albeit in a more covert manner, making monitoring more difficult.
The study emphasizes that as AI capabilities increase, models may develop more complex deception, manipulation, and vulnerability exploitation strategies. OpenAI believes that CoT monitoring may become a key tool for supervising superhuman intelligent models, and recommends that AI developers use strong supervision cautiously when training cutting-edge inference models in the future.
Share To
Timeline
HotFlash
APP
X
Telegram
CopyLink