OpenAI launches high difficulty benchmark test BrowseComp, challenging AI's ability to search online

PANews|Apr 10, 2025 23:55

OpenAI has opened a new benchmark, BrowseComp, to evaluate the ability of AI agents to find difficult information on the Internet. The test consists of 1266 extremely challenging questions, designed to simulate AI's "online treasure hunt" in complex information networks, emphasizing that the answers are difficult to find but easy to verify. The questions in the test cover multiple fields such as film and television, technology, history, etc., and the difficulty is significantly higher than existing tests such as SimpleQA. According to the AIGC open community, this testing benchmark is very difficult, with OpenAI's own GPT-4o and GPT-4.5 accuracy rates of only 0.6% and 0.9%, almost zero. Even when using GPT-4o with browser functionality, it is only 1.9%. But OpenAI's latest agent model, Deep Research, has an accuracy rate of up to 51.5%.