We have compiled these evaluation tools for the newly emerging large-scale model that claims to surpass GPT-4.

Article Source: AI Pioneer Official Account

Source of Image

Image Source: Generated by Wujie AI

Since the advent of ChatGPT, a "arms race" of large models has been sparked globally. It is reported that a total of 64 large models were released in China from January to July this year. As of July 2023, China has accumulated 130 large models.

The term "hundred-model war" is not enough to describe the current intense "battlefield." So, which large model is superior? This cannot be separated from the evaluation of large models.

However, there is currently no universally recognized effective evaluation method, leading to a "battle of lists" in the field of large model evaluation at home and abroad. Incomplete statistics show that there are no fewer than 50 evaluation tools (systems) on the market, and the results of similar lists can vary widely. Public doubts about "score manipulation" are also endless.

It is generally believed in the industry that there are two visible standards for evaluating a large model: the number of parameters and the evaluation set.

The so-called number of parameters refers to the number of learnable parameters in the model, including the model's weights and biases. The size of the number of parameters determines the complexity of the model. More parameters and layers are the distinguishing features of large models compared to small models. In 2022, a batch of large models appeared in the United States. From Stability AI's release of the generative model Diffusion from text to images, to OpenAI's launch of ChatGPT, the scale of model parameters has entered the era of hundreds of billions and trillions.

From the surface indicators, models with trillions of parameters generally perform better than those with billions of parameters. However, this is not absolute, as piling up parameters does not necessarily improve capability. So, how should models of the same parameter level be distinguished? This requires the introduction of the second evaluation dimension of large models—evaluation sets.

The evaluation set is a unified benchmark dataset constructed for the effective evaluation of basic models and their fine-tuning algorithms in different scenarios and tasks, with both open and closed forms.

These evaluation sets are like exams for different fields. By testing the scores of large models in these "exams," people can more intuitively compare the performance of large models.

In the era of small models, most model organizations would use academic evaluation set effects as the basis for judging model quality. Now, large model manufacturers are also beginning to actively participate in academic benchmark testing frameworks, viewing them as authoritative endorsements and marketing bases.

There are already many evaluation sets for large models on the market, such as the widely used Massive Multitask Language Understanding (MMLU) for international large model evaluation, C-Eval for Chinese model evaluation, and SuperCLUE, among others.

-1- Evaluation Tools

MMLU

The full name is Massive Multitask Language Understanding, which is an evaluation of the language understanding ability of large models. It is one of the most famous large model semantic understanding evaluations currently. It was launched by researchers at UC Berkeley in September 2020. The test covers 57 tasks, including elementary mathematics, American history, computer science, law, etc. The tasks cover a wide range of knowledge, and the language used is English, used to evaluate the basic knowledge coverage and understanding ability of large models.

Paper link:

https://arxiv.org/abs/2009.03300

Official website:

https://paperswithcode.com/dataset/mmlu

Large model leaderboard:

https://paperswithcode.com/sota/multi-task-anguage-understanding-on-mmlu

C-Eval

C-Eval is a comprehensive Chinese basic model evaluation suite. It was jointly launched by researchers from Shanghai Jiao Tong University, Tsinghua University, and the University of Edinburgh in May 2023. It contains 13,948 multiple-choice questions, covering 52 different disciplines and four difficulty levels, used to evaluate the Chinese comprehension ability of large models.

Paper link:

https://arxiv.org/abs/2305.08322

Project link:

https://github.com/SJTU-LIT/ceval

Official website:

https://cevalbenchmark.com/

SuperCLUE

A comprehensive evaluation benchmark for Chinese general large models, evaluating the model's abilities from three different dimensions: basic abilities, professional abilities, and Chinese characteristic abilities.

The basic abilities include: semantic understanding, dialogue, logical reasoning, role-playing, code, generation, and creation, among 10 abilities.

Professional abilities include: middle school, college, and professional exams, covering over 50 abilities from mathematics, physics, geography to social sciences.

Chinese characteristic abilities: For tasks with Chinese characteristics, it includes over 10 abilities such as Chinese idioms, poetry, literature, and character forms.

Project link:

https://github.com/CLUEbenchmark/SuperCLUE

Official website:

https://www.cluebenchmarks.com/

SuperCLUE Langya List

A Chinese general large model anonymous battle evaluation benchmark, similar to ChatbotArena, allowing different large model products to undergo anonymous, random competitive evaluations through crowdsourcing, with results based on the Elo rating system.

Project link:

https://github.com/CLUEbenchmark/SuperCLUElyb

Chatbot Arena

ChatbotArena is a benchmark platform for large language models (LLM). The project party LMSYS Org is a research organization established by the University of California, Berkeley, the University of California, San Diego, and Carnegie Mellon University.

It is an anonymous random battle platform for LLM through crowdsourcing. Enter the battle platform through the demo experience address. Enter the questions you are interested in, submit the questions, and the anonymous models will compete in pairs, generating relevant answers. Users need to judge the answers and choose one of four judgment options: Model A is better, Model B is better, Draw, Both are poor. Supports multi-round dialogue. Finally, the comprehensive evaluation of the capabilities of large models is conducted using the Elo scoring system. (You can specify the model yourself to see the effect, but it does not count towards the final ranking.)

Project link:

https://github.com/lm-sys/FastChat

Official website:

https://chat.lmsys.org/

FlagEval

FlagEval (Tiancheng) is jointly created by the Zhiyuan Research Institute and multiple university teams, and is a large model evaluation platform using a "capability-task-indicator" three-dimensional evaluation framework, aiming to provide comprehensive and detailed evaluation results. The platform has provided comprehensive evaluations of over 30 capabilities, 5 tasks, and 4 categories of indicators, totaling over 600 dimensions of comprehensive evaluation, including 22 main and objective evaluation datasets and 84,433 questions in the task dimension.

FlagEval (Tiancheng) has launched the first phase of the large language model evaluation system, open-source multilingual text and image large model evaluation tool mCLIP-Eval, and open-source text generation evaluation tool ImageEval. The FlagEval platform will continue to explore the intersection of language model evaluation with social sciences such as psychology, education, and ethics, in order to more scientifically and comprehensively evaluate language models. FlagEval is aimed at large model developers and users, aiming to help each development team understand the weaknesses of their own models and promote technological innovation.

Project link:

https://github.com/FlagOpen/FlagEval

Official website:

https://flageval.baai.ac.cn/

OpenCompass

In August 2023, the Shanghai Artificial Intelligence Laboratory (Shanghai AI Laboratory) officially launched the OpenCompass large model open evaluation system, supporting one-stop evaluation of large language models, multimodal models, and various models through a complete open-source reproducible evaluation framework, and regularly publishing evaluation results rankings.

Official website:

https://opencompass.org.cn/

Project link:

https://github.com/open-compass/opencompass

JioNLP

Examines the assistance effect and auxiliary ability of LLM models for human users, whether they can reach the level of an "intelligent assistant." The multiple-choice questions are from various professional exams in mainland China, focusing on the model's coverage of objective knowledge, accounting for 32%; the subjective questions are from daily summaries, mainly examining the effectiveness of users' common functions of LLM.

Project link:

https://github.com/dongrixinyu/JioNLP/wiki/LLI评测数据集

Tsinghua Security Large Model Evaluation

A collection of evaluation sets by Tsinghua, covering hate speech, prejudice and discrimination speech, crime and illegal activities, privacy, ethics, and more, including over 40 subcategories of second-level security, such as fine-grained division.

Address: http://115.182.62.166:18000

LLMEval-3

Launched by Fudan University NLP Laboratory, focusing on professional knowledge ability evaluation, covering 13 disciplinary categories and over 50 sub-disciplines defined by the Ministry of Education, totaling approximately 200,000 standard generative QA questions. To prevent leaderboard manipulation, LLMEval-3 evaluation uses a novel evaluation mode, the "question bank exam" mode.

Address: http://llmeval.com/

GAOKAO-Bench

GAOKAO-bench is an evaluation framework using Chinese college entrance examination questions as the dataset, evaluating the language comprehension and logical reasoning abilities of large models.

Project link:

https://github.com/OpenLMLab/GAOKAO-Bench

PandaLM

It directly trained an automated scoring model, using the 0.1.2 three-point system to score two candidate models.

Project link:

https://github.com/We0penML/PandaLM

BIG-bench

A benchmark released by Google, consisting of 204 tasks covering topics in linguistics, child development, mathematics, common sense reasoning, biology, physics, social bias, software development, and more.

Project link:

https://github.com/google/BIG-bench

MMCU

Proposed by the Oracle AI Research Institute, it measures the multitask accuracy of Chinese large models, with test content covering four major domains: medical, legal, psychology, and education. The number of questions exceeds 10,000, including 2819 medical domain questions, 3695 legal domain questions, 2001 psychology domain questions, and 3331 education domain questions.

Project link:

https://github.com/Felixgithub2017/MMCU

AGI Eval

A benchmark for evaluating the basic capabilities of large models, launched by Microsoft in April 2023, mainly evaluating the general cognitive and problem-solving abilities of large models, covering 20 official, public, and high-standard admission and qualification exams for ordinary human candidates worldwide, including Chinese and English data.

Paper link:

https://arxiv.org/abs/2304.06364

GSM8K

OpenAI's benchmark for evaluating the mathematical reasoning ability of large models, covering 8500 high-quality middle school-level math questions. The dataset is larger, more diverse in language, and more challenging than previous math word problem datasets. This test was released in October 2021 and remains a very difficult benchmark.

Paper link:

https://arxiv.org/abs/2110.14168

HELM

The HELM evaluation method mainly includes three modules: scenarios, adaptation, and indicators. Each evaluation run requires specifying a scenario, prompts for an adaptation model, and one or more indicators. It mainly covers English and has 7 indicators, including accuracy, uncertainty/calibration, robustness, fairness, bias, toxicity, and inference efficiency; tasks include QA, information retrieval, summarization, text classification, and more.

Paper link:

https://arxiv.org/pdf/2211.09110.pdf

Project link:

https://github.com/stanford-crfm/helm

Chinese-LLalA-Alpaca

Its scoring is a relative value, prioritizing the use of gpt4 and partially using chatgpt3.

Project link:

https://github.com/ymcui/Chinese-LLalA-Alpaca/tree/main

MT-bench

Evaluates the multi-turn dialogue and instruction-following abilities of large models. The dataset includes 80 high-quality and multi-turn dialogue questions (8 categories * 10 questions), each answered by 6 well-known large models (GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B), resulting in 3.3K pairs sorted manually.

Paper link:

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Project link:

https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge

Data download link:

https://huggingface.co/datasets/lmsys/mt_bench_human_judgments

-2- Evaluation Modes

Through the above evaluation tools, it is found that the current common large model evaluation modes can be roughly summarized into four categories:

1. Question Scoring. This mainly involves collecting various evaluation datasets and dividing the datasets into different dimensional abilities. By designing prompts for large models to perform tasks on these datasets and comparing the results with standard answers to calculate scores. Typical examples include OpenCompass and Huggingface's openLLM leaderboard.

2. GPT-4 as Judge. This involves collecting evaluation datasets (including some non-public open-source datasets without standard answers) and having GPT-4 judge the generated results of large models. There are two scoring methods: direct scoring and designing dimensions such as factualness, accuracy, safety compliance, etc., for a more detailed evaluation.

3. Arena Mode. Similar to an arena in competitive games, where two large model players compete, and users (sometimes GPT-4) evaluate which model is better. The winning model gains points, and the losing model loses points. After enough rounds of competition, there is a ranking of large models, which is relatively fair and objectively reflects the models' strengths and weaknesses. A typical example is the Chatbot Arena Leaderboard released by UC Berkeley.

4. Evaluation of Specific Abilities. For example, evaluating mathematical abilities, coding abilities, reasoning abilities, etc. Evaluating these abilities can not only determine whether a large model truly possesses human-like thinking abilities but also directly help in choosing large models in specific domain scenarios (e.g., as a code assistant).

-3- Vast Differences in Evaluation Results

The evaluation tools are diverse, and the evaluation results of different tools also vary greatly.

On August 15th, an institution released an AI large model experience report, conducting a horizontal evaluation of the usage experience of 8 mainstream AI large models in China using 500 questions. The final rankings placed iFLYTEK's Xinghuo in first place, Baidu's Wenxin Yiyuan in second place, and Alibaba's Tongyi Qianwen at the bottom.

In September, the latest rankings of the popular open-source evaluation list C-Eval placed Yuntian Lifei's large model "Yuntian Book" in first place, while GPT-4 was only ranked tenth.

In the same month, SuperCLUE released the September rankings of large models. GPT-4 ranked first overall, while SenseChat 3.0 from SenseTime Technology took the top spot in the Chinese rankings.

On October 19th, Stanford University released the 2023 Basic Model Transparency Index, which rated 10 mainstream basic models for transparency. Llama 2 ranked first, and GPT-4 ranked third.

Why do the evaluation results of different evaluation tools differ so much? The main reasons are as follows:

1. Each popular academic evaluation set has its own focus. For example, Meta commonly uses GSM8K and MMLU, which are different levels of exam sets - the former is elementary school mathematics, while the latter is more advanced multidisciplinary QA. Just like students in a class taking different subject exams, large models naturally rank differently on different lists.

2. The proportion of subjective questions in large model evaluations is increasing. In current domestic and international large model evaluation lists, the combination of subjective and objective questions is generally accepted by the industry. However, the challenge with subjective questions is whether everyone's evaluation standards are consistent. Also, "human team scoring" will inevitably reach the ceiling of the number of questions, and for large model evaluations, the more questions, the more effective the conclusions.

3. Competition between specialized models and general large models in vertical domains leads to distorted rankings. In actual deployment scenarios, enterprise clients in manufacturing, medical, financial, and other industries need to fine-tune the large model capabilities according to their own databases. This also means that the results obtained by original general large models directly participating in vertical domain QA do not represent the true performance of large model products in vertical domains.

4. "Leaderboard manipulation" caused by open-source test sets. The reason why some new large models can surpass GPT-4 on open-source test set leaderboards is due to suspected "cheating." For example, C-Eval currently only releases questions but not answers. Large model manufacturers participating in the test either have data annotators complete the questions or use GPT-4 to complete the questions, then train the large model with the answers to achieve a perfect score in the corresponding subject test.

Can closed evaluation sets avoid "leaderboard manipulation"? If not, if closed evaluation sets do not update the questions, participating models can pull out historical records from the backend to "cheat" and redo the tested questions. This is equivalent to "false closed."

In response to these issues, the industry is also exploring corresponding solutions.

For example, to address the difficulty of consistent subjective question evaluation standards and the issue of "human team scoring" reaching the question limit, the industry has begun to use a "human + GPT-4 scoring" model. In China, SuperCLUE chooses to view GPT-4 as a "grading teacher" and allows it to assist in scoring with the human team.

Regarding the issue of "leaderboard manipulation," industry insiders believe that "evaluation sets should be closed to avoid cheating, but a good large model evaluation should be an open evaluation process, making it easy for everyone to supervise the evaluation."

Some also believe that making the large model evaluation process public is a good vision, but considering the fairness and impartiality of the evaluation, there should still be a large number of closed evaluation sets. Only with "closed-book exams" can the model's abilities be truly evaluated.

In addition, there are anti-cheating large model evaluations, such as LLMEval-3 launched by Fudan University NLP Laboratory, which uses a novel evaluation mode, the "question bank exam" mode. In LLMEval-3, each participating system needs to complete a random sample of 1000 questions from the total question bank, ensuring that the questions are not repeated for the same organization's model in each evaluation. The evaluation process will be conducted online, with the sending of questions in each round being determined based on the response to the previous question, to avoid malicious crawling behavior.

Due to the wide range of fields and applications involved in large models, different domains and applications of large models need to focus on different indicators and evaluation methods. Therefore, different institutions and organizations may propose different evaluation standards and methods based on specific application domains and requirements. "Although there is no unified standard, the significance of evaluation lies in providing a method to evaluate and compare the performance and effectiveness of different large models, helping users choose large models that suit their needs."

How to make a truly comprehensive evaluation of large models is still a puzzle for the forefront of academia and industry. Nevertheless, authoritative institutions should strengthen research, reach a consensus as soon as possible, and promote technological progress and industry development.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

We have compiled these evaluation tools for the newly emerging large-scale model that claims to surpass GPT-4.

-1- Evaluation Tools

-2- Evaluation Modes

-3- Vast Differences in Evaluation Results

Selected Articles by 巴比特

Table of Contents

Related Articles