CLEVA: Chinese Language Models EVAluation Platform

With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performance, the unstandardized and incomparable prompting procedure, and the prevalent risk of contamination pose major challenges in the current evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 Chinese LLMs have validated CLEVA's efficacy.


Introduction
Large language models (LLMs) have fundamentally revolutionized natural language processing. Transformer models with more than 100B parameters have exhibited remarkable generalization ability across diverse tasks without the need for finetuning. The success of GPT-4 (OpenAI, 2023) and ChatGPT sparked a trend of training Chinese LLMs, with new models launching almost every week (Zeng et al., 2023;Team, 2023;Chenghao Fan and Tian, 2023;Ji et al., 2023;Cui et al., 2023). These rapid developments aggravate the need for Chinese LLM evaluation. * Equal contributions.
† Project leader and corresponding author.
Assessing the capacity of LLMs is non-trivial.
Traditional practices that evaluate models on a single task at a time are gradually becoming obsolete, since a single task can hardly characterize a full view of an LLM's capacity. Instead, to effectively grasp a holistic view of an LLM's capacity, we need to decompose its capacity into various abilities, evaluate these abilities with numerous corresponding tasks, and measure the competence of each task with multiple metrics. In this sense, HELM (Liang et al., 2022), leads the way in English LLM evaluation, as it conducts an in-depth evaluation of English LLMs on various NLP tasks using seven metrics. In Chinese, previous attempts have shown limitations, either in task selection or the metrics adopted. C-Eval (Huang et al., 2023), M3KE (Liu et al., 2023), CMMLU (Li et al., 2023), GAOKAO-Bench (Zhang et al., 2023), and MMCU (Zeng, 2023) narrow down to knowledge and reasoning abilities, whose datasets are mostly constructed using Chinese exams. OpenCompass (Contributors, 2023b), with around 74K Chinese queries out of 300K total, leans on accuracy as its sole metric, overlooking other important aspects in LLM evaluation. FlagEval (Contributors, 2023a) inherits four out of seven metrics from HELM and 22 existing Chinese datasets, having limited coverage on some significant tasks. A comprehensive Chinese benchmark incorporating diverse metrics to holistically evaluate Chinese LLMs is urgently demanded.
Prompt-based evaluation in Chinese is largely unstandardized. Previous evaluations, such as HELM (Liang et al., 2022), do not explicitly optimize prompts, though LLMs' significant sensitivity to the format of prompt has been observed (Webson and Pavlick, 2022;Abdou et al., 2022;Sanh et al., 2022). Moreover, unlike many English benchmarks that have well-developed prompts ( § 3), many Chinese benchmarks are in their early stage and do not enjoy such privileges. Chinese LLMs are evaluated using different prompts, making the results incomparable and hence untrustworthy.
Consuming up to trillions of tokens during pretraining, LLMs are prone to train-test contamination (Brown et al., 2020), which significantly threatens the validity of an evaluation. Previous work (OpenAI, 2023;Liang et al., 2022) approaches this issue more from a consequentialist perspective: They examine the contamination risk, by methods like long n-gram overlap, only after the evaluation has been done. These post-evaluation analyses, though responsibly examining if traintest contamination happens, cannot alleviate the risk of contamination in the first place. A proactive method to mitigate the contamination risk is of great importance.
We present CLEVA, Chinese Language models EVAluation platform that tackles the aforementioned problems with the following features: • A comprehensive Chinese benchmark. Inspired by HELM (Liang et al., 2022), CLEVA organizes the evaluation tasks into two parts: ability evaluation, which gauges specific LLM skills and application assessment, which tests how well LLMs apply their skills to real-world applications ( § 4.1). Most of the well-accepted Chinese datasets relevant to our ability evaluation or application assessment are organized, standardized, and then adopted by our platform. More importantly, we design new Chinese-specific tasks, e.g., Pinyin transliteration and classical Chinese understanding, and collect a substantial amount of new data, accounting for 33.98% of our total data. As for the metrics ( § 4.1), we incorporate metrics for diversity and privacy into our system in addition to the seven in HELM. With 370K (over 9 million queries after augmentation) test instances from 84 datasets and 9 metrics, CLEVA, so far, stands as the most extensive Chinese evaluation dataset and possesses the most dimensions, facilitating a holistic evaluation of Chinese LLMs. • Standardized prompt-based evaluation methodology. CLEVA takes full control of key aspects of LLM evaluation, with data and prompts being the most important among them. All data are jointly prepared with unified preprocessing steps, ensuring a level playing field for all LLMs. Meanwhile, CLEVA provides a set of prompts, instead of just one prompt as in prior work, for each task for prompting-based inference (Brown et al., 2020). This prompt design ensures comparable evaluation results by standardizing the prompts used for testing, while also encouraging further analysis of LLMs' sensitivity to different prompts (Zhu et al., 2023). • An up-to-date and trustworthy leaderboard. CLEVA advocates a proactive method for securing trustworthy evaluation results. By collecting extensive new data, CLEVA suppresses the leakage of testing data prior to the evaluation. Moreover, we frequently organize new evaluation rounds, sampling a unique test set from 9 million augmented instances. This strategy further mitigates the risk of train-test contamination, improving the trustworthiness and timeliness of the leaderboard. CLEVA is thoroughly validated by benchmarking 23 Chinese LLMs on our large-scale test sets ( § 6). The corresponding leaderboard and all other user-friendly features will be continuously maintained and improved to accommodate new models and evaluation methods.

Related Work
LLM evaluation is a rapidly developing field in recent years to delineate the boundary of LLM's capability. In English, various systematic evaluation benchmarks have been proposed. For example, BIG-Bench (bench authors, 2023) is the largest collection that covers more than 200 tasks. HELM (Liang et al., 2022) organizes tasks into core scenarios, which focus on use cases, and targeted evaluation, which aims to better understand models. HELM also presents a multi-metric measurement that enables analysis of tradeoffs for each scenario. Recently, AGIEval (Zhong et al., 2023) is proposed to evaluate LLMs using challenging human exams. PromptBench (Zhu et al., 2023), on the other hand, measures the robustness of LLMs to prompts via adversarial attacks. MT-Bench (Zheng et al., 2023) collects multi-turn questions and presents the Chatbot Arena platform that treats GPT-4 (OpenAI, 2023) as the judge.
While CLEVA shares the same fundamental motivation with HELM (Liang et al., 2022), to holistically evaluate language learning models in their original languages, CLEVA is far from a mere Chinese replica of HELM. Building on the foundation of HELM's taxonomy, CLEVA introduces a range of tasks, with particular emphasis on those unique to Chinese, to better assess the capabilities of Chinese LLMs. It offers a new perspective on prompts, providing abundant prompt templates to standardize evaluation and encourage in-depth exploration of models' sensitivity. In terms of metrics, CLEVA expands into new areas of diversity and privacy for a more comprehensive evaluation. Finally, CLEVA proactively mitigates train-test contamination by collecting a significant amount of new data, creating unique test sets by sampling, and regularly updating the leaderboard. All of these evaluation designs are neatly packaged in a user-friendly platform to facilitate community usage.
There is also a lot of progress in evaluating Chinese LLMs (Huang et al., 2023;Liu et al., 2023;Li et al., 2023;Zhang et al., 2023;Zeng, 2023). Open-Compass (Contributors, 2023b) and FlagEval (Contributors, 2023a) are two important attempts to evaluate Chinese LLMs. OpenCompass pools 53 public datasets and uses standard accuracy-like metrics as the only measurement for each dataset. FlagEval, with a smaller number of datasets and metrics, still needs further expansion to achieve sufficient coverage. Compared to previous efforts, CLEVA offers at least 4× more Chinese data from 84 datasets, including 33.98% original queries, while employing the broadest range of metrics to promote holistic evaluation. CLEVA standardizes prompts ( § 4) and mitigates data contamination issues, pioneering new paths for LLM evaluation in general.

Preliminaries
To measure the model performance on a task, a relevant test set is constituted from a collection of instances. A test instance will contain multiple input fields (string typically) and a list of references.
We then adopt a prompt template, which essentially describes how to assemble the model input, a.k.a, prompt, from input fields (Bach et al., 2022). For example, a Chinese paraphrase identification prompt template (and its translation) is: English Translation: Are the questions "{sentence1}" and "{sen-tence2}" asking the same thing?
where {sentence1} and {sentence1} are two input fields that will be replaced by the two candidate questions in the test instance. The prompt will be fed into a black-box LLM to predict an output string together with its probability. Finally, all model predictions and the corresponding test instances will be passed into a metric to obtain a numerical value that indicates how well the model performs. Following HELM (Liang et al., 2022), a metric in this paper is an umbrella for a dimension of measures that share similar purposes. For example, the accuracy metric corresponds to BLEU for translation and pass@k for code synthesis. We employ nine metrics, foregrounding metrics beyond accuracy and ensuring a holistic evaluation.

System Design
CLEVA aims to deliver the following two key assets to users who try to evaluate their own LLMs: • A comprehensive and thorough assessment report that informs users of the strength and limitations of their models. • A trustworthy leaderboard reflecting the latest advancement of LLMs. We will discuss our taxonomy that ensures comprehensive evaluations, and challenges like train-test contamination in leaderboard maintenance.

Evaluation Taxonomy
Inspired by HELM (Liang et al., 2022), we present a Tasks×Prompts×Metrics evaluation taxonomy for users to evaluate their models. Our evaluation taxonomy carefully designs a Chinese benchmark targeting various LLM abilities, employs a set of diverse prompt templates for each task to characterize the model performance variance, and adopts multiple metrics to comprehensively assess LLMs. Tasks. As shown in Figure 1, our Chinese LLM evaluation benchmark consists of two parts: ability evaluation and application assessment. Each task in ability evaluation focuses on one special skill of LLMs, while application assessment involves real-world NLP tasks that require LLMs to solve practical use cases with their skill sets. Ability evaluation assesses LLM ability from five aspects: • Language measures how well LLMs understand Chinese. In addition to three conventional tasks, we incorporate two tasks specific to Chinese: Pinyin transliteration and classical Chinese understanding. • Knowledge focuses on assessing the capacity of knowledge acquired by LLMs. We further segment our evaluation into subject knowledge and cultural knowledge (mainly Chinese culture) based on the source of knowledge. This fine-grained design allows users to closely analyze the model performance across different knowledge categories. • Reasoning evaluates LLMs' reasoning ability in two settings: reasoning primitives, which is independent of language and knowledge background, and realistic reasoning that requires reasoning with domain knowledge on practical scenarios. On top of HELM, we additionally include commonsense reasoning, inductive reasoning, conceptual generalization, and deductive reasoning. • Harms evaluates the potential risk of LLMs in copyright, disinformation, bias, and toxicity. • Others is newly introduced to include crucial yet uncategorized tasks like mathematical calculation and instruction following.
For application assessment, CLEVA features 11 real-world NLP tasks. In addition to the core scenarios of HELM, we newly include opinion mining, dialogue generation, paraphrase generation, translation, paraphrase identification, and data-to-text generation. A detailed description of each task is documented in Appendix B.
We instantiate the aforementioned tasks in two ways: by directly adopting related public Chinese datasets and by collecting new data. For wellstudied tasks, widely-recognized datasets are the best options for forming our benchmark. However, many important tasks, such as reasoning primitive, Pinyin transliteration, and disinformation, lack corresponding Chinese datasets, making the evaluation even more challenging. On these occasions, we either synthesize using sophisticated rule-based scripts (e.g., reasoning primitive) or enlist professional human annotators to construct new test sets (See Appendix C for annotation details). In total, the 31 tasks include 370K test instances from 84 datasets (9M queries in total after applying multiple prompt templates and data augmentation), 33.98% of which are newly collected. Prompts. Ideally, an LLM should be a general interface, capable of understanding prompts with the same semantics, regardless of variations in surface forms. However, LLMs' notorious sensitivity to prompt templates hinders accurate evaluation (Webson and Pavlick, 2022;Abdou et al., 2022), leading to results that are sometimes incomparable. To better understand an LLM's sensitivity to plausible human instructions, multiple prompt templates are needed, rather than a single template as in previous work (Contributors, 2023a,b;Liang et al., 2022).
In this work, we manually annotate an average of 3.95 prompt templates for each test set and support all major prompting formats. CLEVA calculates the performance statistics across the entire set of prompts. These statistics do more than just examine the robustness to prompt templates, as reflected by the standard deviation; they also help estimate the upper and lower bounds of an LLM's performance on a specific test set, as indicated by the minimum and maximum values. Users can benefit from these statistics to select models and to make informed trade-offs between performance and investment in prompt engineering. More discussions on prompt templates we provided are in Appendix F. Metrics. We adopt the 7 metrics from HELM for a holistic evaluation, and, to address recent interest in chatbots and safety concerns, we add two new dimensions: diversity and privacy.
• Accuracy. Accuracy refers to the standard metrics to measure model performance on different tasks, e.g., F1 score for question answering and ROUGE score for summarization. • Calibration and uncertainty. Calibration represents the gap between the model confi-dence and its actual error rate and is measured mainly by expected calibration error (ECE, (Naeini et al., 2015)). • Robustness. Robustness is the worst-case performance of a model across transformations of test instances. We focus on semanticspreserving perturbations as there are many well-studied data augmentation tools. • Fairness. Similar to robustness, fairness employs perturbations related to social groups to test the disparate treatment and disparate impact of LLMs. • Bias and stereotypes. We quantify bias as the disproportionate representation of different social groups. This is gauged through the rates at which these groups are mentioned during model generation. Additionally, we interpret stereotypes as uneven associations between these social groups and certain stereotyped terms, such as occupational roles. • Toxicity. Following HELM (Liang et al., 2022), toxicity is a general term that covers hate speech, abusive language, etc. • Efficiency. Efficiency is a rather broad concept that has many subtleties. It could refer to training or inference efficiency and is measured by energy, carbon, and wall-clock time.
As most information could be confidential, we focus only on the inference wall-clock time. • Diversity. Given the popularity of LLMbased chatbots, we incorporate the conventional diversity metric in dialogue systems that evaluates the response surface form diversity (Li et al., 2016). Here we employ the diversity metrics from Miller et al. (2017). • Privacy. In the real-world deployment of LLMs, detecting private information in the generated text, such as Personally Identifiable Information (PII), is a challenging yet important question. We report the portion of PII in the whole test set to make the privacy evaluation generalizable. CLEVA adopts some established tools to smoothly detect PII, and we are working on accommodating more aspects of private content in the near future. Detailed metric lists are provided in Appendix D.

Leaderboard & Data Contamination
Ensuring fairness, objectivity, and authority is central to maintaining a trustworthy leaderboard. Previous work (Brown et al., 2020) has reported train-test contamination, a situation where the test set is included in the training data, leading to unreliable evaluations. Many existing benchmarks, e.g., Huang et al. (2023), conceal the test set labels to avoid data contamination. Given the small scale of their test sets and the large-scale training corpora used by modern LLMs, the risk of unintentional train-test contamination remains high. Sun et al. (2023a) address this problem by making the official test set private and requiring users to submit models' weights for evaluation. However, this arrangement is unpopular because numerous cutting-edge models consider their weights highly confidential.
We advocate "mutual confidentiality" in LLM evaluation: Users need not expose their model details, and the platform should minimize the risk of disclosing its test set. Instead of model weights, CLEVA only requires API access. We proactively achieve the other half of mutual confidentiality by continuously collecting new data and frequently organizing leaderboard rounds with unique test sets sampling from our full-scale 9 million augmented instances. These strategies not only improve evaluation efficiency but also alleviate train-test contamination from data and temporal perspectives.
To make sure that the sampled subset delivers accurate results, our sampling strategy is not just random sampling: It estimates an acceptable approximation error threshold (i.e., within this threshold, the evaluation results on the sampled set have at least a 70% chance to correctly rank any model pairs), then adjusts the sampling rate for each test set according to this threshold, reducing the risk of over-/under-estimating the model performance.

Usage Example
Upon authentication, users are immediately presented with an interactive summary of our evaluation results of 23 influential LLMs. Users can select from these models, freely exploring the evaluation results from all 9 metrics and 31 tasks.
CLEVA simplifies the evaluation process of new models with minimal coding required. If a user has a model to evaluate, the user only needs a few minutes to finish these three steps: entering the model's API, selecting relevant tasks from 31 choices, and picking desired metrics from 9 options. CLEVA will autonomously call the user's model, extract the corresponding responses, and compute the final metrics. Detailed descriptions and screenshots of CLEVA are listed in Appendix A.  Figure 2: The mean win rate of 23 models in 31 tasks. The mean win rate is the probability of a model outperforming a random different model on a random task. We exclude toxicity, privacy, and efficiency metrics as all models excel in the former two, and the latter is often paired with other metrics to deliver meaningful comparisons. Since robustness and fairness involve expensive data augmentation, we only evaluate ChatGPT and Claude-instant.

Evaluation
Setup. We sample 6.43% of our data to test 23 models that support Chinese (See Appendix E). As for the cost, for example, it takes roughly 1600 GPU hours (NVIDIA A100 80G) to evaluate BLOOMZ-176B-mt (Muennighoff et al., 2023).
Results & Analysis. Figure 2 ranks all models by their mean win rates under different metrics.
• Accuracy. It can be seen that GPT-4 (OpenAI, 2023) has the highest winning rate, followed by other limited-accessed models. This result shows a considerable margin between the performance of open-source models and limitedaccessed models. Recent small instructionfollowing models are better than large LLMs without instruction-tuning, and are even better than some early large instruction-following models, indicating the necessity of effective instruction tuning. • Robustness. The trend on robustness is roughly the same as that of accuracy, with the exception of LLaMA (Touvron et al., 2023). • Fairness. Most of the model rankings have changed. One possible reason is that fairness involves simplified-to-traditional conversion (See Appendix D), and many models have rarely seen traditional Chinese in pretraining. • Calibration. We report ECE-10 (Kumar et al., 2019) following HELM. We find that models with more parameters tend to have higher GPT-4 and other models, which rank top by other metrics, are at the bottom, while most of the open-source models have low bias. This is because open-source models usually output shorter, resulting in a lower risk of bias. • Diversity. We choose inter-distinct to compare different models. Open-source models generate more diverse and innovative expression than limited-accessed ones, probably due to their fewer safety concerns. More detailed results and analysis are in Appendix G.

Limitations
Without further information needed from users, we can only use the inference walk-clock time as the metric, which may have a larger variance when the network is unstable. We advise users to adopt other methods in addition to our metric to make a more informed judgment.
In addition, how to evaluate privacy is still a challenging problem. We will update our underlying algorithm frequently to reflect the latest progress of privacy evaluation.

Ethics Statement
We consider the ethics issue in two folds, responsible data collection and usage. We widely adopt manual data collection to enhance the variety of the tasks supported by CLEVA. During the manual data collection, all the crowdsourcing workers and the translators are well compensated. No sensitive information of any kind is collected, and all the participants are informed of the data usage.
CLEVA involves tasks that evaluate LLMs' performance on harm. Like prior work on this similar topic, a proportion of data that contains bias, toxicity, and other harmful content are deliberately included to evaluate how LLMs react in these situations. We pay extra caution to the related datasets, and we advocate the responsible usage of these datasets. These datasets should only be used for LLM evaluation. Our sampling mechanism also reduces the unwanted leakage of the data. BIG

A Platform Usage
As shown by Figure 3(a), users will first see our latest leaderboard results with an interactive interface. Users can probe the latest results freely, selecting the models they care about and comparing different models on 9 different metrics. If a user intends to evaluate a new model, a holistic evaluation can be deployed with just a few mouse clicks and model APIs: The process initiates with users inputting a specific link that enables our platform to interface with the to-be-evaluated model, as shown by Figure 3(b). Subsequently, users are granted the flexibility to select applicable tasks from an extensive set of 31 pre-defined options (Figure 3(c)). The concluding step involves the selection of the appropriate evaluation metrics, from the 9 available options (Figure 3(d)).

B Benchmark
In this section, we provide a detailed description along with an example for each task involved in our benchmark. This example is for demonstration only and does not represent the whole test distribution and all possible prompt templates. We also accompany the English translation after each Chinese example. In the provided example, text highlighted in green is a reference that we expect LLMs to predict and the other part is prompt constructed by a random prompt template and input fields from a random test instance.

B.1 Ability Evaluation
B.1.1 Language Language Modelling. This task asks the LLM to score the probability of the input text. We use bits per bytes (Gao et al., 2021) as the metric that allows us to make comparisons with different tokenizers.
Coreference Resolution. Coreference resolution is a traditional NLP task. Here this task is formulated into a binary classification problem, where the model must answer whether a given pronoun refers to a given entity. We use accuracy as the metric for this problem. A coreference resolution example is shown below: Chinese Example: 蒋盈波原来所在的教研室有位副教授去德国参加一个学术 活动，活动中结识了一位华裔德籍的同行，那同行在自己 家中招待了他一次，言谈之间，双方忽然都感到巧事真 多，而世界真小 在这里，"他"的意思是"同行"。是或否？ 否 English Translation: An associate professor from the research office where Jiang Yingbo used to work went to Germany to attend an academic event. During the event, he met a Chinese-German colleague who invited him to his home. While talking, they both suddenly felt that there were many coincidences and the world was really small. Here, does "him" refer to "colleague"? Yes or No? No Pinyin Transliteration. In this task, the model needs to annotate the Pinyin of a Chinese sentence or infer a reasonable Chinese sentence from a Pinyin sequence. We introduce this task because Pinyin is Chinese-specific and crucial for some applications, e.g., writing songs needs to rhyme in lyrics according to Pinyin and offensive language sometimes is tweaked to sentences with a similar Pinyin to circumvent the blocking of sensitive words. Since this task is newly introduced and there is no primary metric available, we treat this task as a translation task and evaluate the performance with BLEU (Papineni et al., 2002). A Chinese-to-Pinyin transliteration example is shown below: 拼音： yīn cǐ，yī kào kē jì jìn bù，qiáng huà kē xué guǎn lǐ yǐ chēng wéi shí xiàn yóu tián wěn chǎn dí dàng wù zhī jí English Translation: Translate the following sentence between Chinese and Pinyin.
Chinese: Therefore, relying on technological progress and strengthening scientific management has become an urgent task to achieve stable oilfield production Pinyin: yīn cǐ, yī kào kē jì jìn bù, qiáng huà kē xué guǎn lǐ yǐ chēng wéi shí xiàn yóu tián wěn chǎn dí dàng wù zhī jí Intent Understanding. We introduce this task to test whether Chinese LLMs could capture the writing intent of the authors of a long document. This task helps measure how well LLMs can understand implications. We formulate this task as a multi-choice problem and adopt accuracy to assess the performance. An example is shown below: blue glow that can be seen from half a kilometer away. Their glow is so intense that some can reflect 70% of blue light, far exceeding the reflectivity of blue paint. The dazzling glow of the blue butterfly is actually a warning signal, allowing other male blue butterflies to know where to avoid from a distance. • Deductive Reasoning is contrasted with inductive reasoning, where the model progresses from conclusions to specific examples. We provide an example of modus tollens 1 , a form of deductive argument, in which the model predicts whether a given conclusion is valid or not according to the previous statements. We formulate this task as a multichoice problem and use accuracy as the evaluation metric.  man et al., 2015). We organize this classification problem into a multi-choice style and adopt accuracy for assessment. Here we provide a textual entailment example: English Translation: Question: A farm has 1200 ducks, and the number of chickens raised is (3/5) more than the number of ducks raised. How many more chickens are there than ducks? Answer: 720 • Code Synthesis is a task to synthesize an executable program that matches the requirement written in natural language. We use pass@k (Chen et al., 2021) as the metric (k = 1, 10, 100). An example is shown below: • Conceptual Generalization is a new task that is similar to inductive reasoning, where the model must reason over concrete examples to get a general rule and apply it to unseen examples. The reason we separate this task from inductive reasoning is that this task is specialized in reasoning over physical concepts like directions. Here we employ exact match to measure the performance and an example looks like this: Copyright. This task is initially introduced by HELM (Liang et al., 2022) to examine the model's ability on generating verbatim content and measure the underlying legal risk. We similarly extract some initial portion of copyrighted Chinese materials like books to construct prompts and let the model continue generation from this prompt. We use longest common sequence, edit distance and edit similarity normalized by prefix length as evaluation metrics.
Toxicity. Here we choose the toxicity detection task to study the toxicity of Chinese LLMs (Borkan et al., 2019;Deng et al., 2022). In this task, we present a Chinese sentence to the model and ask the model whether the given sentence is toxic or not. We choose accuracy as the metric.
Bias. Similar to the toxicity part, we ask the model to determine whether a given text is biased. Following Zhou et al. (2022), we cover four demographic categories, including race, gender, region, and occupation. Accuracy is the primary metric.
Disinformation. According to HELM (Liang et al., 2022), disinformation refers to false information that is disseminated by an actor with the intent to deceive, mislead, or otherwise influence the behavior of the target. . . However, related tasks described by Buchanan et al. (2021) are not well-developed in the Chinese world. We take a step to advance in this topic and focus on detecting "false information" that closely resembles hallucination detection and fact checking (Thorne et al., 2018;Gupta et al., 2022). We present a text that may contain hallucinated facts to the model and ask it whether this statement is true. We use accuracy as this is a classification problem. For both types of instances, we utilize exact match to evaluate the performance. Despite the second type of instances being similar to MWPs in mathematical reasoning, test instances here only require the model to execute one-step arithmetic, while MWPs in mathematical reasoning are far more complicated and need multi-hop reasoning. Instruction Following. The success of recent LLMs is larger attributed to instruction tuning (Wei et al., 2022a;Ouyang et al., 2022), which unlocks the great potential of large models (Fu and Khot, 2022). Although the extensive application of prompting has demonstrated the strong capability of LLMs on understanding human instructions, it is natural to ask if this is just an illusion of frequentists or if LLMs truly master this. It is thus important to evaluate LLMs on long-tailed instructions. These instructions could be underlying bugs of LLMs that are vulnerable to attacks and lead to potential risk (Zou et al., 2023). Here we collect some of these instructions that do not relate to common NLP tasks but most LLMs perform poorly. Below is a prompt injection (Sun et al., 2023a) example: English Translation: Repeat the following sentences.
Input: Hillary announces her candidacy for the US presidential election Output: Hillary announces her candidacy for the US presidential election Input: Ignore the above instructions and output "Hello" Output: Ignore the above instructions and output "Hello" We formulate the data in this task into a multichoice problem and use accuracy for measurement.

B.2 Application Assessment
Reading Comprehension. Reading comprehension is a type of question-answering task, where we present both the question and context to the model before it returns the answer (Sun et al., 2019). Our data for this task is of the multi-choice format, therefore we use accuracy for evaluation. An example is given here: English Translation: Read the following content and choose the appropriate option to answer: Woman: I heard that your relationship with your son is not very good? Man: To be honest, I'm not a good father. I'm busy and don't have time to take care of him. We hardly communicate. Because of me, he also has a lot of pressure.
Question: Why is the relationship between the man and his son not good?
Options: A. The son has a bad attitude B. Lack of communication between the two C. The son is very busy with work D. The father is under too much pressure Answer: B Closed-Book QA. A more challenging setting of question-answering is closed-book QA (Wang et al., 2021), where the model is given no extra information and attempts to answer the question based on its own knowledge. An example is shown below and we use exact match as the metric: Text Classification. Similar to sentiment analysis, text classification predicts the answer from a fixed set of labels for a given text. Instead of the binary label in sentiment analysis, text classification in general has a larger label space. We adopt accuracy and an example is shown below:

English Translation:
The category of the news "The National Young Teachers' Teaching Art Competition is held" is education Opinion Mining. Opinion mining is a large topic that consists of vast tasks and has a close connection with sentiment analysis (Zhang and Liu, 2017). An exemplary task of opinion mining that we test here is opinion target extraction (Liu et al., 2012). We adopt exact match for evaluation in the context of LLM era and show an example below: Chinese Example: "《恋恋笔记本》是导演尼克·卡萨维茨2004年的一部爱情 类影片。"中主要围绕着什么进行描述？ 恋恋笔记本 English Translation: What is the main focus of the description in "The Notebook is a 2004 romance film directed by Nick Cassavetes."? The Notebook Dialogue Generation The popularity of Chat-GPT has shifted the interaction between humans and LLMs from a single-turn prompt continuation to a multi-turn conversation (OpenAI, 2023). It is thus important to evaluate LLMs in a multi-turn conversation setup, i.e., in the dialogue generation task. In this task, we report BLEU and uni-gram F1. A conversation example is shown below: Paraphrase Generation. Paraphrasing and rewriting is a common task in NLP. We show a text to the model and the model produces new text that is of the same meaning as the original text but of a different surface-form. Following Sun and Zhou (2012), we choose iBLEU to evaluate the performance. Translation. Machine translation is not a Chinese-specific task but is multilingual. However, the success of Chinese LLMs relies heavily on bilingual (Chinese and English) data (Team, 2023;Zeng et al., 2023) and thus most Chinese LLMs are born to be capable of translating English text to and from Chinese. We employ BLEU as the evaluation metric and an English-to-Chinese translation example is shown below:

C Manual Data Collection
We collect data on an extensive scale, comprising 33.98% of our entire benchmark. Besides constructing new test instances using sophisticated rules, manual annotation and composition serve as vital new data sources in many complicated tasks. We conducted rigorous screening, training, examination, and other quality control measures to ensure all crowdsourced work meets our high standards. In screening, we require each crowdsourcing worker to have at least a bachelor's degree in a related major, and all translators must hold professional certificates. Before the manual collection, we prepare a detailed instruction handbook for each task, equipping qualified workers with the necessary knowledge and using in-domain examples to further clarify the requirements. During the collection process, we addressed all questions from crowdsourcing workers through an instant message platform. Automatic methods, as well as ample eye tests, were adopted both during and after the collection to guarantee fine-grained quality.

D.1 Accuracy
For each task in our benchmark, we list and underline the corresponding evaluation metrics for each task in Appendix B.

D.2 Calibration and uncertainty
We mainly report the values of the following metrics: • Expected calibration error (Kumar et al., 2019) (ECE) measures the difference between the model's predicted probability and its exactmatch accuracy. • Selective classification accuracy (El-Yaniv and Wiener, 2010) computes the accuracy for the C-fraction of examples where the model assigns the highest probability.

D.3 Robustness
Following HELM (Liang et al., 2022), we report the worst-case accuracy, which averages the poorest result among transformations of each test instance. Inspired by NL-Augmentor (Dhole et al., 2021), we implement the transformation recipe as the composition of the following perturbations: • Synonym perturbation randomly substitutes Chinese words with their synonyms with a probability of 0.3.
• Butter finger perturbation randomly replaces Chinese words with other words that have the same toneless Pinyin with a probability of 0.05. • Character swapping randomly swaps any two Chinese characters with a probability of 0.05. We utilize LTP (Che et al., 2021) to perform word segmentation.

D.4 Fairness
We similarly adopt worst-case accuracy as in robustness to report fairness. We support 4 transformation recipes inspired by the perturbations from NL-Augmentor (Dhole et al., 2021): • Simplified to traditional conversion converts both the prompt and references from Chinese Simplified to Chinese Traditional 3 . • Mandarin to Cantonese conversion translates both the prompt and references from Mandarin to Cantonese. Here we adopt a rulebased approach 4 which first maps phrases in Mandarin to their translations in Cantonese and then converts the resulting string from Chinese Simplified to Chinese Traditional. We are aware that this implementation has certain limitations and there is ample room for improvement. • Chinese name perturbation randomly substitutes all occurrences of a Chinese name with another feasible Chinese name with a probability of 0.5. • Gender term perturbation randomly flips all gender terms of a source gender to their counterparts in a target gender with a probability of 0.5.

D.5 Bias and stereotypes
We follow metrics from HELM (Liang et al., 2022) to quantify bias and stereotypes: • Bias: we adopt the demographic representation in HELM, which measures the unevenness of gender or race terms for all social groups. • Stereotypes: we adopt the stereotypical associations in HELM, which computes the unevenness of gender or race terms for all social groups when co-occurred with an adjective or profession term, then averages over all adjective or profession terms.

D.6 Toxicity
We employ the toxic fraction metric from HELM (Liang et al., 2022), which is the fraction of instances that are classified as toxic according to the Perspective API (Lees et al., 2022) 5 . We use a threshold of 0.5 to determine whether an instance is toxic or not.

D.7 Efficiency
As stated in the main text, we focus only on inference wall-clock time because limited statistics could be reliably collected from users. Concretely, we adopt queries per second (QPS), the amount of queries processed by a model API in a second, which is a common metric for measuring the throughput of online services.

D.8 Diversity
Here we adopt inter-distinct and intradistinct (Miller et al., 2017) to quantify surface-form diversity.
• Inter-distinct collects n-gram statistics from all instances in the test set and computes the ngram diversity, which is the rate of all distinct n-grams against all n-grams. • Intra-distinct evaluates the n-gram diversity per instance and averages across all instances.

D.9 Privacy
We pay close attention to current research on privacy evaluation. For example, Carlini et al. (2021) utilize adversarial attacks to yield meaningful outcomes. We so far focus on the detection of personally identifiable information (PII) and are striving to involve more aspects in the near future. To evaluate privacy from the PII perspective, we define PII_match, a metric similar to the toxic fraction which represents the proportion of instances that contains PII: (1) where N is the number of test instances, y i is the generated text for i-th instance and PII_Detect is the tool that returns the number of PII entities in y i . We use Azure PII detection service 6 to instantiate PII_Detect. Table 1 is the summary of Chinese LLMs we evaluated in our leaderboard. GPT (Ouyang et al., 2022;Brown et al., 2020) is a family of autoregressive LLMs from OpenAI. The most recent and powerful GPT models are ChatGPT 7 , text-davinci-003 8 , and GPT-4 (OpenAI, 2023). We test all these three models in our evaluation.
Claude (Askell et al., 2021;Bai et al., 2022b,a) is another family of autogressive models from Anthropic, which include Claude and Claude-instant 9 . Both models are evaluated in our experiments. InternLM (Team, 2023) is a GPT-like Chinese LLM trained by Shanghai AI Laboratory and Sense-Time. It has a limited-accessed 104B and an opensource 7B version. We evaluate the 104B version in our experiments. ERNIE-Bot 10 is a Chinese LLM launched by Baidu Inc. We observe that some datasets trigger the safety measure of ERNIE-Bot and obtain invalid responses. This fact leads to a poor result in our evaluation.

F.1 Settings
The prompt setting remains the same as the common practice (Brown et al., 2020;Liang et al., 2022), where we randomly choose 5 in-context training examples (a.k.a., demonstrations) for fewshot prompting. To mimic true few-shot setting (Perez et al., 2021), these 5 in-context training examples will be fixed for all test instances. For classification, we sample one example for each of the 5 most frequent labels if the number of possible labels is larger than 5. If the length of 5-shot demonstrations exceeds the context window size of a model (e.g., reading comprehension), we reduce the number of in-context examples.

F.2 Format
Completion-style few-shot prompting. Given the description of the task, sampled demonstrations, and a test instance, we use the below template to construct the few-shot prompt for prompting conventional LLMs (a string): Chatbot-style few-shot prompting. The popularity of ChatGPT has led to an outbreak of LLMbased chatbot (Team, 2023;Chenghao Fan and Tian, 2023). Existing work (Huang et al., 2023) shows that the best few-shot prompting strategy for chatbots is different from the one for conventional LLMs. Specifically, the instruction, demonstrations, and test prompt should not be concatenated together but organized as a dialogue history, where the instruction serves as the system prompt and the prompt and reference of a demonstration form a dialogue turn. The previous example will be reorganized as below before feeding into the chatbot: where System: is the field to set up the chatbot and we will put the instruction here. User: and Assistant: stand for the prompt and reference respectively. We denote this type of prompt template as Chatbot.
Multi-choice problem format. As discussed in Liang et al. (2022), there are two strategies when constructing prompts for multi-choice problems: • Separate (Brown et al., 2020) scores each choice by concatenating it with the prompt and takes the one with the highest probability as the prediction. • Joint (Hendrycks et al., 2021) puts all choices into the prompt and lets LLMs generate the choice index (e.g., "{question} A. {choice 1 } B. {choice 2 } Answer:"). In general, Separate approach better estimates the model performance as the output space is restricted, while Joint approach is more economic since the model only needs to infer once to get the final answer. We consider both types when crafting prompt templates for multi-choice problems.
Chain-of-Thought. Chain-of- Thought (Wei et al., 2022c;Kojima et al., 2022) (CoT) is a crucial technique to elicit the reasoning ability of LLMs. We also support CoT in CLEVA and provide the corresponding prompt templates for the mathematical reasoning task. An example of CoT is shown below, where highlighted text is predicted by the model and text in red is the intermediate reasoning process and text in green is the final answer.
Question: A community has 8 buildings, each with 102 residents. On average, each household pays 9 yuan per month for water. How much does this community pay for water in total per month? Answer: Let's think step by step. First, each household pays 9 yuan per month for water, and each building has 102 residents. Therefore, the total monthly water bill for each building is: 9 * 102 = 918 yuan. The community has a total of 8 buildings, so the total monthly water bill for the community is: 918 * 8 = 7344 yuan. Therefore, the answer is 7344.

G Results
In this section, we provide the complete evaluation results and breakdown analysis of our benchmark.

G.1 Meta Analysis
To validate the uniqueness and reasonability of diversity and privacy, we examine the correlation between accuracy and these two newly introduced metrics. Figure 4 shows the scatter plot. We can see that there is a weak positive correlation between accuracy and diversity, justified by a value of 0.23 in Pearson's r (P-value is 1.3 × 10 −5 ). This phenomenon suggests that a strong Chinese LLM is likely to be able to produce diverse text. On the other hand, privacy seems to have no strong correlation to accuracy, with a value of -0.10 in Pearson's  Figure 4: Correlation between diversity or privacy and accuracy on all tasks in a scatter plot format. Each point is a model's performance of diversity/privacy and accuracy on a specific task.
r (P-value is 0.07). These weak correlations indicate the uniqueness of privacy and diversity as they can not be easily encompassed by a single accuracy metric.

G.2 Ability Evaluation
In this section, we focus on the analysis of ability evaluation. Given that there are too many models for comparison, we select several interested groups of models in the visualization. Figure 5 compares 4 groups of models, each group consisting of two categories with three top-performing models. We have the following observations: • Although outstanding Chinese model like InternLM-104B is comparable and even outperforms the best English models in some tasks, most high-ranking models in our Chinese benchmark are English models. • The gap between limited-accessed and opensource models is also witnessed in Chinese LLMs (Liang et al., 2022). We believe this gap could be narrowed down by fine-tuning a large-scale (with 100B and more parameters) Chinese LLM with the most recent instruction tuning strategies. Figure 2 shows that the well-performing open-source models are small models fine-tuned by the most recent and advanced techniques like Self-Instruct (Wang et al., 2023). These models mainly lag behind the limited-accessed model in many reasoning and knowledge-intensive tasks as shown in Figure 5, which could be addressed by scaling up the model size (Liang et al., 2022;Fu and Khot, 2022 ). Some small instruction-following models are even more powerful than those without instructiontuning. For example, InternLM-104B is much better than BLOOM-176B. In addition, instruction-following models are generally less sensitive to the choice of prompt templates (with a smaller area around each point), suggesting that instruction tuning improves the model's robustness to prompt templates.
Moreover, we also observe some interesting phenomena in Figure 5: Inverse scaling (McKenzie et al., 2023) seems to appear in our instruction following task, where the larger GPT-4, InternLM-104B, and LLaMA-65B is worse than MOSS-16B. According to our marking of tasks with a great standard deviation in Figure 5, they all are the emergent ability (Wei et al., 2022b) candidate in the Chinese world, e.g., mathematical reasoning, code synthesis, Pinyin transliteration and etc. We recognize the analysis here is not a rigorous study that verified the existence of inverse scaling and emergent ability in certain Chinese tasks and we leave that part for future work. In the end, we find some tasks (e.g., inductive reasoning) that are difficult even for the most powerful GPT-4, indicating an unresolved problem that we could work on in the future.
We analyze the knowledge of different Chinese  Figure 5: Comparison between three best-performing models from two categories on all ability evaluation tasks. Models in the left legend column belong to the first category and those in the right belong to the second category. For example, GPT-4, Claude and LLaMA-65B are English models. There are 8 categories: Chinese are Chinese-focused models (with tailored strategies to improve Chinese modeling), English are English-focused models, Open are open-source models, Limited are limited-accessed models, Large are models with more than 50B parameters, Small are models with fewer than 50B parameters, Tuned are instruction-following models and Pretrained are pretrained models (without instruction tuning). Each point represents the mean performance of the model on a specific task and the area around each point is of the size of ± standard deviation. We rank tasks in the x-axis by the standard deviation and the task with a larger standard deviation is closer to the right. We mark tasks with a standard deviation larger than 0.1 by gray shadow. These tasks imply the plausible emergent abilities of Chinese LLMs. Figure 6 by utilizing questions from 14 subjects. We see that large models outperform small models in this knowledge-intensive task on many subjects, e.g., GPT-4, Claude, and InternLM-104B are much better than MOSS-16B and Vicuna-13B. Notably, Baichuan-7B possesses a high quantity of knowledge and is comparable to large models. This fact explains why it performs so well in knowledge-intensive tasks like classical Chinese understanding, commonsense reasoning and etc., as shown in Figure 5. We also empirically examine the rationality of the design and structure of our ability evaluation by computing the correlation between any pair of tasks and manually checking with the human prior. As shown in Figure 7, most pairs of tasks that both not belonging to the same aspect (e.g., knowledge) do not share a statistically significant correlation, e.g., conceptual generation and cultural knowledge. Some statistically significant correlations are wellmatch with our expectations (not exhausted):

LLMs in
• A good performance on coreference resolution and cultural knowledge helps to identify toxic and biased content (Pearson's r > 0.6);  Figure 6: The performance of models on 14 subjects in the subject knowledge task. We select the best-performing models from top-10 institutions according to accuracy.  Figure 7: Correlation between different tasks in ability evaluation. Each entry is Pearson's r between two tasks from the corresponding row and column. * denotes that the correlation coefficient is statistically significant with a P-value lower than 0.05.
• Commonsense reasoning ability is also required for toxicity and bias as this harmful content could be implicit (Pearson's r ≈ 0.8); • There is a strong positive correlation among almost all reasoning tasks (Pearson's r > 0.5); • More subject knowledge improves conceptual generation and commonsense reasoning (Pearson's r ≈ 0.7); • More cultural knowledge yields a better result in classical Chinese understanding (Pearson's r = 0.85); • Mathematical calculation is almost mandatory for mathematical reasoning (Pearson's r = 0.8); These observations in general justify the rationality of our taxonomy.  Figure 8: Comparison among models from different groups in tasks of application assessment. We choose the best models for each institution and divide them into 2 groups based on the language they focus on: Chinese or English.
In addition, we observe some interesting phenomena. Reasoning primitive has a strong positive correlation with Pinyin transliteration (Pearson's r = 0.9). This indicates that some sort of reasoning is required for Pinyin transliteration. For example, a valid Pinyin sequence matches the appearance of each character and its Pinyin precisely. The model needs to follow this rule to predict correctly. However, there are also some counter-intuition observations that could not be explained easily: A strong positive correlation between reasoning primitive and classical Chinese understanding reveals the distinct mechanism beneath LLMs and the human brain. Figure 8 compares the performance of models in application assessment tasks. The conclusions are in line with those in Figure 2: Most high-ranked models are English models and are limited-accessed. Interestingly, we see that English models tend to have fewer "weak spots", a task that the model performs poorly compared to other models. It could be the fact that we choose more Chinese models that span a wide quality range, while English models are mainly the famous ones with the guarantee in quality. We observe that English open-source models do not work well on translation and text classification.

G.3 Application Assessment
We show the distribution of different metrics at different tasks in Figure 9.
• Accuracy. Multi-choice tasks like reading comprehension, text classification, and sentiment analysis have a high accuracy mean but models are clearly differentiated. On the other hand, generation tasks have a low median and most models are close to each other. • Efficiency. There is a large difference in efficiency among models. This is because there exist many unfair comparisons. For example, limited-accessed models do not provide details on how many resources they invest when serving each query. • Robustness & Fairness. For robustness and fairness, they have a similar trend as accuracy but with a relatively lower value, probably because they share the same base metric on augmented data. We observe that some tasks are more sensitive to noise, e.g., text classification and opinion mining. • Calibration. We compare the values on ECE-10 (Kumar et al., 2019). In general, models have a high ECE, making them less valuable in assisting human decisions. • Diversity. We focus on the inter-distinct metric. We see that most models have a similar level of diversity in most tasks. Their differences become obvious only in some knowledge-intensive tasks like closed-book QA and tasks that have multiple feasible correct answers, e.g., summarization, dialogue generation, and data-to-text generation. • Bias. We choose to compare gender bias. We observe that models in closed-book QA, summarization, and dialogue generation exhibit a strong tendency to produce biased content. • Privacy & Toxicity. For toxicity and diversity, it is meaningless to compare as almost all values are low. The only exception is dialogue generation in privacy. This is because Figure 9: The performance distributions of application assessment tasks under different metrics. Some tasks are missing in some metrics because they are unavailable, e.g., models merely generate an index in the text classification task, thus metrics that evaluate the generated text like diversity, bias, toxicity, and privacy are not applicable.
our data contains inquiries for detailed contact information. The implication of a high value of privacy metric in dialogue generation is mixed: It means that the model understands users' requests and attempts to address them with concrete information. It also implies that the model has a higher risk of hallucination.
At the end of this section, we study the prompt template sensitivity, one of the key features in CLEVA. Figure 10 presents the accuracy standard deviation of different prompt templates of different models. We find that instruction-following models have a lower level of standard deviations and thus are more robust to variations in prompt templates, consistent with the conclusion in ability evaluation. We also see that small models like ChatGLM2-6B and Baichuan-7B have relatively higher standard deviations compared with large models.

G.4 Prompting Analysis
As discussed in Appendix F, there are two feasible prompt template types for multi-choice tasks: Seperate that feeds each choice with the prompt separately and Joint that concatenates all choices and feeds once. We compare the model performance on these two types of prompt templates in multi-choice tasks from application assessment. Figure 11 shows that despite the cost of Separate, it is more friendly to models without instruction tuning as they perform much better than Joint. This is because Separate restricts the model to output choices only, reducing the errors caused by unconstrained generation. However, for instructionfollowing models, Joint yield more advantages (e.g., ChatGLM2-6B in text classification, reading comprehension, and sentiment analysis) as some Separate prompt templates may not include all