CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually changing medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling. Our benchmark is released at \url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}.


Introduction
Artificial intelligence is gradually changing the landscape of healthcare, and biomedical research (Yu et al., 2018). With the fast advancement of biomedical datasets, biomedical natural language processing (BioNLP) has facilitated a broad range A key driving force behind such improvements and rapid iterations of models is the use of general evaluation datasets and benchmarks (Gijsbers et al., 2019). Pioneer benchmarks, such as BLURB (Gu et al., 2020), PubMedQA (Jin et al., 2019), and others, have provided us with the opportunity to conduct research on biomedical language understanding and developing real-world applications. Unfortunately, most of these benchmarks are developed in English, which makes the development of the associated machine intelligence Anglo-centric. Meanwhile, other languages, such as Chinese, have unique linguistic characteristics and categories that need to be considered. Even though Chinese speakers account for a quarter of the world population, there have been no existing Chinese biomedical language understanding evaluation benchmarks.
To address this issue and facilitate natural language processing studies in Chinese, we take the first step in introducing a comprehensive Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark with eight biomedical language understanding tasks. These tasks include named entity recognition, information extraction, clinical diagnosis normalization, short text classification, question answering (in transfer learning setting), intent classification, semantic similarity, and so on. We evaluate several pre-trained Chinese language models on CBLUE and report their performance. The current models still perform by far worse than the standard of single-human perfor-mance, leaving room for future improvements. We also conduct a comprehensive analysis using case studies to indicate the challenges and linguistic differences in Chinese biomedical language understanding. We intend to develop a universal GLUElike open platform for the Chinese BioNLP community, and this work helps accelerate research in that direction. Overall, the main contributions of this study are as follows: • We propose the first Chinese biomedical language understanding benchmark, an openended, community-driven project with diverse tasks. The proposed benchmark serves as a platform for the Chinese BioNLP community and encourages new dataset contributions.
• We report a systematic evaluation of 11 Chinese pre-trained language models to understand the challenges derived by these tasks. We release the source code of the baselines as a toolkit for future research purposes.

Related Work
Several benchmarks have been developed to evaluate general language understanding over the past few years. GLUE (Wang et al., 2019b) is one of the first frameworks developed as a formal challenge affording straightforward comparison between taskagnostic transfer learning techniques. SuperGLUE (Wang et al., 2019a), styled after GLUE, introduce a new set of more difficult language understanding datasets. Other similarly motivated benchmarks include DecaNLP (McCann et al., 2018), which recast a set of target tasks into a general questionanswering format and prohibit task-specific parameters, and SentEval (Conneau and Kiela, 2018), which evaluate explicitly fixed-size sentence embeddings. Non-English benchmarks include Rus-sianSuperGLUE (Shavrina et al., 2020) and CLUE (Xu et al., 2020), which is a community-driven benchmark with nine Chinese natural language understanding tasks. These benchmarks in the general domain provide a north star goal for researchers and are part of the reason we can confidently say we have made great strides in our field. For BioNLP, many datasets and benchmarks have been proposed (Wang et al., 2020;Li et al., 2016;Wu et al., 2019) which promote the biomedical language understanding (Beltagy et al., 2019;Lewis et al., 2020;Lee et al., 2020). Tsatsaronis et al. (2015) propose biomedical language understanding datasets as well as a competition on largescale biomedical semantic indexing and question answering. Jin et al. (2019) propose PubMedQA, a novel biomedical question answering dataset collected from PubMed abstracts. Pappas et al. (2018) propose BioRead, which is a publicly available cloze-style biomedical machine reading comprehension (MRC) dataset. Gu et al. (2020) create a leaderboard featuring the Biomedical Language Understanding & Reasoning Benchmark (BLURB). Unlike a general domain corpus, the annotation of a biomedical corpus needs expert intervention and is labor-intensive and time-consuming. Moreover, most of the benchmarks are based on English; ignoring other languages means that potentially valuable information may be lost, which can be helpful for generalization.
In this study, we focus on Chinese to fill the gap and aim to develop the first Chinese biomedical language understanding benchmark. Note that Chinese biomedical text is linguistically different from English and has its domain characteristics, necessitating an evaluation BioNLP benchmark designed explicitly for Chinese.

Design Principle
CBLUE consists of 8 biomedical language understanding tasks.The task descriptions and statistics of CBLUE are shown Table 1. Unlike CLUE (Xu et al., 2020) as shown in Table 2, CBLUE has a diverse data source (the annotation is expensive), richer task setting, thus, more challenging for NLP models. We introduce the design principle of CBLUE as follows: 1) Diverse tasks: CBLUE contain widespread token-level, sequence-level, sequence-pair tasks.
2) Variety of differently distributed data: CBLUE collect data from various sources, including clinical trials, EHRs, medical forum, textbooks, and search engine logs with a real-world distribution.
3) Quality control in long-term maintenance: We asked domain experts (doctors from Class A tertiary hospitals) to annotate datasets and carefully review data to ensure data quality.  Table 2: Difference between CBLUE, CLUE and BLURB. There are three major differences: a) CBLUE has a much more diverse task setting with different data sources in the biomedical domain including clinical trials, EHRs, medical forum, text books and search engine logs; b) CBLUE has a long-tailed distribution which is challenging; c) CBLUE contains a specific transfer learning scenario supported by the CHIP-STS dataset, in which the testing set has a different distribution fromthe training set.

Tasks
CMeEE For this task, the dataset is first released in CHIP2020 2 . Given a pre-defined schema, the task is to identify and extract entities from the given sentence and classify them into nine categories: disease, clinical manifestations, drugs, medical equipment, medical procedures, body, medical examinations, microorganisms, and department.
CMeIE For this task, the dataset is also released in CHIP2020 (Guan et al., 2020). The goal of the task is to identify both entities and relations in a sentence following the schema constraints. There are 53 relations defined in the dataset, including 10 synonymous sub-relationships and 43 other subrelationships.
CHIP-CDN For this task, the dataset is to standardize the terms from the final diagnoses of Chinese electronic medical records. Given the original phrase, the task is to normalize it to standard terminology based on the International Classification of Diseases (ICD-10) standard for Beijing Clinical Edition v601.
2 http://cips-chip.org.cn/ CHIP-CTC For this task, the dataset is to classify clinical trials eligibility criteria, which are fundamental guidelines of clinical trials defined to identify whether a subject meets a clinical trial or not (Zong et al., 2021). All text data are collected from the website of the Chinese Clinical Trial Registry (ChiCTR) 3 , and a total of 44 categories are defined. The task is like text classification; although it is not a new task, studies and corpora for the Chinese clinical trial criterion are still limited, and we hope to promote future research for social benefits.
CHIP-STS For this task, the dataset is for sentence similarity in the non-i.i.d. (non-independent and identically distributed) setting. Specifically, the task aims to evaluate the generalization ability between disease types on Chinese disease questions and answer data. Given question pairs related to 5 different diseases (The disease types in the training and testing set are different), the task is to determine whether the semantics of the two sentences are similar.
KUAKE-QIC For this task, the dataset is for intent classification. Given search engine queries, the task is to classify each of them into one of 11 medical intent categories defined in KUAKE-QIC. Those include diagnosis, etiology analysis, treatment plan, medical advice, test result analysis and others.
KUAKE-QTR For this task, the dataset is used to estimate the relevance of the title of a query document. Given a query (e.g., "Symptoms of vitamin B deficiency"), the task aims to find the relevant title (e.g., "The main manifestations of vitamin B deficiency").
KUAKE-QQR For this task, the dataset is used to evaluate the relevance of the content expressed in two queries. Similar to KUAKE-QTR, the task aims to estimate query-query relevance, which is an essential and challenging task in real-world search engines.

Data Collection
Since machine learning models are mostly datadriven, data plays a critical role, and it is pretty often in the form of a static dataset (Gebru et al., 2018). We collect data for different tasks from diverse sources, including clinical trials, EHRs, medical books, and search logs from real-world search engines. As biomedical data may contain private information such as the patient's name, age, and gender, all collected datasets are anonymized and reviewed by the IRB committee of each data provider to preserve privacy. We introduce the data collection details followingly.

Collection from Clinical Trials
Clinical trial eligibility criteria text is collected from ChiCTR, a non-profit organization that provides information about clinical trial registration for public research use. In each trial registry file, eligibility criteria text is organized as a paragraph in the inclusion criteria and exclusion criteria. Some meaningless texts are excluded, and the remaining texts are annotated to generate the CHIP-CTC dataset.

Collection from EHRs
We obtain the final diagnoses of the medical records from several Class A tertiary hospitals and sample a few diagnosis items from different medical departments to construct the CHIP-CDN dataset for research purposes. The diagnosis items are randomly sampled from the items which are not covered by the common medical synonyms dict.
No privacy information is involved in the final diagnoses.

Collection from Medical Forum and Textbooks
Due to the COVID-19 pandemic, online consultation has become more and more popular via the Internet. To promote data diversity, we select the online questions by patients to build the CHIP-STS dataset. Note that most of the questions are chief complaints. To ensure the authority and practicability of the corpus, we also select medical textbooks of Pediatrics (Wang et al., 2018), Clinical Pediatrics (Shen and Gui, 2013) and Clinical Practice 4 . We collect data from these sources to construct the CMeIE and CMeEE datasets.

Collection from Search Engine Logs
We also collect search logs from real-world search engines like the Alibaba KUAKE Search Engine 5 . First, we filter the search queries in the raw search logs by the medical tag to obtain candidate medical texts. Then, we sample the documents for each query with non-zero relevance scores (i.e., to determine if the document is relevant to the query). Specifically, we divide all the documents into three categories, namely high, middle, and tail documents, and then uniformly sample the data to guarantee diversity. We leverage the data from search logs to construct KUAKE-QTC, KUAKE-QTR, and KUAKE-QQR datasets.

Annotation
Each sample is annotated by three to five domain experts, and the annotation with the majority of votes is taken to estimate human performance. During the annotation phase, we add control questions to prevent dishonest behaviors by the domain experts. Consequently, we reject any annotations made by domain experts who fail in the training phase and do not adopt the results of those who achieved low performance on the control tasks. We maintain strict and high criteria for approval and review at least 10 random samples from each worker to decide whether to approve or reject all their HITs. We also calculate the average interrater agreement between annotators using Fleiss' Kappa scores (Fleiss, 1971), finding that five out of six annotations show almost perfect agreement (κ = 0.9).

Characteristics
Utility-preserving Anonymization Biomedical data may be considered as a breach in the privacy of individuals because they usually contain sensitive information. Thus, we conduct utilitypreserving anonymization following (Lee et al., 2017) to anonymize the data before releasing the benchmark.

Real-world Distribution
To promote the generalization of models, all the data in our CBLUE benchmark follow real-world distribution without up/downsampling. As shown in Figure 1(a), our dataset follows long-tail distribution following Zipf's law and will inevitably be long-tailed. However, long-tail distribution has no significant effect on performance. Further, some datasets, such as CMedIE, have label hierarchy with both coarsegrained and fine-grained relation labels, as shown in Figure 1(b).
Diverse Tasks Setting Our CBLUE benchmark includes eight diverse tasks, including named entity recognition, relation extraction, and singlesentence/sentence-pair classification. Besides the independent and i.i.d. scenarios, our CBLUE benchmark also contains a specific transfer learning scenario supported by the CHIP-STS dataset, in which the testing set has a different distribution from the training set.

Leaderboard
We provide a leaderboard for users to submit their own results on CBLUE. The evaluation system will give final scores for each task when users submit their prediction results. The platform offers 60 free GPU hours from Aliyun 6 to help researchers develop and train their models.

Distribution and Maintenance
Our CBLUE benchmark was released online on April 1, 2021. Up to now, more than 300 researchers have applied the dataset, and over 80 teams have submitted their model predictions to our platform, including medical institutions (Peking Union Medical College Hospital, etc.), universities (Tsinghua University, Zhejiang University, etc.), and companies (Baidu, JD, etc.). We will continue to maintain the benchmark by attending to new requests and adding new tasks.

Reproducibility
To make it easier to use the CBLUE benchmark, we also offer a toolkit implemented in PyTorch (Paszke et al., 2019) for reproducibility. Our toolkit supports mainstream pre-trained models and a wide range of target tasks.

Experiments
Baselines We conduct experiments with baselines based on different Chinese pre-trained language models. We add an additional output layer (e.g., MLP) for each CBLUE task and fine-tune the pre-trained models.
Models We evaluate CBLUE on the following public available Chinese pre-trained models: • BERT-base (Devlin et al., 2018). We use the base model with 12 layers, 768 hidden layers, 12 heads, and 110 million parameters.
• RoBERTa-large (Liu et al., 2019). Compared with BERT, RoBERTa removes the next sentence prediction objective and dynamically changes the masking pattern applied to the training data.
• RoBERTa-wwm-ext-base/large. RoBERTawwm-ext is an efficient pre-trained model which integrates the advantages of RoBERTa and BERT-wwm. • Mac-BERT-base/large (Cui et al., 2020). Mac-BERT is an improved BERT with novel MLM as a correction pre-training task.
• PCL-MedBERT 7 . A pre-trained medical language model proposed by the Peng Cheng Laboratory.
We implement all baselines with PyTorch (Paszke et al., 2019). All the training details can be found in the appendix.

Benchmark Results
We report the results of our baseline models on the CBLUE benchmark in Table 3. We notice that larger pre-trained models obtain better performance. Since Chinese text is composed of terminologies, carefully designed masking strategies may be helpful for representation learning. However, we observe that models which use whole word masking do not always yield better performance than others in some tasks, such as CTC, QIC, QTR, and QQR, indicating that tasks in our benchmark are challenging and more sophisticated technologies should be developed. Further, we find that ALBERT-tiny achieves comparable performance to base models in CDN, STS, QTR, and QQR tasks, illustrating that smaller models may also perform well in specific tasks. We think this is caused by the different distribution between pretraining corpus and Chinese medical text; thus, large PTLMs may not obtain satisfactory performance. Finally, we notice that PCL-MedBERT, which tends to be state-of-the-art in Chinese biomedical text processing tasks, and does not perform as well as we expected. This further demonstrates the difficulty  Table 4: Human performance of two-stage evaluation scores with the best-performed model. "avg" refers to the mean score from the three annotators. "majority" indicates the performance taken from the majority vote of amateur humans. Bold text denotes the best result among human and model prediction.

22%
Need domain knowledge of our benchmark, and contemporary models may find it difficult to quickly achieve outstanding performance.

Human Performance
For all of the tasks in CBLUE, we ask human amateur annotators with no medical experience to label instances from the testing set and compute the annotators' majority vote against the gold label annotated by specialists. Similar to SuperGLUE (Wang et al., 2019a), we first need to train the annotators before they work on the testing data. Annotators are asked to annotate some data from the development set; then, their annotations are validated against the gold standard. Annotators need to correct their annotation mistakes repeatedly so that they can master the specific tasks. Finally, they annotate instances from the testing data, and these annotations are used to compute the final human scores. The results are shown in Table 4 and the last row of Table 3. In all tasks, humans have better performance.

Case studies
We choose two datasets: CMeEE and KUAKE-QIC, a sequence labeling and classification task, respectively, to conduct case studies. As shown in Figure 2, we report the statistics of the proportion of various types of error cases 8 . For CMeEE, we notice that entity overlap 9 , ambiguity 10 , need domain knowledge 11 , annotation error 12 are major reasons that result in the prediction failure. Furthermore, there exist many instances with entity overlap, which may lead to confusion for the named entity recognition task. While in the analysis for KUAKE-QIC, almost half of bad cases are due to multiple triggers 13 and colloquialism. Colloquialism 14 is natural in search queries, which means that some descriptions of the Chinese medical text are too simplified, colloquial, or inaccurate. We show some cases on CMeEE in Table 5. In the second row, we notice that given the instance of "皮疹可因宿主产生特异性的抗毒素抗体 8 See definitions of errors in the appendix. 9 There exist multiple overlapping entities in the instance. 10 The instance has a similar context but different meaning, which mislead the prediction. 11 There exist biomedical terminologies in the instance which require domain knowledge to understand. 12 The annotated label is wrong. 13 There exist multiple indicative words which mislead the prediction. 14 The instance is quite different from written language (e.g., with many abbreviations)

病情诊断 病情诊断 病情诊断 治疗方案
The systolic blood pressure of the elderly is 160, and the diastolic blood pressure is only more than 40. What is the reason? How to treat?
而减少 (Rash can be reduced by the host producing specificanti-toxin antibodies.)", ROBERTA and PCL-MedBERT obtain different predictions. The reason is that there exist medical terms such as "抗毒素抗体 (anti-toxin antibodies)". ROBERTA can not identify those tokens correctly, but PCL-MedBERT, pre-trained on the medical corpus, can successfully make it. Moreover, PCL-MedBERT can extract entities "缺 失,易 位,倒 位 (eletions, translocations, inversions)" from the long sentences, which is challenging for other models.
We further show some cases on KUAKE-QIC in Table 6. In the first case, we notice that both BERT and BERT-ext fail to obtain the intent label of the query "请问淋巴细胞比率偏高、中性细 胞比率偏低有事吗? (Does it matter if the ratio of lymphocytes is high and the ratio of neutrophils is low?)", while MedBERT can obtain the correct prediction. Since "淋巴细胞比率 (ratio of lymphocytes)" and "中性细胞比率 (ratio of neutrophils)" are biomedical terms, and the general pre-trained language model has to leverage domain knowledge to understand those phrases.
As shown in Table 5 and Table 6, compared with other languages, the Chinese language is very colloquial even in medical texts. Furthermore, polysemy is prevalent in chinese language. The meaning of a word changes according to its tone, which usually causes confusion and difficulties for machine reading. In summary, we conclude that tasks in CBLUE are not easy to solve since the Chinese language has unique characteristics, and more robust models should be developed.

Conclusion
In this paper, we present a Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark. We evaluate 11 current language representation models on CBLUE and analyzed their results. The results illustrate the limited ability of state-ofthe-art models to handle some of the more challenging tasks. In contrast to English benchmarks such as GLUE/SuperGLUE and BLURB, whose model performance already matches human performance, we observe that this is far from the truth for Chinese biomedical language understanding.

Acknowledgments
We want to express gratitude to the anonymous reviewers for their hard work and kind comments. This work is funded by Special Project of New Generation Artificial Intelligence of the Ministry of Science and Technology of China

Ethical Considerations
We collected all the data with authorization from the organization that owned the data and signed the agreement. We release the benchmark following the CC BY-NC 4.0 license. All collected datasets are anonymized and reviewed by the IRB committee of each data provider to preserve privacy. Since we collect data following real-world distribution, there may exist popularity bias that cannot be ignored.

A Broader Impact
The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the severe health effects of COVID-19 and the public health measures implemented to slow its spread. A lack of information fundamentally causes many difficulties experienced during the outbreak; attempts to address these needs caused an information overload for both researchers and the public. Biomedical natural language processing-the branch of artificial intelligence that interprets human language-can be applied to address many of the information needs making urgent by the COVID-19 pandemic. Unfortunately, most language benchmarks are in English, and no biomedical benchmark currently exists in Chinese. Our benchmark CBLUE, as the first Chinese biomedical language understanding benchmark, can serve as an open testbed for model evaluations to promote the advancement of this technology.

B Negative Impact
Although we ask domain experts and doctors to annotate all the corpus, there still exist some instances with wrong annotated labels. If a model was chosen based on numbers on the benchmark, this could cause real-world harm. Moreover, our benchmark lowers the bar of entry to work with biomedical data. While generally a good thing, it may dilute the pool of data-driven work in the biomedical field even more than it already is, making it hard for experts to spot the relevant work.

C Limitations
Although our CBLUE offers diverse settings, there are still some tasks not covered by the benchmark, such as medical dialogue generation (Liu et al., 2020;Lin et al., 2020;Zeng et al., 2020) or medical diagnosis (Wei et al., 2018). We encourage researchers in both academics and industry to contribute new datasets. Besides, our benchmark is static; thus, models may still achieve outstanding performance on tasks but fail on simple challenge examples and falter in real-world scenarios. We leave this as future works to construct a platform including dataset creation, model development, and assessment, leading to more robust and informative benchmarks.

D CBLUE Background
Standard datasets and shared tasks have played essential roles in promoting the development of AI technology. Taking the Chinese BioNLP community as an example, the CHIP (China Health Information Processing) conference releases biomedicalrelated shared tasks every year, which has extensively advanced Chinese biomedical NLP technology. However, some datasets are no longer available after the end of shared tasks, which has raised issues in the data acquisition and future research of the datasets.
In recent years, we can obtain state-of-the-art performance for many downstream tasks with the help of pre-trained language models. A significant trend is the emergence of multi-task leaderboards, such as GLUE (General Language Understanding Evaluation) and CLUE (Chinese Language Understanding Evaluation). These leaderboards provide a fair benchmark that attracts the attention of many researchers and further promotes the development of language model technology. For example, Microsoft has released BLURB (Biomedical Language Understanding & Reasoning Evaluation) at the end of 2020 in the medical field. Recently, the Tianchi platform has launched the CBLUE (Chinese Biomedical Language Understanding Evaluation) public benchmark under the guidance of the CHIP Society. We believe that the release of the CBLUE will further attract researchers' attention to the medical AI field and promote the development of the community. CBLUE 1.0 15 comprises the previous shared tasks of the CHIP conference and the dataset from Alibaba QUAKE Search Engine, including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentencepair classification. 15 We release the benchmark following the CC BY-NC 4.0 license.

E Detailed Task Introduction E.1 Chinese Medical Named Entity
Recognition Dataset (CMeEE) Task Background As an essential subtask of information extraction, entity recognition has achieved promising results in recent years. Biomedical texts such as textbooks, encyclopedias, clinical trials, medical literature, electronic health records, and medical examination reports contain rich medical knowledge. Named entity recognition is the process of extracting medical terminologies, such as diseases and symptoms, from the above mentioned unstructured or semi-structured texts, and it can help significantly improve the efficiency of scientific research. CMeEE dataset is proposed for this purpose, and the original dataset was released at the CHIP2020 conference.
Task Description This task is defined as given the pre-defined schema and an input sentence to identify medical entities and to classify them into 9 categories, including disease (dis), clinical symptoms(sym), drugs (dru), medical equipment (equ), medical procedures (pro), body (bod), medical examination items (ite), microorganisms (mic), department (dep). For the detailed annotation instructions, please refer to the CBLUE official website, and examples are shown in Table 7.
Annotation Process The annotation guide was conducted by two medical experts from Class A tertiary hospitals and optimized during the trail annotation process. A total of 32 annotators had participated in the annotation process, including 2 medical experts who are also the owner of the annotation guideline, 4 experts from the biomedical informatics field, 6 medical M.D., and 22 master students from computer science majors. The annotation lasts for about three months (from October 2018 to December 2018), as well as an additional month's time for curation. The total expense is about 50,000 RMB. The annotation process was divided into two stages.
• Stage1: This stage was called the trail annotation phase. The medical experts gave training to the annotators to make sure they had a comprehensive understanding of the task. Two rounds of trail annotation were conducted by the annotators, with the purpose of getting familiar with the annotation task as well as It is of great diagnostic value to detect the specific antigen of a certain pathogen with immunoassay, a simple and quick assay that is intended for early diagnosis and proves more reliable than the antibody assay. discovering the unclear points of the guideline, and annotation problems were discussed, and the medical experts improved the annotation guidelines according to the feedback iteratively.
• Stage2: For the first phase, each record was assigned to two annotators to label independently, and the medical experts and biomedical informatics experts would give in time help. The annotation results were compared automatically by the annotation tools (developed for CMeEE and CMeIE tasks), and any disagreement was recorded and handed over to the next phase. In the second phase, medical experts and the annotators had a discussion for the disagreements records as well as other annotation problems, and the annotators made corrections. After the two stages, the IAA score (Kappa score) is 0.8537, which satisfied the research goal.
PII and IRB The corpus is collected from authorized medical textbooks or Clinical Practice, and no personally identifiable information or offensive content is involved in the text. No PII is included in the above-mentioned resources. The dataset does not refer to ethics, which has been checked by the IRB committee of the provider.
The original dataset format is a self-defined plain text format. To simplify the data pre-processing step, the CBLUE team has converted the data format to the unified JSON format with the permission of the data provider.
Evaluation Metrics This task uses strict Micro-F1 metrics.
Dataset Statistic This task has 15,000 training set data, 5,000 validation set data, 3,000 test set data. The corpus contains 938 files and 47,194 sentences. The average number of words contained per file is 2,355. The dataset contains 504 common pediatric diseases, 7,085 body parts, 12,907 clinical symptoms, and 4,354 medical procedures in total.
Dataset Provider The dataset is provided by: • Task Background Entity and relation extraction is an essential information extraction task for natural language processing and knowledge graph (KG), which is used to detect pairs of entities and their relations from unstructured text. The technology of this task can apply to the medical field. For example, with entity and relation extraction, unstructured and semi-structured medical texts can construct medical knowledge graphs, which can serve lots of downstream tasks.
Task Description Given the schema and sentence, in which defines the relation (Predicate) and its related Subject and Object, such as ("subject_type": "疾 病"，"predicate": "药 物 治 疗"，"object_type": "药 物"). The task requires the model to automatically analyze the sentence and then extract all the T riples = [(S1, P 1, O1), (S2, P 2, O2)...] in the sentence. Table 8 shows the examples in the data set, and 53 SCHEMAs include 10 kinds of genus relations, 43 other sub-relations. The details are in the 53_schema.json file. For the detailed annotation instructions, please refer to the CBLUE official website, and examples are shown in Table 8.

Annotation Process
The annotation guide was conducted by two medical experts from Class A tertiary hospitals and optimized during the trail annotation process. A total of 20 annotators had participated in the annotation process, including 2 medical experts who are also the owner of the annotation guideline, 2 experts from the biomedical informatics field, 4 medical M.D., and 14 master students from computer science majors. The annotation lasts for about four months (from October 2018 to December 2018), which contains the annotation time as well as the curation time. The total expense is about 40,000 RMB. Similar to the CMeEE dataset, the annotation process for CMeIE also contains the trail annotation stage and the formal annotation stage following the same process. Besides, an additional step called the Chinese segmentation validation step was added for this dataset. The data provider has developed a segmentation tool for the medical texts which could generate the segment as well as the POS tagging, and some specified POS types (like 'disease,' 'drug') could help validate if there were potential missing named entities for this task automatically, which could help assist the annotators to check the missing labels. The final IAA for this dataset is 0.83, which could satisfy the research purpose.
PII and IRB The corpus is collected from authorized medical textbooks or Clinical Practice, and no personally identifiable information or offensive content is involved in the text.
No PII is included in the above-mentioned resources. The dataset does not refer to ethics, which has been checked by the IRB committee of the provider.
Evaluation Metrics The SPO results given by the participants need to be accurately matched. The strict Micro-F1 is used for evaluation.
Dataset Statistic This task has 14,339 training set data, 3,585 validation set data, 4,482 test set data. The dataset is from the pediatric corpus and common disease corpus. The pediatric corpus originates from 518 pediatric diseases, and the common disease corpus is derived from 109 common diseases. The dataset contains nearly 75,000 triples, 28,000 disease sentences, and 53 schemas.
Dataset Provider The dataset is provided by: • 'ob-ject_type': '手术治疗-surgical_treatment'} Task Background Clinical term normalization is a crucial task for both research and industry use. Clinically, there might be up to hundreds of different synonyms for the same diagnosis, symptoms, or procedures; for example, "heart tack" and "MI" both stand for the standard terminology "myocardial infarction". The goal of this task is to find the standard phrases (i.e., ICD codes) for the given clinical term. With the help of the standard code, it can help ease the burden of researchers for the statistical analysis of clinical trials; also, it can be helpful for the insurance companies on the DRGs or DIP-related applications. This task is proposed for this purpose, and the originally shared task was released at the CHIP2020 conference.

Task Description
The task aims to standardize the terms from the final diagnoses of Chinese electronic medical records. No privacy information is involved in the final diagnosis. Given the original terms, it is required to predict its corresponding standard phrase from the standard vocabulary of "International Classification of Diseases (ICD-10) for Beijing Clinical Edition v601". For the detailed annotation instructions, please refer to the CBLUE official website. Examples are shown in Table 9.
Annotation Process The Chinese Diagnostic Normalization Data Set (CHIP-CDN) was annotated by the medical team of Yidu Cloud. They are all composed of people with medical backgrounds and clinician qualification certificates. This work took about 2 months, and since the work was done by internal staff, the estimated cost was around 100,000 RMB in total. The Chinese Diagnostic Normalization Data Set (CHIP-CDN) is completed by one round of labeling, one round of full audit, and one round of random quality inspection. Labeling and review are completed by ordinary labeling personnel with clinical qualifications, and random quality inspections are completed by high-level terminology experts.

PII and IRB
The corpus is collected from EMR(electronic medical records), and only the final diagnoses part is chosen for research purposes. The dataset does not refer to ethics.
As shown in the example table, the final diagnosis has no PII included.
The original dataset format is a self-defined xlsx format. To unify the data pre-processing step, the CBLUE team has converted the data format to the JSON format with the permission of the data provider.

Evaluation Metrics
The F1 score is calculated with (original diagnosis terms, standard phrases) pairs. Say, if the test set has m golden pairs, and the predicted result has n pairs, where k pairs are predicted correctly, then: P = k/n, R = k/m, F 1 = 2 * P * R/(P + R). (1) Dataset Statistic 8,000 training instances and 10,000 testing instances are provided. We split the original training set into 6,000 and 2,000 for the training and validation set, respectively.
Dataset Provider The dataset is provided by Yidu Cloud Technology Inc.

E.4 Clinical Trial Criterion Dataset (CHIP-CTC)
Task Background Clinical trials refer to scientific research conducted by human volunteers to determine the efficacy, safety, and side effects of a drug or a treatment method. It plays a crucial role in promoting the development of medicine and improving human health. Depending on the purpose of the experiment, the subjects may be patients or healthy volunteers. The goal of this task is to predict whether a subject meets a clinical trial or not. Recruitment of subjects for clinical trials is generally done through manual comparison of medical records and clinical trial screening criteria, which is time-consuming, laborious, and inefficient. In recent years, methods based on natural language processing have got successful in many biomedical applications. This task is proposed with the purpose of automatically classifying clinical trial eligibility criteria for the Chinese language, and the original task is released at the CHIP2019 conference. All the data comes from real clinical trials collected from the website of the Chinese Clinical Trial Registry (ChiCTR) 16 , which is a non-profit organization providing registration for public research use. Each Task Description A total of 44 pre-defined semantic categories are defined for this task, and the goal is to predict a given text to the correct category. For the detailed annotation instructions, please refer to the CBLUE official website. Examples of labeled data are shown in Table 10.
Annotation Process The CHIP-CTC corpus was annotated by three annotators. The first annotator is Zuofeng Li, a principal scientist in Philips Research China, with more than a decade of research   Jinxuan Yang (Ph.D. candidate) in the biomedical informatics field from Tongji University. The annotation started in July 2019 and took about 1 month. Further, the corpus was used in the CHIP 2019 shared task. The annotation was related to the annotator's research project, and no payment was required.
One experienced biomedical researcher (Z.L) and two raters (Z.Z and J.Y, Ph.D. candidate for biomedical informatics) of biomedical domains labeled the CHIP-CTC corpus with the 44 categories. First, they studied these categories' definitions, investigated a large number of expression patterns of criteria sentences, and chose criteria examples of each category. Next, the two raters independently annotated the same 1000 sentences, then they checked annotations and discussed contradictions with Z.L until consensus was achieved. This step repeated 20 iterations, and 20000 criteria sentences were annotated, which were later used to cal-culate the inter-annotator agreement score (0.9920 by Cohen's kappa score). Finally, the remaining 18341 sentences were assigned to the two raters for annotation.
PII and IRB The corpus is collected from the Chinese Clinical Trial Registry (ChiCTR) website, which is a non-profit organization providing registration for public research use. For each registered clinical trial case on this website, it is already approved by the ethics committee of the organization. In addition, the annotation and corpus have also been reviewed and approved by Internal Committee on Biomedical Experiments (ICBE) in Philips. It is encouraged to use the corpus for academic research.
For each registered clinical trial report, no PII is included.
The original dataset format is a self-defined csv format. To unify the data pre-processing step, the CBLUE team has converted the data format to the JSON format with the permission of the data provider.
Evaluation Metrics The evaluation of this task uses Macro-F1. Suppose we have n categories, C 1 , ..., C i , ..., C n . The accuracy rate P i is the number of records correctly predicted to class C i / the number of records predicted to be class C i . Recall rate R i = the number of records correctly predicted as the class C i / the number of records of the real C i class.
Dataset Statistic This task has 22,962 training sets, 7,682 validation sets, and 10000 test sets.
Dataset Provider The dataset is provided by the School of Life Sciences and Technology, Tongji University, and Philips Research China.

E.5 Semantic Textual Similarity Dataset (CHIP-STS)
Task Background CHIP-STS task aims to learn similar knowledge between disease types based on the Chinese online medical questions. Specifically, given question pairs from 5 different diseases, it is required to determine whether the semantics of the two sentences are similar or not. The originally shared task was released at the CHIP2019 conference.
Task Description The category represents the name of the disease type, including diabetes, hypertension, hepatitis, aids, and breast cancer. The label indicates whether the semantics of the questions are the same. If they are the same, they are marked as 1, and if they are not the same, they are marked as 0. Examples of labeling are shown in Table 11.
Annotation Process The CHIP-STS corpus was annotated by five undergraduate annotators from medical colleges under the guidance of one surgeon and one physician. The task is relatively simple since it is a two-class classification one; the annotation process, as well as the time of verification, lasts for two weeks. A total of 30,000 sentences pairs are annotated, and the annotation expense is 25,000 RMB. There are five types of diseases, so each annotator was assigned two types of disease to the label to guarantee that each type of disease was annotated by two raters. During the trail annotation process, each annotator was given 100 records to label, which aimed to test if they could understand the tasks thoroughly. Following that, the annotators start to label the process, and medical experts would give necessary help, like explaining the disease mechanism to assist the raters. Finally, each record was labeled by two different labelers, and the disagreed pairs were selected for discussion and case study; the annotators would recheck the previous labeled results according to the experts' feedback. The IAA score was 0.93.

PII and IRB
The corpus is collected from online questions from the medical forum, and it doesn't refer to the ethics, which has been checked by the IRB committee of the provider.
During the annotation step, sentences with PHI information are discarded by the annotators manually. The CBLUE team has also validated the dataset record by record to guarantee there is no PII included.
The original dataset format is a self-defined csv format. To unify the data pre-processing step, the CBLUE team has converted the data format to the JSON format with the permission of the data provider.
Evaluation Metrics The evaluation of this task is Macro-F1.
Dataset Statistic This task has 16,000 training sets, 4,000 validation sets, and 10,000 tests set data.
Dataset Provider The dataset is provided by Ping An Technology.

E.6 KUAKE-Query Intent Classification
Dataset (KUAKE-QIC) Task Background In medical search scenarios, the understanding of query intent can significantly improve the relevance of search results. In particular, medical knowledge is highly specialized, and classifying query intentions can also help integrate medical knowledge to enhance the performance of search results. This task is proposed for this purpose.   Table 12.
Annotation Process The KUAKE-QIC corpus was annotated by six annotators who graduated from medical college; they were employed by Alibaba as full-time employees for the KUAKE department. They must get past the test for the specified annotation tasks before the annotation starts. This task cost about 2 weeks, and the annotation fee was 6,600 RMB with 22,000 labeled records, that's to say, 0.3 RMB / per record. The annotation process was divided into three steps: The first step was the trail annotation step; 2,000 records were selected for this stage. The annotators were grouped into 2 groups, each with 3 persons. The data provider had a strict metric for quality control, say, the IAA between the three persons within the same group must exceed 0.9.
The second stage is the formal annotation phase, and during this stage, 6 annotators were divided into three groups, each with 2 persons. A total of 20,000 records were annotated; IAA for this step was 0.9230.
The last step was the quality control step, the sampling strategy was adopted, and 300 records were sampled for validation; some common annotation problems were raised by the medical experts, and the data would be fixed in a batch mode. In addition, some disagreed cases were made final decisions by the medical experts.
PII and IRB The corpus is collected from user queries from the KUAKE search engine, and it doesn't refer to the ethics, which has been checked by the IRB committee of the provider.
During the annotation step, sentences with PHI information or offensive information (like sexual queries) are discarded by the annotators manually.
The dataset also got passed the data disclosure process of Alibaba.
The CBLUE team has also validated the dataset record by record to guarantee there is no PII included.
Evaluation Metrics Accuracy is used for the evaluation of this task.
Dataset Statistic This task has 6,931 training set data, 1,955 validation set data, and 1,994 test set data.
Dataset Provider The dataset is provided by Alibaba QUAKE Search Engine.

E.7 KUAKE-Query Title Relevance Dataset (KUAKE-QTR)
Task Background KUAKE Query Title Relevance is a dataset for query document (title) relevance estimation. For example, give the query "Symptoms of vitamin B deficiency", the relevant title should be "The main manifestations of vitamin B deficiency".
Task Description The correlation between Query and Title is divided into 4 levels (0-3), 0 is the worst, and 3 stands for the best match. For the detailed annotation instructions, please refer to the CBLUE official website. Examples are shown in Table 13.
Annotation Process The KUAKE-QTR corpus was annotated by a total of nine annotators, among which seven were from third-party crowd-sourcing undergraduates from medical colleges, and two were from Alibaba full-time employees with medical backgrounds. The crowd-sourcing annotators were required to get trained and pass the annotation test before they could execute the task. The annotations lasted for 2 weeks, and a total of 28,000 RMB was used.   Similar to the KUAKE-QIC task, the KUAKE-QTR annotation process was divided into three steps with minor changes: The training and examination stage: The seven annotators got trained by the two FTE (full-time employee) experts to understand the tasks, then each one was given 200 records to label, which have ground-truth answer annotated by FTE experts. The precision must be above 85% to get past the test.
The second step was the formal annotation step, and Each annotator was given 3,000 records to label, among which 100 were with golden labels. The annotation tools would automatically evaluate the annotation quality by comparing the label between the annotators' ones and the golden ones. Help would be given to the annotators if necessary. Only the precision exceeding the threshold 0.85 would be handed to the next round.
The last step was the quality control step, the sampling strategy was adopted, and 100 records were sampled for validation by the FTE medical experts; bad cases would be returned to the crowdsourcing annotators to be fixed.

PII and IRB
The corpus is collected from user queries from the KUAKE search engine, and it doesn't refer to the ethics, which has been checked by the IRB committee of the provider.
During the annotation step, sentences with PHI information or offensive information (like sexual queries) are discarded by the annotators manually. The dataset also got passed the data disclosure process of Alibaba.
The CBLUE team has also validated the dataset record by record to guarantee there is no PII included. One record with the NULL label was discarded with the permission of the provider.
Evaluation Metrics Same as the KUAKE-QIC task, accuracy is used for the evaluation of this task.
Dataset Statistic This task has 24,174 training set data, 2,913 validation set data, and 54,65 test set data.
Dataset Provider This dataset is provided by Alibaba QUAKE Search Engine.

E.8 KUAKE -Query Query Relevance
Dataset (KUAKE-QQR) Task Background KUAKE Query-Query Relevance is a dataset that evaluates the relevance between two given queries to resolve the long-tail challenges for search engines. Similar to KUAKE-QTR, query-query relevance is an essential and challenging task in real-world search engines.
Task Description The correlation between Query and Title is divided into 3 levels (0-2), 0 is the worst, and 2 stands for the best correlation. For the detailed annotation instructions, please refer to the CBLUE official website. Examples are shown in Table 14.
Annotation Process The same as KUAKE-QTR except for the expense, which is 22,000 RMB in total.
PII and IRB The same as KUAKE-QTR.
Evaluation Metrics Same with the KUAKE-QIC and KUAKE-QTR tasks, accuracy is used for the evaluation metrics.
Dataset Statistic This task has 15,000 training set data, 1,600 validation set data, and 1,596 test set data.
Dataset Provider This dataset is provided by Alibaba QUAKE Search Engine.

G Error Analysis for Other Tasks
We introduce the error definition as follows and illustrate some error cases for other tasks in Table  27 to 32.
Ambiguity indicates that the instance has a similar context but different meaning, which mislead the prediction.
Need domain knowledge indicates that there exist biomedical terminologies in the instance which require domain knowledge to understand.
Need syntactic knowledge indicates that there exists complex syntactic structure in the instance, and the model fails to understand the correct meaning.
Entity overlap indicates there exist multiple overlapping entities in the instance.
Long sequence indicates that the input instance is very long.
Annotation error indicates that the annotated label is wrong.
Wrong entity boundary indicates that the instance has the wrong entity boundary.
Rare words indicates that there exist lowfrequency words in the instance.
Multiple triggers indicates that there exist multiple indicative words which mislead the prediction.
Colloquialism (very common in the search queries) indicates that the instance is quite different        Table 20: Hyper-parameters for the CHIP-CDN task. We model the CHIP-CDN task with two stages: recall stage and ranking stage. num_negative_sample sets the number of negative samples sampled for the training ranking model during the ranking stage. recall_k sets the number of candidates recalled in the recall stage.    Table 28: Error cases in CHIP-CDN. We evaluate roberta-wwm-ext and PCL-MedBERT on 3 sampled sentences, with their gold labels and model predictions. There may be multiple predicted values, separated by a "##". RO = roberta-wwm-ext, MB = PCL-MedBERT.  Table 30: Error cases in CHIP-STS. We evaluate performance of baselines with 3 sampled instances. The similarity between queries is divided into 2 levels (0-1), which means 'unrelated' and 'related'. BE = BERT-base, BE+ = BERT-wwm-ext-base, MB = PCL-MedBERT.