MLEC-QA: A Chinese Multi-Choice Biomedical Question Answering Dataset

Question Answering (QA) has been successfully applied in scenarios of human-computer interaction such as chatbots and search engines. However, for the specific biomedical domain, QA systems are still immature due to expert-annotated datasets being limited by category and scale. In this paper, we present MLEC-QA, the largest-scale Chinese multi-choice biomedical QA dataset, collected from the National Medical Licensing Examination in China. The dataset is composed of five subsets with 136,236 biomedical multi-choice questions with extra materials (images or tables) annotated by human experts, and first covers the following biomedical sub-fields: Clinic, Stomatology, Public Health, Traditional Chinese Medicine, and Traditional Chinese Medicine Combined with Western Medicine. We implement eight representative control methods and open-domain QA methods as baselines. Experimental results demonstrate that even the current best model can only achieve accuracies between 40% to 55% on five subsets, especially performing poorly on questions that require sophisticated reasoning ability. We hope the release of the MLEC-QA dataset can serve as a valuable resource for research and evaluation in open-domain QA, and also make advances for biomedical QA systems.


Introduction
As a branch of the QA task, Biomedical Question Answering (BQA) enables effectively perceiving, accessing, and understanding complex biomedical knowledge by innovative applications, which makes BQA an important QA application in the biomedical domain (Jin et al., 2021). Such a task has recently attracted considerable attention from the NLP community (Zweigenbaum, 2003;He et al., 2020b;Jin et al., 2020), but is still confronted with the following three key challenges: 1 https://github.com/Judenpech/MLEC-QA (1) Most work attempt to build BQA systems with deep learning and neural network techniques (Ben Abacha et al., 2017, 2019bPampari et al., 2018) and are thus data-hungry. However, annotating large-scale biomedical question-answer pairs with high quality is prohibitively expensive. As a result, current expert-annotated BQA datasets are small in size.
(2) Multi-choice QA is a typical format type of BQA dataset. Most previous work focus on such format type of datasets in which contents are in the field of clinical medicine (Zhang et al., 2018b;Jin et al., 2020) and consumer health (Zhang et al., 2017(Zhang et al., , 2018aHe et al., 2019;Tian et al., 2019). However, there are many other specialized sub-fields in biomedicine that have not been studied before (e.g., Stomatology).
(3) Ideal BQA systems should not only focus on raw text data, but also fully utilize various types of biomedical resources, such as images and tables. Unfortunately, most BQA datasets are either texts (Tsatsaronis et al., 2015;Pampari et al., 2018;Jin et al., 2019) or images (Lau et al., 2018;Ben Abacha et al., 2019a;He et al., 2020a); as a result, BQA datasets that are composed by fusing different biomedical resources are relatively limited.
To push forward the variety of BQA datasets, we present MLEC-QA, the largest-scale Chinese multi-choice BQA dataset. Questions in MLEC-QA are collected from the National Medical Licensing Examination in China (NMLEC) 2 , which are carefully designed by human experts to evaluate professional knowledge and skills for those who want to be medical practitioners in China. The NMLEC has a total number of 24 categories of exams, but only five of them have the written exams in Chinese. Every year, only around 18-22% of applicants can pass one of these exams, showing the complexity and difficulty of passing them even for skilled humans.
There are three main properties of MLEC-QA: (1) MLEC-QA is the largest-scale Chinese multichoice BQA dataset, containing 136,236 questions with extra materials (images or tables), Table 1 shows an example. (2) MLEC-QA first covers the following biomedical sub-fields: Clinic, Stomatology, Public Health, Traditional Chinese Medicine, and Traditional Chinese Medicine Combined with Western Medicine (denoted as Chinese Western Medicine). Only one (Clinic) of them has been studied in previous research. (3) MLEC-QA provides extra labels of five question types (A1, A2, A3/A4 and B1) for each question, and an in-depth analysis of the most frequent reasoning types of the questions in MLEC-QA, such as lexical matching, multi-sentence reading and concept summary, etc. Detailed analysis can be found in Section 3.2. Examples of sub-fields and question types are summarized in Table 2. We set each example of five question types corresponding to one of the subfields due to page limits.  As an attempt to solve MLEC-QA and provide strong baselines, we implement eight representative control methods and open-domain QA methods by a two-stage retriever-reader framework: (1) A retriever finding documents that (might) contain an answer from a large collection of documents. We adopt Chinese Wikipedia dumps 3 as our information sources, and use a distributed search and 3 https://dumps.wikimedia.org/ analytics engine, ElasticSearch 4 , as the document store and document retriever. (2) A reader finding the answer in given documents retrieved by the retriever. We fine-tune five pre-trained language models for machine reading comprehension as the reader. Experimental results show that even the current best model can only achieve accuracies of 53%, 44%, 40%, 55%, and 50% on the five categories of subsets: Clinic, Stomatology, Public Health, Traditional Chinese Medicine, and Chinese Western Medicine, respectively. The models especially perform poorly on questions that require understanding comprehensive biomedical concepts and handling complex reasoning.  In summary, the major contributions of this paper are threefold: • We present MLEC-QA, the largest-scale Chinese multi-choice BQA dataset with extra materials, and it first covers five biomedical sub-fields, only one of which has been studied in previous research.
• We conduct an in-depth analysis on MLEC-QA, revealing that both comprehensive biomedical knowledge and sophisticated reasoning ability are required to answer questions. • We implement eight representative methods as baselines and show the performance of existing methods on MLEC-QA, and provide an outlook for future research directions.

Related Work
Open-Domain BQA The Text REtrieval Conference (TREC) (Voorhees and Tice, 2000) has triggered the open-domain BQA research. At the time, most traditional BQA systems were employing complex pipelines with question processing, document/passage retrieval, and answer processing modules. Examples of such systems include EPoCare (Niu et al., 2003), MedQA (Yu et al., 2007;Terol et al., 2007;Wang et al., 2007) and AskHERMES (Cao et al., 2011). With the introduction of various BQA datasets that are focused on specific biomedical topics, such as BioASQ (Tsatsaronis et al., 2015), emrQA (Pampari et al., 2018) and PubMedQA (Jin et al., 2019), pioneered by Chen et al. (2017), the modern open-domain BQA systems largely simplified the traditional BQA pipeline to a two-stage retriever-reader framework by combining information retrieval and machine reading comprehension models (Ben Abacha et al., 2017, 2019b. Moreover, the extensive use of medical images (e.g., CT) and tables (e.g., laboratory examination) has improved results in real-world clinical scenarios, making the BQA a task lying at the intersection of Computer Vision (CV) and NLP. However, most BQA models focus on either texts or images (Lau et al., 2018;Ben Abacha et al., 2019a;He et al., 2020a); as a result, BQA datasets that are composed by fusing different biomedical resources are relatively limited.

Open-Domain Multi-Choice BQA Datasets
With rapidly increasing numbers of consumers asking health-related questions on online medical consultation websites, cMedQA (Zhang et al., 2017(Zhang et al., , 2018a, webMedQA (He et al., 2019) and ChiMed (Tian et al., 2019) exploit patient-doctor QA data to build consumer health QA datasets. However, the quality problems in such datasets are that the answers are written by online-doctors and the data itself has intrinsic noise. By contrast, medical licensing examinations, which are designed by human medical experts, often take the form of multi-choice questions, and contain a significant number of questions that require comprehensive biomedical knowledge and multiple reasoning ability. Such exams are the perfect data source to push the development of BQA systems. Several datasets have been released that exploit such naturally existing BQA data, which are summarized in Table 3. Collecting from the Spain public healthcare specialization examination, HEAD-QA (Vilares and Gómez-Rodríguez, 2019) contains multichoice questions from six biomedical categories, including Medicine, Pharmacology, Psychology, Nursing, Biology and Chemistry. NLPEC (Li et al., 2020) collects 21.7k multi-choice questions with human-annotated answers from the National Licensed Pharmacist Examination in China, but only a small number of sample data is available for public use.
Last but not least, clinical medicine, as one of the 24 categories in NMLEC, has been previously studied by MedQA (Zhang et al., 2018b) and MEDQA (Jin et al., 2020). However, the former did not release any data or code, and the latter only focused on clinical medicine with 34k questions in their cross-lingual studies, questions with images or tables were not included, and none of the remaining categories in MLEC-QA were studied.  Medicine (TCM), and Chinese Western Medicine (CWM). After removing duplicated or incomplete questions (e.g., some options missing), there are 136,236 questions in MLEC-QA, and each question contains five candidate options with one correct/best option and four incorrect or partially correct options. We describe in detail the JSON data structure of MLEC-QA in Appendix B. MLEC-QA contains 1,286 questions with extra materials that provide additional information to answer correctly. As shown in Figure 1, the extra materials are all in a graphical format with various types, such as ECG, table of a patient's condition record, formula, CT, line graph, explanatory drawing, etc. We include these questions with extra materials in MLEC-QA to facilitate future BQA explorations on the crossover studies of CV and NLP, although we will not exploit them in this work due to the various specifics involved in extra materials. Basically, as shown in Table 2 and Table 4, the questions in MLEC-QA are divided into five types including: • A1: single statement question; • B1: similar to A1, with a group of options shared in multiple questions; • A2: questions accompanied by a clinical scenario; • A3: similar to A2, with information shared among multiple independent questions; • A4: similar to A3, with information shared among multiple questions, new information can be gradually added. We further classify these questions into Knowledge Questions (KQ) and Case Questions (CQ), where KQ (A1+B1) focus on the definition and comprehension of biomedical knowledge, while CQ (A2+A3/A4) require analysis and practical application for real-world medical scenarios. Both types of questions require multiple reasoning ability to answer.  For the Train/Dev/Test split, randomly splitting may cause data imbalance because the number of the five question types are various from each other (e.g., A1 is far more than others). To ensure that the subsets have the same distribution of the question types, we split the data based on the question types, with 80% training, 10% development, and 10% test. The overall statistics of the MLEC-QA dataset are summarized in Table 5. We can see that the length of the questions and the vocabulary size in Clinic are larger than the rest of the subsets, explaining that clinical medicine may involve more medical subjects than other specialties.

Reasoning Types of the Questions
Since the annual examination papers are designed by a team of healthcare experts who try to follow the similar reasoning types distribution. To better understand our dataset, we manually inspected 10 sets of examination papers (2 sets for each subfield), and summarize the most frequent reasoning types of the questions from MLEC-QA and previous works (Lai et al., 2017;Zhong et al., 2020). The examples are shown in Table 6. Notably, the "Evidence" is well-organized by us to show how models need to handle these reasoning issues to achieve promising performance in MLEC-QA. The definition of reasoning types of the questions are as follows: Lexical Matching This type of question is common and the simplest. The retrieved documents are highly matched with the question, the correct answer exactly matches a span in the document. As shown in the example, the model only needs to check which option is matched with.
Multi-Sentence Reading Unlike lexical matching, where questions and correct answers can be found within a single sentence, multi-sentence reading requires models reading multiple sentences to gather enough information to generate answers.

Concept Summary
The correct options for this type of question do not appear directly in the documents. It requires the model to understand and summarize the question relevant concepts after reading the documents. As shown in the example, the model needs to understand and summarize the relevant mechanism of "Thermoregulation", and infer that when an obstacle arises in thermoregulation, the body temperature will not be able to maintain a relatively constant level, that is, it will rise with the increase of ambient temperature.
Numerical Calculation This type of question involves logical reasoning and arithmetic operations related to mathematics. As shown in the example, the model first needs to judge the approximate age of month according to the height of the infant, and then reverse calculate the age of months according to the height formula of infants 7~12 months old to obtain the age in months: (68 -65) / 1.5 + 6 = 8.

Multi-Hop Reasoning
This type of question requires several steps of logical reasoning over mul-tiple documents to answer. As shown in the example, the patient's hemoglobin (HB) value is low, indicating that the patient has anemia, and the supply of iron should be increased in their diet. The model needs to compare the iron content of each option: the iron content of C, D and E is low and that of A, B is high, but B is not easily absorbed, so the best answer is A.
Reasoning Type Example ( * represents the correct answer)

Lexical Matching
The main hallmark of peritonitis is: A. Significant abdominal distension B. Abdominal mobility dullness C. Bowel sounds were reduced or absent D. Severe abdominal cramping * E. Peritoneal irritation signs Evidence: The hallmark signs of peritonitis are peritoneal irritation signs, i.e., tenderness, muscle tension, and rebound tenderness.

Multi-Sentence Reading
Which is wrong in the following narrative relating to the appendix: A. The appendiceal artery is the terminal artery B. Appendiceal tissues contain abundant lymphoid follicles C. Periumbilical pain at appendicitis onset visceral pain * D. Resection of the appendix in adults will impair the body's immune function E. There are argyrophilic cells in the deep part of the appendiceal mucosa, which are associated with carcinoid tumorigenesis Evidence: (1) The appendiceal artery is a branch of the ileocolic artery and is a terminal artery without collaterals; (2) The appendix is a lymphoid organ[...]Therefore, resection of the adult appendix does not compromise the body's immune function; (3) The nerves of the appendix are supplied by sympathetic fibers[...]belonging to visceral pain; (4) Argyrophilic cells are found in the appendiceal mucosa and are the histological basis for the development of appendiceal carcinoids.

Concept Summary
The main hallmark of thermoregulatory disorders in hyperthermic environments is: A. Developed syncope B. Developed shock C. Dry heat of skin * D. Increased body temperature E. Decreased body temperature Evidence: The purpose of thermoregulation is to maintain body temperature in the normal range. In hyperthermic environments, the thermoregulatory center is dysfunctional and cannot maintain the body's balance of heat production and heat dissipation, so the body temperature is increased by the influence of ambient temperature.

Numerical Calculation
A normal infant, weighing 7.5kg and measuring 68cm in length. Bregma 1.0cm, head circumference 44cm. Teething 4. Can sit alone and can pick up pellets with a hallux and forefinger. The most likely age of the infant is: * A. 8 months B. 24 months C. 18 months D. 12 months E. 5 months Evidence: A normal infant measured 65cm at 6 months and 75cm at 1 year of age. The infant's 7 to 12 month length is calculated as: length = 65 + (months of age -6) x 1.5.

Multi-Hop Reasoning
6-month-old female infant, artificial feeding mainly, physical examination revealed a low hemoglobin (HB) value, the dietary supplement that should be mainly added is: * A. Liver paste B. Egg yolk paste C. Tomato paste D. Rice paste E. Apple puree Evidence: (1) Low HB value indicates anemia tendency. Iron deficiency anemia is the most important and common type of anemia in China.
(2) Iron supply should be increased in diet.
(3) Liver paste is rich in iron. (4) The iron content of egg yolk paste is lower than that of liver paste, and it is not easy to be absorbed. (5) The iron content of tomato paste, rice paste and apple puree is lower than that of liver paste.

Document Retriever
Both examination counseling books and Wikipedia have been used as the source of supporting materials in previous research (Zhong et al., 2020;Jin et al., 2020;Vilares and Gómez-Rodríguez, 2019). However, because examination counseling books are designed to help examinees pass the examination, knowledge is highly simplified and summarized; even the easily confused knowledge points are compared. Using examination counseling books as information sources may make the retriever-reader more likely to exploit shallow text matching, and complex reasoning is seldom involved.
Therefore, to help better understand the improvement coming from future models, we adopt Chinese Wikipedia dumps as our information sources, which contain a wealth of information (over 1 million articles) of real-world facts. Building upon the whole Chinese Wikipedia data, we use a distributed search and analytics engine, Elas-ticSearch, as the document store and document retriever, which supports very fast full-text searches. The similarity scoring function used in Elasticsearch is the BM25 algorithm (Robertson and Zaragoza, 2009), which measures the relevance of documents to a given search query. As defined in Appendix C, the larger this BM25 score, the stronger the relevance between document and query.
Specifically, for each question Q i and each candidate option O ij where j ∈ {A, B, C, D, E}, we define Q i O ij = Q i + O ij as a search query to Elasticsearch and is repeated for all options. The document with the highest BM25 score returned by each query is selected as supporting materials for the next stage machine reading comprehension task.

Control Methods
In general, each option should have the same correct rate for multi-choice questions, but in fact, the order in which the correct options appear is not completely random, and the more the number of options, the lower the degree of randomization (Poundstone, 2014). Given the complex nature of multi-choice tasks, we employ three control methods to ensure a fair comparison among various open-domain QA models.
Random A ′ = Random(O). For each question, an option is randomly chosen as the answer from five candidate options. We perform this experiment five times and average the results as the baseline of the Random method. B, C, D, E}. For each question, the j th option is always chosen as the answer to obtain the accuracy distribution of five candidate options. (O). Incorporating the previous experiences of NMLEC and multi-choice task work (Vilares and Gómez-Rodríguez, 2019), the Mixed method simulates how humans solving uncertain questions, and consists of the following three strategies: (1) the correct rate of choosing "All of the options above is correct/incorrect" is much higher than the other options. (2) Supposing the length of options is roughly equal, only one option is obviously longer with more detailed and specific descriptions, or is obviously shorter than the other options, then choose this option. (3) The correct option tends to appear in the middle of candidate options. The three strategies are applied in turn. If any strategy matches, then the option that matches the strategy is chosen as the answer.

Fine-Tuning Pre-Trained Language Models
We apply an unified framework UER-py (Zhao et al., 2019) to fine-tuning pre-trained language models on the machine reading comprehension task as our reader. We consider the following five pre-trained language models: Chinese BERT-Base (denoted as BERT-Base) and Multilingual Uncased BERT-Base (denoted as BERT-Base-Multilingual) (Devlin et al., 2019), Chinese BERT-Base with whole word masking and pre-trained over larger corpora (denoted as BERT-wwm-ext) (Cui et al., 2019), and the robustly optimized BERTs: Chinese RoBERTa-wwm-ext and Chinese RoBERTa-wwm-ext-large (Cui et al., 2019 is the sentence separator in pre-trained language models. We pass each of the five options in turn, and the model outputs the hidden state representation S ij ∈ R 1×H of the input sequence, then performs the classification and output an unnormalized log probability P ij ∈ R of each option O ij being correct by P ij = S ij W T , where W ∈ R 1×H is the weight matrix. Finally, we pass the unnormalized log probabilities of each option through a softmax layer and obtain the option with the highest probability as the predicted answer A ′ i .

Experimental Settings
We conduct detailed experiments and analyses to investigate the performance of control methods and open-domain QA methods on MLEC-QA. As shown in Figure 2, we implement a two-stage retriever-reader framework: (1) a retriever first retrieves question relevant documents from Chinese Wikipedia using ElasticSearch, (2) and then a reader employs machine reading comprehension models to generate answers in given documents retrieved by the retriever. For the reader, all machine reading comprehension models are trained with 12 epochs, an initial learning rate of 2e-6, a maximum sequence length of 512, a batch size of 5. The parameters are selected based on the best performance on the development set, and we keep the default values for the other hyper-parameters (Devlin et al., 2019). We use accuracy as the metric to evaluate different methods, and provide baseline results, as well as human pass mark (60%) instead of human performance due to the wide variations exist in human performance, from almost full marks to cannot even pass the exam.

Retrieval Performance
The main drawbacks of the Chinese Wikipedia database in biomedicine are that it is not comprehensive and thorough, that is, it may not provide complete coverage of all subjects. To evaluate whether retrieved documents can cover enough evidence to answer questions, we sampled 5% (681) questions from the development sets of five categories using stratified random sampling, and manually annotate each question by five medical experts with 3 labels: (1) Exactly Match (EM): the retrieved documents exactly match the question.
(2) Partial Match (PM): the retrieved documents partially match the question, can be confused with the correct options or are incomplete.
(3) Mismatch (MM): the retrieved documents do not match the question at all. Table 7 lists the performance of the retrieval strategy as well as the results of the annotation for KQ and CQ questions on five subsets.  From the table, we make the following observations. First, most retrieved documents indicate PM with the questions, while the matching rates of EM and MM achieve maximums of 20.83% (CWM) and 50% (PH), respectively. Second, the matching rate of CQ is higher than KQ in most subsets as CQ are usually related to simpler concepts, and use more words to describe questions, which leads to easier retrieval. By contrast, KQ usually involve more complex concepts that may not be included in the Chinese Wikipedia database. Therefore, the mismatching rate of KQ is significantly higher than that of CQ. Third, among different subsets, the performance in the subset Cli achieves the best as clinical medicine is more "general" to retrieve compare with other specialties. Whereas the performance in the subset PH achieves the worst as the Public Health is usually related to "confusing concepts", which leads to poor retrieval performance.

Baseline Results
Tables 8 and Figure 3 show the performance of baselines as well as the performance on KQ and CQ questions. As we can see, among control methods, the correct option has a slight tendency to appear in the middle (C and D) of candidate options, but the margins are small. The performance of the Mixed method is slightly better than a random guess, which indicates that the flexible use of  guessing skills may add wings to the tiger as humans can exclude some certain wrong options, but if the cart before the horse is reversed, it is impossible to pass the exam only through opportunistic guessing. RoBERTa-wwm-ext-large and BERTwwm-ext perform better than other models on five subsets. However, even the best-performing model can only achieve accuracies between 40% to 55% on five subsets, so there is still a gap to pass the exams. Comparing the performance between KQ and CQ questions, most models achieve better performance on CQ, which is positively correlated with CQ's better retrieval performance. Among different subsets, the subset TCM is the easiest (54.95%) one to answer across the board, while the subset PH is the hardest (40.04%), which does not totally correspond to their retrieval performance as shown in Table 7. The possible reason is that the diagnosis and treatment of diseases in traditional Chinese medicine are characterized by "Homotherapy for Heteropathy", that is, treating different diseases with the same method, which may result in some patterns or mechanisms that can be used by the models to reach such results.

Comparative Analysis
Given that we use the Chinese Wikipedia database as our information sources and apply a two-stage retriever-reader framework, the reason for such poor baseline performance could come from both our information sources and the retriever-reader framework.

Information
Sources Both books and Wikipedia have been used as the information sources in previous research. One of our subsets, Clinic, has been studied by MEDQA (Jin et al., 2020) as a subset (MCMLE) for cross-lingual research. MEDQA uses 33 medical textbooks as their information sources and the evaluation result shows that their collected text materials can provide enough information to answer all the questions in MCMLE. We compare the best model (RoBERTa-wwm-ext-large) performance on both datasets as shown in Table 9. Notably, questions in MCMLE have four candidate options due to one of the wrong options being deleted. Therefore, the random accuracy on MCMLE is higher than ours.
From the results we can see that even with 100% covered materials, the best model can only  achieve 16.88% higher accuracy on the test set than ours, which indicates that using Wikipedia as information sources is not that terrible compared with medical books, and the main reason for baseline performance may come from machine reading comprehension models that lack sophisticated reasoning ability.

Retriever-Reader
We also perform an experiment that sampled 5% (92) questions from the development set of Public Health, and manually annotate each question by a medical expert to determine whether that can exactly or partially match with the top K retrieved documents, as shown in Table 10. Notably, the actual number of retrieved documents is 5×K as we define Q i O ij = Q i +O ij as a search query and is repeated for all options.
From the results, we can see that more documents even bring more noise instead, as the best match documents have already been fetched in the top 1 documents. It indicates that the poor performance of machine reading comprehension models is coming from the insufficiency of reasoning ability rather than the number of retrieved documents.

Conclusion
We present the largest-scale Chinese multi-choice BQA dataset, MLEC-QA, which contains five biomedical subfields with extra materials (images or tables) annotated by human experts: Clinic, Stomatology, Public Health, Traditional Chinese Medicine, and Chinese Western Medicine. Such questions correspond to examinations (NMLEC) to access the qualifications of medical practitioners in the Chinese healthcare system, and require specialized domain knowledge and multiple reasoning abilities to be answered. We implement eight representative control methods and opendomain QA methods by a two-stage retrieverreader framework as baselines. The experimental results demonstrate that even the current best approaches cannot achieve good performance on MLEC-QA. We hope MLEC-QA can benefit researchers on improving the open-domain QA models, and also make advances for BQA systems.  In order to benefit researchers on improving the open-domain QA models, and also make advances for Biomedical Question Answering (BQA) systems, we present MLEC-QA, the largest-scale Chinese multi-choice BQA dataset to date.

F.2 LANGUAGE VARIETY
The data is represented in simplified Chinese (zh-Hans-CN), and collect from the 2006 to 2020 NM-LEC, as well as practice exercises from the Internet.

F.3 SPEAKER DEMOGRAPHIC
Since the data is designed by a team of anonymous human healthcare experts, we are not able to directly reach them for inclusion in this dataset and thus could not be asked for demographic information. It is expected that most of the speakers come from China with professionals working in the area of biomedicine, and speak Chinese as a native language. No direct information is available about age and gender distribution.

F.4 ANNOTATOR DEMOGRAPHIC
The experiments involve annotations from 5 medical experts with at least have a master's degree and have passed the NMLEC. They ranged in age from 28 45 years, included 3 men and 2 women, all come from China and speak Chinese as a native language.

F.5 SPEECH SITUATION
All questions in MLEC-QA are collected from the National Medical Licensing Examination in China (NMLEC), which are carefully designed by human experts to evaluate professional knowledge and skills for those who want to be medical practitioners in China.

F.6 TEXT CHARACTERISTICS
The topics include in MLEC-QA are in 5 biomedical sub-fields: Clinic, Stomatology, Public Health, Traditional Chinese Medicine, and Traditional Chinese Medicine Combined with Western Medicine.