ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination

As ChatGPT and GPT-4 spearhead the development of Large Language Models (LLMs), more researchers are investigating their performance across various tasks. But more research needs to be done on the interpretability capabilities of LLMs, that is, the ability to generate reasons after an answer has been given. Existing explanation datasets are mostly English-language general knowledge questions, which leads to insufficient thematic and linguistic diversity. To address the language bias and lack of medical resources in generating rationales QA datasets, we present ExplainCPE (over 7k instances), a challenging medical benchmark in Simplified Chinese. We analyzed the errors of ChatGPT and GPT-4, pointing out the limitations of current LLMs in understanding text and computational reasoning. During the experiment, we also found that different LLMs have different preferences for in-context learning. ExplainCPE presents a significant challenge, but its potential for further investigation is promising, and it can be used to evaluate the ability of a model to generate explanations. AI safety and trustworthiness need more attention, and this work makes the first step to explore the medical interpretability of LLMs.The dataset is available at https://github.com/HITsz-TMG/ExplainCPE.


Introduction
Advancements in the field of Large Language Models (LLMs) (Radford et al., 2019;Brown et al., 2020;Ouyang et al., 2022), exemplified by models such as GPT-4 (OpenAI, 2023), have opened up new possibilities and challenges across a myriad of natural language processing (NLP) tasks (Wei et al., 2022a).These models have shown remarkable success in understanding and generating human-like text, promoting research that spans a wide array of applications (Bubeck et al., 2023).
A critical aspect that remains unexplored is the interpretability of these models, specifically the ability to provide accurate and faithful rationale for their decisions (Wei et al., 2022b;Turpin et al., 2023).The degree to which these models can explain their reasoning is of crucial significance, especially in high-stakes domains such as healthcare, where the clarity of generated responses directly impacts decision-making and outcomes (Rudin, 2019).
Current benchmarks for LLMs mostly focus on the exam performance, such as MMLU (Hendrycks et al., 2021) and AGIEval (Zhong et al., 2023).These datasets do not allow for a detailed assessment of LLMs' justifications of their decisions, because of the unavailability of high-quality and professional explanations.Moreover, accurately measuring the explainability of these LLMs is a difficult task due to the lack of comprehensive and standardized datasets that come from unbiased and trustworthy sources (Li et al., 2023).Existing benchmarks predominantly are from online forums and consumer feedback and only consist of English-language general knowledge questions (Wiegreffe and Marasovic, 2021), which results in insufficient thematic and linguistic diversity.Overall, the lack of appropriate evaluation datasets has prevented a full understanding of LLMs' strengths and weaknesses in the field of interpretability.
To address this gap, we introduce ExplainCPE, a challenging medical benchmark dataset in Chinese, encompassing over 7K instances.This dataset, specifically tailored to evaluate the capacity of model explainability, diversifies the linguistic scope of interpretability research and allows for a rigorous assessment of model performance in a specialized, high-stakes domain.An example from our dataset is presented in Table 1.The in-depth analysis of LLMs performance on ExplainCPE brings to light several critical observations.First, Table 1: A translated example from our ExplainCPE dataset with response of GPT-4 and ChatGPT (✓: correct answer option).The blue text represents the given answer in the response.The red text represents the error in the GPT-4 response, the reason for choosing the option is breast cancer rather than old age.
we find substantial limitations in understanding of these LLMs over medical text and their ability to execute computational reasoning effectively.For example, only GPT-4 passed Chinese Pharmacist Examination with 75.7% accuracy, while other models like ChatGPT failed.Through the case analysis of GPT-4 and ChatGPT, we found that the explanations generated by LLMs still have flaws such as contradictory, insufficient analysis, confused logic, and how to improve its interpretability is the part that LLMs should pay attention to in the future.Furthermore, we report heterogeneous preferences for in-context learning among different LLMs, suggesting varying strategies for explanation generation.For example, models with little chatting ability such as BELLE (Ji et al., 2023b,a) are more sensitive to the number of few-shot examples than with ChatGPT with strong chatting ability.To the best of our knowledge, we are the first to propose a free-text explanation benchmark in Chinese medical examination and further explore the interpretability of LLMs in the medical field.We provide a baseline for future research on explanation generation research, and this dataset can also be used to improve the interpretability of these large language models.As the broader issues of AI safety and trustworthiness gain attraction, our work represents a pioneering step towards enhancing the medical interpretability of LLMs, underscoring the urgent need to develop AI that is not only intelligent, but also transparent, robust, unbiased and reliable.
Our main contributions can be summarized as follows: • We introduce ExplainCPE, a challenging benchmark for generating free-text explanations in Chinese medical QA, which provides a baseline for future research on explanation generated by LLMs, and can be used to study how to improve the ability of the model to generate explanation.
• We analyze the basic attributes of the dataset, such as the average length of questions, options, and explanations.Additionally, we examine the high-level categories of questions, which can assist researchers in understanding the distribution of categories in ExplainCPE and the interpretability performance of the models.
• We conduct experiments on the ExplainCPE dataset to demonstrate its effectiveness and feasibility.Our findings reveal that different LLMs exhibit varying preferences for incontext learning.We analyze error cases and identify some limitations of current LLMs, which can serve as directions for future development.2 Related Work

Medical Question Answering
In the medical domain, addressing questions can be particularly challenging due to their specialized and complex nature.Consequently, community efforts have been directed towards advancing biomedical question-answering systems, such as BioASQ (Tsatsaronis et al., 2012(Tsatsaronis et al., , 2015)).Another system, SeaReader (Zhang et al., 2018), was proposed to answer clinical medical questions by leveraging documents extracted from medical publications.In a study by Yue et al. (2020), the authors performed a comprehensive analysis of the em-rQA (Pampari et al., 2018) dataset to evaluate the capacity of QA systems to utilize clinical domain knowledge and generalize to novel questions.Furthermore, Jin et al. (2019) introduced PubMedQA, a system that generates questions based on article titles and can be answered using their respective abstracts.2021) develop a first-of-its-kind dataset named ECQA, comprising human-annotated positive and negative properties, as well as free-flow explanations for 11,000 question-answer pairs derived from the CQA dataset.Ye and Durrett (2022) assess the performance of four LLMs across three textual reasoning datasets utilizing prompts containing explanations in multiple styles.Their findings indicate that human-evaluated high-quality explanations are more likely to coincide with accurate predictions.

LLMs Benchmarks
New NLP benchmarks are urgently needed to align with the rapid development of LLMs.MMLU (Hendrycks et al., 2021) is a collection of English-language materials that encompasses knowledge from 57 different disciplines including elementary mathematics, US history, computer science, law, and more.To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.Another significant contribution to this field is the C-EVAL (Huang et al., 2023), which represents the first comprehensive effort to evaluate foundational models' knowledge and reasoning capabilities within a Chinese context.C-EVAL consists of multiple-choice questions designed to assess performance across four difficulty levels: middle school, high school, college, and professional.These questions cover 52 diverse disciplines, spanning from humanities to science and engineering, thereby providing a holistic evaluation of the model's capabilities.Zhang et al. (2023) introduces the GAOKAO-Benchmark (GAOKAO-Bench), an intuitive benchmark that employs questions from the Chinese Gaokao examination as test samples for evaluating LLMs.Most benchmarks focus on evaluating the performance of LLMs in answering or answering questions, but few focus on the ability of LLMs to explain the answers given.

Dataset Collection
The National Licensed Pharmacist Examination in China, collaboratively administered by the Ministry of Personnel and the State Food and Drug Administration, serves as the basis for our question set.
In order to evaluate the performance and generalizability of our models, we have compiled a test set using examples from the previous two years (2020-2021) of the official examination.Each official question's explanation is sourced from official examination solution.Additionally, we have collected over 7,000 instances from various sources, including the internet and exercise books.The instance in ExplainCPE dataset is multiple choice question with five options.
In addition to the official questions, we also collaborated with three doctoral students from Peking Union Medical College (all of whom have undergone standardized residency training).They manually reviewed 320 samples from the collected data to evaluate the completeness and accuracy of the label and explanations.The evaluation resulted in the 99.4%/99.0%accuracy rate, with 318/317 out of the 320 samples being deemed correct.
Following the removal of duplicate and incomplete questions (e.g., those lacking answers or options), we randomly divided the remaining instances into training and development sets based on a predetermined ratio.To further enhance the quality of our dataset, we inspected instances with an edit distance of less than 0.1 and manually removed questions containing different words that conveyed the same meaning.

Data Statistic
The training, development, and test sets comprise 6,867, 500, and 189 questions, respectively, with average lengths of 28.31, 28.44, and 37.79 words.A summary of the dataset statistics can be found in Table 2. Figure 1 illustrates the distribution of question and explanation lengths across the training, development, and test sets.

Data Analysis
In order to investigate the properties of the Ex-plainCPE dataset, we primarily focus on the diversity of questions in this subsection.Our aim is to determine the categories of problems that LLMs excel at handling.To achieve this, we performed a multi-level classification of the dataset, comprising three levels.
At the first level, questions are classified into positive and negative categories.Positive questions which is also called direct question prompt the respondent to select the correct option, while negative questions require identifying the incorrect option among the options provided.
At the second level, questions are categorized into 7 groups: logical reasoning, drug knowledge, scenario analysis, mathematical calculation, disease knowledge, general knowledge, and others.
Finally, at the third level, questions are classified into 14 categories based on their content: antiinflammatory, infection, tumor, anesthesia, cardiovascular, weight loss, orthopedics, nervous system, respiratory system, digestive system, urinary system, endocrine, immune system, and others.
We randomly selected 1200 instances from the training and development sets and manually assigned a three-level classification to each question.The proportional distribution of each category within the dataset is presented in Figure 2. A more detailed proportional distribution of each category within the dataset is presented in Appendix B.

Prompting
Prompting has a significant impact on the output of generative language models, so we standardized the structure of our prompts.In order to better analyze the performance and interpretability of language models, we designed prompts to request the model to provide an answer option along with an explanation in the test set.An example of the template and a fully instantiated prompt can be found in Appendix A. Two types of prompt templates were utilized: with and without instructions.The purpose of this design was to explore the influence of instructions on different models.In the zero-shot setting, the few_shot_example slot was left blank.Additionally, it should be noted that prompts without instructions are the same as prompts with instructions in the zero-shot setting.
To investigate the impact of in-context on model performance, we designed prompts with different numbers of few-shot examples, including zeroshot, one-shot, four-shot, and eight-shot prompts.For one-shot prompts, we randomly selected a single instance from the training set.For four-shot and eight-shot prompts, we manually selected instances with varying question types to ensure model predictions were balanced.It should be noted that the few-shot examples were the same for all models in each respective prompt type.

Model Comparison
To compare the performance of different models, we evaluated several LLMs on our test dataset.We recognize that LLMs can be classified as chat or non-chat models, depending on their ability to engage in human-like conversation.Chat models, which are pre-trained with vast amounts of data and fine-tuned through reinforcement learning from human feedback (RLHF), include GPT-4 (Ope-nAI, 2023), ChatGPT (OpenAI, 2022), ChatGLM-6B (Du et al., 2022;Zeng et al., 2023), and ChatYuan (ClueAI, 2023).Non-chat models, on the other hand, are typically pre-trained on unsupervised plain text and fine-tuned on code or instructional data but do not have sufficient RLHF to enable human-like conversation.Examples of nonchat models include GPT-3 (Ouyang et al., 2022), BELLE (Ji et al., 2023b,a), and GPT-3 (Brown et al., 2020).Consequently, non-chat models are more inclined to predict the next word or complete a given task rather than engage in conversation.In this section, we provide a brief introduction to the LLMs used in our experiments.
• ChatGPT (OpenAI, 2022) is a large language model with hundreds of billions of parameters, specifically designed for human-like conversation across a wide range of topics.Chat-GPT's text understanding ability is derived from language model pre-training, its reasoning ability is derived from code pre-training, its logical reasoning ability is derived from supervised instruction training, and its dialogue ability is derived from RLHF.
• GPT-4 (OpenAI, 2023) represents the latest milestone in OpenAI's deep learning scaling efforts, and is a large multimodal model that exhibits human-level performance on various professional and academic benchmarks.GPT-4 outperforms ChatGPT on most tasks.
• GPT-3 (Ouyang et al., 2022) is a series of models.In this paper, we simply call textdavinci-003 with GPT-3.Text-davinci-003 is capable of performing any language task with better quality, longer output, and more consistent instruction-following than GPT-3.

2023
) is an open-source dialogue language model that supports both Chinese and English bilinguals.Utilizing technology similar to ChatGPT, it is optimized for Chinese question-answering and dialogue.After about 1T identifiers of Chinese and English bilingual training, supplemented by supervision, fine-tuning, feedback self-help, human feedback reinforcement learning, and other technologies, ChatGLM-6B with 6.2 billion parameters can generate answers that closely align with human preferences.

Results
One of the main objectives of our dataset is to evaluate the interpretability of models by assessing the quality of the generated text.Therefore, we not only measured the accuracy of the models but also required a useful and efficient evaluation metric.
Evaluating interpretability is a long-standing problem due to the diversity of interpretation forms and content.
As suggested by Wiegreffe et al. (2022), the quality of explanations generated by the joint method needs to be further verified.Therefore, we chose two methods to evaluate the interpretability of the models: automatic metrics and human evaluation.For automatic metrics, we used Rouge to measure the quality of the explanations provided by the models and accuracy to measure the models' performance.Due to the model's input length limitation, we could not conduct eight-shot experi- ments for some models, such as GPT-3.Moreover, some models did not respond with answers and explanations even when we requested them, which is why some models lack a Rouge score.

Automatic Metrics
The performance of each model in each setting can be found in Appendix C. Regarding interpretable automatic evaluation indicators, GPT-4 achieved the best results in explanation generation with a Rouge-L score of 0.247, followed by ChatGPT, ChatGLM-6B, and BELLE-7B-2M.ChatGLM-6B yielded unexpected results in metrics, despite its relatively small parameter size, with high accuracy and Rouge scores.
We plotted line charts of model performance as a function of the number of few-shots.The line chart is divided into two, the chart on the left for chat models and the chart on the right for non-chat models.From the figure, we identified three key findings.
Firstly, it is evident from Figure 3 that regard-less of the size of the model parameters or whether instructions are given, going from zero-shot to oneshot often results in a significant performance improvement, which is better than any subsequent increase of few-shot examples.
Secondly, when comparing chat models and nonchat models, GPT-3 is a model with a large number of parameters but weak dialogue ability, while GPT-4 and ChatGPT are models with strong dialogue ability.Regardless of whether instructions are provided, the performance of GPT-3 increases with an increase in few-shot examples, but GPT-4 and ChatGPT tend to achieve their maximum performance in one-shot setting.This suggests that for a model with a large number of parameters and strong dialogue ability, one-shot setting is a good choice.Conversely, for models with weak dialogue ability, their performance is somewhat proportional to the number of few-shot examples.
Thirdly, when comparing the two figures, the models in the left picture have strong dialogue ability.Therefore, in the case of the same number of few-shot examples, providing instructions is better than not providing instructions.However, in the right picture, the models have weak dialogue ability.Therefore, in the case of the same number of few-shot examples, not providing instructions is better.

Human Evaluation
From the perspective of interpretability, there are certain limitations in using the rouge evaluation metric to evaluate the interpretability of the model.So we also used human evaluation to assess the qualitative properties of the generated explanations.We follow Monsen and Rennes (2022), Wiegreffe et al. (2022) and Kunz et al. (2022) asking annotators to rate from 1 to 5 according to the following questions for each e.
• Is e a well-formed sentence?
• Does e support the label?• Is the content of e factually correct?
• Does e provide a valid reasoning path for the label?
• Does e add new information, rather than recombining information from the input?
Due to the low accuracy of some models and the poor quality of the generated explanations, we only manually evaluated a subset of the models, and the results are presented in Table 4.As expected, GPT-4 remains the best performer.It is noteworthy that the performance of GPT-4 and ChatGPT in terms of well-formed and support is the same.This indicates that both GPT-4 and ChatGPT can comprehend the question's requirements, provide the label, and generate a complete and coherent explanation that supports the label.However, GPT-4 outperforms ChatGPT in terms of the correctness of the explanation, effectiveness of the explanation process, and novelty.ChatGLM lags behind ChatGPT and GPT-4 on all five indicators.And We also ask GPT-4 to evaluate responses, the relative scores of different metrics are consistent with human evaluation results.

Error Analyses
In this subsection, we present an analysis of the performance of the models from an overall perspective and specific examples.Table 5 displays the performance of the models on different types of questions.Notably, GPT-4 and ChatGPT perform better on negative questions, while other models perform better on positive questions.Moreover, GPT-4 demonstrates improvement in logical reasoning, whereas other models do not.While GPT-4 improves in scenario analysis questions, other models exhibit a decline.Conversely, GPT-4 declines in general knowledge questions while other models improve.GPT-4 correctly solves two mathematical calculation questions, whereas ChatGPT fails on all such questions.These findings suggest that GPT-4 has stronger logical reasoning, scenario analysis, and mathematical calculation abilities than other models.The superior performance of GPT-4 and ChatGPT on negative questions indicates their better understanding of text and ability to answer questions.
We analyze specific error cases of ChatGPT and GPT-4 to identify the limitations of current LLMs.Appendix D outlines the reasons for explanation errors.Although the results of LLMs are impressive, they are not yet perfect.In example 1, GPT-4 provides the correct answer but the wrong explanation, which is difficult to detect.Thus, models should pay close attention to such errors before widespread use.In Example 2, although the model has a certain calculation capability, the reliability of its calculation is still not guaranteed.In Example 3, neither GPT-4 nor ChatGPT fully comprehends the detailed requirements of the question, leading to errors.Therefore, LLMs still have scope for improvement in text comprehension and generating explanations.

Conclusion
In this work, we propose ExplainCPE, a challenging medical dataset for natural language explanation evaluation.Our study on ExplainCPE dataset demonstrates the potential of LLMs in medical question answering with explanations.Our analysis of model performance on different types of questions reveals the strengths and limitations of different LLMs in terms of in-context learning.The error cases point out the need for further improvement in LLMs in explanation generation and text comprehension.Further work can use our dataset to improve and evaluate the model interpretability.

Limitations
Due to the lack of interpretable benchmarks in the medical professional field, we present ExplainCPE in this paper.While there are many explainable methods, we only contribute to the Explanation Generation.Moreover, most of the current interpretable methods are aimed at classification tasks.For LLMs which are used to generate response, new interpretable methods are necessary.We explore the ability of LLMs in medical diagnosis and interpretability.While model performance can be well assessed by accuracy, automatic assessment of interpretability is still lacking.
However, our analysis of ExplainCPE dataset is just a preliminary exploration, and there is still much room for further research and development.For example, future work can focus on improving the quality and diversity of the explanations in the dataset, expanding the coverage of medical knowledge, and exploring new evaluation metrics for interpretability.In addition, more advanced LLMs can be developed to further improve the performance of medical question answering with explanations by utilizing the data in the training set.We believe that the ExplainCPE dataset can serve as a valuable resource for the research community to advance the field of medical question answering and LLMs.

Ethical Statement
This paper is concerned about proposing a dataset on explanations of medical question answers.The data in this dataset are all from Chinese Pharmacist Examination related exercises.Moreover, the cases in the exercises are all fictitious cases, and there is no personal privacy, discrimination or attack content.Judging from its impact, this data set can be used to improve the interpretation ability in human medical diagnosis, reduce misdiagnosis, and contribute to human intelligent medical care.

A Prompting Template
There are two types of prompt templates, prompting with instruction and prompting without instruction.And you can check the template in Table 6.You can also see the template instantiation in Table 7.

B Distribution of Categories
In Figure 4, we show the proportion distribution of each type in the dataset in more detail.

C Performance Comparison
Perhaps due to the training data or model size, these models do not respond well to a given multiple-choice question.We have already evaluated two popular medical LLMs-ChatGLM-Med and Huatuo-Llama-Med-Chinese-and found that they struggled with our multi-choice questions.Wang et al. (2023b) constructed a Chinese medical instruction data set through the medical knowledge graph and GPT3.5 API, and based on this, fine-tuned the instructions of ChatGLM-6B to improve the question-answering effect of ChatGLM in the medical field called Med-ChatGLM (Wang et al., 2023b).Based on the same data, we also trained a medical version of the LLaMA model called Huatuo (Wang et al., 2023a).
In Table 8, we show the results of each model under all settings, including the number of fewshots and with or without instructions.= total drug dose/drug concentration.In this question, the total dose of the drug is 100 mg, and the drug concentration is 5 μg/ml (note that the unit should be the same).The total dose of the drug is converted to μg to be 100,000 μg, which is substituted into the calculation formula: Vd = 100,000 μg/5 μg/ml = 20,000ml = 20L.Because the distribution of drugs not only occurs in the blood, but may also be distributed in cells, tissues and organs, etc., the apparent distribution volume of common drugs is relatively large, generally ranging from tens of liters to hundreds of liters (ie: 0.03L/kg to 0.8L/kg).The answer to this question is option D, and 50L is within the apparent volume of distribution of common drugs.

Figure 1 :
Figure 1: Distribution of questions and explanations length in ExplainCPE.

Figure 2 :
Figure2: The distributions of proportions for each category at two levels in ExplainCPE.In the first layer of categories, positive questions account for the majority.In the second category, logic questions and knowledge questions account for the majority.
Figure 3: Performance comparison under different few-shot numbers.Left: Models with chatting ability such as GPT-4, ChatGPT and ChatGLM-6B.Right: Models without enough chatting ability such as GPT-3 and BELLE.Each model has 2 settings, with instruction and without instruction.

Figure 4 :
Figure 4: Proportional Distribution of each category of ExplainCPE dataset.

Table 2 :
ExplainCPE dataset statistics, where Q, A, E represents the Question, Answer, and Explanation, respectively.

Table 4 :
Human evaluation results(top) and GPT-4 evaluation results(bottom) of different models in five perspectives.
Table3presents the best performance of each model on the test set, regardless of whether the prompt of each model is consistent.Not surprisingly, GPT-4 is the bestperforming model, achieving 75.7% accuracy with the most suitable set of 1-shot without instruction.Therefore, GPT-4 has demonstrated the ability to pass the National Licensed Pharmacist Examination in China and has outperformed more than 80% of the people who take the examination.

Table 5 :
Performance of models on different types of samples in ExplainCPE.

Table 6 :
Prompt template.The solid line separates whether it is with instruction.And the dotted line separates the Chinese and English versions.

Table 7 :
Instantiated example of prompt template.The dotted line separates the Chinese and English versions.
Answer: B. If skin and mucous membrane dryness occurs during treatment, the dose should be reduced or the drug should be discontinued.Analysis: Isotretinoin capsules may cause some adverse reactions during the treatment of acne.Contraceptive measures should be taken during treatment and within 3 months after drug withdrawal, because isotretinoin has a strong teratogenic effect (choice A is correct).Dry skin or dry mucous membranes may occur during treatment, but dose reduction or drug withdrawal is not required; usually it can be relieved by using mild skin care products, lip balm, artificial tears, etc. (option B is wrong).Adverse reactions are common, but most of them can be recovered after stopping the drug (choice C is correct).The first 2 months of treatment may lead to abnormal liver enzymes, check liver function if necessary (option D is correct).Some patients experienced transient exacerbation of rash during 2 to 4 weeks of medication (choice E is correct).Therefore, the wrong statement is option B. Error Analyse: GPT-4 is right.ChatGPT is wrong.The explanation given by ChatGPT is correct and conforms to the gold explanation, but the label it gives is wrong.This question examines the drug treatment of Parkinson's.1. Earlyonset patients without mental decline: (1) Pramipexole, selegiline or compound levodopa + COMT inhibitor (entecapone) should be the first choice if the treatment guidelines in the United States and Europe are followed; (2) If the patient cannot afford high-priced drugs due to economic reasons, amantadine is the first choice; (3) If due to the needs of special work, the patient strives to significantly improve motor symptoms or suffers from cognitive decline, the first choice is the first choice Compound levodopa or compound levodopa + COMT inhibitor (entecapone); (4) obvious tremor but poor efficacy of other drugs → trihexyphenidyl.2. Late onset or accompanied by mental decline (1) Compound levodopa is generally the first choice for treatment.(2) Try not to use anticholinergic drugs, especially for elderly male patients.Therefore, choose E for this question.ChatGPT Response: Answer: B. Selegiline Analysis: The main manifestation of Parkinson's disease is movement disorder, because Parkinson's disease is caused by the loss of dopaminergic neurons, so dopamine receptor agonists are one of the treatments for Parkinson's disease drugs.Pramipexole is also an effective treatment for Parkinson's disease in young people, but it is less suitable for this patient because of its more side effects in older patients.Amantadine works similarly to dopamine in treating Parkinson's disease, but it often causes side effects in elderly patients, so it is not suitable for this patient.Entacapone is a combination drug for the treatment of Parkinson's disease, which consists of levodopa and entapazide, but this drug is prone to side effects such as lowering of blood pressure, and is not suitable for elderly patients with mild cognitive impairment.Therefore, it is a better choice to choose B. selegiline, which is a selective dopamine D2 and D3 receptor agonist, which will not cause aggravation of dyskinesia symptoms such as muscle stiffness and tremor, and can improve cognitive cognitive barriers.