MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.


Introduction
Recent advanced language models, e.g., GPT-3, ChatGPT, and LLaMa (Touvron et al., 2023a), are effective in various general tasks, suggesting their potential to healthcare use cases, such as alleviating the burden on human experts in decisionmaking and patient care.However, training, adapt-1 We will release the data set and source codes at https: //github.com/ZexueHe/MedEvalThe data set will also be available at the Department of Veterans Affairs Open Data Portal https://www.data.va.gov  ing, and evaluating these models requires highquality domain-specific datasets, which are often challenging to obtain.Previous medical datasets have been collected from healthcare-related literature (Dernoncourt and Lee, 2017;Gupta et al., 2021;Jin et al., 2019b;Banarescu et al., 2013) or web pages on the Internet (McCreery et al., 2020;Ammar et al., 2018).While these datasets are large, they may lack quality with heterogeneous topics (e.g., scientific literature about nutrition may offer limited help in the decision-making process of an X-ray analysis).On the other hand, high-quality clinical data is typically obtained by annotating records from healthcare systems like MIMIC-CXR (Johnson et al., 2019;Yan et al., 2021a).However, such data is either limited in size (Tsatsaronis et al., 2015), or may only cover certain dominant systems and specific domains2 , such as chest X-rays (John-son et al., 2016) or eye diseases (Otmakhova et al., 2022).Other approaches involve automatically generating medical corpus using templates (Pampari et al., 2018;Pappas et al., 2018) or using language models (Guo et al., 2023;Tang et al., 2023), but they have been noted to be limited in diversity, complexity, and quality (Gupta et al., 2021).
To tackle the aforementioned challenges and facilitate research in clinical NLP, we introduce MEDEVAL, a large-scale medical benchmark with multi-level curated labels for multiple tasks and multiple domains.MEDEVAL comprises 22,779 sentence-level datapoints from radiology reports, including expert-crafted classification labels (e.g., abnormality identification labels) and ground truth for generation tasks (e.g., disambiguated rewritings).Additionally, we include 21,228 complete reports with expert-annotated medical codes for disease classification (e.g., for ankle radiology studies) and golden output for generation tasks (e.g., summarization of radiology reports).Besides the ability to support multi-tasks at different levels, MEDE-VAL's uniqueness also lies in its diverse data coverage for different body parts (such as chest, foot, and ankle) and different modalities (X-rays, ultrasound, etc.), and the incorporated novel tasks/data that are collected from the U.S. Department of Veterans Affairs (VA) health care system nationwide.To the best of our knowledge, MEDEVAL represents the first expert-curated medical NLP benchmark that is both comprehensive and large-scale.MEDE-VAL will be released to facilitate future research.
We further conduct a comprehensive evaluation of multiple state-of-the-art language model baselines, including domain-adapted PLMs followed by in-domain fine-tuning (e.g., fine-tuned BERT (Devlin et al., 2018)) and general-purposed LLMs utilized with few-shot in-context learning (e.g., Chat-GPT).We evaluate their performance on sentencelevel and document-level NLU and NLG tasks.We observe the effectiveness of both categories of models in different healthcare tasks, with surprisingly comparable performances from LLMs only using few-shot learning to domain-adapted PLMs in certain generation tasks.Our comprehensive evaluation indicates language models are strong candidates in medical tasks whose data is already seen/similar to their training data.Our investigation provides insights into the potentials and X-ray, CT, Ultrasound, etc) and examined body parts (e.g., chest, abdomen, etc).limitations of LLMs in healthcare domains, guiding the appropriate use of LLM-assisted healthcare decision-making systems in the future.Overall, our contributions are summarized as: • We propose a large-scale medical benchmark, namely MEDEVAL, with a broad coverage for various tasks and domains to facilitate future research in clinical NLP.
• We provide expert annotations for multiple tasks with multi-granularity, from sentence classification and rewriting, to report classification and summarization.
• We systematically evaluate various language models, and shed light on the strengths and weaknesses of these models for healthcare applications.
2 Related Work  et al., 2016, 2019, 2023)).Several works have been proposed based on their databases.For instance, MIMIC-CXR (Johnson et al., 2019) is a dataset consisting of pairs of radiology images and reports of chest X-ray exams.MIMIC PERform Dataset (Charlton et al., 2022) comprises physiological signals related to critically-ill patients.Additionally, Edin et al. (2023) collected document summary pairs labeled with diagnosis and procedure codes from Johnson et al. (2023).Though their collection may be one of the most comprehensive, there are many domains not included.A more recent attempt was made to alleviate the incompleteness concern by introducing M3 (Otmakhova et al., 2022), a multi-domain medical benchmark that incorporates multi-level expert annotations.By only considering studies on ophthalmology, M3 offers limited help in providing a fully comprehensive solution.To complement these existing efforts, we collect a large-scale dataset of real medical reports from another healthcare system that offers broader coverage including 35 human body regions from 8 examination modalities (e.g., X-ray, CT, etc.).
Language Models for Healthcare Large pretrained language models are being widely adopted to solve healthcare tasks.One line of research involves adapting general language models to the biomedical domain through continuous training on domain-specific data and tasks.For instance, Yan et al. (2021b) enhanced BERT with contrastive learning for chest report generation.ClinicBERT (Huang et al., 2019) was proposed by continuously training BERT on clinic notes using masked language modeling, and Yan et al. (2022) developed RadBERT by continuously training BERT on a vast collection of radiology reports.Other adaptations of BERT, such as BioBERT (Lee et al., 2020), Blue-BERT (Peng et al., 2019), SciBERT (Beltagy et al., 2019), and BioMegatron (Shin et al., 2020), involved training on large publicly available medical corpora like PubMed or Semantic Scholar.Furthermore, LLMs of alternative architectures have also been employed, including BioELMo (Jin et al., 2019a), BioBART (Yuan et al., 2022), and BioMed-RoBERTa (Gururangan et al., 2020a).Another research direction capitalizes on the generaliza-tion capabilities of recent LLMs, where biomedical problems are addressed through prompting LLMs in zero-shot or few-shot settings.This approach has been utilized in various applications, such as medical report summarization (Otmakhova et al., 2022), medical writing (Biswas, 2023), and medical named entity recognition (Hu et al., 2023), etc.In this work, we propose MEDEVAL, a multilevel data with curated annotations at various granularity to comprehensively evaluate the strengths and limitations of LMs in healthcare.

Dataset Design
MEDEVAL (shown in Figure 2) is designed with multiple NLU and NLG tasks at both the sentence and document levels, based on medical data collected from two different healthcare databases.Our data covers diverse combinations of human body parts and examination modalities.We first introduce the data sources where we collected the text input (Section 3.1).Then we present the expertannotated ground truth labels3 created by our medical team4 (Section 3.2).

Input Data Composition
Sentence-Level Corpora The sentence-level corpora used in this study are sourced from two wellconstructed datasets: the sentence-level OpenIannotated dataset (Demner-Fushman et al., 2016), which consists of sentences from chest studies, and the VA-annotated dataset (He et al., 2023b), which includes sentences about different body parts examined by different modalities.These datasets have undergone de-identification, completion of missing terms and uniqueness checks.More details about the data preprocessing is given in Appendix A. We use the officially released versions of the OpenI-annotated and VA-annotated datasets.In addition, we provide new annotations for sentencelevel tasks on these data sources.
Report-Level corpora We collect the raw radiology reports from two distinct sources: (1) text corpus from MIMIC-CXR, which comprises records related to human chests (Johnson et al., 2019), (2) text corpus from the databases of a nationwide government healthcare system.We randomly collect data points about different body parts and exam modalities, resulting in multiple domains under different data distributions.The distribution of the domain is illustrated in Figure 1.The collected data are processed with automatic de-identification, followed by a thorough human inspection to verify that no private information about patients or doctors is disclosed or hinted at in the text.We also employ an offline paraphrasing tool (Damodaran, 2021) to revise the text data collected from the second source.The paraphrasing is followed by another human inspection to filter out any unqualified records where the rewriting deviates significantly from the original report.The resulting data set can be considered "synthesized" and containing no privacy information but retaining realistic clinical conditions as the source data.
For each evaluation task, we split the data in a ratio of 7:1:2 for train/validate/test.

Sentence-level Labels
NLU Tasks Identifying sentences with certain diagnostic properties is a practical use case in a real-world healthcare system.For example, identifying if a report sentence implies an abnormal finding about the patient or not.To test if language models can capture the medical semantics of single sentences, we first include abnormal sentence identification into our evaluation pool.We use the sentence-level corpora and the associated abnormality labels to classify abnormal sentences.
Ambiguous sentences appear in radiology reports mainly due to the use of medical jargon whose meaning is different from daily usage, contradictory findings within the same sentence, or grammatical errors that mislead interpretation (He et al., 2023b).Accurate identification of such sentences is crucial, as they impede patients' comprehension of diagnostic decisions, leading to potential treatment delays and irreparable consequences.To the best of our knowledge, as a novel task proposed recently, current LMs may not readily include such a task into its pre-training stage.Therefore, evaluation of this task allows us to investigate how language models perform when the tasks are unfamiliar.We leverage the report sentences and their associated ambiguous labels, and our medical team re-examined and re-annotated the labels for ambiguous sentences.
NLG Task Expanding beyond the previous ambiguous sentence identification, we include the task of sentence disambiguation as a sentence-level generation task.Proposed in He et al. (2023b), sentence disambiguation aims to rewrite an ambiguous sentence in a way that its diagnostic findings are more explicitly expressed while at the same time, the original content of the report sentence is faithfully maintained.This requires rewritten sentences to avoid the change of the original pathological findings or introducing new findings.Similar to ambiguous sentence identification, disambiguated rewriting presents a challenging generation task, not only because both the data and task formulation are not likely to be covered in the pre-training stage of existing language models, but also because there are two objectives that need to be optimized at the same time.In this task, based on the ambiguous sentences and their associated diagnostic labels, our medical team manually created the dis-ambiguated rewritings as the ground truth.

Document-level Labels
NLU Task To access if language models can capture the key findings of a radiology report, we consider Report Codes Prediction as an evaluation task.This task involves categorizing reports into specific diagnostic codes based on the mentioned pathological findings.Therefore, different from sentencelevel abnormality identification, this task requires a multi-label multi-class classification.Our medical team manually labels the medical codes of each report.Detailed information regarding the codes is provided in Table 1.More details about the expertlabeling procedure are provided in Appendix A.
NLG Task Automatic medical summarization plays a crucial role in healthcare literature, by providing concise summaries, it saves time and manual effort for medical professionals when assessing the effectiveness of medical interventions.In our evaluation, we include report summarization as a task to assess the generation capability of language models.The impression section in each report serves as a summary that captures the supportive evidence for clinical decisions.To ensure data quality, we conduct a manual inspection of all collected <report, impression> pairs, filtering out any pairs where the impression does not align with the corresponding report.It is worth noting that the curated parallel data of reports and summaries provide valuable support for future work in related fields.

Evaluated Language Models
We evaluate two categories of language models with MEDEVAL5 : (1) domain-adapted pre-trained language models (Adapted PLMs), which are trainable models adapted on certain domain data, and (2) general-purpose large language models (Prompted LLMs) which are used by zero/few-shot prompting.

Domain-adapted PLMs
Recent literature found it is effective to adapt pretrained language models to certain narrow domains such as biomedical text by a continued training step on domain-specific data (Gururangan et al., 2020a), following which we take a pre-trained (or generally adapted) language model, and test it on the MEDEVAL test set.We also fine-tuned the models from this category to customize it to fit the tasks of MEDEVAL, with their corresponding training data.
For the sentence-level NLG task, we follow the the setting of He et al. (2023b) by evaluating: (1) style transformer (Dai et al., 2019) which transfers the original sentence into a less ambiguous style, (2) PPLM (Dathathri et al., 2020) which adds perturbation to LM to move the (re-)generation towards a less ambiguous direction, (3) DEPEN (He et al., 2021a) which is built upon PPLM and only re-generates ambiguous tokens detected before, and (4) MedDEPEN (He et al., 2023b), a biomedical-adapted DEPEN by introducing contrastive pre-training.Each work has included a transformer-based language model.We refer the reader to the original papers for more details.
For the document-level NLG task, we follow the setting of Yan et al. (2022) and customize previously adapted BERT-based models used before for the summarization task.
We prompt those LLMs under zero/few-shot settings, where we randomly select the examples from the training set of each task to compose prompts 5  times.We report the test results with the prompts which obtain optimal results on the validation set.See Appendix C for more details.

Evaluation Metrics
For NLU tasks, we report classification metrics including accuracy and F1 scores.For NLG tasks, we report BLEU and ROUGE scores with respect to the ground truths labeled by our medical team.For sentence-level generation tasks (i.e., rewriting), to evaluate the objective of disambiguation, we follow the setting of He et al. (2023b) to report accuracy decrements of the ambiguity classifier (∆Acc am ) as the disambiguation metric.To evaluate the rewriting fidelity, we report the content distortion score, which is defined as the decrement of the accuracy from an abnormality classifier (∆Acc ab ).Therefore, higher distortion indicates a lower content fidelity.

Results and Discussion
In this section, We first present the results for sentence-level NLU tasks (Ambiguity Identification and Abnormality Identification) in Table 2, then sentence-level NLG task (Disambiguated Rewriting) in Table 3, finally document-level NLU (Code Prediction) and NLG (Report Summarization) tasks in Table 4 and Table 5.
The Effectiveness of Instruction Tuning While BioMed LM is the first large language model customized for the biomedical domain, we observe that it does not outperform adapted PLMs and most prompted LLMs in the majority of tasks.Particu- larly, BioMed LM has been found to be the weakest performer in tasks such as sentence identification, disambiguated rewriting, and report summarization.We would like to highlight that, unlike other prompted LLMs such as ChatGPT, GPT-3, and Vicuna, BioMed LM lacks an Instruction Tuning step in its model training.This omission significantly impacts BioMed LM's ability to generate replies following the instructions from the given options.
In zero-shot NLU tasks, only 40% of the test cases receive appropriate responses at the sentence level and the qualified rate drops to less than 1% at the document level (so we did not report the results in Table 4).In few-shot report codes prediction, the document-based prompts often exceed BioMed LM's maximum threshold of 1024 tokens, resulting in query errors.In generation tasks, BioMed LM keeps returning irrelevant text.Our manual inspection reveals that the outputs rarely adhere to the given instructions in prompts or address the queries.This is further supported by the remarkably low BLEU or ROUGE scores in Table 3 and Table 5.We provide more discussions in Appendix D.1.These findings underscore the significance of Instruction Tuning and establish it as a crucial step when adapting prompted LLMs for specialized applications like healthcare decision-making.
In the remainder of this section, we focus on addressing more intriguing questions based on average performance across a range of baselines (e.g., the average accuracy of adapted PLMs versus prompted LLMs), where we exclude BioMed LM from further consideration.
Discussion on Task Type and Granularity In this section, we aim to determine the proficiency of language models at different levels and tasks.To achieve this, we begin by calculating the average accuracy scores of all adapted PLM baselines and  First, examining the results presented in Figure 3, we observe that both adapted PLMs and prompted LLMs perform relatively similarly across different data levels.However, it becomes apparent that adapted PLMs outperform prompted LLMs in NLU tasks, no matter whether it's on the sentence or document level.This suggests that fine-tuning provides a more effective means of injecting specific knowledge about narrow domains or tasks.On the other hand, consistently superior performance of prompted LLMs compared to adapted PLMs is observed in generation tasks, at both the sentence and document levels.This can be attributed to multiple advantages of large-scale pre-training such as a larger model size or the benefits HFRL in the LLMs we utilized, such as ChatGPT.These models demonstrate a capability to generate language that is more akin to human-like expressions, thereby achieving better generation scores.These imply that fine-tuning PLM models can be a viable choice for NLU tasks, while prompting-based LLMs may be more suitable when healthcare professionals require an AI writer to help their work.Common v.s.Rare Domains In Table 6, we explore the impact of the domain on language models in the healthcare field.We compute the average accuracy of adapted PLMs and prompted LLMs in abnormality identification v.s.ambiguity identification.We consistently observe higher performance from both adapted PLMs and prompted LLMs when working with data from the chest domain compared to miscellaneous domains.This superior performance can be attributed to the similarity between the chest data we tested and the pre-training data of the language models -chestrelated healthcare text is widely available in the public domain and can be included in the training corpus of PLMs.Similarly, LMs are expected to excel in abnormality identification tasks, which are a common research topic in current literature.
The most challenging scenario arises when both the data and task are unseen, specifically in the case of ambiguous identification within the miscellaneous domain.In such situations, there are limited or no examples available in the public domain.Therefore, querying language models with (zero) few-shot learning proves to be less effective.

Family of LLMs and Few Shot Learning
In this analysis, we examine the behavior of different language models (LLMs) with varying numbers of shots across different tasks.We calculate the average accuracy of ChatGPT, GPT3, and Vicuna-7B in NLU tasks and the average BLEU scores in NLG tasks.Additionally, we consider the average performance achieved in zero-shot or few-shot settings (Table 7).From the table, it is evident that in most cases, providing additional examples assists LMs in making predictions for NLU tasks.However, in NLG tasks, no consistent trend is observed, indicating the need for further research to discover optimal prompts.We do not observe a clear advantage of any specific LLM family over others, suggesting that the choice of the optimal LLM family for a given task may vary on a case-by-case basis.

Conclusion
We introduce MEDEVAL, a multi-task, multi-level, and multi-domain medical benchmark designed to serve as a comprehensive testbed for advanced language models.Through extensive evaluation experiments, we thoroughly analyze the capabilities and limitations of current LLMs in tackling various medical tasks, such as the effectiveness of instruction tuning and the performance disparities between adapted and prompted LMs in NLU and NLG tasks.Our findings provide valuable insights and serve as a handbook for future research in utilizing LLMs to enhance healthcare practices.

Limitations
In our efforts to provide a comprehensive testbed for current advanced language models, we have included multiple tasks.However, we acknowledge that there may be other tasks of interest that could have been analyzed, such as medical named entity recognition, multi-document report summarization, etc.We plan to expand the range of test tasks in future iterations of the MEDEVAL benchmark.We'd like to note that due to computing constraints, we were unable to evaluate some large language models such as Vicuna-60B or OPT-175B (Zhang et al., 2022).Our evaluation was focused on popular large language models with reasonably large sizes.In future work, we consider addressing this limitation by incorporating these larger language models into our testbed.

Ethics Statement
Our data underwent a rigorous de-identification process and were carefully reviewed by human evaluators following strict anonymization criteria.Moreover, the collection of data from real-world healthcare systems has received the necessary IRB approval (DC VAMC protocol 1736644, VASDHS IRB protocol 200086), ensuring compliance with ethical standards.To further ensure ethical usage, before inputting the data into large language models, including commercial ones like ChatGPT, we conducted data synthesis and subjected it to additional human inspection.These steps were taken to address any potential ethical concerns associated with the data.
It is important to highlight that the responsible and safe usage of biomedical data is a critical requirement in AI for healthcare, especially in the use case of large language models which are noticed to suffer from different kinds of potential harms (Leino et al., 2019;Lloyd, 2018;He et al., 2021b;Xu et al., 2022;He et al., 2022) and weaknesses (Ribeiro et al., 2020;Stuart-Ulin, 2018;He et al., 2023a).Therefore, we strongly recommend that our benchmark be used in conjunction with expert auditing to ensure the highest level of safety in real-world applications.

A Data Preparation
A.1 Preprocess of sentence-level corpora OpenI In the original OpenI release, many non-sensitive terms were incorrectly masked as "xxxx" by the de-identification software described in (Demner-Fushman et al., 2016).Our medical team manully fills in the missing information based on the context of the reports and additional information associated with it.

A.2 Preprocess of the document-level corpra
Manual De-identification Criteria We hire human reviewers to manually inspect the reports after the automatic de-identification tools.According to our criteria, we will discard a datapoint if it contains • real names of the patient, or the healthcare professions, • home address, working address, or locations of the patient or healthcare professionals.
• contact information (e.g., phone number) about the patient, or healthcare professionals.
In the second round of human inspection about de-identification, 99.8% of the data are well deidentified in the automatic stage, and 0.2% of the data are discarded.
Diseases Code Preparation Before experts examine the disease codes of a report, we first group data by their resources.For reports sourced from MIMIC-CXR, we employ CheXpert (Irvin et al., 2019), a rule-based automatic labeler to generate pseudo-labels for the diagnostic codes.Each label has three options: positive, negative, and unknown.In the case of reports from the second source, we customized a rule-based automatic labeler called pyConText NLP8 , which generates pseudo-codes according to the keywords of each domain.In the last step, our medical team manually reviews, correct the codes where there is a conflict (e.g., positive "no finding" appears with certain positive diseases), and writes down the correct codes for each report.

A.3 Medical Team
Our medical expert team consists of 4 members, including 2 senior board-certified radiologists with more than 15 years of experience in healthcare and a doctor who has more than 10 years of experience serving as a PI of medical research.We follow standard labeling practices, involving multiple rounds of iterative review by different experts until Cohen's kappa coefficient reaches 0.85.Any remaining disagreements are collectively resolved.

B Implementation details
Models All adapted transformers are implemented based on the HuggingFace9 libraries.All prompted LLMs are implemented with its original release in their official webpage or GitHub.

C Prompts used for Querying LLM C.1 Number of examples in few-shot settings
In our study, we maintain a balanced and unbiased approach by setting the number of examples equal to the number of classes when prompting the language models (especially in NLU tasks).Additionally, we explore alternative numbers of examples, such as 1, 3, 5, 7, or 9.The NLU experiment results presented in our paper utilize a 2-shot approach, while the NLG results employ a 3-shot approach.
"ambiguous" or "unambiguous", without saying anything else.Sentence: {} Label: Similarly, we will add more examples for the prompts built for more than two-shot situations.For NLG tasks, we have the following prompts: Zero-shot Rewriting tasks Here is a task to rewrite ambiguous sentences to be less ambiguous.
Given a sentence in a radiology report, written by a radiologist, please rewrite it to be more explicit about the diagnostic decision reflected in the sentence, however, maintain the main meaning of the original sentence.A sentence is defined to be ambiguous because of (1) medical jargon with meanings different from everyday general usage, such as unremarkable; (2) contradictory findings in the same sentence; (3) misleading grammatical errors such as no period between full sentences.Now given a new sentence, answer me with its rewrite, without saying anything else.
Sentence: {}.Rewrite: Three-shot Rewriting tasks Here is a task to rewrite ambiguous sentences to be less ambiguous.
Given a sentence in a radiology report, written by a radiologist, please rewrite it to be more explicit about the diagnostic decision reflected in the sentence, however, maintain the main meaning of the original sentence.A sentence is defined to be ambiguous because of (1) medical jargon with meanings different from everyday general usage, such as unremarkable; (2) contradictory findings in the same sentence; (3) misleading grammatical errors such as no period between full sentences.the joint spaces are not noticeable.there is a deformity of the hallux valgus.the angle of pitch is within normal limits.three views of the left foot show no fracture dislocation foreign body pathologic calcification or soft tissue swelling.the joint spaces are not noticeable.hallux valgus minor.the angle of pitch is within normal limits.
Summary: 1 bilateral hallux valgus deformities.2 no acute osseous abnormalities.ssn7312ptc1job no.1157.Report: there has been no significant change in the patient's condition since the patient's exam which was earlier in the day.the heart size is normal and the lungs are free of disease.on the left base is again noted a small granulom.Summary: for a active disease in the chest there is no evidence.Report: the cardiovascular-mediastinal silhouette is normal.it's not unusual for pulmonary vessels.the bones appear to be intact.Summary: chest x-rays within normal ranges.no change in date 2010-06-28.Now given a new report, answer me with its summary only, without saying anything else.Report: {}.Summary: D More Discussions D.1 Case Studies of BioMed LM BioMed LM, a GPT-style LLM trained on PubMed, lacks instruction fine-tuning in its training process.As a result, the model's outputs often lack a unified format, making it challenging to conduct follow-up statistical evaluations.To assess the model's performance, we introduce the concept of a qualified rate, which represents the proportion of test cases where the model provides a relevant prediction for the given task, such as identifying abnormalities by including only one of the "normal" or "abnormal" in the response.We report the qualified rate here: Sentence Report Zero-shot NLU 40% 1% Few-shot NLU 92% 85% In zero-shot settings, BioMed LM struggles to adhere to the instructions in the prompt, generating outputs that are freestyle and not aligned with the expected format.In a few-shot setting, particularly in document-level tasks, the length of the prompts exceeds the maximum input capacity of BioMed LM (1024 tokens).To address this issue, we employ two solutions: (1) chunking the input into 1024-token segments and (2) discarding test cases that exceed the maximum length.However, both solutions have drawbacks.The chunked uncompleted input approach leads to more "freestyle" outputs, resulting in a lower qualified rate.On the other hand, discarding examples introduces a high variance in the statistics.We show some examples of the "freestyle" outputs at the end of this section.
Considering these challenges, we treat BioMed LM as an exception and exclude it when analyzing results across different categories of LMs.Nonetheless, we emphasize the importance of further investigation into biomedical LLMs for future research.
Example Output of BioMed LM We show some examples of unqualified "free-style" outputs from BioMed LM in the following: Sentence Level, NLU-Zero-shot Note that we expect the model to return a single word from "normal", "abnormal", "ambiguous", or "unambiguous".However, BioMed LM returns the following output: Prompt: Given a sentence from a radiology report, written by a radiologist, please tell if the following sentence indicates "normal" or "abnormal" findings.Sentence: median sternotomy wires appear intact.Label: Returned Text: median sternotomy wires appear intact.
Sentence Level, NLU-Two-shot In this case, we expect the model to return a single word "normal", "abnormal", "ambiguous", "unambiguous".However, the output from BioMed is: Prompt: Here is a task to classify ambiguous sentences.Given a sentence in a radiology report, written by a radiologist, please tell if it is ambiguous.A sentence is defined to be ambiguous because of (1) medical jargon with meanings different from everyday general usage, such as unremarkable; (2) contradictory findings in the same sentence; (3) misleading grammatical errors such as no period between full sentences.Here are 2 examples.Sentence: lungs are unremarkable.Label: ambiguous.Sentence: unchanged chronic appearance of the left lung.Label: unambiguous.Now given a new sentence, answer me with "ambiguous" or "unambiguous".Sentence: findings: there is a cystic, approximately 3. 6 x 2. 6 x 6.
6 cm, mass with its epicenter in the left submandibular region, with extension into the left submental space, left sublingual space, left base of tongue, and superiorly into the left peripharyngeal space.Label: Returned Text: cystic mass.Sentence: there is a cystic, approximately 3. 6 x 2. 6 x 6. 6 cm, mass with its epicenter in the left submandibular region, with extension pathologic calcification or soft tissue swelling.
the joint spaces are not noticeable.there is a deformity of the hallux valgus.the angle of pitch is within normal limits.three views of the left foot show no fracture dislocation foreign body pathologic calcification or soft tissue swelling.the joint spaces are not noticeable.hallux valgus minor.the angle of pitch is within normal limits.
Summary: 1 bilateral hallux valgus deformities.2 no acute osseous abnormalities.ssn7312ptc1job no.1157.Report: there has been no significant change in the patient's condition since the patient's exam which was earlier in the day.the heart size is normal and the lungs are free of disease.on the left base is again noted a small granulom.Summary: for a active disease in the chest there is no evidence.Report: right and left feet have severe hammertoes.calcaneus on the left shows an achilles spur.
the severe hallux valgus configuration of the feet is observed.a dominant finding was the degree of luxation and deviation of first metatarsal proximal phalangeal joint.. Summary: Returned Text: 1 bilateral hammertoes. 2 no acute osseous abnormalities.ssn7312ptc1job no.1157.Report: the patient has been seen in the clinic for a routine checkup.The patient has

E More Results
We considered popular LLMs such as GPT3, Chat-GPT, and Vicuna-7B (the instruction finetuned LLaMa) in the main context.But we also keep in mind that LLMs are rapidly developing and we are following up by adding the newest models into our evaluation.In Table 10, Table 11, Table 12, and Table 13, we provide more results with GPT410 , LLaMa2 (Touvron et al., 2023b), LLaMa2-chat (Touvron et al., 2023b), GPT-NeoX citegpt-neox-20b, PMC-LLaMa (Wu et al., 2023), BioGPT (Luo et al., 2022) etc, which align with the findings already in the paper.We will keep working on followups and add more evaluation results here (and also Table 13: More results on document-level NLG tasks. Figure 1: A summary of the multi-level multi-task and multi-domain medical benchmark (MEDEVAL).Classification tasks are highlighted in green and generation tasks are highlighted in red.

Figure 3 :
Figure 3: Average performance of adapted PLM and prompted LLM on different tasks and at different levels.
The configurations of querying LLMs are listed as follows:

Table 1 :
Report disease codes covered in MEDEVAL.

Table 2 :
Evaluation (accuracy)over two categories of PLMs on abnormality identification and ambiguity identification tasks (sentence-level NLU).Bold: the highest performance.Underlined: the lowest.

Table 3 :
Evaluation on disambiguated rewriting Tasks (sentence-level NLG).We report the disambiguation score, content distortion score (where smaller content distortion indicates higher fidelity), and BLEU4 score.Bold: the best performance.Underlined: the worst.

Table 4 :
avg EMR ↑ avg Accuracy avg EMR ↑ avg Accuracy ↑ avg EMR ↑ Evaluation on report codes prediction Task (Document-level NLU).We report the average accuracy over all classes of diseases and the exact match rate (EMR) between predictions and labels.Bold: the highest performance.Underlined: the lowest.

Table 5 :
Evaluation on report summarization task (Document-level NLG).We report the Rouge scores and BLEU4 scores.
Bold: the highest performance.Underlined: the lowest.

Table 6 :
Average accuracy of adapted PLMs and prompted

Table 7 :
Average accuracy and BLEU of various LM families with zero/few shots.

Table 8 :
Configuration of ChatGPT and GPT-3 Here are three examples.

Table 9 :
Qualified Rate of BioMed LM on NLU tasks