Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

ChatGPT is a large language model developed by OpenAI. Despite its impressive performance across various tasks, no prior work has investigated its capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of ChatGPT on various benchmark biomedical tasks, such as relation extraction, document classification, question answering, and summarization. To the best of our knowledge, this is the first work that conducts an extensive evaluation of ChatGPT in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot ChatGPT even outperforms the state-of-the-art fine-tuned generative transformer models, such as BioGPT and BioBART. This suggests that ChatGPT’s pre-training on large text corpora makes it quite specialized even in the biomedical domain. Our findings demonstrate that ChatGPT has the potential to be a valuable tool for various tasks in the biomedical domain that lack large annotated data.


Introduction
The rapid growth of language models (Rogers et al., 2021;Zhou et al., 2023) in the field of Natural Language Processing (NLP) in recent years has led to significant advancements in various domains, including the biomedical domain (Kalyan et al., 2022). Although specialized models (e.g., BioBERT (Lee et al., 2020), BioBART (Yuan et al., 2022a), BioGPT (Luo et al., 2022), etc.) have shown promising results in the biomedical domain, they require fine-tuning using domainspecific datasets. This fine-tuning process can be time-consuming due to the requirement of taskspecific large annotated datasets. In contrast, zeroshot learning enables models to perform tasks without the need for fine-tuning on task-specific datasets. ChatGPT, a large language model, has 1 https://openai.com/blog/chatgpt demonstrated impressive zero-shot performance across various tasks (Laskar et al., 2023). However, its performance in the biomedical domain remains to be thoroughly investigated. In this regard, this paper presents a comprehensive evaluation of Chat-GPT on four key biomedical tasks: relation extraction, question answering, document classification, and summarization.
In this paper, our primary objective is to explore the extent to which ChatGPT can perform these tasks without fine-tuning and assess its performance by comparing with state-of-the-art generative fine-tuned models, BioGPT and BioBART. To our best knowledge, this is the first work that evaluated ChatGPT on benchmark biomedical datasets. Our evaluation of ChatGPT can have a profound impact on the biomedical domain that lacks domainspecific datasets by exploring its zero-shot learning capabilities. To ensure the reproducibility of our evaluation and to help facilitate future research, we will release all the ChatGPT-generated responses along with our evaluation code here: https:// github.com/tahmedge/chatgpt-eval-biomed.

Related Work
The effective utilization of transformer-based (Vaswani et al., 2017) NLP models like BERT (Devlin et al., 2019) have also led to significant progress in the biomedical domain (Lee et al., 2020;Alsentzer et al., 2019;Beltagy et al., 2019;Gu et al., 2020;Peng et al., 2019) in recent years. BERT leverages the encoder of the transformer architecture, while GPT leverages the decoder of the transformer. In addition to these models, sequence-tosequence models like BART (Lewis et al., 2019) that leverage both the encoder and the decoder of transformer have also emerged as a powerful approach in various text generation tasks.
(ii) effect: this type is used to identify drug-drug interactions describing an effect.
(iii) advice: this type is used when a recommendation or advice regarding a drug-drug interaction is given.
(iv) int: this type is used when a drug-drug interaction appears in the text without providing any additional information. [PASSAGE] HoC Document Classification 9972 / 4947 / 4947 The 10 hallmark cancer taxonomy with their definitions are given below: (i) Sustaining proliferative signaling: Cancer cells can initiate and maintain continuous cell division by producing their own growth factors or by altering the sensitivity of receptors to growth factors. (ii) Evading growth suppressors: Cancer cells can bypass the normal cellular mechanisms that limit cell division and growth, such as the inactivation of tumor suppressor genes. (iii) Resisting cell death: Cancer cells develop resistance to apoptosis, the programmed cell death process, which allows them to survive and continue dividing. (iv) Enabling replicative immortality: Cancer cells can extend their ability to divide indefinitely by maintaining the length of telomeres, the protective end caps on chromosomes.
(v) Inducing angiogenesis: Cancer cells stimulate the growth of new blood vessels from existing vessels, providing the necessary nutrients and oxygen to support their rapid growth.  biomedical datasets have helped these models to achieve state-of-the-art performance in a variety of BioNLP tasks (Gu et al., 2021). However, one major limitation of using such fine-tuned models is that they require task-specific large annotated datasets, which is significantly less available in the BioNLP domain in comparison to the general NLP domain. In this regard, having a strong zero-shot model could potentially alleviate the need for large annotated datasets, as it could enable the model to perform well on tasks that it was not trained on.
Recently, large autoregressive language models like GPT-3 (Brown et al., 2020) have demonstrated impressive few-shot learning capability. More recently, a new variant of GPT-3, called the Instruct-GPT model (Ouyang et al., 2022) has been proposed that leverages the reinforcement learning from human feedback (RLHF) mechanism. The resulting InstructGPT models (in other words, GPT-3.5) are much better at following instructions than the original GPT-3 model, resulting in an impressive zero-shot performance across various tasks. ChatGPT, a very recent addition to the GPT-3.5 series, has been trained using dialog-based instructional data alongside its regular training phase.
Though ChatGPT has demonstrated strong zeroshot performance across various NLP tasks (Laskar et al., 2023;Qin et al., 2023;Bang et al., 2023;Yang et al., 2023), it is yet to be investigated in the biomedical domain. To this end, this paper aims to evaluate ChatGPT in the biomedical domain.

Our Methodology
For a given test sample X, we prepare a task instruction T and concatenate the text in the test sample with the task instruction to construct the prompt P . Then the prompt P is given as input to Chat-GPT (gpt-3.5-turbo) to generate the response R. In this paper, we evaluate ChatGPT on 4 biomedical tasks across 11 benchmark datasets. Below, we describe these tasks, the datasets we use for evaluation, and the prompt P that we construct for each task depending on the respective dataset.
(i) Relation Extraction: Given a text sequence S, the biomedical relation extraction task aims to extract relations between entities mentioned in the text by identifying all possible relation triplets. In this paper, we evaluate drug-target-interaction in the KD-DTI dataset (Hou et al., 2022), chemical-  (iii) Question Answering: For the questionanswering task, we evaluate the performance of ChatGPT on the PubMedQA dataset (Jin et al., 2019). Here, the objective is to determine whether the answer to a given question can be inferred from the reference context. We give the question, the reference context, and the answer as input to Chat-GPT to determine whether the answer to the given question can be inferred from the given reference context, with ChatGPT being prompted to reply either as yes, no, or maybe (see Table 1 for details).
(iv) Abstractive Summarization: Given a text sequence S, the goal is to generate a concise abstractive summary of S.

Experiments
Since ChatGPT is a generative model, we consider two state-of-the-art generative transformers as our baselines. Below, we first present these baselines, followed by presenting the results.

Fine-tuned Baselines
BioGPT: The backbone of BioGPT (Luo et al., 2022) is GPT-2 (Radford et al., 2019), which is a decoder of the transformer. The BioGPT model was trained over PubMed titles and abstracts via leveraging the standard language modeling task. We compare zero-shot ChatGPT with BioGPT models fine-tuned on relation extraction, document classification, and question-answering tasks.
BioBART: BioBART is a sequence-to-sequence model that was pre-trained over PubMed abstracts (Yuan et al., 2022a). The pre-training process involves reconstructing corrupted input sequences. We compare the zero-shot ChatGPT with BioBART fine-tuned on abstractive summarization datasets.

Results & Discussion
We first compare the performance of ChatGPT with BioGPT on relation extraction, document classification, and the question-answering task (see Table  3). Then we compare its performance with Bio-BART on summarization datasets (see Table 4). More evaluation details are given in Appendix A.

Relation Extraction Evaluation:
We observe that in the BC5CDR and KD-DTI datasets for relation extraction, ChatGPT led to higher recall scores but much lower precision scores compared to the fine-tuned BioGPT model. This is because ChatGPT tends to generate long and descriptive responses, leading to many inaccurate relation extractions. Though in terms of F1, it outperforms fine-tuned BioGPT in the BC5CDR dataset, it fails to outperform in the KD-DTI dataset. More importantly, it outperforms BioGPT in the DDI dataset in all metrics: Precision, Recall, and F1.
While analyzing the results in different datasets, we observe that in both BC5CDR and DDI datasets where ChatGPT outperforms BioGPT, the training set is small, only 500 and 664 instances, respectively. On the other hand, in the KD-DTI dataset where ChatGPT fails to outperform BioGPT, the training set contains 12000 instances. This gives us a strong indication that even in the biomedical domain, zero-shot ChatGPT can outperform finetuned biomedical models in smaller-sized datasets.
We also observe that more descriptive prompts may help ChatGPT to obtain better Precision scores. Contrary to the KD-DTI dataset, we describe the definition of each interaction type in the DDI dataset (see Table 1) where ChatGPT performs the best. To further investigate the effect of prompts in relation extraction, we evaluate the performance in BC5CDR with a new prompt: i. Identify the chemical-disease interactions in the passage given below: [PASSAGE]. We observe that the Precision, Recall, and F1 scores are decreased by 16.07%, 10.3%, and 14.29%, respectively, with this prompt variation.

Document Classification Evaluation:
We observe that in the HoC dataset, the zero-shot Chat-GPT achieves an F1 score of 59.14, in comparison to its counterpart fine-tuned BioGPT which achieves an F1 score of 85.12. We also investigate the effect of prompt tuning by evaluating with two new prompts that are less descriptive (see Appendix A.2 for more details): i. Prompting without explicitly mentioning the name of 10 HoC classes, drops F1 to 38.20. ii. Prompting with the name of each HoC class is given without providing the definition of each class, drops the F1 score to 46.93.
Question Answering Evaluation: We observe that in the PubMedQA dataset, the zero-shot Chat-GPT achieves much lower accuracy than BioGPT (51.60 by ChatGPT in comparison to 78.20 by BioGPT). However, the BioGPT model was finetuned on about 270K QA-pairs in various versions of the PubMedQA dataset for this task. While ChatGPT achieves more than 50% accuracy even without any few-shot examples in the prompt.
Summarization Evaluation: We observe that in terms of all ROUGE scores (Lin, 2004), ChatGPT performs much worse than BioBART in datasets that have dedicated training sets, such as iCliniq, HealthCareMagic, and MeQSum. Meanwhile, it performs on par with BioBART in the MEDIQA-QS dataset. More importantly, it outperforms Bio-BART in both MEDIQA-ANS and MEDIQA-MAS datasets. Note that MEDIQA-ANS, MEDIQA-MAS, and MEDIQA-QS datasets do not have any dedicated training data and ChatGPT achieves comparable or even better performance in these datasets compared to the BioBART model fine-tuned on other related datasets (Yuan et al., 2022a). This further confirms that zero-shot ChatGPT is more useful than domain-specific fine-tuned models in biomedical datasets that lack large training data.

Conclusions and Future Work
In this paper, we evaluate ChatGPT on 4 benchmark biomedical tasks to observe that in datasets that have large training data, ChatGPT performs quite poorly in comparison to the fine-tuned models (BioGPT and BioBART), whereas it outperforms fine-tuned models on datasets where the training data size is small. These findings suggest that ChatGPT can be useful in low-resource biomedical tasks. We also observe that ChatGPT is sensitive to prompts, as variations in prompts led to a noticeable difference in results.
Though in this paper, we mostly evaluate Chat-GPT on tasks that require it to generate responses by only analyzing the input text, in the future, we will investigate the performance of ChatGPT on more challenging tasks, such as named entity recognition and entity linking (Yadav and Bethard, 2018;Yan et al., 2021;Yuan et al., 2022b;Laskar et al., 2022a,b,c), as well as problems in information retrieval (Huang et al., 2005;Huang and Hu, 2009;Yin et al., 2010;Laskar et al., 2020Laskar et al., , 2022d. We will also explore the ethical implications (bias or privacy concerns) of using ChatGPT in the biomedical domain.

Limitations
Since the training datasets of ChatGPT are unknown, some data used for evaluation may or may not exist during the training phase of ChatGPT Also, a new version called the GPT-4 model has been released that may ensure higher accuracy. Nonetheless, GPT-4 is very costly to use, around 60x more expensive than ChatGPT. Meanwhile, even using the paid ChatGPT Plus 2 subscription, it is available for just a limited use (allows evaluation of only 25 samples in 3 hours). Another limitation of this research is that the results mentioned in this paper for ChatGPT may not be reproducible, as ChatGPT may generate different responses for the same input prompt. Although the experimental results may change over time, this work will still give a concrete direction for future research using Chat-GPT like large language models in the biomedical domain.

Ethics Statement
The paper evaluates ChatGPT on 4 benchmark biomedical tasks that require ChatGPT to generate a response based on the information provided in the input text. Thus, no data or prompt was provided as input that could lead to ChatGPT generating any responses that pose any ethical or privacy concerns. This evaluation is only done in some academic datasets that already have gold labels available and so it does not create any concerns like humans relying on ChatGPT responses for sensitive issues, such as disease diagnosis. Since this paper only evaluates the performance of ChatGPT and investigates its effectiveness and limitations, conducting this evaluation does not lead to any unwanted biases. Only the publicly available academic datasets are used that did not require any licensing. Thus, no personally identifiable information has been used.

A.1 Evaluating ChatGPT on Different Tasks
Since ChatGPT generated responses can be lengthy, and may sometimes contain unnecessary information while not in a specific format, especially in tasks that may have multiple answers (e.g., Relation Extraction), it could be quite difficult to automatically evaluate its performance in such tasks by comparing with the gold labels by using just an evaluation script. Thus, for some datasets and tasks, we manually evaluate the ChatGPT generated responses by ourselves and compare them with the gold labels. Below we describe our evaluation approach for different tasks: • Relation Extraction: The authors manually evaluated the ChatGPT generated responses in this task by comparing them with the gold labels. To ensure the reproducibility of our evaluation, we will release the ChatGPT generated responses.
• Document Classification: We created an evaluation script and identifies if the gold label (one of the 10 HoC classes) is present in the ChatGPT generated response. For fair evaluation, we lowercase each character in both the gold label and the ChatGPT generated response. Our evaluation script will be made publicly available to ensure reproducibility of our findings.
• Question Answering: Similar to Document Classification, we also evaluated using an evaluation script that compares the gold label and the ChatGPT generated response (here, we also convert each character to lowercase). The evaluation script will also be made public.
• Abstractive Summarization: We used the HuggingFace's Evaluate 3 library (Wolf et al., 2020) to calculate the ROUGE scores and the BERTScore for the Abstractive Summarization task evaluation.

A.2 Effects of Prompt Variation
We investigate the effects of prompt tuning in the HoC dataset by evaluating the performance of Chat-GPT based on the following prompt variations: • Prompting with explicitly defining the 10 HoC classes achieves an F1 score of 59.14 (see Row 1 in Table 5).
• Prompting without explicitly mentioning the name of 10 HoC classes, drops F1 to 38.20 (see Row 2 in Table 5).
• Prompting with the name of each HoC class is given without providing the definition of each class, drops the F1 score to 46.93 (see Row 3 in Table 5).
Our findings demonstrate that more descriptive prompts yield better results.

A.3 Sample ChatGPT Generated Responses
Some sample prompts with the ChatGPT generated responses for Relation Extraction, Document Classification, and Question Answering tasks are given in Table 6 and for the Abstractive Summarization task are given in Table 7. 1.
The 10 hallmark cancer taxonomy with their definitions are given below: (i) Sustaining proliferative signaling: Cancer cells can initiate and maintain continuous cell division by producing their own growth factors or by altering the sensitivity of receptors to growth factors.
(ii) Evading growth suppressors: Cancer cells can bypass the normal cellular mechanisms that limit cell division and growth, such as the inactivation of tumor suppressor genes.
(iii) Resisting cell death: Cancer cells develop resistance to apoptosis, the programmed cell death process, which allows them to survive and continue dividing.
(iv) Enabling replicative immortality: Cancer cells can extend their ability to divide indefinitely by maintaining the length of telomeres, the protective end caps on chromosomes.
(v) Inducing angiogenesis: Cancer cells stimulate the growth of new blood vessels from existing vessels, providing the necessary nutrients and oxygen to support their rapid growth.
(vi) Activating invasion and metastasis: Cancer cells can invade surrounding tissues and migrate to distant sites in the body, forming secondary tumors called metastases.
(vii) Cellular energetics: Cancer cells rewire their metabolism to support rapid cell division and growth, often relying more on glycolysis even in the presence of oxygen (a phenomenon known as the Warburg effect).
(viii) Avoiding immune destruction: Cancer cells can avoid detection and elimination by the immune system through various mechanisms, such as downregulating cell surface markers or producing immunosuppressive signals.
(ix) Tumor promoting inflammation: Chronic inflammation can promote the development and progression of cancer by supplying growth factors, survival signals, and other molecules that facilitate cancer cell proliferation and survival. (x) Genome instability and mutation: Cancer cells exhibit increased genomic instability, leading to a higher mutation rate, which in turn drives the evolution of more aggressive and drug-resistant cancer cells.
Classify the following sentence in one of the above 10 hallmark cancer taxonomy. If cannot be classified, answer as "empty": [SENTENCE] 59.14 2.
Is it possible to classify the following sentence in one of the 10 categories in the Hallmarks of Cancer taxonomy? If possible, write down the class. [SENTENCE] 38.20

3.
Classify the sentence given below in one of the 10 categories (i. activating invasion and metastasis, ii. tumor promoting inflammation, iii. inducing angiogenesis, iv. evading growth suppressors, v. resisting cell death,vi. cellular energetics, vii. genomic instability and mutation, viii. sustaining proliferative signaling, ix. avoiding immune destruction, x. enabling replicative immortality) in the Hallmarks of Cancer taxonomy? If cannot be classified, answer as "empty". [SENTENCE] 46.93 Identify each pair of drugs and the drug-induced side-effects (e.g., diseases) in the following passage: Famotidine-associated delirium. A series of six cases. Famotidine is a histamine H2-receptor antagonist used in inpatient settings for prevention of stress ulcers and is showing increasing popularity because of its low cost. Although all of the currently available H2-receptor antagonists have shown the propensity to cause delirium, only two previously reported cases have been associated with famotidine. The authors report on six cases of famotidine-associated delirium in hospitalized patients who cleared completely upon removal of famotidine. The pharmacokinetics of famotidine are reviewed, with no change in its metabolism in the elderly population seen. The implications of using famotidine in elderly persons are discussed.
Drug: Famotidine Drug-induced disease: Delirium Side effect: Increased risk of delirium in elderly patients chemical: famotidine disease: delirium
The key difference is in the 150-loop cavity adjacent to the conserved active site in neuraminidase. Based on these findings and by modifying oseltamivir, six analog inhibitors were proposed as candidates for developing inhibitors against H5N1 virus, particularly against the oseltamivir-resistant H5N1 virus strain.
Drug-target interactions: -Oseltamivir and neuraminidase -Analog inhibitors and neuraminidase Identify the pairs of drug-drug interactions in the passage given below based on one of the following interaction types: (i) mechanism: this type is used to identify drug-drug interactions that are described by their pharmacokinetic mechanism.
(ii) effect: this type is used to identify drug-drug interactions describing an effect.
(iii) advice: this type is used when a recommendation or advice regarding a drugdrug interaction is given.
(iv) int: this type is used when a drug-drug interaction appears in the text without providing any additional information.
Interactions between Leukine and other drugs have not been fully evaluated. Drugs which may potentiate the myeloproliferative effects of Leukine, such as lithium and corticosteroids, should be used with caution.
Corticosteroids and Leukine -advice Cancer cells exhibit increased genomic instability, leading to a higher mutation rate, which in turn drives the evolution of more aggressive and drug-resistant cancer cells.
Classify the following sentence in one of the above 10 hallmark cancer taxonomy. If cannot be classified, answer as "empty": Heterogeneity in DNA damage within the cell population was observed as a function of radiation dose.
genomic instability and mutation.
genomic instability and mutation.
PubMedQA Question Answering For the question, the reference context, and the answer given below, is it possible to infer the answer for that question from the reference context? Only reply as either Yes or No or Maybe.
Question: Do some U.S. states have higher/lower injury mortality rates than others?
Reference context: This article examines the hypothesis that the six U.S. states with the highest rates of road traffic deaths (group 1 states) also had above-average rates of other forms of injury such as falling, poisoning, drowning, fire, suffocation, homicide, and suicide, and also for the retail trade and construction industries. The converse, second hypothesis, for the six states with the lowest rates of road traffic deaths (group 2 states) is also examined. Data for these 12 states for the period 1983 to 1995 included nine categories of unintentional and four categories of intentional injury. Seventy-four percent of the group 1 states conformed to the first hypothesis, and 85% of the group 2 states conformed to the second hypothesis.
Answer: Group 1 states are likely to exhibit above-average rates for most other categories of injury death, whereas group 2 states are even more likely to exhibit below-average rates for most other categories of injury death.

Yes.
Yes. The article discusses the causes and considerations of bad breath, which is usually related to poor dental hygiene.
Other disorders and diseases can also produce distinct breath odors, such as fruity breath being a sign of ketoacidosis in diabetes. Proper dental hygiene, avoiding smoking, and following healthcare provider's instructions are recommended for treating bad breath. If bad breath persists or is accompanied by other symptoms, medical attention may be necessary.
Bad breath could be caused by poor oral hygiene, dental and gastroesophageal diseases, infections, smoking, alcoholism, and certain foods.

Answer Summarization
For the following question, some relevant answers are given below. Please write down a short concise answer by summarizing the given answers.
Question: Can I use a Power Plate machine if I have a pacemaker?
Answer 1: Most machines and devices will not interfere with your pacemaker. But some with strong magnetic fields may. Always ask your provider about any specific device that you need to avoid. DO NOT put a magnet near your pacemaker.Most appliances in your home are safe to be around. This includes your refrigerator, washer, dryer, toaster, blender, computers and fax machines, hair dryer, stove, CD player, remote controls, and microwaves.You should keep several devices at least 12 inches (30 centimeters) away from the site where the pacemaker is placed under your skin. These include: Battery powered cordless tools (such as screwdrivers and drills) Plug-in power tools (such as drills and table saws) Electric lawnmowers and leaf blowers Slot machines Stereo speakers Tell all providers that you have a pacemaker before any tests are done.Some medical equipment may interfere with your pacemaker.Stay away from large motors, generators, and equipment. DO NOT lean over the open hood of a car that is running. Also stay away from: Radio transmitters and high-voltage power lines Products that use magnetic therapy, such as some mattresses, pillows, and massagers Large electrical-or gasoline-powered appliances If you have a cell phone: DO NOT put it in a pocket on the same side of your body as your pacemaker. When using your cell phone, hold it to your ear on the opposite side of your body. Be careful around metal detectors and security wands. Handheld security wands may interfere with your pacemaker. Show your wallet card and ask to be hand searched. Most security gates at airports and stores are OK. But DO NOT stand near these devices for long periods. Your pacemaker may set off alarms. After any operation, have your provider check your pacemaker.
Answer 2: You will probably be able to go home after 1 day or even the same day in some cases. You should be able to return to your normal activity level quickly.Ask your provider how much you can use the arm on the side of your body where the pacemaker was placed. You may be advised not to: Lift anything heavier than 10 to 15 pounds (4.5 to 6.75 kilograms) Push, pull, and twist your arm for 2 to 3 weeks. Raise your arm above your shoulder for several weeks. When you leave the hospital, you will be given a card to keep in your wallet. This card lists the details of your pacemaker and has contact information for emergencies. You should always carry this wallet card with you. You should try to remember the name of the pacemaker manufacturer if you can in case you lose your card.
It is important to ask your provider about specific devices that may interfere with a pacemaker, but most appliances in the home are safe. It is recommended to stay away from large motors, generators, and equipment, products that use magnetic therapy, and to be cautious around metal detectors and security wands. Additionally, after getting a pacemaker, some activity restrictions may apply, such as not lifting heavy objects and limiting arm movements. It is important to carry a wallet card with pacemaker details and emergency contact information.
Most machines are safe to use with a pacemaker. Some devices should be kept at least 12 inches away.
These include battery powered cordless tools (such as screwdrivers and drills) Plug-in power tools (such as drills and table saws) Electric lawnmowers and leaf blowers Slot machines Stereo speakers Some medical equipment may interfere with your pacemaker.Stay away from large motors, generators, and equipment, Radio transmitters and high-voltage power lines Products that use magnetic therapy, such as some mattresses, pillows, and massagers Large electricalor gasoline-powered appliances. Do not put your cell phone in a pocket or to the ear on the same side as your pacemaker.