Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data

Large language models (LLMs) can generate natural language texts for various domains and tasks, but their potential for clinical text mining, a domain with scarce, sensitive, and imbalanced medical data, is underexplored. We investigate whether LLMs can augment clinical data for detecting Alzheimer's Disease (AD)-related signs and symptoms from electronic health records (EHRs), a challenging task that requires high expertise. We create a novel pragmatic taxonomy for AD sign and symptom progression based on expert knowledge, which guides LLMs to generate synthetic data following two different directions:"data-to-label", which labels sentences from a public EHR collection with AD-related signs and symptoms; and"label-to-data", which generates sentences with AD-related signs and symptoms based on the label definition. We train a system to detect AD-related signs and symptoms from EHRs, using three datasets: (1) a gold dataset annotated by human experts on longitudinal EHRs of AD patients; (2) a silver dataset created by the data-to-label method; and (3) a bronze dataset created by the label-to-data method. We find that using the silver and bronze datasets improves the system performance, outperforming the system using only the gold dataset. This shows that LLMs can generate synthetic clinical data for a complex task by incorporating expert knowledge, and our label-to-data method can produce datasets that are free of sensitive information, while maintaining acceptable quality.


Introduction
Clinical text holds a large amount of valuable information that is not recorded by the structured data fields in electronic health records (Wang et al., 2018b).Clinical text mining, which aims to extract and analyze information from medical records, such as diagnosis, symptoms, treatments, and outcomes, has various applications, such as clinical decision support, disease surveillance, patient education, and biomedical research (Murdoch and Detsky, 2013).However, clinical text mining faces two major obstacles: the scarcity and sensitivity of medical data.Medical data is often limited in quantity and diversity, due to the high cost and difficulty of data collection and annotation, which require expert knowledge and consent from patients and providers.On the other hand, medical data is highly sensitive and confidential, due to the ethical and legal issues of data privacy and security, which impose strict regulations and restrictions on data collection and usage (Berman, 2002).These obstacles hinder the development and evaluation of clinical text mining methods, especially those based on data-hungry deep learning models.
Recently LLMs have demonstrated impressive performance on many natural language processing (NLP) benchmarks, (Wang et al., 2018a(Wang et al., , 2019;;Rajpurkar et al., 2016), as well as in medical domain applications (Singhal et al., 2023(Singhal et al., , 2022;;Nori et al., 2023).However, they also face some common problems like hallucination, homogenisation, etc (Azamfirei et al., 2023).Hallucination means that LLMs produce factual errors or inconsistencies in their outputs that do not match the input or the real world.This can damage the reliability and credibility of LLMs, especially for applications in clinical domain that need high accuracy and consistency.In addition, data generated by LLMs tends to be highly homogeneous and fails to capture the diversity and realism of real data, which are essential for many downstream tasks.For example, for clinical text analysis, we need the generated data to cover different types of clinical texts, such as patient histories, diagnoses, or treatment plans.This presents a huge challenge for LLM.The difference between LLM generated data and real data makes people doubt LLMs' practical application value.
In this paper, we investigated whether the outputs of LLMs can be a valuable data source for clinical text mining despite all these aforementioned drawbacks.We focus on Alzheimer's Disease (AD) signs and symptoms detection from electronic health records (EHRs) notes.Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that affects millions of people worldwide (Scheltens et al., 2021;Schachter and Davis, 2022).It can cause cognitive impairment, behavioral changes, and functional decline.Detecting AD-related signs and symptoms from EHR is a crucial task for early diagnosis, treatment, and care planning (Leifer, 2003).In addition to the scarcity, sensitivity, and imbalance of clinical data, this task is highly challenging due to the high expertise required to interpret the complex and diverse manifestations of AD (Dubois et al., 2021).
We propose a novel pragmatic taxonomy for AD sign and symptom progression based on expert knowledge, which consists of nine categories that capture the cognitive, behavioral, and functional aspects of AD (Bature et al., 2017;Lanctôt et al., 2017).We created three datasets following the taxonomy: (1) a gold dataset annotated by human experts on longitudinal EHRs of AD patients; (2) a silver dataset created by the data-to-label method which labels sentences from a public EHR collection with AD-related signs and symptoms; and (3) a bronze dataset created by the label-to-data method which generates sentences with AD-related signs and symptoms based on the label definition.The "data-to-label" method employs LLMs as annotators and has been widely adopted in many tasks.The "label-to-data", on the other hand, relies on the LLM's generation ability to produce data with labels based on instructions.
We performed experiments of binary classification (whether the sentence is related to AD signs and symptoms or not) and multi-class classification (assign one category from the nine pre-defined categories of AD signs and symptoms to an input sentence), using different data combinations to finetune pre-trained language models (PLMs), and we compared their performance on the human annotated gold test set.We observed that the system performances can be significantly improved by the silver and bronze datasets.In particular, combing the gold and bronze dataset, which is generated by the label-to-data method, outperform the model trained only on the gold or gold+silver dataset for some categories.The minority classes with much fewer gold data samples benefit more from the improvement.We noticed slight degradation of performances for a small proportion of categories.But the overall increases in results demonstrates that LLM can be applied to medical data annotation, and even its hallucinations can be leveraged to create datasets that are free of sensitive information, while preserving acceptable quality.
The contributions of this paper are as follows: • We create a novel pragmatic taxonomy for AD sign and symptom progression based on expert knowledge, and it has shown to be reliably annotated using information described in EHR notes.
• We investigate whether LLMs can augment clinical data for detecting AD-related signs and symptoms from EHRs, using two different methods: data-to-label and label-to-data.
• We train a system to detect AD-related signs and symptoms from EHRs, using three datasets: gold, silver, and bronze.And evaluate the quality of the synthetic data generated by LLMs using both automatic and human metrics.We show that using the synthetic data improves the system performances, outperforming the system using only the gold dataset.

Large Language Models
Large language models (LLMs) have enabled remarkable advances in many NLP domains because of their excellent results and ability to comprehend natural language.Popular LLMs including GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), and GPT-4 (OpenAI, 2023), LaMDA (Thoppilan et al., 2022), BLOOM (Scao et al., 2022), LLaMA (Touvron et al., 2023), etc. vary in their model size, data size, training objective, and generation strategy, but they all share the common feature of being able to generate natural language texts across various domains and tasks.They have achieved impressive results on many natural language processing (NLP) benchmarks by leveraging the large-scale and diverse text data from the web.
However, researchers have noticed the drawbacks of LLMs since their debut.Some limitations of LLMs have widely acknowledged and have drawn wide attentions from the research community like hallucination, homogenisation, etc. (Tamkin et al., 2021) Hallucination is a well-known and widelystudied problem in natural language generation (NLG), which is often defined as "generated content that is nonsensical or unfaithful to the provided source content" (Ji et al., 2023).Hallucination has been observed and analyzed in various NLG tasks, such as machine translation (MT) (Guerreiro et al., 2023), text summarization (Cao et al., 2021), and dialogue generation (Das et al., 2023).Hallucination can be caused by various factors, such as data noise, model bias, lack of commonsense knowledge, or insufficient supervision.It can be detected and mitigated by various methods, such as data cleaning, model regularization, knowledge injection, or output verification (Ji et al., 2023).
In this paper, we explore a different angle on hallucination, and examine whether the hallucinations of LLMs can be a valuable data source for clinical text processing, rather than a difficulty.We propose that the hallucinations of LLMs can be leveraged to create synthetic or augmented datasets that do not expose sensitive information, but still maintain the linguistic and semantic features of clinical texts, such as vocabulary, syntax, and domain knowledge.Experiments on classification tasks confirmed the validity of the proposal.
Homogenisation is also a potential drawback of using LLMs at a large scale.While it ensures the stability of the text quality, it also reduces the diversity of text.This issue has been noticed and discussed (Marian, 2023), but there is still a lack of research in this direction.In this work, we have observed homogenisation in the "labelto-data" method and conducted experiment which help reveal how it impacts the system performances on the studied task.

Clinical Text Mining and synthetic data generation
Clinical text mining faces two main obstacles: the limited availability and the confidentiality of health data.Various attempts have been done to overcome the lack of and the privacy issues with health data.Public datasets, such as MIMIC (Johnson et al., 2016), i2b2 (Uzuner et al., 2011), or BioASQ (Tsatsaronis et al., 2015) etc, are openly available for research purposes.Synthetic datasets, such as Synthea (Walonoski et al., 2018) and MedGAN (Choi et al., 2017) etc., are constructed based on statistical models, generative models, or rules.They can be used to augment or complement real medical data, without violating the privacy or confidentiality of the patients or providers.Data augmentation or transfer learning techniques are machine learning techniques used to address the data scarcity or imbalance issue by generating or utilizing additional or related data, such as synthetic data, noisy data, or cross-domain data, to enrich or improve the data representation or diversity (Che et al., 2017;Gligic et al., 2020;Gupta et al., 2018;Xiao et al., 2018;Amin-Nejad et al., 2020;Li et al., 2021).However, synthetic datasets may not capture the naturalness and realism of human-written medical texts, and may introduce errors or biases that can affect the performance and validity of clinical text mining methods.
LLMs have also been explored for clinical text processing.Research has demonstrated that LLM holds health information (Singhal et al., 2022).Studies has shown that LLMs can generate unstructured data from structured inputs and benefit downstream tasks (Tang et al., 2023).There are also some work that leveraged LLMs for clincal data augmentation (Chintagunta et al., 2021;Guo et al., 2023) In this paper, we propose a novel approach to leverage LLMs, especially its hallucination ability via the label-to-data method as a data source for clinical text processing, which can mitigate the scarcity and sensitivity of medical data.

Alzheimer's disease signs and symptoms detection
Clinical text mining methods have been increasingly applied to detect AD or identify AD signs and symptoms from spontaneous speech or electronic health records, which could be potentially severed as a natural and non-invasive way of assessing cognitive and linguistic functions.(Karlekar et al., 2018) applied neural models to classify and analyze the linguistic characteristics of AD patients using the DementiaBank dataset.(Wang et al., 2021) developed a deep learning model for earlier detection of cognitive decline from clinical notes in EHRs.(Liu and Yuan, 2021) used a novel NLP method based on term frequency-inverse document frequency (TF-IDF) to detect AD from the dialogue contents of the Predictive Challenge of Alzheimer's Disease.(Agbavor and Liang, 2022) used large language models to predict dementia from spontaneous speech, using the DementiaBank and Pitt Corpus datasets.These studies demonstrate the potential of clinical mining methods for assisting diagnosis of AD and analyzing lexical performance in clinical settings.

Methodology
In this section, we introduce our task of AD signs and symptoms detection, and how we leveraged LLMs's capabilities for medical data annotation and generation.We also present the three different datasets (gold, silver and bronze) that we created and used for training and evaluating classifiers for AD signs and symptoms detection.

Task overview
Alzheimer's disease (AD) is a neurodegenerative disorder that affects memory, thinking, reasoning, judgment, communication, and behavior.It is the fifth-leading cause of death among Americans age 65 and older (Mucke, 2009).This task aims to identify nine categories of AD signs and symptoms from unstructured clinical notes.EHRs can help with early diagnosis and intervention, appropriate care and support, disease monitoring and treatment evaluation, and quality of life improvement for people with AD and their caregivers.This is very challenging a task as the AD-related signs and symptoms can vary in form and severity, and it requires a lot of knowledge and experience to capture them from a large amount of text.We ask experts to create the annotation guideline by defining each category of AD-related signs and symptoms and providing examples and instructions for the annotators (See Appendix A for details).

Datasets
As stated above, we created and utilized three different datasets for our experiments: gold, silver and bronze.

Gold data (Human annotation)
We expert annotated 5112 longitudinal EHR notes of 76 patients with AD from the U.S. Department of Veterans Affairs Veterans Health Administration (VHA).The use of the data has been approved by the Institutional Review Board at the VHA Bedford Healthcare System, which also approved the waiver of documentation of informed consent.Under physician supervision, two medical professionals annotate the notes for AD signs and symptoms following the annotation guidelines.They selected sentences to be annotated and labelled its categories as output, and resolved the disagreements by discussion.The inter-annotator agreement was measured by Cohen's kappa as k=0.868, indicating a high level of reliability.This leads to the gold standard dataset with 16,194 sentences with a mean (SD) sentence length of 17.60 (12.69) tokens.

Silver data (Data-to-Label)
The silver dataset consisted of 16,069 sentences with a mean (SD) sentence length of 19.60 (15.44) tokens extracted from the MIMIC-III database (Johnson et al., 2016), which is a large collection of de-identified clinical notes from intensive care units.We randomly sampled the sentences from the discharge summaries, and used the LLM model to annotate them with AD-related symptoms.The LLM receives the annotation guidelines and the clinical text as input and produces the sentence to be annotated and its categories as output.The outputs are further checked by the LLM by asking for a reason to explain why the sentence belongs to the assigned category.In this step, the inputs to the LLM are the guidelines and the annotated sentences and the outputs are Boolean values and explanations.This chain-of-thoughts style checking has been proved to improve LLM performances (Wei et al., 2022).Although many LLMs can be used here, we adopt the Llama 65B (Touvron et al., 2023) due to its performances, availability, costs and privacy concerns.

Bronze data (Label-to-Data)
We used GPT-4, a state-of-the-art LLM that has been shown to generate coherent and diverse texts across various domains and tasks (OpenAI, 2023).GPT-4 is a transformer-based model with billions of parameters, trained on a large corpus of web texts.We accessed GPT-4 through the Azure Ope-nAI service1 .
In the generation task, GPT-4 takes only the annotation guidelines as input and produces a piece of note text and outputs the sentence to be annotated and its categories.This is a bronze-level dataset that does not contain any sensitive personal information.It consists of 16,047 sentences with an average sentence length of 16.58 words.Figure 1 shows a snippet of the generated text and annotations.As shown in the example, the model firstly generates a clinical text and then extract sentences of interests for annotation.The generated text is rich in AD signs and symptoms and contains no Protected Health Information (PHI) or (Personally Identifiable Information) PII.While the text doesn't completely convey the complexity of AD diagnosis, or the follow-up required to arrive at AD diagnosis (e.g."I immediately drove over and took him to the ER.After a series of tests, including an MRI and a neuropsychological evaluation, he was diagnosed with Alzheimer's disease." In fact, the diagnosis of AD is a challenging task and it's unlikely to get diagnosed with AD at the ER department), it maintains the linguistic and semantic features of clinical texts, i.e., vocabulary, syntax, and domain knowledge, and represents high quality annotation.
We processed the datasets by tokenizing, lowercasing, and removing duplicate sentences.We split them into train, validation, and test sets (80/10/10 ratio).Table 1 shows the dataset statistics.The gold and silver data have similar average length and standard deviation.The bronze data has a smaller standard deviation and a more balanced categorical distribution than the gold and silver data, which differs from real patient notes.The smaller SD in sentence lengths indicates less diversity in the bronze data.We will experiment with these datasets to see how the LLM's output can help clinical text mining.

Classifiers
We use an ensemble method that integrates multiple models and relies on voting to produce the final output.This reduces the variance and bias of individual models and enhances the accuracy and generalization of the prediction, which is crucial for clinical text mining (Mohammed and Kora, 2023).
For the base models, we utilized the power of pre-trained language models (PLMs), which are neural networks that have been trained on large amounts of text data and can capture general linguistic patterns.Three different PLMs, namely BERT (bert-base-uncased) (Devlin et al., 2018) , RoBERTa (roberta-base) (Liu et al., 2019) and Clin-icalBERT (Huang et al., 2019) are used in this work.These models have been widely used for clinical text processing and achieved good performances (Vakili et al., 2022;Alsentzer et al., 2019) .
We fine-tune the PLM models on different combinations of the gold, silver, and bronze datasets, as described below.We use cross-entropy loss, Adam optimizer (with a learning rate of 1e-4 and a batch size of 32), and 10 epochs for training.We select the best checkpoints based on the validation accuracy for testing.The outputs are determined by a majority vote strategy.
We use a subset of the gold dataset as the test set.We compare the performances using accuracy, precision, recall, and F1-score.

Data Combinations
Gold only We fine-tuned the PLMs only on the training set of the gold data and evaluate the performance on the test set of the gold data.Bronze + Gold, we fine-tuned the PLMs on the bronze data and further fine-tuned them on the training set of the gold data and evaluate the performance on the test set of the gold data.Silver + Gold, we fine-tuned the PLMs on the silver data and further fine-tuned them on the training set of the gold data and evaluate the performance on the test set of the gold data.Bronze + Silver + Gold, we fine-tuned the PLMs on the combination of the bronze data and silver data, and further fine-tuned them on the training set of the gold data and evaluate the performance on the test set of the gold data.For each data setting, we trained the models as a binary classifier and multi-class classifier.For the binary classification task, we randomly sampled sentences with no AD signs and symptoms from the longitudinal notes of the 76 patients, where our gold data comes from, as negative data.The negative/positive data ratio is 5:12 .The task is to test whether this sentences is AD signs and symptoms relevant or not.For the multi-class classification task, the classifiers need to identify specifically one category that the sentence belongs to among our nine categories of AD signs and symptoms.

Results
Table 2 and Table 3 show the performance of the system on the test set of the gold data, using different combinations of data for fine-tuning PLMs.
The results demonstrate that the system benefits from the silver and bronze data and most increases are significant.For binary classification, the highest performance is obtained by fine-tuning the system on both the bronze+silver data (overall accuracy=0.94,4.44% ↑), followed by adding  For multi-class classification, the bronze data alone provides the biggest overall performance improvement (7.35%↑).On the contrary the silver data does not improve much (1.47%↑) and even reduces the performance when combined with the bronze data (5.88%↑<7.35%↑).For subcategories, the performance gain is large in minority classes including coping strategy (31.82%↑ by adding bronze+silver) and notice/concern by others (21.05%↑ by adding bronze+silver).

Analysis
The performances of machine learning models are largely decided by data quantity and quality.To better understand the results, we conducted a series of analysis.
As Table 1 shows, the gold data has an imbalanced distribution of categories.This poses a challenge for classification tasks.The bronze+silver data, with a more balanced categorical distribution, helps to mitigate this problem.We notice an increase in the performance for Coping strategy (31.82%↑) using bronze+silver data.Performance gains are also observed for other minority classes including NPS, Requires assistance, Cognitive assessment, etc., by adding more training examples.
The amount of data is not the only factor that influences the performance.For the physiological changes category, adding silver data 4 times the size of the gold data makes no difference, while adding a smaller amount of bronze data results in   a significant improvement of 21.88% in F-scores.This suggests that the bronze data has a higher quality than the silver data for some categories.We randomly selected 100 samples from both the silver data and the bronze data and asked our human experts to check the quality of the annotation.The annotation accuracy on bronze data is around 85%, and annotation accuracy on silver data is around 55% indicating the complexity and challenge of real world data.Some examples of LLM's labeling errors are shown in Table 4.
Experts identify at least two types of labelling errors by the LLM in the silver data: 1. Over-inference (example 1&2), the LLM tends to make inference based on the information that is presented, and goes beyond what is supported by the evidence or reasoning.
In example 1, the LLM infers that the patient is weak so assistance must be required.Similar to example 2, we found that there are some sentences that mentioned son/daughter as nurses also get labelled as Concerns by others.The LLM infers that specific medical knowledge of children or spouse/children being present at hospital may indicate concern, but this could be wrong.
The mis-classified data impacts the two tasks differently.For binary classification, the system only needs to distinguish sentences with AD-related signs and symptoms from other texts, so the misclassification is less critical.However, for multiclass classification task, the system needs to correctly assign the categories of the AD-related signs and symptoms, which can be confused by the LLMs outputs.This partially offsets the advantage of increasing the amount of data, especially when using the silver dataset, which has a much lower accuracy than the bronze dataset.We observe that the silver dataset even harms the performance on "Requires Assistance" in multi-class classification task.
On the other hand, when using the bronze dataset, which has a relatively higher quality, we see overall performance improvements for both binary and multi-class classification tasks.We noticed that in the multi-class classification task, the bronze data causes performance degradation on some categories.The bronze data differs from the gold or silver data, which are real patient notes.This may cause distribution mismatch with the test dataset and lower performance for some categories.Table 1 shows the bronze data has less variation in lengths (4.69 vs 12.69,15.44).This suggests that we need to steer LLMs to produce data that matches the data encountered in practice.
To sum up, different data combinations affect the results (Table 2&3) by varying the training data in amount, quality and distribution.However, the performance generally improves with the addition of the bronze and/or silver data, though further analysis is needed for each category.

Conclusion and Future Work
In this paper, we examined the possibility of using LLMs for medical data generation, and assessed the effect of LLMs' outputs on clinical text mining.We developed three datasets: a gold dataset annotated by human experts from a medical dataset, which is the most widely used method for clinical data generation, a silver dataset annotated by the LLM from MIMIC (data-to-label) and a bronze dataset generated by the LLM from its hallucinations (labelto-data).We conducted experiments to train classifiers to detect and categorize Alzheimer's disease (AD)-related symptoms from medical records.We discovered that using a combination of gold data plus bronze and/or silver achieved better performances than using gold data only, especially for minority categories, and that the LLM annotations and hallucinations were helpful for augmenting the training data, despite some noise and errors.
Our findings suggest that LLM can be a valuable tool for medical data annotation when used carefully, especially when the data is scarce, sensitive, or costly to obtain and annotate.By using LLM hallucinations, we can create synthetic data that does not contain real patient information, and that can capture some aspects of the clinical language and domain knowledge.However, our approach also has some ethical and practical challenges, such as ensuring the quality, diversity, validity, and reliability of the LLM annotations and hallucinations, protecting the privacy and security of the data and the model, and avoiding the potential harms and biases of the LLM outputs.
For future work, we will investigate other methods and techniques for enhancing and regulating the LLM annotations and hallucinations, such as using prompts, feedback, or adversarial learning.And we would also tackle the ethical and practical issues of using LLM for medical data annotation, by adhering to the best practices and guidelines for responsible and trustworthy AI.We also intend to apply our approach to other clinical text processing tasks, such as relation extraction, entity linking, and clinical note generation.

Limitations
Despite the promising results, our approach has several limitations that need to be acknowledged and addressed in future work.First, our experiments are based on the experimented LLMs and a single clinical task (AD-related signs and symptoms detection).It is unclear how well our approach can generalize to other LLMs, and other clinical tasks.Different LLMs may have different hallucination patterns and biases, and different clinical tasks may have different annotation criteria and challenges.Therefore, more comprehensive and systematic evaluations are needed to validate the robustness and applicability of our approach.
Second, our approach relies on the quality and quantity of the LLMs annotations and hallucinations, which are not guaranteed to be consistent or accurate.The LLMs produces irrelevant, incorrect, or incomplete annotations or hallucinations, which will introduce noise or confusion to the classifier.Moreover, the LLMs may not cover the full spectrum of the AD-related signs and symptoms, or may generate some rare or novel symptoms that are not in the gold dataset.Therefore, the LLMs' annotations and hallucinations may not fully reflect the true distribution and diversity of the clinical data.To mitigate these issues, we suggest using some quality control mechanisms, such as filtering, sampling, or post-editing, to improve the LLMs' outputs.Fine tuning on high quality gold data can partially address these problems.We also suggest using some data augmentation techniques, such as paraphrasing, synonym substitution, or adversarial perturbation, to enhance the LLMs' outputs.
Third, our approach may raise some ethical and practical concerns regarding the use of LLMs for medical data annotation, especially its hallucinations.Although not observed in this work, there is still a slight possibility that the LLMs may produce some sensitive or personal information that may breach the privacy or consent of the patients or the clinicians.The LLMs may also generate

LLM annotation
Human comments 1 [Pt] was profoundly weak, but was no longer tachycardic and had a normal blood pressure.

Requires assistance
Over-inference 2 Her husband is a pediatric neurologist at [Hospital].
Notice/concern by others Over-inference 3 Neck is supple without lymphadenopathy.

Physiological changes Miss negation
Table 4: Examples of the LLM's incorrect annotations from the silver data some misleading or harmful information that may affect the diagnosis or treatment of the patients or the decision making of the clinicians.Therefore, the LLM outputs should be used with caution and responsibility, and should be verified and validated by human experts before being used for any clinical purposes.We also suggest using some anonymization or encryption techniques to protect the confidentiality and security of the LLM outputs.

A Annotation Guideline
The annotation guideline comprises the main part of the prompts used in this work.

B Prompts
We used 3 prompts in this work.
1. Prompt 1 is to ask LLM to annotate provided text following the above guidelines.

Figure 1 :
Figure 1: Illustration of the text and annotation generated by the GPT-4 based on the annotation guideline.(Class names are shown here for better readability.)

Table 2 :
Performance (P/R/F-1/Accuracy (change compared to gold only)) of the ensemble system on the gold test set using different data combinations for training (Binary Classification).

Table 3 :
Performance (F-1/Accuracy (change compared to gold only)) of the ensemble system on the gold test set using different data combinations for training (Multi-class Classification).
It is created by experts and revised based on LLM outputs.The version used in this work is as follows: