NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial Reports

How can we interpret and retrieve medical evidence to support clinical decisions? Clinical trial reports (CTR) amassed over the years contain indispensable information for the development of personalized medicine. However, it is practically infeasible to manually inspect over 400,000+ clinical trial reports in order to find the best evidence for experimental treatments. Natural Language Inference (NLI) offers a potential solution to this problem, by allowing the scalable computation of textual entailment. However, existing NLI models perform poorly on biomedical corpora, and previously published datasets fail to capture the full complexity of inference over CTRs. In this work, we present a novel resource to advance research on NLI for reasoning on CTRs. The resource includes two main tasks. Firstly, to determine the inference relation between a natural language statement, and a CTR. Sec-ondly, to retrieve supporting facts to justify the predicted relation. We provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these tasks. Baselines on this corpus expose the limitations of existing NLI approaches, with 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To the best of our knowledge, we are the first to design a task that covers the interpretation of full CTRs. To encourage further work on this challenging dataset, we make the corpus, competition leaderboard, and website, available on CodaLab, and code to replicate the baseline experiments on GitHub 1 .


Introduction
Clinical trials are research studies performed to test the efficacy and safety of novel treatments, and they are indispensable for the progression of experimental medicine (Avis et al., 2006).CTRs are documents that detail the methodology and results of a particular trial.Clinical practitioners use the information in these CTRs to design personalized and targeted interventions, matching therapeutic agents to each patient's tumor biomarkers.However, there are over 400,000 CTRs available, being continually published at an ever-increasing rate (Bastian et al., 2010).Consequently, it has become unfeasible to manually carry out comprehensive evaluations of all the relevant literature when designing treatment protocols (DeYoung et al., 2020).Natural Language Inference (NLI) offers a potential solution for the large-scale interpretation and retrieval of medical evidence, connecting the latest evidence to support personalized care (Sutton et al., 2020).
In this paper, we propose two different tasks, on breast cancer CTRs.Firstly to determine the inference relation between a natural language statement, and a CTR, shown in Figure 1.Secondly, to retrieve supporting facts from the CTR(s) to justify the predicted relation demonstrated in Figure 1.With respect to healthcare, these inference tasks are a move towards automatic claim verification and evidence extraction, over biomedical corpora.
These proposed tasks represent several challenges for existing NLI systems.The inference task requires substantial amounts of quantitative reasoning and numerical operations, as seen in Figure 1, on which many current state-of-the-art NLI models perform poorly (Ravichander et al., 2019;Galashov et al., 2019).Additionally, much of the inference for these tasks revolves around domain-specific terminology, also shown in Figure 1.Many state-of-the-art (SOTA) NLI models fail to effectively surmount the word distribution shift from general domain corpora to biomedical corpora through transfer learning (Lee et al., 2019).This phenomenon is well-represented in the proposed dataset.This task differentiates itself from other biomedical NLI tasks by considering the full scope of CTRs, not limited to results and interventions, instead including eligibility and Adverse events (AE) as well (DeYoung et al., 2020).This task also provides a significant variance in the types of inference required to solve each instance, with minimal, if any, repetition in inference chains.
The contributions of this paper are as follows: 1.The definition of a novel benchmark (NLI4CT), including two main tasks requiring reasoning over CTRs, incorporates several of the fundamental challenges currently faced by modern NLI systems over clinical trial text.
3. An extensive empirical evaluation of state-ofthe-art NLI models demonstrating current limitations and challenges involved in the proposed tasks.Specifically, we test 6 SOTA NLI models on our dataset and report a maximum F1 score of 0.644.We attribute these results to generalization problems induced by the shift in distribution required for inference.
To the best of our knowledge, we are the first to design a task that covers the full scope of CTRs, combining together the complexity of biomedical and numerical NLI.Solving these tasks would be a significant advancement toward efficient retrieval, synthesis, and inference of published literature to support evidence-based experimental medicine.

Related Work
There is a multitude of expert-annotated resources for clinical NLP, examples are shown in Table 1.The TREC 2021Clinical Track (Soboroff, 2021) is a large-scale information retrieval task on CTR data, with an emphasis on eligibility.Evidence Inference 2.0 (DeYoung et al., 2020) presents a Question-Answering (QA) task and a span selection task on CTR data, specifically with the CTR results.The MEDNLI (Romanov and Shivade, 2018) dataset contains an entailment task, which utilizes the medical history notes of patients as premises.These datasets are predominantly designed to test for biomedical language understanding and reasoning.Whereas NLI4CT, additionally requires complex numerical inference and common-sense reasoning to solve and is designed to avoid repetitive inference patterns.Currently, neural architectures achieve the best results on biomedical NLI datasets (Gu et al., 2021;DeYoung et al., 2020).
Neural models continue to perform poorly on quantitative reasoning and numerical operations within NLI (Ravichander et al., 2019;Galashov et al., 2019).Prior works such as Lee et al. (2020), Shin et al. (2020), andGu et al. (2021) experiment with biomedical pre-training strategies.Kiritchenko et al. (2010) presents ExaCT, an automatic information extraction system for clinical trial variables such as drug dosage, sample size, and primary outcomes from CTRS.While there are many proficient evidence retrieval models, there are no existing systems that are able to effectively carry out biomedical and numerical NLI simultaneously.

Clinical Trial Reports
Clinical trials are research studies on human volunteers used to evaluate an intervention (DeYoung et al., 2020).An intervention may be medical, surgical, or a behavioral protocol.These interventions are often compared with a standard treatment, placebo, or control protocol.The investigators use various outcome measurements from the volunteers to ascertain the effectiveness of the intervention, as well as the likelihood and severity of AEs.
For NLI4CT we retrieved 1000 publicly available Breast cancer CTRs with published results from ClinicalTrials.gov.This data is developed by the U.S. National Library of Medicine and covered by the HIPAA Privacy Rule, protecting personally identifiable information.We separate the CTRs into 4 sections: • Eligibility criteria -A set of conditions for patients to be allowed to take part in the clinical trial • Intervention -Information concerning the type, dosage, frequency, and duration of treatments being studied.
• Results -Number of participants in the trial, outcome measures, units, and the results.
• Adverse Events -These are signs and symptoms observed in patients during the clinical trial.
See examples of the types of information contained in these sections in Table 2.

Task Definition
We define two different tasks on the NLI4CT dataset, Task 1, a textual entailment task, and Task 2, an evidence selection task.Each instance in NLI4CT contains a CTR premise and a statement.
The CTR premise consists of one of the 4 sections of a CTR, with a length ranging from 5-500 tokens, and the statements are sentences ranging from 10-35 tokens (see example in Figure 1).On average there are 7.74 pieces of relevant evidence that need to be selected out of a total of 21.67 facts per CTR premise.There are two types of instances in NLI4CT, single and comparison, in single instances the CTR premise is the primary CTR, and the statement makes a claim about the information in this CTR.Whereas in the comparison type the premise contains a primary and a secondary trial, and the statement will make a claim comparing and contrasting the two trials.To summarize: Task 1 Requires determining the entailment relation between the CTR premise and the statement, therefore the output is either an entailment or contradiction label, as shown in Figure 1.
Task 2 Requires outputting a set of supporting facts, extracted from the CTR premise, necessary to justify the label predicted in Task 1.

Statement annotation
A group of domain experts, including clinical trial organizers from a major cancer research center, took part in the annotation task.Annotators were provided with a CTR section prompt, and two CTRs, a primary and a secondary, as seen in step (B) of Figure 2. The annotation task was first to generate an entailment statement, a short statement that makes an objectively true claim about the contents of the prompted section of the trial(s), Step (C) of Figure 2 Annotators could choose to write a statement about the contents of the primary trial or to compare both the primary and secondary trials.The objective was to generate non-trivial statements, i.e. solving the inference requires reading and understanding multiple rows of the CTR section.Non-trivial statements typically included summarisation, comparisons, negation, relations, inclusion, superlatives, aggregations, or rephrasing.However, non-trivial sentences are not limited to these types, and any statement involving understanding and reasoning was accepted.
Each CTR is divided into 4 sections (Table 1), and each of these sections represents a set of facts.And for all annotated statements we select a subset of these facts that support the label assigned to the statement, see Figure 1 and (D) in Figure 2.This collection of facts is designated as evidence.There is always at least one piece of evidence relevant to the inference for any given statement.In cases where negation is used, e.g.clinical trial A does not have a placebo arm annotators are asked to provide the full CTR section as evidence, as we believe this type of retrieval is more reflective of human inference patterns.
Finally, we employ a negative rewriting strategy (Chen et al., 2019), using the previously generated entailment statement to write an objectively wrong contradiction statement, as shown in Step (E) of Figure 2. Annotators are encouraged to modify the words, phrases, or sentence structures while retaining the sentence style/length of the original statement.This is done to minimize the occurrence of stylistic or linguistic patterns pertaining to ei-ther entailment or contradictory statements.We then collect evidence that contradicts the claims made in the contradiction statement.An example annotation form is available in the appendix.

Resulting Dataset
The NLI4CT dataset consists of 2,400 annotated statements with accompanying labels, CTRs, and evidence.Split into 1700 training, 500 test, and 200 development instances.The two labels and 4 CTR sections prompts are equally distributed across the dataset and its splits.60% of the instances are the single type, with the remaining 40% the comparison type.
6 Reasoning Challenges 6.1 Biomedical Reasoning Acronyms In CTRs acronyms are used for concepts such as test names, administration times and orders, treatment types, and disease names.Acronyms are significantly more prevalent in clinical texts than in general-domain texts and consistently disrupt NLP performance in this field (Grossman Liu et al., 2021;Shickel et al., 2017;Jiang et al., 2011;Moon et al., 2015;Jimeno-Yepes et al., 2011;Pesaranghader et al., 2019;Jin et al., 2019;Wu et al., 2015).Consider the following statement, "Patients with a positive FISH test will be administered trastuzumab IV Q.D.".Depending on the terminology in the premise, verifying all the claims in this statement requires a system to know the association between FISH/Fluorescence In Situ Hybridization, IV/intravenous infusion, and QD/once daily.Over 100,000 healthcare acronyms and abbreviations have been identified with 170,00 corresponding senses (Grossman Liu et al., 2021).
Future systems must overcome this challenge to perform biomedical NLI.
Synonyms and aliases These are often employed for drugs and gene names.Statements in NLI4CT regularly refer to treatments and diagnoses using different aliases than are present in the CTR premise, e.g.where a CTR may use Trastuzumab in its intervention, the statement may refer to Herceptin or any of the 4 other brand names of Trastuzumab.Similarly, where the eligibility criteria may require an ERBB2+ tumor diagnosis from its patients, the statement may refer to HER2 positive or any of the 20 other aliases for the ERBB2 gene.
Taxonomic Relations Concepts such as diseases, treatments, and diagnostic tests can be classified and structured in a taxonomic hierarchy.As shown in Figure 1, to achieve the necessary inference systems must understand that chemotherapy is a hypernym or a super-set of Cyclophosphamide treatments, similar to Cancer being a hypernym of Carcinoma.
Domain Knowledge NLI4CT statements regularly test for domain expert knowledge.This is done by describing characteristics or conditions which pertain to certain biomedical grading systems or diagnoses.For example, if a patient is confined to bed or a chair for more than 50% of waking hours, and the inclusion criteria require a WHO score of < 2, the model must be able to infer that the patient will not be eligible.Or if a patient has a positive FISH test, and the inclusion criteria require the patient to have a Triple Negative Breast Cancer diagnosis, they would also be ineligible.

Common sense Reasoning
NLI4CT requires common sense reasoning, particularly co-reference resolution and general world knowledge, which remain challenging for existing NLI models (Emami et al., 2018;Trichelair et al., 2018).Co-referencing resolution is often necessary to associate dosages, frequencies, and routes of administration to specific drugs in the intervention.General world knowledge is most often required for claims about the eligibility section, e.g. a child is a person < 18 years old, and women over the age of 60 are not of childbearing potential.

Numerical Reasoning
The type of numerical reasoning required to solve instances most closely resembles Math Word Problems (MWP) (Huang et al., 2016;Patel et al.,  2021).Prior works have shown that even elementary MWPs remain unsolved for current NLP models (Patel et al., 2021).NLI4CT requires models to compare dosages, frequencies, and fractions/percentages often necessitating unit conversions.Consider the example instance in Figure 3.
Here the model will need to aggregate the number of occurrences of every different cardiac event recorded in the adverse events of cohort 1.Then it must construct and evaluate the inequality.This combination of biomedical knowledge and numerical reasoning is expected to be a significant challenge for existing models.

Biomedical Pretraining
Transformer-based models leverage pre-trained contextualized word embeddings to model inference, and optimizing pre-training strategies can drastically improve downstream performance (Gururangan et al., 2020).Previous works have repeatedly shown that pre-training general-domain language models on biomedical corpora significantly improves performance on bio-medical tasks (Shin et al., 2020;Beltagy et al., 2019;Lee et al., 2020;Li et al., 2016).Therefore, we also include BioBERT (Lee et al., 2020), and BioMegatron (Shin et al., 2020) (versions of BERT-base and Megatron-lm that are pre-trained on biomedical corpora) in our baselines to assess the effects of pre-training on performance on our dataset.

Experimental Details
For the transformer-based model baselines we tokenize the statements and the section(s) of the CTR(s) as indicated by the instance type and section prompt.If the instance type is comparison we concatenate the primary and secondary CTR section data together.Then we pass the tokenized statement and section information to the model.It should be noted that for a significant portion of the instances, particularly of the comparison type, the input exceeds the maximum input length of 512 tokens.We employ a binary sequence classification/regression head for all baseline models, which predicts either Entailement or Contradiction.The models were trained on the training set for 10 epochs, and each model checkpoint was archived.For the Okapi BM25 similarity, we test threshold values from 0.1 -0.9, in intervals of 0.1 on the development set.We then record the results of these models on the test set.Code to reproduce the experiments is available at: anonymous-link

Evaluation Metrics
Task 1 is a binary classification task, the statement is either labeled as entailment or contradiction.We evaluate performance on task 1 by computing the Precision, Recall, and Macro F1-score of the predicted labels against the annotated gold labels (Entailment/Contradiction).For Task 2 the output is a subset of the facts within the CTR premise.We can frame the evidence selection problem as a ranking problem (Jansen et al., 2021).Models assign a score to each fact in a CTR premise, which can be used to rank the facts.In this framework, solving the task requires the model to rank all the facts in the gold evidence higher than the irrelevant facts.We evaluate performance on this task using Mean Average Precision mAP @ K (Liu, 2011).the test set.However, we omit the results of models which exclusively predict a single class.None of the models achieve an F1-score significantly above the random baseline (0.667).BM25, T5, BERT-base and GPT2 also fail achieve over 0.5 accuracy on the test set.The best-performing models on the test set were BioBERT, BioMegatron, and RoBERTa-base, each achieving over 0.6 accuracy and F1 score.BERT base reported a significantly lower F1 than the other baselines.BioBERT, and BioMegatron both achieve over 0.95 accuracy and F1 score on the training set after 10 epochs.As shown in Table 8, this does not translate to an improvement in performance on the test, indicating drastic over-fitting.BioBERT significantly outperforms its general-domain counterpart on the training set, achieving a maximum F1 score of 0.96, in contrast to 0.65 by BERT-base.Biomedical pre-training clearly impacts the models' ability to encode relevant information, limited to the training set.On average, model complexity does not have a consistent impact on performance.Neither the over-fitting biomedical models nor the models which failed to learn on the training set exhibited any positive responses to changes in the learning rate.The aggregation of these results indicates that the NLI4CT entailment task represents a significant challenge for existing NLI models.

Evidence-only Baseline
We generated an additional set of baselines, using only the gold evidence as the premise.This reduces the likelihood of the models focusing on irrelevant or adversarial information and removes the need to identify evidence spans within the CTR section.This did not result in any significant differences in performance, with BioBERT achieving the highest F1 score of 0.667.This implies that even after training for 10 epochs none of the models draw inferences from the relevant evidence.The results for this baseline are available in the appendix.

Statistical Artefacts
We generated a baseline with BERT-base, using only the statements, removing any access to evidence or context from the CTRs.BERT-base achieved a maximum F1 score of 0.626 and an accuracy of 0.606 on the test set.This is a significantly higher F1 score than on the Task 1 baseline.Additionally, we observed above-random accuracy (0.5) with several model checkpoints.This indicates that the evaluated models might exclusively rely on the presence of superficial statistical artefacts without learning the underlying rules of the tasks.
Lexical and syntactic patterns such as token distributions, statement lengths, and discriminative conditions that are disproportionately associated with a particular class of a dataset can superficially inflate model performance (Herlihy and Rudinger, 2021).These statistical artifacts are typically introduced during the annotation (Gururangan et al., 2018) and allow statement-only classifiers to achieve significantly above-random results (Poliak et al., 2018;Tsuchiya, 2018).
We support the statement-only baseline with qualitative analysis of the dataset.We found that there was no significant difference between the sentence length distributions of the two classes.Additionally, we find a 100% overlap in the 15 most common tokens between the two classes in the training set, and 93% overlap in the test set.However, several individual tokens were not evenly distributed across classes in the test set.In particular the distributions of not, more, group, and cohort.But these uneven distributions were not present, if not completely reversed in the training set.Therefore these do not entirely explain the above random accuracy observed.

Categorical Analysis
Numerical Reasoning We categorize the instances in our dataset into those that require numerical or quantitative reasoning and those that require pure NLI.We separate out all instances with statements that contain numbers or any of the following tokens; higher, lower, fewer, more, less, number, same into the Numerical reasoning category.This results in a 62% Numerical 38% NLI split of the test set.We evaluate the best-performing checkpoints BioBERT reported the highest F1 score for both Numerical and NLI categories, at 0.552 and 0.660 respectively.The performance on the NLI instances is significantly higher than the performance on the numerical instances, for all models other than BM25 and GPT2 which only marginally report the same trend.Biomedical pretraining does not appear to provide a significant advantage within either category.Overall these results follow the trends observed in prior works, numerical reasoning remains a challenging task for NLI models (Ravichander et al., 2019;Galashov et al., 2019;Peng et al., 2021;Patel et al., 2021).
Single vs Comparison We repeat this experiment, this time categorizing by instance type.Comparison type instances have 2 CTR sections as a premise, effectively doubling the amount of irrelevant and potentially adversarial information.Therefore, we would expect comparison instances to be more challenging than the single type.We report the results of this experiment in Figure 5. Contrary to expectations the majority of the models there was no significant difference in performance across the categories.BioMegatron reports a +0.1 F1 on single type instances, and BM25, and T5 report +0.09 and +0.07 F1 score respectively on the comparison type.As T5 failed to learn Task 1, even on the training set, and BM25 is not able to capture clues that go beyond lexical overlaps, we hypothesize that this is in part due to the model relying on superficial clues to predict the output label.Therefore, in many cases, the evidence is just not being leveraged properly.
CTR Sections Finally, we categorize the in-   stances in our dataset by CTR section and report the baselines on these categories in Table 4.All models show significant fluctuations in performance across the different sections.Interestingly, Results and AEs are on opposite ends of the rankings, despite arguably containing the most similar types of information.Additionally, Eligibility is ranked 2 nd despite having the highest average number of tokens.

Task 2
For Task 2, the evidence selection task, we generate a baseline using SOTA general purpose transformer-based ranking models (Reimers, 2020), namely, DistilRoBERTa (Sanh et al., 2019), Mpnet (Song et al., 2020), and MiniLM (Wang et al., 2020) variants.We additionally test BM25.We embed the facts in the CTR sections, and the statements using these pre-trained models and then compute the cosine similarity.We evaluate using the mAP @ K (Liu, 2011), with K = total number of sentences.We report the results in Table 5.There is no  significant difference between the mAP scores of the models, with BM25 producing the highest mAP score.All models achieve well above the random baseline.This indicates that the limiting factor for SOTA models on NLI4CT is not evidence selection, but rather biomedical, and numerical inference.

Conclusions & Future work
We present 2 tasks, textual entailment of claims over clinical trial reports (CTRs), and extracting evidence from CTRs to support or contradict these claims.We provide a corpus of 2400 expertannotated instances to test and train models on these tasks.If models could be developed to solve these tasks, it would progress the field towards the efficient retrieval and synthesis of published literature to support evidence-based medicine.Additionally, it would address several important NLU challenges related to numerical reasoning, biomedical NLI, and inference over long texts.
Our baselines outline the weaknesses of existing NLI models, particularly with regard to numerical inference.We tested 6 SOTA NLI models on Task 1 and reported a maximum F1 score of 0.644.Additionally, we show that the limiting factor for current models is not evidence selection, but rather leveraging that evidence, and using numerical and biomedical inference for predictions.Our corpus, competition leaderboard, website, and code to replicate the baseline experiments are available at: anonymous-link.

Limitations
The most significant limitation of this work is that despite the fact that we are dialoguing with a motivational scenario for medicine, these models are not fit for medical application, this would require a technology clinical trial and regulatory assessment.
Additionally, our evaluation does not include an interventional study, meaning we do not perform perturbations/interventions on the test instances to verify that the models learn the underlying causal structure of the tasks.We plan to do this in future work.
Finally, our dataset contains fewer instances than other published biomedical NLI datasets.This is a consequence of the time and resource-intensive nature of expert-level annotation.The size of the training set may be problematic for large models.

Figure 1 :
Figure 1: We propose two tasks for reasoning on clinical trial data expressed in natural language.Firstly, to predict the entailment of a Statement and a CTR premise, and secondly, to extract evidence to support the label.

Figure 2 :
Figure 2: Schematic of the NLI4CT annotation process.

Figure 3 :
Figure 3: Schematic of a numerical inference chain to solve an NLI4CT instance.

Figure 4 :
Figure 4: Graph of the baseline model F1 scores on the NLI vs Numerical instances in the test set.

Figure 5 :
Figure 5: Graph of the baseline model F1 scores on the Single vs Comparison instances in the test set.

Table 1 :
Table of Clinical NLP datasets.

Table 2 :
Sample excerpts from each section in a CTR.Data provided by ClinicalTrials.gov.

Table 3 :
Table8summarises the performance of the baseline models on Task 1.We report the results from the model checkpoints with the highest F1 score on Results from the NLI4CT Task 1 baseline on the test set.

Table 4 :
F1 score of the baseline models on the test set, categorized by CTR section.

Table 5 :
Results from the NLI4CT Task 2 baseline on the test set.