Less Likely Brainstorming: Using Language Models to Generate Alternative Hypotheses

A human decision-maker benefits the most from an AI assistant that corrects for their biases. For problems such as generating interpretation of a radiology report given findings, a system predicting only highly likely outcomes may be less useful, where such outcomes are already obvious to the user. To alleviate biases in human decision-making, it is worth considering a broad differential diagnosis, going beyond the most likely options. We introduce a new task,"less likely brainstorming,"that asks a model to generate outputs that humans think are relevant but less likely to happen. We explore the task in two settings: a brain MRI interpretation generation setting and an everyday commonsense reasoning setting. We found that a baseline approach of training with less likely hypotheses as targets generates outputs that humans evaluate as either likely or irrelevant nearly half of the time; standard MLE training is not effective. To tackle this problem, we propose a controlled text generation method that uses a novel contrastive learning strategy to encourage models to differentiate between generating likely and less likely outputs according to humans. We compare our method with several state-of-the-art controlled text generation models via automatic and human evaluations and show that our models' capability of generating less likely outputs is improved.


Introduction
Cognitive errors occur when an abnormality is identified, but its importance is incorrectly understood, resulting in an incorrect final diagnosis (Onder et al., 2021;Bruno et al., 2015).For example, radiologists may look for confirmatory evidence to support a diagnostic hypothesis and ignore or discount evidence that refutes the hypothesis (confirmation bias; Busby et al. (2018); Onder et al. (2021)).One 1 Code is available at https://github.com/Liyan06/Brainstorm.
What is less likely to happen after that?
Tom goes to the gym every day.
Tom improves his physical fitness.
He gets a promotion from his manager who saw him in the gym.

He receives a scholarship for his dedication.
… There is no evidence of restricted diffusion.
What are possible less likely interpretations?Acute ischemia Infarct Chronic small vessel ischemic changes

+ +
Figure 1: Examples from MRIINTERPRET and E-CARE datasets.The task is to generate interpretations or hypotheses that humans would consider to be "less likely" to happen but still relevant to the context."+" and "∼" represent likely and less likely outputs, respectively.
way to reduce the likelihood of such cognitive errors is to provide cognitive "help" by having a devil's advocate (Seah et al., 2021;Waite et al., 2017).For this purpose, we propose a new text generation task called "less likely brainstorming" to produce less likely but relevant consultations to bring fresh eyes to examine a case-a powerful way to correct diagnostic errors.
Here, we consider less likely hypotheses in two scenarios.First, they can be hypotheses that humans think are likely but not among the most likely to happen.These hypotheses are critical to providing second opinion of a prior clinical study but are often difficult to generate by traditional decoding techniques.Second, they can be hypotheses that are indeed impossible according to humans, but are close to being true if certain counterfactual assumptions about the input hold.These hypotheses are also helpful as they are often ignored by clinicians.There is a tendency for clinicians to look for a confirmatory diagnostic hypothesis but ignore a refutable one.Note that a less likely hypothesis reflects the likelihood of a potential diagnosis from the human perspective, not from the probability of model output.
We propose BRAINSTORM, a novel contrastive learning strategy to generate "less likely" hypotheses.We treat this problem as a text generation task as text generation models are the most flexible for providing predictions and explanations for complex tasks; they can generalize to new examples and produce complex, structured diagnoses in many formats.Generation of the "less likely hypotheses" is conditioned on an indicator variable set to trigger the model to prefer outputs are less likely according to humans.For this purpose, we propose two additional loss objectives to effectively learn the relationship between input context, the indicator, and outputs.Without our training strategy, using naive controlled generation training, we find that conditioning on the indicator often leads to generating "highly likely" or irrelevant outputs.
We explore this task in two settings: everyday commonsense reasoning and brain magnetic resonance imaging (MRI) interpretation generation (more details in Section 5).In the everyday commonsense reasoning setting, we adapt ART (Bhagavatula et al., 2020) and E-CARE (Du et al., 2022), which both contain "less plausible" or "implausible" hypotheses that fit our definition of less likely.An illustrative example asking for less likely hypotheses can be found in Figure 1.We show that our approach can generate more "less likely" hypotheses than baselines, including models directly fine-tuned on this set, past controllable generation approaches (Lu et al., 2022), or models with alternate decoding (Li et al., 2022;Liu et al., 2021).In the brain MRI interpretation setting, we experiment with predicting diagnoses from brain MRI reports (see Figure 1).Assessment by a neurologist reveals that our model successfully shifts the distribution of generated diagnoses further toward the tail while still generating relevant diagnoses.

Related Work
Uncertainty in Radiology Interpretation Uncertainty plays a significant role in the process of clinical decision making (Croskerry, 2013).When facing uncertainty, physicians may resort to various erroneous strategies, such as denying the presence of uncertainty resulting in various interpretation biases.These biases could lead to unexpected consequences (Kim and Lee, 2018;Eddy, 1984), including missed diagnoses, misdiagnoses, unnecessary diagnostic examinations and even lifethreatening situations (Farnan et al., 2008).Recent work (Seah et al., 2021;Waite et al., 2017) have provided deep-learning based methods and suggestions in reducing errors from interpretation bias on medical imaging.To the best of our knowledge, we are the first to explore reducing bias from interpreting radiology reports via our less likely text generation framework.

Controllable text generation and decoding meth-
ods Controllable text generation is the task of generating text that adheres certain attributes, such as language detoxification (Zhang and Song, 2022;Liu et al., 2021;Dathathri et al., 2020), formality modification (Mireshghallah et al., 2022;Yang and Klein, 2021) and open-ended story generation (Mori et al., 2022;Lin and Riedl, 2021;Fan et al., 2018).The task of controllable text generation encompasses both training-time and decoding-time methods.Training-time approaches include CTRL (Keskar et al., 2019), which learns to utilize control codes to govern attributes in order to generate the desired text, and QUARK (Lu et al., 2022), which leverages a strong attribute classifier as a reward function to unlearn unwanted attributes.These methods typically rely on training data that contains both the desired and undesired attributes to be effective in the supervised setting.Our method falls into this category.
On the other hand, decoding-time methods utilize off-the-shelf pre-trained LMs (PLMs) and aim to re-rank the probability of generated text based on specific constraints.PPLM (Dathathri et al., 2020) and FUDGE (Yang and Klein, 2021) are typical methods in this category that train an attribute classifier to guide PLMs to generating desired text.DEXPERTS (Liu et al., 2021) and Contrastive Decoding (Li et al., 2022) are more recent methods that re-weight generation probabilities by contrasting the output distributions between different LMs.We select those two as strong baselines for comparison against our proposed model.
Contrastive Learning in NLP Contrastive learning (CL) has been applied to a wide range of representation learning tasks in NLP, such as learning task-agnostic sentence representation (Gao et al., 2021) and improving natural language understanding (Jaiswal et al., 2021;Qu et al., 2021).It has recently been applied to text generation tasks as well (An et al., 2022;Cao and Wang, 2021;Lee et al., 2021) where additional hard positive or negative examples are created through techniques such as back-translation or perturbation.

Problem Setting
The problem we tackle in this work can be viewed as a controllable text generation task.Let x be a premise or a brain MRI report findings, we want a model to generate a likely/less likely hypothesis or interpretation y given an indicator i by drawing from the distribution P (y | x, i).The indicator i can take two values: + to indicate generating likely outputs and ∼ to generate less likely outputs.
For example, given a premise x ="Tom goes to the gym every day." in Figure 1 from the E-CARE dataset (more details in Section 5), we want a model to generate a hypothesis y ∼ that is less likely to happen (i = ∼) after x, such as "He gets a promotion from his manager who saw him in the gym.".Although this hypothesis fits into the same scenario as the premise as it directly connects to the premise involving Tom's daily gym attendance, it is less likely to happen since the causal relationship between going to the gym and receiving a promotion is not common.The understanding of what is "less likely" can be based on the concept of bounded rationality (Simon, 1955), where likely hypotheses are those that are likely given known premises, but less likely hypotheses may stem from additional unknown premises.
It is important to note that when we refer to an output as "less likely/likely", we mean that it is less likely/likely based on human understanding of x.All models we experiment with in this work generate outputs that have high probability according to the model, regardless of whether they are likely or less likely to happen according to humans.

Methodology
In this section, we present our method as well as baseline models we compare against.Requirements for these models can be found in Table 1.We use BART (Lewis et al., 2020) as the backbone LM for all experimental settings.

BRAINSTORM
Our encoder-decoder system takes the concatenation of a pair (x, i) as input and returns one or multiple generated output sequences y.At decoding time t, our model iteratively decodes the next token conditioned on the left-hand context, i.e., y <t : where P LM (y t | x, i, y <t ) is the next token distribution given the context.The task inputs are described in Section 5.
Besides the standard maximum likelihood training with human reference, we incorporate two additional loss objectives to guide models to associate the context, indicators, and target sequences.The training approach is illustrated in Figure 2.
Margin Loss First, given the indicator i, we want the model to assign a higher estimated probability to human reference y than its opposite indicator ¬i.Therefore, we apply a margin-based loss: (2) where m is the margin value.This loss objective tells models that if the indicator is modified, then the target sequence should have lower probability.Margin loss does not require both likely and less likely outputs y + and y ∼ .

Similarity Loss
We propose two versions of a contrastive similarity loss based on the availability of examples that can be used in CL.When both positive and negative examples are available in the same batch, we define the similarity loss as ) Here, z x,i , z y , and z ŷ represent the hidden representations of input (x, i), human reference y, and an output ŷ in the same batch.L sim encourages the model to maximize the agreement between z x,i and its corresponding output z y .This loss objective encourages a model to learn the relation between certain indicators and the target sequence by contrasting the target sequence with all negative outputs in the batch.
This objective term resembles that in CoNT (An et al., 2022) which takes self-generated outputs as negative samples; here, we conditioned the input on special indicators.Note that at the training time, the indicator i could be either + or ∼.When the indicator i = +, the hard negative is the human reference of y ∼ , and vice versa.We set the weight of the term in Equation (3) associated with the Tom goes to the gym every day.
L margin < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 7 H D s a W U 5 y 1   L sim < l a t e x i t s h a 1 _ b a s e 6 4 = " Y r J  The L sim objective is highlighted in red where it requires both likely and less likely data.
hard negative to 10 throughout the experiment to increase its importance relative to in-batch negatives.
When positive and negative examples are not available at the same time (denoted by a lack of a "pair" check in Table 1), we propose an alternative similarity loss objective L ′ sim that minimizes the similarity of encoder representation z x,i and z x,¬i , without comparing to outputs in the batch: We use cosine similarity for both versions.
Final Loss The overall training objective of BRAINSTORM is the combination of the standard maximum likelihood estimation (MLE) L MLE , margin loss, and similarity loss: where w s and w m are hyperparameters.BRAIN- QUARK (Lu et al., 2022) is a state-of-the-art controllable text generation method that outperforms methods such as unlikelihood training (Welleck et al., 2020).QUARK trains an LM to generate text with fewer undesirable properties by maximizing rewards assigned by a reward function.In this study, we use the DeBERTa model (He et al., 2020) as the reward function to help generate more y ∼ (more details in Section 6).

Decoding-Time Baselines
Modified DEXPERTS DEXPERTS (Liu et al., 2021) combines a base LM M along with two language models called "expert" (M exp ) and "antiexpert" (M anti ) that model text with desired and undesired properties, respectively.The next token distribution is determined by )) where z is the logits for the next token y t and z ′ t is the truncated logits from M under any truncation sampling methods such as topk sampling.For simplicity, we omit the preceding context in the notation.The hyperparameter α controls how far the final token distribution deviates from model M .
In our setting, we modify this definition to be the fluency of the generated text (Liu et al., 2021).z ∼ t is from a base LM that generates y ∼ only.It can be MLE-LL or BRAINSTORM.
Modified Contrastive Decoding Contrastive Decoding (CD) combines a larger M exp and a smaller "amateur" model (M ama ) and searches for text under a constrained search space (Li et al., 2022).The resulting outputs are intended to amplify the strengths of M exp and remove undesired properties that appear in M ama .A scaling factor τ CD controls the penalties of the amateur model in CD.
In our setting, two models have the same size.M ama learns to generate y + ; M exp can be MLE-LL or BRAINSTORM.Intuitively, the ability to generate y ∼ is preserved, while the tendency to generate y + is factored out.
Hyperparameters We experiment with a wide range of values for α in DEXPERTS and τ CD in CD and show how the fraction changes across these values in Figure 3.We keep the recommended value for the remaining hyperparameters.Unless specified otherwise, we generate outputs using diverse beam search (Vijayakumar et al., 2016).

Experimental Settings
We investigate our methods in both brain MRI settings and everyday commonsense reasoning settings (Table 5).

Everyday Commonsense Reasoning
Two datasets from the commonsense reasoning domain were adapted.See examples in Figure 4 from Appendix.
ART (Abductive Reasoning in narrative Text; Bhagavatula et al. ( 2020)) is a large-scale benchmark dataset that tests models' language-based abductive reasoning skills over narrative contexts.Each instance in the dataset consists of two observations O 1 and O 2 (O 1 happened before O 2 ), as well as a likely and a less likely hypothesis event (happening in between O 1 and O 2 ) collected from crowd workers.Each "likely" hypothesis is causally related to two observations and each "less likely" hypothesis is created by editing each "likely" hypothesis.The original task is to generate a likely hypothesis given the observation pair (O 1 , O 2 ).
E-CARE (Explainable CAusal REasoning; Du et al. ( 2022)) tests models' causal reasoning skills.Each instance in the dataset consists of a premise, a "likely" and a "less likely" hypothesis, and a conceptual explanation of the causality.The likely hypothesis can form a valid causal fact with the premise.Two tasks are introduced: (1) causal reasoning: choosing the "likely" hypothesis given a premise and (2) explanation generation: generating an explanation for the causal fact.
Adapted Setting In our adapted setting, we want a model F to generate y ∼ given either an observation pair (ART) or a premise (E-CARE) x.Formally, let E be a binary evaluator E(x, y) ∈ {1, 0} that classifies an output y into either y + or y ∼ based on x.We want a model F that generates ŷ = F (x, i =∼), where E(x, ŷ) = 0.
Evaluation For ART, we use the default training, validation and test sets to evaluate our models.For E-CARE, we randomly construct training and validation sets from the original training set and use the default validation set as the test set since the original test set is not available.All hyperparameters are determined on the validation set.
For each instance x in the test set, we ask a model F to generate ŷ = F (x, i =∼), then measure the fraction of less likely hypotheses according to an evaluator E.
To reduce ambiguity and encourage more consistent human evaluations, we formally define all relevancy categories from rounds of pilot studies.More detailed definitions and annotation instructions can be found in Appendix B and C. We measure both the (1) relevancy and (2) fluency of generated hypothesis in human evaluation.

MRIINTERPRET
We present a new dataset MRIINTERPRET based on the findings and impression sections of a set of de-identified radiology reports we collected from brain MRIs.Each instance consists of a findings x, an indicator i, and a likely/less likely interpretation y of the findings x depending on i.

Dataset Construction
We first find phrases such as "likely represents", "consistent with", and "may be unrelated to" that represent uncertainty from each sentence of reports.We view these phrases as indicators of the presence of interpretations; denote them by s + or s ∼ .A likely or less likely indicator (Appendix F) suggests a likely or less likely interpretation of a finding.For each likely indicator s + , we treat the sub-sentence preceding s + concatenated with prior 6 sentences as the findings x, and the completion of the sentence following s + as the likely interpretation y + of the findings x.We include prior sentences to provide more context for reaching interpretations.For less likely indicators s ∼ , we treat the sub-sentence either following or preceding s ∼ as the less likely interpretation of the findings depending on how s ∼ is stated.An example can be found in Figure 4.

Indicator Unification
We have collected a variety of indicators and decided to unify them into a minimum set for both likely and less likely indicators.More details of indicator unification can be found in Appendix F.
Evaluation To ensure the human evaluation for MRIINTERPRET to be as reliable as possible, we carefully curate a thorough annotation instruction guideline with precise definitions for all relevancy labels in Section 7 and Appendix E.
6 Evaluation on Commonsense Reasoning

Automatic Evaluation
Our first evaluation relies on automatically assessing whether system outputs are likely or less likely according to humans.We fine-tune DeBERTa models (He et al., 2020) for our automatic evaluation on two everyday commonsense datasets.They take the pair of (x, y) as input and predict whether y is a likely or less likely hypothesis.In our settings, the fine-tuned DeBERTa model achieves 85% accuracy on the test set of ART and achieves 80% on the original validation set of E-CARE.Table 2 compares a number of methods on our commonsense reasoning datasets.We answer several questions based on these results.We perform a paired bootstrap test for each result by comparing to MLE-LL.We highlight results that are better at 0.05 level of significance.
Can we just train on (x, y ∼ )?Interestingly, the baseline model MLE-LL that only trained on (x, y ∼ ) pairs generates "likely" hypotheses approximately half of the time.This is possibly an effect of the pre-training regimen; furthermore, generating likely hypotheses may be easier and past work has shown that seq2seq models can amplify behaviors like copying that are easy to learn (Goyal et al., 2022).
Are the proposed two loss objectives effective?We see that compared to MLE-LL, our proposed BRAINSTORM method achieves substantially higher fractions of less likely hypotheses with no cost to quality in terms of perplexity.At the bottom of Table 2, we show that ablating either of the proposed loss objectives worsens performance (and note that ablating both yields MLE).BRAIN-STORM ′ is not as effective since it does not compare with outputs in the batch, but we can see its merits in MRIINTERPRET (Section 7).
Can decoding-time methods alleviate the problem of generating likely outputs?We explore  Table 3: Human evaluations on ART and E-CARE.We see that our method is able to produce more "less likely" (L-Likely) outputs on both datasets.We calculated the mean of the ratings from multiple annotators for each sample.
whether DEXPERTS and CD can further raise the fraction of less likely generations when combined with either MLE-LL or BRAINSTORM.These methods have hyperparameters that trade off how much of the "undesired" behavior each can remove from the system.We compute several fractionperplexity trade-off curves in Figure 3. Notably, although the fraction of less likely outputs can improve, both of these methods significantly increase the perplexity of generations, which corresponds with notably worse fluency of the text.Although these points apparently have high less likely fractions, we caution that the distribution of the text may deviate from the text that DeBERTa was fine-tuned on, meaning that our classifiers may not work well in these ranges.The green lines reflect thresholds where we observe serious degradation in output quality starting to occur.Below this perplexity threshold, the automatic evaluation suggests that both methods demonstrate some capability in alleviating the models' tendency in generating "likely" hypotheses without too great a cost to perplexity.Note that DEXPERTS is more effective than CD in ART and vice versa in E-CARE.Table 2 reports the settings where models achieve the minimum perplexities; at these points, perplexity is substantially increased but the frac-tion of less likely hypotheses is not substantially changed for the majority of results.
Can QUARK yield improvement?In Table 2, the automatic evaluation results show that QUARK exceeds BRAINSTORM by generating 6% more "less likely" hypothesis in ART and 10% more in E-CARE.It also has lower perplexity in ART.To further compare the two models, we conducted a human evaluation on the outputs from two models, and the result shows that QUARK generates lowerquality "less likely" hypotheses (Section 6.2).

Human Evaluation
To further validate the results, we conduct a finergrained human evaluation on a sample of 100 examples from the test sets of both datasets along two axes -relevancy and fluency.We refined our relevancy evaluation by dividing the "relevancy" category into four subcategories, resulting in a total of five categories for evaluation.: (1) Likely; (2) Less likely; (3) Contradictory -the output is impossible if we assume the input is true; (4) Repetition -the output is describing the same meaning as the input; and (5) Irrelevant -the output has little connection with input.More thorough category definitions with examples, annotation instruc-tion and quality checks for AMT annotators can be found in Appendix C. We compare the performance of three models: MLE-LL, BRAINSTORM, and QUARK (Table 3).As QUARK demonstrates better performance in automatic evaluation, we include its generated text in our human evaluation.
Our results show a high level of agreement between the automatic evaluation (Table 2) and human evaluation (Table 3) regarding the fraction of "likely" hypotheses on both datasets.On ART, QUARK and BRAINSTORM decrease the fraction of "likely" hypotheses by 60% and 50%, respectively, compared to MLE-LL.However, on E-CARE, the human evaluation indicates that all three models generate an equivalent number of "likely" hypotheses.By further breaking down the "relevancy" category used in the automatic evaluation, we then have a clearer understanding of the distribution of categories among the models' outputs.
Low-Quality Hypotheses It is not desirable for models to generate outputs that are repetitions of the input (Repetition) or have little connection to the input (Irrelevant).On the ART dataset, all models generate a small proportion of irrelevant outputs, with QUARK and BRAINSTORM reducing the fraction of "Repetition" hypotheses by half, compared to MLE-LL.However, we get more low-quality outputs on E-CARE.While BRAINSTORM is able to reduce the fraction of Repetition hypotheses by a large margin, it is not as effective as QUARK.One possible reason for this is that QUARK is trained to generate outputs that the DeBERTa classifier (the reward model) predicts as less likely; Repetition cases are rarely classified as less likely due to their similarity with the input, but Irrelevant outputs are more likely to be classified this way.
Less Likely versus Contradictory While less likely hypotheses are desirable, contradictory hypotheses are less so.A typical way of generating a contradictory hypothesis is by simply adding negation: Lisa went laptop shopping yesterday → Lisa didn't go laptop shopping yesterday.However, such examples have little value as the negation brings no new information to the input and is not a useful counterfactual for a user to see.
We evaluate the models' outputs on the ART dataset, where a significant number of contradictory hypotheses are generated, and find that 43 out of 100 hypotheses generated by QUARK include the words "didn't" or "not," while only 10 hypotheses generated by BRAINSTORM and MLE-LL did so.We posit that this is likely due to the DeBERTa classifier assigning high rewards for hypotheses that include negation words, and QUARK effectively learning this shortcut.

Human Evaluation on MRIINTERPRET
To evaluate the models' performance on the radiological interpretation generation setting, we select 30 findings from our validation set that ask for less likely interpretation.For each finding, we select the human reference and generate the top 5 less likely interpretations from 2 baselines (MLE-LL and MLE) and BRAINSTORM ′ , resulting in 30 × (5 × 3 + 1) = 480 interpretations.We randomized the order of these interpretations before evaluation.
Due to the structure of the indicators in this dataset, methods that require examples to have both y + and y ∼ for the same data (see "pair" in Table 1) are not able to be used.Since QUARK relies on a trained classifier, we choose not to use QUARK as well.A trained classifier on MRIINTERPRET is not reliable since the training set only consists of naturally occurring data, which is highly imbalanced (see Table 5 in Appendix).This leads the classifier to perform poorly on the "less likely" class, which is the minority class but is also the class of greatest interest in this study.We find that augmenting the training data with counterfactual cases is not easy.For example, "the lack of evidence of restricted diffusion makes it less likely to be" is a naturally occurring prompt from a less likely example, and attempting to change it to a sentence such as "the lack of evidence of restricted diffusion could represent" yields a statement that turns out to be out of distribution from the training data and models do not behave reliably in these counterfactual cases.
For each generated interpretation, we evaluate its (1) relevancy to the findings and (2) whether it contains any hallucinations about findings (Appendix E.2).For relevancy, we asked a neurologist to classify each interpretation into: (1) Relevant and likely; (2) Relevant and less likely; and (3) Irrelevant.Further, for those classified as "Relevant and less likely", we further evaluate how well the interpretation fits into the context of the findings by grading them on three levels: high, medium and low, ranging from high matches that represent the most obvious less likely interpretations to low matches that represent relevant but exceedingly rare diagnosis.We provide detailed definitions for these  4. Most human references (which the neurologist was blinded to) are annotated as either a high or medium match under the relevant but less likely category, suggesting the reliability of the neurologist's annotation.We find that training on all data (MLE) instead of exclusively on less likely data (MLE-LL) would effectively help generate more relevant but less likely interpretations and reduce the amount of irrelevant ones.One possible reason is that MRIINTERPRET is a highly imbalanced dataset (Table 5).
By comparing the outcomes between human reference and BRAINSTORM, we find that BRAIN-STORM tends to shift the distribution of generated interpretations towards generating lower matched interpretations, which effectively extends the beam of potential diagnoses that meet the criteria of "relevant but less likely" based on refuting findings.Anecdotally, interpretations in this medium category reflect the sort of alternative hypotheses and "outside-the-box" suggestions that represent the original goal of our approach.

Conclusion
In this work, we propose a new text generation task "less likely brainstorming" for reducing cognitive errors in interpreting findings of MRI reports.We found that simply training on less likely data does not help with generating less likely interpretations and hence propose a novel CL method to tackle the problem.In two settings, we show that our proposed training technique can effectively generate more "less likely" hypotheses, producing interpre-tations that radiologists may not think of, outperforming past training-and decode-time modifications to generation models.

Limitations
Our brain MRI interpretations were evaluated by a single neurologist.Such annotations require deep expertise and are not easily carried out with high quality by trainees, which limited the amount of data we were able to collect.To ensure that the annotation would be as reliable as possible, we carefully thought of the dimensions in evaluating the generated interpretations and proposed a thorough annotation instruction guideline.We believe that future work can conduct more extensive studies using our annotation guidelines as a starting point.Further, the radiology reports we experiment with are from a single academic medical center, which makes the generalizability unclear.Future work is needed to evaluate the performance of our models on data from different medical centers.Finally, future work is needed to evaluate relevant and likely outputs from MRI interpretations to address different forms of interpretation bias and to expand the beam of potential likely diagnoses based on the findings.
Beyond the brain MRI interpretation experiments, our generation experiments are limited to a set of pre-trained models optimized for carrying out generation tasks in English.It is possible that multilingual models generating in languages other than English will show different properties.We are limited by the availability of resources for automatic evaluation in these settings, but a more extensive multilingual evaluation with human users could be conducted in the future.

Ethical Risks
We are proposing better ways for incorporating systems into the radiological diagnostic process.This is aimed at helping improve human decisionmaking and mitigating the limitations of traditional fully-automatic approaches.However, we believe that it is imperative to rigorously test and evaluate these methods before they can be put into practical clinical settings.We are not claiming that these methods are ready for real-world adoption at this stage.

A Dataset statistics
Dataset statistics can be found in Table 5.

B Definition of Relevancy Categories on Everyday Commonsense
To encourage more consistent human evaluations, we formally define all relevancy categories as the following.These definitions are refined from rounds of pilot studies to reduce ambiguity for human annotations.Example outputs and explanations for each relevancy category can be found in the annotation interface (Figure 5 and 7).

B.1 E-CARE
Relevant A hypothesis is relevant if it fits with the same scenario as the premise.It should not introduce new people, places, or things that are not at least plausibly in the same source scenario.
Likely For the hypothesis to be likely, it must also be causally related to the premise -either the premise causes the hypothesis or the hypothesis causes the premise (you will see both versions of the task below).There should not be clearly more likely hypotheses than it.

Relevant and Less likely
The hypothesis is still the same scenario as the premise (relevant).However, it is less likely to be causally related to the premise.There could be other hypotheses that are superior to the given hypothesis.
Irrelevant The generated hypothesis does not describe the same scenario as the premise or is not causally related to the premise.

Contradictory
The hypothesis contradicts the premise -it says something that is impossible if we assume the premise to be true (e.g., the premise states that something happened and the hypothesis states that that thing did not happen).
Repetition The hypothesis is very similar to the premise -it either contains a text span that is a repetition of the premise, or it is expressing nearly the same meaning as the premise.

B.2 ART
Relevant A hypothesis is relevant if it fits with the same scenario as the observation pair.It should not introduce new people, places, or things that are not at least plausibly in the same source scenario.
Likely For the hypothesis to be likely, it must also be strongly related to O 1 and O 2 in a causal fashion -to the extent possible, the first observation O 1 should cause the hypothesis and the hypothesis causes the second observation O 2 .There should not be clearly more likely hypotheses than it.
Relevant and Less likely The hypothesis is still the same scenario as the observation pair (relevant).However, it is less likely to be causally related to the observation pair -maybe it could happen following O 1 , but not necessarily.There could be other hypotheses that are superior to the given hypothesis. Irrelevant

C Annotation on Everyday Commonsense
The human evaluation by crowdworkers has been judged to be IRB exempt.We hired crowd annotators from US through Amazon Mechanical Turk.These annotators have lifetime approval rates over 99% and more than 1000 approved HITs.We first conducted a quality check on ART and E-CARE.
For each dataset, we randomly selected 100 examples from the test set and each example is evaluated by 7 annotators, resulting in 100 × 7 = 700 annotations for each dataset.We finally selected 7 qualified crowdworkers from each of the datasets.is equivalent to $12/hr and is higher than the minimum USA wage.
Category definitions and annotation instructions with examples are shown in Figure 5, 6, 7 and 8.
Selecting Qualified Workers After we collected all annotations from the pilot study.We filter out workers by following these steps: 1. We first filter out workers that annotated less than 4 HITs.With limited amount of annotated HITs, it is hard to evaluate the consistency of their annotations.
2. For any HIT, if two output sequences are exactly the same but the annotator assigned them different categories, then we remove the worker.For example, in E-CARE, if the premise is "Tom goes to the gym every day." and we have the hypotheses "He gets a promotion from his manager who saw him in the gym." that appears twice, then if one hypothesis is classified as "Relevant and Likely" and another one is classified as "Relevant but Less Likely", we will filter out this annotator.
3. We use the "Repetition" category to further filter out annotators.We believe "Repetition" is the least subjective category in our annotation instruction, and using this category to filter out annotations would lead to minimum bias we can project to the selected annotators.This consists of two steps: (1) A model many generate an output that is exactly the input.For example, a model takes as input "Tom goes to the gym every day." and generate "Tom goes to the gym every day." as well.This happens sometimes across all models.For those cases, we will filter out annotators that assigned categories other than "Repetition"; (2) Besides the exact match, there are cases where a model's output is a paraphrase of the input.For these, to minimize our bias, we choose to use models' outputs that only differs from the input by at most two words to filter out annotators.For example, in ART, if one observation is "Lisa went laptop shopping yesterday", and the model's output is "She went laptop shopping yesterday", then we filter out annotators that do not assign "Repetition" to it.
After we collected all the annotations from qualified workers, we use the above steps to further filter out works that do not meet our standard.Finally, we got valid annotations by three annotators from each datasets.We use Fleiss kappa to calculate the agreement between annotators.The annotators achieve moderate agreement (κ = 0.447) on ART and fair agreement (κ = 0.354) on E-CARE for relevancy evaluation.This is within our expectation since evaluating whether a hypothesis is likely or less likely is subjective.

D Fluency Evaluation on Everyday Commonsense Reasoning
Fluency evaluation can be found in Table 6.Most of generations from models are fluent and grammatically correct.

E Annotation on Brain MRI Interpretation
The use of the brain MRI data is covered by an IRB.A neurologist reviewed each finding sample and evaluated the interpretation on multiple metrics.

E.1 Relevancy
The overall objective of the interpretation generation was to produce less likely diagnoses, or interpretations, based on the absence of specific findings.The findings followed a common pattern of "Absence of [finding x] makes it unlikely to

Task
~ Lisa decided to buy a car.
+ The price raised on the next day.
+ Tom improves his physical fitness.
~ He gets a promotion from his manager who saw him in the gym.
This scenario makes sense -Lisa was grateful that she had made her laptop purchase the day before, as the price un-expectedly increased the following day.
The event focus on Lisa's purchase of a laptop.It is not likely that she would suddenly decide to buy a car without any prior indication of her interest in doing so.
It directly connects to the premise involving Tom's daily gym attendance.However, the connection between gym attendance and job promotion is indirect.
O1: Lisa went laptop shopping yesterday.O2: She was thankful she bought it.
Tom goes to the gym every day.
…… Absence of evidence of restricted diffusion makes it unlikely to be ~ acute ischemia

~ infarct
In diffusion weighted imaging sequences on MRI of the brain, one of the most common causes of diffusion restriction finding is due to acute ischemic stroke, also known as an infarct.Thus, the absence of restricted diffusion within brain tissue makes an interpretation unlikely to be acute ischemia/infarct.
It directly relates to Tom's daily gym attendance.Regular exercise is a common and effective method for improving physical fitness.Relevant and Likely Output was judged as "relevant and likely" if the interpretation erroneously suggested a diagnosis that would be likely, not unlikely, despite the absence of [finding x].For instance, "Absence of restricted diffusion within the previously described fluid collections along the right convexity makes it unlikely to be".An interpretation of "the presence of a small subdural hematoma" is actually a likely diagnosis given the lack of restricted diffusion in the fluid collection since subdural hematomas do not normally demonstrate restricted diffusion.
Relevant but Less Likely Output was judged as "relevant but less likely" if the interpretation correctly provides a less likely diagnosis due to the absence of [finding x].For example, "absence of restricted diffusion makes it unlikely to be".An interpretation of "acute ischemia" is unlikely since diffusion restriction is often associated with acute ischemia.
If the interpretation was judged as "relevant but unlikely", the degree to which the interpretation fits with the findings was graded on three levels: (1) high, (2) medium, and (3) low.
• Less likely interpretations were high matches if they were within the top 5 diagnoses to fit the statement.These were the most obvious interpretations.
• Less likely interpretations were medium matches if they were further down the bar of potential interpretations.They still were relevant to the findings and made sense as being less likely given the absence of the finding of interest, but are less obvious and fall outside of the top 5 diagnoses.
• Less likely interpretations were low matches if the interpretation was relevant to the findings, but was an exceedingly rare diagnosis to make it of low value to mention as an interpretation.
Irrelevant Output was judged as "irrelevant" if it was not related to the finding of interest or the structure that the finding of interest is referring to.

E.2 Presence of Hallucination
Lastly, no matter the rating of relevance, presence or absence of hallucination was noted.It was possible to have a relevant but unlikely interpretation with high degree of fit with the finding, but a hallucination that does not appear in the original findings was added.We therefore evaluate whether each interpretation contains hallucinations.
The results are shown in Table 7.The models listed contain a large proportion of hallucinated content especially for MLE and BRAINSTORM.We examined what these hallucinations look like.We found that in the most cases, models hallucinate about the findings (generating some findings that do not actually written in the report) and concatenate those hallucinated findings after their interpretations.For examples, a generated interpretation would be "an acute infarction although this is limited by the presence of contrast enhancement", "intracranial abscess although this is limited by the presence of significant soft tissue swelling", or "blood products in the ventricular system as seen on prior CT."However, unlike other text generation tasks such as text summarization where hallucinations are hard to identify, hallucinations in MRIINTERPRET follow a pattern of interpretation followed by the non-existent findings.Although future work could work on how to directly generate interpretations without hallucination, a rule-based heuristics can remove the majority of hallucinations in the current version of our system.
s o L b V n b G I Z P N 7 I b m M i R n d I e Q V 3 + g b z 7 7 R 8 z e S m 2 l B w L f + b 7 z 5 Z z k l L X g F o b D3 1 F 8 4 + a t 2 3 d 2 7 i b 3 7 j 9 4u N v Z e 3 R s d W M o G 1 M t t D k t i W W C K z Y G D o K d 1 o Y R W Q p 2U p 6 9 W + o n 3 5 m x X K s v 0 N Y s l 2 S m e M U p g U A V n T / 7 o 1 4 m C c z L y r U e Z 5 J P 8 T Z f + D 5 + s 8 6 M d F Z X I M n C 9 y Z b / Y c v X D Y n E I z + W w a 6 P n d W v n f h k r x w L c 6 4 W s u U C P f V + 3 5 y n n 3 0 x b a J J G b G l U + u n e o 5 z h S b Y d 7 H z 7 D E r / H 1 t b D 2 n 4 R P S C 8 / + S o 4 P h i k L w Y H n 1 9 2 j 9 5 u v m M H P U F P U Q + l 6 B U 6 Q u / R C I 0 R j T 5 E O l p E b f w p h t j F f l 0 a R x v P Y / R P x D / / A k I I 4 S 0 = < / l a t e x i t > Tom improves his physical fitness.(likely effect) He gets a promotion from his manager who saw him in the gym.(less likely effect) Encoder Decoder Tom improves his physical fitness.L MLE < l a t e x i t s h a 1 _ b a s e 6 4 = " L E e 7 E 0 y v e e / W Q B W j T d w u 3 w H L U i Y = " > A A A C 9 X i c f V J b a x N B F J 5 d b z X 1 k u q j L 4 M h k K C E 3 S r o g 0 K x F H y o E M G 0 l e y 6 z E 5 m k 6 F z W W Z m b Z d h / o c v f a i I r / 4 X 3 / w 3 T p J N i a 3 0 w M B 3 z v e d O d 9 c 8 p J R b a L o T x D e u H n r 9 p 2 N u 6 3 N e / c f P G x v P T r Q s l K Y j L B k U h 3 l S B N G B R k Z a h g 5 K h V B P G f k M D / e n f O H X 4 n S V I p P p i 5 J y t F U 0 I J i Z H w p 2 w o 2 u 8 N e w p G Z 5 Y W t H U w 4 n c B V f u r 6 8 O 0 y H 2 I H 4 x 2 P 7 4 s r P z r r m O D f A E P A U 9 E I N X Y A e 8 B 0 M w A j h Q w b f g P P g e n o R n 4 Y / w 5 1 I a B k 3 P Y / B P h L / + A q y 2 8 P E = < / l a t e x i t > : standard supervised training : ensure likely label is more likely given the likely indicator (and , , tuple) w p K 6 W S F e I I m w d l 8 p c J c Q n T 3 y e b C 3 O Y y e D j f f b X V 3 X r b X s Q Y e g I e g D y K w D X b A G z A C Y4 C 9 L 9 4 3 7 4 f 3 0/ / q f / d / + b 8 b q e + 1 n v v g V P h / / g P k z P 7 B < / l a t e x i t >

Figure 2 :
Figure2: An overview of BRAINSTORM using an example from E-CARE, which consists of three objectives.z x,i is the encoder representation of the input x conditioned on an indicator i. z y + , z y ∼ and z ŷ are the decoder representations of positive, hard negative, and other negative target sequences within the same batch, respectively.The L sim objective is highlighted in red where it requires both likely and less likely data.

Figure 3 :
Figure 3: Fraction-perplexity trade-off of decoding-time methods CD and DEXPERTS on ART test set and original E-CARE validation set (our test set).We show the trade-off across various values for τ CD in CD and α in DExperts.Both CD and DExperts can improve the fraction of less likely hypotheses, but at a very high cost to perplexity.

Figure 4 :
Figure4: Examples from MRIINTERPRET, ART and E-CARE.The example shown in the table for E-CARE asks for a likely/less likely effect of the premise."+"/"∼" indicates whether humans would consider the output to be likely/less likely according to the context under the Examples column.We explain why humans would consider these outputs as likely/less likely in the Explanation column (this is not in the training data).

Figure 10 :
Figure 10: Unifying "less likely" indicators in MRIINTERPRET and how we map flipped indicators.
STORM ′ replaces L sim by L ′ sim .
It is a conditional model p(y | x, i) that learns to generate both y + and y ∼ depending on i. MLE-LL learns to generate less likely outputs y ∼ by only training on (x, y ∼ ).Both models are trained with standard MLE.

Table 1 :
Here, z + t is from the model that learns to generate ŷ+ by only training on (x, y + ) pairs.z neu t is from the model that learns to generate both y + and y ∼ conditioned on the indicator.Unlike MLE, this model does not condition on indicators to generate hypotheses.Instead, it leverages text with both desired (generating y ∼ ) and undesired properties (generating y + ).It is shown to effectively maintain Requirements for various methods.+/∼/pair means a method requires y + /y ∼ /both for x.QUARK can take any type of data as inputs but requires a trained classifier.We use BRAINSTORM ′ as an alternative of BRAINSTORM if y + and y ∼ are not both available for x.DEXPERTS and CD require that both y + and y ∼ could be available for x (which is not the case for MRIINTER- PRET, Section 7).

Table 2 :
Performance of generating less likely hypothesis on ART test set and E-CARE validation set.For DEXPERTS and CD, we list the fractions where models reach minimum PPL.The ablation study of our proposed method is shown at the bottom.
The hypothesis does not describe the same scenario as the observation pair: it either involves different people, places, or things, or the events it describes have very little connection to O 1 and O 2 .Contradictory The hypothesis contradicts either observation O 1 or observation O 2 -it says something that is impossible if we assume O 1 and O 2 to be true (e.g., O 2 states that something happened and the hypothesis states that that thing did not happen).The hypothesis is very similar to either O 1 or O 2 -it either contains a text span that is a repetition of O 1 or O 2 , or it is expressing nearly the same meaning as O 1 or O 2 .

Table 5 :
A summary of dataset statistics.All datasets are in English.For ART and E-CARE, we show the stats of our adapted versions.Since E-CARE has a hidden test set, we randomly split the original training set into a training and a validation set, and we use the original validation set as our test set.Note that each example in E-CARE asks for either the cause or the effect of the premise.

Table 6 :
Human evaluation of fluency on everyday commonsense reasoning datasets.Annotators reached substantial agreement on both datasets.

Table 7 :
table for E-CARE asks for a likely/less likely effect of the premise."+"/"∼"indicateswhether humans would consider the output to be likely/less likely according to the context under the Examples column.We explain why humans would consider these outputs as likely/less likely in the Explanation column (this is not in the training data).Human evaluation on hallucinations.The result shows the percentage of hallucinations found in 150 generated interpretations from each model.be[interpretationy]."The finding of interest was modified to be standardized across all findings if it used varying terminologies in a similar pattern (see Appendix F for more details).Because the interpretations are oriented in this negated valence, the objective of the output is to produce "relevant but unlikely" interpretations.The annotator rated the interpretation on 3 metrics: (1) relevant and likely, (2) relevant but less likely, and (3) irrelevant.