That’s the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data

Pretraining multimodal models on Electronic Health Records (EHRs) provides a means of learning representations that can transfer to downstream tasks with minimal supervision. Recent multimodal models induce soft local alignments between image regions and sentences. This is of particular interest in the medical domain, where alignments might highlight regions in an image relevant to specific phenomena described in free-text. While past work has suggested that attention “heatmaps” can be interpreted in this manner, there has been little evaluation of such alignments. We compare alignments from a state-of-the-art multimodal (image and text) model for EHR with human annotations that link image regions to sentences. Our main finding is that the text has an often weak or unintuitive influence on attention; alignments do not consistently reflect basic anatomical information. Moreover, synthetic modifications — such as substituting “left” for “right” — do not substantially influence highlights. Simple techniques such as allowing the model to opt out of attending to the image and few-shot finetuning show promise in terms of their ability to improve alignments with very little or no supervision. We make our code and checkpoints open-source.


Introduction
There has been a flurry of recent work on model architectures and self-supervised training objectives for multimodal representation learning, both generally (Li et al., 2019;Tan and Bansal, 2019;Huang et al., 2020;Su et al., 2020;Chen et al., 2020) and for medical data specifically (Wang et al., 2018;Chauhan et al., 2020;Li et al., 2020). These methods yield representations that permit efficient learning on various multimodal downstream tasks (e.g., classification, captioning).
Given the inherently multimodal nature of much medical data -e.g., in radiology images and text are naturally paired -there has been particular interest in designing multimodal models for Electronic Health Records (EHRs) data. However, one of the factors that currently stands in the way of broader adoption is interpretability. Neural models that map image-text pairs to shared representations are opaque. Consequently, doctors have no way of knowing whether such models rely on meaningful clinical signals or data artifacts (Zech et al., 2018).
Recent work has proposed models that soft-align text snippets to image regions. This may afford a type of interpretability by allowing practitioners to inspect what the model has "learned" or allow more efficient identification of relevant regions. Past work has presented illustrative multimodal "saliency" maps in which such models highlight plausible regions. But such highlights also risk providing a false sense that the model "understands" more than it actually does, and irrelevant highlights would be antithetical to the goal of a efficiency in clinical decision support.
Multimodal models may fail in a few obvious ways; they may focus on the wrong part of an image, fail to localize by producing a high-entropy attention distribution, or localize too much and miss a larger region of interest. However, even when image attention appears reasonable, it may not in actuality reflect both modalities. Figure 1 shows an example. Here the model ostensibly succeeds at identifying the image region relevant to the given text (left). One may be tempted to conclude the model has "understood" the text and indicated the corresponding region. But this may be misleading: We can see that the same model yields a similar attention pattern when provided text with radically different semantics (e.g., when swapping "right" with "left"), or when providing sentences referencing an abnormality in another region.
Our contributions are as follows. (i) We appraise the interpretability of soft-alignments induced between images and texts by existing neural multimodal models for radiology, both retrospectively Image ReportSent1
There is blunting of the right posterior costophrenic angle...  and via manual radiologist assessments. To the best of our knowledge, this is the first such evaluation. (ii) We propose methods that improve the ability of multimodal models for EHR to intuitively align image regions with texts.

Preliminaries
We aim to evaluate the localization abilities of multimodal models for EHR. For this we focus on the recently proposed GLoRIA model (Huang et al., 2021), which is representative of state-of-the-art, transformer-based multimodal architectures and accompanying pre-training methods. For completeness we also analyze (a modified version of) UNITER (Chen et al., 2020). We next review details of these models, and then discuss the datasets we use to evaluate the alignments they induce.

GLoRIA
GLoRIA uses Clinical BERT (Alsentzer et al., 2019) as a text encoder and ResNet (He et al., 2016) as an image encoder. Unlike prior work, GLoRIA does not assume an image can be partitioned into different objects, which is important because pre-trained object detectors are not readily available for X-ray images. GLoRIA passes a CNN over the image to yield local region representations. This is useful because a finding within an X-ray described in a report will usually appear in only a small region of the corresponding image (Huang et al., 2021). GLoRIA exploits this intuition via a local contrastive loss term in the objective.
We assume a dataset of instances comprising an image x v and a sentence from the corresponding report x t , and the model consumes this to produce a set of local embeddings and a global embedding per modality: v l ∈ R M ×D , v g ∈ R D , t l ∈ R N ×D , and t g ∈ R D . To construct the local contrastive loss, an attention mechanism (Bahdanau et al., 2014) is applied to local image embeddings, queried by the local text embeddings. This induces a soft alignment between the local vectors of each mode: where t i is the ith text embedding, v j the jth image embedding, and τ is a temperature hyperparameter.

UNITER
Despite the challenges inherent to adopting "general-domain" multimodal models for this domain (discussed in Appendix A.1), we modify UNITER to serve as an additional model for analysis. We provide details regarding how we have implemented UNITER in Appendix A.2, but note here that this requires ground-truth bounding boxes as inputs, which means that (a) results with respect to most metrics (which measure overlap with target bounding boxes) for UNITER will be artificially high, and, (b) we could not use this method in practice, because it requires a set of reference bounding boxes as input (including at inference time). We include this for completeness.

Data and Metrics
Data Our retrospective evaluation of localization abilities is made possible by the MIMIC-CXR (Johnson et al., 2019a,b) and Chest ImaGenome (Wu et al., 2021) datasets. MIMIC-CXR comprises chest X-rays and corresponding radiology reports. ImaGenome includes 1000 manually annotated image/report pairs, 1 with bounding boxes for anatomical locations, links between referring sentences and image bounding boxes, and a set of conditions and positive/negative context annotations 2 associated with each sentence/bounding box pair.
Metrics We quantify the degree to which attention highlights the region to which a text snippet refers by comparing average attention over an input sentence x j = 1 N N i=1 a ij with reference annotated bounding boxes associated with the sentence. AUROC Avg. P IOU@5/10/30% 69.07 51.68 3.79/6.69/20.10 We use several metrics to measure the alignment between soft attention weights and bounding boxes. We create scores s ∈ R P for each of the P pixels based on the attention weight assigned to the image region the pixel belongs to. Specifically, for GLoRIA we use upsampling with bilinear interpolation to distribute attention over pixels. For UNITER, we score pixels by taking a max over attention scores for the bounding boxes that contain the pixel (scores for pixels not in any bounding boxes are 0). We use bounding boxes to create a segmentation label ∈ R P where p = 1 if pixel i is in any of the bounding boxes, and p = 0 otherwise. Given pixel-level scores s and pixel-level segmentation labels p , we can compute the AU-ROC, Average Precision, and Intersection Over Union (IOU) at varying pixel percentile thresholds for the ranking ordered by s (See section A.4).
We also adopt a simple, interpretable metric to capture the accuracy of similarity scores assigned to pairs of images and texts. Specifically, we use a simpler version of the text retrieval task from (Huang et al., 2021): We report the percentage of time the similarity between an image and a sentence from the corresponding report is greater than the similarity between the image and a random sentence taken from a different report in the dataset. This allows us to interpret 50% as the mean value of a totally random similarity measure.

Are Alignment Weights Accurate?
We first use the metrics defined above to evaluate the pretrained, publicly available weights for GLoRIA (Huang et al., 2021). Table 1 reports the metrics used to evaluate localization on the gold split of the ImaGenome dataset. AUROC scores are well over 50%, suggesting reasonable localization performance. IOU scores are small, which is expected as target bounding boxes tend to be much larger than the actual regions of interest and serve more to detect errors when highlighted regions are far from where they should be; this is further supported by the relatively high average precision scores. However, while seemingly promising, our results below suggest that the attention patterns here may be less multimodal than one might expect.
We next focus on evaluating the degree to which *Equivilant to shuffling report sentences

Swap Left Right:
Small left pleural effusion is stable.

Random Sentence:
The lungs are hyperinflated but clear of consolidation.
Small right pleural effusion is stable.

Original BBox BBox Perturbations
Text Perturbations these patterns actually reflect the associated text.
To this end we perturb instances in ways that ought to shift the attention pattern (Section 3.1), e.g., by replacing "right" with "left" in the text. We then identify data subsets in Section 3.2 comprising "complex" instances, where we expect the image and text to be closely correlated at a local level. Figure 2 shows examples of the perturbations that include: Swapping "left" with "right" (Swap Left Right); Shuffling the target bounding boxes for sentences within the same report at random (Shuffle in Report); Replacing sentences in a report with other sentences, randomly drawn from the rest of the dataset (Random Sentences); Replacing target bounding boxes with other bounding boxes randomly sampled from the dataset (Random BBoxes) 3 , and; Swapping the correct conditions in a synthetically created prompt with random conditions Synth w/ Swapped Conditions. We include additional details about synthetic sentences and perturbations in Appendices A.3 and A.5. Under these perturbations, we would expect a well-behaved model to shift its attention distribution over the image accordingly, resulting in a decrease in localization scores (overlap with the original reference bounding boxes). The Random BBoxes perturbation in particular targets the degree to which the attention relies specifically on the image modality, because here the "target" bounding boxes have been replaced with bounding boxes associated with random other images. By contrast, all other perturbations should measure the degree to which the model is sensitive to changes to the text (even Shuffle in Report, which is equivalent to shuffling the sentences in a report).

Perturbations
If attention maps reflect alignments with input texts, then under these perturbations one should expect large negative differences in performance (∆metric) relative to observed performance using the unperturbed data. For all but Random BBoxes, if the performance does not much change (∆metric ≈ 0), this suggests the attention maps are somewhat invariant to the text modality.

Subsets
We perform granular evaluations using specific data subsets, including: (1) Abnormal instances with an abnormality, (2) One Lung instances with only one side of the Chest X-ray (left or right) referenced, and (3) Most Diverse Report BBoxes (MDRB) instances with a lot of diversity in the labels for sentences in the same report. Details are in Appendix A.6. Intuitively, some of the perturbations in Section 3.1 should mainly effect certain subsets: Swap Left Right should most impact the One Lung subset, Shuffle in Report should mainly effect MDRB, and Random Sentences, Random BBoxes, and Synth w/ Swapped Conditions should primarily effect Abnormal examples.

Manual Evaluation
We enlist a domain expert (radiologist) to conduct annotations to complement our retrospective quantitative evaluations. We elicit judgements on a fivepoint Likert scale regarding the recall, precision, and "intuitiveness" of image highlights induced for text snippets. 4 More details are in the Appendix, including annotation instructions (Section A.7) and a screenshot of the interface ( Figure 11).

Results
We first evaluate performance on the subsets described in Section 3.2. This establishes a baseline with respect to which we can take differences observed under perturbations. We report results in Table 2. We observe that the model performs significantly worse on both the One Lung and MDRB subsets (which we view as "harder") in terms of AUROC and Average Precision, supporting this disaggregated evaluation.
Manual evaluation results of 3.1, 1.8, and 1.7 for recall, precision, and intuitiveness respectively in-  The only significant effect is from evaluating on random labels dicate that GLoRIA produces unintuitive heatmaps that have poor precision and middling recall. Because GLoria was trained on the CheXpert dataset and we perform these evaluations on ImaGenome, the change in dataset may be one cause of poor performance; in Section 4 we report how retraining on the ImaGenome dataset affects these scores.
To measure the sensitivity of model attention to changes in the text, we report differences in localization performance in Figure 3. Specifically, this is the difference in model performance (∆AUROC) achieved using (a) the original (unperturbed) sentences, and, (b) sentences perturbed as described in Section 3.1. We show results for each perturbation on the subsets they should most effect (Section 3.2), leaving the full results for the appendix (Figure 14).
The only real decrease in performance observed is under the Random BBoxes perturbation, which entails swapping out the target bounding box for an instance with one associated with some other instance (image). Performance decreasing here (and not for text perturbations) is consistent with the hypothesis that the attention map primarily reflects the image modality, but not the text. This is further supported by the observation that the model pays little mind to clear positional cue words such as "left" and "right" when constructing the attention map; witness the negligible drop in performance under the Swap Left Right perturbation. Finally, swapping in other sentences (even from different reports) yields almost no performance difference.

Can We Improve Alignments?
The above results indicate that image attention is unintuitive and less sensitive to the text modality than might be expected. Next we propose simple methods to try to improve image/text alignments.

Models
All models build on the GLoRIA architecture except the baseline UNITER, for which we perform no modifications except to re-train from scratch on the MIMIC-CXR/Chest ImaGenome dataset. 5 In the results, GLoRIA refers to weights fit using the CheXpert dataset, released by (Huang et al., 2021). We do not have access to the reports associated with this dataset so we do not use it for training or evaluation, but we do make comparisons to the original (released) GLoRIA model trained on it.
We also retrain our own GLoRIA model on the MIMIC-CXR/ImaGenome dataset; we call this GLoRIA Retrained. While the two datasets are similar in size and content, CheXpert has many more positive cases of conditions than MIMIC-CXR/ImaGenome (8.86% of CheXpert images are labeled as having "No Findings"; in the ImaGenome dataset, reports associated with 21.80% of train images do not contain a sentence labeled "abnormal"). Given this difference in the number of positive cases, we train a Re-trained+Abnormal model variant on the subset of MIMIC-CXR/ImaGenome sentence/image pairs featuring an "abnormal" sentence.
We also train models in which we adopt masking strategies intended to improve localization, hypothesizing that this might prevent over-reliance on text artifacts that might allow the model to ignore text that localizes. Our Retrained+Word Masking model randomly replaces words in the input with [MASK] tokens during training with 30% probability. 6 For our Retrained+Clinical Masking model, we randomly swap clinical entity spans found using a SciSpaCy entity linker (Neumann et al., 2019) for [MASK] tokens with 50% probability.
Many sentences in a report will not refer to any particular region in the image. We therefore propose the Retrained+"No Attn" Token model, which concatenates a special "No Attn" token parameter vector to the set of local image embeddings just before attention is induced. This allows the model to attend to this special vector, rather than any of the local image embeddings, effectively indicating that there is no good match. 5 We re-train from scratch because: (1) Unlike in the original model, we are not feeding in features from Fast-RCNN, but instead using flattened pixels from a bounding box, and; (2) We would like a fair comparison to the GLoRIA variants which are also re-trained from scratch. 6 We choose the high value of 30% here because without allowing hyperparameter tuning of this probability, we would like to see a significant impact when comparing to the baseline.  We also consider a setting in which we assume a small amount of supervision (annotations linking image regions to texts). We finetune a model to produce high attention on the annotated regions of interest, i.e., we supervise attention. We employ an alignment loss L alignment (s, ) = p s p p using the pixel-wise scores s derived from the attention 7 and the segmentation labels (Section 2.3). We train on a batch of 30 examples for up to 500 steps with early stopping on an additional 30-example validation set using a patience of 25 steps. This might be viewed as "few-shot alignment", where we use a small number of annotated examples to try to make the model more interpretable by improving image and text alignments.
Finally, as a point of reference we train Re-trained+Rand Sents in the same style as the Retrained model except that all sentences are replaced with random sentences. This deprives the model of any meaningful training signal, which otherwise comes entirely through the pairing of images and texts. This variant provides a baseline to help contextualize results. For all models, we use early stopping with a patience of 10 epochs. 8

Results and Discussion
Table 3 might seem to imply that UNITER performs best. However, we emphasize that this is not comparable to other models because, as discussed in 2.2, UNITER's attention is defined over ground truth anatomical bounding boxes (rather than the entire image), of which the sentence bounding boxes are a subset; this dramatically inflates AU-ROC and average precision scores. (We have included UNITER despite this for completeness.) Finetuning on a small set of ground truth bounding boxes (+30-shot Finetuned) substantially improves performance. Of the remaining (not ex-  Counter-intuitively, +Clinical Masking performs slightly worse than Retrained. Perhaps clinical masking blinds too much key information. The +"No Attn" Token model also performs comparatively well, suggesting that allowing the model to not attend to any particular part of the image does increase performance. The annotation results ( Figure 4) of recall, precision, and intuitiveness are perhaps more revealing and do not necessarily align with our automatic metrics. 9 This is likely a product of the limitations of the ImaGenome bounding boxes. The +"No Attn" Token model scored highest in terms of intuitiveness and precision, which is promising given that unlike the +Abnormal and +30-shot Finetuned models, this model does not require any additional training information (i.e., indications of training sample abnormalities, or ground truth bboxes). A simple modification to the architecture that allows it to pass on aligning a given text to the image yields a stark increase in performance with respect to the baseline Retrained model. The Retrained model performs about the same as GLoRIA in terms of precision and intuitiveness, although it incurs a significant drop in 9 We do not include UNITER in this because the attention over the bounding boxes is very unintuitive and different from the other models' attention (See Appendix Figure 8).  Figure 5: For each perturbation, we plot the change in localization performance (as measured by AUROC), for each of the models we retrain from scratch on the respective subsets. Here, UNITER is effected most by the Random BBoxes perturbations because it uses the original ground truth as input. recall.
The +30-shot Finetuned model uses the bounding boxes as ground truth, but these are somewhat noisy. Better annotations of the regions of interest might improve intuitiveness further. When performing annotations, the radiologist also noticed that a large percentage of sentences in reports do not refer to anything focal, which indicates the necessity of looking at the subsets from Section 3.2all of which should have more focal sentencesespecially when it comes to the perturbations. This also may help explain the superior performance of the +"No Attn" Token model which explicitly handles these cases.
We next perform the perturbations introduced above (and assessed on GLoRIA) to the proposed variants to assess sensitivity to input texts (full results in Figure 14 of the appendix). We observe that +30-shot Finetuned, +"No Attn" Token, and +Abnormal, in that order, are most affected when swapping left and right. These three models are also the most affected by shuffling bounding boxes within a report or swapping in a random sentence from the rest of the dataset, although for these perturbations, the +Abnormal model is more sensitive than the +"No Attn" Token.
The Random BBoxes perturbation serves mostly as a reference measure of how variable model scores can be when swapping in entirely wrong bounding boxes. But it also seems to suggest that for models affected more by this, the attention is more focused on precision. This indicates that besides UNITER, the +30-shot Finetuned, +Word Masking, +Abnormal, and +"No Attn" Token, in that order, are the most precise; this is in line with the average precision scores in Table 3 Table 4: Average accuracies with respect to discriminating between the sentence actually associated with an image and a sentence randomly sampled from the dataset. (See Appendix  Table 8 for results on subsets.) Global and local refer to using only global or local embeddings for computing similarity.
entropy scores in the appendix (Table 9). Taken together these perturbation results suggest that +"No Attn" Token, +Abnormal, and +30shot Fine-tuned are the models most intuitively sensitive to text. However, they remain less intuitive than they would ideally be. 10 Table 4 reports the accuracy of each model with respect to identifying the correct sentence from two candidates for a given image. These results indicate that performing comparatively well at identifying the correct sentence does not necessarily correlate with intuitiveness or textual sensitivity, i.e., being able to discriminate between sentences given an image does not imply an ability to accurately localize within an image, given a sentence. In particular, +Word Masking performs best here, though we saw above that it is relatively unintuitive and its localization is somewhat invariant to perturbations. Further, the three best models in terms of textual sensitivity have relatively poor performance (with the possible exception of the +Abnormal variant).
To quantify the relationships between scores, we report correlations between them across instances for +"No Attn" Token (the best model in terms of manually judged intuitiveness) in Figure 6. Of the automatic metrics, IOU@10% has the strongest correlation with annotated intuitiveness. Avg. Precision and Precision at 10% have almost no correlation with intuitiveness and relatively weak correlation with annotated precision. We also show correlation with local and global similarity between two positive pairs. 11 Though the local similarity of positive pairs is somewhat correlated with each of the annotation ratings, the global similarity is only 10 We discuss results for experiments in which we swap conditions in synthetic sentences in Appendix (B.3); these are inconclusive.
11 Because we only look at positive pairs, higher similarity scores are better. (weakly) correlated with annotated precision.
The "No Attn" score, which is what we use to refer to the attention score for the added "No Attn" token, has some interesting Pearson correlations. Unfortunately, its relationship with annotations is complicated by our user interface. Often the "No Attn" score (which we display in the corner of the image) will either be unnoticeable or it will saturate the heatmap, resulting in the radiologist assigning low scores (1s) for an instance. Therefore, we note that some negative correlations with annotations ( Figure 22) may mostly reflect how the X-ray heatmap is displayed to the user. However, a -.30 correlation with IOU@10% and a -.47 correlation with whether an image contains an abnormality are significant. This demonstrates the potential for this score to identify situations where the model should abstain from displaying a heatmap altogether because either there is nothing abnormal to highlight or the model is not confident in its heatmap.
We note some interesting qualitative behavior discovered during the annotation that may also support the use of this "No Attn" architecture and score. When many of the models are incorrect, they tend to highlight image edges or corners. We hypothesize this occurs because the model attempts to find a static part of the image-one that is similar across most instances-on which to attend. This behavior is misleading and not quantifiable. The "No Attn" Token offers an alternative to this behavior, providing a means for the model to pass on inducing a heatmap altogether when appropriate.
We conclude with a qualitative impression of localization performance. Figure 7 shows model attention distributions for a (cherry-picked) instance and the accompanying Swap Left Right perturbation. This example was selected specifically to illustrate how models can fail to behave intuitively. In this example, the correct region of interest for the original prompt lies mostly centered on the small box, and the large box (corresponding to the left lung) is somewhat misleading as it covers more than the strict region of interest. This example demonstrates that though the anatomical locations discussed in the prompt are correctly highlighted by the bounding boxes, the region of interest is not always directly on those anatomical locations.
With the original prompt, GLoRIA yields a high-entropy map, GLoRIA Retrained and the +Masking, +"No Attn" Token, and +Abnormal are centered roughly correctly (some more intuitive than others), and finally, +30-shot Finetuned almost fully highlights the large box (even though this is not strictly the correct region of interest) and almost entirely ignores the small box (the real region of interest). The perturbation of swapping out "left" with "right" changes all of the models' heatmaps to varying degrees and with varying intuitiveness. In this example, the most intuitive heatmaps after the perturbation are given by the +"No Attn" Token and +Abnormal models, whereas other models still show significant emphasis on the original region and/or show emphasis on unintuitive and entirely irrelevant regions.
Summary of key findings Existing multimodal pretraining schemes beget models that accurately select the text that matches a given image (Table  4), and yield attention distributions that at least somewhat depend on the text. But these models are not found intuitive (Table 4) and perturbing texts does not cause not consistently yield changes in the attention patterns that one would expect ( Figure  5). Simple changes to pre-training may improve this behavior. Specifically, adding the ability of the model to not attend to any particular part of the image may result in models that produce attention patterns which are more intuitive ( Figure 4) and more reflective of input texts ( Figure 5), although this may slightly harm performance on the pretraining task itself (Table 4).

Related Work
Work on multi-modal representation learning for medical data has proposed soft aligning modalities, but has focussed quantitative evaluation on the resultant performance that learned representations afford on downstream tasks (Ji et al., 2021;Liao et al., 2021;Huang et al., 2021). Model interpretability is often suggested using only qualitative examples; our work aims to close this gap.
A line of work in NLP evaluates the interpretability of neural attention mechanisms (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019; Serrano and Smith, 2019). Elsewhere, work at the intersection of computer vision and radiology has critically evaluated use of saliency maps over images (Arun et al., 2021;Rajpurkar et al., 2018).
Recent work has sought to improve the ability of these models to identify fine-grained alignments via supervised attention (Kervadec et al., 2020;Sood et al., 2021), but have focused on downstream task performance. This differs from our focus on evaluating and improving localization itself, especially within the medical domain. We also do not assume access to large amounts of supervision, which is commonly lacking in this domain.

Conclusions
We evaluated existing state-of-the-art unsupervised multimodal representation learning models for EHRs in terms of inducing fine-grained alignments between image regions and text. We found that the resultant heatmaps are often unintuitive and invariant to perturbations to the text that ought to change them substantially. We evaluated a number of methods aimed at improving this behavior, finding that: (1) allowing the model to refrain from attending to the image, and; (2) finetuning the model on a small set of labels for interpretable heatmaps substantially improves performance. We hope that this effort motivates more work addressing the interpreteability of multi-modal encoders for healthcare.

Limitations
This is a first attempt to investigate the interpretability of pre-trained multi-modals models for medical imaging data, and as such our work has important limitations. ImaGenome only annotates anatomical locations for each sentence and bounding boxes for each anatomical location; these may not correspond directly to regions of interest. In addition, these extracted anatomical locations highlight many levels of the hierarchy, so if a specific part of the lung is mentioned, the whole lung's bounding box may still be included. The annotations we collected for evaluation also have important limitations to consider, namely that we only used one radiologist annotator and the annotations were limited to 50 instances.
Finally, we did not try a more fine-grained UNITER model with more and smaller bounding boxes that form a grid (to avoid an object detector), primarily because this would incur a significantly higher computational cost due to the number of image vectors, but future work might explore this option. , it is reasonable to ask whether we can simply apply such models and pre-training schemes to multimodal medical data. However, a few key difficulties complicate straight-forward adaptation. Necessity of Object Detectors. Many open domain models assume access to general object detectors during pre-processing. Such detectors are not readily available in the medical domain, and training object detection models requires large-scale, high-quality annotations for many different phenomena and/or anatomical regions. Further, one would need to collect such data for each domain in radiology (e.g., brain versus chest imaging).
In many multimodal models object detectors are used to produce bounding boxes, and are also tasked with inducing low-dimensional fixed-length vectors for significant regions, effectively taking care of region representation learning so that it need not be learned end-to-end. Open domain models often expect tens of bounding boxes in an image, but even a coarse segmentation of images (e.g., into a 19x19 grid as in GLoRIA) yields many more bounding boxes than this, exacerbating the mismatch between pre-trained general object detectors and the medical domain when the former are initialized from open-domain checkpoints.
Mismatch in Alignment Assumptions. UNITER uses optimal transport to align image and text vectors, but this assumes that each object (or salient part within an image) can be reasonably aligned to a segment of the corresponding text. This makes sense in the case of the general domain data like COCO (Lin et al., 2014) because usually we expect most detected objects to be mentioned in the caption. By contrast, in the medical domain we would expect that most parts of the image are unrelated to any portion of the corresponding text, and the task of the model is to identify salient regions of the text and match these with a particular image region. This is especially true when not using an object detector to identify the interesting regions as preprocessing step. The result of the optimal transport objective is that, averaging over tokens in the input text, each bounding box is equally important. In Section 2.2, we circumvent this problem by not using the It is easy to see why this attn usually has high localization with respect to the bounding boxes but also remains very unintuitive. It is still unclear if the UNITER model learns how to localize at all. optimal transport distribution itself (though that would be the natural choice), but instead using the attention mechanisms within UNITER.
Despite these obstacles to re-purposing open domain multimodal models for this space, in Section A.2 we describe how we modify UNITER to serve as a baseline for our analysis for completeness.

A.2 UNITER Details
We use all reference anatomical bounding boxes available in the ImaGenome dataset. We reshape each bounding box to 45 × 45 pixels (enforcing fixed length), and then flatten and zero-pad the resultant vectors to be of length 2048 (the dimension UNITER expects). We train UNITER with a batch size of 4096 and 5 gradient accumulation steps on ImaGenome for 200k training steps.
For saliency we compute the mean attention over all 144 heads (12 layers × 12 heads) to produce pixel-wise scores (Gan et al., 2020). We take the mean of the attention when querying the text over the image, and when querying the image over the text; we normalize the resultant scores and treat these as analogous to a ij , i.e., the saliency score relating the two modalities. We note that the absolute overlap scores between UNITER attention (just defined) and bounding boxes will be relatively high given that the UNITER attention is defined over the ground truth bounding boxes for all anatomical locations, and the bounding boxes used to evaluate the attention for a particular sentence are a subset of these same input bounding boxes. This also means we cannot use this approach in practice in the unsupervised setting in which we operate.

A.3 Synthetic Sentences
In Section B, we include results involving synthetic sentences, which we describe here. To facilitate controlled experiments involving swapping out conditions -Section 3.1, Synthetic+Swapped Conditions -we also adopt a strategy for creating synthetic sentences using the labels from ImaGenome (Wu et al., 2021), and test our models on these sentences as well. Specifically, we construct these sentences using a set of rules to translate the condition and positive/negative context annotations and the anatomical names for the corresponding bounding boxes into natural language, as described in Here we show the Rules for creating synthetic sentences. If there are multiple conditions in the sentence, we concatenate synthetic sentences for each of them. The "loclist" is created by turning the list of anatomical locations associated with the condition/context into a natural language list (e.g., "x," "x and y," or "x, y, and z"). We combine leftand right-side locations into one item ("left lung" and "right lung" is mapped to "lungs").

A.3.2 Synthetic Examples
In Table 5, we present examples of synthetic examples formed via the rules in Section A.3. Figure   Figure 9 demonstrates what a thresholded (bilinearly upsampled) attention would look like and, for this specific threshold, which pixels are true positives (shown in green), false positives (shown 12 We present examples in the Appendix (Table 5).

A.4 Metrics
in white), and false negatives (any other pixels inside either of the bounding boxes). For metrics like AUROC and Avg Precision, statistics need to be computed while sliding through all possible thresholds.

A.5 Perturbations Details
Swap Left Right We replace every occurrence of the word "right" in the text with "left" and vice versa (ignoring capitalization). This is intended to probe the degree to which the attention mechanism relies on these two basic location cues. Of course, many sentences do not contain these words because conditions (or lack thereof) occur on both sides of the chest X-ray. Therefore, it is particularly important to look at the metrics on the "One Lung" subset (Section 3.2) for this perturbation.

Shuffle in Report
We shuffle the sets of bounding boxes for different sentences in the same report at random. One would expect that performance would decrease significantly, because the resultant bounding boxes associated with given a sentence are (probably) wrong. However, sentences within the same report might be talking about similar regions. Therefore, for this perturbation it is important to look at the instances where the overlap between (a) the region of interest for the sentence and (b) the regions associated with other sentences in the report is low. We look at results for such cases explicitly using the Most Diverse Report BBoxes (MDRB) subset (Section 3.2).

Random Sentences
We replace sentences in an instance with other sentences, randomly drawn from the rest of the dataset. Here too we expect performance to decrease significantly because the sampled text will refer to an entirely different image.

Random BBoxes
We replace the set of bounding boxes for a sentence with a different set of bounding boxes randomly selected from the rest of the dataset. This differs from the Random Sentences perturbation in that the bounding boxes here are not only unrelated to the sentences, but also unrelated to the image. Therefore, we expect that this will have the poorest performance of all the settings, especially under the hypothesis that the attention is mostly a function of the image.  Small right pleural effusion is stable.

Swap Left Right: Small left pleural effusion is stable.
Random Sentence: The lungs are hyperinflated but clear of consolidation.

Synthetic:
The right costophrenic angle is abnormal. There is lung opacity in the right costophrenic angle. There is pleural effusion in the right lung and right costophrenic angle.

Synthetic w/ Swapped Conditions:
There is pulmonary edema/ hazy opacity in the right costophrenic angle. There is costophrenic angle blunting in the right lung and right costophrenic angle. *Equivilant to shuffling report sentences

Shuffle in Report
Random BBoxes Figure 10: Examples of each perturbation (including Snyth w/ Swapped Conditions) for a given instance.
conditions, we follow the same rules for generating the synthetic sentence with a different condition randomly sampled from a set of (other) possible conditions. Possible conditions are defined as any condition (excluding the current) that occurs in the same anatomical locations anywhere else in the gold dataset. 13 This perturbation should measure the impact of conditions on model attention.

A.6 Subset Details
Abnormal Image/sentence pairs where there is an "abnormal" label associated with the sentence. This occurs if any conditions are mentioned in a positive context, i.e., where the radiologist believes the patient has said condition. This targets "interesting" examples where the attention should ideally highlight the region relevant to the condition described.
One Lung Image/sentence pairs where the bounding boxes corresponding to the sentence contain a bounding box of either the left or right lung, but not both. This subset allows us to evaluate how the 13 If there are no other conditions, we leave the condition as is and the synthetic sentence is not perturbed. model performs when the attention should only be on one side of the image.
Most Diverse Report BBoxes Instances where the overlap in the sets of bounding boxes for sentences within the same report is minimal. Specifically, we calculate the mean intersection over union (IOU; Section 2.3) of the segmentation labels 1 , 2 for pairs of sentences in the same report. We then take the 10% of instances within reports with the smallest mean IOU. This subset is intended to include examples within reports where multiple distinct regions of interest discussed in different sentences.
These first two subsets are important because in many examples there is nothing abnormal, and the reports contain sentences such as "No effusion is present." For these types of sentences, the bounding boxes are commonly over both lungs because the evidence for the sentence is that nothing abnormal is in either lung. In these situations, it seems as though it might be easier for the model to realize higher scores for two reasons: 1) lungs take up most of the image, so attention is likely to fall in the bounding boxes, and 2) the lungs are a pretty good guess for the "important" regions of any image, independent of the text. The last subset is important because it comprises examples which contain a set of target bounding boxes and associated texts which cover mostly distinct image regions.

A.7 Annotations
In Figure 11, we present our user interface for collecting annotations created using streamlit. In Section A.7.1 (below), we show the annotation instructions.

A.7.1 Instructions
Our aim here is to collect judgements (*annotations*) concerning the interpretability and possible usefulness of alignments between text snippets and image regions induced by neural network models. More specifically, we will ask you to evaluate "heatmaps" output by different unsupervised (or minimally supervised) models which attempt to align natural language (sentences) and image regions (within accompanying chest X-rays). We ask three specific questions to asses these heatmaps; each question is 5-way multiple choice, and each of the answers are described below. In each round of annotation collection, we aim to collect annotations for multiple models with respect to a shared set of text snippets. That is, for each image, we ask for multiple assessments (across models) for the quality of alignments performed for a particular sentence. You will not be told which model generated which "heatmaps", and model aliases are randomly selected for every instance.

Prompts
You can choose the natural language sentences fed to the model-which we refer to as "prompts"by either selecting a sentence from the list of sentences in the associated radiology report, or by writing your own "custom" prompt. We ask you to complete one round of annotations for report sentences, followed by one round in which you evaluate the alignments generated by the model for custom prompts (i.e., text you enter). For the report sentences round, we ask you to select one sentence that you think is interesting from the list of report sentences (prior to looking at any heatmaps). More specifically, you should, when possible, select a sentence with a focal abnormality that has strong clinical relevance. If one is not present, you can select a sentence that has a more diffuse abnormality or a negative statement that is still relatively focal. You will then annotate or judge the align-ments induced by all models for this particular sentence. For instances that you do not think have any appropriate sentences or for instances where you can think of a better prompt, we ask you to write a prompt to annotate using the "custom prompt" option in addition to annotating the best sentence from the report.

Annotations
Bellow we list the questions and what each of the possible answers would mean.

The heatmap includes what percentage of the region of interest from the prompt?
• 0-20 -The heatmap is focused on entirely the wrong part of the image, does not highlight any part of the image strongly, or has very minimal intensity on the region of interest. • 20-40 • 40-60 -The heatmap comes close to covering the region of interest, or does cover the region of interest but with not too much intensity. • 60-80 • 80-100 -The prompt refers to a region that is within a high-intensity part of the heatmap.
2. What percentage of the heatmap represents an area of interest?
• 0-20 -This heatmap is all over the place or highlights a large portion of the image. • 20-40 • 40-60 -The focus includes the relevant region(s) but also other irrelevant regions (either adjacent or elsewhere in the image). • 60-80 • 80-100 -The heatmap is very targeted to only the parts of the image most relevant to the prompt.
• 3 -The heatmap does show a region of tiniest, but has some stray parts or does not catch all relevant regions. • 4 -The heatmap is reasonably intuitive and contains mostly (though not exclusively) the regions I would expect. • 5 -The heatmap is almost exactly what you might draw to represent the region of interest.

Ground truth bounding boxes
You have the ability to see ground truth bounding boxes from the dataset associated with the particular sentence you have selected from the report; these were manually drawn to match the corresponding sentence. We suggest that you use these bounding boxes when annotating the heatmaps associated with the report sentences. No such bounding boxes are available for the custom prompts that you will author. Figure 12 depicts what happens when the model attends very highly to the "No Attn" token.

B Full/Additional Results
Here we include full/expanded results for the tables in the main paper and some additional results from which we may not yet have a takeaway.

B.1 Custom Prompts
We also let annotators chose to write (and annotate) a fitting prompt if one was not present in the report. Figure 13 shows the annotations for these "custom" prompts for GLoRIA and the +"No Attn" Token models and indicates that even in this small, potentially out of domain setting, the scores are consistent with the in-domain annotations.
B.2 Localization performance for all models on all subsets. Table 6 reports additional results to those in Table 3, describing localization performance on each subset individually. Not shown in the main paper, we can see here that synthetic sentences perform comparably to real sentences, validating our method for constructing synthetic sentences. In fact, on +30-shot Finetuned, there is a significant jump in performance when using synthetic sentences.

B.3 Deltas of all models on all subsets
In Figure 14 we report results analogous to those in Figures 3 and 5, but on all subsets, all models, and all perturbations at once.
The results from swapping conditions in synthetic sentences, which were not shown in the main paper, vary across data subsets (Figure 14). The most telling subset for this perturbation is probably the Abnormal set. The results here are difficult to interpret because the +Rand Sents model seems to be considerably effected, which is counter-intuitive as we would expect this model to be invariant to the text by construction (note that the other perturbation results are consistent with this). Given this, we do not draw any particular conclusions from the swapped conditions experiment at present.

B.4 ∆ Average Precision
In Figure 15, we plot the analogous plot to Figure  14 for the changes in Average Precision as opposed to AUROC. Average Precision seems to tell a similar story to AUROC in terms of which models have greater changes for each perturbation. The only major difference is that for Average Precision, all models show a positive change for the Random BBoxes perturbation in the MDRB subset. This is likely because picking a random bounding box from the whole dataset when in this subset means that the random bounding box will likely be bigger than the original because the bounding boxes in this subset tend to be small. Having a larger bounding box as a label would therefore likely improve precision in general. This makes it harder to interpret this particular perturbation in this subset.

B.5 Random Attention KL Divergences
To measure the extent to which a model eschews the text and relies mostly on the image to induce an attention pattern, we introduce Random Attention KL Divergence. This is the symmetric Kullback-Leibler (KL) divergence for an instance between (a) the attention distribution induced given the original text, and (b) the attention over the same image but paired with random text. In Table 7, we show the mean Random Attention KL Divergence for each subset.

B.6 Candidate Selection Accuracy for other subsets
In 8, we extend Table 4 to the remaining subsets.

B.7 Entropy
In Table 9 we present results for the entropy attention mechanisms for each model for the entire dataset as well as the subsets.

B.8 Performance across Specific Abnormalities
In Figure 16, we present Intuitiveness for all models on examples with specific abnormalities.

B.9 Correlations
In Figures 18, 19 , 20, 21, 22, 23, and 24, we present the pairwise pearson correlation over instances for a few different values for each model's outputs on the full gold split. Most of the localization metrics here seem to be somewhat correlated, although not as much as one might expect. IOU seems to be generally more correlated with AUROC than with Average Precision.
Of particular note is the correlation between Attention Entropy and the global and local similarities: Attention Entropy is usually slightly positively correlated with Global Similarity and slightly negatively correlated with Local Similarity. Though it is still unclear why this is, it may have to do with a model's ability to localize seeing as this is more pronounced in models that perform better localization.
Finally, it is interesting that +Abnormal model has a somewhat negative correlation between Attention Entropy and all of the localization metrics, potentially indicating a connection between examples of abnormalities and Attention Entropy, but more work should be done to probe this further.

B.10 Precision and IOU at different Thresholds
Finally, we present Precision (Table 10) and IOU (Table 11) at different thresholds to get a better sense for the differences in the attention between each model. (Some IOU scores for GLoRIA are repeated here to allow for an easier comparison.) It is also clear that the Masking Model performs the best when only taking the top 5 or 10 percent, but GLo-RIA starts producing similar or better scores at less strict thresholds. The precision scores above 70% here for +Masking, which far exceed any other model's scores at any threshold, give the sense that this model is quite effective at localization, but the dropoff when looking at the subsets do indicate the need for future work in this area.