Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions

As multimodal learning finds applications in a wide variety of high-stakes societal tasks, investigating their robustness becomes important. Existing work has focused on understanding the robustness of vision-and-language models to imperceptible variations on benchmark tasks. In this work, we investigate the robustness of multimodal classifiers to cross-modal dilutions – a plausible variation. We develop a model that, given a multimodal (image + text) input, generates additional dilution text that (a) maintains relevance and topical coherence with the image and existing text, and (b) when added to the original text, leads to misclassification of the multimodal input. Via experiments on Crisis Humanitarianism and Sentiment Detection tasks, we find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model. Metric-based comparisons with several baselines and human evaluations indicate that our dilutions show higher relevance and topical coherence, while simultaneously being more effective at demonstrating the brittleness of the multimodal classifiers. Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations, especially in human-facing societal applications.


Introduction
Rich multimodal content understanding is crucial for several AI for Social Good applications like humanitarian information detection during crises, hate speech analyses, and fake news mitigation (Ofli et al., 2020;Kiela et al., 2020;Facebook, 2020;Khattar et al., 2019;Verma et al., 2022). In many such scenarios, the information in individual modalities, either image or text, is designed to be complementary to information in the other modality. As such, joint modeling of both modalities is of fundamental importance, and consequently, seen from above. entire california communities reduced to ash. seen from above. entire california communities reduced to ash. the devastation in california: why have entire communities either been destroyed, or reduced to a few bare earth bare rock formation?

Cross-Modal Dilution (image → text)
Fusion-based Deep Multimodal Classifier infrastructure & utility damage not humanitarian Figure 1: Overview of our study. We investigate the robustness of fusion-based deep multimodal classifiers to cross-modal dilutions. We generate dilutions that maintain semantic relevance with the original text and image while causing incorrect classifications. We also demonstrate the realistic nature of cross-modal dilutions using human evaluation. The figure shows an actual example from our experiments.
technologies that enable multimodal understanding are advancing rapidly and are being deployed at scale (Nayak, 2021;Grauman et al., 2021). It is desirable that deep learning models are robust to dilution-based variations in input. Dilution is defined as the addition of related content that dilutes the effect of the original information. Naik et al. (2018) and Ribeiro et al. (2020) argue that natural language processing (NLP) models should not alter their predictions after adding dilutionsfor instance, appending statements like "and true is true" (multiple times) for the Natural Language Inference task and adding randomly created URLs for the Sentiment Analysis task.
We study the robustness of multimodal classifiers to dilutions. Compared to the simple dilutions created for NLP tasks, we aim to explore realistic dilutions for multimodal data. Since what entails plausible dilutions for multimodal data has not been established, we propose a new category of dilutions specific for multimodal content, named cross-modal dilutions. Cross-modal dilution involves adding relevant information from the image modality to the text modality for a multimodal input; see Figure 1. Our notion of dilution, unlike the examples above, is contextual -that is, the change introduced varies for different information items. Additionally, evaluating robustness to dilutions in a multimodal setting is non-trivial because the possible additions are constrained by the semantics of both the image and the original text.
Previous research on the robustness of deep multimodal learning focuses on perturbations for Visual Question Answering (VQA) (Srivastava et al., 2020;Zhang et al., 2019;Gupta, 2017;Wu et al., 2017) and involves making minor alterations to the textual questions (Mudrakarta et al., 2018), or asking more challenging questions than what were present in the training dataset (Sheng et al., 2021;Li et al., 2021b). In contrast, we focus on multimodal classification and study dilution-based variations. To this end, we propose a method that leverages a large language model to generate additional text that is (i) related to the information in the image, (ii) semantically aligned to the existing user-provided textual description, and (iii) is adversarial in nature (i.e., when added to the existing description, leads to incorrect predictions by multimodal models). The first two constraints ensure that the additional text is realistic, while the third constraint enables us to assess the robustness of multimodal classifiers under these settings. Our contributions are summarized as follows: • We propose and investigate the robustness of multimodal classifiers to cross-modal dilutions. We develop an approach that leverages keywords from image and text to perform controlled generation of semantically relevant text that can be appended to the original text to cause misclassification.
• Via extensive evaluation covering aspects like adversarial effectiveness, content relevance, diversity, and coherence, we establish that the dilutions generated by our proposed model are better than several rule-based and model-based baselines. We release our code to aid future research. 1 • We conduct human evaluations to (a) assess the quality of generated dilutions over the most competitive baseline and (b) establish the realistic nature of diluted multimodal examples. We find that our cross-modal dilutions are perceived by humans as better than the baseline dilutions and more realistic.

Related Work
Robustness of Multimodal Models: Existing research studies the robustness of multimodal models by making imperceptible adversarial changes to the individual input modalities using unimodal perturbations (Li et al., 2020a;Chen et al., 2020). However, while adversarial perturbations to images are often deemed as imperceptible to humans, the adversarial perturbations in text often compromise the semantic meaning and its category to notable extents (Wang et al., 2021). In the context of multimodal learning, the problem of introducing textual perturbations that lead to semantically poor changes has been tackled by developing careful automated approaches -for instance, by synthesizing counterfactual samples using language models (Chen et al., 2020), or by conducting human-inthe-loop curation of adversarial examples (Sheng et al., 2021;Li et al., 2021b). However, these studies only focus on VQA (Antol et al., 2015). Additionally, as Gilmer et al. (2018) argue, the imperceptibility criterion does not constrain the plausible action space in human-facing applications. For instance, it has been shown that the humanprovided description of an image can vary notably with the personality, age, and location of the writer in terms of its length, emotion, and vocabulary; all the while preserving the cross-modal semantic interaction (Shuster et al., 2019;Chunseong Park et al., 2017;Denton et al., 2015). Consequently, in this work, we focus on the robustness of multimodal classifiers to plausible variations, specifically cross-modal dilutions.
Adversarial Perturbations: Our investigation concerns adding related text in a multimodal example to the existing textual information. Several methods have been proposed to introduce imperceptible and adversarial perturbations in text (Li et al., 2021a(Li et al., , 2020bGarg and Ramakrishnan, 2020), focusing on word-level or phrase-level automated insertions, replacements, and merging. Moving beyond the imperceptibility constraint, to estimate robustness to perceptible but plausible changes in text, recent research has investigated the robustness of NLP models to rule-based distractions that are added to the original text (Naik et al., 2018;Ribeiro et al., 2020). As the constraints that govern textual dilutions in a multimodal setting are different, we propose a model to generate cross-modal (image → text) dilutions that maintain semantic and topical coherence with the existing image and text, while also demonstrating adversarial properties with respect to the multimodal classifiers. This provides us with a realistic estimate of the robustness of multimodal classifiers.  Figure 2: Overview of our proposed method. We propose XMD -Cross-Modal Dilution Generator. Our approach extracts keywords from the image and text of a multimodal example and generates dilution text that causes incorrect classifications by the multimodal classifier when appended to the original text. The generation model is trained in a multi-stage multi-task setup, where the adversarial loss component (stage 2) encourages the generation of dilution words that cause incorrect categorization. The blue dashed lines depict the training pathway.

Cross-Modal Dilutions
Related work on language-only models (Naik et al., 2018;Ribeiro et al., 2020) inspires us to study the robustness of deep multimodal classifiers to dilution. In the context of multimodal learning, dilutions can be introduced by adding information from the associated image to the original text. Since multimodal fusion models are expected to consider the information in images and text jointly, they should, in principle, be robust to the expression of additional information regarding the image in the form of text. This is, however, challenging to study because a plausible dilution should have semantic similarity with both the image and the original text. While a rule-based dilution like "and true is true" (investigated by Naik et al. (2018)) are plausible for specific language-only tasks like Natural Language Inference (Bowman et al., 2015), they do not cover the action space of plausible cross-modal dilutions for multimodal content. Therefore, we develop an approach to generate dilutions that are semantically aligned with original text and image.
Our proposed approach follows the following framework to generate dilutions; see Figure 2. (i) Extract keywords from image and text based on their prominence in their respective modalities.
(ii) Train a language model to fill words around the extracted keywords from the original text to generate dilutions (Zhang et al., 2020). The generation model is trained using a multi-stage multi-task approach. The first stage fine-tunes the model to generate in-domain text using textual keywords in a self-supervised manner. The second stage involves training the model on an objective that combines generation loss with adversarial loss.
(iii) The trained model is then used to generate text based on the keywords combined from both text and image modalities. The generated text is then appended to the original text as dilution.

Method for Generating Dilutions
Multimodal classifier: We design a fusion-based multimodal classification model (M mm ) following widely adopted architectures in both academic research and industrial applications (Agarwal et al., 2020;Dataminr, 2020). M mm takes the concatenation of modality-specific representations as input and makes a joint classification. To model individual modalities, we first train an image-only classifier M image and a text-only classifier M text for the same classification task. We then concatenate the output of the penultimate layers of the modalityspecific models to feed them into a fully-connected network that is trained to fuse the modality-specific representations to perform joint classification based on the multimodal input. Keyword extraction: Our dilution generation approach is centered around keywords in the original image and text as that will ensure semantic relatedness of the dilution text with both the associated modalities. We use Yet Another Keyword Extractor (YAKE) (Campos et al., 2018) to extract the most important keywords from the textual description for each example. For extracting keywords from the image, we consider the top 150 objects in the Visual Genome dataset (Krishna et al., 2017) and identify these objects in our dataset using a pre-trained image to Scene Graph generator (Tang et al., 2020). We further filtered the list of all identified objects by only considering objects with a bounding box that occupies at least 10% of the total image area to ensure prominence in the image. These objects are considered the keywords of the image. We denote the keywords from text and image as K text and K image , respectively. Constrained text generation: Once we have the keywords from text and image for each of the examples, the goal is to generate dilution around these keywords. For this, we extend the constrained text generation approach proposed by Zhang et al. (2020). We fine-tune a BERT language model to progressively predict [MASK] tokens around the initial set of keywords until only a special token (i.e., no-insertion token [NOI]) is predicted at all places to indicate no further insertions. We consider the original descriptions of the training examples in our target dataset and fine-tune the pretrained model to reconstruct the original examples using keywords in the text, i.e., K text . We adopt the same generation objective as Zhang et al. (2020) and denote it as L gen . The fine-tuned model can generate domain-specific text using the supplied keywords during inference. Adversarial training: While the above fine-tuning enables constrained generation of target-domain text based on the supplied keywords, we need to ensure that the generated dilutions also cause incorrect classifications by the trained multimodal classifier M mm . Explicitly designing the generation process to exhibit adversarial nature provides an estimate of the possible drop in performance in the presence of cross-modal dilutions. To this end, we consider the POINTER model after domainspecific fine-tuning and fine-tune it further using a combined loss function. The combined loss function takes into account not only the original generation loss but also a weighted component of the adversarial loss L adv . More formally, where λ controls the contribution of the adversarial loss towards the generation process. The incorporation of L adv encourages the generation model to fill the [MASK] tokens with words that would cause incorrect classifications by the multimodal classifier M mm . More formally, L adv is computed for each training example as: where y = 1 when the predicted class by M mm is different from the ground-truth class and y = 0 when the predicted and ground-truth labels are the same. The probability of incorrect classification, i.e.,ŷ, is obtained by adding the class probabilities of incorrect classes (Le et al., 2020;He et al., 2021). The training of the generation model is done in a multi-stage manner -in the first stage, the model is fine-tuned to generate related in-domain text from keywords using L gen and then in the second stage, it is trained in a multi-task fashion using a weighted combination of L gen and L adv . This ensures that the model maintains the quality and coherence in the generated text while learning adversarial behavior.
Inference-time dilution generation: We use the constrained text generation model described above to generate text based on combined keywords from both text and image (i.e., K text ⊕ K image -where ⊕ denotes the concatenation of keywords). These generated textual dilutions are added to the original text to obtain examples with cross-modal dilutions.
Our evaluation aims to assess the impact of these cross-modal dilutions on the performance of the trained multimodal classifier M mm along with various attributes of the generated text.

Multimodal Datasets
We conduct experiments on two user-generated datasets that have real-world societal applications. Crisis Humanitarianism Dataset: During crises, affected parties often use social media to communicate with humanitarian organizations that process the available information to provide timely and effective interventions.  3 We crawled the images from Reddit URLs provided by the authors and split the dataset in a 80:10:10 ratio to obtain the train (n = 2568), validation (n = 321), and test (n = 318) sets.

Experiments
We first discuss the training of our proposed crossmodal dilutions (XMD) generator model. Then, we discuss multiple baselines that dilute the original text using various rule-and model-based approaches. Finally, we evaluate XMD and compare its performance with the baselines. We refer the reader to Appendix A.1 and A.2 for details and evaluation of the modality-specific classifiers. We feed the concatenation of fine-tuned text and image representations to the multimodal classifier, which is essentially a series of fully-connected layers with ReLU activation (Agarap, 2018). The architecture of the multimodal classifier comprises an input layer (1024 neurons), 3 hidden layers (512, 128, 32 neurons), and an output layer (neurons = number of classes in the dataset). We use Adam optimizer (Kingma and Ba, 2014) with a learning rate initialized at 10 −4 and adopt early stopping based on the validation set loss to avoid overfitting. Cross-modal dilutions generator (XMD): For keyword extraction from YAKE, we set the maximum n-gram size to 1, the de-duplication threshold to 0.9 with 'seqm' function, and the window size to 1. The rest of the hyper-parameters were set to their default values used in previous studies (Zhang et al., 2020;Tang et al., 2020). We fine-tune the POINTER model pre-trained on Wikipedia text (Zhang et al., 2020) (Pan et al., 2020)) to generate the captions for the images in the test set. We append the generated captions to the original text for dilution.

Evaluation metrics
Our evaluation is focused on assessing two aspects of the dilutions: (a) are the dilutions effective in deteriorating the classification performance of the multimodal classifier?, and (b) are the added dilutions relevant to the original text + image, and maintain topical coherence with the existing text?
To this end, we compute standard classification metrics for the former evaluation and compute embedding-based similarity measures for the latter.
Sim text denotes the similarity between the original text and the generated dilution and is computed using the cosine similarity between the embeddings from the fine-tuned BERT classifier. Similarly, Sim img denotes the similarity between the generated dilution and the image and is computed using cosine similarity between CLIP embeddings (Radford et al., 2021). For topical coherence, we compute the KL Divergence (KL Div) between the topic distributions of the original text and the generated text. For details regarding the training of the topic model, please see Appendix A.6. Additionally, we quantify the correspondence similarity between image and final text (i.e., original + dilution) using a learned metric Sim corr that quantifies the correspondence between diluted descriptions and original images based on the correspondence between original text and images; see Appendix A.5 for further details. Furthermore, we compute Self-BLEU (Zhu et al., 2018) scores for the sentences in the generated dilution to quantify diversity, wherever applicable. For all model-based baselines and XMD, we report the average values over 5 runs with different random seeds.

Main Results
Our results (Tables 1 & 2) show the following: • Rule-based dilutions do not demonstrate adversarial effectiveness with the exception of using most similar image's description as dilution, which however, shows poor relevance and topical coherence.
• Model-based baselines show adversarial effectiveness but lack in relevance and coherence.
• XMD demonstrates the best adversarial effectiveness while generating more relevant and topically coherent dilutions with respect to all the baselines.
• Our results generalize over both the datasets under considerations -see Table 1 for Crisis Humanitarianism and Table 2 for Sentiment Detection. We elaborate on these results next.
Effect of rule-based baseline dilutions: We start by noting that the insertion of random URL, keywords from image, text, and both together, are ineffective in decreasing the classification performance of multimodal classifiers considerably. However, inserting the most similar image's description to the original text substantially lowers the classification performance, from F 1 score of 0.734 to 0.642 (12.3% drop) for Crisis Humanitarianism dataset and from 0.793 to 0.665 (16.4% drop) for the Sentiment Detection dataset. This indicates that adding text corresponding to a similar image in the dataset is a reasonably effective dilution strategy. However, since the most similar image in the dataset could correspond to a different class, using its description as dilution frequently leads to less relevance and low topical coherence, as indicated by low values of Sim text , Sim img , and KL Div.
Effect of model-based baseline dilutions: Modelbased baseline dilution strategies are generally more effective than rule-based dilution strategies in lowering the classification performance of the multimodal models. The drop in F 1 scores ranges from 9.6% (0.734 → 0.684) using GPT to 15.1% (0.734 → 0.628) using GPT-FT for the Crisis Humanitarianism dataset. Similar trends are observed on the Sentiment Detection dataset, Since GPT-FT is fine-tuned on in-domain text, the inserted text demonstrates a higher relevance with the original text when compared to GPT alone. Similarly, consistently across the two datasets, the correspondence similarity and the topical similarity scores for GPT-FT based dilutions are better than those of GPT. While the caption generation-based dilution strategies are also effective, they show lower relevance with existing text and a higher topical difference due to domain mismatch. The generated captions are generic and do not cater to the domains of crises and sentiment. Given the performance of all the model-based baselines across all the metrics, we consider GPT-FT to be the most competitive baseline. Overall, these results demonstrate that model-based baseline dilutions, whether text-only (GPT and GPT-FT) or cross-modal (using SCST and XLAN caption generation models), severely affect the performance of multimodal classifiers but lack in terms of relevance and coherence.

Effect of proposed cross-modal dilutions:
The cross-modal dilutions added using our approach lead to a drop in F 1 scores from 0.734 to 0.564 (23.3%) and from 0.793 to 0.614 (22.5%) for the Crisis Humanitarianism and Sentiment datasets, respectively. This is by far the most effective dilution strategy that also demonstrates high relevance with the original text and image, high correspondence similarity, and low topical difference.  It is worth mentioning that our proposed method (XMD) also generates text with the highest diversity across generated sentences compared to all the baselines. This is demonstrated by lowest Self-BLEU scores in Tables 1 and 2. However, since the values for all the methods are consistently small, all the dilutions can be considered sufficiently diverse.
To summarize, we observe that deep multimodal classifiers are not overly sensitive to minor content dilutions like the insertion of random URLs or keywords from the original content. However, adding dilutions based on text-alone (GPT, GPT-FT) or cross-modal (Captions, XMD) causes a notable drop in the classification performance of multimodal models. To this end, our proposed XMD generates the most effective dilutions in terms of the observed drop in classification performance while maintaining relevance with the original image and text and topical coherence.

Analysis of Cross-Modal Dilutions
Next, we further analyze the dilutions generated by our proposed method (XMD). We focus on the Crisis Humanitarianism dataset for our analyses. In addition to the analyses presented here, we investigate the effect of the length of dilutions (i.e., number of inserted words) on classification performance and observe no notable difference in observed trends with similar dilution lengths; see Appendix A.10. In Appendix A.11, we analyze the sensitivity of quantified metrics with respect to variations in λ. Finally, we conduct a human evaluation to assess how realistic the diluted multimodal examples are when compared against real multimodal examples.
Subjective Assessment of Dilutions: Figure 3 shows examples of the dilutions generated by XMD from the Crisis Humanitarianism dataset along with dilutions obtained from the baselines.
To further assess the quality of generated dilutions, we conducted a survey on Amazon Mechan- ical Turk (AMT). We instructed the annotators to compare two multimodal posts -one containing dilutions from XMD and the other containing dilutions from GPT-FT for the same multimodal example. The posts were randomly ordered to mitigate position bias. Annotators were asked to respond on a 5-point Likert scale (1: strongly disagree, 5: strongly agree) to the following question: Based on the quality of the text and its relevance with the image, is the post on the right more likely to be an actual social media post than the post on the left? We obtained 5 annotations for each of the 200 examples that were randomly sampled from the test set of the Crisis Humanitarianism dataset. Overall, the results showed that annotators consider the dilutions generated by XMD to be more realistic than GPT-FT.

Conclusion and Future Work
In sum, our work is the first investigation of the robustness of multimodal classifiers to cross-modal dilutions. We establish the plausibility of such dilutions via human evaluations and develop a model to emulate adversarial scenarios reliably. We find that multimodal classifiers that fuse the state-of-the-art modality-specific representations are not robust to cross-modal dilutions generated by XMD. Deep classifiers are increasingly being used for crucial applications that involve the joint understanding of user-generated multimodal data. Our broader goal in this work is to analyze and advocate for the robustness of multimodal models with societal applications, while focusing on the most representative fusion-based multimodal classification technique. In the future, we intend to leverage the knowledge of vulnerabilities identified in the current work to develop more robust multimodal models. We encourage interested researchers to investigate other cross-modal variations pertinent to multimodal data and assess the robustness of multimodal learning approaches to these variations.

Limitations and Broader Perspective
It is important to be clear about the limitations of this work. Our approach hinges on extracting informative keywords from both the image and the text to ensure the relevancy of the generated dilutions. In scenarios where the extracted keywords from images are generic (like celebrity faces for multimodal fake news detection) or the contextual relationship between image and text modalities is not straightforward (like multimodal hate speech), the proposed method does not generate semantically meaningful dilutions. We discuss the limitations in greater detail in Appendix A.12.
This work emphasizes the possibility that the lack of robustness of multimodal classification models can cause societal harm, such as delaying humanitarian interventions during crisis events. As such, the trained adversarial dilution generation models could be put to malicious use. We strongly condemn the misuse of this research. We release the code to aid reproducibility and promote future research on this topic. We believe that this research will encourage the community to investigate the robustness of multimodal classifiers and minimize real-world harm, leading to long-term benefits. Bias of pre-trained models: It is known that pretrained models used in our study demonstrate many biases (Bender, 2019;Hendricks et al., 2018;Garimella et al., 2021). This is often reflected in the kind of keywords that are identified in images and the resulting generated text (e.g., stereotypical gender associations). We acknowledge that the current state of deep learning research is limiting, and the consequential shortcomings are reflected in our work to some extent. Annotations, IRB approval, and datasets: The annotators for this study were recruited via AMT. We specifically recruited 'Master' annotators located in the United States; and paid them at an hourly rate of 10 USD for their annotations. The human evaluation experiments were approved by the Institutional Review Board at Georgia Tech. The datasets used in this study are publicly available and were curated by previous research.

A.1 Text-only Classifier Training
Before training, we pre-process the text in multimodal examples to remove URLs, emoticons, platform-specific tokens (like 'RT' for indicating retweets on Twitter), and symbols like @ and #. We also expanded negatives like can't and won't to 'can not' and 'will not'. To train the text classifier (M text ), we fine-tune a pre-trained language model, DistilBERT (Sanh et al., 2019;Devlin et al., 2018), on the two datasets discussed in Section 4 by using the respective training sets. To train the text classification models for each dataset, we use Adam optimizer (Kingma and Ba, 2014) with a learning rate initialized at 10 −4 ; hyper-parameters are set by observing the classification performance achieved on the respective validation set. We use early stopping (Caruana et al., 2000) to stop training when the loss value on the validation set stops to improve for 5 consecutive epochs. The performance of the trained classifier on the test sets of Crisis Humanitarianism and Sentiment Detection datasets are presented in Table 4.

A.2 Image-only classifier
We apply a standard image pre-processing pipeline so that images with different dimensions can fit the pre-trained VGG-16 model's input requirement. First, we resize the image so that its shorter dimension is 224. We then crop the square region in the center and normalize the square image with the mean and standard deviation of the ImageNet images (Deng et al., 2009b).
To train the image-only classifier (M image ), we apply a fine-tuning approach to train the task-specific image classifiers. We first freeze the weights of VGG-16 (Simonyan and Zisserman, 2015), pre-trained on ImageNet (Deng et al., 2009b), and then swap the last layer from the original model to three fully connected hidden layers with dimensions 4096, 256, and num-of-classes. Finally, we retrain these three layers to adapt the image distribution in each dataset. We use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 10 −4 for each dataset.
To avoid overfitting, we use early stopping to stop training when the loss value on the validation set stops to improve for 10 consecutive epochs. Table  4 shows the performance of image-only classifier.

A.3 Keyword Extraction from YAKE
For extracting keywords from the original text, we use YAKE (Campos et al., 2018). We set the following hyper-parameters: maximum N-gram size = 1; de-duplication threshold = 0.9; de-duplication algorithm: 'seqm'; window size = 1, maximum number of keywords extracted from text = 5.
We create this baseline to emulate the scenario where the user could have posted the multimodal example after diluting the original text by adding a highly similar image's description. We find the most similar image in the test set to an image of a given multimodal example and append its caption to the text in the given multimodal example. As mentioned in the main text, we use the cosine similarity between the VGG-16 embeddings obtained after task-specific for computing the similarity. Overall, most similar images were found to be highly similar, with an average highest similarity score of 0.767 with a standard deviation of 0.067. Nonetheless, as discussed in Section 6, this naïve dilution strategy frequently leads to irrelevant and topically incoherent.

A.5 Evaluation: Correspondence similarity
We explain the rationale behind adopting the correspondence similarity score (i.e., Sim corr ) as one of our evaluation metrics. For context, the crossmodal correspondence prediction task is a binary classification task that aims to classify two input modalities as corresponding or not. For instance, if an image and text that are parts of the same multimodal example are provided as input, the correct prediction is Label 1, indicating true correspondence. Conversely, if the input text and image are from different multimodal examples, the correct prediction is Label 0, indicating false correspondence. The correspondence prediction task has been widely adopted as a pre-training step for multimodal deep learning models (Arandjelovic and Zisserman, 2017; Verma et al., 2019;Feng et al., 2014). In this work, we train correspondence prediction models using the fine-tuned image and text representations of the dataset-specific undiluted training set, and then report Sim corr -the average probability score for Label 1 (i.e., true correspondence) on the diluted dataset-specific test set examples. Effectively, the score indicates that given a model trained to predict correspondence between image and text from original unmodified training examples; the model is successful in establishing a correspondence between diluted text and images in the test set examples.
To train the cross-modal correspondence prediction model, we create negative examples by randomly sampling 3 mismatched descriptions from the training set for each image with the correct description. We then take the fine-tuned representation of the input image and text and pass them through a series of fully-connected layers of sizes (1024 (input), 512, 256, 128, 64, 32, and 2 (output)). As shown in Tables 1 and 2, the correspondence prediction model provides a nearly-perfect Sim score (i.e., 0.999) on undiluted test sets. However, the scores for baselines and the proposed model differ based on the dilution strategy adopted.

A.6 Topical Coherence
To measure the topical coherence between generated dilution and the original text, we compute the KL Divergence between the topic distributions of the two text segments -i.e., D KL (P dilution ||Q original ). We train an LDA topic model (Blei et al., 2003) using the text in a taskspecific training set. The presented KL divergence CLASSIFICATION Table 7: Numbers of words inserted by the dilution methods and classification performance after controlling for the number of inserted words (all methods have~20 words after modifications).
scores are averaged over all the examples in the test set. We set the number of topics to be 20 (based on topic coherence score) for the results presented in this paper. Additionally, we do not witness a change in the observed trends with variations in the chosen number of topics (n ∈ {5, 10, 15, 20}) for LDA topic modeling. For implementing the Self-BLEU metric for quantifying diversity, we use NLTK's BLEU score function (Loper and Bird, 2002) and adopt the approach proposed in Zhu et al. (2018).

A.7 Results on Sentiment Detection
The main text presents an abridged version of the results on the Sentiment Detection dataset. The complete results are presented in Table 5.

A.8 Human evaluation details
For both our annotation tasks, we recruited annotators using Amazon Mechanical Turk. We set the criteria to 'Master' annotators who had at least 90% approval rate and were located in the United States. The rewards were set by assuming an hourly rate of 10 USD for all the annotators. In addition, the annotators were informed that the aggregate statistics of their annotations would be used and shared as part of academic research.
The annotators were primed to identify real social media posts by showing them 5 original multimodal examples. Previous research has demonstrated the role of providing examples in obtaining high-quality annotations (Khashabi et al., 2021). For both our human evaluations, we also inserted some "attention-check" examples during the annotation tasks to ensure the annotators read the text carefully before responding. This was done by explicitly asking the annotators to mark a randomlychosen score on the Likert scale regardless of the actual content. We discard the annotations from annotators who did not correctly respond to all the attention-check examples.

A.9 Ablations for Sentiment Detection
The ablation results on the Sentiment Detection dataset are presented in Table 6. The results follow the same trends as discussed in Section 8 for the Crisis Humanitarianism dataset. Lambda Performance Metrics Figure 4: Effect of varying λ. As λ is increased, the adversarial effectiveness of the generated dilutions increases (lower F 1 ) but at the expense of relevance with original text & image (lower Sim text and Sim img ).

A.10 Length of Dilutions
To examine whether the drop in performance is contingent on the number of words inserted for dilution, we first report the number of words inserted using each of these methods (see Table 7). Then, we control for the number of words inserted by employing either repetition or truncation so that each method inserts a comparable number of~20 words for dilution. As shown in Table 7, even with comparable number of inserted words, the trends observed in Section 6 persist. This reinforces that it is not merely the dilutions' length that precipitates the drop in classification performance but the sensitivity to the inserted content.

A.11 Effect of variations in lambda
Our main results and subsequent analyses are based on λ = 0.01, which controls the contribution of adversarial loss in the overall objective (see Equation   1). Figure 4 shows the variation in classification performance on the crisis humanitarianism dataset with respect to the variations in λ. We find that as λ increases, the classification performance deteriorates further. However, increasing λ hurts the relevance of the generated dilution with the original text and image, as well as the topical coherencethe relevance and coherence scores drop quickly as the relative contribution of L gen is reduced.

A.12 Limitations
As indicated in Section 8, in some scenarios, the extracted keywords from the images could be generic and do not extract meaningful keywords towards the specific task at hand. For instance, for multimodal fake news detection, the extracted keywords from pictures of celebrity faces are typically: man, woman, eye, smile, dress etc. However, these keywords are unrelated to the larger (true/false) discourses centered around the celebrity. Similarly, for multimodal hate speech detection, the extracted keywords are often literal (such as hat, clown, monkey) while the original text aims to establish provocative parallels like calling a person clown or associating certain groups with animals. Our current work is best applied to settings where the contextual relationship between the visual and textual modalities is straightforward, and extracted keywords provide a good representation of the cumulative expression. As part of our future work, we intend to develop cross-modal dilution strategies that can work with a wider variety of usergenerated multimodal data.