Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation

Detecting out-of-context media, such as “miscaptioned” images on Twitter, is a relevant problem, especially in domains of high public significance. In this work we aim to develop defenses against such misinformation for the topics of Climate Change, COVID-19, and Military Vehicles. We first present a large-scale multimodal dataset with over 884k tweets relevant to these topics. Next, we propose a detection method, based on the state-of-the-art CLIP model, that leverages automatically generated hard image-text mismatches. While this approach works well on our automatically constructed out-of-context tweets, we aim to validate its usefulness on data representative of the real world. Thus, we test it on a set of human-generated fakes, created by mimicking in-the-wild misinformation. We achieve an 11% detection improvement in a high precision regime over a strong baseline. Finally, we share insights about our best model design and analyze the challenges of this emerging threat.


Introduction
Out-of-context images are a popular form of misinformation where an image is miscaptioned to support a false claim (Fazio, 2020). Such repurposing can often be detected via text-image inconsistencies in named entities, e.g. mislabelling UK lords as US senators (O'Rourke, 2021), or semantic context, e.g. implying that a pictured politician is intoxicated (O'Rourke, 2020). In this paper, we discuss our approach to the Image-Text Inconsistency Detection challenge of the DARPA Semantic Forensics (SemaFor) Program 1 . SemaFor focuses on the development of defenses against misinformation and falsified media. Specifically, the challenge tasks participants with analyzing Twitter posts that are (1) geo-diverse, i.e., published by users from a broad range of countries and (2) topical, i.e., related to a narrow set of topics spanning COVID-19, Climate Change and Military Vehicles. Participants * Denotes equal contribution. 1 https://www.darpa.mil/program/semantic-forensics then classify these image-text posts as pristine (consistent) or falsified (inconsistent). In this paper, we discuss our data collection and model approach, in which we build a large-scale dataset of multimodal tweets, denoted Twitter-COMMs, and apply the recent CLIP model (Radford et al., 2021) to this task. Our proposed method achieves top performance on the program leaderboard. We discuss the results and multiple ablations for our best method. We will publicly release our data and models 2 .

Twitter-COMMs Dataset
In this section we describe our data collection strategies behind Twitter-COMMs, which consists of multimodal tweets covering the topics of COVID-19, Climate Change, and Military Vehicles.

Data Collection
We collected data using Twitter API v2 3 in three stages for COVID-19 and Climate Change, and two stages for Military Vehicles, refining the filters at each stage to acquire more relevant tweets as we learned from the data in the prior stage. We employed the following global filters for all topics, at each collection stage: (1) country!=USA, (2) language=English, (3) tweet must have at least one image, and (4) must not be a retweet. In addition, we used different filtering strategies for each of the three program topics, which we detail next.

COVID-19 and Climate Change
Our data collection consisted of three stages. The first stage employed simple topic, keyword, and hashtag filters, the second stage used more specific keyword and topic combinations, while the third  -19 (bottom). The samples are selected randomly from the data, and the Twitter handles and URL's have been removed from the texts. A fair number of the images in the corpus are screenshots and graphics with text rather than photographs (as in the middle example of top row).
focused on collecting topical data from Twitter accounts of various news organizations.
In the first stage we collected roughly 100,000 tweets each for COVID-19 and Climate Change topics. We used the "COVID-19" topic of the Twitter API's Entity Annotations feature 4 , which allows users to find tweets related to different topics. For Climate Change we filtered with an OR clause on keywords "climate change", "global warming", and (#globalwarming, #climatechange) hashtags. Inspection of the stage 1 results revealed a lot of off-topic tweets. For example, a Twitter user might post a tweet about working from home during the pandemic and tag the tweet with a COVIDrelated hashtag. While this type of content is somewhat related to COVID-19, we wanted to focus on 4 https://developer.twitter.com/en/docs/labs/annotations data where misinformation/disinformation might be more relevant, such as more topical/newsworthy tweets (e.g. bad actors may spread propaganda related to the COVID-19 pandemic by making false or misleading claims). To that end, in stage 2 we filtered by combining each topic phrase with one of the 19 5 topical search terms (e.g. "agriculture", "crops", "death", "vaccination") and turned these combinations into an OR clause, e.g., "...('COVID' AND ('agriculture' OR 'crops' OR 'death'...))". The resulting data appeared much more relevant than the initial collection effort. Finally, related to the argument above, in the third data collection stage we focused on tweets authored by news organizations, as opposed to random users. For that, 7k news organization Twitter handles were sourced from WikiData 6 .

Military Vehicles
Collecting data about the Military Vehicles topic proved more challenging than the other two topics. We initially tried simple keyword filters such as "military", "aircraft", "tank", etc, but found that those resulted in a lot of irrelevant content such as tweets related to video games, or tweets where "tank" took a different meaning (e.g., "fish tank" or "tank tops"). This initial approach did not return many relevant results. The WikiData news organization approach used in the other two topics also did not provide enough usable data. As a result we crafted two different, highly customized stages for Military Vehicles. We gathered a list of both civilian and military vehicles and aircraft from eight different publicly available datasets 7 . The datasets were annotated either for image classification or object detection tasks. We queried the Twitter Search API using the vehicle and aircraft names from this set. We then trained an EfficientNet (Tan and Le, 2019) image classifier that categorized images as either civilian ground vehicle, civilian aircraft, military ground vehicle, military aircraft, or other. (The "other" category training set consisted of several thousand manually annotated images from the initial data collection effort that did not contain any military or civilian vehicles or aircraft.) We trained the classifier to 97% accuracy and used it to filter out any tweets predicted to be in the "other" category. For the second collection stage we combined the military vehicle and aircraft names with custom keywords (  In total, we have collected 884, 331 tweets, each having at least one image, see Table 1. Example samples for each topic are given in Figure 1.

Falsified Samples
In order to train an out-of-context image detector, we require falsified samples in addition to the pris-tine ones. However, we found that it is rather challenging to collect such "miscaptioned" images at scale. Thus, in addition to the pristine samples described above, we automatically generate falsified samples where there is some inconsistency between image and text. We create random negatives (denoted as "Random") by retrieving an image for a given caption at random. We also create hard negatives (denoted as "Hard") following the method from (Luo et al., 2021). Specifically, we use the matching method from their "Semantics / CLIP Text-Text" split where given a query caption, we retrieve the image of the sample with the greatest textual similarity. We mainly generate mismatches within each topic (COVID-19, Climate Change, Military Vehicles), except for a small set of random mismatches across topics (denoted as "Cross Topic"). Our dataset is balanced with respect to labels, where half of the samples are pristine and half are falsified, i.e., each falsified sample has an associated pristine sample. Table 2 presents summary statistics for the falsified training samples. We detail our development set and other data used for evaluation in the next section.

Experiments
In this section we discuss the data used for evaluation, present our approach and provide an ablation study for our various design choices, and finally, report the results on the Image-Text Inconsistency Detection challenge of the DARPA Semantic Forensics (SemaFor) Program.

Evaluation Sets
In the following we report results on several evaluation sets. (a) We validate our approach on samples synthetically generated using the same procedure as our training set (denoted Dev). See a synthetic hard negative example in Figure 2. (b) We also evaluate on a set of hand-curated samples derived from news images and captions that were part of the SemaFor Evaluation 1 Dataset (denoted Eval 1). At a a high level, the Eval 1 data was obtained by collecting real-world pristine news image-caption pairs and introducing various inconsistencies into them, e.g., manipulating images or captions. An example of a possible manipulation is given in Figure 2. (c) Finally, we evaluate on a hidden set of hand-curated samples derived from Twitter, as part of the SemaFor Evaluation 2 Dataset (denoted Eval 2). Here, the image-text pairs were originally collected from Twitter, then text was manipulated to introduce an inconsistency, see example in Figure 2. We emphasize that while Eval 1/2 data is not per se "real" misinformation, it is nevertheless "in-thewild" w.r.t. our synthetic training data and much more representative of real-world misinformation. We note that collecting "real" out-of-context misinformation at scale is highly challenging. All three evaluation sets contain a mixture of samples relevant to the topics of COVID-19, Climate Change and Military Vehicles. Table 3 provides the number of samples in each set.

Approach and Design Choices
For our approach we fine-tune CLIP (Radford et al., 2021), a large pretrained multimodal model that maps images and text into a joint embedding space via contrastive learning. Given our empirical results, we make the following design choices: • We use the RN50x16 backbone. We find that this backbone consistently yields a 2-3% improvement compared to other released backbones, such as ViT/B-32. • We tune the upper layers and keep CLIP's lower layers frozen 8 . We find that this scheme is more memory efficient and yields more stable convergence than tuning all the layers. • We use a learning rate of 5e-08 for CLIP and 5e-05 for the classifier. From our hyperparameter sweeps we find this setting to be the most appropriate, as CLIP is pretrained while the classifier is randomly initialized. • We multiply CLIP image and text embeddings before passing that as an input to the classifier. This is different from Luo et al. (2021), who used a simple feature concatenation. Our proposed fusion technique is more effective as demonstrated by a later ablation study.

Experimental Results
In the following, we go over the results from our approach and several ablations: different fine-tuning schemes, multimodal fusion methods, percentage of hard negative samples, expert vs. joint training, and training set size. For most ablations we optimize on a 500k subset of the training data (unless otherwise noted) for faster development. We report the following metrics. Most tables report binary classification accuracy with the threshold 0.5. This is complemented with ROC curves, which provide a more complete view of performance across multiple thresholds. Since some of our evaluation sets (Eval 1,2) have an unequal number of pristine and falsified samples, for these we report the balanced binary classification accuracy, where we average the true positive and false positive rates.

Multimodal Fusion
First, we compare different multimodal fusion techniques, see Table 4. We try three fusion methods: concatenating the CLIP image and text embeddings (Concat), concatenating the embeddings and their dot product (Concat + Dot), and multiplying the embeddings element-wise (Multiply). Inspired by how CLIP was trained to maximize the dot product of normalized image-text pairs, Concat + Dot and Multiply 9 incentivize the classifier to stay faithful to the pre-initialized joint embedding space. These architecture choices yield on average a 7% performance improvement over simple concatenation.
For future experiments we choose to use the Multiply method to minimize trainable parameters and maintain a simple approach.

Expert vs. Joint Training
Here we study whether training a joint model on all three topics at once may be inferior to training three topic-specific experts, see Figure 3. Here, we evaluate the models fine-tuned on 1M samples with 75% hard negatives. We find that the joint model performs on par with or better than the expert models, thus we use a joint model in all the other experiments.

Fine-Tuning Scheme
Since we only know the high-level topics but not the precise composition of samples in our hidden set Eval 2 (e.g. synthetic vs. natural images, text styles, author types), we investigate methods for out-of-domain robustness. Specifically, we try the scheme from (Anonymous, 2022), where the authors first optimize the classifier while keeping the pretrained feature extractor frozen (linear probing), then optimize the entire network (fine-tuning). The intuition behind this method is that a good initialization from linear probing minimizes the chance of feature distortion, i.e. when the pretrained model overfits to in-domain data. We report the results in Table 5. In fact, we find that direct fine-tuning (FT) achieves slightly better performance on both in-domain Twitter data and out-of-domain news data. Thus, for future experiments we use direct fine-tuning.

Percentage of Hard Negatives
Next, we analyze the importance of using hard negatives in our training data. Specifically, we measure the impact of different percentages of hard negative samples, where the rest are random negatives. Table 6 presents the results. We see that more hard negatives in training naturally improves the performance on hard negatives in our development set, but there is also a trade-off in performance on random negatives. Given that we care about samples that more closely mimic challenging real-world misinformation but also want to avoid degrading performance on easy samples, we opt for a ratio of 75% hard and 25% random negatives for future experiments.

Training Set Size
We also investigate the influence of training set size on performance. We report the binary classification accuracy as we use 500k, 1M, and 2M samples as seen in Table 7. We observe that increasing training data size generally leads to improved performance, with most of the gains coming from higher accuracy on hard negatives.

Results on Eval 2 Set
Our final submitted model was directly fine-tuned on the entire training set of 2M training samples, with a 75% weighting on hard negatives. We report results in Table 8 and Figure 5. We improve by 11% in probability of detection at 0.1 false alarm rate, meaning that our method is able to detect more falsified samples with minimal false alarms. At equal error rate we improve by 5% in both probability of detection and accuracy, meaning that our method is more accurate when trying to maximize both pristine and falsified per-class performance. Our model achieved the best performance on the program leaderboard among ten competing approaches.

Conclusion
In this work we face a specific real-world challenge: flag image-text pairs as misinformative with no corresponding training data. To approach this challenge, we first collect Twitter-COMMs, a largescale topical dataset with multimodal tweets, and construct random and hard negatives on top of it. We then design our approach based on the recent CLIP model, making several important design choices, such as multiplying the image and text embeddings for multimodal fusion and increasing the percentage of hard negatives in our training data. We show that with this approach we can substantially improve over a powerful baseline, an off-theshelf CLIP model, and achieve the top result on a challenge with in-the-wild (unseen) text-image inconsistencies.

Acknowledgements
We would like to thank PAR Tech, Syracuse University, and the University of Colorado, Denver for creating the evaluation data. We thank the SRI team, including John Cadigan and Martin Graciarena, for providing the WikiData-sourced news organization Twitter handles. We would also like to thank Dong Huk (Seth) Park, Sanjay Subramanian, and Reuben Tan for helpful discussions on finetuning CLIP. This work was supported in part by DoD including DARPA's LwLL, and/or SemaFor programs, and Berkeley Artificial Intelligence Research (BAIR) industrial alliance programs.

References
Anonymous. 2022. Fine-tuning distorts pretrained features and underperforms out-of-distribution. Submitted to The Tenth International Conference on Learning Representations. Under review.

A Appendix
In the following we provide additional details about the Twitter-COMMs dataset.
A.1 Dataset Summary Table 9 shows a summary of the dataset. The "Geo-tagged" column refers to the geolocation data provided by tweet authors. This property is empty in most cases, but when present, can be in the form of a Twitter "place" which contains a display name, a geo polygon (which in some cases is as broad as an entire country), as well as other fields, such as country name. It is also possible for the geo data to be in the form of latitude and longitude, but that is rarer. The "Countries" columns is extracted from the geo location data and because of the small amount of geo-tagged tweets we can only report countries for a small fraction of tweets in the dataset (e.g., Table 11). To try to match the SemaFor program's goal of using non-US data we filtered for country != "US" in almost every data query, with the only exception being some of the Military Vehicles queries. Due to the difficulty in finding enough data for that topic we removed the country filter in order to relax the query constraints. As seen in Table 11 there are 707 posts from the US. All but two of these US-based tweets are from the Military Vehicles topic (Table 13). It is likely that Twitter is only able to apply the country filter on tweets that are geo-tagged, and allows any non-geo-tagged data to pass this country filter. We are not sure of the country breakdown of the majority of the posts in the dataset, but we can say that any tweets that do have country information are mostly non-US. One oddity to note is that although we included an English-only search filter ("lang:en") in all API calls, the API still returned a small number of non-English tweets (Table 12). We are not sure why this is, but manual inspection of some of these examples shows that a good portion of them are in fact in English. Table 10 shows the number of tweets that Twitter flagged as containing possibly sensitive material. Figure 6 shows high-level "word cloud" summaries for the words in the tweets for each topic.

A.2 Media Stats
The total number of images/tweets is shown in Table 14. Twitter allows users to add 1-4 images to a tweet, and/or a single video. As seen in Table 15, 90% of the tweets have a single image. In cases where a tweet contained more than one image, we only used the first image (according to the order of the images Figure 6: Word Cloud Summaries for Each Topic   returned by the Twitter API).

A.4 Military Vehicles Collection
We merged the eight datasets from Table 17 into a single dataset and then used the vehicle and aircraft names from this dataset to programmatically search for tweets related to the Military Vehicles topic. This was done in two stages. The first stage used the vehicle/aircraft names alone, but returned a lot of off-topic data. To address this we used this military/civilian dataset to train a five-way image classifier that predicts categories: civilian vehicle, civilian aircraft, military vehicle, military aircraft, and other. The other category consisted of a large number of the off-topic images from our initial effort (e.g., non-vehicle, non-aircraft images). We ran the classifier on the stage 1 images and removed any images predicted as "other". In stage 2 we searched using the vehicle and aircraft names in combination with the keywords from Table 18, and removed the "country!=US" filter that we had used in all our other searches. After some further cleaning and de-duplication we ended up with about 102k Military Vehicles tweets (see Table 9.