Late Fusion with Triplet Margin Objective for Multimodal Ideology Prediction and Analysis

Prior work on ideology prediction has largely focused on single modalities, i.e., text or images. In this work, we introduce the task of multimodal ideology prediction, where a model predicts binary or five-point scale ideological leanings, given a text-image pair with political content. We first collect five new large-scale datasets with English documents and images along with their ideological leanings, covering news articles from a wide range of mainstream media in US and social media posts from Reddit and Twitter. We conduct in-depth analyses on news articles and reveal differences in image content and usage across the political spectrum. Furthermore, we perform extensive experiments and ablation studies, demonstrating the effectiveness of targeted pretraining objectives on different model components. Our best-performing model, a late-fusion architecture pretrained with a triplet objective over multimodal content, outperforms the state-of-the-art text-only model by almost 4% and a strong multimodal baseline with no pretraining by over 3%.


Introduction
In an increasingly divided world rife with misinformation and hyperpartisanship, it is important to understand the perspectives and biases of the creators of media that we consume.Media bias can manifest in many ways and has been analyzed from a variety of angles: the news may favor one side of a political issue (Card et al., 2015;Mendelsohn et al., 2021), select certain events to report on (Mc-Carthy et al., 1996;Oliver and Maney, 2000;Fan et al., 2019) or even distort or misrepresent facts (Gentzkow and Shapiro, 2006;Entman, 2007).
Identifying a news article's underlying political slant is the task of ideology prediction, which has * Equal contribution.focused largely on political texts like news articles and has been tackled with a variety of models, including Bayesian approaches with attention (Kulkarni et al., 2018), graph neural networks (Li and Goldwasser, 2019), and LSTMs and BERT (Baly et al., 2020).However, past work focuses solely on news article's text; news articles contain other forms of non-verbal information in which the underlying ideology may be realized.Consider Figure 1a and Figure 1b, two images from articles depicting the same news story, but by news sources with opposing ideologies (New York Times and Fox News, respectively).The underlying ideology of the news source may influence the choice of image: in Figure 1a, Michael Flynn is depicted with a happy expression and surrounded by other figures, while in Figure 1b, Flynn bears a stern expression and is the sole figure.Images are an integral part of news articles: over 56% of articles in AllSides1 include at least one image.Images are often used to frame certain issues or influence the reader's opinion.For example, liberal websites devote more visual coverage of Donald Trump and also portray Trump with more negative emotions compared with conservative websites (Boxell, 2021).In addition, images of groups of immigrants, in contrast to individual immigrants, tend to decrease survey respondents' support for immigration (Madrigal and Soroka, 2021).These findings naturally lead us to conduct a study of political images.In Section 3.3, we present a thorough analysis of images, finding, inter alia, that (1) liberal sources tend to include more figures in an image, (2) conservative sources have a higher usage of collage images, and (3) faces are more likely to show negative or neutral emotion rather than positive.
Although modern American politics have centered around two polar opposites (Klein, 2020), 38% of US adults identify as politically independent and do not agree wholly with left or right ideologies. 2 Ideology exists on a spectrum (Bobbio, 1996), and we wish to predict more fine-grained ideology than merely left or right.Thus, we define multimodal ideology prediction in this work as predicting one of five ideological slants (left, lean left, center, lean right, right) given both an article's text and cover image.To support this new task, we present several new large-coverage datasets of news articles and images across the ideological spectrum from various sources including AllSides, Reddit, Twitter, and 11 independent news sources.
We experiment with several early and late fusion architectures and evaluate several continued pretraining objectives to improve the image and text encoder separately as well as jointly.Our technical contributions include a novel triplet margin loss over multimodal inputs, and the first systematic study on multimodal models for ideology prediction, which reveals several findings: (1) images are indeed helpful for ideology prediction, improving over a text-only model especially on right-leaning images; (2) late-fusion architectures perform better than early-fusion architectures; (3) ideology-driven pretraining on both the text and image encoders is beneficial; (4) finetuning with a joint triplet difference loss encourages the model to learn informative representations of ideology.Code and datasets can be found at github.com/launchnlp/mulmodide.

Related Work
Media Bias/Ideology on Texts The study of media bias and ideology has a long history going as 2 https://www.pewresearch.org/fact-tank/2019/05/15/facts-about-us-political-independents/ far back as White (1950).Computationally, researchers have studied various approaches in classical machine learning as well as neural methods (e.g.Evans et al., 2007;Yu et al., 2008;Sapiro-Gheiler, 2019;Iyyer et al., 2014).However, these works focus solely on text.There exist several resources of news articles across the political spectrum, compiled for the purpose of educating users on media bias (Park et al., 2009, 2011; Hamborg et al., 2017,  3 , 4 , 5 ).Multimodal studies such as ours need annotated data for training and testing.Thus, we collect several datasets containing both political text and images from various sources.

Media Bias/Ideology on Texts and Images
Only very recently has there been much study on media bias with respect to both text and images.Existing work on characterizing political images has been limited to narrow domains such as Russian trolls on Twitter (Zannettou et al., 2020), political memes (Beskow et al., 2020), and COVID content on TikTok (Southwick et al., 2021).In addition, data containing both text and images annotated for political ideology are not readily available.6Thus, we collect, annotate, and analyze a variety of new datasets, focusing on political figures in news images.For the tasks of multimodal ideology prediction, one similar work to ours is Thomas and Kovashka (2019), who investigate adding text to help an image encoder train an enhanced representation of images.Afterwards, they ignore the text and focus on ideology prediction from images alone.They consider only left or right ideologies, in contrast to our more fine-grained 5-way set.

Data
In this section, we describe several datasets collected in this work for pretraining and finetuning the proposed models.

Pretraining Datasets
We build two pretraining datasets based on BIGNEWSBLN (Liu et al., 2022a), a corpus of over 1.9M English news articles collected from 11 news sources balanced across the political spectrum.

BN-IMGCAP
We first collect a new dataset, BN-IMGCAP, of 1.2M images that occur anywhere in a news article, 7 along with their captions, from seven news sources represented in BIGNEWSBLN, chosen to roughly cover equal proportions of left-, center-, and right-leaning ideologies.Details of this collection process are described in Appendix A. We use these image-caption pairs in our experiments for pretraining the image encoder with the InfoNCE loss and bidirectional captioning loss (Section 4.3).Liu et al. (2022a) also introduced a subset of BIGNEWSBLN called BIGNEWSALIGN containing articles associated with a story cluster, i.e., news articles from different news sources but written about the same story, for pretraining with an ideological triplet loss.From this subset, we identify articles containing images, and we crawl the images from each article's corresponding webpage.We call this dataset of article text and images BNA-IMG and use this for pretraining the cross-modal attention with our proposed triplet margin loss (Section 4.3).Table 1 summarizes these datasets.

Evaluation Datasets
AllSides We extract a dataset of news articles and images from AllSides, which associates stories (e.g., of a particular event) with multiple articles about that story but written by various news sources across a 5-point ideology scale (left, lean-left, center, lean-right, and right).We crawl the AllSides website to obtain (story, article, source) tuples from 2012/06/01-2021/08/31, focusing on articles from the 25 news sources with the most number of articles in AllSides and spanning the complete range of ideology (see Appendix B for the complete list).For each news article, we extract the article text and cover image from each article's corresponding news source's website, totaling 5,662 stories containing 12,471 articles.
Reddit We also collect a dataset of 357k Reddit posts with images from the past 10 years from five subreddits representing both the left-leaning (r/Liberal, r/democrats, r/progressive) and rightleaning (r/Conservative, and r/Republican) political stances, chosen for being among the largest and most active partisan subreddits.For each post, we keep the post title and the image itself, as long as the post was not removed (˜2,300 posts).In order to avoid data leakage, we filter out all posts linking 7 In contrast to just the cover image.to images from the 11 news sources represented in BIGNEWS, resulting in a set of 313k posts.In addition, because the number of posts from rightleaning subreddits overwhelms the number of leftleaning posts, we subsample the right-leaning posts, resulting in a balanced dataset of 65k posts with images, half from each political leaning.The Reddit dataset is summarized in Figure 2. In contrast to news articles, Reddit imposes a 300-character limit on the post title.8Thus, this dataset and the Twitter data described below provide a good opportunity to examine how our models perform on short texts, compared to the longer-form news articles.
Twitter We additionally collect a dataset of 2.1M political tweets from Twitter from the past 10 years using the Twitter Decahose stream, selecting tweets by political figures included in a list of 9,981 US politicians and their Twitter handles (Panda et al., 2020).In contrast to AllSides, Twitter does not explicitly annotate discrete ideologies.Thus, we label tweets with their author's ideology, identified based on their DW-NOMINATE9 dimension (Boche et al., 2018), a measure of a politician's voting history: a positive number indicates conservative leaning (e.g.Donald Trump, 0.403), while a negative number indicates liberal leaning (e.g.Barack Obama, -0.343).
We partition politicians into left, center, and right ideologies, containing those whose ideology score is less than -0.2, between -0.2 and 0.2, and above 0.2, respectively.The distribution of these scores is shown in Figure 3. Finally, we discard tweets without images, leaving 57,093 tweets from 1,422 politicians as our final evaluation dataset.More details are summarized in Table 2.    Table 2: Total number of politician users and Twitter posts in our dataset, with associated statistics per user (last three columns).

Characterization and Analysis of Datasets
To motivate different model and pretraining variants described in the next section, we analyze the content of images and text in our newly-collected AllSides dataset using both automatic and manual means.

Automatic Annotation of Images
The majority of images contain political figures; we wish to identify these figures10 and some salient aspects that may be relevant to predicting the ideology of the article.We employ DeepFace (Taigman et al., 2014), a state-of-the-art facial recognition framework.Given an input image, DeepFace identifies faces and matches them to a set of reference images; we construct a set of 10 reference images for 722 political figures using a combination of entity linking and manual heuristics, detailed in Appendix C. We also employ DeepFace to detect gender (male/female), race (Asian, Black, Indian, Latino, Middle Eastern, or White), and emotion (neutral, angry, fear, sad, disgust, happy, surprise) in AllSides images.11 Pitfalls of Facial Recognition While using Deep-Face, we encountered a few pitfalls.First, Deep-Face is often unable to recognize faces that are small, blurred, or in side-profile.This corroborates existing work showing that reduced quality of faces is detrimental to the detection of faces and emotions (Jain and Learned-Miller, 2010;Yang et al., 2016Yang et al., , 2021)).Second, we noticed frequent mistakes with a few high-profile figures.For example, DeepFace often classifies Barack Obama and Eric Holder as Hispanic or Middle Eastern, and Donald Trump as Asian, showing that DeepFace can be faulty even for famous people with lots of training images.

Manual Annotation of Images
No facial recognition tool is perfect, and aspects of images that could be relevant for ideology, such as main figures or the presence of certain objects, are not captured by DeepFace.Therefore, we manually annotate 400 random images from AllSides.For each image, we identify the number of people in the image (1-5, or "group" if there are 6 or more people).We identify the main figure(s) in the image.For each main figure, we identify their name (if a known political figure), gender, race, and emotion (Positive, Negative, or Neutral).12If the figure is of mixed-race (e.g., Barack Obama) or if the figure is unknown (i.e., not easily identifiable after examining the article's text and searching Google), we label their most salient race.We also identify any salient aspects of the image that help convey the image's message.This may include the presence of certain objects (e.g.,   guns, flags), activities (e.g., protest), or text in the image.We also annotate special image classes: whether the image is an invalid/missing image, a news source banner, a cartoon drawing, a collage, or a composite image.The difference between the latter two is explained in Figure 4.
Analysis We present annotator agreement between DeepFace and humans in Appendix D. In this section, we focus on drawing insights from the analysis of the images.We first examine the number of figures in the image (Table 3).We find that images from liberal sources on average contain more figures than images from conservative sources.Specifically, a higher percentage of left-leaning and center images contained 5 or more faces.Within the 5+ faces category, a large fraction are unknown figures (i.e., not well-known politicians), though these images may contain more notable politicians (e.g., Trump at a podium surrounded by supporters).The distribution of number of figures in these images may reflect liberals' focus on equality as a group in comparison to conservatives' focus on self-reliance as part of their political identity, as revealed by prior work (Hanson et al., 2019).
We also examine the distribution of face occurrence by topic of the article.We find images about  topics such as civil rights, labor, and holidays have the most number of figures on average, while topics such as national defense, FBI, and criminal justice have relatively fewer people.This is simply a natural reflection of the nature of the topic.Within a topic, the distribution largely follows the liberal images have more figures rule.For example, in the gun control topic, lean-left images contain on average 4.9 people, while lean-right images contain on average 2.0 people.Roughly 12-19% of images contain no face.We find that the majority of these images contain inanimate subjects mentioned in the news articles, which are about e.g. an oil tanker that caught on fire, or rubble from an earthquake, rather than about a specific political figure.Around 13% of these no-face pictures contain well-known government buildings including the White House, the Capitol building, and the Supreme Court building.One explanation is that these images represent the three branches of government in the US and are thus a form of metonymy, e.g. the White House can refer to not only the president but also the country as a whole.However, future work is needed to understand why reporters would select, e.g., an image of the White House instead of an image of the president.
We also investigate types of images in Table 4.Most are ordinary images, but we find that over 12% of images from Right sources are collages, which are often arranged in the form of a narrative.For example, in Figure 4a or policymakers who disagree on an issue.
The four most frequent figures in images are Donald Trump, Barack Obama, Hillary Clinton, and Joe Biden (see Table D.1 for details).We find a trend that articles from a particular ideology tend to have more images of the opposing figure (e.g., right-leaning media contains more images of Obama).This is likely a result of attack politics (Haynes and Rhine, 1998;Theilmann and Wilhite, 1998), where politicians attack their opponent instead of bolster their own position, especially when campaigning.This type of negative campaigning has been shown to be employed more by Republicans (Lau and Pomper, 2001).
Lastly, we analyze the emotion of the figures in AllSides (Figure 5).Across all ideologies, the majority of faces have negative or neutral emotion.Consumers actually prefer negative news over positive news (Trussler and Soroka, 2014), and negative images in news are more memorable (Newhagen and Reeves, 1992;Newhagen, 1998).Specifically for facial expressions, liberal and conservative news sources have differences in portrayals of Donald Trump (Boxell, 2021).Angry facial emotion primes also tend to increase trust in negative news messages (Ravaja and Kätsyri, 2014).This may explain why more extreme Left and Right news sources, which are more likely to contain less credible news (Allen et al., 2020), have a higher rate of negative emotion faces than Lean-Left and Lean-Right.

Models
Armed with new diverse datasets of articles and images, we now propose several models, input encoding strategies, and pretraining regimens to tackle the challenges of multimodal ideology prediction.

Text-Only and Image-Only Models
We first experiment with text-only models, including RoBERTa (Liu et al., 2019) and POLITICS (Liu et al., 2022a), a RoBERTa model further pretrained with a political ideology objective and thus specialized for ideology prediction and stance detection.
For image-only models, we use Swin Transformer (Hu et al., 2019;Liu et al., 2021Liu et al., , 2022b;;Xie et al., 2022), a general-purpose hierarchical Transformer-based model that computes representations of images using shifted windows and has obtained strong or state-of-the-art performance on several image processing tasks.Because of our focus on faces, we experiment with several faceaware image preprocessing methods before encoding the images.These methods are described in Appendix F but were ultimately not successful.Thus, we use the images unchanged.

Multimodal Models
Early Fusion Also known as single-stream, an early fusion model takes the joint sequence of text and images as input and merges both modalities to obtain a single representation.We experiment with VisualBERT (Li et al., 2019) and ViLT (Kim et al., 2021), two Transformer-based models that have demonstrated strong performance on a series of vision-and-language downstream tasks such as VQAv2 (Goyal et al., 2017), NLVR2 (Suhr et al., 2019), andFlickr30K (Plummer et al., 2015).Visu-alBERT concatenates words and image segments identified by an object detector, with an additional embedding indicating the input modality.Instead of using object detectors, we feed in faces detected by DeepFace, as we consider political figures more relevant to the ideology prediction task.ViLT is a similar architecture, but with separate positional embeddings for the text and image imputs and does not use an object detector.We use the pretrained weights released publicly by the authors.
Late Fusion Also known as dual-stream, two models separately encode each modality; then the two representations are joined into a single representation.This is in contrast to early fusion, where a single encoder processes the image and text jointly.We use RoBERTa to encode text, and Swin Transformer to encode images.
We evaluate several representation joining mechanisms: concatenation, Hadamard product , gated fusion (Wu et al., 2021), and cross-modal attention (LXMERT; Tan and Bansal, 2019).Gated fusion combines the two representations by learning a gate vector λ so that the combined representation is h = h text + λ h img .For crossmodal attention, (Hendricks et al., 2021) has comprehensively analyzed different types of attention mechanism and found that the coattention scheme (given queries from one modality, e.g., image, keys and values can be taken only from the other modality, e.g., language) has the best performance.Therefore, we use the co-attention scheme for our cross-modal attention module; our implementation largely resembles the cross-modal attention module in LXMERT, with the number of cross-modality layer N X increased from 5 to 6.

Continued Pretraining to Inject Knowledge of Ideology
Recent work has shown that continuing to train a pretrained model on domain-specific data or on an auxiliary task can improve the model's performance on the target task (Beltagy et al., 2019;Gururangan et al., 2020;Lee et al., 2020).In this vein, we aim to improve our ideology prediction model by performing continued pretraining with relevant objectives and auxiliary data.
For pretraining the image encoder, we experiment with an InfoNCE loss (Sohn, 2016; Van den Oord et al., 2018;Radford et al., 2021), a contrastive loss computed within each batch, where the image and text encoders are trained to maximize the cosine similarity of the image and text embeddings of the n correct pairs in the batch, while minimizing the cosine similarity of the embeddings of the n 2 − n incorrect pairings.We use this loss with images and their captions, with the hypothesis that supervision from captions will allow the image encoder to develop a more robust representation and potentially learn features of the image that are present in the caption.
We also experiment with a bidirectional captioning loss (VirTex; Desai and Johnson, 2021), in which the image embedding is passed to an image captioning Transformer head, which generates a corresponding caption token by token in both the left-to-right and right-to-left directions.
Finally, we propose a novel triplet margin loss on triplets of news (anchor, positive, negative), where the positive pair shares the same ideology with the anchor, while the negative image has a different ideology than the anchor.Mathematically, where T is the set of news triplets in the training set; t (a) , t (p) , and t (n) are the joint representations of text and image (concatenated and passed through a linear transformation) of the anchor, positive, and negative news in triplet t. α is a bias term; and [•] + is the ramp function max(•, 0).This is inspired by the triplet loss used in FaceNet (Schroff et al., 2015) and is similar to the triplet ideology loss proposed by (Liu et al., 2022a), who pretrain a text-only ideology prediction model with this loss.We apply this loss to pretrain the image encoder, text encoder, and embedding combination components of our model.

Experiments
We first perform preliminary experiments comparing single-modality models to determine whether the inclusion of images helps ideology prediction.Then, we evaluate multimodal experimental setups, exhaustively selecting an image encoder, embedding combination methods, and pretraining methods; performing the continued pretraining; then finetuning the entire model on the task of ideology prediction.
Implementation Details We implement the new models in PyTorch, importing existing models from their authors' respective GitHub pages.All models were trained for a maximum of 20 epochs with early-stopping patience of 4. For detailed hyperparameters for pretraining and finetuning and other specific implementation details, please refer to our code on the project page.

Results and Analysis
Text-Only and Image-Only Models We first present unimodal experiments on AllSides in Table 5.We find that the vision-only model performs  5).In late fusion models, the embedding joining methods show no significant difference.However, the best performing model with no pretraining is the late fusion RoBERTa+Swin model with cross-modal attention.
significantly worse than the text-only baseline, indicating that an image alone is inadequate for predicting ideology.Surprisingly, Swin-Small slightly outperforms Swin-Base (which is larger in size), though the difference is not substantial.Thus, we decided to use Swin-Small (Swin-S) as the image encoder backbone for our multimodal models for its performance and size.These baseline results motivate the premise of multimodal ideology prediction, in which we use images as additional signal to augment the text.
Multimodal Models without Pretraining Next, we present multimodal model results without pretraining in Table 6.We find that early fusion models cannot outperform the text-only baselines, in contrast to late fusion models, indicating that the combination of text and image is beneficial for ideology prediction, but the choice of architecture is important.The late fusion architecture with crossmodal attention performs the best, and thus we take this model as our starting point for the rest of the experiments in this paper.

Multimodal Models with Pretraining
We then exhaustively experiment with combinations of pretraining objectives for each component of the model (language encoder, image encoder, and cross-modal attention) for ablation and analysis.
Results on the AllSides evaluation set are shown in Table 7. First, we find that replacing RoBERTa with the pretrained POLITICS model already gains a 1% improvement to the overall model.By pretraining on similar domain text, the model is able to generate better text representations.Liu et al. (2022a) find that the POLITICS objective allows the model to perform much better on left-leaning articles, which have higher perplexity (i.e. the language is more diverse).In a multimodal setup, we find that this text encoder helps the multimodal model improve on right-leaning input data more than left-leaning, indicating that the inclusion of images help the classification of right-leaning ideology.
For pretraining the image encoder, our experiments show that the VirTex-style bidirectional captioning loss performs better than the InfoNCE contrastive learning objective.This pretraining method allows the model to better capture the similarities between the image and its associated text.Text may also provide more semantically dense signal than contrastive approaches (Desai and Johnson, 2021), thus leading to better performance.
For pretraining the cross-modal attention, we find that our proposed Triplet Margin Loss objective, which optimizes all three components (image encoder, text encoder, cross-modal attention) of the entire model, improves over no pretraining.
Overall, ablation experiments show that the best combination (RoBERTa pretrained with POLITICS loss, and Swin pretrained with VirTex loss on the image encoder) contribute around 1 percentage point and 2 percentage points respectively (Table 7).Further combining them with the triplet margin loss, we have the best performing model result in more than 3 percents over the baseline late-fusion model.
Twitter & Reddit Finally, we evaluate on the Reddit and Twitter datasets to get a more comprehensive perspective of the model's ideology prediction ability on different domains.Results are presented in Table 8.Overall, performance is substantially lower than on AllSides, because of domain mismatch: Reddit posts and tweets are not usually written in the long, formal language of news articles.However, the improvements over a text-only baseline are more substantial than on All-Sides, where the long text already contains enough  information for predicting ideology.We also find that for Twitter, which was split into left, center, and right ideologies, the models perform poorly on Center tweets, probably due to high dataset imbalance (Table 2), though the addition of images greatly improved over text-only models.

Conclusion
This paper introduces the task of fine-grained multimodal ideology prediction, where a model predicts one of five ideologies leanings given a pair of text and an image.We collect five new largescale datasets of political images and present an in-depth characterization of these images, examining aspects such as facial features across ideologies.We experiment with a combination of stateof-the-art multimodal architectures and pretraining schemes, including a newly proposed triplet margin loss objectives.Along with the release of our datasets, our experimental findings will inform the selection of models and training objectives in future work and spur future research in politics, ideology prediction, and other multimodal tasks.

Facial Recognition
We use DeepFace to perform facial recognition as well as attribute recognition for gender, race, and emotion.However, in this work we only use Deep-Face to perform analysis (rather than prediction) on our new datasets, and we compare DeepFace's analyses with human annotations.The options for gender, race, and emotion are defined by DeepFace.Though some may question the options, it is not within the scope of this paper to argue for or against these options.As mentioned in Section 3.3, DeepFace often classifies Barack Obama and Eric Holder as Hispanic or Middle Eastern, and Donald Trump as Asian.This is likely due to models learning that dark skin or squinty eyes, respectively, are important features predictive of race.As researchers, we must be aware of these biases in the models and be careful not to reinforce racial stereotypes due to models' predictions.We do not explicitly use race, gender, or ethnicity as features in our prediction model.Moreover, we call on all researchers to deal with the automatic facial recognition tools like DeepFace carefully and take all possible biases into consideration.

Ideology
In our work, we have made several assumptions about the ideology, specifically that the ideology of a news source, Reddit subreddit, and Tweet author is consistent.Obviously this may not always be true; a left-leaning post may appear in r/Conservative, or a politician on Twitter may be a moderate who tweets reflect liberal and conservative stances on different issues.However, these are relatively rare cases, and we will warn all potential users about such cases.

Model
Intended Use The use case we have described for our multimodal ideology prediction model is to educate users about ideology bias of media of various genres with both texts and images.
Failure mode The failure mode referred to a case where our model fails to predict the correct ideology of a piece of media work with both text and image.While we showed that these models have high accuracy, the models are not 100% perfect.End users of our model must not take model predictions as fact.We encourage end users to consult experts in machine learning as well as political science when using our models.

Limitations
Facial Recognition As our analysis has shown, DeepFace is not 100% reliable as an automatic annotation tool.To confidently use DeepFace as an analysis tool, manual annotation (which we have done in the paper) is necessary but time-consuming, requiring human labor.
Compute Resources -GPUs Due to the scale of the data (summarized in Table 1) and the size of the model (summarized in Table 9), pretraining is extremely computationally expensive and requires large GPU resources.Our experiments are performed using 2 NVIDIA RTX A6000 and 2 Quadro RTX 8000 GPUs.Batch sizes are chosen to meet hardware constraints and we pretrain the models for 2500 steps.The pretraining on BN-IMGCAP and BNA-IMG takes approximately 3 and 4 days, respectively.one author of this paper annotated images from 200 news stories following the above guidelines.We compute the Cohen's kappa to measure agreement between our annotator and the DeepFace predictions.A summary of statistics is shown in Figure D.2.We examine each aspect of the images in turn.
For number of people in an image, κ = 0.69 indicates relatively high agreement.Note that we bin 6+ people into the "group" label.In most cases, disagreement stemmed from DeepFace recognizing more faces than the annotator; these faces were often small, blurred, or partially obscured.
For the main figures in an image, DeepFace does not explicitly have such a notion, and we did not tell the annotator what a "main figure" should be.By examining the annotations, we find that main figures tend to be large, often in the center of the image, and in focus (i.e., not blurred).The number of main figures per ideology is presented in DeepFace however does rank the extracted faces by their saliency.We examine the most salient image predicted by the model and compare with our annotator's labels.First, we find that identifying the figure's name results in low-annotator agreement (κ = 0.32).We observe that DeepFace is often not able to recognize faces if they are relatively small in the image, or if the face is in side-profile, even if the image is of a well-known figure.This indicates that although facial recognition tools are adept at detecting the presence of a face, they may be sensitive to the size of the face when identifying the face.We also find that DeepFace incorrectly predicts Matt Bevin (former governor of Kentucky) and Jacob Blake (a black man who was a victim of a police shooting) for many images.Due to their unproportionately high model predictions, we believe that that these two images represent the "stereotypical" white man and black man, respectively.
For gender, DeepFace shows high agreement (κ = 0.76) with our annotator.Gender can be easily retrieved from a database once the person is correctly identified.Nevertheless, we find it interesting to note that DeepFace's mistakes were all misclassifying women as men; these instances included Hillary Clinton, Samantha Power, Gina Haspel, and Patty Murray.
Similar to gender, race can also be queried if the person is known.However, race is also problematic in several regards, one of which is mixed

Figure 1 :
Figure 1: Two images from separate sources depicting Federal Judge Pauses Justice Department Effort to Dismiss Michael Flynn Case.In (a), from New York Times, Flynn is shown with several other figures and has a positive expression.In contrast, in (b), from Fox News, Flynn is the sole figure, with a negative expression.

Figure 2 :
Figure 2: Proportion of posts with images from each political subreddit.

Figure 3 :
Figure 3: Histogram of the first DW-NOMINATE dimension in VoteView.Negative indicates left-leaning, while positive indicates right-leaning.Gray bars indicate the split points at -0.2 and 0.2 that separate the left, center, and right ideologies.

Figure 4 :
Figure 4: Collages are composed of separate images arranged adjacently, while a composite image is composed of partial images edited together.Collages are often used to tell a sequential story, while composites show a connection between different people.

Figure 5 :
Figure 5: Facial emotions stratified by ideology from human annotations of AllSides.The majority of emotions are negative or neutral, rather than positive.Notice that Left and Right (i.e., the media labeled as more extreme ideologies) have a much higher proportion of negative faces than Lean Left and Lean Right.

Figure 6 :
Figure 6: High-level structure of the late-fusion model architecture.The representations of the image and the text are separately computed, then combined before being passed to a classification layer.

Table 1 :
Number of image-text pairs in our newly-collected pretraining datasets, separated by news source.BNA-IMG contains article text, while BN-IMGCAP contains captions.

Table 3 :
Percentage of images containing faces in All-Sides, analyzed by DeepFace.On average, left-leaning images use slightly more figures than right-leaning images.

Table 4 :
Type of images in a random sample of 400 images from AllSides.While most images are ordinary, a small percentage fall under special cases.Notably, a large fraction of Right images are collages (i.e., 12%), indicating a common strategy by right-leaning media.

Table 5 :
Experiments with single modality models (no pretraining) on 5-way prediction on AllSides.All results are averaged over 5 runs.RoBERTa is already a strong baseline, showing that an article's text is sufficient in many cases for predicting ideology.However, the imageonly Swin models perform quite poorly; in many cases it is hard to infer ideology solely from images.

Table 6 :
Multimodal results on AllSides without pretraining.Early fusion models perform worse than textonly baselines (found in Table

Table 7 :
Pretraining ablation experiments on AllSides.The base model is RoBERTa + Swin-S.We report mean and standard deviation over five runs.The base model performs poorly on Left.Adding pretraining substantially improves performance overall, especially on articles reported by the Right-leaning media.

Table 8 :
Results on Reddit & Twitter datasets, showing mean and standard deviation over five runs.Due to domain mismatch, performance on Reddit and Twitter is worse than on AllSides.However, the addition of pretraining improves overall performance.For detailed breakdown by ideology, see Tables G.1 and G.3.

Table 9 :
Number of parameters in each model.