Multimodal Fusion with Co-Attention Networks for Fake News Detection

Fake news with textual and visual contents has a better story-telling ability than text-only contents, and can be spread quickly with social media. People can be easily deceived by such fake news, and traditional expert identiﬁcation is labor-intensive. Therefore, automatic detection of multimodal fake news has become a new hot-spot issue. A shortcoming of existing approaches is their inability to fuse multi-modality features effectively. They simply concatenate unimodal features without considering inter-modality relations. Inspired by the way people read news with image and text, we propose a novel Multimodal Co-Attention Networks (MCAN) to better fuse textual and visual features for fake news detection. Extensive experiments conducted on two real-world datasets demonstrate that MCAN can learn inter-dependencies among multimodal features and outperforms state-of-the-art meth-ods.


Introduction
The rapid growth of social media has created fertile soil for the emergence and fast spread of fake news (Zhao et al., 2015), resulting in serious consequences. For example, during the U.S. 2016 presidential election, the most popular fake news was more widely spread than the most popular authentic news on Facebook, which confused people and broke the authenticity balance of the news ecosystem . To mitigate the negative effects caused by fake news, it is crucial to detect fake news on social media automatically.
Tweets with images are getting popular on social media recently, which have richer information and attract more viewers than tweets with only texts . Fake news also makes full use of this advantage to draw and mislead readers. Figure  1 shows three examples of fake news from Twitter. In the left example, both text and image indicate it is likely to be fake. The text of the middle one provides little evidence that it is fake news, but the image is obviously forged. In the right example, the image seems normal, while the textual contents indicate that it is probably fake. A hypothesis drawn from these examples is that combining text and the attached image is more conducive to detecting fake news.
Recent works have a growing interest in using multimodal (text + image) information to detect fake news.  utilize local attention mechanisms to fuse features of image, text, and social context. Some studies explore to learn the joint representations of text and image, based on auxiliary adversarial networks (Wang et al., 2018) and variational autoencoders (Khattar et al., 2019). Nevertheless, they are not fine-grained enough in feature extraction and feature fusion. First, some studies require labor-intensive extra information, such as social context  and event category (Wang et al., 2018), to help detect fake news, which increases the cost of the detection. Second, except for texts in tweets, the methods mentioned above all focus on characteristics of images at the semantic level (e.g., emotional provocations), which can be reflected in the spatial domain. However, these methods ignore the individual information of fake images at the physical level, e.g., re-compression artifacts, which is reflected in the frequency domain (Qi et al., 2019). Third, some models (Wang et al., 2018;Khattar et al., 2019) obtain fused representations by simply concatenating multi-modality features. Although leverages local attention mechanism, the attention values of att-RNN  are only obtained from joint textual-social representations, which cannot reflect the similarity between textual-social representations and visual representations. Intuitively, when people judge news credibility with text and image, they often observe image first and then read text (Wang et al., 2020). This process may be repeated several times. In this process, people understand image according to the textual information, and understand text according to the associated image information. So the information of one modality is conditionally fused with that of another modality for once or multiple times. Intuitively, there are inter-modality attention relations between image and text. However, existing state-of-the-art methods are weak to fuse multimodal features due to their neglect of inter-modality interactions.
To address the aforementioned challenges, we propose the Multimodal Co-Attention Networks (MCAN) for fake news detection by considering multimodal features. In our proposed model, we first extract spatial-domain features and frequencydomain features from image, as well as textual features from text. Then we develop a novel fusion approach with multiple co-attention layers to learn inter-modality relations, which fuses visual features first, and then the textual features. The fused representation obtained from the last co-attention layer is used for fake news detection.
The contributions of this paper can be summarized as follows: (1) We propose a novel end-to-end approach to detect fake news on social media only using the text and the attached image, without any extra information and auxiliary tasks. (2) The proposed MCAN model stacks multiple co-attention layers to fuse the multimodal features, which can learn inter-dependencies among them. (3) Our MCAN model is a general framework for fake news detection, and the components of MCAN are flexible. The sub-networks used to extract multimodal features can be replaced by different models. Moreover, the modular fusion process of MCAN allows our model to handle more modalities conveniently.
(4) We evaluate MCAN on two large scale realworld datasets. The results demonstrate that our model outperforms the state-of-the-art models.
The rest of the paper is organized as follows: In Section 2, we summary previous related work on fake news detection. In Section 3, we detail our proposed model. The datasets, baselines, and experiment results are presented in Section 4. We conclude the study in Section 5.

Related Work
Following the previous work (Ruchansky et al., 2017;, we specify that fake news is the news that is intentionally fabricated and can be verified as false. Existing methods for fake news detection can be divided into unimodal approaches and multimodal approaches.

Unimodal Fake News Detection.
Textual features are extracted from text content, including statistical features, such as the number of paragraphs in the text (Volkova et al., 2017), the percentage of negative words (Potthast et al., 2017;Bond et al., 2017), the number of punctuation and emojis (Castillo et al., 2011), and semantic features, such as writing styles (Chen et al., 2015) and language styles (Feng et al., 2012). However, these features are hand-designed, bringing bias and design difficulty. To address this problem, many studies use deep learning technologies, such as RNN (Ma et al., 2016), CNN (Yu et al., 2017), and GAN (Ma et al., 2019), to identify fake news. Their results show that deep learning methods perform better.
Visual features are important for news verification (Jin et al., 2016;, such as clarity score (Jin et al., 2016), the number of images (Wu et al., 2015;Jin et al., 2016). However, these features are manually crafted and just learn simple patterns, hardly applying to real images. Qi et al. (2019) design a CNN-based model to capture image patterns, but their model only works in the case of large samples. So the applicable scope is very limited.
Social context features are born in the social connection between users and tweets, such as user profile and the number of posts. Liu et al. (2018) use user profiles on the news propagation path to determine the truth of the news. Some other works model propagation patterns as tree structures based on kernel methods (Wu et al., 2015;Ma et al., 2017). However, social context features are handcrafted, incomplete, and unstructured.
The above work embodies the limitations of unimodal features in detecting fake news. In this paper,  we consider multiple modalities simultaneously when detecting fake news.

Multimodal Fake News Detection.
Recent works explore to fuse multimodal features.  use local attention mechanism to fuse textual, visual, and social context features. Wang et al. (2018) learn event-invariant features by an aided adversarial network. Khattar et al. (2019) utilize autoencoders coupled with a detector to learn the shared representation of the text and the attached image. However, they ignore the characteristics of fake images at physical level (e.g., recompression artifacts), and the fused features they learned lack correlations across multiple modalities.
To overcome the limitations of existing works, we propose MCAN to learn inter-dependencies among modalities. We extract spatial-domain and frequency-domain features of image, and textual features. Then we fuse them through a deep coattention model inspired by a realistic scenario.

Model Overview
Our model aims to learn multimodal fusion representations by considering dependencies across the modalities. As shown in Figure 2, the proposed model has three major procedures: feature extraction, feature fusion, and fake news detection.
Given news with text and image, we first utilize three different sub-models to extract features from spatial domain, frequency domain, and text. Then the multi-modality features are fused through a deep co-attention model, which consists of multiple co-attention layers. At last, the output of the coattention model is used for judging the truth of the input news.  Spatial-Domain Feature. To learn the semanticlevel characteristics of the given image, we employ the VGG-19 network (Simonyan and Zisserman, 2014) to extract visual features from spatial domain. After the second of the last layer of VGG-19, we add a fully connected layer (denoted as "s-fc" in Figure 2) with ReLU activation function to generate a d × 1 dimensional feature representation of the input image in spatial domain, which is denoted as

Feature Extraction
Frequency-Domain Feature. Fake-news images are often re-compressed images or tampered images that show periodicity in frequency domain (Qi et al., 2019), which can be easily captured by CNNs. Thus we design a CNN-based sub-network to extract features from frequency domain, as in Figure 3. The image is transformed from spatial domain to frequency domain through discrete cosine transform (DCT) as in Qi et al. (2019). After that, we obtain 64 vectors H 0 , H 1 , . . . , H 63 , which are then sampled to the fixed size 250. For parallel computation, we pick 64 250-dimensional vectors into a matrix H F ∈ R (64×250) , which is fed to the CNN-based network later. The CNN-based subnetwork consists of a major network along with multi-branch networks. The earlier parts of the major network have three convolutional blocks and a max-pooling layer. The multi-branch networks are the same as architectures in Inception V3 (Szegedy et al., 2016). The last parts of the major network are a max-pooling layer followed by a convolutional block. Each convolutional block is comprised of a two-dimensional convolutional layer with batch normalization and ReLU activation function. After adding a fully connected layer with ReLU activation function (denoted as "f-fc" in Figure 2), we obtain the feature representation of the image in frequency domain R F ∈ R d×1 .
Textual Feature. The text content of the tweet is a sequential list of words denoted as T = [T 1 , T 2 , . . . , T n ], where n is the number of words in a tweet, and each word T i ∈ T is tokenized by a pre-prepared vocabulary (Devlin et al., 2018). Recently, the BERT model (Devlin et al., 2018) which is pre-trained on a large language corpus, has been proven to perform very well in multiple natural language processing tasks. Thus we utilize BERT to obtain the aggregate sequence representation as textual features we desired. The textual feature is then resized to be a d × 1 dimensional representation (denoted as R T ) by a fully connected layer with ReLU activation function.

Feature Fusion
Intuitively, people often look at the image first and then read the text when reading the news with image and text. This process may be repeated several times, continuously fusing image and text information. Therefore, we develop a novel fusion approach to simulate this process. Before presenting the fusing process, we first introduce its basic unit, the co-attention (CA) block. We achieve feature fusion by cascaded stacking multiple CA layers, which consists of two parallelly connected CA blocks. consists of a multi-headed self-attention function and a fully connected feed-forward network, both wrapped a residual connection followed by layer normalization. The input of MSA is first used to compute (d × 1)-dimensional queries, keys, and values packed into matrixes Q, K, V , respectively. The similarity of the dot product between Q and K determines the attention distribution on the V . Multi-head attention function with m heads has m self-attention functions in parallel. For the i-th head, the inputs are transformed from Q, K, and V as follow: h are the projection matrices for the i-th head, d h = d/m is the dimensionality of the output feature of each head.
The calculation process of multi-head selfattention function can be presented as follows: The fully connected feed-forward network consists of two linear transformations with a ReLU activation function in between, where the dimensionality of input and output is d × 1, and the inner-layer dimensionality is d f f .
The co-attention block (denoted as "Co-Attn") is extended from the MSA block, as shown in Figure  4(b). For a Co-Attn block, the queries are from one modality while keys and values are from another modality. Especially, the query matrice is used as a residual item after the multi-head attention sublayer. The rest architectures are the same as MSA. The Co-Attn block produces an attention-pooled feature for one modality conditioned on another modality. If Q comes from text and k and V come from the attached image, the attention value calculated using Q and K can be used as a measure of the similarity between the text and image, and then weights the image. Just like humans, after reading the text, they will pay more attention to the areas in the image that are similar to the text. We believe that co-attention can simulate this process and learn inter-dependencies between different features.
Co-attention layer. We obtain a CA layer by connecting two Co-Attn blocks in parallel, as shown in Figure 2. Giving two Co-Attn blocks different features, the CA layer computes queries, keys, and values for each Co-Attn block as in a MSA block. Then the keys and values of one Co-Attn block are passed as input to another Co-Attn block. The outputs of two Co-Attn blocks are concatenated together and then fed into a fully connected layer to get the fused representation. The CA layer models dense interactions between input modalities by exchanging their information.
Multiple co-attention stacking. In order to fuse multimodal features deeply, we stack 4 CA layers in depth. The fusion process is progressive, and the output of each CA layer is one of the inputs of the next layer (see Figure 2). We first fuse spatialdomain representation R S and frequency-domain representation R F in first CA layer and obtain R (1) C . Then R F are enhanced to fuse with R (1) C in the second CA layer which outputs R (2) C . In the third and fourth layers, the inputs are the output of the previous layer and text representation R T , and outputs are R (3) C and R (4) C , respectively. The output vector of each CA layer is d-dimensional.The calculation processes are formulated as follows. Due to the page limit, we only show the calculation processes in the first CA layer and skip the repeated calcula-tion details of other layers.

R
(1) where R C S←F ∈ R d is the attention-pooled feature for spatial domain conditioned on frequency domain, R C F ←S ∈ R d is the attention-pooled feature for frequency domain conditioned on spatial domain, and W (1) C ∈ R 2d×d is the projection matrice of the first CA layer. R (1) C is transformed to be a (d × 1)-dimensional representation before being input to the next CA layer. Specifically, the first and the third CA layers share parameters, and the second and the fourth CA layers share parameters.

Model Learning
We have obtained the multimodal feature representation R (4) C fused features of text, spatial domain, and frequency domain. Let f = R (4) C , which is used to predict. The output of the proposed MCAN is the probability of a tweet being fake: where W f is parameters of the fully connected layer, and W s is parameters of the linear layer in the softmax layer. The loss function is devised to minimize the cross-entropy value: where y is the ground truth, with 1 representing fake news and 0 representing real news, and Θ denotes all learnable parameters in the proposed model.

Datasets
To evaluate the effectiveness of the proposed MCAN, we conduct experiments on two public real-world datasets, which are collected from Twitter and Weibo, respectively. The Twitter dataset was released for Verifying Multimedia Use task at MediaEval (Boididou et al., 2016). The Weibo dataset is collected by . In the Weibo dataset, the real news is verified by an authoritative news agency in China, Xinhua News Agency. The fake news is verified by the official rumor debunking system of Weibo. The tweets in each dataset contain texts, attached images/videos, and social context information. In this work, we focus on text and image information. So we remove the tweets with videos and the tweets without texts or images. In Twitter dataset, 512 images are shared by the remaining data. When preprocessing the Weibo dataset, the steps we used are similar to that in the work . We keep the same data split scheme as the benchmark on these two datasets. The detailed statistics of the two datasets are listed in Table 1.

Experimental Settings
The max length of the text is 25 on Twitter and 160 on Weibo. The hidden size of "s-fc", "f-fc" and "t-fc" are 256. We set d=256, m = 4, and d f f = 512. The hidden size of "p-fc" is 35. The parameters of VGG-19 and BERT are frozen when training on Twitter dataset due to overfitting, but not on Weibo dataset. The BERT model used on Twitter dataset is multilingual cased BERT-based model and the one used on Weibo dataset is Chinese BERT-based model. Our proposed model is trained for 100 epochs with early stopping. We use Adam (Kingma and Ba, 2014) and AdaBelief (Zhuang et al., 2020) as optimizers on Twitter and Weibo datasets, respectively, to seek the optimal parameters of our model. The optimal hyperparameters of our model are determined by grid searching, and the selection criterion is accuracy. The hyperparameters of baselines are the same as those in respective studies.

Baselines
To validate the effectiveness of MCAN, we choose two categories of baseline models: unimodal models and multimodal models, which are listed as follows: (1) Table 2 shows the results of baselines and our proposed model on two datasets. We can observe that the proposed MCAN outperforms all the baselines over all metrics across two datasets.

Performance Comparison
There are many similar trends on the two datasets. MCAN-A performs better than unimodal models, which indicates that adding features usually improves model performance, but it is not always positively correlated. For example, Text on Weibo dataset is better than MCAN-A. After adding the process of multimodal fusion, our proposed MCAN beats MCAN-A and other multimodal models, which embodies our proposed feature fusion method is indeed better than the simple concatenation method.
There are also some differences on the two datasets. The performance of Text (BERT) and Spatial (VGG-19) on Weibo dataset is much better than that on Twitter dataset. The reason is related to the dataset itself. On Weibo dataset, the average length of a tweet is about 10 times of that of a tweet  On Weibo dataset, the accuracy of fine-tuned BERT and VGG-19 all exceed 85%. In this case, our proposed MCAN further improves the accuracy to close to 90% with the help of cascaded way of stacking CA layers. Comparing with the situation on Twitter dataset, we can find that our model performs better in the face of weak unimodal features. In our MCAN model, the representation ability of features can be greatly improved by effectively fusing other features.

Ablation Analysis
Quantitative Analysis. To evaluate the effectiveness of each component of the proposed MCAN, we remove each one from the entire model for comparison. "ALL" denotes the entire model MCAN with all components, including spatialdomain representation (S), textual representation (T), frequency-domain representation (F), and coattention layers (A). After removing each one of them, we obtain the sub-models "-S", "-T", "-F" and "-A", respectively. "-F-A" denotes the reduced MCAN without both frequency-domain representation and co-attention layers. The results are exhibited in Figure 5. We can see that every component plays a significant role in improving the performance of MCAN. MCAN beats MCAN-F, which reveals that the frequency domain information is indeed helpful to detect fake news. On Twitter dataset, the contribution of textual representations to the entire model is less than that of visual representations, while the situation on Weibo dataset is opposite. This is still due to the imbalanced issue and the less average length of a tweet on Twitter dataset, which decrease the performance of the textual represen-tation. Besides, on Weibo dataset, removing one or two components, the performance of MCAN does not drop significantly as on Twitter dataset. This benefits from balanced data distribution and the stability of fine-tuned BERT and VGG-19, as mentioned in Section 4.4. Qualitative Analysis. To illustrate the effectiveness of co-attention layers in MCAN, we qualitatively visualize the joint representation of three modalities learned by MCAN-A and the fused representation R (4) C learned by MCAN on Weibo testing set with t-SNE (Maaten and Hinton, 2008), as shown in Figure 6. The label of each tweet is real or fake.
From Figure 6, we can observe that the separability of the feature representation learned by MCAN is much better than its reduced model MCAN-A. MCAN-A can learn discriminable features, but many features are still easily misclassified, showing in Figure 6(a). On the contrary, the features learned by MCAN are more discriminable with a more significant segregated area between two types of samples, as exhibited in Figure 6(b). This is attributed to the cascaded way of stacking co-attention layers in MCAN, which fuses the characteristics of multiple modalities deeply and boosts to distinguish fake news and real news.
From the above phenomena, we can conclude that the proposed method MCAN learns better and more distinctive feature representations with the coattention layers, thus achieving better performance.

Case Studies
To further illustrate the importance of multimodal features for fake news detection, we compare the results reported by MCAN and unimodal models (Text and Spatial) and exhibit some fake news correctly captured by MCAN but missed by unimodal models.
Before washed away by flood, an Indian man calmly gave the last gesture to a photographer.
A group of dolphins brought a dog that fell into a canal to safe area.  Figure 7 shows two top-confident tweets successfully detected by MCAN but missed by textonly MCAN. The textual contents of the two examples can provide little evidence that it is fake news. However, the two attached images seem forged pictures.
The water mantis lives in sewers. Its head has two to three times the poison of pufferfish and has no antidote.
Several urban management officers are frantically plundering streetside property worth more than 100 million yuan. In Figure 8, the two examples are detected by MCAN but missed by Spatial. The attached images in two examples look normal. However, the words in the tweet seem exaggerated and unbelievable. It is challenging for spatial-domain-only MCAN to detect, but with multimodal features, our MCAN model identifies them correctly.
These comparative cases prove that when a single-modal model, whether a text-based model or an image-based model, cannot correctly distinguish fake news, the proposed MCAN using multimodal features can give high confidence.

Conclusions
In this work, we propose a novel Multimodal Co-Attention Networks (MCAN) to tackle the challenge of fusing multimodal (textual and visual) features for fake news detection. We utilize three different sub-networks to extract features from text, spatial domain, and frequency domain, respectively. Then the three features are deeply fused by stacking co-attention layers, which is inspired by human behavior. When people read news with image, image and text are read once or multiple times, and continuously fused in brain. Experiments on two public benchmark datasets for fake news detection validate the effectiveness of MCAN, and the results show that MCAN outperforms the current state-of-the-art methods. In the future, we plan to extend the co-attention based fusion approach in MCAN to other fake news research, such as fake news diffusion.