Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks

With the popularity of smartphones, we have witnessed the rapid proliferation of multimodal posts on various social media platforms. We observe that the multimodal sentiment expression has specific global characteristics, such as the interdependencies of objects or scenes within the image. However, most previous studies only considered the representation of a single image-text post and failed to capture the global co-occurrence characteristics of the dataset. In this paper, we propose Multi-channel Graph Neural Networks with Sentiment-awareness (MGNNS) for image-text sentiment detection. Specifically, we first encode different modalities to capture hidden representations. Then, we introduce multi-channel graph neural networks to learn multimodal representations based on the global characteristics of the dataset. Finally, we implement multimodal in-depth fusion with the multi-head attention mechanism to predict the sentiment of image-text pairs. Extensive experiments conducted on three publicly available datasets demonstrate the effectiveness of our approach for multimodal sentiment detection.


Introduction
The tasks of extracting and analyzing sentiments embedded in data have attracted substantial attention from both academic and industrial communities (Zhang et al., 2018;Yue et al., 2018). With the increased use of smartphones and the bloom of social media such as Twitter, Tumblr and Weibo, users can post multimodal tweets (e.g., text, image, and video) about diverse events and topics to convey their feelings and emotions. Therefore, multimodal sentiment analysis has become a popular research topic in recent years (Kaur and Kautish, 2019;Soleymani et al., 2017). As shown in Fig. 1, sentiment is no longer expressed by a pure modality in the multimodal scenario but rather by the com- Two posts express the user's positive sentiment from multimodal data that has global characteristics, including the "have a fun/nice day" phrase, the ocean scene, and the beach scene. bined expressions of multiple modalities (e.g., text, image, etc.). In contrast to unimodal data, multimodal data consist of more information and make the user's expression more vivid and interesting.
We focus on multimodal sentiment detection for image-text pairs in social media posts. The problem of image-text mismatch and flaws in social media data, such as informality, typos, and a lack of punctuation, pose a fundamental challenge for the effective representation of multimodal data for the sentiment detection task. To tackle this challenge, Xu et al. (2017;2017) constructed different networks for multimodal sentiment analysis, such as a Hierarchical Semantic Attentional Network (HSAN) and a Multimodal Deep Semantic Network (MDSN). Xu et al. (2018) and Yang et al. (2020) proposed a Co-Memory network (Co-Mem) and a Multi-view Attentional Network (MVAN) models, respectively, introducing memory networks to realize the interaction between modalities.
The above methods treat each image-text post in the dataset as a single instance, and feature dependencies across instances are neglected or modeled implicitly. In fact, social media posts have specific global co-occurring characteristics, i.e., co-occurring words, objects, or scenes, which tend to share similar sentiment orientations and emotions. For example, the co-occurrences of the words "have a fun/nice day" and of the bright scenes "ocean/beach" in the two images in Fig. 1 imply a strong relationship between these features and positive sentiment. How to more effectively make use of the feature co-occurrences across instances and capture the global characteristics of the data remain a great challenge.
We propose a Multi-channel Graph Neural Networks model with Sentiment-awareness (MGNNS) for multimodal sentiment analysis that consists of three stages.
(i) Feature extraction. For text modality, we encode the text and obtain a text memory bank; for image modality, we first extract objects and scenes and then capture the image' semantic features from a multiview perspective.
(ii) Feature representation. We employ a Graph Neural Network (GNN) for text modality based on the global shared matrices, i.e., one text graph based on word co-occurrence is built based on the whole dataset. Specifically, we first connect word nodes within an appropriate small window in the text. After that, we update the node representation by itself as well as neighbor nodes. For image modality, it is believed that different views of an image, such as the beach (Scene view) and person (Object view) in Fig. 1(a), can reflect a user's emotions (Xu and Mao, 2017). The existing literature usually models the relationship between the scenes and objects within an image, failing to capture the rich co-occurrence information from the perspective of the whole dataset. In contrast, we explicitly build two graphs for scenes and objects according to the co-occurrences in the datasets and propose Graph Convolutional Network (GCN) models over the two graphs to represent the images. In general, to tackle the isolated feature problem, we build multiple graphs for different modalities, with each GNN acting as a channel, and propose a Multi-channel Graph Neural Networks (Multi-GNN) module to capture the in-depth global characteristics of the data. This multi-channel based method can provide complementary representation from different sources (George and Marcel, 2021;George et al., 2019;Islam et al., 2019).
(iii) Feature fusion. Previous studies usually directly connect multimodal representations, without considering multimodal interactions (Wang et al., 2020a;Xu, 2017;Xu and Mao, 2017). In this stage, we realize the pairwise interaction of text and image modalities from different channels through the use of the Multimodal Multi-head Attention Interaction (MMAI) module and obtain the fusion representation.
Our main contributions are summarized as follows: • We propose a novel MGNNS framework that models the global characteristics of the dataset to handle the multimodal sentiment detection task. To the best of our knowledge, we are the first to apply GNN to the image-text multimodal sentiment detection task.
• We construct the MMAI module from different channels to realize in-depth multimodal interaction.
• We conduct extensive experiments on three publicly available datasets, and the results show that our model outperforms the stateof-the-art methods.
2 Related Work

Multimodal Sentiment Analysis
For convenience, multimodal polarity analysis and emotion analysis are unified to form multimodal sentiment analysis. Traditional machine learning methods are adopted to address the multimodal sentiment analysis task (Pérez-Rosas et al., 2013;You et al., 2016). Recently, deep learning models have also achieved promising results for this task. For the video dataset, Wang et al. (2020b) proposed a novel method, TransModality, to fuse multimodal features with end-to-end translation models;  leveraged semi-supervised variational autoencoders to mine more information from unlabeled data; and Hazarika et al. (2020) constructed a novel framework, MISA, which projects each modality to two distinct subspaces: modalityinvariant and modality-specific subspaces. There is a massive amount image-text data on social platforms, and thus, image-text multimodal sentiment analysis has attracted the attention of many researchers. Xu et al. constructed different networks for multimodal sentiment analysis-HSAN (2017), MDSN (2017) and Co-Mem (2018). Yang et al. (2020) built an image-text emotion dataset, named TumEmo, and further proposed MVAN for multimodal emotion analysis.

Graph Neural Network
The Graph Neural Network has achieved promising results for text classification, multi-label recognition, and multimodal tasks. For text classification, a novel neural network called Graph Neural Network (GNN), and its variants have been rapidly developed, and their performance is better than that of traditional methods, such as Text GCN (Yao et al., 2019), TensorGCN (Liu et al., 2020), and TextLevelGNN (Huang et al., 2019). The GCN is also introduced in the multi-label image recognition task to model the label dependencies (Chen et al., 2019).
Recently, Graph Convolutional Network has been applied in different multimodal tasks, such as Visual Dialog (Guo et al., 2020;Khademi, 2020), multimodal fake news detection (Wang et al., 2020a), and Visual Question Answering (VQA) (Hudson and Manning, 2019; Khademi, 2020). Jiang et al. (2020) applied a novel Knowledge-Bridge Graph Network (KBGN) in modeling the relations among the visual dialogue cross-modal information in fine granularity. Wang et al. (2020a) proposed a novel Knowledge-driven Multimodal Graph Convolutional Network (KMGCN) to model semantic representations for fake news detection. However, the KMGCN extracted visual words as visual information and did not make full use of the global information of the image. Khademi (2020) introduced a new neural network architecture, a Multimodal Neural Graph Memory Network (MN-GMN), for VQA, which model constructed a visual graph network based on the bounding-boxes, which produced overlapping parts that might provide redundant information.
For the image-text dataset, we found that certain words often appear in a text post simultaneously, and different objects or scenes within an image have specific co-occurrences that indicate certain sentiments. We explicitly model these global characteristics of the dataset through the use of a multichannel GNN.
3 Proposed Model Fig. 2 illustrates the overall architecture of our proposed MGNNS model for multimodal sentiment detection that consists of three modules: the encoding module, the Multi-GNN module, and the multimodal interaction module. We first encode text and image input into hidden representations. Then, we introduce GNN from different channels to learn multiple modal representations. In this paper, the channels are the Text-GNN (TG) module, the Image-GCN-Scene (IGS) module, and the Image-GCN-Object (IGO) module. Finally, we realize the in-depth interactions between different modalities by multimodal multi-head attention.

Problem Formalization
The goal of our model is to identify which sentiment is expressed by an image-text post. Given a set of multimodal posts from social media, where T i is the text modality and V i is the corresponding visual information, N represents the number of posts. We need to learn the model f : P → L to classify each post (T i , V i ) into the predefined categories L i . For polarity classification, L i ∈ {P ositive, N eutral, N egative}; for emotion classification, L i ∈ {Angry, Bored, Calm, F ear, Happy, Love, Sad }.

Encoding
For text modality, we first encode words by GloVe (Pennington et al., 2014) to obtain the embedding vector and then obtain the text memory bank, M t , by BiGRU (Cho et al., 2014): (1) where T is a text sequence, L t is the maximum length of a padded text sequence, and d t is the dimension of hidden units in the BiGRU.
For image modality, we extract image features from both the object and scene views to capture sufficient information. We believe that there are interdependencies between different objects or scenes in an image. To explicitly model this co-occurrence, we first extract objects O = {o 1 , ..., o l o } by YOLOv3 (Farhadi and Redmon, 2018), and extract scenes S = {s 1 , ..., s l s } by VGG-Place (Zhou et al., 2017). Finally, we obtain the object and scene memory banks with the pretrained ResNet (He et al., 2016). Thus, if an input image V has a 448×448 resolution and is split into 14×14 = 196 visual blocks of the same size, then each block is represented by a 2,048-dimensional vector.
We have a fun day on the beach!  Note that we delete the stopwords during data preprocessing so that the words "a" and "the" do not have connections.

Multi-channel Graph Neural Networks
In this subsection, we present our proposed Multi-GNN module. As Fig. 2 shows, this module consists of the TG channel (middle), the IGO channel (right), and the IGS channel (left).
Text GNN: As shown in the middle of Fig. 2, motivated by (Huang et al., 2019), we learn text representation through the Text Level GNN. For text with l t words T = {w 1 , ..., w k , ..., w l t }, where the k th word, w k , is initialized by glove embedding r t k ∈ R d , d = 300. We build the graph of the textbased vocabulary of the training dataset, which is defined as follows: We build edges between w k and w j when the number of co-occurrences of two words is not less than 2.
where N t and E t are the set of nodes and edges of the text graph, respectively. The word representations in N t and the edge weights in E t are taken from global shared matrices built based on vocabulary and the edge set of the dataset, respectively. That is, the representations of the same nodes and weights of the edges are shared globally. e t k,j is initialized by point-wise mutual information (PMI) (Wang et al., 2020a) and is learned in the training process. ws is the hyperparameter sliding window size, which indicates how many adjacent nodes are connected to each word in the text graph. Then, we update the node representation based on its original representations and neighboring nodes by the message passing mechanism (MPM) (Gilmer et al., 2017), which is defined as follows: where A t k ∈ R d is the aggregated information from neighboring nodes from node k−ws to k+ws, and max is the reduction function. α is the trainable variable that indicates how much original information of the node should be kept, and r t k ∈ R d is the updated representation of node k.
Finally, we can calculate the new representation of text T as follows: Image GCN: In this module, we explicitly model interdependence within l x scenes or objects by IGX, as shown on the left and right sides of Fig.  2, respectively. The graph of the image is defined as follows: where N x ∈ R C x is the set of nodes of IGX; x or X ∈ {Object, Scene}, C x = 80 when x = Object, and C x = 365 when x = Scene.
To build the edges of IGX, we first build the global shared co-occurrence matrix-based dataset: where E x ∈ R C x ×C x is the co-occurrence matrix; edge weight e x p,q indicates the co-occurrence times of x p and x q in the dataset.
Then, we calculate the conditional probability for node p as follows: where N x p denotes the occurrence times of x p in the dataset. Note that P x p,q = P x q,p . As mentioned by (Chen et al., 2019), the simple correlation above may suffer several drawbacks. We further build the binary co-occurrence matrix: where β is the hyperparameter used to filter noisy edges. It is obvious that the role of the central node is different from that of neighboring nodes, so we need to further calculate the weight of the edge: where R x ∈ R C x ×C x is the weighted cooccurrence matrix, and hyperparameter γ indicates the importance of neighboring nodes. Finally, we input node N x and edge R x of the image into the graph convolutional network. Like in (Kipf and Welling, 2016), every layer can be calculated as follows: where H  Figure 3: The MMAI module illustrates the process of multimodal interaction from four channels, X ∈ {Object, Scene}. We take the interaction process between text and image scene channels as an example to demonstrate this for convenience. The dotted arrows are the outputs of the other two channels after the interactions.
By stacking multiple GCN layers, we can explicitly learn and model the complex interdependence of the nodes. Then, we obtain the image representation with objects or scenes dependencies: But, we cannot capture the relationship between nodes and sentiments. Therefore, we learn the sentiment-awareness image representation through multi-head attention (Vaswani et al., 2017).
×d is a sentiment embedding matrix built based on the label set l s = 3 for polarity classification and l s = 7 for emotion classification; K = V = I x W I , W I ∈ R C x ×d model , K, V ∈ R d model .

Multimodal Interaction
Motivated by the Transformer (Vaswani et al., 2017) prototype, we design a Multimodal Multihead Attention Interaction (MMAI) module that can effectively learn the interaction between text modality and image modality by multiple channels, as shown in Fig. 3.
We employ the MMAI to obtain the Text guided Image-X representations and Image-X guided Text representations, X ∈ {Object, Scene}. For the Text-guided Image-X attention, where LN (·) is layer normalization, and F F N (·) is the feed-forward network. When N = 1, H T gX 1 = T , as in Eq. 7. For the Image-X-guided Text attention, when N = 1, H XgT The fused multimodal representation is as follows: where ⊕ is a concatenation operation.

Sentiment Detection
Finally, we feed the above fused representation, R m , into the top fully connected layer and employ the softmax function for sentiment detection.
where w s and b s are the parameters of the fully connected layer.

Experiments
We conduct experiments on three multimodal sentiment datasets from social media platforms, MVSA-Single, MVSA-Multiple (Niu et al., 2016), and TumEmo (Yang et al., 2020), and compare our MGNNS model with a number of unimodal and multimodal approaches.

Datasets
MVSA-Single and MVSA-Multiple are two different scale image-text sentiment datasets crawled from Twitter 1 . TumEmo is a multimodal weaksupervision emotion dataset containing a large 1 https://twitter.com amount of image-text data crawled from Tumblr 2 . The statistics of these datasets are given in Appendix A; and for a fair comparison, we adopt the same data preprocessing method as that of Yang (Yang et al., 2020). The corresponding details are shown in Appendix B.

Experimental Setup
Parameter

MVSA- * TumEmo
Learning rate 4e − 5 5e − 5 ws We adopt the cross-entropy loss function and Adam optimizer. In the process of extracting objects and scenes, we reserve the objects with the probability greater than 0.5 and the top-5 scenes, respectively. The other parameters are listed in Table 2, * ∈ {Single, M ultiple}. We use Accuracy (Acc) and F1-score (F1) as evaluation metrics. All models are implemented with PyTorch.

Baselines
We compare our model with multimodal sentiment models with the same modalities and the unimodal baseline models.
Unimodal Baselines: For text modality, CNN (Kim, 2014) and Bi-LSTM (Zhou et al., 2016) are well-known models for text classification tasks, and BiACNN (Lai et al., 2015) incorporates the CNN and BiLSTM models with an attention mechanism for text sentiment analysis. TGNN (Huang et (Xu and Mao, 2017) is a deep semantic network with attention for multimodal sentiment analysis. Co-Mem (Xu et al., 2018) is a co-memory network for iteratively modeling the interactions between multiple modalities. MVAN (Yang et al., 2020) is a multi-view attentional network that utilizes a memory network for multimodal emotion analysis. This model achieves state-of-the-art performance on image-text multimodal sentiment classification tasks.

Experimental Results and Analysis
The experimental results of the baseline methods and our model are shown in Table 3, where MGNNS denotes that our model is based on multichannel graph neural networks 3 .
We can make the following observations. First, our model (MGNNS) is competitive with the other strong baseline models on the three datasets. Note that the data distribution of MVSA- * is extremely unbalanced. Thus, we reproduce the MVAN model with ACC and Weighted-F1 metrics instead of the Micro-F1 metric used in the original paper, which is more realistic. Second, the multimodal sentiment analysis models perform better than most of the unimodal sentiment analysis models on all three datasets. Moreover, the segmental indictors are difficult to capture for images owing to the low information density, and the sentiment analysis on the image modality achieves the worst results. Finally, the TGNN unimodal model outperforms the HSAN multimodal model, indicating that the GNN has excellent performance in sentiment analysis.

Ablation Experiments
We conduct ablation experiments on the MGNNS model to demonstrate the effectiveness of different modules.  (+CoAtt), the model performance is found to be slightly worse than that of the MGNNS module. This further illustrates the importance of multimodal interactions and the superiority of the MMAI module. When one of the object views (w/o Object) or scene views (w/o Scene) is removed, the performance of the model declines, which indicates that both views of the image are effective for multimodal sentiment analysis.

Transferability Experiment
In the Multi-GNN module, we build multiple graphs for different modalities based on the dataset. For different datasets, the graphs built by the unimodal model are different. However, can graph capture from one dataset (e.g., MVSA-Single) have positive effects on other datasets (e.g., TumEmo)?
In this subsection, we will verify the transferability of the model through experiments. As Table 5 shows, the following conclusions can be drawn: (i) Regardless of the modality, such as text or image, compared to introducing the graph constructed based on own dataset, the experimental results calculated based on graphs transferred from other datasets are worse. This is mainly because each dataset has unique global characteristics, the experimental results based on transferred graphs are slightly worse. (ii) However, due to the commonality of datasets when expressing the same emotions, the results of the transferred models are not completely worse. For example, the same scenes and objects can appear in different images in different datasets simultaneously for image modalities. Therefore, graphs from different datasets have transferability and can be used for other datasets. (iii) For different datasets, the experimental results of "X2Y-Text" are worse than those of "X2Y-Image". That is, the text graph has worse transferability. The reason for this may be that text graphs with various nodes are created based on the vocabulary of different datasets. Two situations in the transferred text graph will seriously affect the results: fewer nodes will lose information, and more nodes will provide redundant information. (iv) When the dataset gap is relatively wide, the transferability of text graphs is worse. For example, from the larger datasets transfer to the smallest dataset, including T2S-Text and M2S-Text, experimental results show a drop of 2.45% and 2.69%, respectively; from the smaller datasets transfer to the most largest dataset, including S2T-Text and M2T-Text, experimental results show a significant drop of 4.81% and 4.09%, respectively.

Hyperparameter Settings
Hyperparameter ws: To obtain adequate information from neighboring nodes in the TGNN, we conduct experiments under different settings for hyperparameter ws in Eq. 4, the related results of which are shown in Fig. 4. The best ws selection varies among different datasets since the average text length of TumEmo is longer compared to other data. The TGNN cannot obtain sufficient information from neighboring nodes with ws values that are too small, while larger values may degrade the performance due to the redundant information provided by neighboring nodes.   Table 5: Transferability experiment results of Acc and F1 on different datasets. S, M and T denote MVSA-Single, MVSA-Multiple, and TumEmo, respectively. For "Z" modality, "X2Y-Z" represents that the graph that is built based on the "X" dataset is transfered to the "Y" dataset, where Z ∈ {Text, Image}, X ∈ {MVSA-Single, MVSA-Multiple, TumEmo}, and Y ∈ {MVSA-Single, MVSA-Multiple, TumEmo}. For example, "M2S-Text" represents that the text graph that is built based on the MVSA-Multiple dataset is transferred to the MVSA-Single dataset.
Hyperparameter β: We vary the values of hyperparameter β in Eq. 11 for the binary cooccurrence matrix from different views, the results of which are shown in Fig. 5. We find that the best β value is different for different views in different datasets. For MVSA- * , the smaller β value can reserve more edges to capture more information since the scene co-occurrence matrix is sparser than that in the object view. For TumEmo with a large amount of data, preserving the top-5 scenes produces many noise edges, so the value of scene-β is greater than that of MVSA- * . Hyperparameter γ: As Fig. 6 shows, the model receives the best performance for the three datasets when γ is 0.2. When γ is smaller, the neighboring nodes do not receive enough attention; in contrast, their own information is not fully uti-lized.

Conclusions
This paper proposes a novel model, MGNNS, that is built based on the global characteristics of the dataset for multimodal sentiment detection tasks. As far as we know, this is the first application of graph neural networks in image-text multimodal sentiment analysis. The experimental results on publicly available datasets demonstrated that our proposed model is competitive with strong baseline models.
In future work, we plan to construct a model that adopts the advantages of the GNN and pretrained models such as BERT, VisualBERT, and etc. We want to design a reasonable algorithm to characterize the quality of the objects and scenes selected from the image and further improve the representation ability of the model. Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), pages 207-212.

A Dataset
A.1 MVSA-Single and MVSA-Multiple The statistics for the MVSA-Simple and MVSA-Multiple datasets are listed in

A.2 TumEmo
The statistics for the TumEmo dataset are listed in Table 2, containing a large number of image-text posts labeled by emotion.

B Preprocessing Data
The text data contain many useless characters for sentiment analysis, such as URLs, stopwords, and punctuation. We need to preprocess text data to enhance the effectiveness of multimodal emotion detection. We perform data preprocessing as follows: • remove the "URL", as in"http://..."; • remove the stopwords, such as "a, an, the, and etc. "; • remove the useless punctuation, including periods, commas, semicolons, etc; • remove the hashtag and its content (#content); In particular, the TumEmo dataset uses #emotion as a weakly supervised label.
• remove the posts for which the text length is less than 3.