Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection

Aspect terms extraction (ATE) and aspect sentiment classification (ASC) are two fundamental and fine-grained sub-tasks in aspect-level sentiment analysis (ALSA). In the textual analysis, joint extracting both aspect terms and sentiment polarities has been drawn much attention due to the better applications than individual sub-task. However, in the multi-modal scenario, the existing studies are limited to handle each sub-task independently, which fails to model the innate connection between the above two objectives and ignores the better applications. Therefore, in this paper, we are the first to jointly perform multi-modal ATE (MATE) and multi-modal ASC (MASC), and we propose a multi-modal joint learning approach with auxiliary cross-modal relation detection for multi-modal aspect-level sentiment analysis (MALSA). Specifically, we first build an auxiliary text-image relation detection module to control the proper exploitation of visual information. Second, we adopt the hierarchical framework to bridge the multi-modal connection between MATE and MASC, as well as separately visual guiding for each sub module. Finally, we can obtain all aspect-level sentiment polarities dependent on the jointly extracted specific aspects. Extensive experiments show the effectiveness of our approach against the joint textual approaches, pipeline and collapsed multi-modal approaches.


Introduction
Multi-modal aspect-level (aka target-oriented) sentiment analysis (MALSA) is an important and finegrained task in multi-modal sentiment analysis (MSA). Previous studies normally cast MALSA in social media as two independent sub-tasks: Multimodal Aspect Terms Extraction (MATE) and Multimodal Aspect Sentiment Classification (MASC). First, MATE aims to detect a set of all potential * Corresponding Author  aspect terms from a free text with its accompanying image (Wu et al., 2020a). Second, MASC aims to classify the sentiment polarity of a multi-modal post towards a given aspect in textual modality (Yu and Jiang, 2019).
To better satisfy the practical applications, the aspect term-polarity co-extraction, which solves ATE and ASC simultaneously, receives much attention recently in a textual scenario (Wan et al., 2020;Chen and Qian, 2020b;Ying et al., 2020). However, to our best knowledge, in the multi-modal scenario, the joint MATE and MASC, i.e., joint multi-modal aspect-sentiment analysis (JMASA), have never been investigated so far. For this joint multi-modal task, we believe that there exist the following challenges at least.
On the one hand, visual modality may provide no clues for one of sub-tasks. For example, in Figure 1(a), since the image shows most of the content described in the text, and we can't infer from the image which team has an advantage at first glance. While, a direct understanding of the text (e.g., the word "rout") seems to be able to judge the sentiment of "Spurs" and "Thunder". Thus this image does not add to the text tweet meaning (Vempala and Preotiuc-Pietro, 2019). On the contrary, in Figure 1(b), the information of textual modality is quite limited so that we cannot directly infer the sentiment towards one aspect. While, the visual modality provides rich clues (e.g., differential expressions) to help us predict the correct sentiment of "OBAMA". Therefore, a well-behaved approach should determine whether the visual information adds to the textual modality (cross-modal relation detection) and how much visual information contributes to text.
On the other hand, the characteristics of the two multi-modal sub-tasks are different: one is sequence labeling problem, the other is aspectdependent classification problem. Different tasks seem to focus on different image information. For example, in Figure 1(b), towards first sub-task MATE, if we can attend to some coarse-grained concepts (e.g., silhouette of human face, Person label) in the image, it is enough and effective to help identify the name "OBAMA" in the text as an aspect. Towards second sub-task MASC, we should attend to the details (e.g., different facial expressions) of some regions, so that we can judge the accurate sentiment dependent on a specific aspect "OBAMA". Therefore, a well-behaved approach should separately mine the visual information for these two sub-tasks instead of collapsed tagging with the same visual feeding.
To handle the above challenges, we propose a multi-modal joint learning approach with auxiliary cross-modal relation detection, namely JML. Specifically, we first design a module of auxiliary cross-modal relation detection to control whether the image adds to the text meaning. Second, we leverage the joint hierarchical framework to separately attend to the effective visual information for each sub-task instead of collapsed tagging framework. Finally, we can obtain all potential aspect term-polarity pairs. Extensive experiments and analysis on two multi-modal datasets in Twitter show that our approach performs significantly better than text-based joint approaches and collapsed multi-modal joint approaches.

Related Work
In the past five years, text-based aspect-level sentiment analysis has drawn much attention Chen and Qian, 2019;Zhang and Qian, 2020;Zheng et al., 2020;Tulkens and van Cranenburgh, 2020;Akhtar et al., 2020). While, multimodal target-oriented sentiment analysis has become more and more vital because of its urgent need to be applied to the industry recently (Akhtar et al., 2019;Zadeh et al., 2020;Sun et al., 2021a;Tang et al., 2019;Zhang et al., 2020bZhang et al., , 2021a. In the following, we mainly overview the limited studies of multi-modal aspect terms extraction and multi-modal aspect sentiment classification on text and image modalities. Besides, we also introduce some representative studies for text-based joint aspect terms extraction and sentiment polarity classification. Multi-modal Aspect Terms Extraction (MATE). Sequence labeling approaches are typically employed for this sub-task (Ma et al., 2019;Chen and Qian, 2020a;Karamanolakis et al., 2019). But it is challenging to bridge the gap between text and image. Several related studies with focus on named entity recognition propose to leverage the whole image information by ResNet encoding to augment each word representation, such as (Moon et al., 2018; upon RNN, (Yu et al., 2020b) upon Transformer and (Zhang et al., 2021b) on GNN. Besides, several related studies propose to leveraging the fine-grained visual information by object detection, such as (Wu et al., 2020a,b) However, all the above studies completely ignore the sentiment polarity analysis dependent on the detected target, which has great facilitates in practical applications, such as e-commerce. Different from them, we propose to jointly perform the corresponding sentiment classification besides aspect terms extraction in a multi-modal scenario. Note that we propose a multi-modal joint learning approach to improve the performance of both MATE and MASC.
Multi-modal Aspect Sentiment Classification (MASC). Different from text-based aspect sentiment classification (Sundararaman et al., 2020;Ji et al., 2020;Liang et al., 2020b,a), it is challenging to effectively fuse the textual and visual information. As a pioneer, Xu et al. (2019) collect a benchmark Chinese dataset from a digital product review platform for multi-modal aspect-level sentiment analysis and propose a multi-interactive memory network to iteratively fuse the textual and visual representations.
Recently, Yu and Jiang (2019) annotate two datasets in Twitter for multi-modal target-oriented (aka aspect-level) sentiment classification and leverage BERT as backbone to effectively combine both textual and visual modalities. In the same period, Yu et al. (2020a) propose a target-sensitive attention and fusion network to address both text-based and multi-modal target-oriented sentiment classification.
However, all the above studies assume that the aspect or target has been given, which is limited to some applications. Different from them, we propose to jointly perform aspect terms extraction besides the corresponding sentiment classification in a multi-modal scenario. Note that we also propose a multi-modal joint learning approach to improve the performance of both MATE and MASC.
Text-based Joint Aspect Terms Extraction and Sentiment Classification. Some studies (Zhang et al., 2020a) have attempted to solve both sub-tasks in a more integrated way, by jointly extracting aspect terms and predicting their sentiment polarities. The most recent and representative are a span-based extract-then-classify approach (Hu et al., 2019) and a directed GCN with syntactic information .
However, all the above studies can not model the visual guidance for both sub-tasks. Different from them, we propose a multi-modal joint framework to handle both MATE and MASC.

Joint Multi-modal Aspect-Sentiment Analysis
In this section, we introduce our approach for multimodal aspect terms extraction and aspect sentiment classification jointly. In the following, we first formalize this joint task, then introduce the module of text-image relation detection, finally give the details of our hierarchical framework for multimodal learning. Task Definition We define the following notations, used throughout the paper. Let D = {(X n , I n , A n , S n )} N n=1 be the set of data samples. Given a word sequence X = {x 1 , x 2 , · · · , x k } with length k and an image I, the joint task is to extract a aspect terms list A = {a 1 , a 2 , · · · , a m } and classify the aspect sentiment list S = {s 1 , s 2 , · · · , s m } simultaneously, where m denotes the number of aspects. Note that the word embeddings are obtained by pre-processing via BERT (Devlin et al., 2019) due to its excellent ability of textual representation, meanwhile the image region embeddings are obtained by pre-processing via ResNet (He et al., 2016) due to its excellent ability of visual representation.

Cross-modal Relation Detection
Unlike traditional approaches, which take visual information into consideration completely and ignore whether image can bring benefits to text, we incorporate the image-text relation into the model and only retain the auxiliary visual information towards the text. Therefore, we build a relation module by pre-training to properly exploit visual modality for our joint multi-modal tasks. The cross-modal relation detection module is shown in bottom right corner of Figure 2.
We employ TRC dataset (Vempala and Preotiuc-Pietro, 2019) for text-image relation detection to control whether image adds to the text meaning. Table 1 shows the types of text-image relations and statics of the TRC dataset.
Module Design. we first involve two raw modalities into pre-trained module of BERT and ResNet respectively, noting that the pre-trained module involved in cross-modal relation detection module independently. Then, we incorporate two modal representation into a self-attention block to capture intra-modal interactions for each modality. After that, we put output states into the cross-attention block capture inter-modal interactions for text and image. Formally, where ATT self denotes self-modal multi-head attention as (Vaswani et al., 2017), and ATT cross denotes the cross-modal multi-head attention as (Ju et al., 2020). O rel and T rel are pre-trained embedding of image I and text X. Finally, we obtain the relation probabilities through a feed-forward neural network and a softmax activation function as follows: where W 1 ∈ R 4 * dm×dm and W 2 ∈ R dm×2 are two trainable parameter matrices. H means the concatenation of H o , H x , H o→x and H x→o . Since the relation score can also be binary: 0 or 1, we calculated by equation similarly to equation 5,but score p r < 0.5 = 0, p < 0.5. Then we try both soft and hard relations to guide our multi-modal joint tasks.  Figure 2: The overview of our proposed JML.
Relation Loss.
be a set of text-image pairs for TRC training. The loss L r of binary relation classification is calculated by cross entropy: where p r (x) is the probability for correct classification and the probability is calculated by softmax.

Multi-modal Aspect Terms Extraction
The left part of Figure 2 shows the architecture of multi-modal aspect terms extraction. We first leverage the text-image relation to control the visual input, then make the textual and visual information perform mutual attention.
where RelDet(,) denotes the relation detection module with inputs X and I. O is the output of another ResNet for I, applied for our main task. G r is the relation score. In this stage, we use the mask gate G r to control the additive visual clues.
Subsequently, we make the text attend to the effective visual information of the first sub-task MATE. Defined as follows: where ⊕ denotes the element-wise addition and W a ∈ R dm * dm . T is the output of another BERT for X, applied for our main task. Instead of finding aspects via BIO sequence tagging approaches, we identify candidate aspects by its start and end positions in the sentence, inspired by previous research (Wang and Jiang, 2017;Hu et al., 2019), due to the huge search space and the inconsistence of multi-word sentiment. From the above step, we obtain the unnormalized score as well as the probability distribution of the start position as: where W s ∈ R dm is a trainable weight vector. Correspondingly, we can obtain the end position probability along with its confidence score by: During training, considerate that each sentence may contain multiple aspects, we label the span boundaries for all aspect entities in the A. After that, we obtain a vector y s ∈ R k , where each element y s i indicates whether the i-th position is the start of an aspect, and also we get another vector y e for labeling the end positions.

Multi-modal Aspect Sentiment Classification
Traditionally, aspect sentiment classification with aspect focus on using either sequence tagging methods or sophisticated neural networks that separately encode the target and the sentence. Instead, we propose to obtain the summarized representation from the upper layer cross-modal state H a based on position vectors (y s and y e ). Then, a feed-forward neural network is used to predict the sentiment polarity as shown in Figure 2 ( upper right corner). Inspired by the upper network, we receive a multiple aspect span list from y s and y e . Specially, given an aspect span a, we summarize hidden state representation H a in its corresponding bound (s i , e i ) as a vector H i u with the attention mechanism (Bahdanau et al., 2015). Formally: where W m ∈ R dm is a trainable weight vector. In addition, we integrate visual representation O r in formula (8) into span vector set H u with assistance of relation gate G r . Similar to formula (9-10), cross-modal multi-head attention mechanism is used to modal fusion: where W u ∈ R dm×dm , and then we get H s ∈ R m×dm as final sentiment state set. Furthermore, we obtain the polarity score by applying two linear transformations with a Tanh activation in between, and is normalized with a softmax function to output the polarity probability as: where W p ∈ R dm× and W v ∈ R dm×dm are two trainable weight parameter matrices. in the number of sentiment classes.

Joint Loss
Since it is a joint task with aspect terms extraction and aspect sentiment classification, we calculate two different sets of loss simultaneously as follows: where y s , y e , y p are one-hot labels indicating golden start, end positions, true sentiment polarity separately, and a, m are the number of sentence tokens, aspects respectively. At inference time, we select the most suitable span(k,l)(k<l) with assist of position polarity (g str , g end ) as final aspect prediction based on previous research (Hu et al., 2019). After that, the sentiment polarity probability is calculated for each candidate span and select the sentiment class with the maximum value in p p .

Experimentation
In this section, we systematically evaluate our approach to aspect terms extraction and aspect sentiment classification.

Experimental Settings
Datasets. In the experiments, we use three datasets to evaluate the performance. One is the TRC dataset, and the other two are public Twitter datasets (i.e., Twitter2015 and Twitter2017) for MALSA. The detailed descriptions are as follows: TRC dataset of Bloomberg LP (Vempala and Preotiuc-Pietro, 2019) In this tweets dataset, we select two types of text-image relation annotated by the authors, as shown in Table 1. "Image adds to the tweet meaning" focus on the usefulness of image R 1 R 2 Image adds to the tweet meaning Percentage(%) 44.2 55.8  to the semantics of the tweet, especially suitable for our task. We follow the same split of 8:2 for train/test sets as in (Vempala and Preotiuc-Pietro, 2019).
Twitter dataset. As shown in Table 2, the dataset (i.e., Twitter2015 and Twitter2017) are provided by  for multi-modal named entity recognition originally and annotated the sentiment polarity for each aspect by (Lu et al., 2018). we use this dataset for our joint task. Implementation Details. We implement our approach via Pytorch toolkit (torch-1.1.0) with a piece of GTX 1080 Ti. The hidden size d m in our model is 768 same to dim in BERT (Devlin et al., 2019). The number of heads in ATT self and ATT cross is 8.
During training, we train each model for a fixed number of epochs 50 and monitor its performance on the validation set. Once the training is finished,we select the model with the best F 1 score on the validation set as our final model and evaluate its performance on the test set. We adopt cross-entropy as the loss function and use the Adam (Kingma and Ba, 2015) optimization method to minimize the loss over the training data. To motivate future research, the code will be released via github 1 Evaluation Metrics and Significance Test. In our study, we employ three evaluation metrics to measure the performances of different approaches to multi-modal aspect terms extraction and aspect sentiment classification jointly, i.e. micro F 1 measure (F 1 ), Precision(P ) and Recall(R). Besides, through scipy 2 ,the paired t-test is performed to test the significance of the difference between two approaches, with a default significant level of 0.05. These metrics have been popularly used in some aspect extraction and sentiment classification problems.

Baselines
For a thorough comparison, we mainly compare four groups of baseline systems with our approach.
The first group are the most related approaches to multi-modal aspect terms extraction. 1) RAN (Wu et al., 2020a); a co-attention approach for aspect terms extraction in a multi-modal scenario. 2) UMT (Yu et al., 2020b); 3) OSCGA (Wu et al., 2020b), an NER approach in a multi-modal scenario based on object features with BIO tagging. Note that UMT and OSCGA focus on named entity recognition (NER) with BIO tagging in a multimodal scenario, leveraging the representation ability of transformer and object-level fine-grained visual features, respectively.
The second group are the representative approaches to multi-modal aspect-dependent sentiment classification. 1) TomBERT (Yu and Jiang, 2019). 2) ESAFN (Yu et al., 2020a). Note that TomBERT is based on BERT, ESAFN is based on LSTM but explicitly models textual contexts.
The third group are text-based approaches of joint aspect terms extraction and aspect sentiment classification. 1) SPAN (Hu et al., 2019). 2) D-GCN . Note that SPAN also adopts a hierarchical framework but limited to textual scenario. D-GCN leverage syntactic information with GCN.

Experimental Results
Result of TRC. Table 4 shows the performance of our relation detection module on the test set of TRC data. The result shows that our attentionbased visual-linguistic model equipped with BERT and ResNet outperforms that of (Lu et al., 2018) and RpBERT. F 1 score of our model on the test set of TRC data increases by 8.8% compared to (Lu et al., 2018) and 1.7% compared to RpBERT significantly, which demonstrated the effectiveness of this task.
For JMASA. Table 3 shows the results of different approaches in multi-modal scenarios, which simultaneously process the aspect terms extraction and aspect sentiment classification. From this table, we can observe that 1) text-based joint approaches perform much worse than multi-modal joint task approaches, suggesting that visual modality enriches representation to help correct predictions, rather than limited textual modality. 2) UMT-collapse,  RpBERT and OSCGA-collapse perform much worse than our joint approach, owing to collapsed tagging with the same visual feeding, instead of separately mine the visual information for two subtasks. 3) RpBERT performs the worst in all baselines, which simultaneously process multiple tasks for text-image relation classification and visuallinguistic learning for aspect terms extraction and aspect sentiment classification, suggesting that a vanilla Bert-based model can not handle multiple tasks in the same time and greatly reduce task performance. 4) JML(hard) with a hard relation perform worse than its soft counterpart, indicating the wisdom of using soft image-text relation. 5) Among all the approaches, our proposed JML performs best in terms of almost all metrics. For instance, in terms of metric on Twitter-2017, our approach outperforms D-GCN by 1.9%, 2.3% and 1.4% with respect to M icroF 1, P recision and Recall, respectively. This is mainly because our approach with the joint framework leverages the indeed beneficial clues to two sub-task specially by cross-modal relation detection and cross-modal attention integration. For MATE.  Table 6: Performance of multi-modal aspect-level sentiment classification, compared with sub-task in our joint approach different approaches, which only participate in multi-modal aspect terms extraction, compared with the sub-task performance in our joint approach. From this table, we can observe that 1) UMT performs the worst among all baselines, this is due to the fact that SPAN aligns text with object regions that show in an image and OSCGA combines object-level image information and character-level text information to predict aspects.
2) The sub-task performance in our joint approach performs better in most terms of metric, suggesting that our approach of joint framework promotes aspect terms extraction with the assistance of aspect sentiment information and relation-based visual modality.
For MASC. Table 6 shows the performance of different approaches, which only participate in multi-modal aspect sentiment classification, compared with the sub-task performance in our joint approach. From this table, we can observe that 1) TomBERT performs better than ESAFN, this clearly reveals that BERT as an excellent pretraining encoder indeed improve the richness of textual embedding, compared with LSTM-based encoder. 2) Our approach outperforms the current baselines significantly. We speculate that there are some reasons as follows: First, the cross-modal relation module devotes to refine a high-quality vision expression effectively. Second, our approach defines aspect sentiment classification as a multiaspect task, which considerate the mutual interaction of multiple aspect sentiment.

Analysis
In this section, we give a further investigation of some experimental results and discussion of some meaningful cases.
Ablation Study. To further demonstrate the assistance of image-text relation, we remove relation separately i.e., remove all (W/o Relation All), remove image-to-aspect relation (W/o Relation MATE) and remove image-to-sentiment relation  (W/o Relation MASC). Moreover, to demonstrate the importance of modeling image to our joint task, we remove the vision information, i.e., remove image-to-aspect vision (W/o vision MATE) and remove image-to-sentiment vision (W/o vision MASC). From Table 7, we observe that removing either the image vision or the image-text relation significantly decreases the performance. This illustrates the effectiveness of our approach in refining the visual information and modalities fusion assistance.
Case Study. To further demonstrate the effectiveness of our multi-modal joint task approach, Figure 3 presents three examples with predicted result by JML, and three representative baselines D-GCN, OSCGA-collapse and JML w/o relation all. We can obviously realize that: In example (a), although D-GCN can accurately detect two aspect terms of ground-truth, it gives the wrong sentiment prediction of aspect term "lionelmessi". This is mainly because of the lack of auxiliary visual information. In example (b), OSCGA-collapse predicts an error aspect, owing to incorporate collapsed tags with the same visual feeding in process of mine the visual information for these two sub-tasks. In example (c), we found that JML w/o relation all predict an error sentiment of aspect "miami", suggesting that without the assistance of cross-modal relation, the approach receives the interference of useless image information. However, from these cases, we observe that our well-behaved approach JML can obtain all correct aspect terms and aspectdependent sentiment by controlling the inflow of image information and separately mine the visual information for two sub-tasks in a joint framework.

Conclusion
In this paper, we propose a multi-modal joint approach to simultaneously handle the aspect terms extraction and sentiment classification. Our approach can not only model the cross-modal relation between text and image, determining how much