RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal Sentiment Classification

Recently, Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars. However, current multimodal models have reached a performance bottleneck. To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets to answer the following questions: Q1: Are the modalities equally important for TMSC? Q2: Which multimodal fusion modules are more effective? Q3: Do existing datasets adequately support the research? Our experiments and analyses reveal that the current TMSC systems primarily rely on the textual modality, as most of targets' sentiments can be determined solely by text. Consequently, we point out several directions to work on for the TMSC task in terms of model design and dataset construction. The code and data can be found in https://github.com/Junjie-Ye/RethinkingTMSC.


Introduction
Target-oriented sentiment classification, also known as aspect-based sentiment classification, is a fundamental task of sentiment analysis (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)).It aims to judge the sentimental polarity (positive, negative, or neutral) of a specific target within text.To improve the performance by considering multimodal information, Target-oriented Multimodal Sentiment Classification (TMSC) is proposed to integrate both visual and textual information (Yu and Jiang, 2019).
Recently, the performance of the TMSC systems gradually reaches a plateau and the progress in tackling this task has slowed down.Using the F1-score metric on the popular datasets, Twitter15 and Twitter17 (Yu and Jiang, 2019), we observe that state-of-the-art baselines only achieve ˚Corresponding authors, jzhou@cs.ecnu.edu.cn,qz@fudan.edu.cn.
an F1-score of around 70. Therefore, in this paper, we aim to analyze the causes behind it at both model level and modality level.Roughly speaking, the modules in the model structures can be categorized into two types: 1) encoders to model the representations of different modalities; and 2) multimodal fusion modules to model the interactions between modalities.Moreover, we give a deep analysis of the characteristics of two widely-used datasets, aiming to answer the following three questions: Q1: Are the modalities equally important for TMSC?To explore this issue, we compare and analyze the performance of unimodal models on this task.For the textual modality, we use BERT (Devlin et al., 2019) as the backbone, as it is a widely-used pre-trained language model outperforming earlier models like LSTM (Hochreiter and Schmidhuber, 1997), memory network (Weston et al., 2015), etc.For the visual modality, ResNet (He et al., 2016), ViT (Dosovitskiy et al., 2021), and Faster R-CNN (Ren et al., 2015) are adopted (see Figure 1).
Q2: Which multimodal fusion modules are more effective?The current models use various fusion strategies to model the interactions between modalities, while obtaining little improvement.To explore the effectiveness of different fusion approaches, we summarize the fusion strategies into six categories: Concatenation, Tensor Fusion (Zadeh et al., 2017), Self Attention, Image2Text, Text2Image and Bi-direction.Then we perform a comparative study of them using a unified setup to eliminate potential bias from model size and structure (see Figure 2).
Q3: Do existing datasets adequately support the research?We analyze the existing datasets (i.e., Twitter15 and Twitter17) in depth and obtain the following findings: 1) The size of existing datasets is limited; 2) The multimodal sentiment is much more consistent with the textual sentiment than            the visual sentiment; 3) A large number of targets do not exist in images; 4) There are only a small number of samples where the sentiment is decided by both text and image.
The main contributions of this work are as follows: 1) We investigate the effectiveness of different model structures for TMSC, including various unimodal encoders and multimodal fusion modules; 2) We give an in-depth analysis of limitations of existing widely-used datasets; 3) We derive several valuable observations and point out promising directions for the future research of TMSC model design and dataset creation.

Empirical Study
We summarize the model structures and performance of the baselines for the TMSC task in Table 1.Their structural differences are mainly reflected in the different unimodal encoders and multimodal fusion modules used.Therefore, we carry out several experiments to analyze the impact of these two aspects on performance.

Unimodal Encoder
As previously mentioned in Section 1, we primarily focus on exploring the different image encoders, ResNet, ViT, and Faster R-CNN (see Figure 1), while using BERT as the text encoder.
ResNet.Following most of the baselines (e.g., mBERT (Yu and Jiang, 2019), TomBERT (Yu and Jiang, 2019) and EF-CapTrBERT (Khan and Fu, 2021)), we adopt ResNet-152 as one of the image encoders.Each image is resized into 224 by 224, and then passed through the model to obtain 49 regions, which are used as the image representation I " rv 1 , v 2 , ..., v 49 s, where v i P R 2048 .
ViT.Following SMP (Ye et al., 2022), we adopt ViT to model the image by dividing it into 16 by 16 patches.A CLS token is added at the beginning and fed into the Transformer (Vaswani et al., 2017) encoder to obtain the image representation I " rv cls , v 1 , v 2 , ..., v 196 s, where v i P R 768 .
Faster R-CNN.Similar to VLP (Ling et al., 2022), we adopt Faster R-CNN that is retrained on the Visual Genome dataset (Krishna et al., 2017).We select the top 36 object proposals as the image representation I " rv 1 , v 2 , ..., v 36 s, where v i P R 2048 is obtained from the ROI pooling layer of the Region Proposal Network (Ren et al., 2015).

Multimodal Fusion
We categorize the current multimodal fusion modules into six groups as follows (see Figure 2).
Concatenate is the simplest form of fusion, where the pooled text representation H T p P R 768 is directly combined with the pooled image representation H I p P R 7681 to obtain the multimodal representation H " H I p À H T p , where À is a concatenation operation and H P R 768`768 .
Tensor Fusion is proposed for modeling interactions between modalities while preserving the characteristics of individual modalities.We obtain , where Â is an outer product operation and H P R 768ˆ768 .
Self Attention concatenates the image representation H I P R l I ˆ768 and the text representation , where l I and l T are the lengths of image and text, respectively.Then it is passed through three self-attention layers and a pooling layer to obtain H P R 768 .Image2Text is one type of cross-attention mechanism, using H I as the query and H T as the key and value, through three attention layers to get H P R 768 .Text2Image uses H T as the query and H I as the key and value instead.Furthermore, we concatenate these two as Bidirection representation H P R 768`768 .

Results Analysis
We perform experiments of different unimodal encoders and fusion modules over Twitter15 and Twitter17.In Table 2, we show the results and we have the following observations 2 : First, the text-only model (i.e., BERT) performs well, while the visual-only models (i.e., ResNet, ViT, and Faster R-CNN) perform relatively poorly, 2 The experimental setup is illustrated in Appendix B.
revealing that the reliance on text is much greater than that on images for the TMSC task on these two datasets.In comparison, this phenomenon is more pronounced in Twitter15.
Second, the performance of the model is affected by the use of different fusion methods.Specifically, fusion modules that primarily focus on acquiring the textual information (e.g., Image2Text) perform better than those focused on acquiring the visual information (e.g., Text2Image).This again reveals the inconsistent importance of text and images.
Third, compared with the text-only model, the various fusion modules do not have significant gains in performance and some are even worse.This is due to the fact that some images do not provide related information, but rather distracting information instead3 .
Fourth, the impact of various image encoders is not clear, as evidenced by low performance and high standard deviation on the two datasets (see the "Image" part of Table 2).Moreover, differences in performance among various image encoders are small in the multimodal fusion settings (see the "Multimodal" part of Table 2).This is due to the characteristics of visual data in existing datasets, which is analyzed in depth in the following section.
Based on the comprehensive experimental analyses conducted above, we identify several key points to be considered when designing models for the TMSC task in the future: 1) leveraging text information to exploit the advantages of textual data fully; 2) devising more effective image encoding methods to extract semantic information from images better; and 3) enhancing the noise immunity of the fusion module to enable more flexible selection and utilization of informative features from both textual and visual modalities.

Data Analysis
To gain a deeper understanding of the performance issues mentioned above, we conduct detailed analyses of the two datasets, taking into account quantity, diversity, and annotation.Following the annotation procedure employed by Yu and Jiang (2019), we enlist the participation of three domain experts to annotate 400 randomly sampled test data (200 from Twitter15 and 200 from Twitter17) across four aspects, with the majority vote being considered as the final annotation result (Figure 3) 4 .We have the following observations: First, as shown in Table 3, the sample size is relatively small, with an average of less than 1.5 targets per sample.Additionally, the distributions of the sentimental labels are unbalanced in both datasets, with neutral sentiment accounting for approximately 50% and negative sentiment accounting for less than 15%.The reason behind this is that Twitter15 and Twitter17 were originally constructed by Zhang et al. (2018) and Lu et al. (2018) respectively for the named entity recognition task, rather than specifically for TMSC.
Second, the multimodal sentiment has high consistency with the textual sentiment but low consistency with the visual sentiment.In Twitter15, 4 Illustrative examples with annotations are in Appendix D.
93% of the targets have the same textual sentiment as the multimodal sentiment, while only 47.5% have a visual sentiment that matches.This indicates the biased distribution existing in the dataset, i.e., the textual information is more useful for determining the multimodal sentiment.Although this phenomenon is mitigated in Twitter17, the textual information is still more consistent with the multimodal sentiment than the visual information.
Third, a large number of targets do not exist in images, which is also not suitable for the targetoriented multimodal sentiment classification task.This phenomenon may stem from the construction of the two datasets, where the targets are selected directly from the text, without taking into account the corresponding images (Yu and Jiang, 2019).
Fourth, due to the facts of irrelevant images and non-existence of targets in images, there is only a small portion of the data where the sentiment is determined by both text and images.Specifically, only 22% of Twitter15 and 55% of Twitter17 data require both text and images for the sentiment classification.As for the multimodal task, these two datasets may not be the best-suited in this aspect.
Based on our analyses of existing datasets, we propose that high-quality TMSC datasets should possess the following 1) accurately reflecting the real-world data distribution, including factors such as unbalanced label distribution, while also providing sufficient data samples for different cases; 2) large data diversity, i.e., various data types and domains, to facilitate valid testing for models' generalization capability and robustness; and 3) multi-dimensional annotation information, including both multimodal and unimodal sentiment, to enable thorough analysis of the model's ability to handle different data sources.

Conclusion and Future Work
In this paper, we conduct a series of in-depth experiments for TMSC and data analysis of existing datasets.Our findings reveal that current multimodal models do not exhibit significant performance gains compared to text-only models on the TMSC task.This is largely attributed to the over-reliance on textual modality in existing datasets, while visual modality playing a comparatively less significant role.Based on our experimental analyses, we propose future directions for designing models for the TMSC task and for constructing more suitable datasets which better capture the multimodal nature of social media sentiments.

Limitations
Although we have conducted a series of experiments and data analysis for the TMSC task to the best of our ability, there are at least the following limitations to our work.First, our data analysis was performed mainly for the currently publicly available English datasets Twitter15 and Twitter17, neglecting the Chinese dataset Multi-ZOL, which has not been widely studied.Second, although our analysis indicated some problems in using the currently dataset to measure the TMSC task, we did not construct a new and better dataset for use in academic studies.We have included this task as one of our future works to be investigated.Third, in our experiments, we did not specifically compare the impact of different text encoding methods on the model performance.While we acknowledge that different text encoding methods may indeed have an impact, it is worth noting that BERT, being a well-established text encoding method, already performs adequately.And most existing models use BERT as the text encoder.Therefore, we focused our study on investigating image encoding methods and fusion modules, as we believe there is more room for improvement in these parts.

A Related Work
As one of the tasks of sentiment analysis, TMSC has gained great attention from scholars in recent years (Yu and Jiang, 2019).Xu et al. (2019) constructed a Chinese dataset named Multi-ZOL and proposed a multi-hop memory network for handling modal interactions.Subsequently, Yu and Jiang (2019) constructed two English datasets, Twitter15 and Twitter17, and applied BERT to this task.The following research on the TMSC task can be divided into two directions.On the one hand, there is the continuous exploration of how to enhance the interactions between modalities (Khan and Fu, 2021), and on the other hand, there is the application of pre-trained models to this task (Ye et al., 2022;Ling et al., 2022).Despite these efforts, the current models have not yet achieved significant performance gains relative to the text-only models.We have conducted a series of experiments and data analysis, hoping to provide some insights for the future research of TMSC.

B Experimental Setup
For each set of experiments, we conduct tests using five different random seeds (i.e., 0, 42, 199, 2022, and 11122).We initialize the parameters of the BERT text encoder with bert-base-uncased.The image encoder parameters are frozen, and we use resnet-152, vit-base, and faster-rcnn retrained on the Visual Genome dataset as image encoders, respectively.For both the self-attention and cross-attention modules, we use the last three initialization parameters of bert-base-uncased.We utilize the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2e-5 and run each experiment on a 3090 GPU for 8 epochs.We select the best epoch's parameters on the validation set for testing and calculate the mean and standard deviation as the final result.
To ensure a fair comparison, we set the models in Table 2 uniformly and without continuing pre-training, which may make it challenging to compare them with existing papers due to differences in overall structure and training details compared to Table 1, even if they may use the same unimodal encoder and multimodal fusion modules.

C Model Performance Visualization
We select Image2Text (Faster R-CNN) as the representative of the multimodal models and Second, a portion of the data is predicted correctly by the multimodal model but incorrectly by the text-only model, and vice versa.The proportions of these two parts are similar.This suggests that when images do contribute valuable information to the multimodal model, they also introduce noise.In order to improve the performance, further investigation is required for how to properly incorporate the visual information.Third, over 16% of the data has sentiments that neither the text-only model nor the multimodal model predicts correctly.This indicates the weakness of the current models and we need further explorations.

D Annotation Examples
To clearly and visually illustrate the various scenarios that arise during the dataset annotation process, four samples are presented in Table 4.
The first example demonstrates a scenario where the textual sentiment and the visual sentiment matches, resulting in a multimodal sentiment determined by both modalities.In the example, the sentiment in the text is determined to be positive through the use of words such as "Congratulations" and "winner".Similarly, the sentiment in the image can be inferred as positive by identifying the target (i.e., the first person on the right) and noticing his smiling face.
The second example shows a scenario where the textual sentiment aligns with the multimodal sentiment but not with the visual sentiment, leading to a multimodal sentiment determined by the textual modality only.Specifically, the sentiment conveyed by the text is negative due to the phrase "after husband's assassination" and the sentiment conveyed by the image is neutral as it does not show an obvious facial expression on the person referred to in the text (i.e., the first person on the left).Therefore, the multimodal sentiment conveyed by both modalities is negative.Corresponding to the second example, the third example illustrates a scenario where the visual sentiment aligns with the multimodal sentiment but not with the textual sentiment.In particular, the text simply states a fact with a neutral sentiment, while the image shows the target (i.e., the person waving his hand in front of the podium) with a positive facial expression and posture, resulting in a positive multimodal sentiment overall.
The fourth example presents a scenario where there is no target in the image, resulting in a multimodal sentiment determined solely by the textual modality.Here, the target is "Lebanon", but since there is only one person in the image and no information about "Lebanon", we can only conclude that the multimodal sentiment is neutral based on the text.It is worth mentioning that such a sample is not ideal for the TMSC task as the image does not convey any sentimental information towards the target.

Figure 4 :
Figure 4: Venn diagram for model performance visualization.

Table 1 :
The model structures of various baselines.All text encoders in the above models except for VLP are initialized with BERT.

Table 3 :
Statistics of the datasets.#Avg Targets means the average number of targets for each sample.

Table 4 :
The annotation examples.