CN-HIT-MI.T at SemEval-2020 Task 8: Memotion Analysis Based on BERT

Internet memes emotion recognition is focused by many researchers. In this paper, we adopt BERT and ResNet for evaluation of detecting the emotions of Internet memes. We focus on solving the problem of data imbalance and data contains noise. We use RandAugment to enhance the data of the picture, and use Training Signal Annealing (TSA) to solve the impact of the imbalance of the label. At the same time, a new loss function is designed to ensure that the model is not affected by input noise which will improve the robustness of the model. We participated in sub-task a and our model based on BERT obtains 34.58% macro F1 score, ranking 10/32.


Introduction
Memes are from people's culture or some social activities in daily life, usually composed of one or two forms of image, video, gif, and text (Park, 2020). Memes are active in people's social media, but with the number of memes increasing, offensive memes are also increasing (Williams et al., 2016). So for many social media companies, how to identify the type of meme is very important. Using machine learning to identify the type of meme is a very important solution, and has shown promising Performance.
In SemEval-2020 Task 8: Memotion Analysis (Sharma et al., 2020), the organizer collecting the Internet memes and then using OCR recognition it and correct manually the text on the pictures. The task is divided into three sub-tasks: a) sub-task a detects the emotions of the Internet memes, which are divided into positive, negative and neutral; b) sub-task b detects the types of humor expressed by Internet memes, which are divided into sarcastic, humorous, offensive and motivation type, a meme can belong to multiple categories; c) sub-task c is a semantic level classification of the various humor types of task b, each type is divided into 4 categories: not, slightly, mildly and very. The task uses macro F1 as the evaluation standard for sub-task a, and average macro F1 as the evaluation standard for sub-tasks b and c.
There are three challenges for this task. The first, how to fuse picture and text features, because whether it is a separate text or picture, the meaning expressed is lacking, and it is necessary to fuse the picture features and text semantics to obtain the best classification results. The second, the distribution of train data is imbalanced, which will increase the learn difficulty of the model. At last, the text information provided by the training data contains too much noise, which affects the semantic understanding of the text by the model and it will need to clean the data. Because we didn't find a proper way to fuse text and picture feature, we used BERT (Devlin et al., 2018) for text feature extraction, and ResNet (He et al., 2015) for image feature extraction. Through experiments, we found that the macro F1 score of BERT was higher than ResNet, so we submitted the predicted result of BERT in the final submission result.
In the rest of this article, we organize the content as follows: section 2 introduces the background of this task; section 3 mainly introduces the model we used and some details; section 4 introduces the data and preprocessing; section 5 shows the experimental results and make some analysis. Finally, we discussed our work and future work directions.

Background
Prez Rosas et al. (2013) use the Spanish video collected from YouTube to fuse the audio, video and text at feature level, and then perform sentiment classification on the video dataset. Dobrisek et al. (2013) use decision-level fusion to merge audio and video. Designed a multimodal emotion recognition system. Wollmer et al. (2013)focus on the task of automatically analyzing a speaker's sentiment in on-line videos contain movie reviews, using a mixed feature model to merge audio, video and text. Truong and Lauw (2019) use pictures to extract effective text information for sentiment classification. Xu et al. (2019) researched aspect-level sentiment classification of images and texts. Cai et al. (2019) extract image attributes features, which help the multimodal models for photo-text sarcasm detection. Zadeh et al. (2017) mix single-mode, dual-mode and triple-modal feature information to identify emotion in the comment video. However, none of these research work uses a pre-trained model based on Transformers, but the pre-trained model has proved its effectiveness in many works, so we decided to use BERT instead of LSTM this time.
We participated in sub-task a, which is to judge the emotions of memes, and we only used the data provided by the competition organizer (Sharma et al., 2020). In order to effectively extract the image features and text features, we considered using the single-modal model to extract the features of the image and the text, and then the multimodal model is designed on the basis of the single-modal model. However, the multimodal model is found to be less effective than the single-modal model. It will discuss in section 5.

System Overview
In this task, we mainly used the ResNet-101 pre-training model and the BERT pre-training model. We refer to VistaNet (Truong and Lauw, 2019), Hierarchical Fusion Model (Cai et al., 2019) and Tensor Fusion Network (Zadeh et al., 2017) respectively. We also refer to Liu (2019), and extract the effective features of the picture through the sentences in the text, and then merge the text features and the effective features of the picture. But we found that the prediction results of the multimodal model are not as good as the single-modal model, so we did not use multimodal model in the final submission results, see section 5 for details.

Single-modal for Photo
ResNet He et al. (2015) proposed a residual nets (ResNet) .The residual nets solves the problem of network degradation in deep networks by adding the shallow output of the model to the deep output of the model, and successfully increased the network depth to 152 layers. We would like to use ResNet as a single-modal model to train the image. Through experiments, we decided to use ResNet-101. The pre-training model we use is from the torchvision module of pytorch 1 .
RandAugment The Google research team released RandAugment (Cubuk et al., 2019), which is used to automatically data augmentation with a reduced search space . RandAugment has two parameters N and M, where N represents the number of augmentation transformations to apply sequentially and M represents the magnitude for all the transformations. Because we found that the training data is small, which may lead to the model can not be fully trained, so we consider using RandAugment to enhance the data. In this task, we set N = 14, M = 11, and in the submission result F1 (macro) obtained a score of 0.334.

Single-modal for Text
BERT The Google research team released the pre-trained model BERT (Devlin et al., 2018) on the basis of Transformer. Through pre-training on a huge number of corpora, BERT has achieved state of the art results on many NLP tasks. Considering that the dataset of SemEval-2020 Task8 (Sharma et al., 2020) is not large, we adopted the BERT-base 2 version. Because Google did not provide the pytorch version of  KL-divergence Because we found that the text data provided by SemEval-2020 Task8 (Sharma et al., 2020) contains a lot of text fragments that are not related to the task, we cleaned the text data, see section 4.2 for details. Because we only cleaned the training data, but not the test data. , Resulting in inconsistency between training data and test data. To solve this problem, we redefined a new loss as follows: Where F bert (·) is the model based on BERT, O c and O d represent the probability distribution of the cleaned and uncleaned text obtained by the model. Loss kl and Loss ce is the kullback-leibler divergenceloss and cross-entropy loss function.

Data Setup
Memotion Dataset 7k Dataset (Sharma et al., 2020) is collected for memotion analysis. This task judges the various emotions expressed by the meme based on the meme picture and the text in the picture. Table 1 lists the number of memes of each category in the sub-tasks a. We divide the data into five parts, four of which are training set and one is development set.

Preprocessing
We found that the text in the data set mainly contains three kinds of errors, a) contains meaningless text, such as URL; b) lack of punctuation, resulting in two sentences becoming one sentence; c) the order of the sentences is disordered or multiple sentences are randomly mixed. Therefore, we have corrected the text with the above errors.
The training objective is cross-entropy and KL divergence, and Adam (Kingma and Ba, 2015) optimizer is adopted to compute and update all the training parameters. Learning rate is set to 2e5 for model, respectively. We also use gradual warmup (Goyal et al., 2017) and cosine annealing schedule for learning rate.

Results
The evaluation metric of sub-task a is macro-F1. We divide the dataset into training (5595 meme data) and development (1397 meme data) subsets. We initialize the model with different random seeds for many   Table 2.
From the results, we can see that the prediction results of the multimodal model HFM and TFNet are not as good as the single-modal model. We tried VistaNet (Truong and Lauw, 2019), HFM (Cai et al., 2019) and TFNet (Zadeh et al., 2017). VistaNet's prediction result in this task is the worst. We think VistaNet's main purpose is to filter important text through multiple pictures. It focuses is not just the fusion of text and picture information, so we gave it up early. However, the prediction results of HFM and TFNet are not as good as the single-modal model too. By observing the data set , We found that the meme dataset (Sharma et al., 2020) of SemEval-2020 Task8 has a feature, that is the understanding of meme is very dependent on the alignment of text segment and image regions. But neither HFM nor TFNet focuses on it. So we later designed a model to align the picture and the text. We divide the text into sentences and get the features of sentences by BERT (Liu and Lapata, 2019). Then we get the 7x7 image regions through ResNet-101. At last, performing attention operations on the images regions by sentences. But the actual effect is not good, which is worse than HFM and TFNet. We found that in many cases, a text is only divided into one sentence, so we consider that it may be the sentence level division is too rough. We guess that the use of phrase level or word level division may achieve better results, but we have not designed a suitable model. So in the end we did not use the multimodal model.
we can see that the result of using text alone is better than using pictures alone, it should be the picture information is more complicated, especially some pictures contain multiple sub-pictures. At the same time, we found that after adding TSA and RA, the ResNet's result is improved. Training Signal Annealing (TSA), which can gradually releases training data to model. Specifically, if the models predicted probability for the category positive is higher than a threshold η, we remove that data from this training step (Liu, 2019). As for RandAugment, data augmentation has the potential to significantly improve the generalization of deep learning models (Cubuk et al., 2019). Then, when TSA and KL are added, the result of BERT is improved greatly. KL can be referred to 3.2 . In the end, the best result we submitted, the highest score of macro-F1 was 0.3458, which is about 0.0088 lower than the first place 0.3546 in sub-task a.
The confusion matrix is shown in Figure 1. From the figure, we can see that because of the problem of label imbalance in the dataset, the precision of the negative label is only 20%, and the precision of the positive label reaches 61.13%. This also leads to this macro-F1 is relatively low.

Conclusion
Is very important for social network to recognize the sentiment of Internet memes, but the main difficulty is how to integrate the information of pictures and texts, Which is a very challenging job. Although our multimodal model has not been get very good results, our single-modal model got the 10th . Considering that our multimode model may not align the text features and image features correctly, we hope to use a suitable way to align text features and picture features. For example, use a phrase-level way to divide text, and then align the features of the picture. At the same time, we found that the problem of label imbalance in the dataset led to the low value of the final macro-F1, so in the future we will consider using some other strategies to reduce the impact of label imbalance.