YNU-HPCC at SemEval-2020 Task 8: Using a Parallel-Channel Model for Memotion Analysis

this paper proposed a parallel-channel model to process the textual and visual information in memes and then analyze the sentiment polarity of memes. In the shared task of identifying and categorizing memes, we preprocess the dataset according to the language behaviors on social media. Then, we adapt and fine-tune the Bidirectional Encoder Representations from Transformers (BERT), and two types of convolutional neural network models (CNNs) were used to extract the features from the pictures. We applied an ensemble model that combined the BiLSTM, BIGRU, and Attention models to perform cross domain suggestion mining. The officially released results show that our system performs better than the baseline algorithm


Introduction
In recent years, memes that combine pictures and text have been widely used in social media. Using memes can help users to express richer meaning and emotion compared with using text or images alone; hence, it is worthwhile to analyze the sentiment expressions of memes.Moreover, recognizing and analyzing the meaning and sentiment of memes is much more difficult than analyzing social texts or pictures.
In SemEval-2020 Task 8:Memotion Analysis (Sharma et al., 2020), the organizers hoped that the task would increase the research attention given to the topic. The task is divided into three subtasks.
• Task A-Sentiment Classification: Given an Internet meme, the first task is to classify its sentiment polarity.
• Task B-Humor Classification: Given an Internet meme, the system has to identify the type of humor expressed.
• Task C-Scales of Semantic Classes: The third task is to quantify the extent to which a particular effect is being expressed.
Memes and this issue have attracted the attention of researchers. In a previous study, Borth (2013) pioneered the sentiment analysis of visual content with SentiBank. Another study implemented Optical Character Recognition (OCR) to extract the text captions of memes and then classified the sentiment polarity of the text using the Naive Bayes algorithm (Amalia et al., 2018). For a similar meme sentiment analysis task, Zhao (2019) developed a multimodal sentiment analysis method for image-text posts, and their experiments showed that this method achieves excellent performance on the Flickr benchmark dataset. Hu and Flaxman (2018) used GloVe to map the text to a high dimensional space and fine-tuned the pictures through Inception (a pretrained deep convolutional neural network).  Figure 1: The Multimodal architecture of the parallel channel model.
In this paper, we propose a parallel channel model that includes a text channel, which is implemented to process the text in memes, and an image channel for image analysis. The text channel implements the BiLSTM, BiGRU, and BiLSTM with attention models. For the image channel, a multilayer CNN model and ResNet152 (He et al., 2016) were applied to capture the image features. Then, the information in the two modalities is combined by a dense layer after concatenation. The experimental results show that our approach achieved good performance.
The remainder of this paper is organized as follows. Section 2 describes the proposed parallel channel model, and Section 3 presents the implementation details and experimental results. The conclusions of this study are presented in Section 4.

Overview
As illustrated in Figure 1, the proposed model consists of two channels: the image channel and the text channel. We propose two different types of pretraining vectors and three different models in the text channel and two different models in the image channel as a way to extract picture features. We combined multiple models, use the soft voting mechanism, and output the results. For an input meme, w s represents the extracted text and I is the image. Then, the proposed model can be expressed as follows: where i ∈ (1, 2, 3, 4) and j ∈ (1, 2) , f T and f I represent the way to obtain special text and image features. h T i and h I j are the text vector and the image vector, respectively,and f (w s , I) is the final result.

Text Channel
Embedding Layer. The embedding layer is the first layer of the text channel. We constructed the word vectors from a 768-dimensional BERT vector. Then, a word vector matrix was loaded into the embedding layer and then fed into different hidden layers. For longer posts, we only keep the first 128 words, which is a reasonable choice since 90% of the posts in the dataset contain less than 128 words. We also use the sentence-level vectors from a 768-dimensional BERT vector as the text features and fed them into the fully connected layer.  Figure 2: Residual learning: a building block. models based on LSTM, for instance: Wang (2020) proposed a tree-structured regional CNN-LSTM model for valence-arousal (VA) prediction. A capsule tree LSTM model introduces a dynamic routing algorithm to construct sentence representations (Wang et al., 2019), and experiments prove that the method improves the performance of the tree LSTM and the basic LSTM model. BiLSTM is based on LSTM and can better capture forward and backward semantic dependencies. We show how a memory block calculates the hidden state h T t and output C t using the following equations.
• TransformationC • Status update here, x t is the input vector; C t is the cell state vector; W and b are cell parameters; f t , i t and o t are gate vectors; and σ denotes the sigmoid function. Gated Recurrent Unit (GRU) (Cho et al., 2014) is a variant of LSTM that combines the forget gate and the input gate into a single update gate. It also mixes cell states and hidden states. The final model is simpler than the standard LSTM model. The effect is similar to LSTM but with fewer parameters, and it is not easy to overfit. Attention mechanism (Bahdanau et al., 2015) breaks the limitation that the traditional encoder-decoder structure depends on a fixed-length vector when encoding and decoding. Its implementation retains the intermediate output results of the input sequence via the LSTM encoder, trains a model to selectively learn these inputs and associates the output sequence with it when the model is output. Attention mechanisms have been widely used in various NLP fields such as the Transformer (Vaswani et al., 2017), Neural Machine Translation (Yang et al., 2016) and aspect-level sentiment analysis (Tang et al., 2019).

Image Channel
Convolutional neural networks (CNNs) (Krizhevsky et al., 2012) are often used to extract image representations. A CNN is usually divided into convolution layers and pooling layers. The convolution layers are used to extract n-gram features from the picture pixels. Pooling selects a part of the input matrix and chooses the best representative for the region. The max pooling layer selects the max feature.
ResNet model (He et al., 2016) is one of the widely used image recognition models, and it solves the deep vanishing gradient problem. The basic structure of the residual is shown in Figure 2. We used PyTorch's pretrained ResNet152 model for the feature extraction from pictures.

Experiments and Evaluation
In this section, experiments were conducted to evaluate the proposed models on both subtasks. We also report the results of the official review. The details of the experiment are described as follows.

Data Preparation
The organizers provided 7K human annotated Internet memes labeled with semantic dimensions, namely, sentiment and the type of humor that is sarcastic, humorous, or offensive. For subtask A and subtask B, the data distributions are a little unbalanced, which make the tasks much harder. We randomly used 20% of the memes from the provided data as the dev set to fine-tune the parameters. The Stanford tokenizer toolkit was employed to process the memes-text into an array of tokens. Meanwhile, before feeding the token array to any neural networks, they are preprocessed by following procedures: • Punctuation marks, websites URLs and mailing addresses are removed, • Common nonstandard expressions are restored, and • Non-English letters are treated as unknown words represented by <unk>.

Implementation Details
This experiment used Keras with the TensorFlow backend. For subtask A, we used two different pretrained word vectors, and we introduced other models. For subtask A, we tried different batch sizes and attempts, and the results are shown in Figure 3.The best batch size is 60, the best number of training epochs is 14 and the learning rate is set as 1e-5. We use Scikit-Learn to execute the grid search (Pedregosa et al., 2011) to adjust the hyperparameters, through which we can find the best parameters for evaluating the system. The parameters given are as follows: the time step of the RNN for hidden layers 1, 2 (h 1,2 ) and 3 (h 3 ); the dimension of the dense layer (d); and the dropout rate (r). For the image channel, we also have the number of convolution layers (c), the number of filters (m), the length of the filter (l) and the pool (p). Table 1 summarizes these fine-tuned parameters.

Evaluation Metrics
For the submission to subtask A and subtask B, its performance will evaluated based on the macro-F1 score. The F1-score is often used as an evaluation indicator of unbalanced data, and is defined as follows:  where P denotes the precision and R denotes the recall. A higher F1-score indicates better classification performance. Table 2 shows the detailed results of the proposed our model compared to the other baseline models in ours dev set.

Results and Discussion
Subtask A. Our system achieved a score that was 0.115 higher than the baseline score (0.2176). The results show that our proposed system significantly outperforms the baseline models. The main reason is that we have combined a variety of information from memes and used the BERT word embedding.
Subtask B. Our model score was lower than the baseline score of 0.5118. We guess that it may be caused by the inconsistent data distribution between the dev set and test set, and so we need to do more research on class imbalance in the future.

Conclusion
In this paper, we describe a task system that we submitted to SemEval-2020 for Memotion Analysis. We propose a two parallel channel model. In the text channel, we use 3 RNN models and 2 types of pretraining vectors. In the image channel, we used a pretrained model and a CNN model. We participated in subtasks A and B, and obtained nineteenth place in subtask A. In future work, we will test more novel fusion methods so that the picture features can be better combined with token embedding.