CSECU_KDE_MA at SemEval-2020 Task 8: A Neural Attention Model for Memotion Analysis

A meme is a pictorial representation of an idea or theme. In the age of emerging volume of social media platforms, memes are spreading rapidly from person to person and becoming a trending ways of opinion expression. However, due to the multimodal characteristics of meme contents, detecting and analyzing the underlying emotion of a meme is a formidable task. In this paper, we present our approach for detecting the emotion of a meme defined in the SemEval-2020 Task 8. Our team CSECU_KDE_MA employs an attention-based neural network model to tackle the problem. Upon extracting the text contents from a meme using an optical character reader (OCR), we represent it using the distributed representation of words. Next, we perform the convolution based on multiple kernel sizes to obtain the higher-level feature sequences. The feature sequences are then fed into the attentive time-distributed bidirectional LSTM model to learn the long-term dependencies effectively. Experimental results show that our proposed neural model obtained competitive performance among the participants’ systems.


Introduction
Nowadays, social media platforms such as Facebook, Instagram, and Twitter become the most popular information sharing medium among the people due to its convenient features and realtime behavior. People usually use the various modalities of information such as textual, visual, and audio to express their views, opinions, breaking news, and ideas here. Due to the robust feature of social media, researchers and companies are trying to distill various kinds of information from its contents. But most of the previous studies address only one modality i.e. image information extraction addressed by the computer vision community and textual information extraction by the natural language processing community. However, with the growing ubiquity of Internet memes on social media, it is important to extract information from memes. We need to employ a hybrid approach in this regard since a meme comprises of an image with textual information.
Memotion analysis (Sharma et al., 2020) is commonly defined as the process of detecting and analyzing the underlying emotion of a meme. It might have a significant impact on addressing various issues related to social media. For example, evil-minded people nowadays use the meme contents to propagate anti-social behavior including online harassment, cyber-bullying, and hate speech. Therefore, memotion analysis might help to limit these anti-social behaviors.
To address the challenges of memotion analysis on social media contents, (Sharma et al., 2020) proposed the task 8 at SemEval-2020. The task focuses on three related subtasks. Task A defines a sentiment classification problem where a system needs to predict whether a meme content is positive, negative, or neutral. Whereas task B defines the multilabel multiclass humour classification problem where a system needs to identify the types of humor expressed by a meme. The categories are sarcastic, humorous, offensive, and motivation meme. Task C defines a quantification of semantic class problem where a system needs to quantify the extent of each humour class (defined in Task B) expressed by a meme.
The rest of the paper is structured as follows: Section 2 provides a brief overview of prior research. In Section 3, we introduce our proposed neural attention model. Section 4 includes experiments and evaluations as well as the analysis of our proposed method. Some concluded remarks and future directions of our work are described in Section 5.  Williams et al. (2016) studied the racial microaggressions and perceptions of Internet memes. A few researchers (Amalia et al., 2018;Verma et al., 2020) have tried to distill the inherent sentiment of the meme contents.

Proposed Neural Attention Framework
In this section, we describe the details of our proposed neural attention framework. The goal of our proposed approach is to identify several emotion orientations of internet memes including sentiment, humors, and scales of semantic classes. Figure 1 depicts an overview of our proposed model.  In our proposed architecture, we utilize the OCR (optical character recognition) extracted text contents to identify the emotional orientation of a meme. After extracting meme text, we employ a pre-trained word embedding model to obtain the high-quality distributed vector representations. Next, we apply the multi-kernel convolution (MKC) and time-distributed bidirectional long short-term memory (Bi-LSTM) models to extract the higher-level feature sequences with sequential information from the meme text embeddings. An attention mechanism is employed to amplify the contribution of important elements in the obtained feature representation. The generated output feature sequences are then sent to the fullyconnected prediction module to determine the final category label. Next, we describe each component elaborately.

Embedding Layer
Word embedding is considered as the most popular representations of documents vocabulary. It can capture the context of a word within a text document while considering the semantic similarity and relation with other words (Mikolov et al., 2013;Bojanowski et al., 2017). In our proposed framework, we employ a pre-trained fastText (Bojanowski et al., 2017) word embedding model to capture the distributed vector representations of meme texts. The embedding matrix dimension will be L × D, where L is the meme text length, and D is the word-vector dimension.

Multi-kernel Convolution
We perform the convolution operation on top of the embedding matrix obtained from the embedding layer to extract the higher-level features. Previous studies already demonstrated the efficacy of using multiple kernels based convolution compared to the single one (Kim, 2014;Zhang and Wallace, 2015;Wang et al., 2017). In our multi-kernel convolution, we use four different kernel sizes: 2, 3, 4, and 5 to extract the different kinds of effective features.

Bidirectional Long Short-Term Memory (Bi-LSTM)
To learn sequential correlations from higher-level feature representations obtained from multi-kernel convolution, we employ the time-distributed bidirectional LSTM (Bi-LSTM) (Liu et al., 2020) model in our proposed framework. The bidirectional model runs feature representations in two ways, one from past to future and one from future to past. The difference between this approach from unidirectional is that the LSTM that gone through backward can preserve information from the future. Therefore, at any point in time, the combination of forward and backward LSTM enables the bidirectional LSTM to preserve information from both past and future. Bi-LSTMs showed very good results as they can understand the context better compared to the unidirectional RNN and LSTM.

Attention Mechanism
Recently, the attention mechanism has been widely used in the neural network frameworks to address the long-term dependencies effectively. This mechanism helps the model to learn what to attend or focus based on the input text (Vaswani et al., 2017;Fotso et al., 2018). To amplify the contribution of important elements in the final representation of time distributed bi-directional LSTM module, we employ a similar kind of attention mechanism used in DeepMoji architecture (Felbo et al., 2017). DeepMoji employed an approach based on the idea of (Bahdanau et al., 2015;Yang et al., 2016) to aggregate all the hidden states according to their relative importance weight.
Let us consider h t representation of a word at some time t and w a corresponds to the weight matrix at the attention layer. The attention scores a t are estimated by multiplying h t and w a and perform normalizing to obtain the probability distribution. Finally, the attentional representation is obtained by a weighted summation over all the time steps as follows:

Prediction Module and Model Training
After obtaining the high-level representation from the attentive bi-directional LSTM module, we pass it to a fully connected softmax layer for category prediction. We consider cross-entropy as the loss function and train the model by minimizing the error, which is defined as: where x (i) is the training sample with its true label y (i) . y is the estimated probability in [0, 1] for each label j. 1{condition} is an indicator which is 1 if true and 0 otherwise. We use the stochastic gradient descent (SGD) to learn the model parameter and adopt the Adam optimizer (Kingma and Ba, 2014).

Dataset Collection and Evaluation Strategy
The organizer of the memotion analysis task 8 at SemEval-2020 (Sharma et al., 2020) provided a benchmark dataset to evaluate the performance of the participants' systems. The training dataset contained around 6992 annotated memes along with the OCR extracted text contents and the test dataset contained 1878 annotated memes, respectively.
To evaluate the performance of the system, the organizers used different strategies for the task A, B, and C (Sharma et al., 2020). For the task A, macro average F1-score was applied to estimate the performance of a system. However, for the task B and C, at first, macro average F1-score of each subtask is estimated and their average is considered as the final evaluation measure.

Model Configuration
In the following, we describe the set of parameters that we have used to design our proposed neural network model. We used the OCR extracted text provided by the task organizers and used the Tensorflow (Abadi et al., 2016) framework to design our neural model. Our model is trained on a GPU (Owens et al., 2008) to utilize the benefits of tensor computations parallelly. We used a simple grid search to select the optimal hyper-parameters. At the embedding layer, we employed the pre-trained fastText embedding model (Bojanowski et al., 2017) for the vector representation of meme texts. We used 600 filters in our multiple kernels based convolution and used a single layer Bi-LSTM model. We trained our model using 30 epochs and set the initial learning rate of 0.001 with Adam optimizer. Besides, we set the L2 regularization factor 0.01 in the softmax layer. Unless otherwise stated, default settings were used for the other parameters.

Experimental Results
We now evaluate the performance of our proposed method. The comparative results with top-5 performing systems (Sharma et al., 2020) along with the baseline system for task A, task B, and task C are presented in Table 1, Table 2, and Table 3, respectively. The systems are ranked based on the primary evaluation measure macro average F1 score.   Here, we see that we obtained the competitive performance while comparing with the top-performing systems. However, we think that our system lacks of taking advantage from the image information. Because our model only depends on the OCR extracted text contents. We believe that incorporating the image information might have a significant impact on memotion analysis as well as improve the performance of our model. There is a long thread of research (Islam and Zhang, 2016;Fengjiao and Aono, 2018) that used various techniques to distill the emotion of an image. Employing such techniques might be beneficial to extract the image information for this task.

Conclusion
In this paper, we presented our approach to the SemEval-2020 Task 8: Memotion analysis. We tackled the problem by employing an attention-based neural network model. Though we achieved the competitive performance, there is much room left to improve the performance of our method. We only exploit the information from text content. However, the information extracted from the image also necessary in this context. In the future, we have a plan to address this scenario and introducing several sophisticated deep learning techniques.