Infotec + CentroGEO at SemEval-2020 Task 8: Deep Learning and Text Categorization approach for Memes classification

The information shared on social media is increasingly important; both images and text, and maybe the most popular combination of these two kinds of data are the memes. This manuscript describes our participation in Memotion task at SemEval 2020. This task is about to classify the memes in several categories related to the emotional content of them. For the proposed system construction, we used different strategies, and the best ones were based on deep neural networks and a text categorization algorithm. We obtained results analyzing the text and images separately, and also in combination. Our better performance was achieved in Task A, related to polarity classification.


Introduction
The kind of information usually shared on social media can be in various modalities such as text, images, audio, or video. In spite that this kind of information is shared in a recreational way, many institutions have paid attention to what the users or citizens shared on the Internet. An aspect that has been analyzed in the past is the polarity of the text (for instance tweets, reviews, news). The polarity reflects if the text has negative, positive, or neutral content. More recently, other kinds of data have gain attention, for instance, images. The most popular and descriptive mix of image and text are the memes. Richard Dawkins proposed the term meme referring to mimeme in Greek, that is associated with imitation (Dawkins, 2016). He expressed that a meme is a form of social engendering or cultural propagation. Memes are important because they express emotion, joke, humor, or something that only using words could not be possible. The automatic classification of memes has very important challenges. The meaning of a meme could be interpreted in different ways by different people. Furthermore, for analyzing and classifying the meaning of a meme, image and text processing must be tackled and combining them is the best way to reach a good classification result (Iwazaki, 2018;Xia et al., 2020), which is not precisely easy to achieve. Some works have tackled the meme classification problem. Kumar and Garg, (2019) proposed a method for detecting only sarcasm tone in memes. For the sarcastic detection, a set of typo-graphic only memes where used. The authors extracted the text from the images and used standard TF-IDF approach over the words. Finally, they reported a result of an accuracy of 88% with a multilayer perceptron classifier from a dataset called MemeBank obtained from the Instagram social network using the hashtags #sarcasm, #sarcastic and #irony as positive examples and #motivational and #inspirational as negative. Smitha et al., (2018);Peirsoon and Tolunay, (2018), and Kanai, (2016) dealt with meme classification.
Psychological study and commercial marketing are some of the primary motivations for sentiment analysis from an image with text; that is, the classification of memes can be useful in marketing, advertising, trend analysis, so on (Smitha et al., 2018). Also, performing this task using automatic techniques allows the optimization of response times in previous analyzes.
In this paper, we present the proposed system to tackle for Memotion Analysis Task from SemEval 2020. In the following sections is described in more detail our proposal.
The rest of the paper is organized as follows. Section 2 describes the data and the task description. Section 3 describes our approach to solve the problem and our implemented system. Section 4 details our experimental results, and finally, conclusions are given in Section 5.

Dataset and task description
The Memotion Analysis Task was divided into three sub-tasks: A, B, and C. The overall task description can be consulted in the task overview (Sharma et al., 2020). Task A is for sentiment classification, that is, given a meme, the system must classify its content as positive, negative, or neutral. Task B is for humor classification where the system has to identify the kind of humor expressed by the meme. The types of humor considered were sarcastic, humorous, offensive and motivational, but a meme can have more than one of these categories. Here, a meme can have one, more or none categories. Task C is for scales of semantic classes where the goal is to quantify the degree of emotion expressed in the meme, these quantifications are divided into not (0), slightly (1), mildly (2), and very (4), all of them for the sarcastic, humorous, and offensive emotions.
The sizes of the datasets provided by the competition organizers were, for the training dataset, a total of 7000 memes, 1000 memes for the trail, and finally, 2000 meme images for the test. The datasets had an associated file, with the contained text in the meme. For a profound description of the datasets see Sharma et al., (2020). The text used to feed all our models was obtained from the concatenation of the fields text ocr, text corrected and image name. This concatenation was done looking for a more amount of text no matters if was duplicated or not.

Proposed system 3.1 Neural networks
A neural network is a machine learning method that has gained a lot of popularity since (Krizhevsky et al., 2012) showed the potential of deep learning by winning the Large Scale Visual Recognition Challenge 2012 (ILSVRC12) competition with the AlexNet. Following this trend, we decided to apply them for this competition. Particularly, we used the Inception architecture for our experimentation process, all details are explained below.

Inception modules
In recent years, one of the most notable concepts on the topology of a neural network is the inception module, a new architecture producing efficient and accurate predictions. The inception module was first showed by Szegedy et al., (2014) as building blocks used to create neural networks by placing one on top of another.
A later version, the Inception-V3 (Szegedy et al., 2016b) has improvements in the way the computations are done. The number of operations is reduced by changing 5 × 5 filters by two 3 × 3 filters followed one after the other. With this small change, we get an inception module as shown in Figure 1a which is used for early layers. In a similar way, the authors change n × n filters by two asymmetric convolutions: one 1 × n and one n × 1 (see Figure 1b). With that, they change n 2 parameters for 2n which is a big saving on the computations. A third type of module is used for the final layers with the configuration showed on Figure1c. With all that, the proposed architecture uses three types of inception modules. This architecture will be the base of the proposed system.

Residual connections
Theoretically, deeper networks should be more powerful than shallow ones. Suppose we have two neural networks A and B, where A has n layers and B has n + m layers. Then, the first n layers on B can compute the same as A and then the rest of the m layers simply compute the identity. So, the network B should be able to do everything A can. Unfortunately, in practice this usually does not happen. One problem with neural networks is that the deeper they get, the more difficult it is to train them.

The base model
We used the Keras (Chollet and others, 2015) implementation of the inception-resnetV2 (Szegedy et al., 2016a) network as our base model. Basically, this version uses the inception modules with residual connections which we present on figures 3a, 3b, and 3c. The authors showed that the residual connections help to improve the training speed on their architecture. (c) Inception module C with residual connection.

Using text and images
Once we chose the base model for images, we tried to incorporate the text data into the network. We started by using a network with two inputs, one for the image (224 by 224 pixels on the channels RGB), and one for the vectorized text. For the image input we used the inception-resnetV2 without the top layers and no trained weights, we then flatten the output to get a vector; for the text input, we first vectorized the text using the CountVectorizer function of sklearn (Pedregosa et al., 2011) (producing a vocabulary of 11205 words after the removal of stopwords), and then apply a sequence of dense layers. The CountVectorizer method converts a set of text documents to a matrix of token counts, in the most simple way a token is a word of the vocabulary. After that, we had two vectors of the same length that we concatenate and apply some extra dense layer to finally get a three-way output using a softmax activation to predict the negative, neutral and positive label for the meme.
Unfortunately, the results for the images and the text were not better than just using the images. Nevertheless, the results of using only a text vectorization method, ignoring the images, were better than using both. So the strategy was to use a neural network for the images, a vector model for the text and combine them together with a neural network. In order to combine the two models, we used stacked generalization (Wolpert, 1992). For using this method, we split the dataset into 5 parts. Then trained a neural network model on 4 of the 5 partitions. With that neural network, we predicted the labels of the missing part which were stored on a file. We repeated this process in a similar way as the cross-validation. In the end, after 5 models trained, we had the unbiased predictions of the full dataset. Finally, we trained again a neural network, with the same configuration, but this time on the full data. The model was stored to be used along with the text model. The model for the images we used was an inception-resnetV2 without the top layers and no trained weights, followed by a Global Average Pooling layer and a Dropout. During the training we used the Adam optimization with a learning rate of 0.0001 and 50 epochs. We also tried using the pre-trained version of the inception-resnetV2 without the top layers, plus a Global Average Pooling layer, and some extra dense layers. Since the changes did not improve the results, we focus on the untrained version.
For the text model we used our text classification algorithm based on Tellez et al., (2018) called TextCategorization.jl 1 . TextCategorization.jl is a Julia package inspired by µTC. The main difference with µTC is that it performs a full model selection, and this means the combinatorial problem represents the entire text-classification pipeline. That is, each configuration (model) describes all preprocessing functions, different tokenization schemes as µTC, several term weighting schemes, and several parts of the classifier used (a kernel-based and prototype-based classifier). The selection of both kernel and prototyping schemes are part of the combinatorial problem. The combinatorial problem is tackled using a local search algorithm (Beam Search); the configuration space is sampled randomly for the initial population, and then it is explored with a mutation and crossover strategy. This algorithm was used for text representation, some efforts were the concatenation of the text representation generated by it, and the image extracted by the deep neural network approach. The neutral network for combining the image and text inputs is depicted on Figure 4. It had two inputs, one branch for the image model, and one for the text model. The outputs of the image and text models were followed by two dense layer, then a concatenation and a series of more dense layers. The activation functions used were ReLU except for the last one where we used softmax. Sadly, the results were not better than just using images of just using text.
For our final attempt to combine the images and text, we used XGBoost on the predictions (with the default configuration) without better results (see Table 1).

The final model
Up to this point, we were testing our image methods on Task A only and chose our best ranked neural network: the inception-resnetV2 without the top layers and no trained weights, then a Global Average Pooling layer and a Dropout training using Adam optimization with a learning rate of 0.0001 and 50 epochs. So from now on, the base model used for Task B and Task C, was this with minor changes on the top layers to predict the more complex labels.  Figure 4: Neural network used for combining the image and text predictions.
The decision of applying similar models for the other tasks based on Task A was backed up after seeing this method having better scores 2 on the three tasks.

Proposal for task B
For the Task B, we tried the same as the described for Task A with the same results, so the final model was very similar than that of the previous section. The first change is that, after the Global Average Pooling layer, we added a Dense layer of size 512 and for the final layer, an output of four dimensions with the sigmoid activation. We trained the network with the same parameters as on Task A.

Proposal for task C
For the Task C, we had to change a bit more the model. After the Global Average Pooling layer, we split the network in four branches, so the output will have the form (humour, sarcasm, offensive, motivation). For each branch, we added a Dense layer of size 512 and finally, a Dense layer with four neurons for the first three and with 2 neurons for the motivation one. These layers used a softmax activation. So the output was the probability of getting certain level of sentiment. We obtained better results training for 20 epochs.

Experimental results
We have two types of experimental results, one without the test and one evaluated on it. The evaluation without the test set was made using the predicted labels from the stacked generalization. With that, we obtained unbiased predictions of the whole training set. We show the results of the XGBoost and our final method over the training set on Table 1. Note that the results of the XGBoost combines both, the images and text, to make the prediction but the results show a lower Macro-F1 score on every task, although scoring high on Micro-F1.  Table 1: Internal results on the training set of our system construction Next, we show the results provided by the organizers on the test set on Table 2. The bold scores are the best on each task. Note that our final pick was the best on Task A and B, and the score in Task C was below the baseline.
The Figure 5 shows the confusion matrix for the Task A on the test set, here, we can see that the model has problems classifying the negative and neutral sentiment. On Task B, the confusion matrices on Figure  6 show that our solution struggles to identify the type of humor. For Task C, the prediction is more complicated. Figure 7a shows that the model fails to detect very humorous memes. Something similar   Figure 7b with problems also on not sarcastic memes. Figure 7c follows the same trend for offensive samples. The motivational detection on Figure 7d shows that most of the errors occur assigning the negative label to positive examples.     Table 3. The official results are in Macro-F1 and Micro-F1, the best results in each task are in bold. In summary, our system got position six in Task A, position 24 in Task B, and position 25 in Task C. As can be seen, our best performance was in Task A, this is because we mainly design our system based on the internal results obtained with Task A, but we wanted to also submit our system to the other tasks.

Conclusions
This paper describes our system submitted to Memotion Analysis SemEval 2020 Task. The proposed system was based on deep learning, specifically using the Inception-resnetV2 as a base model. Unfortunately, we reached the best results using in a separate way the text and the images instead of using both. We design our system based on Task A in which we obtained six general position (over 35 results) which is a good result for our team, nevertheless, in Task B and C our results were in 24 and 25 positions.