UPB at SemEval-2020 Task 8: Joint Textual and Visual Modeling in a Multi-Task Learning Architecture for Memotion Analysis

Users from the online environment can create different ways of expressing their thoughts, opinions, or conception of amusement. Internet memes were created specifically for these situations. Their main purpose is to transmit ideas by using combinations of images and texts such that they will create a certain state for the receptor, depending on the message the meme has to send. These posts can be related to various situations or events, thus adding a funny side to any circumstance our world is situated in. In this paper, we describe the system developed by our team for SemEval-2020 Task 8: Memotion Analysis. More specifically, we introduce a novel system to analyze these posts, a multimodal multi-task learning architecture that combines ALBERT for text encoding with VGG-16 for image representation. In this manner, we show that the information behind them can be properly revealed. Our approach achieves good performance on each of the three subtasks of the current competition, ranking 11th for Subtask A (0.3453 macro F1-score), 1st for Subtask B (0.5183 macro F1-score), and 3rd for Subtask C (0.3171 macro F1-score) while exceeding the official baseline results by high margins.

memes as either positive, neutral, or negative. Furthermore, Subtask B builds upon the first subtask such that the participants will be challenged to binary classify the posts considering four categories: humor, sarcasm, offense, and motivation. Finally, Subtask B is extended into Subtask C, where the four categories are expanded into classes of different granularity, gradually increasing from the lowest to the highest possibility regarding to each category.
Proposed Approach. We intend to solve all the previously mentioned subtasks by introducing a neural network based on multi-task learning (MTL). The system will contain modules dedicated to image analysis and modules specialized in text processing. The architecture outputs a single answer for all the required subtasks.
The next parts of this work are structured as follows. In section 2, we perform an analysis of existing solutions found on related works. In section 3, we outline the approaches we applied for memotion analysis. Section 4 details the performed experiments, experimental setup, and error analysis. Finally, we draw conclusions in section 5.
2 Related Work 2.1 Humor Recognition Yoshida et al. (2018) proposed a multimodal approach regarding humor identification, by combining a Convolutional Neural Network (CNN) (Fukushima and Miyake, 1982) and a Long Short-term Memory Network (Hochreiter and Schmidhuber, 1997). The authors also used a ResNet-152 (He et al., 2016) model and developed a custom loss function that takes into account a funny score. To compute the funniness of a post, they used the reviews from a certain humor website, where people use stars to evaluate them. The loss function revolves around a certain threshold, set by the authors to be 100.
Also, humor recognition has been addressed by using a feature-based solution (Chandrasekaran et al., 2016). The authors created features by analyzing the inputs on different levels: cardinality, location, object, instance-level features. A Support Vector Regression is used for humor score prediction. Furthermore, the authors introduced a new technique for improving the humor scores, by altering the funniness of a scene. Firstly, they detected the objects that contribute to the humor of that particular scene, and then they identified replacements that can alter the funniness of the scene. For the former part, they used a multi-layer perceptron (MLP) that outputs a binary class for each object in the scene. Then, for the latter part, altering the humor, they used another MLP trained for identifying potential replacements for the original object.

Multimodal Classification
Besides humor identification, sentiment analysis is another process that can be applied to image-text pair entries (Qian et al., 2019). The authors introduced an architecture based on a CNN inspired by AlexNet (Krizhevsky et al., 2012), as well as a combination of Support Vector Machines and AffectiveSpace 2 (Cambria et al., 2015). Regarding the visual features, a difference from AlexNet is represented by the replacements of multiple fully connected layers with a single fully connected layer of size 4096 x 2. Furthermore, the textual features are extracted by using a 5-fold cross-validation based on AffectiveSpace 2. Their system surpasses the other existing solutions by a margin of 7.2%. The authors concluded that the combination of visual and textual features offers improved results when compared to a single modality.
EmbraceNet (Choi and Lee, 2019) intends to improve the results of multimodal classification tasks by offering flexibility for any learning model, as well as various methods of dealing with absent data. Their architecture consists of Docking layers, used for transforming each input vector to a certain size. Furthermore, the Embracement layer is used for combining feature vectors into a single vector, by using a multinomial sampling method. The model also considered the correlation between modalities, handled missing data, and performed a regularization effect. The final idea is that EmbraceNet obtained considerably improved scores compared to its counterparts, representing a good choice for multimodal classification tasks.

Proposed Approaches
We propose three multi-task learning architectures to solve all the subtasks of the memotion analysis competition (i.e. sentiment classification, humor classification, and scales of semantic classes). Because the entries are bimodal, the resulting system can be divided into two main components based on their input (i.e. one for image-only and the other for text-only). Both parts will act as a feature extractor, the resulting encodings being fused together to obtain the text-image representation. Based on the fused features, our system will discriminate between the output classes of each subtask. Next, we offer details regarding the three systems.

Text-only Multi-task Architecture
In order to extract the most salient features from the text input, we opted to use the ALBERT model (Lan et al., 2019), pre-trained on an English corpus. ALBERT obtained state-of-the-art results on the GLUE (Wang et al., 2018), RACE (Lai et al., 2017), and SQuAD (Rajpurkar et al., 2018) benchmarks, while being more memory efficient and requiring lower training times than its predecessor, BERT (Devlin et al., 2019). Moreover, ALBERT removes the Next Sentence Prediction pre-training strategy and replaces it with Sentence Order Prediction which better ensures the inter-sentence coherence. As a consequence of cross-layer parameter sharing used to decrease memory complexity, the model converges faster and smoother than BERT, proving to be a more stable architecture. The aforementioned improvements of the ALBERT model over BERT represent the main reasons for choosing ALBERT as our text encoding.
The ALBERT model comes in four variants based on the number of its parameters: base (12M), large (18M), xlarge (60M), and xxlarge (235M). We opted for the xlarge variant in order to leverage the trade-off between model size and performance. From an architectural standpoint, ALBERT xlarge has 24 layers with 16 attention heads, each one having a hidden dimension of 2,048 neurons. We extracted the pooled output of ALBERT as the feature vector encoding of the textual input. We added a dropout layer (Srivastava et al., 2014) of 0.1 over the resulted feature vector as a regularization mechanism in order to ensure the robustness of our model. At this step, the output from the previous layer will be fed to five independent classifiers responsible for each category tracked in the competition. All the classifiers consist of two fully connected layers of size 512 and 256, respectively, each followed by a dropout layer of 0.3. For each subtask, a fully connected layer for the output is added on top of each such layer-stack, where every neuron is corresponding to a class. In the end, we obtained five outputs accounting for sentiment classification (three classes), humor classification (four classes), sarcasm classification (four classes), offense classification (four classes), and motivational classification (two classes). We used the softmax activation function on the output layers in order to obtain the distribution probability over the classes.

Image-only Multi-task Architecture
To extract the features from the image input, we opted for the VGG-16 architecture (Simonyan and Zisserman, 2014), which consists of five stacks of convolutional layers using a 3x3 kernel size. Furthermore, each stack is followed by max-pooling layers of 2x2 dimension, the last one being linked to three fully connected layers, where the first two have 4,096 units and the third only 1,000 units. We removed the last three layers from the original architecture to obtain only the stacks of convolutions used in extracting the features from the input image. Moreover, we connected the last layer to a global average pooling layer to form a 1-D feature vector. As in the text-only feature extractor model, we connected the obtained feature vector to the independent classifier stack of layers. At the same time, the resulting architecture adopts a transfer learning strategy by using the pre-trained weights on the ImageNet dataset (Deng et al., 2009).
In our case, the meme templates vary extensively from a single image with associated text to multiple image crops stacked together. By resizing the images to the input size required by the original VGG-16 architecture (i.e. 224x224 resolution), relevant aspects from the memes might be lost. As a measure to prevent detrimental information loss, we replace the default input layer of the VGG-16 to accept the new images, resizing them to a resolution of 500x500 pixels.

Multimodal Multi-task Architecture
In order to take advantage of the image-text relationship, we combined the two separate textual and visual feature extraction components into a unified architecture. It receives both the input image and the processed text into the ALBERT input format, each input channel being processed independently by the corresponding specialized component. Passing the information through both models results into two feature embeddings E i ∈ R 512 and E t ∈ R 2048 , for the image and text embeddings, respectively. The resulting embeddings are then concatenated to obtain the image-text vector representation E it ∈ R 2560 . This output is sent to the set of independent classifiers for each subtask. In this case, an overview of our architecture is illustrated in Figure 1.

Data and Preprocessing
The dataset consists of 6,992 memes for training, out of which we randomly selected 10% for validation, and 2,000 memes for testing, provided in different image formats. Alongside the memes, an extra file specifies the labels (i.e. humor, sarcasm, offensive, motivational, and overall sentiment), as well as the extracted text, for each one of them. The class distribution is imbalanced, thus implying the usage of loss weights in the loss function (Khosla, 2018).
The text is preprocessed such that it will become a proper input for the ALBERT model as follows: tokenization is done by using the SentencePiece subword tokenizer (Kudo and Richardson, 2018) officially released with ALBERT, followed by the computation of the input ids (i.e., token's position in the vocabulary file associated with SentencePiece), the input masks (namely, values of 1 for the actual tokens to be considered by the model, or values of 0 for the padding tokens), and the segment ids (representing the sentence to which the input tokens belong to). As previously mentioned, the images contained in the dataset are resized to a convenient size of 500x500, and then are serialized to tfrecord files alongside the corresponding text and labels. For the test set, the preprocessing pipeline is similar to the one for training.

Experimental Setup
We conducted three experiments, one for each input-specific MTL architecture. Each of them has been trained in two steps. In the first step, we opted to freeze all the layers for the feature extracting component (i.e. ALBERT or VGG-16), depending on the input type. During the second step, we unfroze the weights and allowed the networks to properly adjust them. The architectures have been implemented using TensorFlow 2.1 (Abadi et al., 2016). Furthermore, we experimented with two different optimizers, Adam (Kingma and Ba, 2014) and LAMB (You et al., 2019), in order to maximize the performance of our solutions. We also introduced a patience of 30, such that our models will stop training if no improvements have occurred during the last 30 epochs. Because we trained all our architectures in two steps, during the first one, we set a learning rate of 5e-4 with a warm-up over the first 10% of the total training steps. The second step ran with a peak learning rate of 5e-5. Table 1 presents the hyper-parameters used for training the models during the experiments. Furthermore, we used five separate loss functions, namely L sentiment , L humor , L sarcasm , L offense , and L motivation , where each one represents a categorical crossentropy and has the following format: The term y i represents the observation, while C is the output class.  Table 2 contains the results obtained from running the experiments. As expected, the best results are yielded by the usage of the multimodal method, the MTL ensemble comprised of ALBERT for text encoding, and VGG-16 for image processing. Moreover, the text-only MTL solution is very close to the best results obtained by the multimodal MTL. For example, for the sentiment identification subtask, the multimodal MTL solution manages to achieve a macro F1-score just 0.16% higher compared to the ALBERT-only MTL, even though the number of parameters is considerably larger when compared to the text-only counterpart. This result might occur due to the lack of information contained in most of the meme images, often the text being the crucial factor in establishing the final classification. This can also be found when comparing the results obtained by using only the image part of the network, VGG-16. At the same time, there is a difference of 15.62% macro F1-score for the offense identification subtask between the VGG-16 MTL network and the ALBERT MTL solution. This difference is further justified by the fact that the text produces a larger number of features, in contrast to images. Furthermore, on the test set, we obtained the following results, specified in the official competition format (i.e. three main subtasks): 0.3453 macro F1-score for Subtask A, 0.5183 macro F1-score for Subtask B, and 0.3171 macro F1-score for Subtask C, respectively.

Conclusions and Future Works
This paper presented our approaches regarding the memotion analysis shared task, organized by SemEval-2020. We proposed several architectures that intends to solve the memotion analysis issue by using recent breakthroughs in the computer vision, as well as the natural language processing field: VGG-16 alongside ALBERT. By creating multi-task learning solutions, joint textual and visual modeling network, and separate channels, as well, we were able to achieve good scores for the previously mentioned subtasks and provide a good insight on the way this relatively new challenge can be approached. We highlighted that factors such as image noise, incorrect typing, or overlapping text are key parts that stand against a proper analysis. Also, various combinations of texts and images can cause serious problems in the analysis process conducted by deep learning approaches.
In the future work, we will investigate the effect of using improved visual models, such as the enhancement for VGG-16, VGG-19 (Simonyan and Zisserman, 2014). Furthermore, considering that ALBERT is a lite BERT, we also intend to explore text processing models containing a larger number of parameters, such as BERT-base, or even BERT-large.