NIT-Agartala-NLP-Team at SemEval-2020 Task 8: Building Multimodal Classifiers to Tackle Internet Humor

The paper describes the systems submitted to SemEval-2020 Task 8: Memotion by the ‘NIT-Agartala-NLP-Team’. A dataset of 8879 memes was made available by the task organizers to train and test our models. Our systems include a Logistic Regression baseline, a BiLSTM +Attention-based learner and a transfer learning approach with BERT. For the three sub-tasks A, B and C, we attained ranks 24/33, 11/29 and 15/26, respectively. We highlight our difficulties in harnessing image information as well as some techniques and handcrafted features we employ to overcome these issues. We also discuss various modelling issues and theorize possible solutions and reasons as to why these problems persist.

task also pose unique research value -such as the training of models that could understand reference derived from background knowledge and complex human expressions such as humor and sarcasm.
In this paper, we share the insights we gain from our time working with the task and dataset. In Section 2, we discuss some previous work and related approaches to tackling the problem. Then, in Section 3, we introduce the task and the dataset. Subsequently, in Section 4, we outline the features employed and the model architectures. Section 5 reports our results and model performance. In Section 6 and 7, we analyze our results and discuss the possible source and solutions to our errors and issues. We summarize our key results and conjectures in Section 8.

Related Work
Research on Memes, in particular, are few and far in between. Some works have experimented with memes as sentiment predictors: In their work on Facebook sentiment analysis, French (2017) reported a positive correlation between the category of the meme used and the affection of the discussion from its texts. French (2017) go on to confirm that memes were far more successful in conveying the sentiment of the debate over textual data. Other works attempt to automate the meme generation process (Peirson and Tolunay, 2018;Oliveira et al., 2016) but fail at replicating the humor expressed by particular memes. While Peirson and Tolunay (2018) attempts to re-purpose image captioning algorithms to capture the humor, Oliveira et al. (2016) attempt to use macros within news headlines to do the same. However, both works fail to generate coherent humor that can persuade annotators consistently. Another similar work by Wang and Wen (2015) apply multimodal techniques to generate captions for popular meme formats. Finally, the propagation and effect of memes on social media are also studied in works such as the one by Zannettou et al. (2018a) which provides an assessment of the popularity and use of certain memes, in the context of each community. In another work, Ferrara et al. (2013) aim to detect and analyze meme usage in social media streams, particularly by using an unsupervised clustering framework. Work in the space also make use of meta-information (Wang and Wen, 2015;Ferrara et al., 2013) regarding the meme to provide context to their frameworks. Apart from memes, other work on multimodal social media analysis and humor analysis could provide valuable insights. Shin et al. (2018) show how multimodal information can be used to enhance the analysis of unstructured data on social media. There are numerous works that attempt to predict the sentiment of images (Kanishcheva and Angelova, 2015;Xu et al., 2014). However, these works focus on a single modality -images. Works on Multimodal sentiment analysis like the one by Hu and Flaxman (2018) use both image and text embeddings in their models to show slight improvements over the text-only models. A comprehensive overview of multimodal sentiment analysis can be found in work, such as that Kaur and Kautish (2019) and Soleymani et al. (2017). In the realm of humor analysis -work such as the recent paper by Weller and Seppi (2019) have shown that deep learning models can outperform humans in classifying humor mainly due to a disparity in the sense of humor of the annotators and the testers. In an interesting work by Chandrasekaran et al. (2018) an attempt is made to capture wit by using synonyms of words in image descriptions to incite puns through the subversion of expectations. While the work reports impressive results and beats out humans in a controlled vocabulary setting, humans quickly regain dominance when the vocabulary is unconstrained.

Data
The task organizers have made available a dataset (Sharma et al., 2020) of 8879 annotated memes scrapped from various sources across the internet. Each meme was annotated by two annotators to ensure annotation quality. The text was extracted from the image using the Google OCR system and manually corrected by crowdsourced workers -to ensure that model accuracy doesn't depend on the quality of the OCR techniques used. We briefly describe each subtask of the Memotion Shared Task below: • Task A -Sentiment Classification: Given a meme, the task is to classify it as a positive, negative or neutral meme.
• Task B -Multilabel Characteristic Classification: Given a meme, the system has to identify the existence of the following characteristics -humor, sarcasm, offense and motivation. Being a multilabel classification task a meme can exhibit any combination of the above characteristics or none at all.
• Task C -Scales of Semantic Classes: The third task is to quantify the extent to which a particular effect is being expressed, i.e. if the meme is humorous whether it is funny, very funny or hilarious and so on.
The dataset shows a significant imbalance, particularly in the representation of the labels negative, hilarious, very twisted and hateful offensive of the sentiment, humor, sarcasm and offensive categories, respectively. We represent the distribution of labels in Table 1

Preprocessing and System Overview
Before we outline our features and models, we digress to briefly explain the steps we undertook to reduce noise in the text. We perform basic lower casing and the removal of unnecessary punctuation as the initial step. The second step was to remove specific noise inducing aspects such as the removal of URLs and User mentions (using regular expressions). We did not perform any global image preprocessing steps; however, we did perform basic preprocessing steps such as resizing and grayscale conversion for specific feature extraction techniques.

Features
Depending on the model we use, we employ three different text vectorization techniques -word (1,2)gram TFIDF features for the Logistic Regression model; A GLoVe (Pennington et al., 2014) embedding pretrained on 27 billion tweets for the BiLSTM + Attention based model; Contextual BERT embeddings (Devlin et al., 2019) for our transfer learning approach.
Beyond text vectorization we explore a few hand crafted features to improve model performance. We use features previously employed by Bertero and Fung (2016) in their work to detect humor in sitcoms and Mahajan and Zaveri (2017)'s submission to the SemEval 2017 Task 6. The stylistic features are as follows: number of words, number of parts-of-speech (POS) tags such as nouns, adjectives and verbs and their ratio to the number of words. The POS tags mentioned are obtained using CMU POS Tagger (Owoputi et al., 2013).
Ambiguity features are useful in representing the multiple meanings that can be delivered simultaneously as found in pun related humor (Yang et al., 2015;Miller and Gurevych, 2015). For this purpose we use a concept called Synset (short for Synonym set). A Synset is defined as a set of one or more synonyms that can be interchangeably used in the same context to express the same meaning which was originally embedded. We derive these synonyms using the NLTK Corpus (Loper and Bird, 2002). For example the Synset for the word 'new' would be { new, fresh, raw, newfangled, modern, newly }. The ambiguity features used are as follows: Mean Synset Length (it is the mean of the length of synset of each word of the text) , Maximum Synset Length (it is the maximum length of synset that a single word in the text can have), Synset Length Gap (it the difference between the Maximum Synset Length and Mean Synset Length).
Initially, we experimented with the use of pretrained feature extractors to extract image information. Our approach, using various pretrained models trained on the ILSVRC (Deng et al., 2009) such as the popular Inception V2 ResNet (Szegedy et al., 2017), exhibited underwhelming results and sometimes proved to be detrimental to model performance. In this regard, we elected to employ more hand crafted features rather than pretrained feature extractors. For each image: hue, saturation and value are calculated by converting the RGB images into HSV channels using the scikit-image toolkit (van der Walt et al., 2014). We average the hue, saturation and value over all the pixels in the image to obtain the hue (H image ), saturation (S image ) and luminance (V image ) of the image. We also include RMS contrast features for each image -which can simply be defined as the standard deviation of pixel intensities (i.e. the brightness). We draw inspiration from previous work by Zhang et al. (2015b) and explore more features that can be quantitatively derived from the HSV model such as Colourfulness -a metric defined Hasler and Suesstrunk (2003) that exhibits high correlation to human perception of colourfulness -and metrics by Valdez and Mehrabian (1994) features -that relate brightness and saturation to the following emotions: Pleasure, Arousal and Dominance. Following this, we also entertain the idea of employing facial expressions as an extra image feature. To do this we re-purpose an opensource emotion detection application 5 . The approach used a convolution neural network at its core, to detect the following emotions -Angry, Disgusted, Fearful, Happy, Neutral, Sad and Surprised. The model is pretrained on the FER2013 Kaggle dataset 6 and used as feature extractor in our pipeline.
At this juncture, it might be apt to recognise the small dataset size and label imbalance exhibited by the Memotion Dataset (Sharma et al., 2020). To tackle the imbalance and data scarcity we consider many techniques such as oversampling, weighted training and resampling techniques (such as SMOTE (Chawla et al., 2002)) in our pipelines. We also consider a text augmentation technique found in work from Zhang et al. (2015a). We re-purpose open source code found on github 7 . The core of the idea is to extend the given dataset by simple word replacement wherein we replace certain words in the sentence with corresponding synonyms or phrase replacements pulled from an auxiliary database/dictionary. The image features are then replicated for the newly created synthetic sample. Our database of choice for such replacements was the paraphrase bank 8 database. We apply this technique on the training split of the data and duplicate the corresponding image and dense features.

Model Descriptions
For a baseline approach to the given problem, we elected an L2 regularized Logistic Regression as an ideal starting point. The input representation for text was TF-IDF uni-grams and bi-grams. We also use the handcrafted text and image dense features as well the emotion vectors with this model. All the modelling was done using the sci-kit learn toolkit (Pedregosa et al., 2011). This model uses the previously explained data augmentation technique as well as balanced class weights during training.
Our second model is an attention based deep learning model. Attention is a technique first introduced in the machine translation research space by Bahdanau et al. (2014). The idea was further extended to text classification by work such as the ones by Yang et al. (2016) and Raffel and Ellis (2015) -which are what we implement here. All modelling tasks for this model was carried out using keras with a tensorflow (Abadi et al., 2015) backend. The model uses the 'Adam' learning rate optimizer and categorical cross entropy loss function during the training phase. The model architecture is more intuitively described in  (Pennington et al., 2014) embedding pretrained on 27 billion tweets for text representation and the handcrafted image features and emotion vectors for image representation (which is concatenated to the dense layer output which follows the Attention mechanism). The model was set to train for 100 epochs with a batch size of 2048 and initial learning rate of 0.001 but we found that the model started overfitting at an average of 40 epochs.
We also explore the possibility of leveraging dynamic contextual embeddings through the use of a transfer learning model -BERT. The developers of BERT provide a simple classification API for BERT through the run classifier API available on their github page 9 . Our underlying model of choice was the BERT base,uncased -which trains a total of 110 million parameters, contains 12 transformer blocks and 12 self-attention heads with a hidden layer dimension of 768 . This model simply draws the [CLS] token embedding of the second to last layer of the BERT model for classification. During the training process the weights of the model are modified slightly to better cater to the task at hand. We finetune hyper parameters such as the learning rate, batch size and maximum sequence length to improve performance for different categories. In this case the model only takes advantage of the data augmentation and preprocessing techniques mentioned above, no extra image and dense features were provided.

Experimental Setup and Results
The dataset was provided to task participants as a pre-annotated training dataset (containing 7001 samples) and an un-annotated test dataset (containing 1878 samples). We first perform a model and ablation analysis using 10-fold cross validation methods on the training dataset and follow up with scores obtained by our systems on the test dataset, as provided by the task organizers. All results are represented using Accuracy and Macro F 1 metrics which were used for ranking the systems by the task organizers.
For the training Dataset, Initially, a validation dataset is also maintained to diagnose variance and bias issues that arise in the training phase and to aid hyperparameter tuning. Our overall train, validation, test split ratio is 80:10:10. However, for our final results, we conflate the validation and training set and represent 10-fold cross validation results. We represent our results for Subtask A and the categories within Subtask C (as subtask B and C only vary by semantic levels) in Table 2. We then perform an ablation    analysis with a 10-fold cross validation split using the Logistic Regression model to better understand the effect of the various techniques and features employed and how they affect model performance. Due to the task specific nature of the experiment, we decide to carry out the experiment for all the categories -sentiment analysis, humor, sarcasm, offense and motivation classification. We represent the results in Table 3. On the test dataset, the task organizers rank each system based on averaged Macro F 1 . For subtasks B and C where there are 4 separate categories within the task an average score over the four categories was provided. Ranks were provided for only the top submission of each team. Initially, we submit three different systems for task evaluation -the Logistic Regression model, the BiLSTM + Attention model, the BERT model. However, we saw that the certain models performed better on certain subtasks. Therefore, for our final evaluation, a mixed system (referred to in Table 4 as Final Mixed System) -BERT for Subtask A, Attention for Subtask B (except motivational category) and Logistic Regression for Subtask C (including motivational category in Subtask B) -this represents our official submission to the task. We also represent the macro F 1 score and potential ranks of the individual models in Table 4.
Our model study revealed that Logistic Regression and the BERT model exhibit the best results, trading places based on what metric was considered. The results of the Attention based learner were underwhelming and were not competitive with the other models.
Our ablation analysis reveals that in general balancing and augmentation techniques provide the biggest performance improvements (on the basis of macro F 1 ). We also note that both these techniques were implemented to address the data scarcity and imbalance issue. Image features too, provide slight improvements when added (in all combinations). However, when these techniques are considered in combination, the addition of augmentation over text + balancing + image features, lead to a drop in performance for the humor category and also in the sarcasm task, where the text + image features model outperforms the text + augmentation + image features model. In an additional test, we also observed that for the Attention based learner, image information provided little to no improvement on the performance.
On the test dataset, again BERT and the Logistic Regression model obtain the best results. We also observed that the Attention based learner exhibits lower performance on addition of image features, on all sub tasks.

Error Analysis
Due to the under-representation of specific labels, we saw our models were unable to classify samples into these classes effectively. This was especially apparent for the highly under-represented hateful offensive (of the offensive category) and very twisted (of the sarcasm category) where very few samples could be correctly classified. While we have reservations regarding the subjective nature of humor and offense, we generally found offensive memes wrongly classified as non-offensive and funny memes wrongly classified as not funny when the context mostly derived from the image. This points to our inability to capture image information effectively. This also indicates the absence of any background knowledge, which is imperative when understanding references that most memes tend to invoke. On analyzing the errors and ground truth of the motivation category, we find many of the annotations puzzling. We are, therefore, more curious as to what the annotation guidelines for this category are and refrain from making any conjectures regarding the classification errors. Another issue we would like to explore is the underwhelming results of the BiLSTM + Attention model. The main issues faced during the training phase are the small overall dataset size and under-representation of specific labels. This led to the inability to train larger and deeper models without facing high variance issues. Another issue that arose due to the small dataset size is the inability to train an image feature extractor from scratch (to alleviate data dissimilarity between the dataset and pretrained models) and better harness image information.

Discussion
In our time with the task of meme emotion analysis, we have come to believe that there are many inherent challenges in modelling the task apart from the machine learning algorithms being leveraged. We want to bring attention to the fact that memes are ever-changing -new memes enter and old memes leave the meme ecosystem very frequently and are quite often informed by current affairs. This could imply that training a model on historical meme samples could provide little insight into the meme ecosystem of today. The trends learnt by training with historical samples might not translate well to the trends of today. In the data collection and annotation phase, we make the following observations: Meme Heterogeneity: Memes come in different forms and styles -or so-called templates. These templates can either represent a certain punchline or message and may not share the same referential image. For example, the set of memes {"The Scroll of Truth" 10 , "Nancy Pelosi Ripping Paper" 11 ,"Most People rejected his message" 12 } represent the same idea of 'ignoring a fact or supposed truth' but use different referential images to enhance the humor. Memes belonging to the same template can be assumed to convey a similar message. There exists a countless number of such templates that are used online. Therefore, we think datasets should not attempt to draw memes from all templates but certain ubiquitous ones -as in the former case, the dataset does not contain enough samples to adequately represent the meme template's idea. As a product of this, models face the insurmountable task of learning trends from a large number of templates, each with a minimal set of samples. It may be apt to reverse this idea and create datasets with few selected templates but a large number of samples corresponding to each template.
Annotator Bias: As humans, many of us share different tastes and consequently have different senses of humor. What one may find hilarious might not affect another. Therefore, it may be the case that in tasks such as these, we are modelling humor specific to the annotators -thereby working with an annotator   (2019), it was seen that a sample of random humans could not outperform a model trained on the same jokes. This indicates that the model has learned annotator specific biases and may not generalise well to the populace. We are also curious as to what could be the human error on a task such as this.
To address these concerns, we perform an experiment that randomly samples 100 memes from the training dataset (previously annotated) and gets four independent annotators to tag them on all five categories -sentiment, humor, sarcasm, offensive and motivational. The annotation process was mainly carried out in-house by the authors of this manuscript. We use basic guidelines (in the form of dictionary definitions and examples from the dataset) on tagging the categories and labels. We then calculate macro F 1 scores and accuracy for each annotator and compare their performance with the Logistic Regression model's predictions on the same 100 memes. We also calculate inter-annotator agreement metrics to understand better if all the annotators are on the same page. We represent the results of this in Table 5. On calculating Randolph (2005) free-marginal multi-rater Kappa metrics, we found poor agreement on annotations over all the categories -Sentiment (0.22), humor (0.10), Sarcasm (0.09), Offense (0.36), Motivational (0.38). The poor agreement scores are an indicator of how concepts such as humor and offense can be subjective. We also find that the human annotators barely outperform our Logistic Regression model, which are indicators of low human performance in such tasks. Few Annotators also report different memes as random or nonsensical. On manual checking, we found these memes to be related to pieces of media (e.g., Star Wars, Lord of the Rings) that the annotator was not familiar with.

Conclusion
Our work on this dataset highlights the difficulty in effectively harnessing image information to understand reference and media that are part and parcel of a meme. Consistent with previous work, we are only able to obtain small performance improvements by employing image-based features. We also highlight modelling issues that arise due to the heterogeneous nature of data (a small number of samples, a large number of templates) and the high variance issues. We also bring to question -annotation and data collection practices that might not immediately translate to the task of meme analysis. A general suggestion would be to use a large number of annotators or to harness popularity metrics such as likes and upvotes used on websites like Reddit and Instagram. Popularity metrics could provide a more generalized view of the humor or other characteristics of the meme. We think that memes can be imperative to understanding user sentiment in the present internet ecosystem and beyond. We look forward to the new interest that enters this space due to work done here and in the memotion shared task.