SESAM at SemEval-2020 Task 8: Investigating the Relationship between Image and Text in Sentiment Analysis of Memes

This paper presents our submission to task 8 (memotion analysis) of the SemEval 2020 competition. We explain the algorithms that were used to learn our models along with the process of tuning the algorithms and selecting the best model. Since meme analysis is a challenging task with two distinct modalities, we studied the impact of different multimodal representation strategies. The results of several approaches to dealing with multimodal data are therefore discussed in the paper. We found that alignment-based strategies did not perform well on memes. Our quantitative results also showed that images and text were uncorrelated. Fusion-based strategies did not show significant improvements and using one modality only (text or image) tends to lead to better results when applied with the predictive models that we used in our research.


Introduction
SemEval 2020 task 8 (Sharma et al., 2020) is a sentiment analysis task targeted at memes 1 divided into three sub-tasks of increasing complexity: Sub-task A is predicting the sentiment polarity of a meme, Sub-task B is a multi-label binary classification task which aims to predict whether a meme is humorous, offensive, sarcastic and/or motivational (it can also have neither of these attributes), Sub-task C is a multi-output ordinal classification task which aims to predict the degree of humour, offence, sarcasm and motivation of a meme.
The dataset used for this task contains memes images whose text has been extracted by optical character recognition (OCR) and manually corrected when needed. Each meme is annotated on different aspects: sentiment polarity for sub-task A and the degree of humour, sarcasm, offence and motivation for sub-tasks B and C.
Memes sentiment analysis is a challenging task as memes are multi-modal, rely heavily on implicit knowledge, and often use humour and sarcasm. While this topic is of growing interest for NLP community, the way image and text interact in memes has barely been explored, leading to sub-optimal representation learning.
In an attempt to shed some light on the role of both modalities, we investigate their correlation and their impact on each sub-task prediction. Our code is available at https://github.com/bonheml/SESAM.

Related work
Sentiment analysis of text is a very active research area which still faces multiple challenges such as irony and humour detection (Hernández Farias and Rosso, 2017) and low inter-annotator agreement caused by the high subjectivity of the content .
Research has been extended to multimodal sentiment analysis during the last years (Soleymani et al., 2017), but the focus was mostly on video and text or speech and text. The specific multi-modality of memes in sentiment analysis has only been addressed recently by French (2017), who investigated their correlation with other comments in online discussions.
The growing usage of memes as an alternative medium of communication on social media has also recently drawn the attention of the online abuse research community. Zannettou et al. (2018) studied the propagation of memes posted by fringe web communities 2 , and their influence and transmission between different social media. Sabat et al. (2019) performed hate speech detection on memes and showed that images were more important than text for the prediction.
However, as pointed out by Vidgen et al. (2019), memes completely make sense only if one takes both text and image content into account. These modalities can also lead to totally different perceived sentiment when recombined. For example, a meme whose image is a grumpy cat and the text is "happy birthday" will have a very different sentiment from a meme with the same text but with an image of a happy puppy.
We argue that having a better understanding of both modalities interaction will contribute to more informed joint representations and is a crucial topic to explore.
Thus, we investigate the impact of multiple embeddings applied to both modalities on several models with different types of decision boundaries and verify the consistency of our findings assessing the embeddings across the three different sub-tasks. As our main focus is to study the impact of representations used, we chose simple classification models from Scikit-learn (Pedregosa et al., 2011) such as K nearest neighbours or Gaussian Naïve Bayes over more complex ones such as the deep learning architecture composed of three bidirectional gated recurrent units networks with contextual intermodal attention proposed by Akhtar et al. (2019). We did not perform any hyper-parameter tuning.
In multi-view representation learning, different techniques can be used to represent the views, depending on the nature of the relationship between them (Guo et al., 2019). When they share latent traits, one can use alignment to project their embeddings into a common space given a constraint (e.g., distance, correlation) (Baltrušaitis et al., 2019). On the other hand, if they are complementary, fusion techniques will be more useful as they will group the meaningful latent variables of each view into a compact representation (Li et al., 2019).
In section 3.1 we assess the usefulness of aligned representation for memes by investigating the possible correlations between images and memes. Then, in section 3.2, we study the added value of voter-based fusion techniques such as the one proposed by Gaspar and Alexandre (2019) in their work on multimodal sentiment analysis.

How correlated are images and text?
Exploring the possible correlations between images and text can provide valuable insights into the efficiency of the aligned representation for memes. Canonical correlation analysis (CCA) (Hotelling, 1936) has proven to be very efficient for correlation-based multimodal representation learning alignment (Wang et al., 2015), and has been successfully used for cross-modal multimedia retrieval (Rasiwasia et al., 2010). In order to provide a broad analysis of correlation, we analyse both linear and non-linear relationships between image and text embeddings.
Linear CCA Introduced by Hotelling (1936), CCA aims to find the linear projections of two views which are maximally correlated.
More formally, let X ∈ R n×m and Y ∈ R n×p be two zero-mean matrices of n observations with m and p features respectively. We aim to find the K orthogonal linear projections A = [a 1 , . . . , a k ], B = [b 1 , . . . , b k ] such that: where Σ XX and Σ Y Y are the covariance matrices of X and Y respectively and Σ XY is their cross covariance matrix (Uurtio et al., 2017). Deep CCA As standard CCA is only able to discover linear relationships, other techniques such as Kernel CCA (KCCA) (Lai and Fyfe, 2000;Melzer et al., 2001;Van Gestel et al., 2001;Akaho, 2001) have been developed to discover non-linear associations. However, KCCA does not scale well to large datasets. Using the better scaling capacity of deep neural networks, Andrew et al. (2013) proposed deep CCA (DCCA), a version of CCA which stacks layers of non-linear transformations for both views and optimises the correlation between their transformed representations. Given the size of the dataset and the high dimensionality of the features used in this study, we chose to use DCCA over KCCA.
Application to the tasks Both CCA and DCCA are applied to the training dataset. CCA results are evaluated using the first canonical correlation scores and the assessment of their statistical significance.
As DCCA provides only aligned embeddings, it cannot be evaluated using the same techniques. Instead, we trained DCCA on the training dataset, predicted the aligned embeddings of the dev and testing dataset and compared the results of our different models, discussed in section 2.2, with aligned and non-aligned embeddings. We also investigated the intra-class correlation by performing CCA on each class of sub-task A and each label of sub-task B.

How image and text contribute to the predictions?
Fusion methods Over the years, various fusion techniques for predictive models have been developed. Some rely on a neural network to perform the fusion (Tanti et al., 2017), or just concatenate the modalities into one vector and treat it as a unimodal problem (Baltrušaitis et al., 2019). However, it is difficult to uncover the contribution of each modality with these techniques. In contrast, a voter-based fusion technique (Morvant et al., 2014;Gaspar and Alexandre, 2019) can be easily interpreted and will thus be used here. This technique is referred to as late fusion as the fusion is performed after the learning phase whereas techniques such as embedding concatenation, where the fusion occurs before the learning phase, are referred to as early fusion.
As voter fusion is model-agnostic, it also allows us to test it on different models and tasks to verify the generalisation of our findings. While late fusion has been shown to often provide better results in multimedia fusion (Snoek et al., 2005), early fusion tends to perform better when one of the modalities contribute more than the other to the predictions (Morvant et al., 2014).
To handle this possibility, our voter, illustrated in figure 1, is composed of three identical models which are trained on image, text and a concatenation of both embeddings respectively. Thus, we perform hybrid fusion, using the information provided by both late and early fusion. As we are only interested in exploring the impact of the different modalities, unlike in Gaspar and Alexandre (2019) where classifier decisions were weighted according to their quality, we gave all classifiers the same weights. To assess whether different modalities contribute to a different type of prediction, we also run each model independently and compare their results. Thus, if a modality is only helpful in some cases (e.g., only for negative polarity detection), the voter should provide better results than independent models.  Models The different predictive models used are logistic regression (LR), K nearest neighbours (KNN), Gaussian Naïve Bayes (GNB), Random forest (RF), and multi-layer perceptron (MLP). We chose them so that we can study the impact of different embeddings on several decision boundaries. To suit the different sub-tasks objectives, these models are wrapped in meta classifiers as described in table 1. We used implementation from Scikit-learn (Pedregosa et al., 2011) for the multi-output and multi-label classifiers and a custom implementation of Frank and Hall (2001) made compatible with Scikit-learn models for the ordinal classifier.

Experimental setup
Data cleaning and preprocessing We manually added the text values of seven memes which had neither OCR nor corrected text values in the training dataset and removed URLs corresponding to meme sources from transcribed texts. A number of websites used for meme generation add their URL to the final meme, and this was often caught and transcribed by the OCR extraction. Following Camacho-Collados and Pilehvar (2018) study on text preprocessing, we did not perform any lemmatisation or lowercasing. To obtain one text embedding per meme, the text of each meme was vectorised using a pre-trained Universal Sentence Encoder (USE) (Cer et al., 2018) retrieved from Tensorflow hub (Abadi et al., 2015). The images were processed with Xception (Chollet, 2017), pre-trained on ImageNet (Russakovsky et al., 2015), and the penultimate layer was used as embedding.
Dataset analysis As shown in figures 2 and 3, the training dataset is highly skewed towards positive memes which are mostly funny, motivational, slightly sarcastic and offensive. Occurrences of "extreme" memes such as hateful offensive are very rare (e.g., there are less than 500 hateful offensive memes). The word count distribution is equivalent over each label, and we did not find words specifically attached to a given label. No obvious cluster of memes was shown by the t-SNE (van der Maaten and Hinton, 2008) or UMAP (McInnes et al., 2018) projections of sentence and image embeddings.
Models training The same model types and embeddings combinations are used for the three sub-tasks, and we only varied the meta classifiers as listed in table 1 to adapt the models' output to the task at hand.
Experiment Model training and evaluation was performed in three phases:   Table 4: Averaged macro F1 scores for sub-task C on the dev dataset. The score from the model selected during the evaluation phase to be submitted to the competition is marked with * The training phase where a training and a dev dataset are provided. During this phase, each architecture 3 is trained on the training dataset and evaluated on the dev dataset using the macro F1 score for sub-task A, and averaged macro F1 scores for sub-tasks B and C 4 . No hyper-parameter tuning is performed and the dev dataset is only used to filter non-informative models which will not be submitted during the evaluation phase.
The evaluation phase where an unlabelled testing dataset is provided. During this phase, the predictions are done using the architectures previously selected, without retraining, and uploaded to Codalab. Similarly to the training phase, the results are evaluated using a macro F1 score for sub-task A and an averaged macro F1 scores for sub-tasks B and C. The combination of model type and embedding providing the best results on the testing dataset over the three sub-tasks is selected for the final ranking.
The ranking phase where the model selected during the evaluation phase is submitted for ranking. The final ranking is done using the testing dataset and the same metrics as in the previous phases.

Results
This section provides an analysis of our results at each step of our experiment. First, we investigate the results on the dev dataset which guided our model selection for the evaluation phase. The results retrieved from Codalab during the evaluation phase for the selected models are then analysed, and we finally conclude with the analysis of the scores provided during the final ranking.

Evaluation on dev dataset
Alignment approach (correlation-based) No statistically significant correlation between image and text was found with linear CCA, either over the entire training dataset or intra-class. Non-linear Deep CCA (DCCA) does not provide significant improvements compared to non-aligned concatenated embeddings or text modality only, and even often worsened the results. It thus seems that image and text are more complementary than correlated in the case of sentiment analysis of memes. This finding is consistent with the fact that memes often make sense when the combination of image and text is considered and changing one or the other can change the associated sentiments (Vidgen et al., 2019). Given these empirical results, we conclude that alignment-based approaches may not be suitable for meme analysis.
Fusion approach (voter-based) As shown in tables 2, 3 and 4, voter-based fusion technique did not lead to better results than one modality alone and consistently worsened the results. Interestingly, all the models tested, except KNN, performed better with one modality alone over all the sub-tasks.
Most of the models were not able to find very discriminative features in image embeddings and often ended up predicting every class as belonging to the most frequent class. This problem was also reflected in concatenated embeddings whose results were most of the time worse than the ones provided by the most informative modality. While early and hybrid fusion approaches (i.e., concatenated embeddings and voters) provide better scores than image-only, text-only generally gives the best results, especially for GNB, RF and MLP.
While it intuitively makes sense to consider images in memes, it seems that image representations such as the one we used may not be suitable for the task at hand and thus underperformed. Indeed, these representations are accurate enough to perform image captioning, but they lack the higher-level information we use to interpret memes. For example, a surprised cat and a grumpy cat will just be represented as cats when the sentiment attached to a meme "Me when I look at my grades" can drastically change depending on the type of cat used. Thus, using embeddings extracted from image sentiment classifiers could be more suitable to sentiment analysis of memes.
Because of the reuse of the same image with a different text leading to different sentiments, using image embeddings only can also introduce noise to the data with one image linked to contradictory outputs. Thus, it may be more efficient to merge both embeddings early on. However, early fusion did not show consistent improvements, indicating that more complex, model-dependent fusion techniques such as neural networks may be needed.
Except for KNN that obtained marginally better results with fusion techniques, most models seem to perform best with text, contrary to what was reported by Sabat et al. (2019) for hate speech detection. These apparently contradictory results may be due to the usage of different discriminative features on each task. This could be an interesting avenue to explore for assessing the potential of transfer learning with memes embeddings. Indeed, the more different discriminative features used for sentiment analysis and hate speech detection of memes are, the less efficient the usage of generic meme embeddings will be.
Finally, as memes often reflect the shared culture of the communities which create them (Lin et al., 2014), having some contextual knowledge would probably also be greatly beneficial for meme analysis.
Model selection Given that the results of linear regression are very low and do not provide much information on the impact of each modality, it is removed from the pool of models that will be used for the evaluation phase.

Evaluation on the testing dataset
During the evaluation phase, we used the results provided on Codalab, which are referenced in tables 5, 7 and 9 to select the model to submit for competition ranking and assess the generalisability of the conclusion made from the empirical results during the training phase. After the release of the final ranking,    Table 8: Corrected averaged macro F1 scores provided after the competition for sub-task B on the testing dataset of the evaluation phase. The score from the model submitted to the competition is marked with * the task organisers have notified us that the scores displayed on Codalab during the competition were incorrect and have released corrected scores which we discuss in section 4.3. In this section, we discuss the incorrect Codalab evaluation scores because this is what was available to us during the competition, and we used these to select the final model that we submitted.
Alignment approach (correlation-based) Similarly to the results observed during the training phase, no statistically significant correlation between image and text was found with linear CCA. The scores obtained with DCCA were also lower than the ones obtained during the training phase. Thus, we did not use the architecture with aligned embeddings.
Fusion approach (voter-based) Surprisingly, the results in the evaluation phase were very different from those obtained during the training phase. RF and MLP, which were both performing very well on text modality over all three sub-tasks had consistently lower scores with almost equivalent results over all the embeddings tested. KNN which was previously performing well on fusion-based embeddings also provided lower scores which were almost equivalent over all the embeddings tested, with a marginal improvement with image embeddings.

Model selection
The model best performing on the testing dataset of the evaluation phase, KNN with image embedding, was the one submitted for the final ranking. When this analysis was performed, we did not know that the results in tables 5, 7, 9 were incorrect.

Final results
In this section, we present the correct evaluation results which became available after the competition. Fusion approach (voter-based) As shown in tables 6, 8 and 10, the corrected results are very close to the original evaluation results for sub-task A, but vary importantly for sub-tasks B and C. Similarly to the results obtained during the training phase, GNB still favours text-only modality for all the tasks, but other models now show similar results for text, image, concat and voter. Interestingly, text embeddings provide much lower results than during the training phase, especially for RF and MLP. Various factors such as different label distributions between dev and testing dataset, more similar vocabulary between dev and training dataset, or memes with less informative text in the testing dataset could have influenced these results. We argue that an in-depth analysis of these possible factors could lead to new insights regarding the embedding features used by the models during the learning process. Thus, we will investigate it once the annotated testing dataset has been published.
Model selection Given the corrected scores, GNB with text embeddings would have been a better model to submit for final ranking, especially for sub-task B. Unfortunately, the correct evaluation scores were not available during the competition.

Conclusion
We have provided an overview of the impact of different representations on meme sentiment analysis. We tested alignment-based and fusion-based techniques on a range of models on each sub-task. While none of them seemed to be beneficial for the different sub-tasks, we found that 1) alignment-based techniques were not suitable for meme analysis as image and text of memes are not correlated 2) using only one modality (text or image) tends to perform better than a combination of both when we use standard (i.e. non-deep learning) predictive models. However, these conclusions should be taken with caution as the scores obtained on the dev and testing datasets vary greatly and other factors, such as the label distribution of the dataset can also have influenced these results. Finally, we argue that a more adapted image representation, possibly enriched with contextual knowledge, as well as more complex fusion techniques, may be promising to explore.
A Hyper-parameters of the models In addition to the code provided at https://github.com/bonheml/SESAM, the language, packages used and their version, and the hyper-parameters of the models are detailed in this appendix for reproducibility.

A.1 Language and library used
The experiment was implemented in Python 3.6 and the packages listed in