Emoji-Based Transfer Learning for Sentiment Tasks

Sentiment tasks such as hate speech detection and sentiment analysis, especially when performed on languages other than English, are often low-resource. In this study, we exploit the emotional information encoded in emojis to enhance the performance on a variety of sentiment tasks. This is done using a transfer learning approach, where the parameters learned by an emoji-based source task are transferred to a sentiment target task. We analyse the efficacy of the transfer under three conditions, i.e. i) the emoji content and ii) label distribution of the target task as well as iii) the difference between monolingually and multilingually learned source tasks. We find i.a. that the transfer is most beneficial if the target task is balanced with high emoji content. Monolingually learned source tasks have the benefit of taking into account the culturally specific use of emojis and gain up to F1 +0.280 over the baseline.


Introduction
Many natural language processing (NLP) tasks suffer from a lack of available data. This is especially true for sentiment tasks, such as hate speech (HS) detection, which depend on the availability of manually annotated data. When moving to languages other than English, many sentiment tasks quickly become very low-resourced.
On the other hand, noisy social media content is available in abundance and many sentiment tasks are based on user comments on such platforms. Emojis can be a valuable source for the distant supervision of sentiment tasks, as they correlate with the underlying emotion of a comment. In this study, we aim to exploit the emotional information encoded in emojis to improve the performance on various sentiment tasks using a transfer learning approach from an emoji-based source task (ST) to a sentiment target task (TT). Previous work has focused on the transfer from predicting single emojis (Felbo et al., 2017) or strictly pre-defined emoji-clusters (Deriu et al., 2016). However, predefined emoji clusters do not take into account the culturally diverse usage of emojis (Park et al., 2012;Kaneko et al., 2019). We therefore introduce datadriven supervised and unsupervised emoji clusters and compare these with single emoji prediction tasks. Specifically, we analyze the efficacy of the transfer from a single emoji or (un)supervised emoji cluster prediction ST to a sentiment TT under three conditions, i.e. i) low vs. high amount of emoji content present in TT, ii) balanced vs. unbalanced label distribution in TT and iii) monolingually or multilingually learned ST. The first two conditions are based on typical qualities of sentiment corpora, which tend to be unbalanced in their label distribution with varying degrees of emoji content depending on the source of the data. The third condition is relevant for languages for which a TT is low-resource and which might benefit from a multilingually learned ST.
In Section 2 we give an outline of related work, followed by the introduction of our method (Section 3). The experimental setup in Section 4 details the data and models used as well as the (un)supervised clusters generated. In Section 5 we describe our results and conclude in Section 6.

Related Work
Emojis have been used as a type of distant supervision using pre-defined emotion classes based on psychological models (Suttles and Ide, 2013), binary (positive/negative) classes (Deriu et al., 2016) or a set of single emojis (Felbo et al., 2017). However, such pre-defined emoji classes often do not account for the culturally diverse use of emojis (Park et al., 2012;Kaneko et al., 2019). In contrast, our work does not pre-define the emotion classes found in emojis and instead learns these classes, or clusters, from the data itself. While our and the above approaches focus on exploiting emojis as additional labelled data, e.g. in a transfer setting, emoji embeddings (Eisner et al., 2016) have been used as additional features in downstream tasks such as sarcasm detection (Subramanian et al., 2019).
Transfer learning has recently been driven by transformer-based (Vaswani et al., 2017) language models (LM) such as BERT (Devlin et al., 2019) or XLM-R (Conneau et al., 2020). When learning a source task on these models, the representations in the encoder change to become informative to the task at hand. In a parameter transfer setting, a new but related target task then profits from the learned representations in the encoder. Transfer learning has been applied to sentiment analysis (SA) using parameter transfer methods such as pretrained sentiment embeddings (Dong and de Melo, 2018) or machine translation-based context vectors (McCann et al., 2017). Our approach forms part of the parameter transfer approach, as we use encoder representations learned using emoji-based source tasks and transfer these to sentiment target tasks.
Hate speech classification and sentiment analysis have in recent years been the object of many shared tasks (Rosenthal et al., 2017;Wiegand, 2018;Basile et al., 2019;Mandl et al., 2019;Ogrodniczuk and Łukasz Kobyliński, 2019). Classification models for these tasks often rely on feature engineering and statistical methods such as naivebayes (Saleem et al., 2016), logistic regression over subwords (Waseem and Hovy, 2016) or neural approaches including convolutional neural networks (Park and Fung, 2017) or, as in our case, the representations of large LMs (Yang et al., 2019).

Method: Emoji-Prediction
For our parameter transfer, we rely on a single transformer-based LM which is shared among different tasks. A sequence x ∈ X is featurized by reading it into the encoder of the LM and retrieving its last hidden state. A linear layer is then used as a predictive function f : X → Y to predict labels y ∈ Y . A task T = {Y, f (x)} is then a set of labels Y and the predictive function f over the instances in X.
We follow a transfer learning approach, where source task T S is an emoji-based classification task, i.e. given a sequence, predict the emoji (class) that it originally contained. Target task T T is a downstream task such as SA or HS (Section 4.1). Each task has its own set of instances X, labels Y and predictive function f , while the feature-generating LM stays the same. The error of predictor f is back-propagated to the LM, which allows us to transfer learned parameters from T S to T T .

Source Tasks (ST)
We focus on 5 different emoji-based STs, that can be divided into two types, emoji prediction (EP) and emoji cluster prediction. To sample emojis for EP or create clusters, we rely on a large collection of user generated comments. EP is a multi-class prediction task over the 64 most common emojis identified in the collection of comments. Concretely, given a tweet with all emojis removed, the classifier has to predict which of the 64 emojis was originally contained within it.
The emoji cluster prediction tasks can be supervised (PMI-{Target,Swear}) or unsupervised (KMeans-{2,3}). In this case the task is simplified: Given a tweet with all emojis removed, predict the cluster to which the emoji originally contained in the tweet belonged.
Unsupervised Clusters In order to account for the cultural differences in the use of emojis, we learn emoji clusters directly from the user generated data. We generate 50-dimensional vector representations over the tokens in the collection of user comments using the continuous bag of words (Mikolov et al., 2013) approach. We then perform k-means clustering with 6 target clusters on the representations of emojis that occurred ≥ 1000 times. These clusters are manually merged into 2 (positive/negative) and 3 (positive/negative/neutral) clusters to create the binary KMeans-2 and ternary KMeans-3 emoji cluster prediction STs respectively. Below a comment to be classified as positive according to the KMeans-{2,3} tasks, as it originally contained an emoji that belonged to the positive cluster: So beautiful and great advice →positive Supervised Clusters As an alternative to the completely unsupervised clusters, we exploit the mutual information between emojis and swear words as a type of distant supervision for HS tasks. We calculate the pointwise mutual information (PMI) between comments in our collection of user content (not) containing slurs and the emojis that appear. An emoji is in the slur cluster if its PMI is larger to comments containing swearwords, otherwise it is in the neutral cluster. PMI-Swear is then a binary classification task based on the resulting slur/neutral emoji clusters.
While the unsupervised emoji cluster prediction STs and PMI-Swear are source-oriented, i.e. learned on user generated content, we also explore target-oriented clusters that rely on the shared information between emojis and the labels in each of the TTs. Concretely, we calculate the PMI between the label of an instance in the respective TT training data and the emojis it contains. The emoji is placed into the cluster of the label to which its PMI value is largest. PMI-Target is the ST based on these target-oriented emoji clusters.

Target Tasks (TT)
Once the classifier has been fully trained on the ST, and thus has adapted the underlying LMs representations to fit the ST at hand, we discard it and train a new classifier on top of the enriched LM to predict the TT. We evaluate this transfer from the various STs on two main categories of TTs, namely Hate Speech Detection and Sentiment Analysis. Given a user generated comment, Hate Speech Detection is the task of classifying the comment as either hate or none. Note, however, that concrete label names (e.g. offense, hate, harmful) may differ across specific HS tasks.
While HS in our case is a binary classification task, Sentiment Analysis is a ternary classification task which takes as input a user generated comment and classifies it as either positive, neutral or negative. In the following an example from the Sentiment Analysis in Twitter (Rosenthal et al., 2017) task: Finally starting the 5th season of Dexter. See ya later, weekend! →positive Both HS and SA are sentiment-based tasks, e.g. hate towards a group of people or positive sentiment towards a product etc. We therefore take these two types of tasks to have the potential to benefit from the emotion information encoded in emojis. In the following sections we explore the conditions under which the transfer from an emoji-based ST to a sentiment-based TT is beneficial for the TT.

Experimental Setup
We describe the data used for the STs and TTs respectively (Section 4.1), followed by the specifi-  cations of the encoding LM (Section 4.2) and the emoji cluster creation (Section 4.3).

Data
Source Tasks We use a collection 1 of tweets that has been collected from the Twitter stream between 2011 and 2019 as our corpus needed to sample emojis and create emoji clusters for the STs. We perform language identification using the polyglot 2 library over the tweets to create a corpus for German, English, Spanish, Polish and Arabic (TW-{DE,EN,ES,PL,AR}) respectively. To automatically identify swear words for PMI-Swear, we use a German and a multilingual swear word collection, namely WoltLab 3 and Hatebase 4 . In total, we collected 785 slurs for German, and 1531, 140, 306, 79 for English, Spanish, Polish and Arabic respectively.
Target Tasks We work with 6 target tasks in total, 3 HS and 3 SA tasks, taking into account their emoji content, class (im)balance and language.
For German, we use GermEval 2018 (Wiegand, 2018) Task 1 (offense/other) (HS-DE) and SB10k (Cieliebak et al., 2017) Table 1, we report the label distribution, hate/none for HS and positive/negative/neutral for SA, across all TT training and test sets, as well as ST Twitter corpora sizes. For both ST and TT corpora, we also report the percentage as well as total number of tweets containing emojis.
Preprocessing All data sets undergo the same preprocessing.
Tweets are tokenized using the NLTK (Bird and Loper, 2004) TweetTokenizer and user mentions, retweets and punctuation are removed. Repeated characters are shortened. We use token frequencies to determine the standard orthography of a word (e.g. coooool → cool instead of col).

Model Specifications
For the monolingual (German) experiments, we use the German BERT 5 (BERT-DE) and for multilingual experiments we use Bert-Base-Multilingual-Cased (BERT-M) as the LM to encode the tweets. We base our code 6 on the simpletransformers 7 sequence classification implementations of the above models. Each classification task is trained for a maximum of 10 epochs using early stopping over the validation accuracy with δ = 0.01 and patience 3. Training was performed on a single Titan-X GPU, which took between 1 and 6 hours depending on the data size. We evaluate the resulting classifiers using the Macro F1 measure.

Clusters
We describe the creation of the emoji clusters used for the emoji cluster STs.

Results
We train each model over 10 seeded runs and report the averaged Macro F1 with standard error ( Figure  2). For each TT, we train a baseline, which is the same pre-trained BERT-{DE,M} model that is now fine-tuned directly on the TT classification task at hand, without prior training on the ST. We compare these baselines with those models that have undergone a transfer from ST to TT. We use the term equivalent to signify that two models lie within each others error bounds.

Condition 1: Emoji Content
We evaluate the effect that STs have on TTs with different amounts of emoji content. We focus on the TTs with the lowest and highest amount of emoji content, namely SA-EN (1.9% emoji content) and SA-AR (22.5%). This is the multilingual case. For the monolingual case, we evaluate the effect on SA-DE (2%) and HS-DE (7.2%). All of these TTs are unbalanced, i.e. the minority class makes up 15.2-32.2% of the training data.
The monolingual, low emoji content SA-DE task does not profit from the transfer. Rather, the training on most STs leads to a slight drop in F1-Macro compared to the baseline (F1 0.600). On the other hand, high emoji content HS-DE greatly benefits from the transfer, with PMI-Swear (F1 0.730) being especially beneficial for the performance on the TT, yielding a gain of F1 +0.280 over the baseline. This shows that the shared information in emojis and slurs is relevant to the HS task at hand. Also beneficial are EP (F1 0.705), and the unsupervised KMeans-3 (F1 0.690) and KMeans-2 (F1 0.629) cluster prediction tasks. Only the supervised PMI-Target (F1 0.405) does no seem to be beneficial for the performance on the TT, leading to a drop in performance, which is due to the unbalanced nature of the TT (Section 5.2). The multilingual case shows a slightly mixed trend. Low emoji content SA-EN does not benefit from the transfer, but unlike in the monolingual setting, it is not harmed by it either. All STs lead to a TT performance that is equivalent to the baseline (F1 0.578). High emoji content SA-AR only barely profits from the transfer, with EP (F1 0.509) leading to a small gain of F1 (+0.034) over the baseline (F1 0.475), while all other STs lead to an equivalent performance to the baseline. The overall trend is similar to the monolingual case but the positive and negative effects are dimmed down, which may be due to the multilingual aspect (Section 5.3).
The general trend shows that a decent amount of emoji content in the TT training data is crucial for the transfer to be beneficial.

Condition 2: Label Distribution
To analyze the effect that the STs have on differently (un)balanced TTs, we focus on HS-PL (the minority class makes up 8.5% of training data) and HS-ES (41.3%), as they are the two most (un)balanced TTs, while being comparable in terms of emoji content (13.7% and 14.5% respectively).
PMI-Target performs poorly on unbalanced HS-PL (and HS-DE etc.) due to its use of mutual information between emojis and the TT labels. This leads to it reproducing the class imbalance, making it less effective on unbalanced TTs.
The difference in impact of PMI-Swear on HS-PL (none) and HS-ES (and HS-DE) (gain) can be explained by the composition of the ST dataset. TW-PL is the smallest corpus in the multilingual collection of user comments, and this sparsity is further driven by the morphological complexity of Polish, such that the 306 slurs from the Polish slur list only resulted in 65k Polish training samples in PMI-Swear, as opposed to 1.8M and 3M for German and Spanish respectively.
Overall, if the label distribution in TT is balanced, the TT easily benefits from the transfer. Otherwise other conditions such as the multilinguality or emoji content become more relevant.

Condition 3: Multilinguality
We analyze the effectiveness of the transfer in a monolingual and multilingual setting. For this, we focus on the effect that the monolingually and multilingually learned STs have on HS-DE and SA-DE. Both TTs are unbalanced, while HS-DE has a high emoji content and SA-DE has a low emoji content.
The different effects of the emoji-content in HS-DE and SA-DE has been discussed in Section 5.1, showing that in the monolingual setting, high emoji content HS-DE benefits from the transfer, while low emoji content SA-DE does not. In the multilingual case, we see a similar, but dimmed, trend. SA-DE does not benefit from the transfer, with all TTs leading to an equivalent performance as the baseline (F1 0.566), except KMeans-2 (F1 0.439) which is below the baseline. The STs have a similar performance on HS-DE, being equivalent or below the baseline (F1 0.663). Only PMI-Swear (F1 0.678) is beneficial for the TT performance.
The effect of ST-oriented clusters KMeans-{2,3} was beneficial in the monolingual case (HS-DE), but this benefit is lost in the multilingual setting. This underlines our original idea that SToriented unsupervised emoji clusters learned on large amounts of user generated text have the advantage of accounting for cultural differences in the usage of emojis. When learned multilingually, this advantage is lost. An example of the culturally diverse use of emojis is , which is rather infrequent in Europe and might be used to point towards the importance of recycling. In TW-AR, this emoji is among the top 5 most frequent emojis, and is used to motivate other users to share their content.
The overall trend thus shows that monolingually learned STs are more beneficial than multilingual STs. However, if the training data of a TT is balanced, this effect is less pronounced.

Comparison to Benchmark Results
To put the results into a broader perspective, we compare to state-of-the-art (SOTA) models for each of the shared-tasks/datasets that our TTs are based on (Table 2). For two of the Hate Speech benchmarks, the performance of our transfer approach is close to the SOTA, namely with a difference of F1 −0.038 (HS-DE) and F1 −0.03 (HS-ES). For HS-PL, we were able to achieve a gain of +0.031 over the SOTA. Across all three Sentiment Analysis benchmarks, our models are below the SOTA. This indicates that SA, in general, is a more difficult task to our transfer approach than HS, possibly due to its ternary, rather than binary, classification objective. This is another factor causing the trans-  fer to be overall more beneficial for HS rather than SA, next to the unbalanced (SA-{EN,AR}) and low-emoji content (SA-DE) nature of the SA tasks.

Summary
We have evaluated and identified conditions under which the transfer from an emoji-based ST is beneficial for a sentiment TT. In the experiments in Section 5 we observed three major trends, namely i) TTs with high amounts of emoji content benefit more from the transfer, ii) PMI-Target tends to be detrimental to unbalanced TTs and iii) monolingually learned STs tend to perform better than their multilingual counterparts, due to their improved representation of culturally unique emoji usages. The latter underlines the importance of taking into account cultural differences when exploiting the information encoded in emojis. From these results, we can draw conclusions about the conditions under which a given emojibased ST is beneficial. Due to the shared information between emojis and slurs, PMI-Swear is beneficial to HS tasks when the data that can be generated from the swear word list is decently large. PMI-Target is beneficial when the TT is balanced, otherwise it replicates the already existing class imbalance. Unsupervised KMeans-{2,3} should be learned monolingually to be beneficial and EP is a safe choice for TTs with high emoji content.