SemEval-2014 Task 9: Sentiment Analysis in Twitter

We describe the Sentiment Analysis in Twitter task, ran as part of SemEval-2014. It is a continuation of the last year's task that ran successfully as part of SemEval-2013. As in 2013, this was the most popular SemEval task; a total of 46 teams contributed 27 submissions for subtask A (21 teams) and 50 submissions for subtask B (44 teams). This year, we introduced three new test sets: (i) regular tweets, (ii) sarcastic tweets, and (iii) LiveJournal sentences. We further tested on (iv) 2013 tweets, and (v) 2013 SMS messages. The highest F1-score on (i) was achieved by NRC-Canada at 86.63 for subtask A and by TeamX at 70.96 for subtask B.


Introduction
In the past decade, new forms of communication have emerged and have become ubiquitous through social media. Microblogs (e.g., Twitter), Weblogs (e.g., LiveJournal) and cell phone messages (SMS) are often used to share opinions and sentiments about the surrounding world, and the availability of social content generated on sites such as Twitter creates new opportunities to automatically study public opinion.
Working with these informal text genres presents new challenges for natural language processing beyond those encountered when working with more traditional text genres such as newswire. The language in social media is very informal, with creative spelling and punctuation, misspellings, slang, new words, URLs, and genrespecific terminology and abbreviations, e.g., RT for re-tweet and #hashtags 1 . This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ 1 Hashtags are a type of tagging for Twitter messages.
Moreover, tweets and SMS messages are short: a sentence or a headline rather than a document.
Several corpora with detailed opinion and sentiment annotation have been made freely available, e.g., the MPQA newswire corpus (Wiebe et al., 2005), the movie reviews corpus (Pang et al., 2002), or the restaurant and laptop reviews corpora that are part of this year's SemEval Task 4 (Pontiki et al., 2014). These corpora have proved very valuable as resources for learning about the language of sentiment in general, but they do not focus on tweets. While some Twitter sentiment datasets were created prior to SemEval-2013, they were either small and proprietary, such as the isieve corpus (Kouloumpis et al., 2011) or focused solely on message-level sentiment.
Thus, the primary goal of our SemEval task is to promote research that will lead to better understanding of how sentiment is conveyed in Social Media. Toward that goal, we created the Se-mEval Tweet corpus as part of our inaugural Sentiment Analysis in Twitter Task, SemEval-2013 Task 2 (Nakov et al., 2013). It contains tweets and SMS messages with sentiment expressions annotated with contextual phrase-level and messagelevel polarity. This year, we extended the corpus by adding new tweets and LiveJournal sentences.
Another interesting phenomenon that has been studied in Twitter is the use of the #sarcasm hashtag to indicate that a tweet should not be taken literally (González-Ibáñez et al., 2011;Liebrecht et al., 2013). In fact, sarcasm indicates that the message polarity should be flipped. With this in mind, this year, we also evaluate on sarcastic tweets.
In the remainder of this paper, we first describe the task, the dataset creation process and the evaluation methodology. We then summarize the characteristics of the approaches taken by the participating systems, and we discuss their scores.

Task Description
As SemEval-2013 Task 2, we included two subtasks: an expression-level subtask and a messagelevel subtask. Participants could choose to participate in either or both. Below we provide short descriptions of the objectives of these two subtasks.

Subtask A: Contextual Polarity Disambiguation
Given a message containing a marked instance of a word or a phrase, determine whether that instance is positive, negative or neutral in that context. The instance boundaries were provided: this was a classification task, not an entity recognition task.

Subtask B: Message Polarity Classification
Given a message, decide whether it is of positive, negative, or neutral sentiment. For messages conveying both positive and negative sentiment, the stronger one is to be chosen.
Each participating team was allowed to submit results for two different systems per subtask: one constrained, and one unconstrained. A constrained system could only use the provided data for training, but it could also use other resources such as lexicons obtained elsewhere. An unconstrained system could use any additional data as part of the training process; this could be done in a supervised, semi-supervised, or unsupervised fashion.
Note that constrained/unconstrained refers to the data used to train a classifier. For example, if other data (excluding the test data) was used to develop a sentiment lexicon, and the lexicon was used to generate features, the system would still be constrained. However, if other data (excluding the test data) was used to develop a sentiment lexicon, and this lexicon was used to automatically label additional Tweet/SMS messages and then used with the original data to train the classifier, then such a system would be considered unconstrained.

Datasets
In this section, we describe the process of collecting and annotating the 2014 testing tweets, including the sarcastic ones, and LiveJournal sentences.

Annotation
We annotated the new tweets as in 2013: by identifying tweets from popular topics that contain sentiment-bearing words by using SentiWordNet (Baccianella et al., 2010) as a filter. We altered the annotation task for the sarcastic tweets, displaying them to the Mechanical Turk annotators without the #sarcasm hashtag; the Turkers had to determine whether the tweet is sarcastic on their own. Moreover, we asked Turkers to indicate the degree of sarcasm as (a) definitely sarcastic, (b) probably sarcastic, and (c) not sarcastic.
As in 2013, we combined the annotations using intersection, where a word had to appear in 2/3 of the annotations to be accepted. An annotated example from each source is shown in Table 3.  Table 3: Example of polarity for each source of messages. The target phrases are marked in [. . .], and are followed by their polarity; the sentence-level polarity is shown in the last column.

Tweets Delivery
We did not deliver the annotated tweets to the participants directly; instead, we released annotation indexes, a list of corresponding Twitter IDs, and a download script that extracts the corresponding tweets via the Twitter API. 2 We provided the tweets in this manner in order to ensure that Twitter's terms of service are not violated. Unfortunately, due to this restriction, the task participants had access to different number of training tweets depending on when they did the downloading. This varied between a minimum of 5,215 tweets and the full set of 10,882 tweets. On average the teams were able to collect close to 9,000 tweets; for teams that did not participate in 2013, this was about 8,500. The difference in training data size did not seem to have had a major impact. In fact, the top two teams in subtask B (coooolll and TeamX) trained on less than 8,500 tweets.

Scoring
The participating systems were required to perform a three-way classification for both subtasks. A particular marked phrase (for subtask A) or an entire message (for subtask B) was to be classified as positive, negative or objective/neutral. We scored the systems by computing a score for predicting positive/negative phrases/messages. For instance, to compute positive precision, p pos , we find the number of phrases/messages that a system correctly predicted to be positive, and we divide that number by the total number it predicted to be positive. To compute positive recall, r pos , we find the number of phrases/messages correctly predicted to be positive and we divide that number by the total number of positives in the gold standard. We then calculate F1-score for the positive class as follows F pos = 2(ppos+rpos) ppos * rpos . We carry out a similar computation for F neg , for the negative phrases/messages. The overall score is then F = (F pos + F neg )/2.
We used the two test sets from 2013 and the three from 2014, which we combined into one test set and we shuffled to make it hard to guess which set a sentence came from. This guaranteed that participants would submit predictions for all five test sets. It also allowed us to test how well systems trained on standard tweets generalize to sarcastic tweets and to LiveJournal sentences, without the participants putting extra efforts into this. The participants were also not informed about the source the extra test sets come from.
We provided the participants with a scorer that outputs the overall score F and a confusion matrix for each of the five test sets.

Participants and Results
The results are shown in Tables 4 and 5, and the team affiliations are shown in Table 6. Tables 4 and 5 contain results on the two progress test sets (tweets and SMS messages), which are the official test sets from the 2013 edition of the task, and on the three new official 2014 testsets (tweets, tweets with sarcasm, and LiveJournal). The tables further show macro-and micro-averaged results over the 2014 datasets. There is an index for each result showing the relative rank of that result within the respective column. The participating systems are ranked by their score on the Twitter-2014 testset, which is the official ranking for the task; all remaining rankings are secondary.
As we mentioned above, the participants were not told that the 2013 test sets would be included in the big 2014 test set, so that they do not overtune their systems on them. However, the 2013 test sets were made available for development, but it was explicitly forbidden to use them for training. Still, some participants did not notice this restriction, which resulted in their unusually high scores on Twitter2013-test; we did our best to identify all such cases, and we asked the authors to submit corrected runs. The tables mark such resubmissions accordingly.
Most of the submissions were constrained, with just a few unconstrained: 7 out of 27 for subtask A, and 8 out of 50 for subtask B. In any case, the best systems were constrained. Some teams participated with both a constrained and an unconstrained system, but the unconstrained system was not always better than the constrained one: sometimes it was worse, sometimes it performed the same. Thus, we decided to produce a single ranking, including both constrained and unconstrained systems, where we mark the latter accordingly. Table 4 shows the results for subtask A, which attracted 27 submissions from 21 teams. There were seven unconstrained submissions: five teams submitted both a constrained and an unconstrained run, and two teams submitted an unconstrained run only. The best systems were constrained. All participating systems outperformed the majority class baseline by a sizable margin.

Subtask B
The results for subtask B are shown in Table 5. The subtask attracted 50 submissions from 44 teams. There were eight unconstrained submissions: six teams submitted both a constrained and an unconstrained run, and two teams submitted an unconstrained run only. As for subtask A, the best systems were constrained. Again, all participating systems outperformed the majority class baseline; however, some systems were very close to it.

Discussion
Overall, we observed similar trends as in SemEval-2013 Task 2. Almost all systems used supervised learning. Most systems were constrained, including the best ones in all categories. As in 2013, we observed several cases of a team submitting a constrained and an unconstrained run and the constrained run performing better.
It is unclear why unconstrained systems did not outperform constrained ones. It could be because participants did not use enough external data or because the data they used was too different from Twitter or from our annotation method. Or it could be due to our definition of unconstrained, which labels as unconstrained systems that use additional tweets directly, but considers unconstrained those that use additional tweets to build sentiment lexicons and then use these lexicons.
As in 2013, the most popular classifiers were SVM, MaxEnt, and Naive Bayes. Moreover, two submissions used deep learning, coooolll (Harbin Institute of Technology) and ThinkPositive (IBM Research, Brazil), which were ranked second and tenth on subtask B, respectively.
The features used were quite varied, including word-based (e.g., word and character ngrams, word shapes, and lemmata), syntactic, and Twitter-specific such as emoticons and abbreviations. The participants still relied heavily on lexicons of opinion words, the most popular ones being the same as in 2013: MPQA, SentiWord-Net and Bing Liu's opinion lexicon. Popular this year was also the NRC lexicon (Mohammad et al., 2013), created by the best-performing team in 2013, which is top-performing this year as well.
Preprocessing of tweets was still a popular technique. In addition to standard NLP steps such as tokenization, stemming, lemmatization, stopword removal and POS tagging, most teams applied some kind of Twitter-specific processing such as substitution/removal of URLs, substitution of emoticons, word normalization, abbreviation lookup, and punctuation removal. Finally, several of the teams used Twitter-tuned NLP tools such as part of speech and named entity taggers (Gimpel et al., 2011;Ritter et al., 2011).
The similarity of preprocessing techniques, NLP tools, classifiers and features used in 2013 and this year is probably partially due to many teams participating in both years. As Table 6 shows, 18 out of the 46 teams are returning teams.
Comparing the results on the progress Twitter test in 2013 and 2014, we can see that NRC-Canada, the 2013 winner for subtask A, have now improved their F1 score from 88.93 to 90.14, which is the 2014 best score. The best score on the Progress SMS in 2014 of 89.31 belongs to ECNU; this is a big jump compared to their 2013 score of 76.69, but it is less compared to the 2013 best of 88.37 achieved by GU-MLT-LT. For subtask B, on the Twitter progress testset, the 2013 winner NRC-Canada improves their 2013 result from 69.02 to 70.75, which is the second best in 2014; the winner in 2014, TeamX, achieves 72.12. On the SMS progress test, the 2013 winner NRC-Canada improves its F1 score from 68.46 to 70.28. Overall, we see consistent improvements on the progress testset for both subtasks: 0-1 and 2-3 points absolute for subtasks A and B, respectively.  Table 4: Results for subtask A. The * indicates system resubmissions (because they initially trained on Twitter2013-test), and the indicates a system that includes a task co-organizer as a team member. The systems are sorted by their score on the Twitter2014 test dataset; the rankings on the individual datasets are indicated with a subscript. The last two columns show macro-and micro-averaged results across the three 2014 test datasets.
Finally, note that for both subtasks, the best systems on the Twitter-2014 dataset are those that performed best on the 2013 progress Twitter dataset: NRC-Canada for subtask A, and TeamX (Fuji Xerox Co., Ltd.) for subtask B.
It is interesting to note that the best results for Twitter2014-test are lower than those for Twitter2013-test for both subtask A (86.63 vs. 90.14) and subtask B (70.96 vs 72.12). This is so despite the baselines for Twitter2014-test being higher than those for Twitter2013-test: 42.2 vs. 38.1 for subtask A, and 34.6 vs. 29.2 for subtask B. Most likely, having access to Twitter2013-test at development time, teams have overfitted on it. It could be also the case that some of the sentiment dictionaries that were built in 2013 have become somewhat outdated by 2014.
Finally, note that while some teams such as NRC-Canada performed well across all test sets, other such as TeamX, which used a weighting scheme tuned specifically for class imbalances in tweets, were only strong on Twitter datasets.

Conclusion
We have described the data, the experimental setup and the results for SemEval-2014 Task 9. As in 2013, our task was the most popular one at SemEval-2014, attracting 46 participating teams: 21 in subtask A (27 submissions) and 44 in subtask B (50 submissions).
We introduced three new test sets for 2014: an in-domain Twitter dataset, an out-of-domain Live-Journal test set, and a dataset of tweets containing sarcastic content. While the performance on the LiveJournal test set was mostly comparable to the in-domain Twitter test set, for most teams there was a sharp drop in performance for sarcastic tweets, highlighting better handling of sarcastic language as one important direction for future work in Twitter sentiment analysis.
We plan to run the task again in 2015 with the inclusion of a new sub-evaluation on detecting sarcasm with the goal of stimulating research in this area; we further plan to add one more test domain.  Table 5: Results for subtask B. The * indicates system resubmissions (because they initially trained on Twitter2013-test), and the indicates a system that includes a task co-organizer as a team member. The systems are sorted by their score on the Twitter2014 test dataset; the rankings on the individual datasets are indicated with a subscript. The last two columns show macro-and micro-averaged results across the three 2014 test datasets.
In the 2015 edition of the task, we might also remove the constrained/unconstrained distinction.
Finally, as there are multiple opinions about a topic in Twitter, we would like to focus on detecting the sentiment trend towards a topic.