AUEB: Two Stage Sentiment Analysis of Social Network Messages

This paper describes the system submitted for the Sentiment Analysis in Twitter Task of S EM E VAL 2014 and speciﬁcally the Message Polarity Classiﬁcation sub-task. We used a 2–stage pipeline approach employing a linear SVM classiﬁer at each stage and several features including morphological features, POS tags based features and lexicon based features


Introduction
Recently, Twitter has gained significant popularity among the social network services. Lots of users often use Twitter to express feelings or opinions about a variety of subjects. Analysing this kind of content can lead to useful information for fields, such as personalized marketing or social profiling. However such a task is not trivial, because the language used in Twitter is often informal presenting new challenges to text analysis.
In this paper we focus on sentiment analysis, the field of study that analyzes people's sentiment and opinions from written language (Liu, 2012). Given some text (e.g., tweet), sentiment analysis systems return a sentiment label, which most often is positive, negative, or neutral. This classification can be performed directly or in two stages; in the first stage the system examines whether the text carries sentiment and in the second stage, the system decides for the sentiment's polarity (i.e., positive or negative). 1 This decomposition is based on the assumption that subjectivity detection and sentiment polarity detection are different problems. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ 1 For instance a 2-stage approach is better suited to systems that focus on subjectivity detection; e.g., aspect based sentiment analysis systems which extract aspect terms only from evaluative texts.
We choose to follow the 2-stage approach, because it allows us to focus on each of the two problems separately (e.g., features, tuning, etc.). In the following we will describe the system with which we participated in the Message Polarity Classification subtask of Sentiment Analysis in Twitter (Task 9) of SEMEVAL 2014 (Rosenthal et al., 2014). Specifically Section 2 describes the data provided by the organizers of the task. Sections 3 and 4 present our system and its performance respectively. Finally, Section 5 concludes and provides hints for future work.

Data
At first, we describe the data used for this year's task. For system tuning the organizers released the training and development data of SEMEVAL 2013 Task 2 (Wilson et al., 2013). Both these sets are allowed to be used for training. The organizers also provided the test data of the same Task to be used for development only. As argued in (Malakasiotis et al., 2013) these data suffer from class imbalance. Concerning the test data, they contained 8987 messages broken down in the following 5 datasets: -LJ 14 : 2000 sentences from LIVEJOURNAL.
-SMS 13 : SMS test data from last year.
-TW 13 : Twitter test data from last year.
The details of the test data were made available to the participants only after the end of the Task. Recall that SMS 13 and TW 13 were also provided as development data. In this way the organizers were able to check, i) the progress of the systems since last year's task, and ii) the generalization capability of the participating systems.

System Overview
The main objective of our system is to detect whether a message M expresses positive, negative or no sentiment. To achieve that we follow a 2stage approach. During the first stage we detect whether M expresses sentiment ("subjective") or not; this process is called subjectivity detection. In the second stage we classify the "subjective" messages of the first stage as "positive" or "negative". Both stages utilize a Support Vector Machine (SVM (Vapnik, 1998)) classifier with linear kernel. 2 Similar approaches have also been proposed in (Pang and Lee, 2004;Wilson et al., 2005;Barbosa and Feng, 2010;Malakasiotis et al., 2013). Finally, we note that the 2-stage approach, in datasets such the one here (Malakasiotis et al., 2013), alleviates the class imbalance problem.

Data preprocessing
A very essential part of our system is data preprocessing. At first, each message M is passed through a twitter specific tokenizer and part-ofspeech (POS) tagger (Owoputi et al., 2013) to obtain the tokens and the corresponding POS tags, which are necessary for some sets of features. 3 We then use a dictionary to replace any slang with the actual text. 4 We also normalize the text of each message by combining a trie data structure (De La Briandais, 1959) with an English dictionary. 5 In more detail, we replace every token of M not in the dictionary with the most similar word of the dictionary. Finally, we obtain POS tags of all the new tokens.

Sentiment lexicons
Another key attribute of our system is the use of sentiment lexicons. We have used the following: -HL (Hu and Liu, 2004).
-The three lexicons created from the training data in (Malakasiotis et al., 2013).
Note that concerning the MPQA Lexicon we applied preprocessing similar to Malakasiotis et al. (2013) to obtain the following sub-lexicons: W 0 : Contains weak subjective expressions with neutral prior polarity.

Feature engineering
Our system employs several types of features based on morphological attributes of the messages, POS tags, and lexicons of section 3.2. 6
-The number of elongated tokens.
-The existence of date references.
-The existence of time references. 6 All the features are normalized to [−1, 1] -The number of tokens that contain only upper case letters.
-The number of tokens that contain both upper and lower case letters.
-The number of tokens that start with an upper case letter.
-The number of exclamation marks.
-The number of question marks.
-The sum of exclamation and question marks.
-The number of tokens containing only exclamation marks.
-The number of tokens containing only question marks.
-The number of tokens containing only exclamation or question marks.
-The existence of a subjective (i.e., positive or negative) emoticon at the message's end.
-The existence of an ellipsis and a link at the message's end.
-The existence of an exclamation mark at the message's end.
-The existence of a question mark at the message's end.
-The existence of a question or an exclamation mark at the message's end.
-The existence of slang. For a bigram b and a class c, F 1 is calculated as:

POS based features
where:

Sentiment lexicon based features
For each lexicon we use seven different features based on the scores provided by the lexicon for each word present in the message. 12 -Sum of scores.
-Maximum of scores.
-Minimum of scores.
-Average of scores.
-The count of words with scores.
-The score of the last word of the message that appears in the lexicon.
-The score of the last word of the message.
We also created features based on the P re and F 1 scores of MPQA and the train data generated lexicons in a similar manner to that described in (Malakasiotis et al., 2013), with the difference that the features are stage dependent. Thus, for subjectivity detection we use the subjective and neutral classes and for polarity detection we use the positive and negative classes to compute the scores.

Miscellaneous features
Negation. Negation not only is a good subjectivity indicator but it also may change the polarity of a message. We therefore add 7 more features, one indicating the existence of negation, and the remaining six indicating the existence of negation that precedes words from lexicons S ± , S + , S − , W ± , W + and W − . 13 Each feature is used in the appropriate stage. 14 We have not implement this type of feature for other lexicons but it might be a good addition to the system.
Carnegie Mellon University's Twitter clusters. Owoputi et al. (2013) released a dataset of 938 clusters containing words coming from tweets. Words of the same clusters share similar attributes. We try to exploit this observation by adding 938 features, each of which indicates if a message's token appears or not in the corresponding attributes.

Feature Selection
To allow our model to better scale on unseen data we have performed feature selection. More specifically, we first merged training and development data of SEMEVAL 2013 Task 2. Then, we ranked the features with respect to their information gain (Quinlan, 1986) on this dataset. To obtain the best set of features we started with a set containing the top 50 features and we kept adding batches of 50 features until we have added all of them. At each step we evaluated the corresponding feature set on the TW 13 and SMS 13 datasets and chose the feature set with the best performance. This resulted in a system which used the top 900 features for Stage 1 and the top 1150 features for Stage 2. 13 We use a list of words with negation. We assume that a token precedes a word if it is in a distance of at most 5 tokens.
14 The features concerning S± and W± are used in subjectivity detection and the remaining four in polarity detection.

Experimental Results
The official measure of the Task is the average F 1 score of the positive and negative classes (F 1 (±)). Table 1 illustrates the F 1 (±) score per evaluation dataset achieved by our system along with the median and best F 1 (±). In the same table AVG all corresponds to the average F 1 (±) across the five datasets while AVG 14 corresponds to the average F 1 (±) across LJ 14 , TW 14 and TWSARC 14 . We observe that in all cases our results are above the median. Table 2 illustrates the ranking of our system according to F 1 (±). Our system ranked 6th according to AVG all and 5th according to AVG 14 among the 50 participating systems. Note that our best results were achieved on the new test sets (LJ 14 , TW 14 , TWSARC 14 ) meaning that our system has a good generalization ability.

Conclusion and future work
In this paper we presented our approach for the Message Polarity Classification subtask of the Sentiment Analysis in Twitter Task of SEMEVAL 2014. We proposed a 2-stage pipeline approach, which first detects sentiment and then decides about its polarity. The results indicate that our system handles well the class imbalance problem and has a good generalization ability. A possible explanation is that we do not use bag-of-words fea-tures which often suffer from over-fitting. Nevertheless, there is still some room for improvement. A promising direction would be to improve the 1st stage (subjectivity detection) either by adding more data or by adding more features, mostly because the performance of stage 1 greatly affects that of stage 2. Finally, the addition of more data for the negative class on stage 2 might be a good improvement because it would further reduce the class imbalance of the training data for this stage.