TeamX: A Sentiment Analyzer with Enhanced Lexicon Mapping and Weighting Scheme for Unbalanced Data

This paper describes the system that has been used by TeamX in SemEval-2014 Task 9 Subtask B. The system is a sentiment analyzer based on a supervised text categorization approach designed with following two concepts. Firstly, since lexicon features were shown to be effective in SemEval-2013 Task 2, various lexicons and pre-processors for them are introduced to enhance lexical information. Secondly, since a distribution of sentiment on tweets is known to be unbalanced, an weighting scheme is introduced to bias an output of a machine learner. For the test run, the sys-tem was tuned towards Twitter texts and successfully achieved high scoring results on Twitter data, average F 1 70.96 on Twit-ter2014 and average F 1 56.50 on Twit-ter2014Sarcasm.


Introduction
The growth of social media has brought a rising interest to make natural language technologies that work with informal texts. Sentiment analysis is one such technology, and several workshops such as SemEval-2013 Task 2 (Nakov et al., 2013), CLEF 2013 RepLab 2013 (Amigó et al., 2013), and TASS 2013 (Villena-Román and García-Morera, 2013) have recently targeted tweets or cell phone messages as analysis text. This paper describes a system that has submitted a sentiment analysis result to Subtask B of SemEval-2014 Task9 (Rosenthal et al., 2014). SemEval-2014 Task9 is a rerun of SemEval-2013 Task 2 with different test data, and Subtask B is a task of message polarity classification. This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http: //creativecommons.org/licenses/by/4.0/ The system we prepared is a sentiment analyzer based on a supervised text categorization approach. Various features and their extraction methods are integrated in the system following the works presented in SemEval-2013 Task 2. Additionally to these features, we assembled following notable functionalities to the system: 1. Processes to enhance word-to-lemma mapping.
(a) A spelling corrector to normalize out-ofvocabulary words. (b) Two Part-of-Speech (POS) taggers to realize word-to-lemma mapping in two perspectives. (c) A word sense disambiguator to obtain word senses and their confidence scores.
2. An weighting scheme to bias an output of a machine learner.
Functionalities 1a to 1c are introduced to enhance information based on lexical knowledge, since features based on lexicons are shown to be effective in SemEval-2013 Task 2 (Mohammad et al., 2013). Functionality 2 is introduced to make the system adjustable to polarity unbalancedness known to exists in Twitter data (Nakov et al., 2013). The accompanying sections of this papers are organized as follows. Section 2 describes resources such as labeled texts and lexicons used in our system. Section 3 explains the details of the system. Section 4 discusses the submission test run and some extra test runs that we performed after the test data release. Finally, section 5 concludes the paper.

Sentiment Labeled Data
The system is a constrained system, therefore only the sentiment labeled data distributed by the task  organizers were used. However, due to accessibility changes in tweets, a subset of the training, the development, and the development-test data were used. Table 1 shows the numbers of messages for each type.

Text Normalizer
The text normalizer performs following three rulebased normalization of an input text: • Unicode normalization in form NFKC 2 .

Spelling Corrector
A spelling corrector is included in the system to normalize misspellings. We used Jazzy 3 , an open source spell checker with US English dictionaries provided along with Jazzy. Jazzy combines Dou-bleMetaphone phonetic matching algorithm and a near-miss match algorithm based on Levenshtein distance to correct a misspelled word.

POS Taggers
The system includes two POS taggers to realize word-to-lemma mapping in two perspectives.
Stanford POS Tagger Stanford Log-linear Partof-Speech Tagger (Toutanova et al., 2003) is one POS tagger which is used to map words to lemmas of 'FORMAL' criterion lexicons, and to extract word sense features. A finitestate transducer based lemmatizer (Minnen et al., 2001) included in the POS tagger is used to obtain lemmas of tokenized words. CMU ARK POS Tagger A POS tagger for tweets by CMU ARK group (Owoputi et al., 2013) is another POS tagger used to map words to lemmas of 'INFORMAL' criterion lexicons, and to extract ngram features and a cluster feature.

Word Sense Disambiguator
A word sense disambiguator is included in the system to determine a sense of a word. We used UKB 4 which implements graph-based word sense disambiguation based on Personalized PageRank algorithm (Agirre and Soroa, 2009) on a lexical knowledge base. As a lexical knowledge base, WordNet 3.0 (Fellbaum, 1998) included in the UKB package is used.

Negation Detector
The system includes a simple rule-based negation detector. The detector is an implementation of the algorithm on Christopher Potts' Sentiment Symposium Tutorial 5 . The algorithm is a simple algorithm that appends a negation suffix to words that appear within a negation scope surrounded by a negation key (ex. 'no') and a certain punctuation (ex. ':').

Features
The followings are the features used in the system.  Figure 2: An example of word senses feature word count, total score, maximal score, and last word score 6 . For lexicons without sentiment scores, score 1.0 is used for all entries. Note that different POS taggers are used in word-to-lemma mapping as described in Section 3.1.3. clusters Words are mapped to Twitter Word Clusters of CMU ARK group 7 . The largest clustering result consisting of 1000 clusters from approximately 56 million tweets is used as clusters. word senses A result of the word sense disambiguator is extracted as weighted features according to their scores. Figure 2 shows an example of this feature.
The ngram features are introduced as basic bagof-words features in a supervised text categorization approach. Lexicon features are designed to strengthen the lexical features of Mohammad et al. (2013) which have been shown to be effective in the last year's task. Cluster features are implemented as an improvement for an supervised NLP system following the work of Turian et al. (2010). Word sense features are utilized to help subjectivity analysis and contextual polarity analysis (Akkaya et al., 2009).

Machine Learner
Logistic Regression is utilized as an algorithm of a supervised machine learning method. As an implementation of Logistic Regression, LIBLIN-EAR (Fan et al., 2008) is used. A Logistic Regression is trained using the features of Section 3.2 with the three polarities (positive, negative, and neutral) as labels.  Table 3: The scores for each source in the test runs. The run with asterisk (*) denotes the submission run. The values in the 'Sources' columns represent scores in SemEval-2014 Task 9 metric (the average of positive F 1 and negative F 1 ).

Prediction Adjuster
Since the labels in the tweets data are unbalanced (Nakov et al., 2013), we prepared a prediction adjuster for Logistic Regression output. For each polarity l, an weighting factor w l that adjusts a probability output P r(l) is introduced. An updated prediction label is decided by selecting an l that maximizes score(l) which can be expressed as equation 1.
arg max l∈{pos,neg,neu} score(l) = w l P r(l) The approach we took in this prediction adjuster is a simple approach to bias an output of Logistic Regression, but may not be a typical approach to handle unbalanced data. For instance, LIBLIN-EAR includes the weighting option '-wi' which enables a use of different cost parameter C for different classes. One advantage of our approach is that the change in w l does not require a training of Logistic Regression. Various values of w l can be tested with very low computational cost, which is helpful in a situation like SemEval tasks where the time for development is limited.

Submission Test Run
The system was trained using the 8,015 tweets included in Twitter(train) and Twitter(dev) described in Section 2.1. Three parameters: cost parameter C of Logistic Regression, weight w pos of the prediction adjuster, and weight w neg of the prediction adjuster, were considered in the submission test run. For the w neu of the prediction adjuster, a fixed value of 1.0 was used.
Prior to the submission test run, the following two steps were performed to select a parameter combination for the submission run.
Step 1  Step 2 The performances of the system for all these parameter combinations were calculated using Twitter(dev-test) described in Section 2.1.
As a result, the parameter combination C = 0.03, w pos = 2.4, and w neg = 3.3 which performed best in Twitter(dev-test) was selected as a parameter combination for the submission run. Finally, the system with the selected parameters was applied to the test set of SemEval-2014 Task 9. 'Twitter(dev-test)' in Table 3 shows the values of this submission run. The system achieved high performances on Twitter data: 72.12, 70.96, and 56.50 on Twitter2013, Twitter2014, and Twit-ter2014Sarcasm respectively.

Post-Submission Test Runs
The system performed quite well on Twitter data but not so well on other data on the submission run. After the release of the gold data of the 2014 test tun, we conducted several test runs using different parameter combinations. 'Twitter(train)+Twitter(dev)', 'SMS(devtest)', and 'SMS(dev-test)+Twitter(dev-test)' are the results of test runs with different data sources used for the parameter selection process. In 'Twitter(train)+Twitter(dev)', the parameter combination that maximizes a micro-average score of 5fold cross validation was chosen since the training data and the parameter selection are equivalent.
The parameter combination selected with 'Twitter(train)+Twitter(dev)' showed similar result as the submission run, which is high performances on Twitter data. In the case of 'SMS(dev-test)', the system performed well on 'LiveJournal2014' and 'SMS(dev-test)' namely 72.99 and 68.92. How-ever, in this parameter combination the scores on Twitter data were clearly lower than the submission run. Finally, 'SMS(dev-test)+Twitter(devtest)' resulted to a mid performing result, where scores for each source marked in-between values of 'Twitter(dev-test)' and 'SMS(dev-test)'.

Conclusion
We proposed a system that is designed to enhance information based on lexical knowledge and to be adjustable to unbalanced training data. With parameters tuned towards Twitter data, the system successfully achieved high scoring results on Twitter data, average F 1 70.96 on Twitter2014 and average F 1 56.50 on Twitter2014Sarcasm.
Additional test runs with different parameter combination showed that the system can be tuned to perform well on non-Twitter data such as blogs or short messages. However, the limitation of our approach to directly weight a machine learner's output was shown, since we could not find a general purpose parameter combination that can achieve high scores on any types of data.