Think Positive: Towards Twitter Sentiment Analysis from Scratch

In this paper we describe a Deep Convolutional Neural Network (DNN) approach to perform two sentiment detection tasks: message polarity classiﬁcation and contextual polarity disambiguation. We apply the proposed approach for the SemEval-2014 Task 9: Sentiment Analysis in Twitter. Despite not using any handcrafted feature or sentiment lexicons, our system achieves very competitive results for Twitter data.


Introduction
In this work we apply a recently proposed deep convolutional neural network (dos Santos and Gatti, 2014) that exploits from character-to sentence-level information to perform sentiment analysis of Twitter messages (tweets). The network proposed by dos Santos and Gatti (2014), named Character to Sentence Convolutional Neural Network (CharSCNN), uses two convolutional layers to extract relevant features from words and messages of any size.
We evaluate CharSCNN in the unconstrained track of the SemEval-2014 Task 9: Sentiment Analysis in Twitter (Rosenthal et al., 2014). Two subtasks are proposed in the SemEval-2014 Task 9: the contextual polarity disambiguation (Sub-taskA), which consists in determining the polarity (positive, negative, or neutral) of a marked word or phrase in a given message; and the message polarity classification (SubtaskB), which consists in classifying the polarity of the whole message. We use the same neural network to perform both tasks. The only difference is that in This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http: //creativecommons.org/licenses/by/4.0/ SubtaskA, CharSCNN is fed with a text segment composed by the words in a context window centered at the target word/phrase. While in Sub-taskB, CharSCNN is fed with the whole message.
The use of deep neural networks for sentiment analysis has been the focus of recent research. However, instead of convolutional neural network, most investigation has been done in the use of recursive neural networks (Socher et al., 2011;Socher et al., 2012;.

Neural Network Architecture
Given a segment of text (e.g. a tweet), CharSCNN computes a score for each sentiment label τ ∈ T = {positive, negative, neutral}. In order to score a text segment, the network takes as input the sequence of words in the segment, and passes it through a sequence of layers where features with increasing levels of complexity are extracted. The network extracts features from the character-level up to the sentence-level.

Initial Representation Levels
The first layer of the network transforms words into real-valued feature vectors (embeddings) that capture morphological, syntactic and semantic information about the words. We use a fixedsized word vocabulary V wrd , and we consider that words are composed of characters from a fixedsized character vocabulary V chr . Given a sentence consisting of N words {w 1 , w 2 , ..., w N }, every word w n is converted into a vector u n = [r wrd ; r wch ], which is composed of two subvectors: the word-level embedding r wrd ∈ R d wrd and the character-level embedding r wch ∈ R cl 0 u of w n . While word-level embeddings are meant to capture syntactic and semantic information, character-level embeddings capture morphological and shape information.

Word-Level Embeddings
Word-level embeddings are encoded by column vectors in an embedding matrix W wrd ∈ R d wrd ×|V wrd | . Each column W wrd i ∈ R d wrd corresponds to the word-level embedding of the i-th word in the vocabulary. We transform a word w into its word-level embedding r wrd by using the matrix-vector product: where v w is a vector of size V wrd which has value 1 at index w and zero in all other positions. The matrix W wrd is a parameter to be learned, and the size of the word-level embedding d wrd is a hyper-parameter to be chosen by the user.

Character-Level Embeddings
In the task of sentiment analysis of Twitter data, important information can appear in different parts of a hash tag (e.g., "#SoSad", "#ILikeIt") and many informative adverbs end with the suffix "ly" (e.g. "beautifully", "perfectly" and "badly"). Therefore, robust methods to extract morphological and shape information from this type of tokens must take into consideration all characters of the token and select which features are more important for sentiment analysis. Like in (dos Santos and Zadrozny, 2014), we tackle this problem using a convolutional approach (Waibel et al., 1989), which works by producing local features around each character of the word and then combining them using a max operation to create a fixed-sized character-level embedding of the word. Given a word w composed of M characters {c 1 , c 2 , ..., c M }, we first transform each character c m into a character embedding r chr m . Character embeddings are encoded by column vectors in the embedding matrix W chr ∈ R d chr ×|V chr | . Given a character c, its embedding r chr is obtained by the matrix-vector product: where v c is a vector of size V chr which has value 1 at index c and zero in all other positions. The input for the convolutional layer is the sequence of character embeddings {r chr 1 , r chr 2 , ..., r chr M }. The convolutional layer applies a matrixvector operation to each window of size k chr of successive windows in the sequence {r chr 1 , r chr 2 , ..., r chr M }. Let us define the vector z m ∈ R d chr k chr as the concatenation of the character embedding m, its (k chr − 1)/2 left neighbors, and its (k chr − 1)/2 right neighbors: The convolutional layer computes the j-th element of the vector r wch , which is the character-level embedding of w, as follows: where W 0 ∈ R cl 0 u ×d chr k chr is the weight matrix of the convolutional layer. The same matrix is used to extract local features around each character window of the given word. Using the max over all character windows of the word, we extract a "global" fixed-sized feature vector for the word.
Matrices W chr and W 0 , and vector b 0 are parameters to be learned. The size of the character vector d chr , the number of convolutional units cl 0 u (which corresponds to the size of the character-level embedding of a word), and the size of the character context window k chr are hyperparameters.

Sentence-Level Representation and Scoring
Given a text segment x with N words {w 1 , w 2 , ..., w N }, which have been converted to joint word-level and character-level embedding {u 1 , u 2 , ..., u N }, the next step in CharSCNN consists in extracting a segment-level representation r seg x . Methods to extract a segment-wide feature set most deal with two main problems: text segments have different sizes; and important information can appear at any position in the segment. A convolutional approach is a good option to tackle this problems, and therefore we use a convolutional layer to compute the segment-wide feature vector r seg . This second convolutional layer works in a very similar way to the one used to extract character-level features for words. This layer produces local features around each word in the text segment and then combines them using a max operation to create a fixed-sized feature vector for the segment.
The second convolutional layer applies a matrix-vector operation to each window of size k wrd of successive windows in the sequence {u 1 , u 2 , ..., u N }. Let us define the vector z n ∈ R (d wrd +cl 0 u )k wrd as the concatenation of a se-quence of k wrd embeddings, centralized in the nth word 1 : z n = u n−(k wrd −1)/2 , ..., u n+(k wrd −1)/2 T The convolutional layer computes the j-th element of the vector r seg as follows: where W 1 ∈ R cl 1 u ×(d wrd +cl 0 u )k wrd is the weight matrix of the convolutional layer. The same matrix is used to extract local features around each word window of the given segment. Using the max over all word windows of the segment, we extract a "global" fixed-sized feature vector for the segment. Matrix W 1 and vector b 1 are parameters to be learned. The number of convolutional units cl 1 u (which corresponds to the size of the segmentlevel feature vector), and the size of the word context window k wrd are hyper-parameters to be chosen by the user.
Finally, the vector r seg x , the "global' feature vector of text segment x, is processed by two usual neural network layers, which extract one more level of representation and compute a score for each sentiment label τ ∈ T : where matrices W 2 ∈ R hlu×cl 1 u and W 3 ∈ R |T |×hlu , and vectors b 2 ∈ R hlu and b 3 ∈ R |T | are parameters to be learned. The transfer function h(.) is the hyperbolic tangent. The size of the number of hidden units hl u is a hyper-parameter to be chosen by the user.

Network Training
Our network is trained by minimizing a negative likelihood over the training set D. Given a text segment x, the network with parameter set θ computes a score s θ (x) τ for each sentiment label τ ∈ T . In order to transform this score into a conditional probability p (τ |x, θ) of the label given the segment and the set of network parameters θ, we apply a softmax operation over all tags: 1 We use a special padding token for the words with indices outside of the text segment boundaries.
Taking the log, we arrive at the following conditional log-probability: We use stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to θ: where (x, y) corresponds to a text segment (e.g. a tweet) in the training corpus D and y represents its respective sentiment class label. We use the backpropagation algorithm to compute the gradients of the network (Lecun et al., 1998;Collobert, 2011). We implement the CharSCNN architecture using the automatic differentiation capabilities of the Theano library (Bergstra et al., 2010).

Unsupervised Learning of Word-Level Embeddings
Unsupervised pre-training of word embeddings has shown to be an effective approach to improve model accuracy Luong et al., 2013;Zheng et al., 2013). In our experiments, we perform unsupervised learning of word-level embeddings using the word2vec tool 2 . We use two Twitter datasets as sources of unlabeled data: the Stanford Twitter Sentiment corpus (Go et al., 2009), which contains 1.6 million tweets; and a dataset containing 10.4 million tweets that were collected in October 2012 for a previous work by the author (Gatti et al., 2013). We tokenize these corpora using Gimpel et al.'s (2011) tokenizer, and removed messages that are less than 5 characters long (including white spaces) or have less than 3 tokens. Like in  and (Luong et al., 2013), we lowercase all words and substitute each numerical digit by a 0 (e.g., 1967 becomes 0000). The resulting corpus contains about 12 million tweets.
We do not perform unsupervised learning of character-level embeddings, which are initialized by randomly sampling each value from an uniform distribution: U (−r, r), where r = 6 |V chr | + d chr . The character vocabulary is constructed by the (not lowercased) words in the training set, which allows the neural network to capture relevant information about capitalization.

Sentiment Corpora and Model Setup
SemEval-2014 Task 9 is a rerun of the SemEval-2013Task 2 (Nakov et al., 2013, hence the training set used in 2014 is the same of the 2013 task. However, as we downloaded the Twitter training and development sets in 2014 only, we were not able to download the complete dataset since some tweets have been deleted by their respective creators. In Table 1 In SemEval-2014 Task 9, three different test sets are used: Twitter2014, Twitter2014Sarcarm and LiveJournal2014. While the two first contain Twitter messages, the last one contains sentences from LiveJournal blogs. In Table 2, we show the number of messages in the SemEval-2014 Task 9 test datasets.

Test Dataset
SubtaskA SubtaskB  Twitter2014  2597  1939  Twitter2014Sarcasm  124  86  LiveJournal2014 1315 1142 We use the copora Twitter2013 (test) and SMS2013 to tune CharSCNN's hyper-parameter values. In Table 3, we show the selected hyperparameter values, which are the same for both SubtaskA and SubtaskB. We concatenate the SemEval-2013 Task 2 training and development sets to train the submitted model.

Sentiment Prediction Results
In

Conclusions
In this work we describe a sentiment analysis system based on a deep neural network architecture that analyses text at multiple levels, from character-level to sentence-level. We apply the proposed system to the SemEval-2014 Task 9 and achieve very competitive results for Twitter data in both contextual polarity disambiguation and message polarity classification subtasks. As a future work, we would like to investigate the impact of the system performance for the LiveJournal2014 corpus when the unsupervised pre-training is performed using in-domain texts.