Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification

We provide a simple but novel supervised weighting scheme for adjusting term frequency in tf-idf for sentiment analysis and text classification. We compare our method to baseline weighting schemes and find that it outperforms them on multiple benchmarks. The method is robust and works well on both snippets and longer documents.


Introduction
Baseline discriminative methods for text classification usually involve training a linear classifier over bag-of-words (BoW) representations of documents.In BoW representations (also known as Vector Space Models), a document is represented as a vector where each entry is a count (or binary count) of tokens that occurred in the document.Given that some tokens are more informative than others, a common technique is to apply a weighting scheme to give more weight to discriminative tokens and less weight to non-discriminative ones.Term frequency-inverse document frequency (tfidf ) (Salton and McGill, 1983) is an unsupervised weighting technique that is commonly employed.In tf-idf, each token i in document d is assigned the following weight, where tf i,d is the number of times token i occurred in document d, N is the number of documents in the corpus, and df i is the number of documents in which token i occurred.
Many supervised and unsupervised variants of tf-idf exist (Debole and Sebastiani (2003); Martineau and Finin (2009); Wang and Zhang (2013)).The purpose of this paper is not to perform an exhaustive comparison of existing weighting schemes, and hence we do not list them here.Interested readers are directed to Paltoglou and Thelwall (2010) and Deng et al. (2014) for comprehensive reviews of the different schemes.
In the present work, we propose a simple but novel supervised method to adjust the term frequency portion in tf-idf by assigning a credibility adjusted score to each token.We find that it outperforms the traditional unsupervised tf-idf weighting scheme on multiple benchmarks.The benchmarks include both snippets and longer documents.We also compare our method against Wang and Manning (2012)'s Naive-Bayes Support Vector Machine (NBSVM), which has achieved state-of-the-art results (or close to it) on many datasets, and find that it performs competitively against NBSVM.We additionally find that the traditional tf-idf performs competitively against other, more sophisticated methods when used with the right scaling and normalization parameters.

The Method
Consider a binary classification task.Let C i,k be the count of token i in class k, with k ∈ {−1, 1}.Denote C i to be the count of token i over both classes, and y (d) to be the class of document d.For each occurrence of token i in the training set, we calculate the following, Here, j is the j-th occurrence of token i.Since there are C i such occurrences, j indexes from 1 to C i .We assign a score to token i by, Intuitively, ŝi is the average likelihood of making the correct classification given token i's occurrence in the document, if i was the only token in the document.In a binary classification case, this reduces to, Note that by construction, the support of ŝi is [0.5, 1].

Credibility Adjustment
Suppose ŝi = ŝj = 0.75 for two different tokens i and j, but C i = 5 and C j = 100.Intuition suggests that ŝj is a more credible score than ŝi , and that ŝi should be shrunk towards the population mean.Let ŝ be the (weighted) population mean.That is, where C is the count of all tokens in the corpus.
We define credibility adjusted score for token i to be, where γ is an additive smoothing parameter.If C i,k 's are small, then This is a form of Buhlmann credibility adjustment from the actuarial literature (Buhlmann and Gisler, 2005).We subsequently define tf , the credibility adjusted term frequency, to be, and tf is replaced with tf .That is, We refer to above as cred-tf-idf hereafter.

Sublinear Scaling
It is common practice to apply sublinear scaling to tf .A word occurring (say) ten times more in a document is unlikely to be ten times as important.Paltoglou and Thelwall (2010) confirm that sublinear scaling of term frequency results in significant improvements in various text classification tasks.We employ logarithmic scaling, where tf is replaced with log(tf ) + 1.For our method, tf is simply replaced with log(tf ) + 1.We found virtually no difference in performance between log scaling and other sublinear scaling methods (such as augmented scaling, where tf is replaced with 0.5 + 0.5+tf max tf ).

Normalization
Using normalized features resulted in substantial improvements in performance versus using un-normalized features.We thus use x(d) = x (d) /||x (d) || 2 in the SVM, where x (d) is the feature vector obtained from cred-tf-idf weights for document d.

Naive-Bayes SVM (NBSVM)
Wang and Manning (2012) achieve excellent (sometimes state-of-the-art) results on many benchmarks using binary Naive Bayes (NB) logcount ratios as features in an SVM.In their framework, where df i,k is the number of documents that contain token i in class k, α is a smoothing parameter, and 1{•} is the indicator function equal to one if tf i,d > 0 and zero otherwise.As an additional benchmark, we implement NBSVM with α = 1.0 and compare against our results. 1

Datasets and Experimental Setup
We test our method on both long and short text classification tasks, all of which were used to establish baselines in Wang and Manning (2012).Table 1 has summary statistics of the datasets.The snippet datasets are: • PL-sh: Short movie reviews with one sentence per review.Classification involves detecting whether a review is positive or negative.(Pang and Lee, 2005). 2 • PL-sub: Dataset with short subjective movie reviews and objective plot summaries.Classification task is to detect whether the sentence is objective or subjective.(Pang and Lee, 2004).
And the longer document datasets are: 1 Wang and Manning (2012) use the same α but they differ from our NBSVM in two ways.One, they use l2 hinge loss (as opposed to l1 loss in this paper).Two, they interpolate NBSVM weights with Multivariable Naive Bayes (MNB) weights to get the final weight vector.Further, their tokenization is slightly different.Hence our NBSVM results are not directly comparable.We list their results in table 2.

Support Vector Machine (SVM)
For each document, we construct the feature vector x (d) using weights obtained from cred-tf-idf with log scaling and l 2 normalization.For credtf-idf, γ is set to 1.0.NBSVM and tf-idf (also with log scaling and l 2 normalization) are used to establish baselines.Prediction for a test document is given by In all experiments, we use a Support Vector Machine (SVM) with a linear kernel and penalty parameter of C = 1.0.For the SVM, w, b are obtained by minimizing, using the LIBLINEAR library (Fan et al., 2008).

Tokenization
We lower-case all words but do not perform any stemming or lemmatization.We restrict the vocabulary to all tokens that occurred at least twice in the training set.

Results and Discussion
For PL datasets, there are no separate test sets and hence we use 10-fold cross validation (as do other published results) to estimate errors.The standard train-test splits are used on IMDB and Newsgroup datasets.

cred-tf-idf outperforms tf-idf
Table 2 has the comparison of results for the different datasets.Our method outperforms the traditional tf-idf on all benchmarks for both unigrams and bigrams.While some of the differences in performance are significant at the 0.05 level (e.g.IMDB), some are not (e.g.PL-2k).The Wilcoxon signed ranks test is a non-parametric test that is often used in cases where two classifiers are compared over multiple datasets (Demsar, 2006).The Wilcoxon signed ranks test indicates that the overall outperformance is significant at the <0.01 level.

NBSVM outperforms cred-tf-idf
cred-tf-idf did not outperform Wang and Manning (2012)'s NBSVM (Wilcoxon signed ranks test pvalue = 0.1).But it did outperform our own implementation of NBSVM, implying that the extra modifications by Wang and Manning (2012) (i.e. using squared hinge loss in the SVM and interpolating between NBSVM and MNB weights) are important contributions of their methodology.This was especially true in the case of shorter documents, where our uninterpolated NBSVM performed significantly worse than their interpolated NBSVM.

tf-idf still performs well
We find that tf-idf still performs remarkably well with the right scaling and normalization parameters.Indeed, the traditional tf-idf outperformed many of the more sophisticated methods that employ distributed representations (Maas et (Whitelaw et al., 2005).Str.SVM: Uses OpinionFinder to find objective versus subjective parts of the review (Yessenalina et al., 2010).aug-tf-mi: Uses augmented term-frequency with mutual information gain (Deng et al., 2014).Disc.Conn.: Uses discourse connectors to generate additional features (Trivedi and Eisenstein, 2013).Word Vec.: Learns sentiment-specific word vectors to use as features combined with BoW features (Maas et al., 2011).LLR: Uses log-likelihood ratio on features to select features (Aue and Gamon, 2005).RAE: Recursive autoencoders (Socher et al., 2011).MV-RNN: Matrix-Vector Recursive Neural Networks (Socher et al., 2012).

Conclusions and Future Work
In this paper we presented a novel supervised weighting scheme, which we call credibility adjusted term frequency, to perform sentiment analysis and text classification.Our method outperforms the traditional tf-idf weighting scheme on multiple benchmarks, which include both snippets and longer documents.We also showed that tf-idf is competitive against other state-of-the-art methods with the right scaling and normalization parameters.
From a performance standpoint, it would be interesting to see if our method is able to achieve even better results on the above tasks with proper tuning of the γ parameter.Relatedly, our method could potentially be combined with other supervised variants of tf-idf, either directly or through ensembling, to improve performance further.

Table 2 :
Wang and Manning (2012)cred-tf-idf ) against baselines (tf-idf, NBSVM), using unigrams and bigrams.cred-tf-idfandtf-idf both use log scaling and l 2 normalization.Best results (that do not use external sources) are underlined, while top three are in bold.Rows 7-11 are MNB and NBSVM results fromWang and Manning (2012).Our NBSVM results are not directly comparable to theirs (see footnote 1).Methods with * use external data or software.Appr.Tax: Uses appraisal taxonomies from WordNet