Opinion Mining and Topic Categorization with Novel Term Weighting

In this paper we investigate the efficiency of the novel term weighting algorithm for opinion mining and topic categorization of articles from newspapers and Internet. We compare the novel term weighting technique with existing approaches such as TF-IDF and ConfWeight. The performance on the data from the text-mining campaigns DEFT’07 and DEFT’08 shows that the proposed method can compete with existing information retrieval models in classification quality and that it is computationally faster. The proposed text preprocessing method can be applied in large-scale information retrieval and data mining problems and it can be easily transported to different domains and different languages since it does not require any do-main-related or linguistic information.


Introduction
Nowadays, Internet and social media generate a huge amount of textual information. It is increasingly important to develop methods of text processing such as text classification. Text classification is very important for such problems as automatic opining mining (sentiment analysis) and topic categorization of different articles from newspapers and Internet.
Text classification can be considered to be a part of natural language understanding, where there is a set of predefined categories and the task is to automatically assign new documents to one of these categories. The method of text preprocessing and text representation influences the results that are obtained even with the same classification algorithms. The most popular model for text classification is vector space model. In this case text categorization may be considered as a machine learning problem. Complexity of text categorization with vector space model is compounded by the need to extract the numerical data from text information before applying machine learning methods. Therefore text categorization consists of two parts: text preprocessing and classification using obtained numerical data.
All text preprocessing methods are based on the idea that the category of the document depends on the words or phrases from this document. The simplest approach is to take each word of the document as a binary coordinate and the dimension of the feature space will be the number of words in our dictionary.
There exist more advanced approaches for text preprocessing to overcome this problem such as TF-IDF (Salton and Buckley, 1988) and ConfWeight methods (Soucy and Mineau, 2005). A novel term weighting method (Gasanova et al., 2013) is also considered, which has some similarities with the ConfWeight method, but has improved computational efficiency. It is important to notice that we use no morphological or stop-word filtering before text preprocessing. It means that the text preprocessing can be performed without expert or linguistic knowledge and that the text preprocessing is language-independent.
In this paper we have used k-nearest neighbors algorithm, Bayes Classifier, support vector machine (SVM) generated and optimized with COBRA (Co-Operation of Biology Related Algorithms) which has been proposed by Akhmedova andSemenkin (2013), Rocchio Classifier or Nearest Centroid Algorithm (Rocchio, 1971) and Neural Network as classification methods. RapidMiner and Microsoft Visual Studio C++ 2010 have been used as implementation software.
For the application of algorithms and comparison of the results we have used the DEFT ("Défi Fouille de Texte") Evaluation Package 2008 (Proceedings of the 4th DEFT Workshop, 2008) which has been provided by ELRA and publically available corpora from DEFT'07 (Proceedings of the 3rd DEFT Workshop, 2007).
The main aim of this work is to evaluate the competitiveness of the novel term weighting (Gasanova et al., 2013) in comparison with the state-of-the-art techniques for opining mining and topic categorization. The criteria using in the evaluation are classification quality and computational efficiency. This paper is organized as follows: in Section 2, we describe details of the corpora. Section 3 presents text preprocessing methods. In Section 4 we describe the classification algorithms which we have used to compare different text preprocessing techniques. Section 5 reports on the experimental results. Finally, we provide concluding remarks in Section 6.

Corpora Description
The focus of DEFT 2007 campaign is the sentiment analysis, also called opinion mining. We have used 3 publically available corpora: reviews on books and movies (Books), reviews on video games (Games) and political debates about energy project (Debates).
The topic of DEFT 2008 edition is related to the text classification by categories and genres. The data consists of two corpora (T1 and T2) containing articles of two genres: articles ex-tracted from French daily newspaper Le Monde and encyclopedic articles from Wikipedia in French language. This paper reports on the results obtained using both tasks of the campaign and focuses on detecting the category.  All databases are divided into a training (60% of the whole number of articles) and a test set (40%). To apply our algorithms we extracted all words which appear in the training set regardless of the letter case and we also excluded dots, commas and other punctual signs. We have not used any additional filtering as excluding the stop or ignore words.

Binary preprocessing
We take each word of the document as a binary coordinate and the size of the feature space will be the size of our vocabulary ("bag of words").

TF-IDF
TF-IDF is a well-known approach for text preprocessing based on multiplication of term frequency tf ij (ratio between the number of times the i th word occurs in the j th document and the document size) and inverse document frequen- where t ij is the number of times the i th word occurs in the j th document. T j is the document size (number of the words in the document).
There are different ways to calculate the weight of each word. In this paper we run classification algorithms with the following variants.
1) TF-IDF 1 where |D| is the number of document in the training set and is the number of documents that have the i th word.
2) TF-IDF 2 The formula is given by equation (2) except is calculated as the number of times i th word appears in all documents from the training set. 3 where is calculated as in TF-IDF 1 and α is the parameter (in this paper we have tested α = 0.1, 0.5, 0.9).

4) TF-IDF 4
The formula is given by equation (3) except is calculated as in TF-IDF 4.

ConfWeight
Maximum Strength (Maxstr) is an alternative method to find the word weights. This approach has been proposed by Soucy and Mineau (2005). It implicitly does feature selection since all frequent words have zero weights. The main idea of the method is that the feature f has a non-zero weight in class c only if the f frequency in documents of the c class is greater than the f frequency in all other classes. The ConfWeight method uses Maxstr as an analog of IDF: ℎ = � + 1� * ( ). Numerical experiments (Soucy and Mineau, 2005) have shown that the ConfWeight method could be more effective than TF-IDF with SVM and k-NN as classification methods. The main drawback of the ConfWeight method is computational complexity. This method is more computationally demanding than TF-IDF method because the ConfWeight method requires timeconsuming statistical calculations such as Student distribution calculation and confidence interval definition for each word.

Novel Term Weighting (TW)
The main idea of the method (Gasanova et al., 2013) is similar to ConfWeight but it is not so time-consuming. The idea is that every word that appears in the article has to contribute some value to the certain class and the class with the biggest value we define as a winner for this article.
For each term we assign a real number term relevance that depends on the frequency in utterances. Term weight is calculated using a modified formula of fuzzy rules relevance estimation for fuzzy classifiers (Ishibuchi et al., 1999). Membership function has been replaced by word frequency in the current class. The details of the procedure are the following: Let L be the number of classes; n i is the number of articles which belong to the i th class; N ij is the number of the j th word occurrence in all articles from the i th class; T ij = N ij / n i is the relative frequency of the j th word occurrence in the i th class.
= max , = arg (max ) is the number of class which we assign to the j th word; The term relevance, C j , is given by C j is higher if the word occurs more often in one class than if it appears in many classes. We use novel TW as an analog of IDF for text preprocessing.
The learning phase consists of counting the C values for each term; it means that this algorithm uses the statistical information obtained from the training set.

Classification Methods
We have considered 11 different text preprocessing methods (4 modifications of TF-IDF, two of them with three different values of α parameter, binary representation, ConfWeight and the novel TW method) and compared them using different classification algorithms. The methods have been implemented using RapidMiner (Shafait, 2010) and Microsoft Visual Studio C++ 2010 for Rocchio classifier and SVM. The classification methods are: -k-nearest neighbors algorithm with distance weighting (we have varied k from 1 to 15); -kernel Bayes classifier with Laplace correction; -neural network with error back propagation (standard setting in RapidMiner); -Rocchio classifier with different metrics and γ parameter; -support vector machine (SVM) generated and optimized with Co-Operation of Biology Related Algorithms (COBRA).
Rocchio classifier (Rocchio, 1971) is a wellknown classifier based on the search of the nearest centroid. For each category we calculate a weighted centroid: where is a set of documents which belong to the class c; , ����� are k documents which do not belong to the class c and which are close to the centroid is parameter corresponds to relative importance of negative precedents. The given document is put to the class with the nearest centroid. In this work we have applied Rocchio classifier with ∈ (0.1; 0.9) and with three different metrics: taxicab distance, Euclidean metric and cosine similarity. COBRA is a new meta-heuristic algorithm which has been proposed by Akhmedova and Semenkin (2013). It is based on cooperation of biology inspired algorithms such as Particle Swarm Optimization (Kennedy and Eberhart, 1995), Wolf Pack Search Algorithm (Yang, 2007), Firefly Algorithm (Yang, 2008), Cuckoo Search Algorithm (Yang and Deb, 2009) and Bat Algorithm (Yang, 2010). For generating SVM-machine the original COBRA is used: each individual in all populations represents a set of kernel function's parameters . , , d β α Then for each individual constrained modification of COBRA is applied for finding vector w and shift factor b. And finally individual that showed the best classification rate is chosen as the designed classifier.

Experimental Results
The DEFT ("Défi Fouille de Texte") Evaluation Package 2008 and publically available corpora from DEFT'07 (Books, Games and Debates) have been used for algorithms application and results comparison. In order to evaluate obtained results with the campaign participants we have to use the same measure of classification quality: precision, recall and F-score.
Precision for each class i is calculated as the number of correctly classified articles for class i divided by the number of all articles which algorithm assigned for this class. Recall is the number of correctly classified articles for class i divided by the number of articles that should have been in this class. Overall precision and recall are calculated as the arithmetic mean of the precisions and recalls for all classes (macroaverage). F-score is calculated as the harmonic mean of precision and recall.
Tables 3-7 present the F-scores obtained on the test corpora. The best values for each problem are shown in bold. Results of the all classification algorithms are presented with the best parameters. We also present for each corpus only the best TF-IDF modification.  We can see from the Tables 3-7 that the best F-scores have been obtained with either ConfWeight or novel Term Weighting preprocessing. The algorithm performances on the Games and Debates corpora achieved the best results with ConfWeight; however, we can see that the F-scores obtained with novel Term Weighting preprocessing are very similar (0.712 and 0.720 for Games; 0.700 and 0.714 for Debates). Almost all best results have been obtained with SVM except the Games database where we achieved the highest F-score with k-NN algorithm.
This paper focuses on the text preprocessing methods which do not require language or domain-related information; therefore, we have not tried to achieve the best possible classification quality. However, the result obtained on Books corpus with novel TW preprocessing and SVM (generated using COBRA) as classification algorithm has reached 0.619 F-score which is higher than the best known performance 0.603 (Proceedings of the 3rd DEFT Workshop, 2007). Performances on other corpora have achieved close F-score values to the best submissions of the DEFT'07 and DEFT'08 participants.
We have also measured computational efficiency of each text preprocessing technique. We have run each method 20 times using the Baden-Württemberg Grid (bwGRiD) Cluster Ulm (Every blade comprehends two 4-Core Intel Harpertown CPUs with 2.83 GHz and 16 GByte RAM). After that we calculated average values and checked statistical significance of the results. Figure 1 and Figure 2 compare average computational time in minutes for different preprocessing methods applied on DEFT'07 and DEFT'08 corpora. The average value for all TF-IDF modifications is presented because the time variation for the modifications is not significant.
We can see in Figure 1 and Figure 2 that TF-IDF and novel TW require almost the same computational time. The most time-consuming method is ConfWeight (CW). It requires approximately six times more time than TF-IDF and novel TW for DEFT'08 corpora and about three-four times more time than TF-IDF and novel TW for DEFT'07 databases.

Conclusion
This paper reported on text classification experiments on 5 different corpora of opinion mining and topic categorization using several classification methods with different text preprocessing. We have used "bag of words", TF-IDF modifications, ConfWeight and the novel term weighting approach as preprocessing techniques. K-nearest neighbors algorithms, Bayes classifier, Rocchio classifier, support vector machine trained by COBRA and Neural Network have been applied as classification algorithms.
The novel term weighting method gives similar or better classification quality than the ConfWeight method but it requires the same amount of time as TF-IDF. Almost all best results have been obtained with SVM generated and optimized with Co-Operation of Biology Related Algorithms (COBRA).
We can conclude that numerical experiments have shown computational and classification efficiency of the proposed method (the novel TW) in comparison with existing text preprocessing techniques for opinion mining and topic categorization.