Casting the Same Sentiment Classiﬁcation Problem

We introduce and study a problem variant of sentiment analysis, namely the “same sentiment classiﬁcation problem”, where, given a pair of texts, the task is to determine if they have the same sentiment, disregarding the actual sentiment polarity. Among other things, our goal is to enable a more topic-agnostic sentiment classiﬁcation. We study the problem using the Yelp business review dataset, demon-strating how sentiment data needs to be pre-pared for this task, and then carry out sequence pair classiﬁcation using the BERT language model. In a series of experiments, we achieve an accuracy above 83% for category subsets across topics, and 89% on average.


Introduction
At the sixth argument mining workshop ArgMining 2019 (Stein and Wachsmuth, 2019), the same side stance classification problem has been introduced by Stein et al. (2021) as a shared task to the argument mining community. Identifying the stance of an argument towards a topic is a fundamental problem in computational argumentation. The task presents a new problem variant, namely to classify whether two arguments share the same stance without the need to identify the stance itself. The underlying hypothesis is that this can be achieved in a topic-agnostic manner since only the similarity of two given arguments needs to be assessed. Similarly, in the authorship analysis community, the authorship verification problem (Koppel and Schler, 2004) is the task of determining for a given pair of texts whether they have been written by the same author. Here, too, instead of classifying a given text into predefined author classes, as is the case with authorship attribution, the verification problem casts the problem as a pairwise similarity-based classification task.
In this paper, we recast sentiment analysis in the same manner: Given two texts of unknown sentiment polarity, determine whether their sentiment is the same. Unlike for the same side and the same author classification problems, which suffer from a lack of large-scale training data, due to many resources available for sentiment analysis, scaling up does not prove to be a problem for the same sentiment problem. We see three major contributions in studying this task variant: (1) Focused research on topic-agnosticity, enabling direct observations of the effect of topic and that of agnostic modeling. (2) Potentially easing generalization across domains. (3) In time, a new paradigm of approaches may emerge (whereas the prevailing one still rules today). Our contributions are as follows: We demonstrate how to prepare standard sentiment data for meaningful training and evaluation, introduce an approach based on the transformer neural network architecture where we adapt the sequence pair classification task to the same sentiment problem, and evaluate our model in various experiments. 1 In what follows, Section 2 reviews related work, Section 3 introduces our approach and explains the dataset and its preparation, and Section 4 reports on our evaluation.

Related Work
Sentiment analysis has a wide range of applications in many languages and a variety of methods were developed to refine results and adapt to use-cases (Feldman, 2013;Terán and Mancera, 2019). Its main task is to determine the opinion or attitude of an author, either a single person or a group, about something, be it a product, brand, or service (Tedmori and Awajan, 2019). It has importance for businesses, in campaigns, and the financial sector, among others, and as a result, it has undergone much research to improve accuracy using different models and forms of data representation.
In recent years, sentiment analysis is increasingly being performed using deep learning approaches (Zhang et al., 2018). Johnson and Zhang (2017) designed a deep pyramid CNN which could efficiently represent long-range associations in text and thus more global information for better sentiment classification. Howard and Ruder (2018) have developed ULMFiT, a simple efficient transfer learning method that achieves improvements for various NLP Tasks such as sentiment classification. Another model that performs well on sentiment classification is BERT (Devlin et al., 2019), where pre-trained language models can be fine-tuned without substantial effort to suit different tasks. Sun et al. (2019) showed that decreasing the learning rate layer-wise and further pre-training enhance the performance of BERT. Another approach from Xie et al. (2019) improves the performance of BERT with the usage of data augmentation. It was shown that another current language model XLNet (Yang et al., 2019) achieves the best results for the sentiment classification task.
Based on the idea of the same side stance classification task by Stein et al. (2021) as well as the authorship verification problem (Koppel and Schler, 2004), our underlying hypothesis is that the more complex single sentiment problem may be able to be simplified to the semantic similarity of sentiment text pairs. This can then reduce the demand for topic-specific sentiment vocabulary usage (Hammer et al., 2015;Labille et al., 2017). As there is no prior work about same sentiment classification, our work uses well-known approaches from semantic text similarity (STS) about which several shared tasks have been organized (Agirre et al., 2013;Xu et al., 2015;Cer et al., 2017) and a variety of datasets (Dolan and Brockett, 2005;Ganitkevitch et al., 2013) have been compiled. While prior approaches have employed syntactic, structural, and semantic similarity, to evaluate sentence similarity, single models have gained more popularity in recent times. Mueller and Thyagarajan (2016) show the application of siamese recurrent networks for sentence similarity. With the introduction of contextualized word embeddings, Ranasinghe et al. (2019) evaluate their impact on STS methods compared to traditional word embeddings in different languages and domains.

The Same Sentiment Problem
In the following, we will introduce our model for same sentiment prediction and explain how to prepare training and test data.

Sequence Pair Classification Model
Our approach is based on the sequence pair classification task using the well-known transformer language models. The classification model employs the standard pre-trained BERT model architecture (Devlin et al., 2019) with an additional classification layer, consisting of a dropout of 0.1 and a dense layer with sigmoid activation. This layer accepts a pooled vector representation from the model based on the last hidden state of the [CLS] token, the first token for each input sequence intended to represent the whole sequence.
We fine-tuned the publicly available pre-trained model BERT-base-uncased using pairs of same or different sentiments reviews, generated as described in the following Section 3.2, with a training, validation, and test split of 80:10:10. 512 to include both input sequences with almost no truncation. Batch sizes were dependent on GPU memory and model sequence length, so we used 32 samples per batch for a sequence length of 128, but only 6 for a length of 512. Gradient accumulation was used to account for the small batch sizes. We kept the Adam optimizer with a learning rate of 5e−5 and epsilon of 1e−8. Typically, but depending on the number of training samples, between 3 and 5 epochs of fine-tuning seem to be enough to reach a plateau with further epochs only marginally improving prediction accuracy. The best model setup trained for 15 epochs only added 1% of accuracy but may very well have lost its ability to generalize for unknown topics. We used a single output for binary classification with a sigmoid binary crossentropy loss function as it performed better than two outputs for classes same or not same.

Data Acquisition and Preparation
For our analysis, we required texts with clear stances or sentiments, with both positive and negative samples about the same topic. As we wanted to do cross-topic comparisons, multiple topics with enough samples for standalone training or finetuning of a model were necessary.
Those requirements were fulfilled by the sentiment datasets from the business reviews of the Yelp Dataset Challenge (Asghar, 2016) and Ama-zon product reviews (Ni et al., 2019). 2 The IMDb dataset 3 commonly used in sentiment analysis was not useful as it only contained both a single positive and negative review per movie, and was, therefore, more suited for sentiment vocabulary analysis.
We chose to focus on the Yelp business review dataset as it contains a variety of categories for cross evaluations and qualitatively better review texts compared to Amazon. The dataset is a snapshot with reviews not older than 14 days at its time of creation and is officially being provided as several JSON files from which we only used general business information, such as category, and the customer reviews with text and ratings. It contains 6,685,900 user reviews about 192,127 businesses, 4 in 22 main categories. 5 Businesses are mostly assigned a single main category with related subcategories and seldom overlap. Previous general examinations by Asghar (2016) show extreme variance of the number of reviews and businesses between categories. The reviews required no further textual preprocessing as transformer models use a SentencePiece tokenizer (Kudo and Richardson, 2018) to handle arbitrary text input. It should be noted that those models can only handle some predefined sequence lengths, so text sequences after tokenization will be truncated to fit. With a sequence length of 512, we were able to sufficiently cover most review pairs, as the average number of tokens was about 150 for a single review.
Training Data Generation: For the sequence pair classification, we matched random pairs of reviews about the same business. The star rating of 1 to 5 was translated into binary labels, good or bad, with reviews being considered good if their ranking was above 3 stars. We filtered out businesses that had less than 5 positive and negative reviews each. The remaining reviews were randomly combined per pair type, i. e. 2 -4 sentiment pairs each for good-good, good-bad, bad-bad, and bad-good. 6 This, we will show in Section 4, sufficed to fine-tune the model, even if we omitted in some cases more than 10,000 reviews for spe-cific businesses. The pair generation resulted in a balance of positive / negative reviews and also samples of same sentiment pairs (good-good, badbad) or not (good-bad, bad-good). The number of businesses varied much between each major category, so cross-category training data also varied in quantity.

Evaluation
To thoroughly inspect our approach we conducted a series of experiments to test which hyperparameters are necessary to fine-tune a model in general, how well the model is able to generalize by artificially separating topics in training and evaluation, and how it performs for each category specifically.
Baseline As baseline models, we started with linear models, SVM, and Logistic Regression classifiers, where we represented reviews as ngram count vectors, TF-IDF word vectors, and as Doc2Vec (Le and Mikolov, 2014) embeddings. Using count and TF-IDF vectors, we were only able to achieve about 50% accuracy. With Doc2Vec embeddings, our accuracy improved to about 57%. Those results most likely meant that those approaches were not a good fit for sentiment pair similarity prediction.
We then used a Siamese Recurrent Network architecture (Neculoiu et al., 2016;Mueller and Thyagarajan, 2016) that has been successfully applied to semantic textual similarity problems. Words were represented by pre-trained 50-dimensional GloVe (Pennington et al., 2014) embeddings. We set a maximum input sequence length of 256, 50 LSTM cells in both bidirectional LSTM layers and 50 hidden units. 7 Training plateaued at 15 epochs with 83% accuracy. We will use the same configuration in all the following experiments.
Overall performance Using BERT, we started with an initial sequence length of 128, batch size of 32, and 5 epochs of fine-tuning but otherwise standard parameter choices to see how the model performs in general. The dataset consisted of 2 sentiment pairs for all 4 pair combinations for each business with a train/dev/test split of 90:10:10. This achieved 81.3% accuracy overall. Increasing the sentiment pairs per business to 4 per type only increased accuracy to 82%, so the randomly chosen samples were enough to generally cover the dataset,

Per Major Category
Of special interest is the evaluation per category which better shows where the model works well and where it has difficulties, assuming different categories employ varied and distinct vocabulary and even semantics. The analysis is made more difficult by the fact that the distribution of businesses per category is not uniform in the training data. The model had been trained on the whole train dataset but was evaluated with the test set split into the major categories. It is therefore no real unbiased prediction as examples for each topic were present in the training data. Accuracies, as reported in Table 2, span between 84% to 95% but show no clear correlations between the number of businesses or reviews and prediction accuracy.
Cross-Category A more real-world example has been done with training on a single category and evaluation on the remaining categories as well as category k-fold cross-validation. We chose to train models using a sequence length of 128 for Food and Arts & Entertainment. Results with and with-out overlapping businesses between train and validation categories did not amount to significant accuracy differences (less than 1%). However, we detected a difference of about 10% for results from Arts & Entertainment compared to Food which can be explained with the difference of about 4.5 times as many businesses in Food. The Food model had a test accuracy of 76% on the same category but ranged from 71% to 83% on the other categories, whereas Arts & Entertainment had 62% accuracy itself and between 63% to 72% on other categories. For the cross-validation experiment, we randomized the main categories and split them into 4 nonoverlapping sets of businesses to simulate a situation where the model had to predict on completely unknown categories. We increased the number of sentiment pairs per pair-type to 4, so that we had 16 sentiment pairs per business in total, since a not insignificant number of businesses with more than one main category had to be discarded. We then trained a BERT model with a sequence length of 128 for 3 Epochs on each fold, and evaluated (a) on the remaining folds together, (b) on each fold separately, and (c) on each main category not in the training fold (cf. Table 3). Results for (a) are expected and slightly worse due to the shorter sequence length compared to other tables. For (b) prediction accuracies span between 79.4% and 92.3%, with a difference of 6 pp. for each fold. This is possibly due to more diverse training data which make predictions on unknown categories more robust. Using the baseline siamese model, we achieve similar results that span from 80.7% to 90.5% accuracy. Experiment (c) displays the highest variability as small single categories may differ more extremely compared to larger ones or sets of categories. Our BERT model has 71.5% to 95.3% accuracy, while our baseline model again has a slightly tighter range from 73.6% to 93.5%. The BERT model consistently performed slightly better by 1-3 pp. in all cross-validation experiments, while only being able to use at most 64 tokens per review. It, however, required much longer training.

Conclusion
Our contribution in this paper is the introduction of a new perspective on sentiment analysis. We showed how sequence pair classification can be used to achieve relatively good accuracy on the same sentiment pair problem. Initial results are promising but applying same sentiment models on    (Sanh et al., 2019) or ALBERT (Lan et al., 2020) that have shown improved results on other sequence classification tasks compared to BERT as well as more elaborate models. With the application on other domains, we hope to ultimately find some common features for sameness that can be exploited in various ways to support and improve existing models.

Ethics Statement
We used the Yelp dataset without any modifications to the data contained within. The dataset is a collection of opinionated texts obtained from publicly available and appropriately acknowledged sources respecting their terms and conditions. By reusing pre-trained models using the Huggingface.co transformers library, our approach might have inherited some forms of bias. We did not perform any evaluation of this potential problem. It is worth noting that our experiments show that our approach is far from being ready to be used within a product.
Our goal is to advance the research on this task. In terms of computational resources, we restricted ourselves to variants of pre-trained models that can be fine-tuned with (relatively) fewer resources and are accessible to the majority of researchers.