CORT: A New Baseline for Comparative Opinion Classification by Dual Prompts

,


Introduction
Comparative opinion classification (Liu, 2012) aims to find the relative opinion preference on a specific aspect towards two or more compared targets.Sentences containing comparative opinions may not express a direct positive or negative opinion, but a comparison.In this example sentence, "BMW's handling is better than that of Mercedes-Benz.", there are two targets: BMW and Mercedes-Benz.The aspect in comparison is handling, and the opinion is target t 1 (BMW) is better than target t 2 (Mercedes-Benz).Note that, the comparison does not imply that the opinion towards Mercedes-Benz is negative.Hence, performing typical sentiment classification as a whole is less applicable to comparative text.
Comparative opinion plays a vital role in consumers' purchasing decisions.It is common that a consumer identifies a few candidate products and makes a comparison on all aspects of his/her interest.Comparative sentences are also widely observed in product reviews and online forums.
The Research Problem.In comparative opinion mining, there are a predefined set of opinions O = {o 1 , o 2 , . . ., o n }.Given a comparative sentence, denoted by S = [w 1 , w 2 , . . ., w n ], the task is to predict a four-tuple (t 1 , t 2 , a, o).Here, t 1 and t 2 are the targets to be compared, a denotes the aspect, and o is the opinion.
This task can be decomposed into two subtasks: (i) comparative elements extraction to extract targets t 1 , t 2 in comparison and aspect a, and (ii) comparative opinion classification to predict opinion o with the assumption that targets t 1 , t 2 and the aspect a are given.In this paper, we focus on the second subtask.That is, we assume that targets and aspects are pre-extracted.Generally, the opinion set includes better, worse, same and incomparable.However, incomparable can be filtered in the element extraction stage, so we only consider the remaining three opinions.
Existing studies for comparative opinion are mainly rule-based or machine-learning methods.In general, comparative elements extraction is first performed, to identify comparative sentences and to extract compared targets and aspects.Jindal and Liu (2006a) first identify comparative sentences from review.Hu and Liu (2006); Ding et al. (2009); Xu et al. (2009) extract comparative elements from the identified comparative texts.With comparative text and its comparative elements, Ganapathibhotla and Liu (2008) design six rules based on context and pre-defined pros and cons in review to classify comparative opinion.Panchenko et al. (2019) evaluate a few classifiers with features for comparative opinion classification.Despite that deep learning based solutions have significantly advanced the area of sentiment analysis in recent years, to the best of our knowledge, no dedicated deep learning models have been proposed for comparative opinion classification.
A major challenge in comparative opinion classification is that, opinion o depends on the order of targets t 1 , t 2 (i.e., t 1 is better than t 2 means that t 2 is worse than t 1 ) except when o is same.For this reason, sentiment analysis models that directly predict positive or negative, cannot well handle comparative opinion classification.To overcome this problem, we design a novel twin framework to detect comparative opinion.In this framework, a primary channel and a mirror channel are designed to capture both the original (i.e., for the order t 1 , t 2 ) and the reversal (for order t 2 , t 1 ) comparative opinions.Both channels are realized by prompt-based learning in our framework.
Specifically, our proposed CORT (Comparative Opinion Representations from Twin network) model contains two opinion channels (i.e., primary channel and mirror channel), and a comparative module.Each channel includes three cells: prompter, encoder, and classifier.Given an input in the form of (text, target t 1 , target t 2 , aspect), the prompter generates a template like "[target t 1 ] is [MASK] than [target t 2 ] in [aspect]".Then the encoder encodes the original input text and the template to get a global representation (i.e., encoding at the [CLS] position) and the opinion representation (i.e., encoding at the [MASK] position).Lastly, the comparative opinion is predicted by a classifier.Mirror channel shares the same configurations as the primary channel.The only difference is that the two targets are swapped in the generated template.
To the best of our knowledge, this is the first attempt to design a prompt-based learning framework for comparative opinion classification.We demonstrate that CORT achieves state-of-the-art performance against all existing baselines on three public datasets, namely CameraReview, CompSent-19, and CompSent-08.More importantly, our CORT model is robust and is insensitive to the order of targets in comparison.

Related Work
Comparative opinion expresses opinions by comparing similar targets, which is different from directly expressing an opinion about targets and their aspects (Liu, 2012).Simply put, comparative opinion mining is the analysis of the contrast between multiple targets/objects (Ganapathibhotla and Liu, 2008).Generally, there are two main subtasks: (i) elements extraction, to extract comparative sentences, targets, aspects, and (ii) comparative opinion classification.This paper focuses on the latter.Very few studies consider both subtasks (Liu et al., 2013(Liu et al., , 2021b)).As our model is built on top of prompt-based learning, we also briefly review pretrained language models for sentiment analysis.

Comparative Opinion Classification
The task of comparative opinion mining was formulated between 2006 to 2008 (Jindal and Liu, 2006b;Ganapathibhotla and Liu, 2008).Early approaches are mostly based on feature engineering and manually defined rules.
Rule-based Methods.Ganapathibhotla and Liu (2008) design six rules to identify which target is more preferred.After that, Tkachenko and Lauw (2014) propose a generative model for comparative sentences from online reviews, to define comparative directions of targets.Again, rules are used to predict the preference between multiple targets.In general, rule-based method is expensive to maintain and is heavily domain-dependent.
Traditional Machine Learning.Feature engineering with classifier (i.e., Logistic Regression, Random Forest, Support Vector Machine et al.) was the mainstream approach in the past.Panchenko et al. (2019) build a corpus of comparative sentences, then evaluate multiple supervised models.In their evaluation, comparative sentences are divided into three categories including none, better, and worse.They do not take into account the instances with same opinion.

+
[CLS] The balance on my D200 to be far superior to the D1X.2022).To the best of our knowledge, neither dedicated PLM nor prompt learning has been applied to comparative opinion classification.

CORT Model
The Comparative Opinion Representations from Twin network (CORT) has its root in prompt learning.We first brief Pre-trained Language Model with Prompt (PLM w.Prompt) for comparative opinion classification.Then we detail the design of CORT model and its optimization method.

PLM w. Prompt
How to effectively use target information is vital, when classifying the opinion in a comparative sentence with two targets.The straightforward approach is to use the fine-tuning method for prediction, by using the global representation obtained at the [CLS] position, shown in Figure 1 (a).Here, the targets and the corresponding aspect are appended to the comparative sentence.As the opinion is sensitive to the order of targets, it is more reasonable to adopt prompt learning, by including more contextual information about the targets through prompts, see Figure 1 (b).
Preliminary: Prompt Learning.Prompt learning relies on a pre-defined set of label words V * and a template T .Given an input text x, T modifies the original text into a prompt input, by adding some words including [MASK] to the original input.Conventionally, the representation at the location of [MASK] is used to predict the masked word w.For each label y in Y, a label word set With each label maps to a set, all sets together form a set of label words V * .Thus, in prompt learning, a classification problem is transferred into a mask learning problem, or formally as follows: The architecture of the proposed PLM w.Prompt is shown in Figure 1 (b).Given a comparative sentence as input, we generate the input of PLM with a template: " Because the comparative opinion is nuanced, we use the global representation to enhance the representation of the masked location.Hence, we do not use the vocabulary table to predict the opinion.Instead, we use a softmax classifier to classify the opinion based on the concatenated representation of [CLS] and that of [MASK] position.Cross entropy objective is used to optimize this model.Last, comparative module is used to contrast differences between opposite opinions, which could improve the robustness of the model from the stance of contrastive representations.

Twin Opinion Channels
The difference between primary channel and mirror channel is the order of the two targets in comparison.In the example shown in Figure 1, the template for primary channel is "my D200 is [MASK] than D1X in balance", while the template for mirror channel is "D1X is [MASK] than my D200 in balance".Correspondingly, the ground truth labels for the two templates are opposite in training, e.g., better and worse respectively in this example.Due to their similar structure, we only describe the primary channel.
Prompter Cell.The prompter cell in CORT has the same structure as PLM w.Prompt.Given a sentence S, targets t 1 , t 2 , aspect a, the prompter cell generates input with template S p for primary channel: " Similarly, the generated text with template S m for mirror channel is:" The generated text is encoded by PLM.Taking the input S p in primary channel, we obtain r s and r o , denoting the representations of entire context (i.e., hidden representation at [CLS] position) and the opinion representation (i.e., hidden representation at [MASK] location), respectively. (2) The final representation is the concatenation of the two: r = r s ⊕ r o .In our experiments, we evaluate three popular PLMs, namely, RoBERTa, BERT, and XLNet.
Classifier Cell.The opinion distribution P is computed by a softmax classifier based on the learned representation r: where W p and b p are the learnable parameters.
The twin framework is designed to reflect semantic meaning of comparative opinion.For instance, "my D200 is better than the D1X in auto white balance" and "the D1X is worse than my D200 in auto white balance" mean the same, despite the order change in targets.Hence, mirror channel is computed in a similar manner: Here, r ′ s and r ′ o are the global representation and the mask representation of the mirror channel.The opinion distribution P ′ is computed in the same manner using the final representation r ′ = r ′ s ⊕ r ′ o .Note that the ground truth labels of P and P ′ are opposite when the opinion is not same.
During testing, CORT generates two probability distributions P and P ′ .We use the maximum value from them to assign the comparative opinion.

Comparative Module
Again, due to the order change in targets, when the opinion in the primary channel is better, the corresponding opinion in mirror channel is worse, and vice versa.Hence, the comparative module aims to maximize the distance between two opinion representations when the opinion is better or worse.Simultaneously, the module minimizes the distance of two opinion representations when the opinion is same.To be detailed next, we design our training objective by considering the distance computed by comparative module: where d is the distance by cos similarity; W o and b o are learnable parameters.

Training Objective
CORT model has two learning objectives: opinion objective and comparative objective.Opinion objective is to minimize the cross-entropy of the opinion probability distributions for both channels.
Comparative objective is to maximize the distance of [MASK] representations if the opinion is better or worse, and minimize the distance for same.
Opinion Objective.From the two channels, we have two opinion probability distributions P and P ′ for an instance i. Inspired by Wang et al. (2016Wang et al. ( , 2019)); Liu et al. (2022); Lin et al. (2022), the opinion probability objective J(θ) adopts cross-entropy losses on both channels: Here, Φ and Φ ′ are losses of the two channels; y i and y ′ i are the annotated opinions of the two channels for instance i; λ and µ are hyperparameters.
Comparative Objective.We use hinge loss for comparative objective U (θ): 10) Considering both objectives, the final objective L is the sum of J and U : where ξ is a hyperparameter.

Experiment
We now evaluate the proposed base model PLM w.Prompt and CORT for comparative opinion classification, against baselines.

Dataset
CameraReview.Created by Kessler and Kuhn (2014), this dataset1 contains comparative sentences about camera reviews in English.Each instance is annotated with labels (target t 1 , target t 2 , aspect, opinion).The set of opinion is {better, worse, same}.We select the sentences with clear direction in two targets from this dataset, and split the instances with the ratio of 7:1:2 for training, validation, and testing.

Compared Methods
For completeness, we compare our models with several baselines including rule-based methods, traditional machine learning methods, and neural models.
Rule.Ganapathibhotla and Liu (2008) develop six rules to find which target is more preferred.It does not consider the same opinion.Because the codes are not released, and the same opinion is missing, we implement the same six rules, and develop additional rules for same in our experiments.Specifically, our rules for same opinion are built with opinion words (i.e., same, like, similar, equal) reflecting the same opinion.
Traditional Machine Learning.We follow the methods in Panchenko et al. (2019) to experiment with traditional machine learning methods: Logis-  (Sutton et al., 2007).We build five features to adopt CRF to comparative opinion classification.The five features are word, position, entity, POS tag and word label.Here, word label indicates whether a word is part of t 1 , t 2 , a, or opinion words.If yes, then word label is a special tag, otherwise, word label is "None" (See Appendix A.2 for more details about the others).
RNN-Capsule.RNN-Capsule (Wang et al., 2018) is a powerful model for sentiment classification (e.g., positive, negative, and neutral).To adapt to comparative opinion classification, we enrich the original input by appending the comparative elements (i.e., target t 1 , target t 2 , and aspect) to the end of the original sentence, as input to RNN-Capsule.
Multi-Stage BERT .Multi-Stage BERT (Liu et al., 2021b) extracts comparative elements and detects comparative opinion, in a pipeline setting.For a fair comparison of the comparative opinion classification subtask, we use the ground truth targets and the aspect as input, instead of the extracted elements by the model.Note that this model needs a text span with an opinion as input, but the datasets do not provide such annotations.So we only use the remaining elements, including target t 1 , target t 2 , and aspect.PLM Fine-Tuning.Based on PLM (e.g., BERT, RoBERTa, XLNet), fine-tuning method makes prediction with an extra linear layer after PLM (Chen et al., 2021).Similar to RNN-Capsule, we add the comparative elements (i.e., target t 1 , target t 2 , and aspect) to the end of the original text as the input to PLM Fine-Tuning.Then the [CLS] representation is used to predict the opinion.For a fair comparison, we also experiment PLM Fine-Tuning and PLM w. prompt with data augmentation (by adding a copy of a training instance with target order changed and opinion reversed, if the opinion is Better or Worse).Data augmentation benefits the model with additional training data and naturally avoids data imbalance.Conceptually, this setting is similar to training with dual channels.

Overall Performance Comparison
We use accuracy, macro-F 1, and detailed F 1 of each opinion, to compare all methods.Results on CameraReview and CompSent-19 datasets are reported in Table 2. 4 Because rule and CRF models Discussion.On both datasets, as expected, PLMbased models outperform all other baselines, revealing the powerful ability of PLM models.Among PLM-based models, CORT is the winner, followed by RoBERTa w.Prompt with data augmentation.The performance gap between them clearly indicates the ability of our proposed twin framework.Reported in the Table 2, the use of data augmentation leads to 2.0 and 2.1 points increase in F 1 on CameraReview, compared to the models without data augmentation, respectively.In addition to the increase in F 1, the models using data augmentation become stable on reversed data, and produce similar F 1 scores (see Table 4).On the other hand, both models remain much poorer than CORT.
Compared to PLM Fine-Tuning or PLM w.Prompt (using data augmentation or not), CORT benefits from the following design to achieve the best results: (1) The input to the classifier of both channels is the concatenation of [CLS] and [MASK] representations, because both of them provide important information for classification.(2) The input to the comparative module considers the [MASK] representations only, but not [CLS].This is because [CLS] representation denotes the whole representation of the input text.By design, the order of the targets (t 1 and t 2 ) in the two channels are different, hence [CLS] representations are always different and there is no need to compare.For [MASK] representations, comparative module needs to distinguish the cases when the comparison is Same, and the cases when the comparison is Better or Worse.
Traditional machine learning methods outperform rule-based method.CRF is the best performing traditional machine learning method, with F 1 of 0.649 on CameraReview.RNN-Capsule does not deliver good performance as it is not designed for comparative sentence classification.Further, RNN-Capsule is based on pre-trained word embeddings, not PLM.Surprisingly, Multi-Stage BERT is poorer than many traditional machine learning models.One reason is that its softmax classifier only takes the concatenation of representations of compared targets and the aspect as input.That is, even if PLM is used, an effective design is essential for good performance.Methods like Extra Trees and Majority Class are very sensitive to the data distribution of labels.They tend to predict all instances to the majority label i.e., better, resulting in very low F 1's for worse and same.

Robustness of CORT on Reversal Data
As aforementioned, comparative opinion classification requires semantic understanding in the sense that, if a is better than b, then b is worse than a.In this set of experiments, we evaluate model robustness by swapping the targets in comparison.That is, a well behaved model shall be able to predict the opposite for better and worse in reversal data, and same when the original opinion is same.
Accordingly, in this set of experiments, we keep the training and validation set unchanged but swap the targets t 1 and t 2 in testing, and their corresponding ground truth label.Table 3 reports model performance on CameraReview, and the performance changes against their original performance (see Table 2).In this set of experiments, RoBERTa is used  as the PLM encoder.Our proposed CORT does not change in overall accuracy, F 1 and detailed F 1 .In Table 3, the shown changes of CORT are caused by the reversed labeled opinion on Better and Worse of the original labels.However, big drops are observed for all other models including RNN-Capsule, Multi-Stage BERT , RoBERTa Fine-Tuning, and RoBERTa w.Prompt.In particular, F 1 scores for better and worse opinions decrease sharply.Thanks to the twin channel design in CORT, our model is trained to handle comparison targets in either order, and experiment results well support the robustness of our design.

Choices of PLMs for CORT
PLMs have contributed to significant improvements in various NLP tasks.We evaluate the mainstream PLMs including RoBERTa, BERT, and XL-Net, on CORT.Table 4 reports the performance of the twin framework based on different PLMs on both the original and the reversal test sets of CameraReview.
The model based on RoBERTa performs the best, followed by XLNet and BERT.The other two models i.e., PLM w.Prompt and Fine-Tuning, share the same trend, mainly due to the much larger training data used in RoBERTa.On the reversal test set, our CORT is unaffected and delivers the same performance as the original data for both measures.Significant performance drops happen to PLM w.Prompt and Fine-Tuning, and BERT gives slightly worse performance for these two models, compared to other PLMs.

Cross Dataset Evaluation
To the best of our knowledge, CameraReview is the largest public dataset that comes with comparative opinion annotations.CompSent-08 dataset has only 186 instances, and is insufficient to train a model.As the annotation scheme of CompSent-08 is similar to that of CameraReview, it is interesting to find out whether the CORT model trained on CameraReview dataset could be used to identify comparative opinion on CompSent-08.For completeness, we further evaluate the model trained on CameraReview on the full CompSent-19 dataset as a test set.
Reported in Table 5, all PLM-based models perform very well on CompSent-08, even though these models are trained on a different dataset i.e., Cam-eraReview.Interestingly, PLM w.Prompt performs better than CORT on CompSent-08.Through manual investigation, we note that a few incorrect predictions on CompSent-08 lead to big changes in performance numbers due to its small size.When using the full CompSent-19 as a test set, the proposed CORT shows clear superiority over alterna-

Ablation Study
We conduct ablation study on CameraReview dataset.Thanks to the simple design of the twin framework, we could evaluate the effectiveness of the comparative module and the mirror channel easily.In addition, to study the effect of [CLS] and [MASK] representations, we also conduct experiment without these representations.Table 6 reports the results of detailed comparison: (i) Removal of comparative module leads to 3.1 points decrease in F 1. Further, F 1 decreases 2.0 points after removal of the mirror channel.These shows that both comparative module and the twin opinion channels are effective.(ii) Removal of [CLS] representations (r s and r ′ s ) on both channels leads to 3.3 points decrease in F 1.This result suggests that the context of the entire sentence in both channels is helpful for the classification task.(iii) Removal of [MASK] representations (r o and r ′ o ), leads to a drop of 1.6 points in F 1.This result shows that not only opinion representations are important, but also text representations are vital for classification.

Conclusion
In this paper, we focus on comparative opinion classification, a specific sentiment analysis subtask.Built on the top powerful pre-trained language models, we show that comparative opinion classification can be addressed by prompt learning with promising accuracy.In our proposed CORT, we designed two channels for comparative targets arranged in either order, to facilitate the model to learn the semantics behind the comparative opinion, e.g., a is better than b vs. b is worse than a.Experiments show that the proposed CORT achieves state-of-the-art performance compared to various baselines, on all comparative datasets.We show that CORT based on the twin framework with different pre-trained language models performs beautifully on both the original and reversal data.We also show that the model achieves good performance in cross-dataset setting, demonstrating its effectiveness and robustness.We believe, as a simple and effective model, CORT well serves as a new baseline for comparative opinion classification.

Limitations
There are two main limitations for comparative opinion classification: dataset and model design.
Comparative opinion statements comprise over 10% of the total opinionated text (Kessler and Kuhn, 2013).Hence it is important to study this common linguistic phenomenon.The largest dataset has only about 1k instances, which is considered small for neural models.The lack of highquality and large datasets heavily limits the development in this area.
In this paper, we make the very first attempt to perform comparative opinion classification by dual prompts.By design, the proposed CORT only considers two targets on one aspect.However, comparative text may be expressed in a more complex way.For example, there may be multiple compared targets, on multiple compared aspects.Further, the proposed CORT does not consider the situation that one of the compared targets is a pronoun.All of these are important factors for further exploration.

Figure 1 :
Figure 1: The architectures of (a) PLM Fine-Tuning, (b) PLM w.Prompt, and (c) CORT.The prompter of CORT is the same as PLM w.Prompt.For each opinion channel, r s and r o denote the global representation ([CLS] position) and opinion representation ([MASK] position), respectively.The classifier of each channel takes the concatenated representation r = r s ⊕ r o as input, and predicts opinion distribution P.

Figure 1
Figure 1 (c) depicts the architecture of CORT.Based on the twin framework, CORT has two opinion channels (i.e., primary channel and mirror
Jindal and Liu (2006a)9)t al., 2019).The annotation format is (target t 1 , target t 2 , opinion), without aspect.The set of opinion is {better, worse}.CompSent-08.Despite its small size, this dataset 3 byJindal and Liu (2006a)contains comparative sentences in various domains, ranging from digital cameras to soccer.The sentences are taken from reviews, blog posts, and forum discussions.The annotation format is similar to CameraReview, including target t 1 , target t 2 , aspect, and opinion.Due to the small size, we only use this dataset as a test set, to evaluate the generalization performance of our model.

Table 2 :
(Panchenko et al., 2019)etailed F 1 of methods on CameraReview and CompSent-19 datasets.The baseline methods marked with * are our own implementation, and the methods marked with † are implemented based on the open codes(Panchenko et al., 2019).Best results are in bold face and second best underlined.
tic Regression, XGBoost, SVM, AdaBoost, Decision Tree, SGD classifier, k-Neighbors, Random Forest, Extra Trees, and Majority Class.CRF.Conditional Random Field (CRF) is used for comparative elements extraction

Table 3 :
The accuracy, F 1, and detailed F 1 of neural models with swapping targets on CameraReview dataset.The downwards/upwards arrow indicates the performance change compared to the setting without swapping targets in test set.

Table 4 :
The accuracy and F 1 of PLM based models on original and reversal test data of CameraReview.

Table 5 :
Models are trained on CameraReview, then evaluated on CompSent-08 and CompSent-19 as two test sets.

Table 6 :
Ablation study of CORT on CameraReview.Note that, due to the differences in train and test sets, the results in Table5cannot be directly compared to the numbers in Table2.Nevertheless, the high accuracy and F 1 numbers in Table5do suggest that our CORT is generalized to similar comparative opinion classification tasks.