Comparative Opinion Quintuple Extraction from Product Reviews

As an important task in opinion mining, comparative opinion mining aims to identify comparative sentences from product reviews, extract the comparative elements, and obtain the corresponding comparative opinion tuples. However, most previous studies simply regarded comparative tuple extraction as comparative element extraction, but ignored the fact that many comparative sentences may contain multiple comparisons. The comparative opinion tuples defined in these studies also failed to explicitly provide comparative preferences. To address these limitations, in this work we first introduce a new Comparative Opinion Quintuple Extraction (COQE) task, to identify comparative sentences from product reviews and extract all comparative opinion quintuples (Subject, Object, Comparative Aspect, Comparative Opinion, Comparative Preference). Secondly, based on the existing comparative opinion mining corpora, we make supplementary annotations and construct three datasets for the COQE task. Finally, we benchmark the COQE task by proposing a new BERT-based multi-stage approach as well as three baseline systems extended from previous methods. %The new approach significantly outperforms three baseline systems on three datasets and represents a strong benchmark for COQE. Experimental results show that the new approach significantly outperforms three baseline systems on three datasets for the COQE task.


Introduction
Fine-grained opinion mining from product reviews has received considerable attention in the last decade. As around 10% of product reviews contain at least one comparison (Kessler and Kuhn, 2013), it is therefore crucial to extract and analyze these comparative sentences to detect public opinions towards the compared entities and aspects.
As the pioneering work for this direction, Jindal and Liu (2006b) proposed the Comparative Sentence Mining (CSM) task which first identifies comparative sentences from reviews, and extracts pre-defined comparative quintuples, i.e., (Subject, Object, Comparative Aspect, Relation Word, Comparison Type), from the identified comparative sentences. For example, given a sentence "G6 has a worse zoom than G7, but G6's battery was more reliable than G7", one comparative quintuple is (G6, G7, battery, more, Non-Equal Gradable). In their work they assumed that a comparative sentence contains only one comparative relation, and treated the comparative quintuple extraction task as a comparative element extraction (CEE) problem.
However, in real scenarios, a large number of comparative sentences contain more than one comparative relation. For example, 17.6% of the comparative sentences in the camera domain (Kessler and Kuhn, 2014) have at least two comparative relations. In this situation, simply applying CEE cannot extract comparative quintuples effectively. Moreover, in their definition, the relation word sometimes fails to explicitly reflect the comparative preference. For example, "more" in the above sentence is ambiguous, since it may refer to "more reliable" or "more expensive".
Some recent studies (Panchenko et al., 2019;Ma et al., 2020) proposed a new task named Comparative Preference Classification (CPC), to identify the explicit comparative preferences (e.g., Better, Worse, None) between the subject entity and the object entity. However, the CPC task requires that the subject and object entities have been annotated, which largely hinders its applications in real scenarios.
To address the limitations of CEE and CPC, we introduce a new Comparative Opinion Quintuple Extraction (COQE) task, with the emphasis on the identification of comparative sentences, and the G6 has a worse zoom than G7, but G6's battery was more reliable than G7.  extraction of all the comparative opinion quintuples from the comparative sentence. We define the comparative opinion quintuple as (sub, obj , ca, co, cp), where sub, obj , ca, co and cp refer to Subject, Object, Comparative Aspect, Comparative Opinion and Comparative Preference, respectively. Based on Subject, Object and Comparative Aspect which were defined in previous work, we further define Comparative Opinion as an opinion expression, in terms of a continuous textual span. It is similar to the relation word defined in (Jindal and Liu, 2006b) but including more necessary context, e.g., adjectives/adverbs after the relation word "more" or "less" and the negations.
We also include Comparative Preference as a part of the comparative quintuple, and jointly extract the comparative elements and classify the comparative preference. As shown in Table 1, the output of COQE contains a set of two comparative opinion quintuples: {(G6, G7, battery, more reliable, Better), (G6, G7, zoom, worse, Worse)}. Secondly, we construct three datasets for this COQE task based on three existing comparative opinion mining corpora. On the basis of the camera-domain corpus proposed by Kessler and Kuhn (2014), we further annotate the comparative opinion and preference for each comparative sentence. We also add the comparative opinion annotation to the datasets from the car and electronic domains released by COAE 2012/2013 (Tan et al., 2013). In addition, we annotate all the valid comparative quintuples and provide the starting and end position of each element in the quintuples.
Finally, we benchmark the task by proposing a new multi-stage neural network approach, including the stages of 1) Joint Comparative Sentence Identification and Comparative Elements Extraction, 2) Comparative Element Combination and Filtering, and 3) Comparative Preference Classification. The new approach significantly outper-forms the baseline systems extended from traditional comparative opinion mining methods on three datasets.
The contributions of this work can be summarized as follows: • We propose a new Comparative Opinion Quintuple Extraction (COQE) task, aiming to extract all the comparative quintuples from each review sentence. • We construct three new datasets for the task, on the basis of the existing comparative opinion mining corpora. • We benchmark the task by proposing a multistage neural network approach which significantly outperforms baseline systems extended from traditional methods.

Related Work
As a branch of aspect-based sentiment analysis, Comparative Sentence Mining was first proposed by Jindal and Liu (2006b) to first identify comparative sentences (CSI) from reviews, and extracts pre-defined comparative quintuples, i.e., (Subject, Object, Comparative Aspect, Relation Word, Comparison Type) from the identified comparative sentences. They assumed that a comparative sentence contains only one comparative relation, and regarded comparative quintuple extraction as a comparative element extraction (CEE) problem. This ignored the fact that a large percentage of comparative sentences contain more than one comparison. For the CSI task, (Ganapathibhotla and Liu, 2008;Huang et al., 2008;Park and Blake, 2012) designed keyword-based or syntactic-based rules to identify comparative sentences in product reviews and scientific articles. (Jindal and Liu, 2006a,b;Huang et al., 2008; employed a Class Sequential Rule (CSR) method to mine sequence rules and use them as features of statistical classifiers.
For the CEE task, Jindal and Liu (2006b) and He et al. (2012) employed a Label Sequential Rule (LSR) method to extract comparative elements. (Hou and Li, 2008;Song et al., 2009;Huang et al., 2010;Wang et al., 2015a) extracted comparative elements based on conditional random field (CRF). Wang et al. (2010) and Kessler and Kuhn (2013) further employed semantic role labeling (SRL) to extract comparative elements. Arora et al. (2017) proposed a LSTM-CRF neural network to extract comparative elements.
In recent years, Panchenko et al. (2019) proposed a Comparative Preference Classification (CPC) task, to predict the preference (Better, Worse, None) between two annotated entities. Ma et al. (2020) further proposed a Graph Attention Network for this task. However, CPC requires to annotate two compared entities in advance, which greatly limits its application in real scenes.
In comparison, the COQE task proposed in this work focuses on identification of comparative sentences, and the extraction of all the comparative opinion quintuples from the comparative sentence, rather than comparative element extraction only. We support comparative quintuple extraction when a sentence contains multiple comparisons. Secondly, we re-define the comparative quintuple by incorporating comparative preference, and jointly perform comparative tuple extraction and comparative preference classification. Finally, most of the previous models for comparative opinion mining were based on rule methods or traditional machine learning methods. We establish a multi-stage deep learning approach for our task and significantly improved the performance of both CEE and COQE.
It is also worth noting that some recent studies on opinion tuple extraction (Liao et al., 2016;Peng et al., 2020) and quadruple extraction (Cai et al., 2021) have been proposed in traditional aspectbased sentiment analysis. Our work can be viewed as their extension from absolute opinion mining to comparative opinion mining.

Task Definition
Given a product review sentence containing n words X = [x 1 , · · · , x n ], the goal of COQE is to first identify whether it is a comparative sentences, and then extract the set of quintuples (sub, obj , ca, co, cp) if it is a comparative sentence as follows: (sub, obj, ca, co, cp) (1) where sub and obj refer to the subject and object entities being compared, ca denotes the comparative aspect (i.e., feature attribute) of the entities, co denotes the comparative opinion which is an opinion expression indicating the comparative preference between two entities, and cp ∈ {Worse, Equal, Better, Different} is the comparative preference denoting whether sub is worse than, equal to, better than, or different from obj .
Note that the first four elements of the comparative opinion quintuple need to be extracted from the sentence, while the fifth element needs to be classified from the pre-defined categories. Therefore, COQE is a challenging task that involves extracting four elements, classifying one element, and combining all the five elements into valid quintuples.

Dataset Construction
In addition to the comparative sentence mining corpus proposed by Jindal and Liu (2006b), Kessler et al. (2010) proposed a JDPA corpus, which consists of blog posts about cameras and cars where the camera domain contains 506 comparative sentences, and the car domain contains 1100 comparative sentences. However, the corpus only reflects the comparative elements and can not capture the comparative relation.
Kessler and Kuhn (2014) proposed a camera domain corpus containing 1707 comparative sentences, which explicitly annotated the comparative quintuple and supported the case where a sentence contains multiple comparative relations. They defined the quintuple as (Subject, Object, Aspect, Scale, Predicate), where Predicate is the syntactic marker that introduces a comparison (e.g., "better", "more") and Scale, a modified adjective/adverb, is added when predicate do not by themselves fully describe a comparison (e.g, "reliable" after "more"). The joint annotation Scale and Predicate can solve the shortcoming of Relation Word in (Jindal and Liu, 2006b), but it did not contain some necessary context that describes a comparative relation, e,g., negation and contrast.
The Chinese Opinion Analysis Evaluation (COAE) 2012/2013 (Tan et al., 2013) provided two Chinese comparative sentence mining corpora, in the domains of Car and Electronics, respectively. They annotated the comparative relation as a pair of triples, i.e., (subject, aspect, absolute sentiment) and (object, aspect, absolute sentiment).
We construct three datasets for our COQE task, on the basis of the above corpora.
• Camera-COQE: On basis of the Camera domain corpus released by Kessler and Kuhn (2014), we completed the annotation of Comparative Opinion and Comparative Preference for 1705 comparative sentences, and introducing 1599 non-comparative sentences.   Table 2 displays basic statistics of three datasets, where #Comparative, #Non-Comparative and #Multi-Comparisons denote the number of comparative sentences, non-comparative sentences and comparative sentences with multiple comparative quintuples. #Comparison Per Sent denotes the average number of comparative quintuples per sentence and Percentage denotes the percentage of sentences with multiple comparative quintuples among all the comparative sentences. As we can see, at least 20% of the comparative sentences in each dataset contain multiple comparative opinion quintuples.

Approach
As stated in the task definition, COQE is a challenging task that includes four-element extraction, one-element classification, and five-element combinations. To tackle the task, we propose a multistage neural network framework, in which the first stage is to identify comparative sentences and extract comparative elements, the second stage is to combine and filter the extracted four comparative elements (sub, obj , ca, co) to obtain valid comparative quadruples, and the third stage is to further classify each comparative quadruple into a pre-defined preference category, and obtain all the comparative opinion quintuples.
For the sentence in Table 1, in the first stage, we identify it as a comparative sentence and get the set of four comparative elements: S sub = {G6}, S obj = {G7}, S ca = {zoom, battery} and S co = {worse, more reliable}. In the second stage, we combine the four elements extracted in the first stage with Cartesian product to form a candidate set of comparative quadruples, i.e., (G6, G7, zoom, worse), (G6, G7, zoom, more reliable), (G6, G7, battery, worse), (G6, G7, battery, more reliable). Furthermore, we train a classifier to filter invalid combinations to get two valid comparative quadruples, i.e., (G6, G7, zoom, worse), (G6, G7, battery, more reliable). Finally, in the third stage, the two comparative quadruples are classified into the corresponding comparative preference category to obtain two valid comparative quintuples as shown in Table 1.

Stage 1: Joint Comparative Sentence Identification and Comparative Elements Extraction
In the first stage, we proposed a multi-task learning framework based on BERT to identify comparative sentences and extract comparative elements simultaneously. Specifically, given an input sentence X = [x 1 , · · · , x n ], we first insert two special tokens (i.e., CLS and SEP) at the beginning and the end respectively, and then feed the transformed sentence to BERT to obtain the hidden representations in the last layer: Comparative Sentence Identification. First, we feed h [CLS] to a softmax layer to predict whether the input sentence X is a comparative sentence: where W c and b c are weight matrices to learn, and y c ∈ {0, 1}. Comparative Element Extraction. For the identified comparative sentences, we further adopt four separate linear transformation functions and CRF layers to extract the four elements sub, obj , ca, co, respectively: where h e = W e h + b e and the Begin-Middle-End-Single-Outside (BMESO) tagging schema is adopted for sequence labeling, and e refers to sub, obj , ca, co respectively. It should be noted that we adopt separate output layers for extracting each element in order to solve the problem of overlapping and nesting entities. During the training stage, Comparative Sentence Identification and Comparative Elements Extraction are optimized simultaneously based on a multi-task learning framework. The final loss of the first stage is a weighted sum of L csi and L cee i : where L csi is the cross-entropy loss for comparative sentence identification, and L cee i is the CRF loss for each individual element extraction. The two hyperparameters λ e and λ c are set to be 1 in our experiments.

Stage 2: Comparative Elements Combination and Filtering
In the first stage, we have obtained four sets of comparative elements for comparative sentences, With the four element sets, we perform Cartesian product over them to obtain a set of all possible comparative quadruple candidates: S quad = {(sub 1 , obj 1 , ca 1 , co 1 ), · · · , (sub k , obj l , ca p , co q )}.
For each quadruple, we obtain the representation of each element by concatenating its hidden repre-sentations for comparative sentence identification and comparative element extraction in Eqn. (2) and Eqn. (4) as follows: where start and end denote each element's start and end indices in the sentence, and avg denotes the average pooling operation. We then concatenate the representations of the four elements as the representation of each quadruple below: r = [r sub ; r obj ; r ca ; r co ].
Finally, we stack a softmax layer on top as a quadruple filter to detect the validity of a quadruple: where y quad ∈ {0, 1} indicates whether the input quadruple is valid or not. During training, we employ a class-weighted cross entropy loss to address the data imbalance issue between valid and invalid quadruples as follows: where λ is the trade-off hyperparameter, and set to be 0.4 in our experiments.

Stage 3: Comparative Preference Classification
After obtaining all the valid comparative quadruples in the second stage, we then classify each quadruple into a pre-defined comparative preference category in the third stage. Specifically, we add another softmax layer over the representation of each quadruple in Eqn. (8) for comparative preference classification below: where y s ∈ {Worse, Equal, Better, Different}. During training, the standard cross-entropy loss is used for optimizing the parameters of the comparative preference classifier. Finally, we combine the comparative preference prediction with the valid quadruples predicted in the second stage to get the final comparative opinion quintuples.

Experimental Settings
We evaluate the performance of the multi-stage neural network approach on three COQE datasets. For comparison, we also develop two baseline systems extended from the representative methods in the previous comparative opinion mining task. We divide each dataset into a training set, a validation set and a testing set, with the proportion of 64%, 16% and 20%, respectively.
In Stage 1 of our BERT-based multi-stage approach Multi-Stage BERT , we adopt BERT base for the English Camera dataset, and adopt a Chinese Version of BERT (BERT-Chinese) in the Chinese Car and Ele datasets. During training for all three stages, we use the Adam optimizer and set the batch size to 16 and the dropout to 0.1. The learning rates for Stages 1, 2 and 3 are set to be 2e-5, 5e-4 and 5e-4, respectively.

Evaluation Metrics
As Comparative Sentence Identification (CSI) and Comparative Element Extraction (CEE) are subsets of our approach, we evaluate the performance on CSI, CEE and COQE respectively. For CSI, we use the accuracy as the evaluation metric. For CEE, following (Marasović and Frank, 2018;Zhang et al., 2019Zhang et al., , 2020, we calculate Precision, Recall and F 1 metrics for each element, and their Micro-average F 1 . For COQE, we calculate Precision, Recall and F 1 for the whole quintuple.
The calculation of Precision, Recall, and F 1 score are as follows: where #predict denotes the number of comparative element (or quintuple for COQE) predicted by the model, #gold denotes the number of comparative element (or quintuple) in the dataset, #correct denotes the number of correct comparative quintuple (or quintuple) in the predictions. Meanwhile, we consider three matching strategies for measuring the correct predictions: Exact Match, Proportional Match, and Binary match respectively.
At first, ensure that the predicted quintuple's comparative preference is the same as the golden one, then define #correct e , #correct p and #correct b for Exact Match, Proportional Match, and Binary Match as follows: where g k denotes k-th element in the gold comparative quintuple, p k denotes k-th element in the predicted comparative quintuple. It means that if all p k and g k match exactly (k = 1, 2, 3, 4), the count is 1, otherwise 0.
where len(·) denotes the length of the comparative element. If all p k and g k have overlaps, the count is where the count is 1 if all p k and g k have overlaps, otherwise 0.

Baseline Systems
In addition to Multi-Stage BERT , we also established the following baseline systems: • CSI CSR -CEE CRF : In Stage 1, we use a SVM with CSR (Jindal and Liu, 2006a)    identify comparative sentences and a CRF with standard lexical features to extract comparative elements. Stages 2 and 3 are similar as Multi-Stage BERT .
• (CSI-CEE) CRF : In this approach, we employ a feature-enhanced CRF (Wang et al., 2015b) for joint comparative sentence identification and comparative element extraction, where an all-"O" labeling sequence indicates the identification of non-comparative sentence.
• Multi-Stage LSTM : This is a variant of Multi-Stage BERT , where we replace the text encoder from BERT to LSTM.

Main Result
In Table 3, we report the performance of all four approaches on three tasks across three datasets. For CSI, we report accuracy. For CEE, we report the F 1 score for each element (Subject, Object, Comparative Aspect, Comparative Opinion) and their Micro-average (Micro). For COQE, we report the F 1 score for the quintuple. All results are reported under exact match.
It can be seen that across different tasks and datsets, CSI CSR -CEE CRF yields generally the lowest performance. CSI CSR -CEE CRF is slightly better. But their overall performances are relatively low, especially when dealing with complex tasks such as COQE (lower than 10%). Two deep learning approaches achieve much better performance in all three tasks. The BERT-based Multi-stage approach shows significant priority over LSTMbased one, due to its strong representation and generalization ability.
Among three tasks, the CSI task is the easiest, where almost all methods obtain satisfactory accuracy. The performances of different approaches for CEE are also okay, but the gap between different approaches increases. The COQE task is the most difficult. The traditional machine learning methods generate very poor performance. Even Multi-Stage LSTM fails to achieve satisfactory results. It is reasonable as the exact match of all five elements in a quintuple is very challenging.
By contrast, Multi-Stage BERT shows strong ability and greatly improves the performance of COQE, especially on Car-COQE and Ele-COQE, even though the task is very difficult.
In Table 4, we also report the performance of Multi-Stage BERT under three kinds of matching metrics, and that of different approaches in Table A1 and Table A2. It can be observed under Proportional Match and Binary Match, the performances of all models will have significant improvements.   Table 6: Results of different approaches on the COQE task with or without the filter in Stage 2.

In-depth Analysis
Effects of Different Loss Strategies in Stage 1.
In Table 5, we conduct ablation study of the multilearning framework in Stage 1, by comparing only CSI loss, only CEE loss and multi-task learning. It can be seen that, CSI performance can be increased by adding the CEE loss, and the CEE performance can be also increased by adding the CSI loss. It suggests that the CSI and CEE tasks are mutually indicative. It is therefore reasonable for us to employ a multi-task learning of SCI and CEE in Stage 1.
Effects of Comparative Quadruple Filtering in Stage 2. In Table 6, we investigate the effects of comparative quadruple filtering in Stage 2, by comparing the COQE F 1 score of different approaches with or without Filtering, denoted by Filter and None respectively. The keep rate indicates the percentages of valid quadruple in all possible ones. It can be observed in Table 6   • BERT Embedding: Only the current element's BERT embedding is used in Eqn. (7): r e = avg(h [start:end] ). • High-layer Embedding: Only the current element's high-layer representation is used in Eqn. (7): r e = avg(h e [start:end] ). • Concatenation: The concatenation of the two representations is used, as defined in Eqn. (7).
Based on the results in Table 7, we can clearly observe that the performance of only using Element Feature as the comparative element representation is rather limited, and concatenating the Element Feature and BERT Embedding achieves significant higher performance. This demonstrates that the two kinds of features can generally complement each other. Therefore, we use their concatenation as the comparative element representation in our experiments.

Case Study
To validate the effectiveness of our task, we compare our task with the CSM task proposed by Jindal and Liu (2006b) and the CPC task proposed by Panchenko et al. (2019). The output of three tasks on two examples are shown in Table 8.
Comparing the outputs on the first example, we can clearly see that the comparative quintuple defined in our COQE task exactly paraphrases the input sentence. In contrast, the quintuple defined in the CSM task is not a paraphrase of the input sentence, since it is hard to judge whether "G6" or "G7" is preferred by the user. Moreover, unlike the CPC task that requires providing the entity pairs G6 and G7, our task aims to jointly perform entity pair extraction and preference classification.   Compared with the CSM task on the second example, our task is more suitable for comparative sentences with multiple comparative quintuples. Furthermore, compared with the CPC task, our task incorporates two additional preference categories, i.e., Equal and Different, which can cover a wider range of comparative entities.

Cross-Domain Experiments
In addition to the previous in-domain experiments, we further conducted a cross-domain experiment on two Chinese datasets, where the training set and validation set are chosen from the source domain, and testing set is in the target domain. The results are reported in Table 9. We use Source→Target to denote different cross-domain tasks, e.g., in Ele→Car Ele is the source domain and Car is the target domain.
It can be observed that there is a significant performance drop in the extraction of the subject, object, and aspect in comparison with the in-domain results in Table 3. This is reasonable since most entities and aspects in the source and target domains are quite different. In contrast, an interesting observation is that the comparative opinion extraction performance drops slightly in comparison with the in-domain setting, probably due to that the gap of comparative opinions in different domains is relatively small. As a whole, the quintuple extraction performance has a significant drop.
It can also be found that the drop of comparative sentence identification is very limited. This suggest that the patterns of distinguishing comparison in different domains are similar.

Conclusions and Future Work
In this work, we introduce a new Comparative Opinion Quintuple Extraction (COQE) task, to identify comparative sentences from reviews, and extract all comparative opinion quintuples each of which includes Subject, Object, Comparative Aspect, Comparative Opinion and Comparative Preference. We construct three datasets for the task, and benchmark the task by proposing a new multistage neural network approach which shows significant advantages in comparison with baseline systems extended from previous methods. In the future work, we would like to consider more sophisticated approaches, for example, end-to-end deep learning models, for COQE.

A Experiment results under Proportional Match and Binary Match
In Table A1 and Table A2, we report the performance of all four approaches on three tasks across three datasets under the metrics of Proportional Match and Binary Match.