Self Question-answering: Aspect-based Sentiment Analysis by Role Flipped Machine Reading Comprehension

The pivot for the uniﬁed Aspect-based Sentiment Analysis (ABSA) is to couple aspect terms with their corresponding opinion terms, which might further burgeon easier sentiment predictions. In this paper, we investigate the uniﬁed ABSA task from the perspective of Machine Reading Comprehension (MRC) by observing that the aspect and the opinion terms can serve as the query and answer in MRC in-terchangeably. We propose a new paradigm named Role Flipped Machine Reading Comprehension (RF-MRC) to resolve. At its heart, the predicted results of either the Aspect Term Extraction (ATE) or the Opinion Terms Extraction (OTE) are regarded as the queries, respec-tively, and the matched opinion or aspect terms are extracted as answers. The queries and answers can be ﬂipped for multi-hop detection. Finally, every matched aspect-opinion pair is predicted by the sentiment classiﬁer. RF-MRC can solve the ABSA task without extra data annotation. Experiments on three widely used benchmarks and a challenging dataset demonstrate the superiority of the proposed framework.


Introduction
Aspect-based Sentiment Analysis (ABSA) aims at detecting opinions towards different targets (also known as aspects) instead of inferring overall sentiment polarity in a given sentence (Liu, 2012). It generally consists of three fundamental sub-tasks, namely, aspect terms extraction (ATE), opinion terms extraction (OTE), and aspect sentiment classification (ASC). ATE and OTE extract aspect and * Corresponding author opinion terms from sentences, respectively. And ASC predicts the sentiment polarities (i.e., positive, negative, and neutral) towards aspect terms.
Practically, the heart of ASBA is to capture the connection between aspect terms and their respective opinion terms, which might make it easier to predict the correct sentiment polarities. Such connection is more substantial when multiple aspects with different polarities exist. For example, we illustrate the connection within a sentence shown in Figure 1. The negative polarity of "falafel" can be derived by an aggregation of the relevant opinions "over cooked" and "dried", whereas the positive polarity of "chicken" is oriented by its corresponding opinion word "fine". If the aspect terms and their connected opinion words are mismatched, the prediction may become difficult even incorrect.
Hence, immense efforts have been dedicated to grasping the relations between aspect terms and their potential corresponding opinion terms. Early methods only focused on ASC task and relied on given aspect terms. Among them, a series of methods designed attention mechanisms (He et al., 2018;Tang et al., 2019) or gating mechanisms Xue and Li, 2018) to collect aspectrelated information (e.g., opinion terms) from context. Recently, Graph Neural Network over different dependency trees (Huang and Carley, 2019;Tang et al., 2020;Hou et al., 2021b) was proposed to link aspect terms with interrelated opinion terms more directly. They can account for long-range word dependencies and refrain from identifying contextual words unrelated to aspect terms.
Despite their effectiveness, these methods will be infeasible if the given aspects are absent. As a result, some researchers proposed to incorporate all sub-tasks in a framework of unified ABSA. These methods (He et al., 2019;Chen and Qian, 2020) formulated sub-tasks of ABSA as sequence labeling tasks. By multifarious interaction mechanisms performed on sentence representations of different sub-tasks, they made the aspect terms come into contact with opinion terms. Furthermore, recent researches (Peng et al., 2020;Mao et al., 2021) put forward to extract (aspect, opinion, sentiment) triples from sentences without given aspect terms. They strive to clarify each aspect-opinion pair for sentiment prediction and needed additional labels of triples compared to the previous unified ABSA.
In this paper, we examine the unified ABSA from a perspective of Machine Reading Comprehension (MRC). The MRC framework operates on the context, query, and answer triples (Rajpurkar et al., 2016(Rajpurkar et al., , 2018, in which the constructed natural language query is asked to the context, and the answer is extracted from the context. By observing that the aspect terms and opinion terms can be naturally characterized as queries and answers, we propose a new paradigm named Role Flipped Machine Reading Comprehension (RF-MRC) to meet the heart of the unified ABSA.
First, we extract the initial aspect and opinion terms from a given sentence. Then either the initial aspect terms or opinion terms are deemed as a query to extract corresponding opinion terms or aspect terms as answers. The roles of query and answer can be flipped to perform a multi-hop question-answering process. In this manner, we can progressively obtain the aspect or opinion terms we need without manually designing queries. Meanwhile, the aspect terms could be potentially associated with relevant opinion terms as the multiple question-answering proceeds. Furthermore, we propose a matching module to match all the extracted aspects and relevant opinion terms in pairs simultaneously instead of extracting only one aspect-opinion pair at one time, considering a complex sentence may contain multiple aspects with conflict polarities. Experiments on three widely used benchmarks and a challenging dataset demonstrate the superiority of the proposed framework.

Aspect-based Sentiment Analysis
Existing methods for ABSA consist of separate learning and joint learning, respectively. Methods for separate learning only focus on one of the sub-tasks of ABSA. To name some, Wang et al. Joint learning methods strive to solve multiple sub-tasks simultaneously. Hu et al. (2019); Phan and Ogunbona (2020) used pipeline models to extract aspect terms then predict the sentiment polarities, which are vulnerable due to error accumulation. To tackle this issue, some studies proposed to solve all sub-tasks in a joint learning framework. (2020) integrated ATE and ASC in the same framework to make these two tasks benefit from each other. Some emerging methods (He et al., 2019;Chen and Qian, 2020;Peng et al., 2020;Yu et al., 2021) added OTE as an auxiliary task and connect aspects with respective opinion terms to derive easier sentiment prediction. In addition, recent studies defined a task of (aspect, opinion, sentiment) triples extraction and resolve it in a two-stage framework (Peng et al., 2020) or a unified framework (Mao et al., 2021). Nevertheless, this task demands supplementary data to mark precise (aspect, opinion, sentiment) triples.  Figure 2: An example to examine the unified ABSA from a perspective of MRC.

Solving NLP Tasks by MRC
Machine reading comprehension is a prevalent and elastic framework, which aims to extract answers from context according to query. Many tasks in natural language processing can be framed as comprehension reading. McCann et al. (2018) introduced a natural language decathlon and transformed ten tasks into reading comprehension problems. He et al. (2015) used question-answering pairs to represent the predicate-argument structure in the semantic role labeling annotations. Levy et al. (2017) showed that relation extraction can be reduced to answer simple reading comprehension questions. Li et al. (2020b) designed a unified machine reading comprehension framework to solve the task of nested named entity recognition.  cast the entity-relation extraction as a multi-turn question answering problem.  used a mention with its surrounding words as a query to extract its coreference words as answers. All the above methods have demonstrated that machine reading comprehension is an effective framework to solve natural language processing tasks.

The formulation of unified ABSA
Given a sequence of tokens X = {x 1 , x 2 , ..., x n }, where n denotes the length of sentence, Aspect Terms Extraction (ATE) aims to find aspect terms in X and assign a labelŶ ..,ŷ A n } to it. Opinion Terms Extraction (OTE) aims to find all opinion terms in X and assign a label of I, O} denote the beginning of, inside of, and out of aspect and opinion terms, respectively.ŷ S i ∈ {pos, neg, neu} denotes positive, negative, neutral sentiment polarities, respectively. Sentiment labels of tokens that are not aspects are set to "NULL".

Examine ABSA from MRC perspective
Recall that the Machine Reading Comprehension (MRC) aims to determine the answer to a given query from context. The query encodes significant prior information and the answer can be extracted by detecting its association with the query within context. This configuration provides an elegant way to capture the connection between aspect terms and relevant opinion terms.
In the light of such observation, we examine the unified ABSA from the perspective of MRC. The input sentence is naturally regarded as context. Then, the query could be constructed by aspect terms (opinion terms) and the answer consists of corresponding opinion terms (aspect terms) related to its query. Through this manner, aspect terms come into contact with corresponding opinion terms, and vice versa, by interactions between query and answer. In this way, we believe the unified ABSA can be solved by an MRC framework. For implementation, we can simply concatenate the query and context then feed them into BERT and a feed forward neural network to get the answer, which is exhibited in Figure 2.
In this paper, we proposed a paradigm named Role Flipped Machine Reading Comprehension (RF-MRC) to meet the heart of unified ABSA and derive easier sentiment classification. The overall architecture is shown in Figure 3 and the algorithm is elaborated in appendix of the supplementary materials.

Input Representations
We use BERT (Devlin et al., 2019) to obtain input representation following  and (Chen and Qian, 2020). For a sequence of tokens X (0) = {x 1 , x 2 , ..., x n }, we map the word sequence with pre-trained BERT model to generate a sequence of units vectors

Initial Terms Extraction
In this section, we extract p candidate aspect terms and q opinion terms from the initial sentence with blank queries.
As shown in Figure 3, we perform the initial extraction of aspect or opinion terms with a blank query. For the word vectors n }, we first use a feed forward neural network to get the sequence labels for ATE: where (0) denotes the initial question answering process with a blank query, FFNN denotes feed forward neural network. We select the top p candidate aspect terms (X A ) (0) from X, where i . denotes the indexes of top p potential aspect terms in the sentence. Similarly, we could get the sequence labels Note that j . denotes the indexes of top q potential opinion terms in the sentence.

Role Flipped Module
Based on the initial extraction results, we devise a role flipped module to grasp the connection between aspect terms and relevant opinion terms inside the sentence. The process is shown as the left part of the Figure 3. First, given the sentence as the context, we take the extracted aspect terms as queries to extract corresponding opinion words as answers. The queries are constructed by (X A ) (0) and the context is the input sentence X (0) . In this round, the input can be formed as follows: (3) We feed it into BERT to get hidden vectors H (1) . Then a feed forward neural network is used to get the labels of opinion terms as answers: where FFNN denotes feed forward neural network and (1) represents the hop number. Then we flip the query and the answer for the next round of questionanswering. The above process can be iterated into a multi-hop question-answering process. Noted that answers in the t-th round will serve as queries in the t + 1 round. After T rounds of question answering processes, we get the final aspect terms (X A ) (T −1) and the opinion terms (X O ) (T ) based the labels (Ŷ A ) (T −1) and (Ŷ O ) (T ) in the last round: Specifically, we set the aspect terms as queries and the opinion terms as answers in the last round. Analogously, we can first take the extracted opinion terms (X O ) (0) in the initial terms extraction as queries. Then the same multi-hop questionanswering process is performed to get the final opinion terms (X O ) (T −1) and aspect terms (X A ) (T ) after T rounds. For convenience, we call the process where aspect terms are firstly taken as queries as "A2O", and the other is called "O2A".

Matching Module
So far, we have extracted all the aspect terms and corresponding opinion terms. In order to exploit the captured connection between them, we propose a matching mechanism to match them in pairs and derive easier sentiment prediction. For A2O, after T rounds of cross question answering, we get a set of candidate aspect terms (X A ) (T −1) and a set of opinion terms (X O ) (T ) . We apply an attention mechanism to compute the correspondence between them: where H (T ) is hidden features in the last round and it encapsulates the captured connections in the role flipped module. We can select a best opinion term We use the word vectors of (x i ) T −1 and (x j ) T to compute the sentiment scores of the aspect terms: where ":" represents concatenation. Similarly, for O2A, we can take use of opinion terms (X O ) (T −1) and aspect terms (X A ) (T ) in the last round to compute sentiment scores: For a candidate aspect term x i , the corresponding sentiment score is an average score: (9) Here we only calculate the sentiment scores of aspect terms, the labelŷ S i for any other word is set to "NULL".
In this manner, we can deploy all the extracted connections inside the sentence at once, without using auxiliary labels of triples like (Peng et al., 2020;Mao et al., 2021).

Training
Referring to the Figure 3, in every round of question answering, including the initial extraction, there are two predicted results of aspect terms and opinion terms. Suppose in the t-th round, the predicted labels are (Y A ) (t) and (Y O ) (t) , for ATE and OTE respectively. Then we use the cross-entropy to compute the losses of ATE and OTE in the t round: where N denotes the number of training instances, n i denotes the number of tokens in the i-th instance.  After T rounds of question answering, the losses of ATE and OTE are as follows: λ A t and λ O t are coefficients of ATE and OTE in the t-th round. And in the last round, we get the final sentiment label (Y S ) (T ) . We also use the crossentropy to get the loss of ASC: where N denotes the number of training instances, n i denotes the number of tokens in the i-th instance. The overall loss is the weighted sum of the subtasks' losses: α, β, γ are task coefficients.

Datasets
We adopt three widely used datasets: Restaurant14 and Laptop14 from SemEval 2014 Task 4 (Pontiki et al., 2014), Restaurant15 from Semeval 2015 Task 12 (Pontiki et al., 2015). Note that these three datasets originally contain aspect term labels and sentiment labels, and labels for opinion terms are annotated by (Wang et al., 2016b). We also use a challenging dataset MAMS constructed by (Jiang et al., 2019), in which each sentence contains at least two aspects with different polarities, to perform comprehensive investigations. There are no opinion labels in MAMS. The forms of all datasets are consistent with the description in 3.1 and the statistics are exported in Table 1. For Restaurant14, Laptop14, and Restaurant15, we randomly sample 20% of the training set as the validation set. While the original MAMS dataset contains the training, validation, and test sets.

Compared Methods
Pipeline Model. Following Chen and Qian (2020), we perform DECNN (Xu et al., 2018) and CMLA (Xu et al., 2018) for ATE, TNet  and TCaps (Chen and Qian, 2019) for ASC to four pipeline models. SPAN (Hu et al., 2019) performed a multi-target extractor for ATE and designed a sentiment polarities classifier for ASC.
Unified Model. MNN (Wang et al., 2018) and E2E-ABSA  jointly solve ATE and ASC by using collapsed tagging schema. DOER (Luo et al., 2019a) used a dual cross-shared RNN mechanism to share information between different sub-tasks. IMN (He et al., 2019) is an interactive multi-task model for ATE and ASC, while OTE is confused into ATE. RACL (Chen and Qian, 2020) is a joint learning framework which can solve ATE, OTE and ASC jointly and exploit four relations between different sub-tasks. Our model only needs three annotation sequences related to three sub-tasks, while Peng et al. (2020); Mao et al. (2021) demand several labels of (aspect, opinion, sentiment) triple for each sentence. For this reason, we did not involve them in our compared models.

Settings
We used the pre-trained BERT large model to generate word vectors with d h =1024. We set the number of multiple rounds, the number of candidate aspect terms p, and the number of candidate opinion terms q as 2, 8, and 5, individually. Since a word can be broken into multiple tokens with the BERT model, p and q are bigger than the true number of aspect terms and opinion terms. We trained the model for 80 epochs using Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-5 and batch size 8. The task coefficients {λ A t , λ O t , α, β, γ} are set to {1, 1, 1, 1, 1}. The code is implemented in PyTorch 1.9.0 and launched on Ubuntu server with a NVidia Tesla V100(32GB).
Following the protocols in He et al. (2019), we use four metrics, i.e., AE-F1, OE-F1, AS-F1, and Overall-F1, representing macro F1 scores for ATE, OTE, ASC, and overall performance for complete ABSA. For an aspect containing multiple tokens, we take the polarity of the first token as the final ASC result. As for the Overall-F1, we take the result as correct only when both ATE and ASC results are correct. The metrics of the comparison method are calculated in the same way. The model achieving the best Overall-F1 on the validation sets is used for evaluation on the test set.

Main Results
In order to make a fair validation for the proposed model, we first compare our method with all the baseline models on Restuarant14, Laptop14, and Restuarant15, which are the most widely-used benchmarks for ABSA. Table 2 demonstrates the main results.
We have several observations from Table 2. Firstly, the unified models perform better than the pipeline models, which proves the effectiveness of exploiting the connections between sub-tasks. Secondly, RACL is a strong baseline model compared with IMN and SPAN because RACL takes the relations between ATE and OTE into consideration.
Thirdly, our proposed model achieves the best or second best performance compared with all the baseline models on different sub-tasks. On the one hand, the AE-F1 and OE-F1 are higher than most baseline models. We deduce this is because the extraction results in the last round of question answering can be modified by results in the current round. On the other hand, the sentiment prediction of RF-MRC is more accurate. Especially, RF-MRC achieves 1.45%, 1.91% and 1.81% improvements over the strongest baseline on the Overall-F1 of three datasets. The results prove that using the proposed RF-MRC can exploit the relations between aspect and opinion terms at a more fine-grained level, while other baseline models only consider relations between sentence representations of subtasks. More specifically, aspect terms and corresponding opinion terms will be paired owing to the interaction between query and answer in the role flipped module. Consequently, the sentiment prediction becomes more accurate based on these terms in pairs.

Auxiliary Experiments
To demonstrate the ability of the proposed model to analyze the sentiment in complex sentences, we run an auxiliary experiment on a more challenging MAMS (Jiang et al., 2019)     in this dataset consists of at least two unique aspects with different polarities. Because the opinion labels are not annotated in MAMS, we did not compute the loss L O and only use three metrics, AE-F1, AS-F1, and Overall F1 in evaluation. Three strong baseline models in the main results, namely SPAN, IMN, and RACL, are compared here. As the results demonstrated in Table 3, our RF-MRC achieves the best performance. This suggests that RF-MRC still works in more detailed and complex sentences. It is interesting to observe that AS-F1 improves more than AE-F1 in this comparison. We conjecture this is because our model can capture relations between aspect terms and potential opinion terms, even if there are no opinion annotations in MAMS.

Ablation Test
In order to investigate the effect of the query answer flipped process, we perform comprehensive ablation studies on three datasets. Table 4 shows the results of the Overall-F1 measure. We remove  the process "A2O" and "O2A", respectively, and derive two degraded variants denoted by "w/o A2O" and "w/o O2A". As expected, both of "A2O" and "O2A" processes are effective for the whole task. It is noted that scores of the model without "A2O" decrease more than those of the model without "O2A" on Restaurant14 and Laptop14. We consider it is probably because the extraction of ATE is more accurate than OTE on the two datasets, which can be discovered in Table 2. The model "w/o A2O" performs better than "w/o O2A" since the OTE on Restaurant15 is more accurate than ATE (c.f. Table 2).

Effect of Parameters
Next, we study the effects of different hyperparameters in our model, including the number of the candidate aspect terms p, and the round of cross question answering T , to evaluate how they contribute to the performance. We exhibit the overall F1 in Figure 4. Because the effect of q, which is the number of candidate opinion terms, is similar to p, we omit the repeated display.
As Figure 4(a) shows, the model performs best on Restaurant14 and Laptop14 when p = 8. We believe that the model ignores some true aspects when p decreases, while more inaccurate aspects will be taken into consideration with the value of p increasing. In Figure 4(b), the model is less effective when t = 1 while the performance is best when t = 2. When t > 2, the Overall-F1 shows a decreasing trend. It is possible that too many rounds of question answering are prone to overfitting.

Case Study
Finally, we conduct a case study to illustrate the effectiveness and perform an error analysis. We select three cases from the MAMS dataset and compare our results with IMN and RACL. Table 5 reports the results.
In the first case, there are two aspects, i.e., "outdoor patio" and "ambience". Both IMN and RACL cannot identify "ambience" as aspect terms. We conjecture the possible reasons might be they only consider relations between sentence representations of sub-tasks, which derives the aspect term "ambience" is weakened in such a complex sentence. In addition, IMN extracts "crowds" as an extra aspect might because it fails to consider the relations between aspect terms and relevant opinion terms. However, our proposed model extracts all the aspect terms and predicts corresponding sentiment polarities correctly.
The second case is a longer sentence with three aspects and expresses positive and neutral polarities. Our RF-MRC extracts all aspect terms and opinion terms and predicts corresponding polarities successfully. However, IMN can not extract "delivery" and we conjecture the performance on ATE decreases in a longer sentence. RACL extracts all aspect terms correctly but the polarity of "seating" is misjudged. Because RACL exploits different semantic relations between sub-tasks, it is possible that it captures the inaccurate "rude" and "late" as evidence to predict the sentiment for "seating" as "negative". This case demonstrates that the proposed model has more advantages to solve complex sentences.
We perform an error analysis in the third case. We see that the demonstrated sentence is much shorter than the former two. However, all the three models predict the wrong sentiment for the aspect "dinner". We analyze it is because the "okay" is regarded as the opinion word for "dinner", and this word may usually represent positive polarity in the training set. Recall that our training loss of crossentropy seeking for a maximum likelihood in the training set, which might be that the reason for deriving a wrong prediction in this case. More interestingly, RACL and our RF-MRC, as two SOTA solutions, extract "vegetarian options" as an aspect incorrectly. By looking closer at this sentence, we find that the seldom choice in "vegetarian options" is evidence of why the user says "dinner" is just okay. Hence, understanding the structure of sentences by logical even causal inference might be shed new light on future research of this area.
Moreover, we select a sentence from the test set of Restaurant14 and present visualization of the extraction results and the matching process in Figure 5, successively. Specifically, the aspect terms are marked as red while opinion terms are marked in blue. According to Figure 5(a) and 5(b), we can see our RF-MRC can accurately extract aspect terms, i.e., "food" and "waiting", and opinion terms, i.e., "good", "popular" and "nightmare". As Figure 5(c) shown, the "food" has higher scores with "good" and "popular" while the "waiting" is more relevant to "nightmare". Based on the obser- [CLS] the food is so good and so popular that waiting can really be a nightmare [SEP] (a) [CLS] the food is so good and so popular that waiting can really be a nightmare [SEP]  vations, we can infer that the proposed RF-MRC is capable of associating the aspect terms with relevant opinion terms and matching them in pairs for sentiment classification.

Conclusion
In this paper, we investigate the unified ABSA from the perspective of MRC and propose a new paradigm named RF-MRC. Either extracted aspect terms or opinion terms are constructed as queries, and the related opinion terms and aspect terms are considered as answers. We further design a matching module to match all the extracted aspect terms and relevant opinion terms, and predict the sentiment polarities. Experiments on three widely used benchmarks and a challenging dataset demonstrate the superiority of the proposed framework.