A Unified Generative Framework for Aspect-based Sentiment Analysis

Aspect-based Sentiment Analysis (ABSA) aims to identify the aspect terms, their corresponding sentiment polarities, and the opinion terms. There exist seven subtasks in ABSA. Most studies only focus on the subsets of these subtasks, which leads to various complicated ABSA models while hard to solve these subtasks in a unified framework. In this paper, we redefine every subtask target as a sequence mixed by pointer indexes and sentiment class indexes, which converts all ABSA subtasks into a unified generative formulation. Based on the unified formulation, we exploit the pre-training sequence-to-sequence model BART to solve all ABSA subtasks in an end-to-end framework. Extensive experiments on four ABSA datasets for seven subtasks demonstrate that our framework achieves substantial performance gain and provides a real unified end-to-end solution for the whole ABSA subtasks, which could benefit multiple tasks.


Introduction
Aspect-based Sentiment Analysis (ABSA) is the fine-grained Sentiment Analysis (SA) task, which aims to identify the aspect term (a), its corresponding sentiment polarity (s), and the opinion term (o). For example, in the sentence "The drinks are always well made and wine selection is fairly priced", the aspect terms are "drinks" and "wine selection", and their sentiment polarities are both "positive", and the opinion terms are "well made" and "fairly priced". Based on the combination of the a, s, o, there exist seven subtasks in ABSA. We summarize these subtasks in Figure 1. Specifically, their definitions are as follows: * Equal contribution. † Corresponding author. 1 Code is available at https://github.com/yhcc/ BARTABSA. • Aspect Term Extraction(AE): Extracting all the aspect terms from a sentence.
• Opinion Term Extraction (OE): Extracting all the opinion terms from a sentence.
• Aspect-level Sentiment Classification (ALSC): Predicting the sentiment polarities for every given aspect terms in a sentence.
• Aspect-oriented Opinion Extraction (AOE): Extracting the paired opinion terms for every given aspect terms in a sentence.
• Aspect Term Extraction and Sentiment Classification (AESC): Extracting the aspect terms as well as the corresponding sentiment polarities simultaneously.
• Pair Extraction (Pair): Extracting the aspect terms as well as the corresponding opinion terms simultaneously.
• Triplet Extraction (Triplet): Extracting all aspects terms with their corresponding opinion terms and sentiment polarity simultaneously.
Although these ABSA subtasks are strongly related, most of the existing work only focus 1∼3 subtasks individually. The following divergences make it difficult to solve all subtasks in a unified framework.
and Triplet) only take the text sentence as input, while the remained subtasks ( ALSC and AOE) take the text and a given aspect term as input. 2. Output: Some tasks (AE, OE, ALSC, AOE) only output a certain type from a, s or o, while the remained tasks (AESC, Pair and Triplet) return compound output as the combination of a, s and o. 3. Task Type: There are two kinds of tasks: extraction task (extracting aspect and opinion) and classification task (predicting sentiment).
Because of the above divergences, a myriad of previous works only focus on the subset of these subtasks. However, the importance of solving the whole ABSA subtasks in a unified framework remains significant. Recently, several works make attempts on this track. Some methods (Peng et al., 2020;Mao et al., 2021) apply the pipeline model to output the a, s, o from the inside sub-models separately. However, the pipeline process is not end-to-end. Another line follows the sequence tagging method by extending the tagging schema . However, the compositionality of candidate labels hinders the performance. In conclusion, the existing methods can hardly solve all the subtasks by a unified framework without relying on the sub-models or changing the model structure to adapt to all ABSA subtasks.
Motivated by the above observations, we propose a unified generative framework to address all the ABSA subtasks. We first formulate all these subtasks as a generative task, which could handle the obstacles on the input, output, and task type sides and adapt to all the subtasks without any model structure changes. Specifically, we model the extraction and classification tasks as the pointer indexes and class indexes generation, respectively. Based on the unified task formulation, we use the sequence-to-sequence pre-trained model BART (Lewis et al., 2020) as our backbone to generate the target sequence in an end-to-end process. To validate the effectiveness of our method, we conduct extensive experiments on public datasets. The comparison results demonstrate that our proposed framework outperforms most state-of-the-art (SOTA) models in every subtask.
In summary, our main contributions are as follows: • We formulate both the extraction task and classification task of ABSA into a unified index gen-eration problem. Unlike previous unified models, our method needs not to design specific decoders for different output types.
• With our re-formulation, all ABSA subtasks can be solved in sequence-to-sequence framework, which is easy-to-implement and can be built on the pre-trained models, such as BART.
• We conduct extensive experiments on four public datasets, and each dataset contains a subset of all ABSA subtasks. To the best of our knowledge, it is the first work to evaluate a model on all ABSA tasks.
• The experimental results show that our proposed framework significantly outperforms recent SOTA methods.

ABSA Subtasks
In this section, we first review the existing studies on single output subtasks, and then turn to studies focusing on the compound output subtasks.

Single Output Subtasks
Some researches mainly focus on the single output subtasks. The AE, OE, ALSC and AOE subtasks only output one certain type from a, s or o.
AE Most studies treat AE subtask as a sequence tagging problem (Li and Lam, 2017;Xu et al., 2018;Li et al., 2018b). Recent works explore sequence-to-sequence learning on AE subtask, which obtain promissing results especially with the pre-training language models (Ma et al., 2019;. OE Most studies treat OE subtask as an auxiliary task (Wang et al., 2016a(Wang et al., , 2017Wang and Pan, 2018;Chen and Qian, 2020;He et al., 2019). Most works can only extract the unpaired aspect and opinion terms 2 . In this case, opinion terms are independent of aspect terms.  Tay et al. (2018) incorporate the attention mechanism into the LSTM-based neural network models to model relations of aspects and their contextual words. Other model structures such as convolutional neural network (CNN) (Li et al., 2018a;Xue and Li, 2018), gated neural network (Zhang et al., 2016;Xue and Li, 2018), memory neural network (Tang et al., 2016b;Chen et al., 2017) have also been applied.
AOE This subtask is first introduced by Fan et al. (2019) and they propose the datasets for this subtask. Most studies apply sequence tagging method for this subtask (Wu et al., 2020;Pouran Ben Veyseh et al., 2020).

Compound Output Subtasks
Some researchers pay more attention and efforts to the subtasks with compound output. We review them as follows: AESC. One line follows pipeline method to solve this problem. Other works utilize unified tagging schema (Mitchell et al., 2013;Zhang et al., 2015;Li et al., 2019) or multi-task learning (He et al., 2019;Chen and Qian, 2020) to avoid the error-propagation problem (Ma et al., 2018). Spanbased AESC works are also proposed recently (Hu et al., 2019), which can tackle the sentiment inconsistency problem in the unified tagging schema.
Pairs  propose to extract all (a, o) pair-wise relations from scratch. They propose a multi-task learning framework based on the spanbased extraction method to handle this subtask.
Triplet This subtask is proposed by Peng et al. (2020) and gains increasing interests recently.  design the position-aware tagging schema and apply model based on CRF (Lafferty et al., 2001) and Semi-Markov CRF (Sarawagi and Cohen, 2004). However, the time complexity limits the model to detect the aspect term with longdistance opinion terms. Mao et al. (2021) formulate Triplet as a two-step MRC problem, which applies the pipeline method.

Sequence-to-Sequence Models
The sequence-to-sequence framework has been long studied in the NLP field to tackle various tasks (Sutskever et al., 2014;Cho et al., 2014;Vinyals et al., 2015;Luong et al., 2015). Inspired by the success of PTMs (pre-trained models) (Qiu et al., 2020;Peters et al., 2018;Devlin et al., 2019;Brown et al., 2020), ; Raffel et al. (2020); Lewis et al. (2020) try to pre-train sequence-tosequence models. Among them, we use the BART (Lewis et al., 2020) as our backbone, while the other sequence-to-sequence pre-training models can also be applied in our architecture to use the pointer mechanism (Vinyals et al., 2015), such as MASS .
BART is a strong sequence-to-sequence pretrained model for Natural Language Generation (NLG). BART is a denoising autoencoder composed of several transformer (Vaswani et al., 2017) encoder and decoder layers. It is worth noting that the BART-Base model contains a 6-layer encoder and 6-layer decoder, which makes it similar number of parameters 3 with the BERT-Base model. BART is pretrained on denoising tasks where the input sentence is noised by some methods, such as masking and permutation. The encoder takes the noised sentence as input, and the decoder will restore the original sentence in an autoregressive manner.

Methodology
Although there are two types of tasks among the seven ABSA subtasks, they can be formulated under a generative framework. In this part, we first introduce our sequential representation for each ABSA subtask. Then we detail our method, which utilizes BART to generate these sequential representations.

Task Formulation
As depicted in Figure 1, there are two types of tasks, namely the extraction and classification, whose target can be represented as a sequence of pointer indexes and class indexes, respectively. Therefore, we can formulate these two types of tasks in a unified generative framework. We use a, s, o, to represent the aspect term, sentiment polarity,and opinion term, respectively. Moreover, we use the superscript s and e to denote the start index and end index of a term. For example, o s , a e represent the start index of an opinion term o and the end index of an aspect term a. We use the s p to denote the index of sentiment polarity class. The target sequence for each subtask is as follows: ..], The above subtasks only rely on the input sentence, while for the ALSC and AOE subtasks, they also depend on a specific aspect term a. Instead of putting the aspect term on the input side, we put

BART Decoder
Encoder input:  Figure 2: Overall architecture of the framework. This shows an example generation process for the Triplet subtask where the source is "<s>the battery life is good </s>" and the target is "2 3 5 5 8 6"(Only partial decoder sequence is shown where the 6 (</s>) should be the next generation index). The "Index2Token Conversion" converts the index to tokens. Specifically, the pointer index will be converted to its corresponding token in the source text, and the class index will be converted to corresponding class tokens. Embedding vectors in ll boxes are retrieved from same embedding matrix. We use different position embeddings in the source and target for better generation performance.
The wine list is interesting and has good values , but the service is dreadful  them on the target side so that the target sequences are as follows: ..], where the underlined tokens are given during inference. Detailed target sequence examples for each subtask are presented in Figure 3.

Our Model
As our discussion in the last section, all subtasks can be formulated as taking the X = [x 1 , ..., x n ] as input and outputting a target sequence Y = [y 1 , ..., y m ], where y 0 is the start-of-the-sentence token. Therefore, different ABSA subtasks can be formulated as: (1) To get the index probability distribution P t = P (y t |X, Y <t ) for each step, we use a model composed of two components: (1) Encoder; (2) Decoder.
Encoder The encoder part is to encode X into vectors H e . We use the BART model, therefore, the start of sentence (<s>) and the end of sentence (</s>) tokens will be added to the start and end of X, respectively. We ignore the <s> token in our equations for simplicity. The encoder part is as follows: where H e ∈ R n×d , and d is the hidden dimension.
Decoder The decoder part takes the encoder outputs H e and previous decoder outputs Y <t as inputs to get P t . However, the Y <t is an index sequence. Therefore, for each y t in Y <t , we first need to use the following Index2Token module to conduct a  Table 1: The statistics of four datasets, where the #s, #a, #o, #p denote the numbers of sentences, aspect terms, opinion terms, and the <a, o> pairs, respectively. We use "-" to denote the missing data statistics of some datasets. The "Subtasks" column refers to the ABSA subtasks that can be applied on the corresponding dataset. conversion where C = [c 1 , ..., c l ] is the class token list 4 . After that, we use the BART decoder to get the last hidden state where h d t ∈ R d . With h d t , we predict the token probability distribution P t as follows: where E e , H e ,Ĥ e ,H e ∈ R n×d ; C d ∈ R l×d ; and P t ∈ R (n+l) is the final distribution on all indexes. During the training phase, we use the teacher forcing to train our model and the negative loglikelihood to optimize the model. Moreover, during the inference, we use the beam search to get the target sequence Y in an autoregressive manner. After that, we need to use the decoding algorithm to convert this sequence into the term spans and sentiment polarity. We use the Triplet task as an example and present the decoding algorithm in Algorithm 1, the decoding algorithm for other tasks are much depicted in the Supplementary Material.

Datasets
We evaluate our method on four ABSA datasets. All of them are originated from the Semeval Challenges (Pontiki et al., 2014a,b,c), where only the aspect terms and their sentiment polarities are labeled.
The first dataset(D 17 5 ) is annotated by Wang et al. (2017), where the unpaire opinion terms are labeled. The second dataset (D 19 ) is annotated by Fan et al. (2019), where they pair opinion terms with  To further demonstrate that our proposed method is a real unified endto-end ABSA framework, we present our work in the last row. "E2E" is short for End-to-End, which means the model should output all the subtasks' results synchronously rather than requiring any preconditions, e.g., pipeline methods. The "Datasets" column refers to the datasets that this baseline is conducted.
, where the missing triplets with overlapping opinions are corrected. We present the statistics for these four datasets in Table 1.

Baselines
To have a fair comparison, we summarize topperforming baselines of all ABSA subtasks. Given different ABSA subtasks, datasets, and experimental setups, existing baselines can be separated into three groups roughly as shown in Table 2. The baselines in the first group are conducted on D 17 dataset, covering the AE, OE, ALSC, and AESC subtasks. Span-based method SPAN-BERT (Hu et al., 2019) and sequence tagging method, IMN-BERT (He et al., 2019) and RACL-BERT (Chen and Qian, 2020), are selected. Specifically, the IMN-BERT model is reproduced by Chen and Qian (2020). All these baselines are implemented on BERT-Large.
The baselines of the second group are conducted on D 19 dataset, mainly focusing on AOE subtask. Interestingly, we find that sequence tagging method is the main solution for this subtask (Fan et al., 2019;Wu et al., 2020;Pouran Ben Veyseh et al., 2020).
The baselines of the third group are mainly conducted on D 20a and D 20b datasets, which could cover almost all the ABSA subtasks except for one certain subtask depending on the baseline structures. For the following baselines: RINANTE (Dai and Song, 2019), CMLA (Wang et al., 2017), Liunified (Li et al., 2019), the suffix "+" in Table 2 denotes the corresponding model variant modified by Peng et al. (2020) for being capable of AESC, Pair and Triplet.

Implement Details
Following previous studies, we use different metrics according to different subtasks and datasets. Specifically, for the single output subtasks AE, OE, and AOE, the prediction span would be considered as correct only if it exactly matches the start and the end boundaries. For the ALSC subtask, we require the generated sentiment polarity of the given aspect should be the same as the ground truth. As for compound output subtasks, AESC, Pair and Triplet, a prediction result is correct only when all the span boundaries and the generated sentiment polarity are accurately identified. We report the precision (P), recall (R), and F1 scores for all experiments 6 .

Main Results
On D 17 dataset (Wang et al., 2017), we compare our method for AE, OE, ALSC, and AESC. The comparison results are shown in Table 3. Most of our results achieve better or comparable results to      . Baselines are from . We highlight the best results in bold.
baselines. However, these baselines yield competitive results based on the BERT-Large pre-trained models. While our results are achieved on the BART-Base model with almost half parameters. This shows that our framework is more suitable for these ABSA subtasks.
On D 19 dataset (Fan et al., 2019), we compare our method for AOE. The comparison results are shown in Table 4. We can observe that our method achieves significant P/R/F1 improvements on 14res, 15res, and 16res. Additionally, we notice that our F1 score on 14lap is close to the previous SOTA result. This is probably caused by the dataset domain difference as the 14lap is the laptop comments while the others are restaurant comments.
On D 20a dataset (Peng et al., 2020), we compare our method for AESC, Pair, and Triplet. The comparison results are shown in Table 5. We can observe that our proposed method is able to outperform other baselines on all datasets. Specifically, we achieve the better results for Triplet, which demonstrates the effectiveness of our method on capturing interactions among aspect terms, opinion terms, and sentiment polarities. We also observe that the Span-based methods show superior performance to sequence tagging methods. This may be caused by the higher compositionality of candidate labels in sequence tagging methods (Hu et al., 2019). As the previous SOTA method, the Dual-MRC shows competitive performance by utilizing the span-based extraction method and the MRC mechanism. However, their inference process is not an end-to-end process.
On D 20b dataset , we compare our method for Triplet. The comparison results can be found in Table 6. Our method achieves the best results with nearly 7 F1 points improvements on 14res, 15res, and 16res. Our method achieves nearly 13, 9, 7, 12 points improvements on each dataset for the recall scores compared with other baselines. This also explains the drop performance of the precision score. Since D 20b is refined from D 20a , we specifically compare the Triplet results of the corresponding dataset in D 20a and D 20b . Interestingly, we discover that all baselines have a much bigger performance change on 15res. We conjecture the distribution differences may be the cause reason. In conclusion, all the experiment results confirm that our proposed method, which unifies the training and the inference to an end-to-end generative framework, provides a new SOTA solution for the whole ABSA task.

Framework Analysis
To better understand our proposed framework, we conduct analysis experiments on the D 20b dataset .
To validate whether our proposed framework could adapt to the generative ABSA task, we metric the invalid predictions for the Triplet. Specifically, since the Triplet requires the prediction for-mat like [a s , a e , o s , o e , s p ], it is mandatory that one valid triplet prediction should be in length 5, noted as "5-len", and obviously all end index should be larger than the corresponding start index, noted as "ordered prediction". We calculate number of non−5−len total prediction , referred to as the "Invalid size", and the number of non−ordered prediction total 5−len prediction , referred to as the "Invalid order". The "Invalid token" means the a s is not the start of a token, instead, it is the index of an inside subword. From Table  7, we can observe that BART could learn this task form easily as the low rate for all the three metrics, which demonstrate that the generative framework for ABSA is not only a theoretically unified task form but also a realizable framework in practical. We remove these invalid predictions in our implementation of experiments.
As shown in Table 4, we give some analysis on the impact of the beam size, as we are a generation method. However, the beam size seems to have little impact on the F1 scores.

Conclusion
This paper summarizes the seven ABSA subtasks and previous studies, which shows that there exist divergences on all the input, output, and task type sides. Previous studies have limitations on handling all these divergences in a unified framework. We propose to convert all the ABSA subtasks to a unified generative task. We implement the BART to generate the target sequence in an end-to-end process based on the unified task formulation. We conduct massive experiments on public datasets for seven ABSA subtasks and achieve significant improvements on most datasets. The experimental results demonstrate the effectiveness of our method. Our work leads to several promising directions, such as sequence-to-sequence framework on other tasks, and data augmentation. • BART-Base model: 12 layers, 768 hidden dimensions and 16 heads with the total number of parameters, 139M; • BERT-Base model: 12 layers, 768 hidden dimensions and 12 heads with the total number of parameters, 110M.

A.2 Decoding Algorithm for Different Datasets
In this part, we introduce the decoding algorithm we used to convert the predicted target sequence Y into the target span set L. These algorithm can be found in Algorithm 2, 3, 4.
Algorithm 2 Decoding Algorithm for the AOE subtask Input: Number of tokens in the input sentence n, target sequence Y = [y 1 , ..., y m ] and y i ∈ [1, n + |C|], L T is a given length for different tasks. As the different subtasks are conducted on different datasets, specifically, we conduct the following experiments on each dataset: • On the D 17 dataset, we conduct the AESC and the OE in multi-task learning method. To that end, we feed the pre-defined task tags "<AESC>" and "<OE>" to the decoder first. For example, for the input "The drinks are always :::: well ::::: made and wine selection is ::::: fairly :::::: priced" from D 17 dataset, we e.append(y i ) 9: i+ = 1 10: end while 11: return L define the AESC sequence and the OE target sequence as "<AESC>, 1, 1, POS, 7, 8, POS, </s>" and "<OE>, 4, 5, 10, 11, </s>".
• On the D 19 dataset, we conduct the AOE. As the AOE subtask requires to detect the opinion terms given aspect terms in advance, the aspect terms need to be fed to our decoder first. For the aforementioned example sentence from D 19 dataset, we define the AOE target sequence as " 1, 1, 4, 5, </s>" and the " 7, 8, 10, 11, </s>".

Specific Subtask Metrics
• On the D 17 dataset, we get the AESC and OE results directly. Following previous work, we only calculate the metrics for AESC and ALSC from those true positive AE predictions. Specifically, the F1 • On the D 19 dataset, we get the AOE results directly. The metrics for AOE are standard Precision, Recall and the F1 score.
• On the D 20a and D 20b datasets, we get the Triplet results directly. We preserve the <AT,OT> for Pair metric and <AT, SP> for AESC metric. The metrics for them are standard Precision, Recall and the F1 score.