Solving Aspect Category Sentiment Analysis as a Text Generation Task

Aspect category sentiment analysis has attracted increasing research attention. The dominant methods make use of pre-trained language models by learning effective aspect category-specific representations, and adding specific output layers to its pre-trained representation. We consider a more direct way of making use of pre-trained language models, by casting the ACSA tasks into natural language generation tasks, using natural language sentences to represent the output. Our method allows more direct use of pre-trained knowledge in seq2seq language models by directly following the task setting during pre-training. Experiments on several benchmarks show that our method gives the best reported results, having large advantages in few-shot and zero-shot settings.


Introduction
Aspect-based sentiment analysis (ABSA) is a finegrained sentiment analysis task that includes a number of subtasks, two of which are aspect category sentiment analysis (ACSA) and aspect category detection (ACD). Figure 1 shows an example, where the input is "The restaurant was expensive, but the menu was great". ACD detects the aspect categories, such as price and food, and ACSA predicts the sentiment polarities toward each aspect category. In this work, we focus on these two tasks as well as the joint task that combines both.
Previous studies have investigated various methods that treat ACSA and ACD as classification tasks, learning aspect-specific sentence representations (Wang et al., 2016;Ruder et al., 2016). Recently, pre-trained language models (PLM) have shown their effectiveness to this end (Jiang et al., 2019). The main idea is to make use of pre-trained models such as BERT (Devlin et al., 2019a) for representing an aspect-specific form of the input (e.g., by concatenating the aspect category to the end of the input sentence (Figure 3(a))), which provides useful semantic features for ACSA and ACD classifiers. Such methods have given highly competitive results Li et al., 2020b).
The above classification models benefit from contextualized representations, which contain knowledge learned by pre-training over large data (Lin et al., 2019). However, their use of pre-trained knowledge can be viewed as indirect due to at least two reasons. First, the classification task is performed by using a neural network on top of pretrained representation, with separate network parameters. Second, the integration of aspect category makes the aspect-specific input representation not exactly a natural language sentence, which differs from the pre-training setting. Intuitively, more pre-trained knowledge could be leveraged by connecting pre-training and ACSA at the task level, rather than only at the representation level.
We investigate the above potentials by casting the sentiment classification tasks into language modelling tasks. In particular, as shown in Figure 2, both ACSA and ACD are transformed into sequence-to-sequence (seq2seq) tasks, where the encoder takes the input sentence and the decoder generates a natural language sentence. For ACD, the output follows a template stating whether the specific aspect is discussed (e.g., "The category_type category is discussed"); for ACSA, the sentiment polarity of a specific aspect is stated (e.g., "The sentiment polarity of given_category is polarity_type "). The setting corresponds closely to the denoising auto-

The restaurant was too expensive
The sentiment polarity of price is positive (scoring: 0.1) The sentiment polarity of price is neutral (scoring: 0.2) The sentiment polarity of price is negative (scoring: 0.7)

Aspect category sentiment analysis
The price category is discussed (scoring: 0.9) The price category is not discussed (scoring: 0.1) Aspect category detection Figure 2: ACSA as a generation task. encoder training scheme of BART (Lewis et al., 2020), which we use as the pre-trained model. Compared with classification-based methods, our method does not include more network parameters, and thus can potentially generalize better to new domains (Brown et al., 2020;Gao et al., 2020). Given a new domain with completely unseen aspect categories and sentiment labels, our method can be applied without changing output layer structure.
In addition to classification-based methods, we take masked language models (MLM) as a baseline also, for which a natural counterpart of our method is a mask-refilling task. As shown in Figure 3(b), different from our method, the output template is concatenated to the input, with the keyword being masked for prediction. This MLM task corresponds closely to BERT (Devlin et al., 2019a) pre-training. In comparison to this MLM method, a generation method can better learn the correlation between the input and output template as two related sequences, which has been demonstrated by the strong performance of BART for abstractive text summarization (Lewis et al., 2020).
Experimental results on three standard benchmarks datasets show that both generation and MLM methods outperform classification methods using the same pre-trained language models. Finally, generation methods give stronger performances than MLM methods, outperforming the previous stateof-the-art methods by a large margin. In addition, using the generation method, we show that jointly performing ACSA and ACD leads to better results than the traditional pipeline. To our knowledge, we are the first to employ a generative pre-trained language model to address an ACSA/ACD problem. We release our code at https://github. com/lgw863/ACSA-generation.

Related Work
Aspect Category Sentiment Analysis Wang et al. (2016) propose an attention-based LSTM network, which can concentrate on different parts of a sentence when different aspect categories are taken as input. Ruder et al. (2016) model the interdependencies of sentences in a text with a hierarchical bidirectional LSTM. Yin et al. (2017) model the task as a machine comprehension problem by constructing pseudo question-answer pairs. Xue and Li (2018)    incorporate aspect category information into sentence encoders in the context modeling stage.  construct auxiliary sentences from the aspect categories and convert ACSA to a sentence-pair classification task. Li et al. (2020b) predict the sentiment of an aspect category mentioned in a sentence by aggregating the sentiments of the words indicating the aspect category in the sentence.
Several joint models were proposed to avoid error propagation, which perform ACD and ACSA jointly. Schmitt et al. (2018) propose two joint models: end-to-end LSTM and end-to-end CNN, which produce all the aspect categories and their corresponding sentiment polarities at once. Hu et al. (2019) propose constrained attention networks (CAN) to constrain the attention weight allocation.  propose the aspect-level sentiment capsules model (AS-Capsules), which utilizes the correlation between aspect category and sentiment through shared components. Li et al. (2020a) propose a novel joint model which contains a shared sentiment prediction layer.
All the models above are classification methods, which use a separate output network to give the output label. In contrast, we investigate natural language generation methods by directly following the pre-training process of language models.
Masked Language Model Methods There is a line of work using the masked language model (MLM) for natural language understanding tasks. The basic idea is to leverage information from pre-trained models by defining specific sentence prompt in a language modelling task. Brown et al. (2020) use prompt for few-shot learning in text classification tasks.  rephrase inputs as cloze questions for text classification.  and Gao et al. (2020)

Pre-trained Encoder
The menu was great </s>  extend  by automatically generating label words and templates, respectively. Petroni et al. (2019) extract relation between entities from BERT by constructing cloze-style templates. We are the first to apply such methods to ACSA, taking it as a baseline. Different from these template-based models, our final model uses BART for text generation, which better models the correlations between the input sentence and the output sentence compared with BERT.
Generation Methods There has been work casting NLP problems as sequence generation tasks (Vinyals et al., 2015;Ma et al., 2017;Stanovsky and Dagan, 2018;Raffel et al., 2020), where the output is a sequence of tokens rather than a natural language sentence. Daza and Frank (2018) treat semantic role labelling as a sequence-to-sequence process.  solve the entity-relation extraction task as a multi-turn question answering generation method. Our work is similar in casting an NLP task as a generation task. Different from the above methods, our goal is to make the most of pre-trained knowledge in BART for ACSA.

Methods
Formally for ACD, the input is a sentence X = {x 1 , . . . , x n } = x 1:n , where x i denotes the i-th word. For ACSA, a set of pre-identified aspect categories are also given. We introduce relevant pre-trained language models in 3.1, classification methods in Section 3.2, MLM methods in Section 3.3, and our generation method in Section 3.4.

Pre-trained language Models
We take BERT (Devlin et al., 2019a) and BART (Lewis et al., 2020) as the pre-trained language models. Both are built on the Transformer (Vaswani et al., 2017) architecture. BERT (Devlin et al., 2019a) is an encoder stack of Transformer for masked text filling, where a model uses the context words to predict masked words. BART (Lewis et al., 2020) is a denoising auto-encoder seq2seq model pre-training for natural language generation. Its training applies document corruption such as randomly deleting tokens from the input and corrupting text with an arbitrary noising function. BART is trained to reconstruct the original text.

The Classification Method
We use a multi-layer perceptrons network as the classifier model, which takes a representation vector as input. Both BERT and BART are considered as the encoders.

BERT Classification BERT adopts "[CLS] input sentence [SEP] given_category [SEP]" as input.
The final hidden state corresponding to "[CLS]" is used as the representation for classification.
BART Classification BART adopts " S input sentence /S given_category /S " as input and predicts the sentiment polarity of the sentence towards the given category. The same input is fed into the encoder and decoder (see Figure  3(a)). Formally, suppose that the query category is a, x 0 = S , x n+1 = /S , x n+2 = a, x n+3 = /S , then the input to BART is x 0:n+3 = S x 1 , . . . , x n /S a /S . The output hidden vec-tors obtained by the BART encoder (ENCODER) and BART decoder (DECODER) are: The output vector h n+3 is then taken as the representation vector for classification.

The MLM Method
Masked language models (MLM) (Devlin et al., 2019a) complete a given prompt by filling missing tokens. We refer to the template including a given category and MASK token together as a prompt. For sentiment analysis tasks, BERT MLM adopts the input sentence and the prompt as the model input and predicts the sentiment polarity label word towards the given category. For BART MLM, the same input is fed into the encoder and decoder, and the highest decoder prediction from label words of the MASK token is the predicted polarity label(see Figure 3(b)). We use the same template in the MLM method and generation method, following the template creation method in section 3.4.1.

The Generation Method
We take both ACSA and ACD as language model ranking problems under a seq2seq framework (see Figure 3(c)). The target sequence T a i ,p k (T a i ) = {t 1 , . . . , t m } is a template filled by the given category a i and the polarity type p k . We first introduce how to create templates in Section 3.4.1, and then show the inference and training details in Section 3.4.2 and Section 3.4.3, respectively.

Template Creation
For ACSA, we manually create templates containing one slot for the given_category and another slot for the polarity_type label. We set a category word set A = {a 1 , . . . , a |C| }, |C| is the category type size (e.g., a i ="price") and polarity type word set P = {p 1 , . . . , p |L| }, |L| is the polarity type size (e.g., p k ="positive"), and use words to define templates T a i ,p k (e.g. "The sentiment polarity of price is positive"). The template T is "The sentiment polarity of a i is p k ". For a given category a i , we can obtain a list of templates For ACD, we use a i to create a sentiment template T + a i for an existing aspect category, and a none-category template T − a i . T + is "The a i category is discussed" and T − is "The a i category is not discussed".

Inference
For ACSA, we first enumerate all possible polarities for the given category of the sentence X and fill them in the prepared templates, and then use the fine-tuned pre-trained generative language model to assign a score for each template T a i ,p k = {t 1 , . . . , t m }, formulated as: We calculate a score f (T a i ,p k ) for each possible polarity by employing the pre-trained generative language model (i.e., BART) to score the templates, and then choose the polarity of category a i with the largest score.
For ACD, we first create templates T + a i and T − a i for all possible categories of the sentence X, and then use the fine-tuned pre-trained generative language model to assign a score for each template T a i = {t 1 , . . . , t m }, in a similar way as Equation 1. Also, we decide whether the a i category is discussed or not in the input sentence according to the higher score between T + a i and T − a i .

Training
For ACSA, suppose that the polarity type of a i is p k . We fill the given category a i and the polarity type p k into template T to create a gold target output T a i ,p k . Similarly for ACD, if the category of a i is discussed, the gold target T + a i is obtained by filling a i into T + , and otherwise is T − a i . For ACSA, we use all gold polarities in the training set to construct (X, T) pairs. For ACD, we use all gold categories in the training set to construct (X, T + ) pairs, and additionally create negative samples (X, T − ) by sampling all none existing categories in the input. Finally, we obtain Given a sequence pair (X, T), we feed the input X = x 1:n to the BART encoder, obtaining hidden representations of the sentence: At the c th step of the decoder, h enc and previous output tokens t 1:c−1 are then as inputs, yielding a representation using attention (Vaswani et al., 2017) h dec c = DECODER(h enc , t1:c−1) The conditional probability of the word t c is defined as: where W lm ∈ R d h ×|V| and b lm ∈ R |V| , |V| represents the vocab size of pre-trained BART. The cross-entropy between the decoder's output and the original template is used as the loss function:

Experiments
We choose the SemEval-2014 restaurant review (Rest14) (Pontiki et al., 2014a), a variant of Rest14 (Rest14-hard) (Xue and Li, 2018) and the multiaspect multi-sentiment (MAMS) (Jiang et al., 2019) datasets for sentence-level sentiment , the Tri-pAdvisor (Wang et al., 2010)  We use the pre-trained BERT-base 1 and BARTbase 2 models for task fine-tuning. We select the fine-tuning learning rate from {4e-5, 2e-5, and 1e-5} and batch size from {8, 16, 24} for different models. The dropout probability is 0.1. The best model configuration is selected according to the highest performance on the development set. The details of settings are shown in Appendix A.

Baseline Methods
We compare our generation method with classification and MLM baselines (Figure 3) using the same encoder. In particular, BART generation (i.e., Figure 3(c)) is compared with BART classification (Figure 3(a)) and BART MLM (Figure 3(b)), as well as BERT classification and BERT MLM. In addition, our method is also compared with other models in the literature as follows.
For sentence-level ACSA, we also compare our method with the following state-of-the-art methods in the literature. (1)  For document-level ACSA, we compare our method with the following methods. (1) non-BERT models: LSTM (Tang et al., 2015), HAN (Yang et al., 2016) and MR (machine comprehension pat-1 https://github.com/google-research/ bert 2 https://huggingface.co/facebook/ bart-base/tree/main ACSA Template T Dev accuracy The sentiment polarity of ai is p k 83.78 The sentiment is p k for ai 83.44 The ai category has a p k label 82.31 Table 1: ACSA results using different templates. a i indicates given category, p k indicates polarity type.
The ai category is discussed The ai category is not discussed 93.13 The sentence discusses the ai category The sentence discusses no ai category 92.67 It is about the ai category It is not about the ai category 92.44

Development Experiments
Different templates can be used for expressing the same meaning. For instance, "The sentiment polarity of given_category is positive" can also be expressed by "The sentiment is positive for given_category ". For ACSA, we investigate the impact of manual templates using the MAMS development set. Table 1 shows the impact of different choice of templates. For instance, "The given_category category has a polarity_type label" and "The sentiment polarity of given_category is polarity_type " give 82.31% and 83.78% accuracy, respectively, indicating that the template has influence on the final performance. This is consistent with finds of Gao et al. (2020) for the fewshot task. Based on the development results, we use the top performing template "The sentiment polarity of given_category is polarity_type " in our ACSA experiments.
For ACD, we investigate the impact of templates using the Rest14 development set. Table 2 shows the performance impact of different templates. We use the top performing template "The category_type category is discussed" as template T + and "The category_type category is not discussed" as template T − in our ACD experiments.

ACSA Experiments
The results of sentence-level ACSA are shown in Table 3. We can see that, first, the performance of BERT MLM and BART MLM is better than BERT classification and BART classification, respectively. In particular, BERT MLM gives a strong baseline, outperforming all non-BERT and BERT classification baselines. This shows that making use of pre-training at the task level can achieve better results than that at the representation level. Also, the BART MLM and classification models perform better than the corresponding BERT models. Second, BART generation outperforms all baselines on all three datasets, which indicates that our model can better detect multiple sentiment polarities in one sentence toward different aspect categories. Third, BART generation performs significantly better than BART MLM, giving absolutely 3.89% stronger accuracy on MAMS, demonstrating the effectiveness of the generation method. This shows the strength of BART pre-training for generating semantically related content, which was also reflected by the strong performance of BART on abstractive sum-   marization (Lewis et al., 2020). In contrast, the MLM method concatenates the input and output into one sequence, and thus fails to model their correlation in encoder-decoder pre-trainng.
The performances of our model on documentlevel ACSA are shown in Table 4. Compared with LSTM, HAN and MR, BERT classification and BART classification outperform all baselines, which shows the effectiveness of pre-training. BERT MLM and BART MLM surpass BERT classification and BART classification, respectively. Our BART generation model achieves improvements of 1.15% and 0.70% over BART MLM on TripAdvisor and BeerAdvocate, respectively, demonstrating that the generation method can more effectively make use of BART for ACSA.

ACD Experiments
Results on the Rest14 ACD subtask are presented in Table 5   on precision and F-1 score. In particular, a more than 95% precision score is achieved, which shows that our model can effectively exclude the aspect categories not mentioned in the input.
We also investigate the performance on the MAMS dataset, which consists of at least two unique aspect categories with different sentiment polarities in each input sentence. Table 7 shows that BART generation outperforms all baselines, indicating better ability of our model to detect multiple aspect categories in one sentence.

A Joint Model
The generation method allows us to build a straightforward joint model by extending the first template in Table 1, using "The sentiment polarity of <given_category> is none" as a template for nonexisting aspect categories. The results on Rest-14 and MAMS are presented in Table 6. We find that joint BART generation achieves better results on this task with improvements over pipeline BART generation. Joint BART generation outperforms all baselines on precision, recall and F-1 score, which shows the advantage of joint learning.

Few-Shot and Zero-Shot Learning
We evaluate the model performance on ACSA where only a small amount of labelled data is available for training, simulating the low-resource data scenarios by randomly sampling training instances from a large training set. In particular, we use different numbers of instances for training, randomly sampling a fixed number of instances per category type (10,20,50,100,200, 500 instances per category type for Rest14 and MAMS). The results are shown in Figure 4, where the methods of BERT classification, BART classification and BART MLM are also compared. It can be seen that on all the datasets, our model outperforms BERT classification, BART classification and BART MLM, especially when the number of training instances is small. For example, when there are only 10 training instances, our model gives accuracy scores of 82.01% on Rest14, as compared to 38.57% by BERT classification and 50.16% by BART classification. When the number of instances grows as large as 500, our model gives 2.24% and 2.65% better accuracies than BART MLM on Rest14 and MAMS, respectively. One possible reason is that our method makes more use of direct sentiment knowledge in the pre-trained language model by directly adopting the original structure of BART mentioned earlier. In contrast, classification methods cannot achieve this due to transferring the sentiment bias indirectly.
The results of our zero-shot learning experiments are in Table 8. In all cases, our method outperforms all the baselines. In particular, the model trained on MAMS has a better performance on Rest14 than the reverse zero-shot setting, which proves that the MAMS dataset has a higher challenge.

Influence of Category Frequency
Aspect categories can be implicit and do not necessarily occur as terms in the given sentence. To explore the correlation between ACSA accuracy and the occurrence frequency of a given category, we split the eight categories in the MAMS test set into four subsets based on the occurrence frequency. The category (i.e., miscellaneous) that never occurs in the given sentence is put into the zero frequency subset, the 15% least frequent (i.e., ambience, staff ) are put into low frequency subset, the 30% most frequent (i.e., menu, service) are put into high frequency subset, and the remaining (i.e., price, food, place) are put into mid frequency subset. Figure 5 shows the accuracy of BART classification and our model against the frequency. As the category occurrence frequency decreases, the relative gap of accuracy between the two models increases. In the zero frequency, our method gives absolutely 8.03% stronger accuracy than BART classification. This demonstrates that our method is more robust in summarizing the sentiment polarity of abstract or rare categories. Even if there are no explicit category terms in the sentence, the generation method can give the implicit category opinion of the whole sentence according to the context. Service was fine and the food delivered in reasonable time given the crowd, but for the price I was disappointed.

< miscellaneous: neutral > < incorrect output: negative >
The kids really enjoyed their food and the value on the kids menu is good. < menu: neutral > < incorrect output: positive > The decor could be a bit better, and if there was a small bar the overall atmosphere would be a bit more inviting. < place: negative > < incorrect output: neutral > (a) (b) (c) Figure 6: Examples of BART classification. (a) is an instance with category do not occur as term in sentence. (b) represents that our method is not affected by the surrounding interference information. (c) needs conditional reasoning for analysis. Our method can obtain correct sentiment polarity. Figure 6 shows typical examples from the test set which cannot be inferred by the BART classification model. In sentence (a), the given category miscellaneous does not occur as a term in the given sentence. Our method can synthesize different sentiment polarities with different aspects to obtain correct polarity. In sentence (b), "the value on the kids menu is good", good modifies the value, rather than the given category menu. Our method gives the correct polarity, not being affected by the surrounding other aspect sentiments. The last instance (c) has conditional reasoning which is difficult for BART classification. In contrast, BART generation gives the correct label by correctly recognizing the negativity in "if there was ... would be a bit more inviting". This is likely because our method makes use of pre-trained knowledge to infer the inter-sentential correlations between the input and the output sequences, which the BART classification model failed to achieve due to the indirect use of BART in the additional classification network.

Conclusion
We investigated a generation method for aspect category detection (ACD) and aspect category sentiment analysis (ACSA), which can make better use of BART's advantages in making semantic level summaries to the input by not introducing additional model parameters. Experiments show that our proposed method obtains superior performance over the baseline models for both sentence-level and document-level aspect sentiment analysis. In contrast to the traditional sentiment classification methods, our method is also more powerful on zero-shot and few-shot tasks.
Statistics of these three sentence-level datasets are given in Table 9 and two document-level datasets are described in Table 10.

B Settings
Each method is trained for 30 epochs, during which the model with the best performance on the validation set is saved. We also apply early stopping in training, which means that the training will stop if the performance on validation set does not improve in 5 epochs.