Text Generation Model Enhanced with Semantic Information in Aspect Category Sentiment Analysis

,


Introduction
Aspect based sentiment analysis is an important task, which analyzes sentiments regarding an aspect of a product or a service.This task includes many subtasks, such as Aspect Category Detection (ACD) and Aspect Category Sentiment Analysis (ACSA).ACD is the task of detecting aspect categories while ACSA concentrates on predicting the polarity of given aspect categories.This study focuses on ACSA only.Figure 1 shows an example of ACSA task, where negative and positive are the polarities of the two provided categories service and food.
The conventional approaches carry out ACSA as a classification task.Wang et al. (2016) and Cheng et al. (2017)  for discovering aspect-related words.To achieve better representations, pre-trained language models such as BERT (Devlin et al., 2019) are used (Sun et al., 2019;Jiang et al., 2019).Although achieving competitive results, the fine-tuning of a pre-trained language model for ACSA suffers from two drawbacks: the difference between fine-tuning and pre-training tasks and the gap between newly initialized classification layers and the pre-trained model.Such inconsistency is often harmful to the training of an outstanding classifier for ACSA.To solve the above problems, Liu et al. (2021) propose to transform the sentiment classification task into text generation, which better leverages the power of pre-trained language models following the seq2seq framework like BART (Lewis et al., 2020).However, the naive text generation method cannot fully capture relations between opinion words and target words in sentences containing multiple aspects.
Abstract Meaning Representation (AMR, Banarescu et al., 2013), which is a semantic representation of a sentence in the form of rooted, labeled, directed and acyclic graphs, can model the relations between the target words and associated opinion words.For example, in Figure 1, the relation be-tween "staff " and "rude" is captured by a directed edge between two nodes "staff " and "rude-01".In addition, the AMR graph also provides the high level semantic information, which means words with the same meaning, like "but", "although" and "nevertheless", could be represented by the same node "contrast-01".This paper investigates the potential of combining AMR with the naive text generation model to perform ACSA.
Furthermore, we also design two regularizers for guiding cross attentions over the AMR graph of the decoder.We observe that words (in a sentence) and nodes (in an AMR graph) that are semantically similar should be paid the same amounts of attention.Therefore, we minimize the difference of two cross attentions over aligned words and AMR nodes.Moreover, the decoded tokens should only be attentive to related AMR nodes.To achieve that, we minimize the entropy of the cross attentions over the AMR graph in the decoder layers.
We evaluate our model using the Rest14, Rest14hard and MAMS benchmark datasets.The results show that our model is better than the baselines and achieves a state-of-the-art performance.
Our contributions can be summarized as follows: • We propose a model that incorporates an AMR graph encoder within a seq2seq framework for capturing the relations between the target words and opinion words and using this semantic information for ACSA.To the best of our knowledge, this is the first attempt to explore how to use AMR in ACSA.
• We propose two regularizers to improve the cross attention mechanism over the AMR graph using AMR alignments and information entropy.
• We demonstrate the effectiveness of our proposed method through experiments on three datasets.

Related Work
Aspect Category Sentiment Analysis Numerous attempts have been made to improve ACSA.Wang et al. (2016) propose an LSTM-based model combined with an attention mechanism to attend for suitable words with given aspects.To avoid error propagation, joint models which perform ACSA and ACD simultaneously have been proposed.Schmitt et al. (2018) propose two models with LSTM and CNN, which output an aspect category and its corresponding polarity at the same time.Hu et al. (2019) apply orthogonal and sparseness constraints on attention weights.Wang et al. (2019) design an AS-Capsules model to explore the correlation of aspects with sentiments through share modules.Li et al. (2020a) propose a joint model with a shared sentiment prediction layer.
AMR With the development of AMR parsers, AMR-to-text generation models and larger parallel datasets, AMR has been applied successfully to many downstream text generation tasks.For example, it has been integrated into a machine translation model as additional information for the source side (Song et al., 2019;Nguyen et al., 2021;Xu et al., 2021).In text summarization, several researchers transform AMR representations of sentences into an AMR graph of a summary and generate a text summary from the extracted subgraph (Liu et al., 2015;Dohare and Karnick, 2017;Hardy and Vlachos, 2018;Inácio and Pardo, 2021).However, as explained in Section 1, there has been no attempt to apply AMR to ACSA.

Text Generation Model for ACSA
This section presents an overview of the text generation model for ACSA proposed by Liu et al. (2021), since our proposed model is based on it.The model follows the seq2seq framework that converts a review sentence to a target sentence indicating the polarity of the aspect.It fine-tunes the pre-trained BART model for the text generation task.The model takes a review sentence X = {x 1 , x 2 , ..., x |X| } = x 1:|X| as an input and generates a target sentence Y = {y 1 , y 2 , ..., y |Y | } = y 1:|Y | , where |X| and |Y | are the number of tokens in the source and target sentence respectively.

Target Sentence Creation
The target sentence is formed by filling an aspect category and a sentiment word into a predefined template.We denote the set of aspect categories by A = {a 1 , a 2 , ..., a |A| } and the set of sentiment words by S = {s 1 , s 2 , ..., s |S| }.The template is defined manually like "The sentiment polarity of [AS-PECT_CATEGORY] is [SENTIMENT_WORD]".For each review sentence X whose corresponding aspect category is a p and sentiment polarity is s t , we fill the slots in the template and get the target sentence "The sentiment polarity of ⟨a p ⟩ is ⟨s t ⟩" (E.g., "The sentiment polarity of service is negative").

Training and Inference
For the training, given a pair of sentences (X, Y), the method fetches the input sentence X into the encoder to get vector presentation h enc of X as in Equation (1).In the decoder, the hidden vector at a time step j is calculated using h enc and the hidden vectors of the previous time steps, as in Equation (2).
The conditional probability of the output token y j is: where W ∈ R d h ×|V| and b ∈ R |V| , |V| represents the vocabulary size.The loss function of this model is the following Cross Entropy: For inference, we calculate the probabilities of all possible target sentences with different sentiment polarity classes using the trained model and choose the one with the highest probability.For an input sentence X, aspect category a p and sentiment polarity s t , the probability of a target sentence Y ap,st = {y 1 , y 2 , ..., y m } is calculated as follows: 4 Proposed Method Figure 2 shows our proposed model, which follows general text generation methods (Liu et al., 2021).To encode semantic information from an AMR graph, we use a graph encoder module (Subsection 4.1).We incorporate that information by adding a new cross attention layer to the decoder (Subsection 4.2).We also introduce two types of regularizers to guide the attention score of the new cross attention layer (Subsection 4.3).In addition, Subsection 4.4 introduces the loss function of the model, and 4.5 presents the pre-training procedure to overcome the difficulty of training newly initialized layers and a pre-trained language model.

AMR Encoder
The AMR encoder adopts Graph Attention Networks (GAT, Velickovic et al., 2018).For a given input sequence X = {x 1 , x 2 , ...x |X| }, we construct a corresponding AMR graph G = (V, E) from the pre-trained AMR parser (Bai et al., 2022), where is the adjacency matrix presenting the relations between the nodes.We treat the AMR graph as an undirected graph, which means e ij = e ji = 1 if the two nodes v i and v j are connected, otherwise 0. Given a graph G = (V, E) and node v i ∈ V, we can obtain h ′ i , the hidden state of node v i , as follows: where a T and W are trainable parameters, σ is the LeakyRELU function, ∥ denotes the concatenation of two vectors, N i is the set of neighbor nodes of v i in G, and h i is the initial representation of v i .Note that a node (word) consists of several subwords in general.Using the embedding of the AMR parser, h i is defined as the average of the subword vectors.Applying the multi-head attention mechanism from the Transformer architecture (Vaswani et al., 2017), we obtain the updated representation of node v i : , where K is the number of attention heads, α k ij are the attention coefficients of the k-th head, W k is the weight matrix at the k-th head, and ∥ stands for the concatenation of multiple vectors.

Decoder
After obtaining the graph information, we feed it into each decoder layer by adding a new cross attention module for AMR referred to as "AMR Cross Attention" in Figure 2. We write h ′ for the representations of the AMR nodes obtained from GAT, x is the vector representation of the input sentence and y l is the output of l-th decoder layer.The output of the (l +1)-th decoder layer, y l+1 , is obtained as follows: where LN is the layer normalization function, SelfAttn is the self-attention module, CrossAttn is the cross-attention module, and FFN is the feedforward neural network.Training a deep model like Transformer is really hard and even harder with one more crossattention module.To overcome this difficulty, we employ ReZero (Bachlechner et al., 2021) as the AMR cross attention module instead of the normal residual module.This method is implemented as follows: where F denotes non-trivial functions and α is a trainable parameter which helps moderate the updating of the AMR cross attention.

AMR Cross Attention Regularizers
To incorporate the semantic information from the AMR graph more effectively, we propose two regu-larizers over the attention scores of the AMR cross attention module.
Identical Regularizer Intuitively, a word in a sentence and its aligned node in the AMR graph should receive the same attention as they are supposed to represent similar semantic information.Two transformation matrices for the cross attention matrix over each of the source (input) sentences and the AMR graphs, align src ∈ R |X|×|P | and align amr ∈ R |V |×|P | , respectively, are defined in Equations ( 14) and ( 15), where |P | is the number of aligned pairs of words and nodes.
if token x i belongs to an aligned word at position k 0 otherwise ( 14) Here, T i denotes a set of subwords in the aligned word.With these matrices and two given cross attention matrices A src ∈ R |Y |×|X| , A i_amr ∈ R |Y |×|V | over the review sentence and the AMR graph, respectively, the identical regularizer is formulated as follows: where ∥∥ F denotes the Frobenius norm and L is the number of the decoder layers.The matrix A src is obtained from an oracle fine-tuned text generation model.Also, the matrix A i_amr is obtained by fetching the same input with the regular cross attention layer over the source sentence, which is indicated by the yellow line in Figure 2.

Entropy Regularizer
We expect that our model concentrates on a few important nodes.This means that the cross attention distribution of the tokens over the AMR nodes is supposed to be skewed.Therefore, we try to minimize the information entropy (Shannon, 1948) of the attention scores of the tokens over the AMR nodes.We first calculate the mean of the cross attention score of the token i at the node j over H attention heads as follows: Then, the entropy of the l-th decoder layer is calculated over |V | nodes and |Y | output tokens: The entropy regularizer is defined as the mean entropy of the L decoder layers:

Loss Function
For training the proposed model, the loss function is the sum of the normal cross entropy loss and the aforementioned two regularizers: where λ 1 is the scaling factors of the identical regularizer and λ 2 is that of the entropy regularizer.

Pre-training
It is hard to fine-tune our model, which consists of randomly initialized modules like the AMR graph encoder and the AMR cross attention together with the pre-trained BART.Following (Bataa and Wu, 2019) and (Gheini et al., 2021), which show the positive effect of pre-training a language model with in-domain data and fine-tuning a cross attention layer, after initializing the whole model, we train it with text denoising tasks using review sentences.Following BART, we add noise into the input sentences using the following three methods: • Token Masking: random tokens are sampled and replaced by [M ASK] token.
• Word Deletion: instead of deleting subwords like BART, the whole of text spans is deleted in this method.
• Text Infilling: random text spans are replaced by [M ASK] using a Poisson distribution.
Algorithm 1 shows the pseudocode of the text corruption algorithm that adds noise by the above three methods.

Dataset
The following three datasets are used in our experiments.
• Rest14: This dataset consists of reviews in the restaurant domain, which is included in the Semeval-2014 task (Pontiki et al., 2014).Samples labeled with "conflict" are removed, so the remaining samples have the labels "positive", "negative" and "neutral".In addition, we follow the splitting of the development set suggested by Tay et al. (2018) for the sake of a fair comparison.

Baselines
We compare our method with multiple baselines: • GCAE (Xue and Li, 2018): employs CNN model with gating mechanism to selectively output the sentiment polarity related to a given aspect.
• AS-Capsules (Wang et al., 2019): exploits the correlation between aspects and corresponding sentiments through a capsule-based model.
• CapsNet (Jiang et al., 2019): is the capsule network based model to learn the relations between the aspects and the contexts.
• BERT-pair-QA-B (Sun et al., 2019): performs ACSA as the sentence pair classification task by fine-tuning of the pre-trained BERT.
• AC-MIMLLN (Li et al., 2020b): predicts the polarity of a given aspect by combining the sentiments of the words indicating the aspect.
• BART generation (Liu et al., 2021): performs ACSA by a text generation model with the pre-trained BART.It is almost equivalent to our model without AMR.
• BART generation with pre-training: is the BART generation model combined with our pre-training method except for applying entropy regularization.

Implementation Details
The template used for constructing the target sentences in our experiments is "Quality of The [ASPECT_CATEGORY] is filled by the aspect word, while the [SENTIMENT_WORD] is filled by one of {excellent, awf ul, f ine} which corresponds to {positive, negative, neutral} respectively.For AMR parsing, we use the pre-trained model of AMRBART1 (Bai et al., 2022).In addition, LEAMR2 (Blodgett and Schneider, 2021) is adopted to align the words in the input sentence and the nodes in the AMR graph.
In the pre-training step, we initialize the parameters of BART using the checkpoint of BART base3 .Unlike the parameters of BART, the parameters of the AMR graph encoder and the AMR cross attention modules are newly initialized with the uniform distribution.After pre-training, the last checkpoint is used for fine-tuning the ACSA model.The Adam optimizer (Kingma and Ba, 2015) is used for optimizing the model.The original parameters of BART's encoder and decoder are trained with a learning rate 2e-5 while the learning rate is set to 3e-5 for the parameters in the AMR graph encoder and the AMR cross attention modules.We set the number of the attention heads of AMR encoder to 6, the number of AMR cross attention heads to 6, the batch size is 16 and the dropout value 0.1.The initial value for the ReZero weight α is 1.The regularization coefficients λ 1 and λ 2 are set to (0.075, 0.1), (0.075, 0.1) and (0.025, 0.0075) for the three datasets, while λ 3 is always set to 5e-3.All hyperparameters are tuned based on the accuracy on the development set.

Experimental Results
The results of the experiments are presented in Table 2.The models were trained and evaluated five times with different initializations of the parameters.The table shows the average and standard deviation of the accuracy of five trials using the format "mean (±std)".First, our model outperforms all baselines on the three datasets, which indicates the necessity of incorporating the semantic information into the text generation model for ACSA.Second, compared with the models that learn relations between the aspect and the context like Cap-sNet, AC-MIMLLN, BERT-pair-QA-B and BART generation, the dominance of our model proves that exploiting the AMR graph to learn relations between words is a better way to capture contextual information.The fact that our model also outperforms BART generation with the pre-training further supports the effectiveness of the AMR.Third, the competitive results over the Rest14-hard and MAMS datasets show the effectiveness of the identical and entropy regularizers in enabling the model to concentrate on the correct aspect-related nodes, which is essential for identification of the polarity over multiple aspects.

Ablation Study
To further investigate the effects of the different modules in our model, we conducted ablation studies.The results are presented in Table 3. First, it is found that the removal of the identical regularizer downgrades the performance, which indicates the importance of precisely capturing the semantic information.Second, we also notice that the models without the entropy regularizer perform poorly with reduction by 0.8, 1.1 and 0.4 percentage points in the accuracy on Rest14, Rest14-hard and MAMS, respectively.This shows that the entropy regularizer is essential to prevent models from attending to unnecessary AMR nodes.In addition, removing both regularizers degrades the performance more than removing each of the regularizer, which confirms the essential roles of these regulairzers in performing ACSA.Third, removing the pre-training procedure hurts the performance badly, which leads to decreases by 1.4, 7.5 and 1.5 percentage points on the three datasets respectively.This indicates the big gap between the newly initialized modules and the pre-trained model and the necessity of the pre-training step for overcoming this problem.In summary, the ablation studies show that each component contributes to the entire model.The contribution of the pre-training step is the greatest, while those of the identical and entropy regularizers are comparable to each other.

Case Study
To further examine how the semantic information of AMR and two regularizers work well in ACSA, a few examples are shown as a case study.Table 4 compares our model with the state-of-the-art method "BART generation".The symbols P, N
In contrast, our model pays attention to only the aspect-related AMR nodes, resulting in the correct predictions in both examples.However, our model also faces a difficulty in some cases.In the last example, it wrongly predicts the sentiment polarity for "miscellaneous" because it is really hard to capture aspect-related AMR nodes for a coarse-grained aspect class like "miscellaneous".

Attention Visualization
To study the effectiveness of the two regularizers in guiding the AMR cross attention collocation, we illustrate the cross attention matrix produced by our full model and the model without two regularizers in Figure 3.The review sentence is "The food was good overall, but unremarkable given the price.".The polarity label of the aspect category "food" is positive and the polarity of the aspect category "price" is negative.The model without two regularizers has dense attention matrices that might be noise for prediction of the polarity.In contrast, the attention matrices of our full model are sparse.For example, as for the food category, the word "food" and "excellent" in the target sentence pay much attention or more attention than the model without the regularizers to the nodes "food" and "good-02" in the AMR graph.Similarly, as for the price category, "price" in the target sentence pays a great deal of attention to the node "price-01" in the AMR graph, while "awful" pays less attention to "remarkable-02" than the model without the regularizers.Those cases indicate that our attention mechanism works well even when a review sentence contains multiple aspects.

Conclusions
In this paper we proposed a model which integrated the semantic information from the Abstract Meaning Representation (AMR) and the text generation method for the task of Aspect Category Sentiment Analysis (ACSA).Moreover, to more precisely cap- gave it a shot.
The atmosphere was wonderful, however {ambiance, service, food} (P, P, N) (P, N, N) (P, N, N) the service and food were not.
There are several specials that change {food, staff} (O, P) (P, O) (P, O) daily, which the servers recite from memory.
The place was busy and had a bohemian feel.{place, miscellaneous} (P, P) (N, P) (N, O) Table 4: Case studies of our model compared with state-of-the-art method.ture the semantic correlations between the target words and the AMR nodes, we proposed two regularizers: the identical and entropy regularizers, over the AMR cross attention modules.The exper-imental results on three datasets showed that our model outperformed all baselines.

Limitations
Currently, our model only exploits the direct relations between nodes in the AMR graph.In other words, only one-hop neighborhoods can be considered.However, there are a few cases where an opinion word and a related aspect word can be in a k-hop neighborhood.In the future, we will design a model that can capture long distance relations in the AMR graph.Another limitation is that the errors of the pre-trained AMR parsers and AMR alignment models are propagated to the model as a whole.What is required is to improve the performance of those modules.

Figure 1 :
Figure 1: Example of ACSA with the corresponding AMR graph and alignment with the review sentence.

Figure 2 :
Figure 2: Architecture of our proposed model.

Figure 3 :
Figure 3: Attention scores of target sentences over AMR graph in our models.
Zhu et al. (2019)9) the categories of aspects by using a gated mechanism.Xing et al. (2019),Liang et al. (2019)andZhu et al. (2019)incorporate aspect category information into a sentence decoder for generating representations specific to both the aspect and its context.Sun et al.
Xue and Li (2018)6)capture inter-sentence relations within a review using a hierarchical bidirectional LSTM model.Xue and Li (2018)extract features using CNN and output

Table 1 :
Statistics of datasets.
Jiang et al. (2019)e same purpose as the Rest14-hard,Jiang et al. (2019)propose a larger dataset for ACSA in which each sentence contains at least two different aspects.Their details are shown in Table1.