Exploring Graph Pre-training for Aspect-based Sentiment Analysis

,


Introduction
Aspect-based sentiment analysis (ABSA) has drawn increasing attention in the community, which includes four fine-grained elements: aspect term, opinion term, aspect category, and opinion polarity.The first two terms exist as a raw text span in the review sentence while the remaining two are the classification result of aspect and opinion respectively.Each four mapped sentiment elements form an aspect-level sentiment quadruple.For instance, for the given review "The apps are hard to use.", the corresponding quadruple is (apps, hard, Software, Negative).
The joint extraction of quadruples is the most complex and challenging subtask among all the ABSA tasks, previous work usually formulate it as either sequence-level (Qiu et al., 2011;Peng et al., 2020;Cai et al., 2021) or token-level classification problems (Tang et al., 2016) in joint learning or pipeline manner.However, these methods not only require sophisticated and complex modeling of sentiment elements but also suffer severely from error propagation since the overall prediction performance hinges on the accuracy of every step (Peng et al., 2020).
In this situation, a natural question is how to

Generate Generative Model
The apps are hard to use.

Review
[Tree] The apps are hard to use.

Input
The apps are hard to use.

Input
The apps are hard to use.

Output
Subtask

Input
The apps are hard to use.

Pre-train
Aspect Opinion

Category Polarity
Figure 2: Overview of joint pre-training, the subtasks will be introduced in the following section.We simplify the process of Task-level Graph Pre-training for brief reading, the detailed process will be introduced in the following section.
strengthen the generative model in modeling aspectlevel sentiment structure.We believe the challenges locate in two aspects.First is structural modeling: the huge gap between the pre-training and finetuning phases makes it difficult to model its succinct yet distinctive structure : certain components ( e.g.aspect term ) in sentiment structure obviously more important than others.Another challenge is the generalization and robustness of the generative model: the generative model should be generalizable and robust against irregular sentiment quadruples.It is crucial since the structure is built depending on the quadruples and the challenging scenarios in real practice are usually brought by the irregular sentiment quadruples.
In this study, we proposed two novel graph pretraining paradigms to address above challenges.As shown in Figure 1, we first introduce an optimal self-encoding method called Element-level Graph Pre-training.We abandon the traditional indiscriminate masking strategy (equally random masking every node or edge ) and depending on the characteristics of the opinion tree, adopt sentiment element level masking.Given the opinion tree of the review "The apps are hard to use.",only sentiment nodes (namely apps, hard, Software, Negative ) or the sub-trees they composed in the graph will be masked.In this case, this method can serve as an effective addition to structural modeling in opinion tree generation.
We then propose a Task-level Graph Pre-training paradigm, which mimics the human learning pro-cedure to learn to handle the task in stages.Specifically, we first decompose the quadruple extraction task into multiple subtasks.Each subtask corresponds to mapping the steps for manually building an opinion tree from scratch.Afterwards, we feature a prompt-based learning strategy to separately acquire the knowledge of subtasks and finally employ the learned knowledge to tackle the main task, i.e., generating the entire opinion tree.The decomposed subtasks build fundamental knowledge of irregular sentiment quadruples for generation.
As shown in Figure 2, we then jointly pre-train the model with the two paradigms above and finetune the model with the F inetune task.The advantages of our pre-training method over previous learning methods are threefold: 1) both the Element-level Graph Pre-training and Task-level Graph Pre-training are designed depending on the intrinsic characteristics of the opinion tree instead of treating it as a plain graph.2) the Element-level Graph Pre-training abandons the strategy of capturing the complex structure but focuses directly on the core elements.3) the Task-level Graph Pretraining explicitly forces the model to learn the irregular quadruples with an easy-to-hard routine, making it easier for the model to learn the fundamental knowledge required.The detailed evaluation shows that our model significantly advances the state-of-the-art performance on several benchmark datasets.

Reform
Opinion Tree Figure 3: Opinion tree generation model.

Related Work
There are four aspect-level sentiment elements in ABSA, the various combination of these elements form the numerous sub-tasks of ABSA.The researches on ABSA generally follow a route from handling single sub-task to complex compositions of them.The starting point usually locates in the prediction of a single sentiment element, which is the target of fundamental sub-tasks, such as extracting the aspect term (Qiu et al., 2011;Tang et al., 2016;Wang et al., 2021), classifing the aspect category mentioned in the sentence (Bu et al., 2021;Hu et al., 2019), and detecting the sentiment polarity for a given aspect (Tang et al., 2016;Chen et al., 2022a;Liu et al., 2021;Seoh et al., 2021;Zhang et al., 2022).
Since the sentiment elements are naturally correlated, many studies further focus on exploring the co-extraction of sentiment elements, including aspect and opinion term extraction (Xu et al., 2020;Li et al., 2022); aspect term extraction and its polarity detection (Zhang and Qian, 2020); aspect category and polarity detection (Cai et al., 2020).Furthermore, recent studies also employed end-toend models to extract all the sentiment elements in triplet or quadruple format (Peng et al., 2020;Wan et al., 2020;Cai et al., 2021;Zhang et al., 2021a;Chen et al., 2022b;Mukherjee et al., 2021).
More recently, studies tend to design a unified framework to extract quadruples at one stop with pre-trained encoder-decoder language models, achieving great improvements in ABSA (Zhang et al., 2021a).The target sequence of them is formed by either class index (Yan et al., 2021) or the desired sentiment element (Zhang et al., 2021b).OTG (Bao et al., 2022) addressed the importance of semantic correlations among sentiment elements, proposed a sentiment tree structure called opinion tree, and employed generative model to extract the linearized tree.However, the generative model is pre-trained to solve textual sequence tasks(e.g.masked language model) but finetuned for structure generation, between which exists a huge gap, making generative models sub-optimal for modeling structural knowledge.
Different from previous studies, we introduce two pre-training paradigms for opinion tree generation without treating it as a plain graph.To our knowledge, we are the first to consider designing methods depending on the intrinsic characteristics of the opinion tree.

Opinion Tree Generation Model
In this section, we introduce the basic opinion tree generation model we employed to generate in the pre-train and finetune phases, along with the objective functions and training.

Opinion Tree Construction
For further strengthen the relationship between elements, we build a structure called opinion tree, which aims to jointly model all sentiment elements in a tree for a given review sentence.The opinion tree can be considered as a semantic representation in order to better represent the structure of sentiment elements.Inside the opinion tree, each sentiment element would be connected with another node as either the child or parent relation to represent the crucial relationship.
As shown in Figure 3, we construct the opinion tree using a rooted directed acyclic graph, including nodes of aspect, opinion, category, and polarity, along with the semantic relations between them.After that, we linearize the opinion tree to the target sequence via depth-first traversal.

Generation Model
We employ the pre-trained language model T5 (Raffel et al., 2020) to generate the linearized opinion tree.As shown in Figure 3, it is an encoderdecoder architecture model, the input would be the raw review and the output is linearized opinion tree.Given the token sequence x = x 1 , ..., x |x| as input, the sequence-to-sequence model outputs the linearized representation y = y 1 , ..., y |y| .To this end, the sequence-to-sequence model first computes the hidden vector representation: After the input token sequence is encoded, the decoder predicts the output sequence token-bytoken with the sequential input tokens' hidden vectors.At the i-th step of generation, the selfattention decoder predicts the i-th token y i in the linearized form, and decoder state h d i as: The conditional probability of the whole output sequence p(y|x) is progressively combined by the probability of each step p(y i |y <i , x): where y <i = y 1 ...y i−1 , and p(y i |y <i , x) are the probabilities over target vocabulary V .The objective functions is to maximize the output linearized opinion tree X T probability given the review sentence X O .Therefore, we optimize the negative log-likelihood loss function: where θ is the model parameters, and (X O , X T ) is a (sentence, tree) pair in training set τ , then where p(

Pre-training Paradigms
In this study, we introduce two pre-training paradigms for opinion tree generation.As shown in Figure 2, the two paradigms and finetune task share the same input format with a joint input of prompt, encoded text and tree, each method consists of a set of subtasks focus on respective training targets.The combination of subtasks forms the joint pre-training in our work, we will introduce the paradigms first in this section.

Element-level Graph Pre-training
The opinion tree is directly composed of subtrees that represent respective quadruples, this naturally decides the noteworthy information must locate within the aspect-level sentiment element instead of the other parts of the opinion tree, which could be other structure nodes.For instance, for a linearized opinion tree "(Root,(Quad,(Aspect (Software, apps),(Opinion (Negative, hard)", the indiscriminate masking may mask a sub-sequence "(Opinion (" that: 1) logically can not be reform into a valid structure due to the non-closing brackets.2) contains nodes (e.g."Opinion" ) not included in the crucial sentiment elements.
On the other hand, our Element-level Graph Pre-training paradigm masks aspect-level element nodes (including aspect term, opinion term, aspect category, and opinion polarity) in the opinion tree, as shown in Figure 4, the masked sequence "(Software, apps )" represent legitimate struct and covers core sentiment element only.If continuous nodes are masked, the corresponding sub-graph will be masked as a whole.The method can not only make sure the masked node are crucial sentiment elements but also guarantee the corresponding subsequence is logically legitimate.
With the element-level graph mask strategy introduced above, we propose a set of pre-training

Prompt Sentence
The apps are hard to use.

[Aspect]
The apps are hard to use.

[Aspect] [Opinion] [Polarity]
The apps are hard to use.
The apps are hard to use.
The apps are hard to use.
The apps are hard to use.

[Aspect] [Opinion] [Polarity][Category]
(Root,(Quad, ...(Negative, hard) Pre-train Loop subtasks.The inputs would be a concat of a prompt, a sentence, and an opinion tree.The sentence and tree will be masked with different masking rates while the prompt illustrates the output target, either the sentence or tree.For a given review s = (x 1 , x 2 , ...x n−1 , x n ) and linearized tree t = (t 1 , t 2 , ...t n−1 , t n ), We design the 5 subtasks in the Element-level Graph Pre-training paradigm, which can be found in Table 1.Among which, EP G1 and EP G4 are designed to help the model generate the complete tree t by adding text information while EP G2, EP G3 and EP G5 help the model to generate the full review s by adding the structural information.
To further emphasize the interaction between the pre-training and finetune phases, we designed a dynamic masking rate for Element-level Graph Pre-training paradigms: a small masking rate is used in the initial phase, and then the masking rate increases with training rounds, so that at the end of pre-training, all partially masked pre-training tasks be very close to the finetune tasks (which can be considered as 100% masking rate), the specific masking rate is shown in Table 2.Note our masking rate obviously lower than previous work (Bai et al., 2022), that is because recovering a nearly all-masked text from an opinion tree is unreasonable since opinion tree contains limited information as we discussed before.

Task-level Graph Pre-training
Inspired by the human-learning process we propose a Task-level Graph Pre-training paradigm, whose subtasks follow the routine of human learning procedure to learn to build the opinion tree from scratch.Specifically, we first decompose the quadruple extraction task into multiple subtasks.Each subtask corresponds to mapping the steps for manually building an opinion tree from scratch.The paradigm consists of six subtasks, four (Aspect, Opinion, Category, P olarity) of which extract sentiment structure as the fundamental knowledge for building an opinion tree, the rest (P air, T riple) target the intermediate state of the procedure with co-extraction.The subtasks and the corresponding steps of building can be found in Appendix A. In this case, we force the model to focus directly on irregular cases with a gradual process to build fundamental knowledge for OTG.The inputs of Task-level Graph Pre-training are similar to the previous paradigm, which would be a concat of a prompt and a sentence.Then the subtasks in Task-level Graph Pre-training paradigm can be given as shown in Figure 5.

Joint Pre-training
We use a joint pre-training method to combine the advantages of the Element-level Graph Pre-training paradigm and Task-level Graph Pretraining paradigms.In addition, we include the finetune task F inetune in the pre-train phase for narrowing the gap between two phases and avoiding overfitting.During pre-training, the model will be cyclically trained in the order of a loop started with the subtasks of the Element-level Graph Pretraining, followed by Task-level Graph Pre-training, the gradient will be updated after accumulating the loss in each epoch.After that, we save the model weights and finetune the model with finetune task F inetune.

Experiments
In this section, we introduce the datasets used for evaluation and the baseline methods employed for comparison.We then report the experimental results conducted from different perspectives, and analyze the effectiveness of the proposed model with different factors.

Setting
In this study, we use ACOS dataset (Cai et al., 2021) for our experiments.Following the setting from (Cai et al., 2021), we divide the original dataset into a training set, a validation set, and a testing set.In addition, we choose 20,000 sentences from Yelp1 , and 20,000 sentences from the laptop domain in Amazon2 to pre-train the opinion tree generation model, the sentences are annotated by the OTG model without pre-training.Following the setting of Bao et al. (2023), we divide the quadruples into 4 types, apart from the basic situation, there are 3 irregular situations: Oneto-Many, Mono-Implicit and Bi-Implicit.The statistic can be found in Figure 6.
We employ T53 and fine-tune its parameters for our opinion tree generation model.We tune the parameters of our models by grid searching on the validation dataset.We select the best models by early stopping using the Accuracy results on the validation dataset.The model parameters are optimized by Adam (Kingma and Ba, 2015), the learning rate of pre-training and finetuning is 3e-5 and 1e-4 respectively.The batch size is 16.Our experiments are carried out with an Nvidia RTX 3090 GPU.The experimental results are obtained by averaging ten runs with random initialization.
In evaluation, a quadruple is viewed as correct if and only if the four elements, as well as their combination, are exactly the same as those in the gold quadruple.On this basis, we calculate the Precision and Recall, and use F1 score as the final evaluation metric for aspect sentiment quadruple extraction (Cai et al., 2021;Zhang et al., 2021a).
Particularly, we build two Large Language Model (LLM) baselines: ChatGPT5 is a sibling model to InstructGPT (Ouyang et al., 2022), which is trained to follow instruction in a prompt and provide a detailed response.We ask it to generate all the sentiment elements from the input review sentences.LLaMA6 (Touvron et al., 2023) is a collection of foundation language models, these models are trained on trillions of tokens, and have shown that it is possible to train state-of-the-art models using publicly available datasets exclusively.We use LLaMA-7B, and fine-tune it on the ABSA dataset.
As shown in Table 3, we find that generative models outperform previous classification-based methods and the structural generative method sur-  passes non-structural methods, this indicates that semantic structure does contribute to quadruple extraction.Meanwhile, our proposed model outperforms all the previous studies significantly (p < 0.05), which has an advantage of 2.36% and 0.92% in Restaurant and Laptop domain respectively.The result shows that the proposed joint pre-training is effective in modeling tree structural constraints for generative model, while the large gap between pre-training and finetuning significantly encumbers previous systems.Furthermore, the results also indicate the effectiveness of our Element-level Graph Pre-training and Task Decomposition paradigms, which are used to unify the pre-train and finetune task with special task designs depending on the intrinsic characteristics of the opinion tree instead of treating it as a plain graph.

Analysis and Discussion
In this section, we first give some analysis and discussion to show the influence of Element-level Graph Pre-training (EGP) and Task-level Graph Pre-training (TGP) paradigms.After that, we will investigate our search over masking rate, the influence of pre-training subtasks.

Influence of Different Factors
We first investigate the difference between the two paradigms, from Table 4 we can find, all the paradigms are beneficial to extract the opinion tree.Among which TGP paradigm's contribution outperforms EGP paradigm, the removal of TGP cause an  this situation, there will be one intuitive question: Whether the element-level masking design does achieve a performance better than the indiscriminate paradigm as we expect?
We investigate this question by employing ablation experiments.We first design an indiscriminate paradigm under similar settings, then we give the performance of using different paradigms in Table 5.As we can see, our element-level paradigm outperforms the indiscriminate paradigm, this result shows the superiority of our element-level masking design, and also validated our motivation: for target graphs that contain limited knowledge like opinion tree, indiscriminate masking strategies would be sub-optimal and fine-grained masking should be adopted.
We then investigate the impact of subtasks in EGP paradigm.We add the subtasks in paradigm gradually.As we can see in Table 5, the subtask pair of EP G5 and EP G4 (+All EGP s) contributes the most to the performance (0.58% and 0.42% in each domain respectively), which aims to integrate the complementary information from both formations to generate text and tree respectively, indicating the significance of the complementary association.

Effect of Task-level Graph Pre-training
As shown in Table 6, the OTG model obviously be short in its generalization and robustness against irregular sentiment quadruples when compared with the basic situation.Thus we mimic the human learning procedure for building an opinion tree from scratch with Task-level Graph Pre-training to strengthen its fundamental knowledge.
We investigate the paradigm's effect by comparing the model's performance on each irregular quadruple situation.As shown in Table 6 , our model's improvement in all of the irregular classes surpasses the basic situation when compared with OTG.This result indicates that our pretrain method significantly improves the model's performance with a burst in generalization and robustness against irregular sentiment quadruples, which accomplish the foundation for building an opinion tree and should be taken into consideration apart from improving the structural awareness.
We then investigate the impact of subtasks in TGP paradigm.We remove the subtasks in the paradigms gradually.Table 7 shows the result for Task Decomposition paradigm: the contributions of subtasks stay in a similar scope, among which the Aspect surpasses others with a tiny gap, this may due to the lower implicit rate of aspect terms7 .
In addition, all the subtasks are beneficial to extract the opinion tree.It is worth noting that, the participation of finetune task F inetune demonstrates an obviously positive effect in both paradigms, which improves two domains with an average of 0.31%, this phenomenon gives us a conclusion that adding the finetune task in the pre-train phase is an effective solution for narrowing the gap between them.

Conclusion
In this study, we propose two novel pre-train paradigms for opinion tree generation, which are designed depending on the intrinsic characteristics of the opinion tree.Specifically, the Element-level Graph Pre-training paradigm abandons the strategy of capturing the complex structure but focuses directly on the core elements.While the Task-level Graph Pre-training explicitly focuses on improving the generalization and robustness against irregular quadruples with an easy-to-hard routine.Furthermore, we explore a dynamic masking rate and a cyclical train method for jointly combining the pre-training paradigms in order to bridge the gap between the pre-training and finetuning phases in modeling structural knowledge.
Experimental results show that our proposed model can achieve state-of-the-art performance in ABSA.In addition, the results also validate that, for target graphs that contain certain knowledge like opinion tree, the improving strategy should be made based on the intrinsic characteristics of the structure instead of treating it as a plain graph.

Limitations
The limitations of our work can be stated from three perspectives.First, our pre-training method contains many subtasks that will consume vast computational cost during pre-train (the inference cost will not change).If possible, further work should try to explore a time-saving pre-training method.Secondly, more tasks could be further explored, including cross-domain and cross-lingo sentiment analysis tasks.Finally, we focus on opinion tree generation in one major language.The performance of other languages remains unknown.
Based on the process introduced, we design the subtasks in Task-level Graph Pre-training paradigm.Each subtask corresponds to mapping the steps for manually building an opinion tree from scratch.The paradigm consists of six subtasks: Aspect, Opinion, Category, P olarity,P air and T riple.Their prompts and target graph can be found in Figure 7.Among which, Aspect and Opinion focus on searching the basic elements of each quadruple: • Aspect: Extract all the aspect terms in the review in the form of a tree, Figure 7 (a).
• Opinion: Extract all the Opinion terms in the review in the form of a tree, Figure 7 (b).
Category and P olarity further explore the classification results with the corresponding basic elements: • Category: On the base of Aspect, extract the category classification result of the aspect terms in the review in the form of a tree, Figure 7 (c).
• P olarity: On the base of Opinion, extract the polarity classification result of the opinion terms in the review in the form of a tree, Figure 7 (d).
P air and T riple fulfill the mapping between quadruples.
• P air: On the base of Aspect and Opinion, map the corresponding aspect term and opinion term within a quadruple, Figure 7 (e).
• T riple: On the base of Aspect and P olarity, map the corresponding aspect term and opinion term and polarity within a quadruple,

Figure 6 :
Figure 6: Statistic of regular and irregular situations of opinion trees.

Figure 7 :
Figure 7: Building procedure of subtasks in Task-level Graph Pre-training.

Table 1 :
Subtasks of Element-level Graph Pre-training.

Table 2 :
OutputFigure 5: Subtasks of Task-level Graph Pre-training paradigm.Note the finetune task has been added into the pre-training phase.Dynamic masking rate.

Table 4 :
Impact of pre-training paradigm.

Table 6 :
The average performance of different situations in Restaurant and Laptop domain.

Table 7 :
Impact of subtasks in Task-level Graph Pretraining paradigm.