An Empirical Study of Sentiment-Enhanced Pre-Training for Aspect-Based Sentiment Analysis

,


Introduction
Aspect-Based Sentiment Analysis (ABSA) is an important problem in sentiment analysis (Pontiki et al., 2014).Its goal is to recognize opinions and sentiments towards specific aspects from usergenerated content (Zhang et al., 2022).Traditional ABSA approaches generally develop several separate models (Xu et al., 2018;Xue and Li, 2018;Fan et al., 2019) or a joint model (He et al., 2019;Chen and Qian, 2020), establishing interactions between different sentiment elements through specific model structures.
In recent years, pre-trained models (Devlin et al., 2019;Liu et al., 2019;Raffel et al., 2020) have yielded excellent results in extensive NLP tasks.This inspires many research efforts that leverage pre-training techniques to learn sentiment-aware representations.Among them, Xu et al. (2019a) reveal that pre-training on the sentiment-dense corpus through masked language modeling alone could result in significant improvements on three downstream ABSA tasks.Further, researchers undertake many explorations on integrating sentiment knowledge (e.g., sentiment words) in the pre-training phase (Tian et al., 2020;Zhou et al., 2020;Ke et al., 2020;Li et al., 2021;Fan et al., 2022), as sentiment knowledge has been widely demonstrated to be helpful in various ABSA tasks (Li and Lam, 2017;Zeng et al., 2019;He et al., 2019;Xu et al., 2020a;Wu et al., 2020b;Liang et al., 2022).
Despite significant gains in various ABSA tasks, there has not been a comprehensive evaluation and fair comparison of existing Sentiment-enhanced Pre-Training (SPT) approaches.Therefore, this paper conducts an empirical study of SPT-ABSA to systematically investigate and analyze the effectiveness of the existing approaches.We mainly concentrate on the following questions: (a) what impact do different types of sentiment knowledge have on downstream ABSA tasks?; (b) which knowledge integration method is most effective?; and (c) does injecting non-sentiment-specific linguistic knowledge (e.g., part-of-speech tags and syntactic relations) into pre-training have positive impacts?Based on the experimental investigation of these questions, we eventually obtain a powerful sentiment-enhanced pre-trained model.We evaluate it on a wide range of ABSA tasks to see how much SPT can facilitate the understanding of aspect-level sentiments.
this my first time writing a review for a restaurant.they are super quick very kind, and do an excellent job.their prices can't be beaten for the quality.To enable our study, we prepare a large-scale knowledge-annotated SPT corpus.We obtain and collate over 100 million user-generated reviews from Yelp and Amazon.Subsequently, we develop an effective semi-supervised method for sentiment knowledge mining and annotating.This method is driven by lexicons and syntactic rules, and we devise an Expectation-Maximization (EM) algorithm to estimate them.Experiments demonstrate that this method can mine more considerable and accurate sentiment knowledge than the existing methods.
Our contributions can be concluded as follows: • We develop an effective sentiment knowledge mining method and leverage it to build a largescale knowledge-annotated SPT corpus.
• We systematically review and summarize the existing SPT approaches and empirically investigate and analyze their effectiveness.
• We conduct extensive experiments on ABSA tasks and illustrate how SPT can facilitate the understanding of aspect-level sentiments.
2 Analysis Setup

Pre-training Data
Following Xu et al. (2019a), we use user-generated reviews from Yelp datasets2 and Amazon reviews datasets3 (Ni et al., 2019) for pre-training.We remove those reviews that are too short (<50 characters) and too long (>500 characters) and end up with a corpus containing 140 million reviews in 28 domains.Its statistic is detailed in Appendix A.1.

Sentiment Knowledge Mining
In this paper, we mainly investigate four typical types of sentiment knowledge: reviews' rating scores, sentiment words, word sentiment polarity, and aspect words.We illustrate them in Figure 1(a).Since only annotations of rating scores exist in the collected pre-training corpus, we develop an effective semi-supervised sentiment knowledge mining method.
Our method draws inspiration from the double propagation algorithm proposed by Qiu et al. (2011).They observe that there are some syntactic patterns linking aspect words and sentiment words, which is illustrated in Figure 1(b).Consequently, they define some syntactic rules to expand the aspect lexicon and sentiment lexicon iteratively.However, their method requires careful manual selection of syntactic rules.This limitation hinders the exploitation of complex syntactic patterns, such as (pizza, awful) in "we had a lamb pie pizza that was awful".
To overcome the above limitation, we devise an Expectation-Maximization (EM) algorithm to learn syntactic rules.In our method, the annotations of sentiment words and aspect words in the reviews are treated as unobserved latent variables, and the lexicons and rules are treated as the parameters.We first initialize parameters using MPQA (Wilson et al., 2005) and several simple syntactic rules; E-step annotates the reviews through the current estimate for the parameters; M-step updates the parameters according to the expected annotations.This process can be formulated as: repeat: where θ = (L S , L A , P SS , P AA , P SA , P AS ) denotes the lexicons and syntactic rules.For each mined sentiment word, we use Pointwise Mutual Information (PMI) to determine its polarity (Turney, 2002;Tian et al., 2020).See Appendix B for more details.

Syntax Knowledge Acquisition
We annotate four types of syntax knowledge in the reviews using spaCy 4 .For each word, we annotate its part-of-speech tag.If there is a dependency relation between two words, we annotate its direction and type.If a word is the ancestor of another word, we annotate their dependency distance.

Downstream Tasks and Datasets
An aspect-level opinion can be defined as a triplet consisting of an aspect term, the corresponding opinion term, and the sentiment polarity (Peng et al., 2020).Therefore, we select Aspect term Extraction (AE), Aspect-oriented Opinion term Extraction (AOE), and Aspect-level Sentiment Classification (ASC) to measure a model's understanding of these three sentiment elements, respectively.These downstream tasks are illustrated in Table 1.The datasets for these three ABSA tasks are derived from Wang et al. (2017); Fan et al. (2019).Their statistics are detailed in Appendix A.2.

Method
Given a review X of length T , a pre-trained model produces its word-level contextualized representations and review-level representation, which can be generally formulated as follows: General-purpose pre-training (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020;He et al., 2020) mostly learns parameters through Masked Language Modeling (MLM).In MLM, a certain proportion of words C in the review is masked, and the masked review X is then input to the pre-trained 4 The trained pipeline we use is en_core_web_sm 3.3.0.model to recover the masked part: where FFNN denotes a feed-forward neural network with non-linear activation, and for simplicity, we still use h t to denote the word-level representation of X.
Existing SPT approaches integrate sentiment knowledge in two main ways: (1) knowledgeguided masking prioritizes masking sentiment knowledge in reviews and leverages MLM to increase the model's awareness of sentiment knowledge; (2) knowledge supervision directly converts sentiment knowledge into labels and then predicts them by the word-level representations and the review-level representation.

Integrating Aspect & Sentiment Words
A common way to integrate aspect and sentiment words is to increase their masking probabilities (Tian et al., 2020;Zhou et al., 2020;Ke et al., 2020;Li et al., 2021).There are two implementations: (1) mask-by-probability masks these words with probability x% and masks other words with probability 15%; (2) mask-by-proportion randomly masks these words to y% of the total words and masks other words to (15 − y)% of the total words.The main difference between these two implementations is that the former is more sensitive to the proportion of aspect and sentiment words in the review.
In addition to increasing their masking probability, we propose the strategy of masking their contexts.Our motivation stems from the observation sentiment expressions are often closer to aspect words, and thus we can leverage aspect words to locate sentiment-dense segments of a review.We assign a higher masking probability to words that are closer to aspect words.To achieve this, we empirically choose the normal distribution and geometric distribution for the masking probability assignment, and the corresponding masking strategies are denoted as mask-context-norm and maskcontext-geo.Figure 2 provides an illustration of these two masking strategies, and detailed implementations can be found in Appendix C.
Moreover, an alternative way is to convert the aspect and sentiment words to pseudo-labels and then use the word-level representations to predict these pseudo-labels, which can be formulated as: where y t ∈ {AspW, Other} for integrating aspect words, and y t ∈ {SenW, Other} for integrating sentiment words.

Integrating Review Rating
The review's rating score reflects its overall sentiment.To integrate it, Zhou et al. (2020); Ke et al. (2020) introduce rating prediction.They predict the rating score by the review-level representation and use the cross-entropy function to calculate the loss: where y RAT ∈ {1, 2, 3, 4, 5}.
Besides, Li et al. (2021) adopt the supervisedcontrastive-learning objective (Khosla et al., 2020) to integrate review rating.With this objective, representations from the same sentiment are pulled closer together than representations from different sentiments.Specifically, the loss for a batch B is calculated as follows: where RAT } is the set of indices of all positives in batch B for i, and τ is the temperature.They use sim(i, j) = s ⊤ i s j on the normalized representations as similarity metric.

Integrating Other Sentiment Knowledge
Word Polarity.To integrate this knowledge, Tian et al. ( 2020 In this objective, word polarity is inferred based on word-level representations, similar to Equation 6.There are two variants: one only predicts the polarity of masked-sentiment-words, i.e., y t ∈ {POS, NEG}; the other predicts the polarity of allmasked-words, i.e., y t ∈ {POS, NEG, Other}.The difference is that the latter also includes the label of sentiment words in the supervision. Aspect-Sentiment Pair.Tian et al. ( 2020) regard a sentiment word with its nearest noun (the maximum distance is 3) as an aspect-sentiment pair.They argue that aspect-sentiment pairs reveal more information than sentiment words do.Therefore, they propose aspect-sentiment pair prediction to capture the dependency between aspect and sentiment.They randomly mask at most 2 aspectsentiment pairs in each review and predict them through the review-level representation5 : where P is the set of indices of words in the masked aspect-sentiment pairs.Table 2: Performance of integrating different knowledge in pre-training (F 1 -score, %).The evaluation metric for ASC is Macro-F 1 .We boldface those results with significant advantages.For each type of knowledge, we mark the best integration approach with a ⋆.Among the two methods of integrating review ratings, although supervised contrastive learning performs slightly better than cross-entropy, the latter is simpler and more straightforward.Therefore, we mark cross-entropy as the preferred method.
(2020) point out that integrating emoticons can capture more token-level sentiment knowledge.Consistent with Zhou et al. ( 2020), we treat emoticons as special tokens during the tokenization process and assign them a masking probability of 50% when masking.

Integrating Syntax Knowledge
Although syntax knowledge has been widely incorporated in fine-tuning various ABSA tasks (Zhang et al., 2019;Huang and Carley, 2019;Wang et al., 2020;Chen et al., 2022), few works explore its impact on SPT.In this paper, we cover four types of syntax knowledge and integrate them through knowledge supervision.We infer part-of-speech tags in the same way as in Equation 6and trans-form the predictions of dependency-direction, dependency-type, and dependency-distance into word-pair classification.The word-pair classification can be formulated as follows:   To mitigate the effect of randomness, we run each pre-training approach twice, evaluate each pre-trained model on three downstream tasks 10 times, and release the average results.Moreover, we also pre-train 400k steps on all domains to fully exploit the potential of SPT.

Main Results
We continue to pre-train BERT via the different SPT approaches and subsequently fine-tune them on three ABSA tasks.Their performance is reported in Table 2.We see that MLM alone yields notable improvements, where the maximum is achieved on the ASC task in Restaurant-14, nearly 3%.Integrating sentiment and syntax knowledge leads to a variety of impacts.
What impact do different types of sentiment knowledge have on downstream ABSA tasks?Most sentiment knowledge contributes to performance improvement on the ASC task, with sentiment words, review ratings, and aspect-sentiment pairs showing the highest potential.Integrating aspect words provides general benefits, and masking their contexts improves performance on nearly all downstream tasks.The impact of integrating emoticons is minimal.
Which knowledge integration method is most effective?For aspect words, masking their contexts has a generally positive impact, while increasing their masking probabilities does not.This finding suggests that predicting their context is helpful as the context often contains the key cues of its sentiment.For sentiment words, mask-by-proportion is better than mask-by-probability.This is because the former can better balance the masking proportion of sentiment knowledge and general knowledge.For review ratings, we observe that using supervised contrastive learning does not show a salient advantage over cross-entropy, indicating that the application of contrastive learning on SPT still needs exploration.
Does integrating syntax knowledge have positive impacts?Aspect terms are often phrases, such as the orecchiette with sausage and chicken.These phrases tend to follow certain part-of-speech patterns.From obvious evidence lies in the combination of aspect words and aspect-sentiment pairs.In most scenarios, the introduction of an additional type of knowledge brings both benefits and drawbacks.Experimental results highlight that the combination of aspect words, review ratings, and syntax knowledge achieves the best trade-off, yielding an average improvement of 2.42% over BERT.Further, we compare the best combination with previous SPT works and present the results in Table 3.These results show that under the same computational cost, this combination outperforms previous works in most cases.When pre-training 400k steps, we observe an average improvement of 3.25% over BERT.In addition, we note an anomaly in the model pre-trained on all domains.Specifically, its performance on the AOE task in Laptop-14 does not increase with the number of pre-training steps.This may be because the dependency between the aspect term and opinion term varies between domains.Further exploration of this phenomenon is warranted in future research.

Results on More Downstream Tasks
In addition to the three basic ABSA tasks, we further evaluate the SPT model on more ABSA tasks and datasets.
Aspect Sentiment Triplet Extraction (ASTE) aims to extract the aspect terms along with the corresponding opinion terms and the expressed sentiment (Peng et al., 2020).As a compound task, ASTE evaluates the model's understanding of aspect-level sentiments comprehensively.We take the pre-trained model as the language encoder and select three classical methods for triplet extraction: GTS (Wu et al., 2020a), BMRC (Chen et al., 2021), and Span-ASTE (Xu et al., 2021).We conduct experiments on ASTE-Data-v2 (Xu et al., 2020b) and present the results in Table 4.
Experimental results show that SPT can generally improve the performance of ASTE.The best pre-trained model is our SPT (400k), which achieves an average improvement of 3.31% and 3.93% on GTS and Span-ASTE, respectively.Additionally, we observe that SPT has a smaller improvement on BMRC, suggesting that the paradigm of machine reading comprehension relies more on the model's understanding of natural language statements than on the quality of the representations of partial words.
Cross-domain ABSA aims to transfer ABSA annotations from a resource-rich domain to a resourcepoor domain (Gong et al., 2020).This task requires the model to possess the ability to learn domain-  invariant sentiment knowledge.We leverage this task to evaluate the cross-domain capabilities of pre-trained models.We conduct experiments on the datasets released by Gong et al. (2020) and list the results in Table 5.We find that SPT greatly boosts the performance of BERT on cross-domain ABSA.The best models are those pre-trained on a mixture of multiple domains (BERT REVIEW and Our SPT (400k)).Besides, we notice that removing syntax knowledge causes a significant drop in performance, highlighting the importance of syntax knowledge in crossdomain ABSA.Despite achieving notable improvements over BERT, SPT alone has not yet achieved satisfactory performance, suggesting that addressing cross-domain ABSA requires more than just employing SPT.
MAMS (Jiang et al., 2019) is a challenging benchmark dataset for ABSA, where each review contains at least two different aspects with different sentiments.We list experimental results on MAMS in Table 6.We find that SPT still shows performance gains on this dataset, but only at most 1%, which is relatively lower than the gains observed on other datasets.This suggests that SPT for multiaspect scenarios deserves further exploration.

Further Analysis
Effect of SPT on Data-scarce Scenarios.Data scarcity is a critical challenge in ABSA.We explore the effect of SPT under different amounts of training data.As depicted in Figure 3, the improvements from SPT become more obvious with less training data, with maximums of 5.11% and 6.65%.Furthermore, with SPT, the performance originally achieved using the entire training data can be attained using only 40% of it.This suggests that SPT is a feasible solution to alleviate the issue    et al., 2020).MPQA and Hu2004 denotes annotating sentiment words through the sentiment lexicon provided by Deng and Wiebe (2015) and Hu and Liu (2004), respectively.Three baselines annotate the noun closest to every sentiment word as the aspect word.
of data scarcity in ABSA.
Evaluation for Knowledge-mining Methods.We utilize Aspect term Extraction (AE) and Opinion term Extraction (OE) to indirectly evaluate knowledge-mining methods.Since opinion terms and aspect terms are typically phrases while mining results are at the word level, we use overlap-F 1 as the evaluation metric.The difference with the normal F 1 -score is that overlap-F 1 recognizes a prediction as correct as long as it overlaps with any gold-truth term.We conduct experiments on the datasets provided by Wang et al. (2017) and list the results in Table 7.According to these results, our knowledge-mining method exhibits significant improvements over previous methods in both pre-

Conclusion
In this paper, we perform an empirical study of Sentiment-enhanced Pre-Training (SPT).Our study investigates the impacts of integrating sentiment knowledge and other linguistic knowledge in pre-training on Aspect-Based Sentiment Analysis (ABSA).To enable our study, we first develop an effective knowledge-mining approach, leverage it to build a large-scale SPT corpus, and then select a range of ABSA tasks as the benchmark to systematically evaluate a pre-trained model's understanding of aspect-level sentiments.Experimental results reveal the following findings: (1) integrating aspect words brings general benefits to downstream tasks; (2) integrating sentiment words, review ratings, or aspect-sentiment pairs significantly improves the performance on aspect-level sentiment classification; (3) integrating syntax knowledge can help the model extract aspect terms and aspectoriented opinion terms; and (4) the combination of aspect words, review ratings, and syntax knowledge achieves the best trade-off, yielding an average improvement of 3.25% over BERT.We further examine SPT's effectiveness on more ABSA tasks and find that SPT can improve the performance of a wide range of downstream tasks.Notably, SPT improves the model's cross-domain capabilities.In addition, we also demonstrate the effectiveness of our knowledge-mining method.treat the nearest nouns to the sentiment words as aspect words or build an aspect lexicon based on the aspect annotations of the existing downstream datasets.However, these knowledge-mining methods either lack domain adaptability or are unscalable.Therefore, this paper develops an effective knowledge-mining method.
Hypothesis.Any two words in the same sentence are connected by a syntactic path.This paper denotes a syntactic path as a sequence of dependency relations and part-of-speech tags.For example, given the sentence "we had a lamb pie pizza that was awful", the syntactic path from pizza to awful is denoted as (NOUN, Qiu et al. (2011) observe that there are some syntactic paths linking aspect words and sentiment words.Based on this observation, we assume that there exist lexicons L A , L S and aspect-sentiment path set P AS that satisfy: If w i and w j are linked by path p, then w i ∈ L A and p ∈ P AS =⇒ w j ∈ L S , (18) We leverage this assumption to mine lexicons and path sets.
Initialization.We initialize the sentiment lexicon

B.2 Polarity Assignment
We leverage the reviews' rating scores for polarity assignments.We empirically treat 5-star-rated reviews as positive reviews and (1,2,3)-star-rated as negative reviews.For a sentiment word w i , we count its occurrences in positive and negative reviews, denoted as #w i _pos and #w i _neg.Thus the polarity score of this word can be calculated by: where #pos and #neg denote the total number of positive and negative reviews.Equation ( 22) is derived from Pointwise Mutual Information (PMI) (Turney, 2002).
For each domain, we calculate the polarity score of each sentiment word on all reviews.Then, we empirically assign those sentiment words whose polarity scores are greater than 0.2 as POS and those sentiment words whose polarity scores are less than -0.2 as NEG.

C Detailed Implementation of Masking Context
We leverage aspect words to locate the sentimentdense segments of the review and improve the masking probability of their contexts.
Masking Context by Normal Distribution.Suppose there is only one aspect in the review, and its position is t.We use the normal distribution N (t, σ) for the masking probability assignment, where σ is a hyper-parameter.This could be formulated by: Actually, a review often contains more than one aspect word.Therefore, we sample k aspect words for each review, repeat the probability calculation k times, and pick up the maximum probability for each word.k is related to the length of the review: where T is the length, and z is a hyper-parameter.Finally, we perform normalization to ensure that the masked part is 15% of the review.Masking Context by Geometric Distribution.We also use geometric distribution for the masking probability assignment: where p is a hyper-parameter.We also sample k aspect words and perform normalization.
We find that both masking context methods are highly sensitive to hyper-parameters.In our experiment, we set σ = 6, p = 0.4, and z = 0.1.

D Additional Notes for Existing SPT Approaches
Note for SEKP.Tian et al. ( 2020) propose aspectsentiment pair prediction to capture the dependency between aspect and sentiment.They regard aspectsentiment pair prediction as a multi-label classification task and further transform it into multiple binary classification tasks.However, their implementation only includes positive samples but ignores negative samples when calculating the loss.Therefore, we adopt a different implementation to correct this mistake.This implementation is described in Equation 12 and 13.Besides, since our pre-training corpus contains aspect annotation, we regard a sentiment word with its nearest aspect word as an aspect-sentiment pair.

E Hyper-Parameters
We list the detailed hyper-parameters of SPT in Table 12.In our SPT, we integrate ASPECTWORD, REVIEWRATING, and SYNTAX in pre-training.Therefore, the loss for SPT is calculated by: L SY N = α 3 L P oS + α 4 L DIR + α 5 L DIS .(27)

Figure 1 :
Figure 1: (a) left: four types of sentiment knowledge.(b) right: dependency links between aspect words and sentiment words.Aspect words and sentiment words are marked with blue and orange, respectively.

Figure 2 :
Figure 2: Illustration of masking strategies for integrating aspect words.

Figure 3 :
Figure 3: Performance of AE and ASC on Restaurant-14 under different amounts of training data.

Figure 4 :
Figure 4: Performance of three downstream ABSA tasks with different pre-training steps.

Table 1 :
Examples of three downstream ABSA tasks.
ASC[CLS] delicious mushroom pizza but slow and rude delivery [SEP] mushroom pizza [SEP] POS [CLS] delicious mushroom pizza but slow and rude delivery [SEP] delivery [SEP] NEG AOE [CLS] delicious mushroom pizza but slow and rude delivery [SEP] mushroom pizza [SEP] S B O O O O O O O O O O E [CLS] delicious mushroom pizza but slow and rude delivery [SEP] delivery [SEP] S O O O O B O B O O O E

Table 3 :
Comparison results with the previous SPT works.Our SPT refers to the combination of aspect words, review ratings, and syntax knowledge.The original SKEP and SENTILARE are not pre-trained based on BERT-uncased-base, so we reproduce them on our SPT corpus.We convert the computational cost into training steps.See our notes in Appendix D for this conversion.
weights through BERT-base-uncased.We implement pre-training with a batch size of 1000, and an initial learning rate of 2e-4.See Appendix E for the detailed hyper-parameters.Our pre-training corpus covers 28 domains, such as Restaurant, Laptop, and Books.For most experiments, we only pre-train 10k steps on both Restaurant and Laptop.

Table 2
Whether integrating multiple knowledge simultaneously can lead to better results?According to the results in Table2, we find that integrating multiple knowledge simultaneously does not necessarily lead to better performance.The most

Table 4 :
Performance on the ASTE task (F 1 -score, %).Results are the average of 5 runs.

Table 5 :
Yu et al. (2021)wo cross-domain ABSA tasks (F 1 -score, %).This table only presents the average performance of AE and end-to-end ABSA, and full results are listed in Appendix F. Performance of the previous SOTA comes fromYu et al. (2021).ASPECT-WORD+RATING (10k) denotes removing syntax knowledge from Our SPT.

Table 8 :
Case Study on the ASC task.The aspect terms are marked with orange.
Note for Pre-trainingStep.Existing SPT works have different setups.We estimate their computation based on batch size, maximum text length, model architecture, and training step.Then, we calculate the training step to reach this computation under our settings.