Sentiment-Aware Word and Sentence Level Pre-training for Sentiment Analysis

Most existing pre-trained language representation models (PLMs) are sub-optimal in sentiment analysis tasks, as they capture the sentiment information from word-level while under-considering sentence-level information. In this paper, we propose SentiWSP, a novel Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks.The word level pre-training task detects replaced sentiment words, via a generator-discriminator framework, to enhance the PLM’s knowledge about sentiment words.The sentence level pre-training task further strengthens the discriminator via a contrastive learning framework, with similar sentences as negative samples, to encode sentiments in a sentence.Extensive experimental results show that SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks. We have made our code and model publicly available at https://github.com/XMUDM/SentiWSP.


Introduction
Sentiment analysis plays a fundamental role in natural language processing (NLP) and powers a broad spectrum of important business applications such as marketing (HaCohen-Kerner, 2019) and campaign monitoring (Sandoval-Almazán and Valle-Cruz, 2020).Two typical sentiment analysis tasks are sentence-level sentiment classification (Xu et al., 2019;Yin et al., 2020;Tang et al., 2022) and aspect-level sentiment classification (Li et al., 2021b).
Despite their progress, the application of general purposed PLMs in sentiment analysis is limited, because they fail to distinguish the importance of different words to a specific task.For example, it is shown in (Kassner and Schütze, 2020) that general purposed PLMs have difficulties dealing with contradictory sentiment words or negation expressions, which are critical in sentiment analysis.To address this problem, recent sentiment-aware PLMs introduce word-level sentiment information, such as token sentiments and emoticons (Zhou et al., 2020), aspect word (Tian et al., 2020), word-level linguistic knowledge (Ke et al., 2020), and implicit sentiment-knowledge information (Li et al., 2021b).These word-level pre-training tasks, e.g., sentiment word prediction and word polarity prediction, mainly learn from the masked words and are not efficient to capture word-level information for all input words.Furthermore, sentiment expressed in a sentence is beyond the simple aggregation of word-level sentiments.However, general purposed PLMs and existing sentiment-aware PLMs underconsider sentence-level sentiment information.
In this paper, we propose a novel sentimentaware pre-trained language model called SentiWSP, to combine word-level pre-training and sentence-level pre-training.Inspired by ELECTRA (Clark et al., 2020), which pre-trains a masked language model with significantly less computation resource, we adopt a generatordiscriminator framework in the word-level pre-training.The generator aims to replace masked words with plausible alternatives; and the discriminator aims to predict whether each word in the sentence is an original word or a substitution.To tailor this framework for sentiment analysis, we mask two types of words for generation, sentiment words and non-sentiment words.We increase the portion of masked sentiment words so that the model focuses more on the sentiment expressions.
For sentence-level pre-training, we design a contrastive learning framework to improve the encoded embeddings by the discriminator.The query for the contrastive learning is constructed by masking sentiment expressions in a sentence.The positive example is the original sentence.The negative examples are selected firstly from in-batch samples and then from cross-batch similar samples using an asynchronously updated approximate nearest neighboring (ANN) index.In this way, the discriminator, which will be used as the encoder for downstream tasks, learns to distinguish different sentiment polarities even if they are superficially similar.
Our main contributions are in three folds: 1).SentiWSP strengthens word-level pre-training via masked sentiment word generation and detection, which is more sample-efficient and benefits various sentiment classification tasks; 2).SentiWSP combines word-level pretraining with sentence-level pre-training, which has been underconsidered in previous studies.SentiWSP adopts contrastive learning in the pre-training, where sentences are progressively contrasted with in-batch and crossbatch hard negatives, so that the model is empowered to encode detailed sentiment information of a sentence; 3).We conduct extensive experiments on sentence-level and aspect-level sentiment classification tasks, and show that SentiWSP achieves new state-of-the-art performance on multiple benchmarking datasets.

Related Work
Pre-training and Representation Learning Pretraining models has shown great success across various NLP tasks (Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019).Existing studies mostly use a Transformer-based (Vaswani et al., 2017) encoder to capture contextual features, along with masked language model (MLM) and/or next sentence prediction (Devlin et al., 2019) as the pretraining tasks.Yang et al. (2019) propose XL-Net which is pre-trained using a generalized autoregressive method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order.ELECTRA (Clark et al., 2020) is a generatordiscriminator framework, where the generator per-forms the masked token generation and the discriminator performs replaced token detection pretraining task.It is more efficient than MLM because the discriminator models over all input tokens rather than the masked tokens only.Our work improves ELECTRA's performance on sentiment analysis tasks, by specifying masked sentiment words at word-level pre-training, and combining sentence-level pre-training.
In addition to the pre-training models that encode token representations, sentence-level and passage-level representation learning have undergone rapid development in recent years.A surge of work demonstrates that contrastive learning is an effective framework for sentence-and passagelevel representation learning (Meng et al., 2021;Wei et al., 2021;Gao et al., 2021;Li et al., 2021a).The common idea of contrastive learning is to pull together an anchor and a "positive" sample in the embedding space, and push apart the anchor from "negative" samples.Recently, COCO-LM (Meng et al., 2021) creates positive samples by masking and cropping tokens from sentences.Gao et al. (2021) demonstrate that constructing positive pairs with only standard dropout as minimal data augmentation works surprisingly well on the Natural Language Inference (NLI) task.Karpukhin et al. (2020) investigate the impact of different negative sampling strategies for passage representation learning based on the task of passage retrieval and question answering.ANCE (Xiong et al., 2021) adopts approximate nearest neighbor negative contrastive learning, a learning mechanism that selects hard negatives globally from the entire corpus, using an asynchronously updated Approximate Nearest Neighbor (ANN) index.Inspired by COCO-LM (Meng et al., 2021) and ANCE (Xiong et al., 2021), we construct positive samples by masking a span of words from a sentence, and construct cross-batch hard negative samples to enhance the discriminator at sentence-level pre-training.
Pre-trained Models for Sentiment Analysis In the field of sentiment analysis, BERT-PT (Xu et al., 2019) conducts post-training on the corpora which belong to the same domain of the downstream tasks to benefit aspect-level sentiment classification.SKEP (Tian et al., 2020) constructs three sentiment knowledge prediction objectives in order to learn a unified sentiment representation for multiple sentiment analysis tasks.SENTIX (Zhou et al., 2020)   ment knowledge from large-scale review datasets, and utilizes it for cross-domain sentiment classification tasks without fine-tuning.SentiBERT (Yin et al., 2020) proposes a two-level attention mechanism on top of the BERT representation to capture phrase-level compositional semantics.Senti-LARE (Ke et al., 2020) devises a new pre-training task called label-aware masked language model to construct knowledge-aware language representation.SCAPT (Li et al., 2021b) captures both implicit and explicit sentiment orientation from reviews by aligning the representation of implicit sentiment expressions to those with the same sentiment label.In word-level pre-training, an input sentence flows through a word-masking step, followed by a generator to replace the masked words, and a discriminator to detect the replacements.The generator and discriminator are jointly trained in this stage.Then, the training of discriminator continues in sentence-level pre-training.Each input sentence is masked at sentiment words to construct a query, while the original sentence is treated as the positive sample.Their embeddings encoded by the discriminator are contrasted to two types of negative samples constructed in an in-batch warm-up training step and a cross-batch approximate nearest neighbor training step.Finally, the discriminator is fine-tuned on the downstream task.
Compared with previous studies, the discriminator in SentiWSP has three advantages.(1) Instead of random token replacement and detection, Sen-tiWSP masks a large portion of sentiment words, and thus the discriminator pays more attention to word-level sentiments.(2) Instead of pure masked token prediction, SentiWSP incorporates context information from all input words via a replacement detection task.(3) SentiWSP combines sentencelevel sentiments with word-level sentiments by progressively contrasting a sentence with missing sentiments to a superficially similar sentence.

Word-Level Pre-training
Word masking Different from previous random word masking (Devlin et al., 2019;Clark et al., 2020), our goal is to corrupt the sentiment of the input sentence.
In detail, we first randomly mask 15% words, the same as ELECTRA (Clark et al., 2020).Then, we use SentiWordNet (Baccianella et al., 2010) to mark the positions of sentiment words in a sentence, and mask the sentiment words until a certain proportion p w of sentiment words are hidden.We empirically find that the sentiment word masking proportion p w = 50% achieves the best results.
In the example in Figure 1 (left), the sentiment words "sassy" and "charming" are masked while "smart" is not masked, "comedy" is masked as a random non-sentiment word.
Generator Next, a generator G processes the masked sentence and generates a corrupted sen-tence.As in ELECTRA (Clark et al., 2020), the generator is a small sized Transformer.Formally, the sentence is a sequence of words, i.e., s = [w 1 , w 2 , . . ., w n ], the mask indicators are denoted as m = [m 1 , m 2 , . . ., m n ], m t ∈ {0, 1}, we obtain s mask from the word masking step.For masked out words, the word is replaced by "MASK", i.e., For a given position t, (in our case only positions where w t = ["M ASK"]), the generator G outputs a probability p G w t | s mask for generating a particular token w t with a softmax layer: where e t denotes word embeddings for word w t .
We then replace the current word w t with a random sample strategy based on p G w t | s mask .Sampling introduces randomness and thus it is beneficial for training the discriminator.On the contrary, selecting the word with the highest probability is likely to generate the original word, and the training for discriminator will be more challenging as the discriminator is likely to be trapped to distinguish an original word from a substitution.Formally, the replacing process can be defined as ∀m t = 1, ŵt ∼ p G w t | s mask .We denote the corrupted sequence as where ∀m t = 1, w rep t = ŵt .Discriminator For the corrupted sentence, the discriminator D, i.e., a larger sized Transformer, encodes the corrupted sentence to h D (s rep ), and predicts whether each word w t comes from the data or the generator, using a sigmoid output layer: We jointly train the generator and the discriminator.The generator G is trained by maximal likelihood estimation, and the discriminator D is trained by cross entropy.
where X denotes a large corpus of raw text and λ is the coefficient of the discriminator loss.

Sentence-Level Pre-training
For sentence-level pre-training, we follow the contrastive framework in Chen et al. (2020).The goal of contrastive learning is to learn effective representations by pulling together similar samples (i.e., the positive samples) and pushing away different samples (i.e., the negative samples).
One critical question in contrastive learning is how to construct a pair of query (anchor) and positive/negative samples, i.e., q i , d + i , q i , d − i .As the example shown in Figure 1 (right), given a sentence s i from corpus C, we first mask out a certain percentage (70% in this research) of sentiment words in the sentence to construct q i , and treat the raw sentence as the positive example d + i .In-batch warm-up training Then we fetch the already trained (in word-level pre-training) discriminator model D and conduct a warm-up sentencelevel training with in-batch negatives.In detail, We feed the input q i , d + i to the encoder D to get the representations f i and f + i , and train the encoder to minimize the distance between the positive pairs within a mini-batch using the neg log-likelyhood loss defined as: where τ is a temperature hyperparameter, |B| denotes size of the mini-batch B, sim ( To maintain an up-to-date ANN index two operations are required: (1) inference: refresh the embeddings of all sentences in the corpus with the updated model D; and (2) indexing: rebuild the ANN index with the updated embeddings.Although indexing is efficient (Johnson et al., 2021), inferential computing for each batch is expensive as it needs to be passed forward across the entire corpus.In order to balance the time cost between inference and indexing, we use an asynchronous refresh mechanism similar to Guu et al. (2020) and update the ANN index every m steps.As illustrated in the top right part in Figure 1, we construct a trainer to optimize D, and an inferencer that uses the latest checkpoint (e.g., checkpoint k − 1) to recalculate the encoding f k−1 of the entire corpus and and update ANN f k−1 .Then, the trainer optimizes a cross-entropy objective function with negative samples generated from ANN f k−1 and the original positive example pair q i , d + i . min where B k is the mini-batch batch at checkpoint k, f i , f + i , f j indicate the discriminator D's embeddings of the query, positive, and negative samples generated from the asynchronously updated ANN, respectively.

Fine-tuning
After the pre-training, we fine-tune our model on downstream sentiment analysis tasks.For sentence-level sentiment classification task, we format the input sequence as: [CLS], e 1 , . . ., e n , [SEP ], and take the representation at the [CLS] token to predict the sentiment label y, which indicates the sentiment polarity of the sentence.
For aspect-level sentiment classification task, we format the input sequence as: where a 1 , . . ., a m denotes the phrase of a particular aspect.We fetch the representation at the [CLS] token to predict the sentiment label y of the sentence in the aspect.

Datasets
For SentiWSP pre-training, we use the same English Wikipedia corpus as Devlin et al. (2019).We select 2 million sentences with a maximum length of 128 for the word-level pre-training, and select 500,000 sentences which have 20%-30% proportion of sentiment words for the sentence-level pretraining.

Implementation Details
During pre-training, we use the AdamW optimizer and linear learning rate scheduler, and we set the max sequence length to 128.The learning rate is initialized with 2e-5 and 1e-5 for the base and large model, respectively.For word-level pre-training, we use ELECTRA (Clark et al., 2020) initialize G and D. We set the proportion of sentiment word mask to p w = 0.5 and we keep other hyperparameters the same as ELECTRA.For sentencelevel pre-training, we follow the settings of unsupervised SimCSE (Gao et al., 2021) to do the warm-up training, and set the proportion of sentiment word mask p s = 0.7.The detailed batch size and training steps for different level of pre-training are listed in Appendix A.
For fine-tuning, we use the hyperparameters from Clark et al. (2019) for the most parts.We fine-tune 3-5 epochs for sentence-level sentiment classification and 7-10 epochs for aspect-level sentiment classification tasks.The learning rate of the base and large model for the fine-tunning is set to 2e-5 and 1e-5, respectively.We use a linear learning rate scheduler with 10% warm-up steps.

Comparative Results
We list the performance of different models in Table 1.According to the results, we have several findings: (1) SentiWSP consistently outperforms all baselines on sentence-level classification tasks, which demonstrates the superiority of SentiWSP to capture sentence-level semantics.( 2 a competitive performance on Resturant14 dataset, i.e., the second best among all competitors. (3) SentiWSP is significantly better than ELECTRA, on both sentiment analysis tasks, on all datasets.This observation verifies the effectiveness of the proposed sentiment-aware pretraining strategy.

Ablation Study
To we find that the model performs the worst when p w = 0 (i.e., the same mask strategy as ELEC-TRA) , which verifies our assumption that extra sentiment word masking is beneficial for the model to encode sentiment information.Besides, we find that masking and replacing 50% of the sentiment words yields the best result.the results with different negative samples in the sentence-level pre-training.From the table we have two findings: (1) when only in-batch negative samples are used, the model performs worst.We argue the reason is that in-batch negatives are too simple for the model to distinguish from a positive sample, and continue training on these "easy" negatives does not make further improvements.
(2) When we increase the cross-batch negative sample size from 1 to 10, the model is provided with more informative negative samples.Therefore, the model can learn more detailed sentiment information and the accuracy is improved.However, when we use a large amount of cross-batch negative samples (e.g., 13), the negative samples vary in quality, and thus the model suffers from less similar negatives.Similarity and loss function.For sentence-level pre-training, we compare two commonly adopted similarity functions, i.e., cosine distance and dot product, to measure the similarity between two sentence embeddings.The difference between dot product and cosine distance is that dot product does not incorporate L2 normalization.We also compare two widely used loss functions, i.e., Negative Log-Likelihood (NLL) loss and Triplet loss (Schroff et al., 2015), in contrastive learning and ranking problems.The difference between NLL loss and Triplet loss is that Triplet loss compares a positive example and a negative one directly with respect to a query.We report the comparisons in Table 7.
From the table we have two observations: (1) With different loss functions, the cosine distance appears to be a more accurate measurement for sentence similarity and outperforms the dot product.
(2) The NLL loss produces better results, with different similarity functions, than the Triplet loss.

Training Loss Convergence
Our final model is trained on 4 NVIDIA Tesla A100 GPUs with a total training time of fewer than 24 hours.For word-level pre-training, we can observe from Figure 3 that the generator and discrimina-  tor compete in joint training and gradually converge within 20, 000 steps.For sentence-level pretraining, we can observe from Figure 2 that when the hard-negative example is refreshed every 2000 steps, the loss of the model increases temporarily, which indicates that our ANN search can form a more demanding test for the model and improve the model's capability on these hard testing cases in the following steps.

Conclusion
In this paper, we introduce SentiWSP, which improves pre-training models on sentiment analysis task, by capturing the sentiment information from word-level and sentence-level simultaneously.
Extensive experimental results on five sentencelevel sentiment classification benchmarks show that SentiWSP establishes new state-of-the-art performance on all of them.We conduct experiment on two aspect-level sentiment classification benchmarks.The results show that SentiWSP beats most existing models on Restaurant14 and achieves new state-of-the-art on Laptop14.We further analyze several hyper-parameters that may affect the model performance, and show that SentiWSP can achieve satisfying performance with respect to different hyper-parameter settings.

Limitations
SentiWSP, as most of the current state-of-theart pre-training models, requires relatively large computation resources.As shown in Table 2, SentiWSP-large performs better than SentiWSPbase, the performance divergence is more significant on the MR dataset.We also observe some bad cases when sentiment expressions are implicit suggested in a sentence, i.e., with very few sentiment words, SentiWSP has difficulty in masking and generating, and constructing positive samples.
In the future, we plan to devise an adaptive masking mechanism for sentiment words.

Figure 2 :
Figure 2: Training loss of sentence-level pre-training.

Figure 3 :
Figure 3: Training loss of the word-level pre-training.

Table 1 :
Overall performance of different models on sentiment classification tasks, "Acc" and "MF1" denote accuracy and macro-F1 score, respectively.
•, •) denotes cosine similarity between two vectors.corpus C and then use ANN search to retrieve topk negative examples closest to each query.Then we sample t negative examples as hard-negative examples from the top-k negatives.The hyperparameters k and t are set to 100 and 7, respectively.
Table 3 shows the statistics of these datasets, including the amount of training, validation, and test

Table 2 :
The ablation study results.SP and WP represent sentence-level pre-training and word-level pre-training, respectively.The model without both pre-training is the original ELECTRA model.

Table 3 :
Statistics of datasets used in our experiments.
#C indicates the number of target classes in each dataset.

Table 4 :
) On the aspectlevel sentiment classification tasks, the proposed SentiWSP boosts the ACC by 0.93 and increases MF1 by 1.67 on Laptop14 dataset.It also achieves Acc obtained by additionally masking p w of sentiment words in word-level pre-training, on IMDB and MR.

Table 5 :
Acc obtained by additionally masking p s of sentiment words in sentence-level pre-training, on IMDB and MR.

Table 6 :
Acc obtained by using only in-batch negative samples and top-k cross-batch negative samples.
Negative sample size.In Table6, we report ing point.It is worthy to point out that, even the worst performances, i.e., when p s = 0.5 on IMDB and p s = 0.3 on MR, are better than most of the competitors in Table1.

Table 7 :
The impact of loss and similarity functions in sentence-level pre-training.