Exploiting Position Bias for Robust Aspect Sentiment Classification

Aspect sentiment classification (ASC) aims at determining sentiments expressed towards different aspects in a sentence. While state-of-the-art ASC models have achieved remarkable performance, they are recently shown to suffer from the issue of robustness. Particularly in two common scenarios: when domains of test and training data are different (out-of-domain scenario) or test data is adversarially perturbed (adversarial scenario), ASC models may attend to irrelevant words and neglect opinion expressions that truly describe diverse aspects. To tackle the challenge, in this paper, we hypothesize that position bias (i.e., the words closer to a concerning aspect would carry a higher degree of importance) is crucial for building more robust ASC models by reducing the probability of mis-attending. Accordingly, we propose two mechanisms for capturing position bias, namely position-biased weight and position-biased dropout, which can be flexibly injected into existing models to enhance representations for classification. Experiments conducted on out-of-domain and adversarial datasets demonstrate that our proposed approaches largely improve the robustness and effectiveness of current models.


Introduction
Aspect sentiment classification (ASC) is an important sub-task of sentiment classification. It aims to identify the sentiment polarity (i.e., negative, neutral, or positive) of a specified aspect in a sentence. Take "Great food but the service was bad." as an example. For aspects food and service, their corresponding sentiment polarities are positive and negative, respectively. * Fang Ma and Chen Zhang contribute equally to this work. The order is determined alphabetically. † Dawei Song is the corresponding author. 1 The code and preprocessed data are available at https: //github.com/BD-MF/POS4ASC.

I.D.
Great food but the service was bad ! neg./neg.
The battery has never worked well . pos./neg.

Adv.
Awful food but the service was great ! neg./pos. A challenge in ASC is how to model semantic relations between aspect terms and their contexts, which requires an ASC model to be only sensitive to the sentiment words actually depicting the target aspect terms. Although previous ASC models (Tang et al., 2016b;Li et al., 2018;Zhang et al., 2019a;Xu et al., 2019;Tang et al., 2020) have achieved promising results by modeling complex interactions between aspects and contexts, these models have recently been shown to suffer from the lack of robustness (Xing et al., 2020). The issue is particularly severe in two scenarios: 1) out-of-domain (O.O.D.) scenario: ASC models that perform well on training data often fail to generalize to test data in another domain; 2) adversarial (Adv.) scenario: ASC models can be easily fooled by small adversarially perturbed inputs, e.g. synonymous word substituted ones. To our best knowledge, none of current ASC models have been targeted at alleviating the robustness issue in above-mentioned two scenarios.
To fill this gap, inspired by a recent finding that highlighting words close to a target aspect (termed as position bias) would boost in-domain (I.D.) effectiveness of a model (Zhang et al., 2019b), we hypothesize that such position bias is also crucial for a robust ASC model in O.O.D. and Adv. sce- narios. Figure 1 shows an illustrative example of an ASC model that fails in the two scenarios due to mis-attending. In contrast, with position bias, a model tends to focus more on words nearer to the aspect, thus reducing the probability of misattending. Concretely, we propose two mechanisms: position-biased weight and position-biased dropout. The former assigns an inductive weight to each word according to its position proximity to the aspect. The latter gives each word a probability of being reserved (or dropped out) according to its proximity relation to the aspect. In doing so, position-biased weight degrades the significance of words that are not close enough to the aspect, while position-biased dropout will drop those likely irrelevant words at high probabilities. Essentially, position bias is quantitatively evidenced in commonly used benchmarks. With annotated aspect-opinion pairs offered by Fan et al. (2019), we can calculate position proximity between any pair of aspect and opinion (in short, aspect proximity) in a sentence. The aspect proximity is computed by dividing the relative distance between a pair of aspect and opinion by the length of the corresponding sentence. Therefore, we can plot aspect proximity distributions of these benchmarks with kernel density estimation, as shown in Figure 2. These distributions indicate the aspect proximity is small at a high probability, thereby position bias is a reasonable inductive bias.
Extensive experiments are conducted on Se-mEval and ARTS datasets (Xing et al., 2020). The results show that incorporating the proposed position bias mechanisms would lead to more robust ASC models in both out-of-domain and adversarial scenarios. Furthermore, in terms of flexibility, the proposed methods can be easily adapted by subse-quent models.

Capturing Position Bias
This section describes the proposed position-biased weight and position-biased dropout for capturing position bias. Formally, an n-word sentence containing a target m-word aspect term is formulated as S = {w 0 , w 1 , . . . , w γ , w γ+1 , . . . , w n−1 }, where γ denotes the start index of the aspect term.
By resorting to either a pretrained word embedding (Bengio et al., 2003) or a pre-trained language model (Devlin et al., 2019), we can represent the sentence as V = {e 0 , e 1 , . . . , e γ , e γ+1 , . . . , e n−1 }. We can then use the position-biased weight and dropout to refine V and generate an enhanced representation, denoted as E = {h 0 , h 1 , ...h γ , h γ+1 , ..., h n−1 }. E can then be incorporated into the model, i.e., any further structures for the model will be built upon E, instead of V , to predict sentiment polarities associated with diverse aspects.
Position-biased Weight Generally, the sentiment polarity of an aspect term is determined by its context, which are the words around the aspect term (Zhang et al., 2019b). Thus we can leverage relative position information to calculate weights of context words, with the aim to degrade the significance of those words that are far away from the aspect. Position-biased weight, denoted as p i ∈ (0, 1), is computed as: Position-biased Dropout Dropout (Srivastava et al., 2014;Sennrich et al., 2016) randomly sets elements in a feature vector to zeros. The word-level dropout can model semantic and syntactic compositionality and reduce input redundancy (Iyyer et al., 2015). Motivated by this idea, we give each word a probability of being reserved according to its position proximity to the aspect. The aim is to preserve those words that are close enough to the aspect and drop the rest. The probability that the i-th word will be preserved can be computed as: where Bernoulli(p i ) denotes that z i equals to 1 with p i and equals to 0 with 1−p i . The i-th word is dropped out if z i is 0. Likewise, h i can be attained by multiplying z i and e i .

Experiments
Datasets

Target Models
We conduct experiments on a wide range of existing models for a comprehensive study on whether position bias is beneficial. Specifically, we examine these models' performance before and after injecting the position bias, in terms of position-biased weight (pos-wt) and dropout (pos-dp) individually. The target models include:  (Tang et al., 2016b) applies attention multiple times on word memories, and the output of the last attention is used for prediction. While the original work utilizes word embeddings as memories, we instead choose to add a layer of bidirectional LSTM upon embeddings for more abstractive memories. (e) AOA (Huang et al., 2018) introduces an attention-over-attention based network to model interaction between aspects and contexts. (f) RoBERTa (Dai et al., 2021) is a strong baseline with an MLP built upon the pooled feature induced with RoBERTa .
Implementation Details In all our experiments, the 300-dimensional GloVe (Pennington et al., 2014) is leveraged to initialize the input embedding. All parameters of models are initialized with uniform distributions. During all experiments, a bidirectional LSTM is adopted if necessary instead of a unidirectional one. If a model takes advantage of the attention mechanism, then dot product based attention is employed. In case a model has hidden states, the dimensionality of hidden states is set to 300. The batch size is 64. We use Adam (Kingma and Ba, 2015) as the optimizer with a learning rate of 10 -3 . The coefficient of L2 regularization is 10 -5 . For experiments with RoBERTa as the input embedding , things may change. The dimensionality of hidden states is 768. The learning rate is 10 -5 , while the regularization is rather removed.
Evaluation Metrics For O.O.D., models are trained separately on one domain and evaluated on another. For Adv., models are trained on the SEMEVAL dataset and tested on the ARTS counterpart. For every test, a model is trained on the I.D. training set, selected on the I.D. development set, and tested on the O.O.D. or Adv. test set. The experimental results are obtained by averaging 5 runs with random initialization, and we adopt Accuracy and macro-averaged F1 scores as evaluation metrics. Table 3 shows the I.D. performance of the LSTM on both laptop and restaurant domains, which exhibits incorporating position bias does not harm, if this is the case, a model's generalization on I.D. test sets that much. On the contrary, position bias, especially with position-biased weight, can boost I.D. performance.   (LAP or REST) and tested on another (REST or LAP). Adv. denotes a model is trained in a domain and tested on its ARTS counterpart. Furthermore, w/ pos-dp means a model with position-biased dropout. w/ pos-wt means a model with position-biased weight. The small number next to each performance score indicates either performance improvement (↑) or drop (↓) compared with the original model without using position bias, and those highlighted in red are the best-performing ones among two variants.

Robustness Results
The robustness results are shown in Table 2. We can see that performance of LSTM drops drastically, compared to I.D. performance, on O.O.D. and Adv. test sets, indicating the importance of studying the robustness issue. Our proposed two position bias mechanisms improve the target models' O.O.D. and Adv. performance in most cases. With position-biased dropout, F1 scores of models are improved by up to 5.32 pp on O.O.D. test sets, and 2.36 pp on Adv. test sets, though the efficacy of the position-biased dropout seems not stable across different target models and settings. In contrast, the impact of position-biased weight is much more prominent. With positionbiased weight, Accuracy scores of models can be enhanced by up to 6.05 pp and 7.20 pp on O.O.D. and Adv. test sets, respectively. Further, F1 scores of models are improved by up to 9.51 pp and 8.14 pp on O.O.D. and Adv. test sets.
A highlight is that experimental results with RoBERTa as well exhibit the benefit of position bias, yet with caveats. Although pre-trained language models like RoBERTa are subject to positional encodings, such absolute position information is not enough to model relative position relations between aspect terms and contexts. Therefore, position bias matters during fine-tuning pretrained language models for robust ASC performance. However, we observe that position-biased dropout is not an appropriate choice for pre-trained language models.
Case Study To understand the effect of position bias, we conduct a case study on the two robustness scenarios, as shown in Table 4. Specifically, we visualize the attention scores separately offered by an ASC model LSTM-Attn with and without positionbiased weight method and trained on SEMEVAL-REST.
We can observe that before applying position bias, the model attends irrelevant words and fails in both scenarios. Specifically, in both cases, the model mis-attends to irrelevant opinion expressions. After injecting position bias, the attention scores become more accurate and the model attends to the correct opinion spans. w/ pos-wt Example The price is reasonable although the quality is poor . The price is reasonable although the quality is poor . Awful food but the service was great ! Awful food but the service was great !

Related Work
Fine-grained Sentiment Analysis ASC falls in the broad scope of fine-grained sentiment analysis. While ASC is basically formulated as determining sentiment polarity of a given aspect in a sentence (Tang et al., 2016b,a;Wang et al., 2016;Chen et al., 2017;Huang et al., 2018;Li et al., 2018;Xu et al., 2019;Zhang et al., 2019b,a;Tang et al., 2020), there is an emergent trend that treating fine-grained sentiment as an opinion triplet extraction task (Peng et al., 2020;Wu et al., 2020). Recently, the robustness of ASC models becomes a critical issue that urges researchers to pay more attention on improving the robustness of ASC models (Xing et al., 2020). Our work is the first work to enhance the universal robustness of ASC models by capturing position bias. On another note, we believe opinion triplet extraction is exposed to the similar robustness issue, which should be explored in the near future.
Robustness in NLP Broadly, there are two kinds of robustness in NLP, i.e., O.O.D. and Adv. robustness. O.O.D. robustness in NLP has attracted extensive attention in recent work (Ng et al., 2020;Hendrycks et al., 2020;Xie et al., 2020). In terms of O.O.D. robustness, they often use the crossdomain setting to evaluate models (Benson and Ecker, 2020). Previous work mainly focuses on how to minimize the domain discrepancy and how to improve the feature adaptability of models (Rietzler et al., 2020;Ye et al., 2020). On the other hand, adversarial learning becomes the main method used to improve Adv. robustness of models (Xing et al., 2020). Prior methods consider using semantic operations, such as synonym replacement, random insertion, random swap, and random deletion to augment data (Wei and Zou, 2019). Other methods involve adding extra text (Wallace et al., 2019) and replacing sentences with semantically similar sentences (Ribeiro et al., 2018). Our work goes beyond the two forms of robustness and aims to achieve universal robustness for ASC with position bias.

Conclusion and Future Work
In this work, we find that state-of-the-art ASC models suffer from the issue of robustness, particularly in two scenarios: i) out-of-domain scenario, and ii) adversarial scenario. To address the issue, we propose a simple yet effective inductive bias that should be incorporated, that is, position bias. We proposed two mechanisms to capture position bias, namely position-biased weight and position-biased dropout. They are injected into existing models to enhance the representation. Extensive experiments demonstrate that the proposed methods can largely improve the models' robustness. The results verify our hypothesis that position bias is beneficial for building more robust ASC models.
The work shall be improved in the following two facets: i) Since the approach of incorporating position bias is straightforward yet naive, especially for pre-trained language models, it is meaningful to consider a nicely designed architecture to inject position bias in a more elegant manner. ii) It has been shown that position bias for ASC is highly correlated with the syntactic structure of the sentence. Hence, syntax can likewise be explored to enhance the robustness of ASC models.