An Empirical Study on Leveraging Position Embeddings for Target-oriented Opinion Words Extraction

Target-oriented opinion words extraction (TOWE) (Fan et al., 2019b) is a new subtask of target-oriented sentiment analysis that aims to extract opinion words for a given aspect in text. Current state-of-the-art methods leverage position embeddings to capture the relative position of a word to the target. However, the performance of these methods depends on the ability to incorporate this information into word representations. In this paper, we explore a variety of text encoders based on pretrained word embeddings or language models that leverage part-of-speech and position embeddings, aiming to examine the actual contribution of each component in TOWE. We also adapt a graph convolutional network (GCN) to enhance word representations by incorporating syntactic information. Our experimental results demonstrate that BiLSTM-based models can effectively encode position information into word representations while using a GCN only achieves marginal gains. Interestingly, our simple methods outperform several state-of-the-art complex neural structures.


Introduction
Target-oriented opinion words extraction (TOWE) (Fan et al., 2019b) is a fine-grained task of targetoriented sentiment analysis (Liu, 2012) aiming to extract opinion words with respect to an opinion target (or aspect) in text. Given the sentence "The food is good but the service is extremely slow", TOWE attempts to identify the opinion words "good" and "extremely slow" corresponding respectively to the targets "food" and "service". TOWE is usually treated as a sequence labeling problem using the BIO tagging scheme (Ramshaw and Marcus, 1999) to distinguish the Beginning, Inside and Outside of a span of opinion words. Table 1 shows an example of applying the BIO tagging scheme for TOWE.

Sentence:
The food is good but the service is extremely slow.  Learning effective word representations is a critical step towards tackling TOWE. Traditional work (Zhuang et al., 2006a;Hu and Liu, 2004a;Qiu et al., 2011) has used hand-crafted features to represent words which do not often generalize easily. More recent work (Liu et al., 2015;Fan et al., 2019b;Wu et al., 2020a;Veyseh et al., 2020) has explored neural networks to learn word representations automatically.
Previous neural-based methods (Liu et al., 2015;Fan et al., 2019b) has used word embeddings (Collobert and Weston, 2008;Mikolov et al., 2013;Pennington et al., 2014) to represent the input. However, TOWE is a complex task that requires a model to know the relative position of each word to the aspect in text. Words that are relatively closer to the target usually express the sentiment towards that aspect (Zhou et al., 2020). Fan et al. (2019b) employ Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) to encode the target position information in word embeddings. Wu et al. (2020a) transfer latent opinion knowledge into a Bidirectional LSTM (BiLSTM) network that leverages word and position embeddings (Zeng et al., 2014). Recently, Veyseh et al. (2020) have proposed ONG, a method that combines BERT (Bidirectional Encoder Representations from Transformers) (De-vlin et al., 2018), position embeddings, Ordered Neurons LSTM (ON-LSTM) (Shen et al., 2018), 1 and a graph convolutional network (GCN) (Kipf and Welling, 2016) to introduce syntactic information into word representations. While this model achieves state-of-the-art results, previous studies have shown that the ON-LSTM does not actually perform much better than LSTMs in recovering latent tree structures (Dyer et al., 2019). Besides, ON-LSTMs perform worse than LSTMs in capturing short-term dependencies (Shen et al., 2018). Since opinion words are usually close to targets in text, ON-LSTM risks missing the relationship between the aspect and any information (e.g. position) relating to the opinion words.
In this paper, we empirically evaluate a battery of popular text encoders which apart from words, take positional and part-of-speech information into account. Surprisingly, we show that methods based on BiLSTMs can effectively leverage position embeddings to achieve competitive if not better results than more complex methods such as ONG on standard TOWE datasets. Interestingly, combining a BiLSTM encoder with a GCN to explicitly capture syntactic information achieves only minor gains. This empirically highlights that BiLSTMbased methods have an inductive bias appropriate for the TOWE task, making a GCN less important.

Methodology
Given sentence s = {w 1 , . . . , w n } with aspect w t ∈ s, our approach consists of a text encoder that takes as input a combination of words, partof-speech and position information for TOWE. We further explore enhancing text encoding by incorporating information from a syntactic parse of the sentence through a GCN encoder.

Input Representation
Word Embeddings: We experiment with Glove word vectors (Pennington et al., 2014) as well as BERT-based representations, extracted from the last layer of a BERT base model (Devlin et al., 2018) fine-tuned on TOWE.
Position Embeddings (POSN): We compute the relative distance d i from w i to w t (i.e., d i = i − t), and lookup their embedding in a randomly initialized position embedding table.

Par-of-Speech Tag Embeddings (POST):
We assign part-of-speech tags to each word token using the Stanford parser, 2 and lookup their embedding in a randomly initialized POST embedding table.
Combined Input: We consider two types of input representations: 1. Glove Input (G): Constructed from concatenating Glove word embeddings, POST and POSN embeddings for each token.
2. BERT Input (B): Constructed from concatenating BERT vectors with POSN embeddings for each word token following a similar approach as (Veyseh et al., 2020). 3 We ignore POST embeddings since BERT is efficient in modeling such information (Tenney et al., 2019).

Text Encoders
We experiment with the following neural encoders that take word vector representations as input: CNN: A single layer convolutional neural network (LeCun et al., 1990). Given a word w i ∈ s, the CNN takes a fixed window of words around it and applies a filter on their representation to extract a feature vector for w i . We concatenate the feature vectors corresponding to different filters for w i to compute word representations.
Transformer: A Transformer encoder (Vaswani et al., 2017) that takes a linear transformation of the input words to learn contextualized representations.
BiLSTM: A bi-directional LSTM that takes the input representation and models the context in a forward and backward direction.

ON-LSTM:
A variant of the LSTM neural network proposed by (Shen et al., 2018) which has an inductive bias toward learning latent tree structures.

GCN Encoder
First, we interpret the syntactic parse tree as an adjacency binary matrix A n×n (n is the sentence length) with entries A ij = 1 if there is a connection between nodes i and j, and A ij = 0 otherwise. To apply a GCN on A, we consider the tree with self-loops at each node (i.e., A ii = 1), ensuring nodes are informed by their corresponding representations at previous layers. Formally, let H (k) be the output at the k-th GCN layer, H (k) is given by: where k = 1, . . . , K, W (k) is a parameter matrix at layer k. RELU is used as the activation function. H (0) corresponds to the set of word representations extracted by the text encoder. The second term in (1) induces a residual connection that retains the contextual information of H (0) during the propagation process (Sun et al., 2020).

Classification and Optimization
Our model uses the representation H (l) (where l ≥ 0), applies a linear layer and then normalize it with a softmax function to output a probability distribution over the set {B,I,O} for each word in the input. During training, we minimize the crossentropy function for each word in text for the entire training set.

Implementation Details
Hyper-parameters are tuned on 20% of samples randomly selected from the train set since there is no development set. 5 We use the Adam optimizer 4 Note that LSTM word /BiLSTM word only use word embeddings as input. 5 We use 300-dim Glove word vectors (Pennington et al., 2014) and apply a dropout of 0.8. Dimensions of part-ofspeech and position embeddings are set to 30, but dimensions of position embeddings for pretrained models are set to 100. The CNN uses three filters with sizes 3, 4 and 5 and has a hidden dimension of 300. All other models have a hidden  to train all models. Models that use Glove word vectors are optimized with learning rate 1e −3 and trained for 100 epochs with batch size 16. Models that use BERT hidden vectors are optimized with learning rate 1e −5 and trained with batch size 6. Our source code is publicly available. 6 Table 3 presents the results of all methods. Our models that use Glove Input (or BERT Input) are appended with "G"(or "B") to distinguish them. We report precision (Prec), recall (Rec), F1 score and average F1 score (Avg.F1) across all datasets.

Comparison of Text Encoders:
We first observe that CNN(G) is adept at exploiting the information from simpler word representations (Glove), outperforming the Transformer(G) by +4.52 Avg.F1. We believe that this behavior is due to the fact that TOWE is a short-sequence task (see #ASL in Table 2). This assumption lies well with previous observations by (Yin et al., 2021), which found that CNNs often perform better than Transformers at short-sequence tasks. However, the Transformer(B) is able to improve performance and even outperform CNN(B) by +0.71 Avg.F1 by using BERT. In addition, we find that ON-LSTM(G) and ON-LSTM(B) lag behind BiLSTM(G) and BiL-STM(B) by 4.33 and 0.54 Avg.F1 respectively. ON-LSTM performs worse than LSTMs on tasks that require tracking short-term dependencies (Shen et al., 2018). Since opinion words are usually close to the target in the sequence (see #ASL vrs. #S.Dist. in Table 2   BiLSTM(G)(or BiLSTM(B)) achieves a better performance over ON-LSTM(G)(or ON-LSTM(B)). The performance of BiLSTM(G) over BiLSTM word suggests that the substantial boost in performance comes from either part-of-speech or position embeddings. We later perform an ablation experiment to examine which information is more useful. Interestingly, BiLSTM(G) outperforms the current state-of-the-art ONG by +0.62 Avg.F1 despite its simple architecture, demonstrating the importance to first experiment with simpler methods before designing more complex structures.
Comparison of Text+GCN Encoders: Adding a GCN over any text encoder generally improves performance. This happens because the GCN provides additional syntactic information that is helpful for representation learning. We find that BiLSTM+GCN(G) achieves few gains over BiL-STM(G) while other text encoders including Trans-former+GCN(G) and CNN+GCN(G) achieve relatively higher gains than their counterparts. This suggest that BiLSTM(G) has an inductive bias appropriate for the TOWE task and the performance mostly depends on the quality of the input representation. We observe that when using BERT embeddings, there is a minimal performance difference between using GCNs or not. We attribute this to the expressiveness of BERT embeddings and its ability to capture syntactic dependencies (Jawahar et al., 2019). Overall results suggest that our proposed method outperforms SOTA consistently across datasets.

Ablation Study
We perform ablation experiments on the two best performing models, BiLSTM+GCN(G) and BiL-STM+GCN(B), to study the contribution of their different components. The results are shown in Table 4. On BiLSTM+GCN(G), as we consecutively remove the GCN and POST embeddings from the input representation, we observe a slight drop in performance. The results indicate that POST embeddings as well as the GCN are not critical compo-nents for BiLSTM+GCN(G). Therefore, they can be ignored to reduce model complexity. However, we observe a substantial drop in performance by removing the position embedding from the input representation, obtaining an F1 score equivalent to BiLSTM word across datasets. Similarly, removing the position embeddings in BiLSTM+GCN(B) causes a substantial drop in performance. The results suggest that leveraging position embeddings is crucial for TOWE performance.

Conclusion
We presented through extensive experiments that by employing a simple BiLSTM architecture that uses input representations from pre-trained word embeddings or language models, POST embeddings and position embeddings, we can obtain competitive, if not better results than the more complex current state-of-the-art methods Veyseh et al. (2020). The BiLSTM succeeds in exploiting position embeddings to improve performance. By adapting a GCN to incorporate syntactic information from the sentence we achieve further gains. In future work, we will explore how to improve existing TOWE models by effectively leveraging position embeddings.