Enhancing Aspect-level Sentiment Analysis with Word Dependencies

Aspect-level sentiment analysis (ASA) has received much attention in recent years. Most existing approaches tried to leverage syntactic information, such as the dependency parsing results of the input text, to improve sentiment analysis on different aspects. Although these approaches achieved satisfying results, their main focus is to leverage the dependency arcs among words where the dependency type information is omitted; and they model different dependencies equally where the noisy dependency results may hurt model performance. In this paper, we propose an approach to enhance aspect-level sentiment analysis with word dependencies, where the type information is modeled by key-value memory networks and different dependency results are selectively leveraged. Experimental results on five benchmark datasets demonstrate the effectiveness of our approach, where it outperforms baseline models on all datasets and achieves state-of-the-art performance on three of them.


Introduction
Aspect-level sentiment analysis (ASA) determines the sentiment polarity of a given input text on the fine-grained level, where the sentiment towards a particular aspect in the text is predicted instead of the entire input. E.g., the sentiment of an aspect "bar service" in the sentence "Total environment is fantastic although bar service is poor." is negative, although the text as a whole conveys a positive sentiment polarity. Due to its high practical value in many scenarios, e.g., product review analysis, social media tracking, etc., ASA attracts much attention in the natural language processing (NLP) community for years (Tang et al., 2016a,b;He et al., 2018a;Huang and Carley, 2019). * Equal contribution. † Corresponding author. 1 The code and different models are released at https: //github.com/cuhksz-nlp/ASA-WD.
In recent studies, neural networks, especially recurrent models with attention mechanism, are widely applied in this task, where many of them (Wang et al., 2016;Tang et al., 2016a;Chen et al., 2017;Ma et al., 2017;Fan et al., 2018;Liang et al., 2019;Tang et al., 2020) model semantic relatedness between context and aspect words to facilitate sentiment analysis on aspects. There are other approaches using additional inputs such as word position (Gu et al., 2018), document information (He et al., 2018b;Li et al., 2018a), commonsense knowledge (Ma et al., 2018). Among all such inputs, dependency results of the input text are proved to be a kind of useful information (He et al., 2018a;Huang and Carley, 2019;Tang et al., 2020), because they can help the model locate important content that modifies the aspect words and thus further suggests the sentiment towards the aspect words. Previous approaches with attention mechanism (He et al., 2018a;, graph neural networks (GNN) Huang and Carley, 2019; and transformer (Tang et al., 2020) are applied in leveraging such information. However, most of them mainly focus on using the dependencies among words and omit to leverage other information such as relation types, which could provide useful cues to predict the sentiment. Also, they model all dependency information instances equally without weighting them according to their contribution to the task, where noisy information from the auto-generated dependency tree may hurt model performance. Therefore, improved methods are expected to comprehensively and efficiently learn dependencies among words to enhance ASA.
To address the aforementioned limitations, in this paper we propose an effective and efficient neural approach to ASA with incorporating word dependencies, which is acquired from off-the-shelf toolkits and modeled by key-value memory net- and decoder for ASA; the right part demonstrates the key-value memory networks (KVMN) for dependency information incorporation, where we use example word dependencies and their types (highlighted in yellow) of the aspect term "service" to show that how they are extracted, weighted and then fed into the left part for ASA.
works (KVMN) (Miller et al., 2016). In detail, for each input text parsed by a dependency parser, we extract its dependency relations and feed them into the KVMN, in which word-word associations and their corresponding dependency types are mapped to keys and values, respectively. Then the KVMN learns and weights different dependency knowledge according to the contribution of their corresponding keys to the ASA task, and provides the resulted representations to a regular ASA model, i.e., a BERT-based classifier, for final aspect-level sentiment predictions. In doing so, the proposed approach not only comprehensively leverages both word relations and their dependency types, but also effectively weights them through the memory mechanism according to their contributions to the ASA task. We evaluate the proposed approach on five benchmark datasets, where our approach outperforms the baselines on all datasets and achieves state-of-the-art on three of them.

The Approach
The task of ASA aims to analyze the sentiment of a text towards a specific aspect, which is formalized as a classification task performing on sentence-aspect pairs (Tang et al., 2016b;Ma et al., 2017;Xue and Li, 2018;Hazarika et al., 2018;Fan et al., 2018;Huang and Carley, 2018;Tang et al., 2019;Chen and Qian, 2019;Tan et al., 2019). In detail, each input sentence and the aspect terms in it are denoted by X = x 1 , x 2 , · · · , x n and A = a 1 , a 2 · · · , a m , respectively, where A is the sub-string of X (A ⇢ X ), n and m refer to the word-based length of X and A. Following this paradigm, we design the architecture of our approach in Figure 1, with a BERT-based (Devlin et al., 2019) encoder illustrated on the left to compute the sentence-aspect pair representation r, and enhanced by the word dependency information obtained from the KVMN module on the right, then the result is fed into a softmax decoder to predict the text sentiment towards the aspect. Therefore, ASA through our approach can be formalized aŝ where T denotes the set of sentiment polarities for y and p computes the probability of predicting y 2 T given X and A.ŷ refers to the predicted sentiment polarity type for A in the context of X . In the rest of this section, we firstly describe KVMN for leveraging word dependencies, then explain how the resulted representations are integrated into the backbone sentiment classifier.

KVMN for Word Dependencies
High quality text representations always play a crucial role to obtain good model performance for different NLP tasks (Song et al., 2017;Seyler et al., 2018;Babanejad et al., 2020), where contextual features, including n-grams and syntactic information, have been demonstrated to be effective in enhancing text representation and thus leads to improvements on different models (Song et al., 2006(Song et al., , 2009Song and Xia, 2013;Dong et al., 2014;Miller et al., 2016;Seyler et al., 2018;Diao et al., 2019;Huang and Carley, 2019;Tian et al., 2020b,c,d,e;Chen et al., 2020). Among all these features, dependency ones have been widely used, especially for ASA. To incorporate word dependencies into ASA task, there are many options, including attention mechanism (He et al., 2018a) where the information of dependency types among word pairs are omitted, and GNN and Transformerbased methods Tang et al., 2020) that require complicated architectures to model the entire dependency structure of an input text. Compared to these options, KVMN, whose variants have been demonstrated to be effective in incorporating contextual features (Miller et al., 2016;Guan et al., 2019;Tian et al., 2020a,f;Nie et al., 2020), not only provides an appropriate way to leverage both word-word relations as well as their corresponding dependency types, but also weights different dependency information according to their contribution to the ASA task.
In detail, to build the KVMN, we firstly collect all word-word relations extracted from the parse results of a corpus via an off-the-shelf toolkit and use them to form the key set, and map their corresponding dependency types to the value set. Then, two embedding matrices, K and V are applied to the key and value sets with each vector representing a key or a value in the sets. At training or prediction stage, given an input text, our model obtains its dependency parsing result, i.e., for each w i in a sentence-aspect pair, where w i comes from X , A, or both X and A, we extract words associated with w i and their corresponding dependency types from the parse results. Note that, for each word, we use its inbound and outbound dependency types to represent its governor and dependent word, respectively. Therefore, for example, as illustrated in Figure 1, the words associated to the aspect word "service" are "poor" (governor) and "bar" (dependent); their corresponding dependency types are thus "nsubj" and "compound", respectively. Afterwards, we map the associated words and their corresponding dependency types to keys K i = {k i,1 , k i,2 , · · · , k i,j , · · · , k i,q } and values V i = {v i,1 , v i,2 , · · · , v i,j , · · · , v i,q } from K and V in the KVMN, where each item in K i and V i has its embedding denoted by e k i,j and e v i,j , re-spectively. Once the keys and values are placed, we take the hidden vector h i for w i from the encoder (i.e., BERT), and compute the weight assigning to each value v i,j by We thus use p i,j to activate the corresponding values v i,j and compute the weighted sum by where o i refers to the output of the KVMN model for w i and carries its word dependency information.

Word Dependency Integration for ASA
As shown in Figure 1 where h 0 denotes the hidden vector for the textinitial symbol [CLS], and H X , H A the embedding matrices of words in X and A, respectively. Upon the modeling of word dependencies for each w i , different o i are obtained and averaged, then concatenated with h 0 by where r is the representation for the input sentenceaspect pair enhanced by word dependencies, and the value of l equals to n, m, or n + m if all w i come from X only, A only, or X +A, respectively. 2 Then, we use a dense layer with a trainable matrix W and vector b to align r's dimension to the output space by u = W · r + b, with each dimension of u corresponding to a sentiment type. Finally, a softmax function is applied to u to predict the output sentimentŷ for the aspect A in X : where u t is the value at dimension t in u. REST14  REST15  REST16  TWITTER  TRAIN  TEST  TRAIN  TEST  TRAIN  TEST  TRAIN  TEST  TRAIN  TEST   POSITIVE #  994  341  2,164  728  907  326  1,229  469  1,561  173  NEUTRAL #  464  169  637  196  36  34  69  30  3,127  346  NEGATIVE #  870  128  807  182  254  207  437 3 We also report the number and percentage of the contrastive cases (DIFF.) where in a sentence the sentiments on aspect(s) are different from the entire sentence.

Datasets
Five benchmark datasets, i.e., LAP14 and REST14 (Pontiki et al., 2014), REST15 (Pontiki et al., 2015), REST16 (Pontiki et al., 2016), TWITTER (Dong et al., 2014), are used in our experiments. Specifically, LAP14 is a dataset consists of laptop computer reviews; REST14, REST15, and REST16 consist of restaurant reviews from online users; TWITTER includes tweets collected by querying the Twitter API. For all datasets, we use their official train/test splits and follow Tang et al. (2016b) to clean them by filtering out the aspects with the conflict label 4 as well as the sentences without an aspect. The statistics of the processed five datasets are reported in Table 1, where the numbers of aspects with positive, neutral, and negative polarities are reported. Note that in some datasets, e.g., LAP14 and REST14, there are rather high percentages of sentences (e.g., the sentence in Figure 1) that contain different sentiments towards aspects, as shown in the DIFF. rows in Table 1, which indicates a bigger challenge on ASA comparing to sentiment analysis on an entire sentence.

Word Dependency Extraction
Similar to previous studies Tang et al., 2020) that also require dependency information, we employ the English version of SAPar 5 (Tian et al., 2020e), which is the most effective constituency parser trained on English Pen Tree-Bank (PTB) (Marcus et al., 1993), to obtain the constituency trees of the input text and then convert because that many sentences have more than one aspect and such aspects usually have contrastive sentiment polarities. 4 The "conflict" label is used in LAP14, REST14/16 to identify aspects that have conflict sentiment polarities. 5 https://github.com/cuhksz-nlp/SAPar them into dependency trees by Stanford converter 6 . Therefore, when a dependency tree is built on the entire input text, for each word in the text, one can find its dependent words and types according to the dependency paths on the tree. Consequently, the dependency relations of each word to others can be extended along with the dependency paths and it is not restricted that only one-hop (first-order) relations can be considered in our model. One could easily extend the coverage of word dependencies with two-or three-hop relations from a given word, which are known as second-and third-order dependencies, e.g., "poor ! service ! bar" in Figure 1 is a second-order dependency relation. As described in §2.1, extracting first-order word dependencies is straightforward; to extend it with higher order ones, we follow the same principle to extract word dependencies and assign dependency types as follows: (1) for the governor w g of the target word w, we collect all its governor and dependents (except for w) associated with w g 's inbound and outbound dependency types, respectively; (2) for each dependent w d of w, we find all dependents of w d and use outbound dependency types to represent w d 's dependent words; (3) we include all context words and their corresponding dependency types collected in (1) and (2) as the input to KVMN for w and repeat the process for further higher order word dependencies.
For example, in the input text in Figure 1, the second-order word dependencies and types for "service" are started from its governor "poor" and dependent "bar". Then for "poor", we collect its governor "fantastic" with an inbound dependency type of "advcl", and dependents "although" and "is" with the outbound dependency types of "mark" and "cop", respectively. For "bar", it is not able to expand because it has no dependent,  Table 2: Experimental results (accuracy and F1 scores) of using different encoders (BERT-base and BERT-large) with and without KVMN on five benchmark datasets, where X , A, and X A refer to that KVMN models word dependencies from X only, A only, and X + A, respectively. the collection thus stops here. Therefore, the resulted words (keys) in second-order dependencies and their corresponding dependency types (values) for "service" are K 11 = {bar, poor, fantastic, although, mark}, and V 11 = {bar compound, poor nsubj, fantastic advcl, although mark, is cop}, respectively.

Implementation Details
We adopt BERT-base-uncased and BERT-largeuncased 7 as the encoders in our approach, which are demonstrated to be the most effective encoders for many NLP tasks (Straková et al., 2019;Baldini Soares et al., 2019;. In our experiments, we use their default settings for the two BERT encoders (i.e., for BERT-base-uncased, we use 12 layers with 768 dimensional hidden vectors; and for BERT-large-uncased, we use 24 layers with 1024 dimensional hidden vectors). For all experiments, we use Adam optimizer (Kingma and Ba, 2014) and try different combinations of learning rates, dropout rates, and batch size. 8 In addition, we apply Xavier initialization (Glorot and Bengio, 2010) on all trainable parameters including the embeddings for keys and values in the KVMN. Moreover, we use the cross-entropy loss function to optimize our model and follow the convention to evaluate our models via accuracy and macroaveraged F1 scores over all sentiment polarities, i.e., positive, neutral and negative. 7 We obtain the BERT models from https://github. com/huggingface/pytorch-pretrained-BERT. 8 We report the hyper-parameter settings of different models, as well as their size and running speed, in the Appendix.

Effect of Using Word Dependencies
In the main experiments, we test our model with and without integrating word dependencies by KVMN, where both the base and large BERT encoders are used. In detail, when leveraging word dependencies, we run experiments on our proposed model to explore the effect of learning from different parts of the input, i.e., we try word dependencies from three sources: X only, A only, and both X and A (see §2.2). Experimental results are reported in Table 2 with the prefixes of KVMN denoting which part is encoded from.
There are several observations. First, KVMN works well with both the base and large BERT. Although BERT baselines have already achieved good performance, improvements of our proposed model over the baselines are observed on all datasets with respect to both accuracy and F1 scores. Second, among the three settings of encoding from different parts of the input (i.e., X , A, X +A) in KVMN, in most datasets (except for TWITTER), the highest performance is observed on "A-KVMN". These results comply with the intuition where extracting and learning word dependencies from A ensures KVMN only incorporates the information from the content directly associated with the aspect words, thus focuses the model on the words that are most likely to be helpful on ASA for a particular aspect in a sentence. Third, although the overall performance of X -KVMN and X A-KVMN are not as good as that of A-KVMN, they are still better than the baselines without using word dependencies. Especially for X -KVMN, where word dependencies are extracted from the entire sentence, in this case,  Table 3: Experimental results of our models with the best setting (i.e., using base and large BERT with A-KVMN) of using dependency relations on different (i.e., 1st, 2nd and 3rd) orders. Average percentage of words in a sentence covered by word dependencies on different orders are also reported in the CVGE. column.
the dependency information also helps ASA even though it introduces some noise to the task when the entire sentence possesses a different sentiment polarity (as shown in the DIFF. rows in Table 1), while such noise contributes to its inferior performance to the A-KVMN setting. Therefore, for the case that the sentiment is agreed between the entire sentence and its aspect (e.g., TWITTER dataset is in this case according to Table 1), X -KVMN and A-KVMN have similar performance.

Effect of Different Dependency Orders
Previous experiments showed the effectiveness of our model with KVMN on first-order word dependencies. In this experiment, we use the best setting (i.e., models using A-KVMN) for base and large BERT and run them with encoding higher-order dependencies to further investigate the effectiveness of our model with more dependency information. Particularly, we try second-and third-order word dependencies and compare their results with the previous first-order ones. The results on all datasets, as well as average coverage (%) of words in each sentence with respect to different dependency orders, 9 are reported in Table 3, where (a) and (b) show the results of models with BERT-base and BERT-large encoders, respectively. From the results, it is found that in most cases (e.g., for both base and large BERT), models using second-order word dependencies achieve the overall highest per-9 This metric is used to present how many words in each input sentence are involved when different orders are applied for extracting word dependencies, so as to illustrate how much information in a sentence is helpful for ASA. formance, which can be explained by that firstorder dependency for aspect words is not enough to cover enough salient information helping ASA. This is a common phenomenon when negation is included in a sentence. For example, in "the pizza is not good", for its aspect "pizza", whose firstorder dependencies only link "pizza" with "good", the classifier is thus misled to predict a positive sentiment polarity. Compared to using secondorder word dependencies, third-order dependencies in general do not provide further improvement to ASA, which owes to the reason that more irrelevant information is introduced to the encoder thus distract the model for final prediction. In fact, thirdorder dependencies lead to that around 75% words in each sentence are fed into KVMN, which could severely affect ASA by sentence-level sentiment polarities, and eventually harm model performance especially when an aspect-level sentiment differs from the sentence-level sentiment.

Comparison with Previous Studies
To further demonstrate the effectiveness of our approach, we compare our best-performing model, i.e., the BERT-large encoder with secondorder word dependencies incorporated through A-KVMN, with previous studies, where the comparisons on all datasets are reported in Table 4, where the results of BERT-large baseline, as well as the ones using BERT-base, are also reported for references. It is observed that, our model consistently outperforms the BERT-large baseline on all datasets and achieves state-of-the-art on three of them (i.e., LAP14, REST15, REST16) in terms of  Table 4: Performance Comparison (on accuracy and F1 scores) of our best model (BERT-LARGE + A-KVMN with second-order word dependencies) with previous studies on all datasets. The results of BERT-large baseline are also reported for references. Models that use BERT-large as the encoder are marked by " †". The results marked by "*" indicate that our model is significantly better than the corresponding baseline model (t-test with p < 0.05).
both accuracy and F1 scores. Specifically, compared with previous studies that also leverage dependency information, our approach outperforms (2020) on most datasets. This observation is valid because, in previous models, they are weighting or averaging hidden vectors of the (aspect related) words rather than on the relations, and omitting dependency types which provide guidance to emphasize some useful relations, e.g., the "amod" (i.e., adjectival modifier) type identifies that an adjectival modifier could be the sentiment words of a corresponding aspect. Therefore, the superiority of our model comes from two aspects, weighting word-word relations and leveraging dependency types. KVMN highlights salient dependency relations and learns from them and their dependency types, which alleviates the influence of noisy dependency information. In addition, we note that our approach achieves inferior results on TWITTER dataset compared with Tang et al. (2020). One possible explanation is that a dependency parser trained in the general domain can get inferior parsing results on TWITTER texts from the social media domain, which makes it harder for our approach to improve the BERT-large baseline compared to other datasets. Nevertheless, the effectiveness of our approach is still valid given that our approach outperforms Tang et al. (2020) on all other datasets.

Ablation Study
To confirm the validity of using both word relations (keys) and their corresponding dependency types (values) for ASA, we conduct an ablation study by learning from either part of the two types of dependency information. We choose the models using BERT-base and BERT-large with our best setting (i.e., models with second-order word dependencies and A-KVMN) for this study and adapt the KVMN module to key-only or value-only input. The experimental results on all benchmark datasets are reported in  Table 5: Results on five datasets from our full models (base and large BERT with A-KVMN and second-order word dependencies) and its variants where keys (" KEYS") and values (" VALUES") are ablated. refers to the drop of accuracy and F1 score when keys or values are excluded from the full model. dependency types. Still, one cannot deny the contribution of dependency types because the drop is still significant if values are excluded, where even on some datasets (e.g., REST15 and REST16) higher drops are observed on F1 than KEY ablation. The results for this ablation study demonstrate that dependency type is of high importance to improve ASA if they are appropriately encoded.

Case Study
To illustrate the effect of KVMN module on weighting salient word dependencies and thus improve ASA, we conduct a case study on the sentence "The falafel was rather overcooked and dried but the chicken was fine" shown in Figure 2, in which it contains two aspects with contrast sentiment polarities, i.e., negative towards "falafel" and positive towards "chicken". For each aspect, we run our best model (BERT-LARGE + A-KVMN with second-order word dependencies), and visualize the weights (p i,j in Eq. (2)) assigned to all associated dependency types and their corresponding words, where darker color refers to higher weights.
For the first aspect "falafel" (Figure 2(a)), although there are some adjectives carrying opposite sentiment polarities within its second-order relations, KVMN successfully distinguishes "overcooked" is more important to it and assigns a relatively higher weight. This is because that the corresponding type ("nsubjpass", passive nominal subject) to "overcooked" is intensively highlighted The falafel was rather overcooked and dried but fine advmod ROOT overcooked … … … chicken the was fine … Figure 2: Illustration of an example sentence with two aspects in different sentiment polarities. For each aspect, weights (from our best model) assigned to dependent words and dependency types are visualized with colors, where darker color refers to higher weights. so that the model identifies it as the main sentiment carrier for the aspect word "falafel" where other adjectives (i.e., "dried" and "fine") share the "conj" (conjunction) type and are distantly related to the aspect words, making them less important.
For the other aspect "chicken" (Figure 2(b)), similar to the first one, both "overcooked" and "fine" are included in its associated context words. In this case, "fine" is more closely dependent on "chicken" than "overcooked", where it has a "nsubj" (noun subject) type showing a predicate role thus receives higher weight from KVMN, resulting in a positive sentiment polarity prediction towards "chicken". Overall, this case study perfectly explains the ef-fectiveness of our model, where two aspects share the same context and the only change is the dependency information (o i ) comes from KVMN. Therefore, the different prediction results for the two aspects suggest that KVMN appropriately learns from salient dependency relations and types for each aspect, where different types have their own capabilities to enhance ASA accordingly (e.g., "nsubj" may contribute more than "conj").

Related Work
Different from sentiment analysis for large granular texts, such as document and sentences, ASA focuses on processing sentiment polarities for a specific aspect (e.g., "pizza") or category (e.g., "food") in a piece of text. To address this task, early approaches (Jiang et al., 2011;Dong et al., 2014) followed the sentence classification paradigm and recent studies enhanced it as a mission of sentenceaspect pair classification with applying neural approaches (Wang et al., 2016;Tang et al., 2016a;Ma et al., 2017;Chen et al., 2017;Xue and Li, 2018;Li et al., 2018b;Hu et al., 2019; such as recurrent models (e.g., bi-LSTM) and pretrained encoders (e.g., BERT) for effectively capturing contextual information. In addition to improving the input form, advanced models such as memory networks (Tang et al., 2016b;Chen et al., 2017;Wang et al., 2018;Zhu and Qian, 2018;, attention mechanism (Wang et al., 2016;Ma et al., 2017;Hazarika et al., 2018), capsule networks Chen and Qian, 2019;, GNN (Huang and Carley, 2019;, and transformer (Tang et al., 2020) are applied to this task, with other studies leveraging external resources, including position information (Gu et al., 2018), document information (He et al., 2018b), commonsense knowledge (Ma et al., 2018), etc. Among all resources, syntactic information was proved to be the most effective one and successfully adopted in recent studies with GNN (Huang and Carley, 2019;. Compared with previous studies, our approach offers an alternative way to use KVMN and syntactic information for ASA. Consider those studies using memory networks where their memories are represented by contextual features of the aspect terms, dependency information was not leveraged in their work. In addition, compared with those approaches leveraging word dependencies (i.e., us-ing attention mechanism or GNN), where they not only omitted useful dependency information such as relation types, but also demanded a complicated model structure in doing so, our approach ensures comprehensively encoding from both word-word relations and their dependency types, and models them in an efficient way by KVMN.

Conclusion
In this paper, we propose an effective neural approach to improve ASA with word dependencies by KVMN, where for each aspect term, we firstly extract the words associated to it according to the dependency parse of the input sentence and their corresponding dependency relation types, then use KVMN to encode and weight such information to enhance ASA accordingly. In our approach, not only word-word relations but also their dependency types are leveraged in a KVMN, which to our best knowledge are the first attempts in all related syntax-driven studies for ASA. Experimental results on five widely used benchmark datasets demonstrate the effectiveness of our approach, and shows that second-order word dependency is the best choice for ASA, where the new state-of-the-art results are achieved on three datasets. Moreover, further analyses illustrate the validity of applying KVMN on both dependency relation and type information, especially the effectiveness of dependency types, which are often omitted in previous studies. Table 6: The number of trainable parameters (PARA.) and the running speed (sentences/second) on the test sets of the baseline models (the ones without using KVMN and the dependency information) and our best performing models (the ones with A-KVMN and the second-order dependencies).