To be Closer: Learning to Link up Aspects with Opinions

Dependency parse trees are helpful for discovering the opinion words in aspect-based sentiment analysis (ABSA) (CITATION). However, the trees obtained from off-the-shelf dependency parsers are static, and could be sub-optimal in ABSA. This is because the syntactic trees are not designed for capturing the interactions between opinion words and aspect words. In this work, we aim to shorten the distance between aspects and corresponding opinion words by learning an aspect-centric tree structure. The aspect and opinion words are expected to be closer along such tree structure compared to the standard dependency parse tree. The learning process allows the tree structure to adaptively correlate the aspect and opinion words, enabling us to better identify the polarity in the ABSA task. We conduct experiments on five aspect-based sentiment datasets, and the proposed model significantly outperforms recent strong baselines. Furthermore, our thorough analysis demonstrates the average distance between aspect and opinion words are shortened by at least 19% on the standard SemEval Restaurant14 (CITATION) dataset.


Introduction
Aspect-based sentiment analysis (ABSA) (Pang et al., 2008;Liu, 2012) aims at determining the sentiment polarity expressed towards a particular target in a sentence. For example, in the sentence "The battery life of this laptop is very long, but the price is too high", the sentiment expressed towards the aspect term "battery life" is positive, whereas the sentiment towards the aspect term "price" is negative. Early research efforts (Wang et al., 2016;Chen et al., 2017;Liu and Zhang, 2017;Li and Lu, 2019;Xu et al., 2020a) Figure 1: An example with different tree representations in the Twitter dataset. "Dist" returns the number of hops between two words in the tree. Words marked in red and blue are aspect and opinion, respectively. focus on using an attention mechanism (Bahdanau et al., 2015) to model interactions between aspect and context words. However, such attention-based models may suffer from overly focusing on the frequent words that express sentiment polarity while ignoring low-frequency ones (Tang et al., 2019;Sun and Lu, 2020). Recent efforts show that the syntactic structures of sentences can facilitate the identification of sentiment features related to aspect words (Zhang et al., 2019;Sun et al., 2019b;Huang and Carley, 2019). Nonetheless, these methods unfortunately suffer from two shortcomings. First, the trees obtained from off-the-shelf dependency parsers are static, and thus cannot adaptively model the complex relationship between multiple aspects and opinion words. Second, an inaccurate parse tree could lead to error propagation downstream in the pipeline. Several research groups have explored the above issues with a more refined parse tree. For example, Chen et al. (2020) constructed taskspecific structures by developing a gate mechanism to dynamically combine the parse tree information and a stochastic graph sampled from the HardKuma distribution (Bastings et al., 2019). On the other hand, Wang et al. (2020) greedily reshaped the dependency parse tree by using manual rules to obtain the aspect-related syntactic structures.
Despite being able to effectively alleviate the tree representation problem, existing methods still depend on external parse trees, leading to one potential problem. The dependency parse trees are not designed for the purpose of ABSA but to express syntactic relations. Specifically, the aspect term is usually a noun or a noun phrase, while the root of the dependency tree is often a verb or an adverb. According to statistics, for almost 90% 2 of the sentences, the roots of their dependency trees are not aspect words. Such a root inconsistency issue may prevent the model from effectively capturing the relationships between opinion words and aspect words. For example, Figure 1(a) shows the dependency tree obtained by the toolkit spaCy 3 . The root is the gerund verb "Loving" while the aspect term is the noun phrase "harry potter". The distance between the aspect words "harry" and "potter" and the critical opinion word "Loving" under a dependency tree are four hops and three hops, respectively. However, their relative distances in the sequential order are two and three, respectively. Intuitively, closer distance enables us to identify the polarity in the ABSA task better. Figure 1(b) shows an aspect-centric tree where the tree is rooted by the aspect words. The distances between aspect and opinion words are one hop and two hops, which is closer than the distance in the standard dependency parse tree.
In this paper, we propose a model that learns Aspect-Centric Latent Trees (we name it as the ACLT model) which are specifically tailored for the ABSA task. We assume that inducing tree structures whose roots are within aspect term enables the model to correlate the aspect and opinion words better. We built our model based on the structure attention mechanism (Kim et al., 2017;Liu and Lapata, 2018) and a variant of the Matrix-Tree Theorem (MTT) (Tutte, 1984;Koo et al., 2007). Additionally, we proposed to impose a soft constraint to encourage the aspect words to serve as the root of the tree structure induced by MTT. As a result, the search space of inferring the root is reduced during the training process.
Our contributions are summarized as follows: • We propose to use Aspect-Centric Latent Trees (ACLT) which are specifically tailored for the ABSA task to link up aspects with 2 The detail statistic can be found in Appendix A. 3 https://spacy.io/api/dependencyparser opinion words in an end-to-end fashion. • Our ACLT model is able to learn an aspectcentric latent tree with a root refinement strategy to better correlate the aspect and opinion words than the standard parse tree.
• Experiments show that our model outperforms the existing approaches, and also yields new state-of-the-art results on four ABSA benchmark datasets. Quantitative and qualitative experiments further justify the effectiveness of the learned aspect-centric trees. The analysis demonstrates that our ACLT is capable of shortening the average distances between aspect and opinion words by at least 19% on the standard SemEval Restaurant14 dataset.
To the best of our knowledge, we are the first to link up aspects with opinions through the specifically designed latent tree that imposes root constraints.

Model
In this section, we present the proposed Aspect-Centric Latent Tree (ACLT) model ( Figure 2) for the ABSA task. We first obtain the contextualized representations from the sentence encoder. Next, we use a tree inducer to produce the distribution over all the possible latent trees. The underlying tree inducer is a latent-variable model which treats tree structures as the latent variable. Once we have the distribution over the latent trees, we adopt the root refinement procedure to obtain aspect-centric latent trees. Then, we can encode the probabilistic latent trees with a graph or tree encoder. Finally, we use the structured representation from the tree encoder for sentiment classification.

Sentence Encoder
Given a sentence s = [w 1 , ..., w n ] and the corresponding aspect term a = [w i , ..., w j ] (1 ≤ i ≤ j ≤ n), we adopt the pre-trained language model BERT (Devlin et al., 2019) to obtain the contextualized representation for each word. We concatenate the words in the sentence and explicitly present the aspect term in the input representation:

Aspect-Centric Tree Inducer
While prior efforts (Wang et al., 2020;Chen et al., 2020) on learning latent (or explicit) trees for the ABSA task exist, one of the major contributions of our work is that we link up aspects and opinion words by addressing the root inconsistency issue. Inspired by recent work (Liu and Lapata, 2018;Nan et al., 2020), we use a variant of Kirchhoff's Matrix-Tree Theorem (Tutte, 1984;Koo et al., 2007) to induce the latent dependency structure.
Given the contextualized representation h ∈ R d of each node (token) in the sentence, where d is the dimension of the node representations. We first calculate pair-wise unnormalized edge scores e ij between the i-th and the j-th node with the node representation h i and h j by way of a two feed-forward neural network (FNN) and a bilinear function: where W p ∈ R d×d and W c ∈ R d×d are weights for two feedforward neural networks, tanh is applied as the activation function. W b ∈ R d×d is the weight for the bilinear transformation. e ij ∈ R d×d can be viewed as a weighted adjacency matrix for a graph G with n nodes where each node corresponds to a word in the sentence.
Next, we calculate the root score r i , representing the unnormalized probability of the i-th node to be selected as the root of the structure: where W r ∈ R 1×d is the weight for the linear transformation. Following Koo et al. (2007), we calculate the marginal probability of the dependency edge of the latent structure: where we first assign non-negative weights A ∈ R n×n to the edges, A ij is the weight of the edge between the i-th and the j-th node. Then, we build the Laplacian matrix L ∈ R n×n for graph G and its variantL which takes the root node into consideration for further computation (Koo et al., 2007). We use P ij to denote the marginal probability of the dependency edge between the i-th and the j-th node, and P r i is defined as the marginal probability of the i-th word headed by the root of the tree. Then, P ij and P r i can be derived: where δ is the Kronecker delta. Here, P ∈ R n×n can be interpreted as a weighted adjacency matrix of the word-level graph. We refer the interested reader to Koo et al. (2007) for more details.
Root Refinement Despite the successful application of tree information induced by MTT in previous works (Liu and Lapata, 2018;Guo et al., 2020;Nan et al., 2020), unfortunately, the MTT would still produce arbitrary trees which inappropriate for the specific task if there is no structure supervision. Under the assumption that inducing tree structures whose roots are within aspect term enables the model to better correlate the aspect and opinion words than the standard parse tree, we proposed to impose a soft constraint to encourage the aspect words w ∈ a to serve as the root of tree structure induced by MTT. Specifically, we introduce a cross-entropy loss for this assumption: where t i ∈ {0, 1} indicates whether the i-th token is the aspect word, P r i is the probability of the i-th token being the root from Equation 7. The nice property of this loss is that minimizing the loss is essentially adjusting the aspect words to be the root in the latent trees. On the other hand, this supervision reduce the search space of inferring root for MTT in the training process.
Intuitively, the tree inducer module produces a random structure at early iterations during training since information propagates mostly between neighboring nodes. As the roots are adjusted to the aspect words and the structure gets more refined when the loss becomes smaller, the tree inducer is more likely to generate an aspect-centric latent structure. Our experiment in Section 3.4 shows that the root refinement loss (Equation 8) is able to successfully guide the inducing of latent trees, in which the aspect word is consistent with its root.

Tree Encoder
Given contextualized representation h and the corresponding aspect-centric graph P , we follow Kim et al. (2017) and Liu and Lapata (2018) to encode the tree information by structure attention mechanism: where s p i ∈ R d is the context representation gathered from possible parents of h i , s c i ∈ R d is the context representation gathered from possible children, and h a is the representation for the root node. We concatenate s p i , s c i with h i and transform with weights W s ∈ R d×3d to obtain the structured representation of the i-th word s i .

Classifier
Following Xu et al. (2019) and Sun et al. (2019a), we leverage s 0 , which is the structured aspectaware representation of each sentence, to compute the probability over the different sentiment polarities as: where W p and b p are model parameters for the classifier, and y p is the predicted sentiment probability distribution.  The objective of the classifier is to minimize the cross-entropy loss for an instance (x, y): where y ∈ {positive, negative, neutral}. Our final objective function is a multi-task learning objective, defined as weighted sum of the loss on root refinement and classification: where α ∈ (0, 1) is a coefficient that balances the contribution of each component in the training process. The hyper-parameter α is selected based on the performance on the validation set.

Experimental Setup
We evaluate our proposed ACLT model on five benchmark datasets: the Laptop (Lap14) and Restaurant (Rest14) review datasets from SemEval 2014 Task4 (Pontiki et al., 2014), the Restau-rant15 (Rest15) review dataset from SemEval 2015 Task12 (Pontiki et al., 2015), the Restau-rant16 (Rest16) review dataset from SemEval 2016 Task5 (Pontiki et al., 2016), and Twitter posts from (Dong et al., 2014). Following the previous works (Tang et al., 2016;Chen et al., 2017;Wang and Lu, 2018), we remove a few examples that have conflicting labels. We randomly split 10% of data from the training dataset as the development dataset, and the model is only trained with the remaining data. Detailed statistics of the datasets can be found in Table 1. All hyper-parameters are tuned based on the development set 4 . We employed the uncased version of the BERT-base (Devlin et al., 2019) model in PyTorch (Wolf et al., 2020) 5 . Following previous conventions, we repeat each experiment three times and average the results, reporting accuracy (Acc.) and macro-f1 (F 1 ).

Baselines
The state-of-the-art baselines selected for comparison fall into three main categories: Syntax information free models, dependency parse tree based models, and latent tree based models. Syntax information free models include: • TNet-AS Li et al. (2018) implements a context-preserving mechanism to get the aspect-specific representations.  2020) constructs syntactic information by developing a gate mechanism to combine HardKuma structure and dependency parse tree. We reproduce the results for baselines whenever the authors provide the source code. For AS-GCN+BERT and KumaGCN+BERT models where the code is not made available as of this writing, we implement them by ourselves using the optimal hyper-parameter setting reported in their paper. Since we randomly split 10% of data from the training dataset as the development dataset, and the model is only trained with the remaining data, the results of R-GAT+BERT (Wang et al., 2020) and KumaGCN+BERT (Chen et al., 2020) are lower than which reported in the original paper. In our experiments, we report the average result and the mean absolute deviations over three runs with the random initialization. We stop training when iterations reached the maximum of 30 epochs.

Main Results
As shown in Table 2, dependency tree based models and latent tree based models generally achieve better results than syntax information free models, suggesting that syntactic information indeed benefits the ABSA task and enables it to achieve promising results.
Our model consistently outperforms the models which do not use any syntactic information. For example, ACLT improves upon the BERT-SRC model by 3.56 points in terms of F 1 on the Lap14 dataset, which suggests that our proposed model is able to induce an effective latent tree for ABSA in an endto-end fashion. In particular, with the exception of R-GAT+BERT on the Rest14 dataset in terms of F 1 , our model surpassed all compared models by a significant margin. For example, our model achieves 72.08 and 78.64 F 1 on the Rest15 and Rest16 datasets, which significantly outperform the current state-of-the-art model KumaGCN+BERT, under the same setting. The statistics empirically show that compared to the models that use syntactic information, ACLT can induce a more informative latent task-specific structure to establish effective connections between aspect words and context. Our ACLT model also shows its superiority over all baselines in terms of accuracy.

Does ACLT shorten the distances between aspect and opinion words?
To gain further insight on the relationship between aspect and opinion words in the text, we inspect the distance just between aspect words and selected opinion words. Specifically, we first selected the top five most frequent positive and negative opinion words, respectively, in the Rest14 dataset. We define the distance between the aspect and opinion words to be the number of interaction hops between them. Thus we can calculate the distance between these opinion words and aspect words in a parse tree 6 and an aspect-centric tree, respectively.    Table 3 presents various statistics for the average distance of aspect and opinion words in the trees produced by spaCy dependency parser (Parser), the Matrix Tree Theory without specific root refinements (MTT), and our model (ACLT). As can be seen, in our aspect-centric latent tree, the average distance between opinion words and aspect words is shorter than those in dependency parse tree and MTT. We also observe that without the root refinement, the average distance between opinion words and aspect words in MTT is roughly equivalent to the parse tree. These results confirm our hypothesis that inducing tree structures whose roots are within aspect term enables the model to better correlate the aspect and opinion words than the standard parse tree.

Effect of different tree representations
Our proposed aspect-centric latent tree, the latent Matrix tree, and the standard dependency parse tree all represent the structure of a sentence. Nevertheless, the differences between them and how they directly affect the aspect-based results need to be further investigated. In this section, we first use BERT-base as a contextual encoder, then use GCN to encode dependency parse tree information (Parser+GCN), latent Matrix tree information (MTT+GCN), latent Kuma structure (Kuma+GCN 7 ) and our aspect-centric tree information (ACLT+GCN). Table 4 summarizes the results.
We observe that models incorporated with syntactic information generally outperform the vanilla BERT-SRC, indicating that syntactic information benefits the ABSA task. Such a phenomenon can also be observed in other fundamental NLP tasks (Jie and Lu, 2019;Xu et al., 2021). Moreover, we also found that both our ACLT and ACLT+GCN model consistently outperform models equipped with other dependency trees by a significant margin. These results demonstrate that the aspectcentric tree induced by our model is indeed capable of effectively building relationships between aspect and context words for the ABSA task. Under the same setting, ACLT+GCN outperforms Parser+GCN, MTT+GCN, and Kuma+GCN on all the datasets. In particular, our ACLT+GCN obtains 1.8, 2.6, and 7 points improvement over Parser+GCN, MTT+GCN, and Kuma+GCN on Rest15 in terms of F 1 . Moreover, ACLT+GCN outperforms ACLT on the Rest14 and the Twitter datasets, indicating using a GCN as a tree encoder can boost the model performance to a certain extent.
We also have similar observations for our ACLT model under the setting of accuracy. These experimental results demonstrate that our proposed aspect-centric latent tree is a more effective structure for ABSA, compared to the parse tree. Interestingly, we observe that BERT cannot achieve  Table 4: The performance of BERT with our aspect-centric latent tree vs. BERT with other tree structures. a promising result on all datasets when introduced with the parse tree structure. For example, Parser+GCN drops 1.4 points in F 1 on the Rest16 dataset in comparison with vanilla BERT-SRC. This suggests that a dependency parse tree structure may not be able to capture the complicated interactions between aspect and opinion words effectively.

Did root refinement work?
We quantify the effectiveness of root refinement that adjusts the aspect words to be the root. We experiment with three different structures, including the dependency parse tree obtained by spaCy (Parser), the tree directly induced by MTT without specific root refinements (MTT), and the aspectcentric tree induced by our model (ACLT). Figure 3 shows the number of sentences where the aspect word is consistent with its root under three different tree structures in each dataset. Compared to the other two tree structures, we observe that the roots of our learned trees are consistent with the aspect words in most sentences.  which the aspect words are consistent with the root words using the ACLT model. These results demonstrate that the problem of inconsistency between root and aspect has come close to being solved with our ACLT model.

Effect of tree pruning
To further investigate the effect of different tree structures on model performance, we examine ACLT, R-GAT+BERT, and KumaGCN+BERT with different tree pruning. More specifically, for R-GAT+BERT using the standard prase tree, we discard the dependency relation beyond first-order (k=1) and second-order (k=2) to aspects, respectively. Following Guo et al. (2020), we mask the information of the adjacency matrix P (Equation 6) that is beyond first-order (k=1) and secondorder (k=2) with respect to the aspect for Ku-maGCN+BERT and our ACLT model. As shown in Table 5, on the Twitter dataset, our ACLT yields the best performance with the entire tree, outperforming the first-order pruned tree and second-order pruned tree by 0.76 and 1.79 points in terms of F 1 , respectively. This indicates it is necessary to induce  Table 6: Ablation study of ACLT on various datasets. w/o and w indicate without and with, resepectively. Fixed Root means the tree's root is fixed on the first word of aspect term (Wang et al., 2020).
an entire aspect-centric latent tree rather than its pruned subtree in our model. Interestingly, we observe that R-GAT+BERT and KumaGCN+BERT achieve the best results in cases of Pruned Tree (k = 1) and Pruned Tree (k = 2), respectively. It is likely because that both R-GAT+BERT and KumaGCN+BERT rely on the parse tree. Nevertheless, only a small part of the standard parse tree is related to the ABSA task. Introducing the entire tree may prevent the model from effectively capturing the relationships between opinion words and aspect words.

Ablation Study
We conducted experiments to examine the effectiveness of the major components of our ACLT model, and Table 6 shows the ablation results on the five datasets we used. We observe that both latent tree and root refinement component contribute to the main model. Specifically, with removal of the root refinement module, performance of ACLT drops considerably, leading to a 5.2 and 4.8 decrease, in terms of F 1 , on the Rest15 dataset and the Rest16 dataset, respectively. This result illustrates that refining root to aspect words plays a crucial role in learning a task-specific latent structure for ABSA. The performance drop on fixed root indicates that computing each aspect word's probability to become the root is essential for achieving good performance.

Case Study
To gain further insight on our induced aspectcentric tree, we use Chu-Liu-Edmonds' algorithm (Edmonds, 1967) to extract the aspect-centric trees, where each tree is expressed by a weighted adjacency matrix as shown in Equation (7). We selected two examples from the Twitter and Rest16 datasets, whose sentiments can be correctly predicted by our ACLT model. Overall we observe that aspect-centric trees differ from the standard dependency trees in the types of dependencies they  Figure 4: Two examples from the Twitter and Rest16 dataset to illustrate the difference between a dependency parse tree (a) and an aspect-centric tree (b). Red words indicate the aspect words of the sentence.
create which tend to be of shorter length. Specifically, as shown in Figure 4 top (a), the root of the dependency parse tree is the word "looks" which is inconsistent with the aspect word "google" or "wave", and the key opinion word "interesting" requires three-hop and two-hop interactions to establish a connection with each of the two aspect words respectively. However, as shown in Figure 4 top (b), our aspect-centric tree is rooted in the aspect word "wave" 8 . In addition, we observe that the opinion words and aspect words can be connected by two-hop and one-hop interactions through our aspect-centric tree, which is more effective than the number of interaction hops needed in the dependency parse tree.
We also have similar observations for the second case. Illustrated in Figure 4 bottom (a), the distance between the aspect words "lunch" and "menu" and the critical opinion word "awesome" is four-hops and three-hops, respectively, in the parse tree. In contrast, Figure 4 bottom (b) shows that in the aspect-centric tree extracted by our model, the distances between aspect and opinion words are one-hop and two-hops, which is closer than the distance in the standard dependency parse tree.

Related Work
Aspect-based sentiment analysis. Early efforts on aspect-based sentiment focused on predicting polarity by employing attention mechanism (Bahdanau et al., 2015) to model interactions between aspect and context words (Wang et al., 2016;Chen et al., 2017;Liu and Zhang, 2017;Li et al., 2018;. More recently, neural pretrained language models, for instance, BERT (Devlin et al., 2019) enabled ABSA to achieve promising results. For example, Sun et al. (2019a) manually constructed auxiliary sentences using the aspect word to convert ABSA into a sentencepair classification task. Huang and Carley (2019) propagated opinion features from syntax neighborhood words to the aspect words, in a BERT-based model. Another line of work in ABSA focused on leveraging the explicit dependency parse trees to model the relationships between context and aspect words. Zhang et al. (2019) and Sun et al. (2019b) used GCNs to integrate dependency tree information to capture structural and contextual information simultaneously for aspect-based sentiment analysis. Wang et al. (2020) greedily reshaped the dependency parse trees by using manual rules to obtain the task-specific syntactic structures.
Latent variable induction. Latent variable models (Maillard et al., 2017;Kim et al., 2017;Niculae et al., 2018;Mensch and Blondel, 2018;Liu and Lapata, 2018;Zou and Lu, 2019) have gained much popularity in building Natural Language Processing (NLP) pipelines and discovering task-specific linguistic structures (Kim et al., 2018;Martins et al., 2019). The crucial obstacle of designing structured latent variable models is that they typically involve computing an "argmax" (i.e., searching the highest-scoring discrete structure, such as a parse tree) in the middle of a computation graph. End-to-end approaches directly replace the "argmax" approach by introducing a continuous relaxation for which the exact gradient can be computed and back-propagated normally. For example, Nan et al. (2020) and Guo et al. (2020) used marginal inference to construct latent structures to improve information aggregation in the relation extraction task. More in line with our work, Chen et al. (2020) constructed task-specific structures by developing a gate mechanism to dynamically combine the parse tree information and HardKuma structure. Our work differs from this prior work in three main aspects. First, we construct the aspectspecific tree for inference without relying on an external parser. Second, we facilitate the interactions between target and opinion by introducing an explicit supervision to adaptively adjust the aspect to be the root in an end-to-end fashion. Third, we compute each aspect word's probability to become the root which enables our model to reduce the search space of inferring root for MTT in the training process.

Conclusion and Future Work
In this paper, we propose to use Aspect-Centric Latent Trees (ACLT) which are specifically tailored for the ABSA task to link up aspects with opinion words in an end-to-end fashion. Experiments on five benchmark datasets show the effectiveness of our model. The qualitative and quantitative analysis illustrate that our model is able to improve the probability of aspect words becoming the root of the sentence by imposing root constraints. Moreover, thorough analysis demonstrates our model shortens the average distances between aspect and opinions by at least 19% on the SemEval Restau-rant14 dataset. To the best of our knowledge, we are the first to link up aspects with opinions through the specifically designed latent tree that imposes root constraints. One possible future direction is to apply the proposed approach to other sentiment analysis tasks, such as aspect triplet extraction (Xu et al., 2020b).