Attention-based Relational Graph Convolutional Network for Target-Oriented Opinion Words Extraction

Target-oriented opinion words extraction (TOWE) is a subtask of aspect-based sentiment analysis (ABSA). It aims to extract the corresponding opinion words for a given opinion target in a review sentence. Intuitively, the relation between an opinion target and an opinion word mostly relies on syntactics. In this study, we design a directed syntactic dependency graph based on a dependency tree to establish a path from the target to candidate opinions. Subsequently, we propose a novel attention-based relational graph convolutional neural network (ARGCN) to exploit syntactic information over dependency graphs. Moreover, to explicitly extract the corresponding opinion words toward the given opinion target, we effectively encode target information in our model with the target-aware representation. Empirical results demonstrate that our model significantly outperforms all of the existing models on four benchmark datasets. Extensive analysis also demonstrates the effectiveness of each component of our models. Our code is available at https://github.com/wcwowwwww/towe-eacl.


Introduction
Target-oriented opinion words extraction (TOWE) (Fan et al., 2019) is a subtask of aspect-based sentiment analysis (ABSA) (Hu and Liu, 2004;Pontiki et al., 2016). Given a review and an opinion target in the sentence, the objective of TOWE is to extract the corresponding opinion words describing or evaluating the opinion targets from the review. Opinion targets are the words or phrases representing features or entities toward which users express their attitudes, whereas opinion words referring to * These authors contributed equally to this work; the order is random. † Corresponding author.
those terms are used to express attitudes or opinions explicitly.
The food is tasty and portion sizes are appropriate. Target: food Opinion: tasty The food is tasty and portion sizes are appropriate. Target: portion size Opinion: appropriate Figure 1: Examples of TOWE task. The words highlighted in orange represent the given opinion targets, whereas the words in blue represent the corresponding opinion words. Figure 1 shows two examples of TOWE. In the review "The food is tasty and portion sizes are appropriate .", the terms "food" and "portion sizes" are two given opinion targets. TOWE needs to extract the word "tasty" as the opinion word for the opinion target "food" and the opinion word "appropriate" for the opinion target "portion sizes".
Therefore, the first challenge is to effectively introduce the opinion target information into our model. Fan et al. (2019) designed the IO-BiLSTM to encode the context before and after the given opinion targets separately to represent the position of the existing opinion targets. Wu et al. (2020) introduced position embeddings based on the relative distance toward opinion targets. However, both studies only introduce parts of target information (the position information of targets). In this paper, we introduce the target-aware representation to fully exploit opinion target information in a concise way, which is especially important when our models are used for real-world reviews.
Becase TOWE can be viewed as a syntactic task, a natural solution is analysing the relationship between opinion targets and opinion words by dependency parsing. Recently, owing to the great success of graph convolutional networks (GCNs) in various fields (Kipf and Welling, 2016;Chen et al., 2018;Marcheggiani et al., 2018), a few researchers have attempted to encode the syntactic dependency information with GCNs to build a robust dependency encoder. For example, GCNs over the dependency tree have been exploited to perform semantic role labelling (Marcheggiani and Titov, 2017) and named entity recognition (Cetoli et al., 2017).In addition, several studies explore GCNs over a dependency graph to complete the ABSA task , , Liang et al. (2020), ).
However, it is worth mentioning that TOWE is defined as a sequence labelling task, and the manner in which GCNs are applied to TOWE effectively is yet to be explored. In this study, we first construct a directed graph based on a dependency tree to be more suitable for TOWE. Subsequently, we propose ARGCN, which can enhance our model by encoding syntactic information. ARGCN can be seen as extending the Relational Graph Convolutional Networks (R-GCNs) (Schlichtkrull et al., 2018) with the distance-aware attention mechanism. ARGCN can consider the semantic relevance and syntactic relevance between words simultaneously when it propagates information. In addition, sequential information is extremely important for sequence-labelling tasks. Therefore, after using multi-layer graph convolutions to encode syntactic information, we feed the syntactic representation to a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to capture the sequential information.
Experiments on four benchmark datasets demonstrate that our base model, Target-BiLSTM which is a BiLSTM with target-aware inputs has a similar or better performance than the state-of-the-art model, although we do not introduce extra external knowledge. In addiction, our full model ARGCN further improves the performance and significantly outperforms all of the existing models on four benchmark datasets. Furthermore, extensive experiments demonstrate the effectiveness and necessity of all components in our full model. To the best of our knowledge, it is the first work on applying GCNs to the TOWE task.
The contributions of this paper can be summarized as follows.
• We propose target-aware representation to effectively introduce opinion target information. An empirical study shows it is significant and extensible for the TOWE task.
• We exploit syntactic dependency graphs of sentences and establish the relations between opinion targets and the corresponding opinion words.
• We propose a novel attention-based relational graph convolutional network, ARGCN, an extension of R-GCNs suited to encode syntactic dependency information.
• We propose an ARGCN-based TOWE model. Experimental results show that it significantly outperforms the state-of-the-art model on all datasets of the TOWE task.

Related Work
As subtasks of ABSA, a series of early studies focused on opinion targets extraction, including unsupervised/semi-supervised methods (Qiu et al., 2011;Liu et al., 2012Liu et al., , 2013 and supervised methods (Jakob and Gurevych, 2010;Li et al., 2010). Some recent studies extracted opinion targets and opinion words jointly in a uniform framework and achieved promising results (Wang et al., 2016;Li and Lam, 2017). However, they did not extract the corresponding relation between opinion targets and opinion words. Moreover, studies on extracting paired opinion relations are rare (Hu and Liu, 2004;Zhuang et al., 2006). Because it is important for downstream sentiment analysis and real-world applications, Fan et al. (2019) proposed a new subtask of ABSA, target-oriented word extraction, aiming to extract the corresponding opinion words for the given opinion targets in a review. They released four benchmark datasets for evaluation, designed a target-fused model, and achieved excellent performance. Wu et al. (2020) adopted transfer learning to transfer latent opinion information from the sentiment analysis model to the TOWE model. In this study, we also focus on the TOWE task.
Since Kipf and Welling (2016) proposed their GCN with some simplifications on ChebNet (Defferrard et al., 2016), a variety of graph convolutional networks appeared (Veličković et al., 2018;Schlichtkrull et al., 2018;Busbridge et al., 2019) and achieved great success in many fields, including computer vision (Chen et al., 2018;Garcia and Estrach, 2018;Wang et al., 2019), natural language processing (Marcheggiani and Titov, 2017;Marcheggiani et al., 2018;Yao et al., 2019; and even in chemistry (De Cao and Kipf, 2018). One of the reasons why GCNs work well in several fields is that they can naturally process the graph-structured data to greatly exploit  Figure 2: An example of the syntactic dependency graph based on dependency tree generated by spaCy * dependency parser. "service" is the given opinion target. "quick" and "great" are two corresponding opinion words. In the left figure, we show the reshaped dependency graph before adding extra edges for those who do not have dependency relation but have close distance (closer than a threshold D). In the right figure, we can see that each edge has two features: dependency relation and distance. And we show all edges coming towards the word "quick" in the final graph as an example.
the latent information behind the graph structure. Therefore, they are proven to be efficient especially with only a small amount of data (Kipf and Welling, 2016; Garcia and Estrach, 2018).
Recently, a few studies have tried applying GCNs over the dependency graph to complete some ABSA tasks.  proposed the CDT to perform GCN over a dependency tree together with contextual representations extracted by BiLSTM. Liang et al. (2020) introduced dependency relational embedding to GCN (Kipf and Welling, 2016) to complete ABSA with their DREGCN. Specially, the R-GAT-ABSA ) is a newly proposed architecture for the ABSA task. It focuses on the GAT (Veličković et al., 2018) and extends it by introducing relational embedding for calculating relational attention.

Task Formalization
TOWE aims to extract corresponding opinion words based on the given opinion targets. Formally, we have a review sentence s = {w 1 , w 2 , ..., w n } containing n words. Then, we adopt the BIO tagging scheme (Ramshaw and Marcus, 1999) as

Target-Aware Representation
As described above, we should extract the corresponding opinion words based on the given opinion targets. Therefore, our model should be aware of which words are the opinion targets and identify the * https://spacy.io/ corresponding opinion words. All previous studies only encode the position information for targets. In contrast, we directly introduce category embeddings with respect to the target tag of words to fully introduce target information in the TOWE model. Figure 3 shows an overview of our model.
We denote the category embedding table as T t ∈ R 3×d t , where d t is the dimension of the category embedding. Next, we can obtain the target embedding of each word and form a target embedding matrix of a sentence as E t = [e t 1 ; e t 2 ; · · · ; e t n ]. To retain the target information clearly when feeding to the next module, we concatenate it together with the word representation.
where e w i is word representation of word i, [,] represents the concatenation operation. Thus, our model can understand which words are opinion targets. The target embedding table is jointly optimized during training so that our model can learn the proper target embeddings specifically for the TOWE task.
For simplicity, we denote our target-aware representation as E = [e 1 ; e 2 ; · · · ; e n ] and then feed it to the following modules.

Syntactic Dependency Graph
In this section, we provide a detailed description of our method of building a suitable syntactic dependency graph for the TOWE task. For a given sentence s = {w 1 , w 2 , · · · , w n }, after dependency parsing, we obtain a dependency tree. Figure 2 is the original dependency tree of the sentence "The food is tasty and portion sizes are appropriate .". Next, we add some edges whose relative distance in the sentence is smaller than a given threshold D.
We formally define the directed graph as  Figure 3: Overview of ARGCN. We generate the target-aware representation as the input node representation. Then, L-layers of ARGCN are applied over our syntactic dependency graph. After encoding, we capture sequential information with BiLSTM. Finally, we perform prediction with softmax classifier. Because of space limits, we omit other edges except for those who have dependency relations.
is the set of edge relational types, where r ij is the corresponding dependency relation from v i to v j . If there is not any dependency relation between v i and v j whose relative distance is smaller than D, we add an edge between them and set a special edge type for it, such as other.
represents the set of relative positions, and p ij is the relative position from v i to v j in the sentence.
Note that e ij indicates v i is the neighbour of v j .
To ensure the target information can correctly propagate to the latent opinion words, we redirect some specific dependency relations linking to the target words. Regarding dependency trees, when the edge type is nsubj or dobj, the direction of the edge is from predicate to subject or object. Hence, the information of the subject or object cannot flow through the predicate. Thus, we reverse the dependency edge when it links target words and its type is nsubj (nominal subject) or dobj (direct object). In addition, we remove the root relation because it is a self-loop, which is not helpful for our model.

Attention-Based Relational Graph Convolutional Network (ARGCN)
To encode the well-designed syntactic dependency graph, we begin from R-GCNs (Schlichtkrull et al., 2018) and extend it with a distance-aware attention mechanism. In this paper, we propose an attention-based relational graph convolutional network (ARGCN). The main purpose of our model is to consider semantic and syntactic relevance between words simultaneously. R-GCNs (Schlichtkrull et al., 2018) updated the hidden states of nodes by aggregating node representations of their neighbours according to the edge type of their connections, where R denotes the set of relations, h i is the input representation of node v i , h i is the output representation of node v i , N r i is the set of neighbours of v i under relation r ∈ R, W r and W 1 are trainable parameters, and c i,r is a problem-specific normalization constant, which is usually assigned as the number of neighbours of v i under relation r. Moreover, σ is an element-wise activation function.
In Equation (5), each relation r corresponds to a relation-specific matrix W r . To reduce the parameter number, we perform a basis decomposition (Schlichtkrull et al., 2018). In particular, we set the number of bases as one: where b r is the coefficient depending on r. In this way, every W r shares W 0 as the basis, thereby the number of parameters is greatly reduced. On the other hand, b r denotes the influence with respect to relation types. In ARGCN, we introduce a distance-aware attention mechanism to enhance the power of RGCN: where β ij is the attention coefficient between v i and v j , and c is a trainable vector, which can adjust the influence of the relation and the attention coefficient. σ is an activation function, and we choose to use ReLU in ARGCN layers. We assume that the attention coefficients between two nodes are based on the features of nodes and the relative position in the sentence. First, we obtain query and key by project node features h i and h j by multiplying the same projection matrix W 1 . Next, we get relative positional encoding p by a sinusoid encoding matrix as in .
Then, we use a shared attention mechanism to perform attention on the query, key, and relative positional encoding: where a is a trainable vector mapping the concatenated representation to a scalar. Finally, we normalize o ij across all neighbours of v i using the softmax function: where β ij indicates the importance of v i toward v j with respect to the node representations and the relative position.
In addition, extending our mechanism to employ multi-head attention helps to stabilize the learning process and enhances the performance. Specifically, K independent attention mechanisms execute the transformation of Equation (5).
where W k 0 ∈ R d j ×d k , d j is the input dimension, and d k is the dimension of each head. Then, the output of the multi-head attention mechanism is where W d ∈ R Kd k ×d j+1 . Unlike Vaswani et al. (2017), who chose to use d j+1 /K as a dimension of each head, we set d k = d j+1 , which leads to slight performance gains based on preliminary experiments.
We find that aspect and opinion terms often have direct or indirect relations in the graph based on the syntactic dependency tree. For example, Figure 2 shows that the relation between "service" and "quick" is direct whereas that between "service" and "great" is indirect. To capture these direct or indirect relations, we use L-layers of ARGCN, because L successive ARGCNs result in the propagation of information across the L-th order neighbour.
Moreover, with the deepening network layers, ARGCN tends to be over-smooth. In order to alleviate this problem, we add a residual connection on each ARGCN layer: where h l i is the input of v i in l-th layer of ARGCN, and h i l is the output of v i in l-th layer of ARGCN. Thus, h l+1 i is the input of (l + 1)-th layer of ARGCN.

Sequential Layer
The insufficiency of ARGCN is that it cannot encode the sequential information, which is extremely important for the TOWE task because it is defined as a sequence-labelling task. Intuitively, prediction relies on the prediction label of the words before and after the current word. Therefore, the performance of the model will not be satisfactory without capturing sequential information.
Consequently, we feed the syntactic representation extracted from L-layers of ARGCN to a BiL-STM to capture the sequential information: whereĥ i is the concatenation of the forward and backward output vectors at time-step i. Many other studies that used GCNs over the dependency graph (Marcheggiani and Titov, 2017; often applied LSTM to encode the sequential information, fed the obtained contextual representation to GCNs, and used them for predictions. We also attempted to first encode the sequential relationship by LSTM and then feed them to ARGCN to finally predict the labels of words. However, the performance was impaired. We believe the reason is that the sequential relationship is essential for sequencelabelling tasks. If we collect it before encoding the dependency information, it will be confused through aggregation, leading to poor performance.

Model Training
After collecting the sequential information, we simply mapped the representations to the output space with a fully connected layer and calculated the probability of the labels of words with the softmax function:ŷ where W f c and b f c are the trainable parameters of the fully connected layer. Next, the cross-entropy loss is defined as and minimized during training. Here, the opinion word tags {O, B, I} are correspondingly numeralized as labels {0, 1, 2}, respectively, and y i denotes the gold label.

Datasets and Metrics
Following the previous works (Fan et al., 2019;Wu et al., 2020), we evaluate the models on four benchmark datasets, including 14res, 14lap, 15res and 16res. Explicitly, the datasets 14res and 14lap are annotated from SemEval Challenge 2014 task 4 (Pontiki et al., 2014). The 15res and 16res are annotated from SemEval Challenge 2015 task 12 (Pontiki et al., 2015) and SemEval Challenge 2016 task 5 (Pontiki et al., 2016) respectively. The suffixes "res" and "lap" indicate they are collected from restaurant reviews and laptop reviews, respectively.  The original SemEval challenge datasets are very popular for ABSA subtasks. However, they only contain annotations of aspect terms. Therefore, Fan et al. (2019) extended the annotation to further annotate the corresponding opinion words based on the given opinion targets and ignored the cases without explicit opinion words. Detailed statistics are shown in Table 1. For the classification task, we adopted commonly used evaluation metrics: precision, recall, and F1-score. An extraction is considered as correct only when the opinion words from the beginning to the end are all predicted exactly as the ground truth.

Experimental Settings
For ARGCN and Target-BiLSTM, we adopted 300-dimension GloVe word embeddings (Pennington et al., 2014) as our word representations. For ARGCN-bert and Target-BiLSTM-bert, we adopted the last hidden states of the pre-trained BERT (Devlin et al., 2018) as word representations and fine-tuned it jointly. Inspired by Xu et al. (2018), we fine-tuned the GloVe vectors during training to obtain a domain-specific representation. The dimension of target embedding was 3 and 100 for our base model and GCNs-based models, respectively. We implemented our models with Py-Torch (Paszke et al., 2019). We introduced 10 layers of ARGCN with 128 channels, 8 attention heads and set the hidden size of BiLSTM to 128.
We used spaCy (Honnibal and Johnson, 2015) as our dependency parser. To improve the generalization of ARGCN, dropout (Hinton et al., 2012) layers were applied after the activation with the probability of 0.5. The threshold of relative distance was set to be 3. All of the parameters were optimized by Adam optimizer (Kingma and Ba, 2014). The initial learning rate was 1 × 10 −3 . We randomly split 20% of the training set as the validation set to fine-tune the hyperparameters and apply early stopping. Subsequently, we tested our models and averaged the results of 5 runs.

Compared Methods
We compare our model with several methods which can be categorized into three groups.
• Early Solutions: Some early solutions including rule-based methods and trivial deep learning methods are assigned to the first group. Inspired by Hu and Liu (2004) and Zhuang et al. (2006), Fan et al. (2019) proposed the Distance-rule and Dependency-rule as two representative rule-based methods. Following Liu et al. (2015) and Tang  • TOWE models: IOG is the first TOWE model proposed by Fan et al. (2019). It adopts six different positional and directional LSTMs to extract the opinion words. PE-BiLSTM is the base model of the LOTN (Wu et al., 2020). They introduced target information of TOWE by position embedding and extracted opinion words with a BiLSTM. Wu et al. (2020) proposed an effective transfer learning method LOTN to identify latent opinions from the sentiment analysis model. Next, they integrated it with the PE-BiLSTM to achieve the state-of-the-art performance in TOWE.
• AOPE model: Aspect-opinion pair extraction (AOPE) task which aims at extracting aspects  Table 2: Main Experimental Results(%). Comparison between our proposed models and baselines on four benchmark datasets. P, R and F1 are precision, recall and F1-score, respectively. The result in bold indicates that the model outperforms all of the baselines above significantly (p < 0.01). The results are averaged scores of 10 runs.
The results of baselines are copied from the previous work (Wu et al., 2020). Noted that the experiment results of SDRN are obtained by using their released codes to train and evaluate on TOWE datasets. and opinion expressions in pairs, is a similar task as TOWE. SDRN, which is the state-ofthe-art AOPE model proposed by , mainly consists of an opinion entity extraction unit, a relation detection unit, and a synchronization unit. The synchronization unit could enhance the mutual benefit on the opinion entity extraction unit and a relation detection unit. As a baseline, it extracts the target and opinion. Subsequently, it collect the corresponding target-opinion pairs based on the predicted relations to complete the TOWE task.
• Base model: To show the effectiveness of the target-aware representation, we propose our base model, Target-BiLSTM. A BiL-STM receives the target-aware representation as the input and then predicts after a fullyconneceted layer and a softmax layer. Table 2 shows the main experimental results of the baselines and our models on four benchmark datasets. We can observe that under the same condition of using GloVe for word representation, our base model Target-BiLSTM outperforms PE-BiLSTM with large improvements ranging from 3.31% to 6.18% on F1-score. Note that PE-BiLSTM uses position embedding. Instead, Target-BiLSTM introduces target embedding, which is the evidence of the effectiveness of our target-aware representation. Moreover, it not only performs similarly with LOTN on 14res and 14lap but also significantly outperforms LOTN on 15res and 16res, which introduces a large-scale sentiment analysis dataset for transfer learning. In contrast, our base model does not require additional resources except for the pre-trained word embeddings. Besides, our full model ARGCN outperforms Target-BiLSTM by a large margin on the four datasets. Therefore, we conclude that the syntactic information ARGCN encoded over the dependency graph is helpful to TOWE. Furthermore, ARGCN significantly outperforms LOTN, with large improvements of F1-score ranging from 1.54% to 3.43%, which proves its effectiveness on the TOWE task. With a pre-trained representation model, BERT, Target-BiLSTM achieves the state-of-the-art performance by significant margins, demonstrating the power of the pre-trained language model in this task. In addition, when we apply BERT as the representation layer for ARGCN, it achieves a further state-of-the-art performance, which demonstrates the effectiveness of capturing important syntactic information for sentiment analysis.   Table 3: Ablation study results (%). LSTM-ARGCN denotes the model that places the BiLSTM before ARGCN. R-GCN (#basis=1) denotes using R-GCNs with basis decomposition and the number of basis is one. ARGCN (original) denotes using original dependency tree.

Visualization on Target Embedding
We also designed an experiment to evaluate if our model can learn suitable target embeddings during training, thereby it can benefit from the targetaware representation. Intuitively, a good target embedding should have such a property: the representation of tag "O" is significantly different from that of tags "B" and "I". However, representations of " B" and "I" are similar. However, after training, as we can observe from Figure 4, the cosine similarity between "B" and "I" is close to 1 (0.94), whereas the similarity between "B" and "O" is even smaller than 0, and that between "I" and "O" has the same property. Therefore, we conclude that our model can learn to generate suitable target embeddings during training, which confirms the effectiveness and interpretability of our target-aware representation.

Ablation Study
To evaluate the influence of each component of ARGCN, we conducted an ablation study on ARGCN. As shown in Table 3, we observe performance drops on the four datasets when replacing ARGCN layers with R-GCN layers following Equations (2) and (3), which verifies the effectiveness of employing the distance attention mechanism in ARGCN. In addition, we also find that ARGCN outperforms GAT (Veličković et al., 2018), which proves that specifying the dependency relational type is crucial for applying the dependency graph to the TOWE task. Moreover, we compared ARGCN with an ABSA model, RGAT (Busbridge et al., 2019), which is similar to our model. We observe that our ARGCN performs much better than RGAT. These results prove that in TOWE task, the approach to encode syntactic information in ARGCN is more suitable than the approach used in RGAT.
In addition, to confirm that the syntactic graph that we constructed is effective and reasonable, we compare the ARGCN over the original dependency tree and our reshaped graph. The results show that the latter model outperforms the former one, which proves the effectiveness of our reshaped syntactic dependency graph.

Model Analysis
We further analyzed the effect of the layers number of ARGCN, the number of attention heads and the threshold of relative distance in our model by using different hyper-parameters but keeping the other hyper-parameters unchanged as the experimental settings mentioned above.
Because ARGCN involves an L-layer GCN, we investigated the effect of the layers number L with the final performance of ARGCN. Basically, we varied the value of L in the set {2, 4, 6, 8, 10, 12} and showed the corresponding F1-score of ARGCN on the 14res dataset. The results are illustrated in Figure 5, which shows that ARGCN achieves the best performance when L = 10. In this sense, our model can benefit from the increasing number of layers. However, when the number of layers is larger than 10, our model will tend to be over-smoothing which makes the performance drop dramatically.
As for the effect of the number of attention heads, we also varied the value of attention head number K in the set {1, 2, 4, 6, 8, 10, 12, 14} and showed the corresponding F1-score of ARGCN on the 14res dataset.  Table 4: Evaluation of syntactic information (%). Syntax only means these models use only target embedding without word representation. The results are illustrated in Figure 6, which shows that ARGCN achieves the best performance when K = 8, which justifies the selection on the number of attention heads in the experimental settings. Comparing with the cases between K = 1 and K = 8, we found that the model with 8 attention heads performed better than that with only one attention head. This experiment demonstrated the necessity of the multi-head attention mechanism in our ARGCN.
As for the effect of threshold of relative distance, we performed experiments with different threshold D ranging from 1 to 6 and showed the corresponding F1-score of ARGCN on the 14res dataset. The results are illustrated in Figure 7, where we observe that D = 3 is best value for the threshold in ARGCN.

Evaluation of syntactic information
To understand the role of syntactic information in the TOWE task and measure the ability of ARGCN to encode syntactic information, we removed word representation from our models, leaving the target embedding only. We performed some experiments on evaluating our model on the TOWE task only with syntactic information and position information. The results are shown in Table 4.
We notice that the GNN models perform better than the dependency-rule model, which indicates that the GNN models can exploit the syntactic information well from dependency graph. Furthermore, our well-designed ARGCN outperforms other GNN models including the latest one, RGAT. The reason is that ARGCN considers the relative position of words in the sentence and dependency relation type at the same time when it propagate the information.

Conclusions
In this paper, we proposed a target-aware representation to efficiently introduce opinion target information to our TOWE model. Moreover, we proposed ARGCN by extending the R-GCNs with a distance-aware attention mechanism. Because the sequential information is essential for such a sequence-labelling task, we captured the sequential information with BiLSTM after ARGCN layers and then completed the TOWE task. Empirical results show that our model significantly outperforms all baselines, including state-of-the-art, with large margins, which strongly proves the effectiveness of our model. The extensive analysis also demonstrated the effectiveness and necessity of all components in our model. In addition, we found that GNN model, especially a well-designed GNN model, such as ARGCN, is suitable for encoding syntactic information. We hope that these findings can be insightful for other researchers in the community.