Dual Graph Convolutional Networks for Aspect-based Sentiment Analysis

Aspect-based sentiment analysis is a fine-grained sentiment classification task. Recently, graph neural networks over dependency trees have been explored to explicitly model connections between aspects and opinion words. However, the improvement is limited due to the inaccuracy of the dependency parsing results and the informal expressions and complexity of online reviews. To overcome these challenges, in this paper, we propose a dual graph convolutional networks (DualGCN) model that considers the complementarity of syntax structures and semantic correlations simultaneously. Particularly, to alleviate dependency parsing errors, we design a SynGCN module with rich syntactic knowledge. To capture semantic correlations, we design a SemGCN module with self-attention mechanism. Furthermore, we propose orthogonal and differential regularizers to capture semantic correlations between words precisely by constraining attention scores in the SemGCN module. The orthogonal regularizer encourages the SemGCN to learn semantically correlated words with less overlap for each word. The differential regularizer encourages the SemGCN to learn semantic features that the SynGCN fails to capture. Experimental results on three public datasets show that our DualGCN model outperforms state-of-the-art methods and verify the effectiveness of our model.


Introduction
Sentiment analysis has become a popular topic in natural language processing (Liu, 2012;Li and Hovy, 2017). Aspect-based sentiment analysis (ABSA) talks an entity-level oriented fine-grained sentiment analysis task that aims to determine sentiment polarities of given aspects in a sentence. In Figure 1, the comment is about a restaurant review. The sentiment polarity of the two aspects "price" and "service" are positive and negative, respectively. Thus, ABSA can precisely identify user's attitudes towards a certain aspect, rather than simply assigning a sentiment polarity for a sentence.
The key point in solving the ABSA task is to model the dependency relationship between an aspect and its corresponding opinion expressions. Nevertheless, there probably exist multiple aspects and different opinion expressions in a sentence. To judge the sentiment of a particular aspect, previous studies (Wang et al., 2016;Tang et al., 2016a;Ma et al., 2017;Chen et al., 2017;Fan et al., 2018;Huang et al., 2018;Gu et al., 2018) have proposed various recurrent neural networks (RNNs) with attention mechanisms to generate aspect-specific sentence representations and have achieved appealing results. However, an inherent defect makes the attention mechanism vulnerable to noise in the sentence. Take Figure 1 as an example; for the aspect "service", the opinion word "reasonable" may receive more attention than the opinion word "poor". However, the "reasonable" refers to another aspect, i.e., "price".
More recent efforts (Zhang et al., 2019;Sun et al., 2019b;Huang and Carley, 2019;Zhang and Qian, 2020;Chen et al., 2020;Liang et al., 2020;Wang et al., 2020;Tang et al., 2020) have been de-voted to graph convolutional networks (GCNs) and graph attention networks (GATs) over dependency trees, which explicitly exploit the syntactic structure of a sentence. Consider the dependency tree in Figure 1; the syntactic dependency can establish connections between the words in a sentence. For example, a dependency relation exists between the aspect "price" and the opinion word "reasonable". However, two challenges arise when applying syntactic dependency knowledge to the ABSA task: 1) the inaccuracy of the dependency parsing results and 2) GCNs over dependency trees do not work well as expected on datasets that are not sensitive to syntactic dependency due to the informal expression and complexity of online reviews.
In this paper, we propose a novel architecture, the dual graph convolution network (DualGCN), as shown in Figure 2, to solve the aforementioned challenges. For the first challenge, we use the probability matrix of all dependency arcs from a dependency parser to build a syntax-based graph convolutional network (SynGCN). The idea behind this approach is that the probability matrix representing dependencies between words contains rich syntactic information compared with the final discrete output of a dependency parser. For the second, we construct a semantic correlation-based graph convolutional network (SemGCN) by utilizing a self-attention mechanism. The idea behind this approach is that the attention matrix shaped by self-attending, also viewed as an edge-weighted directed graph, can represent semantic correlations between words. Moreover, motivated by the work of DGEDT (Tang et al., 2020), we utilize a BiAffine module to bridge relevant information between the SynGCN and SemGCN modules.
Furthermore, we design two regularizers to enhance our DualGCN model. We observe that the semantically related terms of each word should not overlap. Therefore, we encourage the attention probability distributions over words to be orthogonal. To this end, we incorporate an orthogonal regularizer on the attention probability matrix for the SemGCN module. Moreover, the two representations learned from the SynGCN and SemGCN modules should contain significantly distinct information captured by the syntactic dependency and the semantic correlation. Therefore, we expect that the SemGCN module could learn semantic representations different from syntactic representations. Thus, we propose a differential regularizer between the SynGCN and SemGCN modules.
Our contributions are highlighted as follows: • We propose a DualGCN model for the ABSA task. Our DualGCN considers both the syntactic structure and the semantic correlation within a given sentence. Specifically, our DualGCN integrates the SynGCN and SemGCN networks through a mutual BiAffine module.
• We propose orthogonal and differential regularizers. The orthogonal regularizer encourages the SemGCN network to learn an orthogonal semantic attention matrix, whereas the differential regularizer encourages the SemGCN network to learn semantic features distinct from the syntactic ones built from the SynGCN network.
• We conduct extensive experiments on the Se-mEval 2014 and Twitter datasets. The experimental results demonstrate the effectiveness of our DualGCN model. Additionally, the source code and preprocessed datasets used in our work are provided on GitHub 1 .

Related Work
Traditional sentiment analysis tasks are sentencelevel or document-level oriented. In contrast, ABSA is an entity-level oriented and a more finegrained task for sentiment analysis. Earlier methods (Titov and McDonald, 2008;Jiang et al., 2011;Kiritchenko et al., 2014;Vo and Zhang, 2015) are usually based on handcrafted features and fail to model the dependency between the given aspect and its context. Recently, various attention-based neural networks have been proposed to implicitly model the semantic relation of an aspect and its context to capture the opinion expression component (Wang et al., 2016;Tang et al., 2016a,b;Ma et al., 2017;Chen et al., 2017;Fan et al., 2018;Huang et al., 2018;Gu et al., 2018;Li et al., 2018a;Tan et al., 2019). For instance, (Wang et al., 2016) proposed attentionbased LSTMs for aspect-level sentiment classification. (Tang et al., 2016b) and (Chen et al., 2017) both introduced a hierarchical attention network to identify important sentiment information related to the given aspect. (Fan et al., 2018) exploited a multi-grained attention mechanism to capture the word-level interaction between aspects and their context. (Tan et al., 2019) designed a dual attention network to recognize conflicting opinions. In addition, the pre-trained language model BERT (Devlin et al., 2019) has achieved remarkable performance in many NLP tasks, including ABSA. (Sun et al., 2019a) transformed ABSA task into a sentence pair classification task by constructing an auxiliary sentence. (Xu et al., 2019) proposed a post-training approach on the BERT to enhance the performance of fine-tuning stage for the ABSA task.
Another trend explicitly leverages syntactic knowledge. This type of knowledge helps to establish connections between the aspects and the other words in a sentence to learn syntax-aware feature representations of aspects. (Dong et al., 2014) proposed a recursive neural network to adaptively propagate the sentiment of words to the aspect along the dependency tree. (He et al., 2018) introduced an attention model that incorporated syntactic information to compute attention weights. (Phan and Ogunbona, 2020) utilized the syntactic relative distance to reduce the impact of irrelevant words.
Following this line, a few works extend the GCN and GAT models by means of a syntactical dependency tree and develop several outstanding models (Zhang et al., 2019;Sun et al., 2019b;Huang and Carley, 2019;Wang et al., 2020;Tang et al., 2020). These works explicitly exploit the syntactic structure information to learn node representations from adjacent nodes. Thus, the dependency tree shortens the distance between the aspects and opinion words of a sentence and alleviates the problem of long-range dependency.
Most recently, several works explore the idea of combining different types of graph for ABSA task. For instance, (Chen et al., 2020) combined a dependency graph and a latent graph to generate the aspect representation. (Zhang and Qian, 2020) observed the characteristics of word co-occurrence in linguistics and designed hierarchical syntactic and lexical graphs. (Liang et al., 2020) constructed aspect-focused and inter-aspect graphs to learn dependency feature of the key aspect words and sentiment relations between different aspects.
In this paper, we propose a GCN based method combining syntactic and semantic features. We use a dependency probability matrix with richer syntactic information and elaborately design orthogonal and differential regularizers to enhance the ability to precisely capture the semantic associations.

Graph Convolutional Network (GCN)
Motivated by conventional convolutional neural networks (CNNs) and graph embedding, a GCN is an efficient CNN variant that operates directly on graphs (Kipf and Welling, 2017). For graph structured data, a GCN can apply the convolution operation on directly connected nodes to encode local information. Through the message passing of multilayer GCNs, each node in a graph can learn more global information. Given a graph with n nodes, the graph can be represented as an adjacency matrix A ∈ R n×n . Most previous work (Zhang et al., 2019;Sun et al., 2019b) extend GCN models by encoding dependency trees and incorporating dependency paths between words. They build the adjacency matrix A over the syntactical dependency tree of a sentence. Thus, an element A ij in A indicates whether the i-th node is connected to the j-th node. Specifically, A ij = 1 if the i-th node is connected to the j-th node, and A ij = 0 otherwise. In addition, the adjacency matrix A, composed of 0 and 1, can be deemed as the final discrete output of a dependency parser. For the i-th node at the l-th layer, formally, its hidden state representation, denoted as h l i , is updated by the following equation: where W l is a weight matrix, b l is a bias term, and σ is an activation function (e.g., ReLU).

Proposed DualGCN
Figure 2 provides an overview of DualGCN. In the ABSA task, a sentence-aspect pair (s, a) is given, where a = {a 1 , a 2 , ..., a m } is an aspect. It is also a sub-sequence of the entire sentence s = {w 1 , w 2 , ..., w n }. Then, we utilize BiLSTM or BERT as sentence encoder to extract hidden contextual representations, respectively. For the BiL-STM encoder, we first obtain the word embeddings x = {x 1 , x 2 , ..., x n } of the sentence s from an embedding lookup table E ∈ R |V |×de , where |V | is the size of vocabulary and d e denotes the dimensionality of word embeddings. Next, the word embeddings of the sentence are fed into a BiLSTM to produce hidden state vectors H = {h 1 , h 2 , ..., h n }, where h i ∈ R 2d is the hidden state vector at time t from the BiLSTM. The dimensionality of a hidden state vector d is output by a unidirectional LSTM. Figure 2: The overall architecture of DualGCN, which is composed primarily of SynGCN and SemGCN. SynGCN uses the probability matrix generated by the dependency parser, while SemGCN leverages the attention score matrix generated by the self-attention layer. The orthogonal and differential regularizers are designed to further improve the ability of capturing semantic correlations. Details of these components are described in the main text.
For the BERT encoder, we construct a sentenceaspect pair "[CLS] sentence [SEP] aspect [SEP]" as input to obtain aspect-aware hidden representations of the sentence. Moreover, in order to match the wordpiece-based representations of BERT with the result of syntactic dependency based on word, we expand dependencies of a word into its all of subwords. Then, the hidden representations of sentence are input into the SynGCN and SemGCN modules, respectively. A BiAffine module is then adopted for effective information flow. Finally, we aggregate all the aspect nodes' representations from the SynGCN and SemGCN modules via pooling and concatenation to form the final aspect representation. Next, we elaborate on the details of our proposed DualGCN model.

Syntax-based GCN (SynGCN)
The SynGCN module takes the syntactic encoding as input. To encode syntactic information, we utilize the probability matrix of all dependency arcs from a dependency parser. Compared to the final discrete output of a dependency parser, the dependency probability matrix could capture rich structural information by providing all latent syntactic structures. Therefore, the dependency probability matrix is used to alleviate dependency parsing errors. Here, we use the state-of-the-art dependency parsing model LAL-Parser (Mrini et al., 2019). With the syntactic encoding of an adjacency matrix A syn ∈ R n×n , the SynGCN module takes the hidden state vectors H from BiLSTM as initial node representations in the syntactic graph. The syntactic graph representation H syn = {h

Semantic-based GCN (SemGCN)
Instead of utilizing additional syntactic knowledge, as in SynGCN, SemGCN obtains an attention matrix as an adjacency matrix via a self-attention mechanism. On the one hand, self-attention can capture the semantically related terms of each word in a sentence, which is more flexible than the syntactic structure. One the other hand, SemGCN can adapt to online reviews that are not sensitive to syntactic information.
Self-Attention Self-attention (Vaswani et al., 2017) computes the attention score of each pair of elements in parallel. In our DualGCN, we compute the attention score matrix A sem ∈ R n×n using a self-attention layer. We then take the attention score matrix A sem as the adjacency matrix of our SemGCN module, which can be formulated as: where matrices Q and K are both equal to the graph representations of previous layer of our SemGCN module, while W Q and W K are learnable weight matrices. In addition, d is the dimensionality of the input node feature. Note that we use only one self-attention head to obtain an attention score matrix for a sentence. Similar to the SynGCN module, the SemGCN module obtains the graph representation H sem . Additionally, we use the symbols {h sem a 1 , h sem a 2 , ..., h sem am } to denote the hidden representations of all aspect nodes. BiAffine Module To effectively exchange relevant features between the SynGCN and SemGCN modules, we adopt a mutual BiAffine transformation as a bridge. We formulate the process as follows: (4) where W 1 and W 2 are trainable parameters.
Finally, we apply average pooling and concatenation operations on the aspect nodes of the SynGCN and SemGCN modules. Thus, we obtain the final feature representation for the ABSA task, i.e., where f (·) is an average pooling function applied over the aspect node representations. Then, the obtained representation r is fed into a linear layer, followed by a softmax function to produce a sentiment probability distribution p, i.e., where W p and b p are the learnable weight and bias.

Regularizer
To improve the semantic representation, we propose two regularizers for the SemGCN module, i.e., orthogonal and differential regularizers. Orthogonal Regularizer Intuitively, the related items of each word should be in different regions in a sentence, so the attention score distributions rarely overlap. Therefore, we expect a regularizer to encourage orthogonality among the attention score vectors of all words. Given an attention score matrix A sem ∈ R n×n , the orthogonal regularizer is formulated as follows: where I is an identity matrix. The subscript F denotes the Frobenius norm. As a result, each nondiagonal element of A sem A semT is minimized to maintain the matrix A sem orthogonal. Differential Regularizer We expect that two types of feature representations learned from the Syn-GCN and SemGCN modules represent distinct information contained within the syntactic dependency trees and semantic correlations. Therefore, we adopt a differential regularizer between the two adjacency matrices of the SynGCN and SemGCN modules. Note that the regularizer is only restrictive to A sem and is given as (10)

Loss Function
Our training goal is to minimize the following total objective function: where λ 1 , λ 2 and λ 3 are regularization coefficients and Θ represents all trainable model parameters.
C is a standard cross-entropy loss and is defined for the ABSA task as follows: where D contains all sentence-aspect pairs and C is the collection of distinct sentiment polarities.

Datasets
We conduct experiments on three public standard datasets. The Restaurant and Laptop datasets  Restaurant  Training  2164  807  637  Testing  727  196  196   Laptop  Training  976  851  455  Testing  337  128  167   Twitter  Training  1507  1528  3016  Testing  172  169  336   Table 1: Statistics for the three experimental datasets.
are made public from the SemEval ABSA challenge (Pontiki et al., 2014). Following (Chen et al., 2017), we remove the instances using the "conflict" label. In addition, the Twitter dataset is a collection of tweets (Dong et al., 2014). All three datasets have three sentiment polarities: positive, negative and neutral. Each sentence in these datasets is annotated with marked aspects and their corresponding polarities. Statistics for the three datasets are shown in Table 1.

Implementation Details
The LAL-Parser (Mrini et al., 2019), which is used for dependency parsing, provides an off-the-shelf parser 2 . For all the experiments, we use pretrained 300-dimensional Glove 3 vectors (Pennington et al., 2014) to initialize the word embeddings. The dimensionality of the position (i.e., the relative position of each word in a sentence with respect to the aspect) embeddings and part-of-speech (POS) embeddings is set to 30. Thus, we concatenate the word, POS and position embeddings and then input them into a BiLSTM model, whose hidden size is set to 50. To alleviate overfitting, we apply dropout at a rate of 0.7 to the input word embeddings of the BiLSTM. The dropout rate of the SynGCN and SemGCN modules is set to 0.1, and the number of SynGCN and SemGCN layers is set to 2. All the model weights are initialized from a uniform distribution. We use the Adam optimizer with a learning rate of 0.002. The DualGCN model is trained in 50 epochs with a batch size of 16. The regularization coefficients, λ 1 and λ 2 are set to (0.2, 0.3), (0.2, 0.2) and (0.3, 0.2) for the three datasets, respectively, and λ 3 is set to 10 −4 . For DualGCN+BERT, we use the bert-base-uncased 4 English version. See our code for more details about BERT's experiments. Additionally, following (Marcheggiani and Titov, 2017), we add a self-loop for each node in 2 https://github.com/KhalilMrini/LAL-Parser 3 https://nlp.stanford.edu/projects/glove/ 4 https://github.com/huggingface/transformers the SynGCN and SemGCN modules.

Baseline Methods
We compare DualGCN with state-of-the-art baselines. The models are briefly described as follows. 1) ATAE-LSTM (Wang et al., 2016) utilizes aspect embedding and the attention mechanism in aspectlevel sentiment classification.
2) IAN (Ma et al., 2017) employs two LSTMs and an interactive attention mechanism to generate representations for the aspect and sentence.
3) RAM (Chen et al., 2017) uses multiple attention and memory networks to learn the sentence representation. 4) MGAN (Fan et al., 2018) designs a multigrained attention mechanism to capture word-level interactions between the aspect and context. 5) TNet (Li et al., 2018b) transforms BiLSTM embeddings into target-specific embeddings and uses CNN to extract final embeddings for classification. 6) ASGCN (Zhang et al., 2019) first proposed using GCN to learn the aspect-specific representations for aspect-based sentiment classification. 7) CDT (Sun et al., 2019b) utilizes a GCN over a dependency tree to learn aspect representations with syntactic information. 8) BiGCN (Zhang and Qian, 2020) uses hierarchical graph structure to integrate word co-occurrence information and dependency type information. 9) kumaGCN (Chen et al., 2020) employs a latent graph structure to complement syntactic features. 10) InterGCN (Liang et al., 2020) utilizes a GCN over a dependency tree to learn aspect representations with syntactic information. 11) R-GAT (Wang et al., 2020) proposes a aspectoriented dependency tree structure and then encodes new dependency trees with a relational GAT. 12) DGEDT (Tang et al., 2020) proposes a dependency graph enhanced dual-transformer network by jointly considering flat representations and graphbased representations. 13) BERT (Devlin et al., 2019) is the vanilla BERT model by feeding the sentence-aspect pair and using the representation of [CLS] for predictions. 14) R-GAT+BERT (Wang et al., 2020) is the R-GAT model that uses a pre-trained BERT to replace BiLSTM as an encoder. 15) DGEDT+BERT (Tang et al., 2020) is the DGEDT model that uses a pre-trained BERT to replace BiLSTM as an encoder.

Comparison Results
To evaluate the ABSA models, we use the accuracy and macro-averaged F1-score as the main evaluation metrics. The main experimental results are reported in Table 2. Our DualGCN model consistently outperforms all attention-based and syntax-based methods on the Restaurant, Laptop and Twitter datasets. These results demonstrates that our DualGCN effectively integrates syntactic knowledge and semantic information. In addition, the DualGCN accurately fits datasets that contain formal, informal or complicated reviews. Compared to attention-based methods such as ATAE-LSTM, IAN and RAM, our DualGCN model utilizes syntactic knowledge to establish dependencies between words, so it can avoid noises introduced by the attention mechanism. Moreover, the syntaxbased methods, such as ASGCN, CDT, R-GAT and so on, achieve better performance than attentionbased methods, but they ignore the semantic correlation between words. However, when considering informal or complicated sentences, using only syntactic knowledge results in poor performance.
In Table 2, on the other side, the results from the last group shows that the basic BERT outperforms most of the models based on static word embedding. Moreover, based on BERT, our DualGCN+BERT achieves better performance.

Ablation Study
To further investigate the role of modules in the DualGCN model, we conduct extensive ablation studies. The results are reported in Table 2. The SynGCN-head model uses the discrete outputs of a dependency parser to construct the adjacency matrix of the GCNs. In contrast, SynGCN leverages the probability matrix generated in a dependency parser as the adjacency matrix.  Table 4 shows a few sample cases analyzed using different models. The notations P, N and O represent positive, negative and neutral sentiment, respectively. We highlight the aspect words in red and in blue. For the aspect "food" in the first sample, the attention-based methods, i.e., ATAE-LSTM and IAN, are prone to attend to the noisy word "dreadful". Although the syntactic dependency can establish direct connections between an aspect and some words, no association exists between the aspect and the opinion words for complicated sentences. Take the second sample as an example; the aspect "apple os" is far from the opinion word "happy" in terms of syntactic distance. Thus, the SynGCN model fails. Additionally, in the third sample, feature representations of the key words "did not" are not captured by the SynGCN model. In contrast, the SemGCN model can attend to the semantic correlation between words. The last two samples demonstrate that our DualGCN, which fully considers the complementarity of syntactic knowledge and semantic information, can address complicated and informal sentences with the help of the orthogonal and differential regularizers.

Attention Visualization
To investigate the effectiveness of the two regularizers in capturing the semantic correlations between words, we visualized the attention score matrix of the DualGCN w/o R O &R D and the intact Dual-GCN. Consider the sample sentence, i.e., "Web browsing is very quick with Safari browser." with "Safari browser" as an aspect. As shown in Figure 3 (a), the attention score matrix is dense, and the related terms of each word overlap in the DualGCN w/o R O &R D model. This result is attributed to the lack of semantic constraints in the self-attention layers. The overlap of semantic correlations will lead to redundancy and noise during information propagation. The seventh and eighth rows of the  (Wang et al., 2016) 77.20 -68.70 ---IAN (Ma et al., 2017) 78.60 -72.10 ---RAM (Chen et al., 2017) 80   attention score matrix are the attention probability distributions of "safari" and "browser", respectively. The information to which "safari browser" pays attention is redundant and it does not pay more attention to the key opinion word "quick". Thus, the DualGCN w/o R O &R D failed. In comparison, in Figure 3 (b), the attention score matrix produced by our DualGCN is relatively sparse. Both "safari" and "browser" are semantically related to "quick", and their other attended items are also semantically reasonable. In addition, the attention scores of the related terms of each words tend to be distinct and precise due to the semantic constraints of these two regularizers. Therefore, our DualGCN model can readily predict the correct sentiment polarity of the aspect "safari browser".

Impact of the DualGCN Layer Number
To investigate the impact of the DualGCN layer number, we evaluate our DualGCN model with one to eight layers on the Restaurant and Laptop datasets. As shown in Figure 4, our model with two DualGCN layers performs the best. On one the hand, node representations cannot propagate far when the number of layers is small. On the other hand, if the number of layers is excessive, the model will become unstable due to the vanishing gradient and information redundancy.

Conclusion
In this paper, we propose a DualGCN architecture to address the disadvantages of attention-based and dependency-based methods for ABSA tasks. Our # Review ATAE-LSTM IAN SynGCN SemGCN DualGCN 1 Great food but the service was dreadful! (N , N ) (N , N ) (P , N ) (P , N ) (P , N ) 2 Works well, and I am extremely happy to be back to an apple OS.
(P , P ) (P , P ) (P , O ) (P , P ) (P , P ) 3 Did not enjoy the new Windows 8 and touchscreen functions.   DualGCN model integrates syntactic knowledge and semantic information by means of the SynGCN and SemGCN modules. Moreover, to effectively capture the semantic correlation between words, we propose orthogonal and differential regularizers in the SemGCN module. These regularizers can attend to the semantically related items with less overlap of each word and capture feature representations that differ from the syntactic structure. Extensive experiments on benchmark datasets show that our DualGCN model outperforms baselines.