BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification

Graph-based Aspect-based Sentiment Classification (ABSC) approaches have yielded state-of-the-art results, expecially when equipped with contextual word embedding from pre-training language models (PLMs). However, they ignore sequential features of the context and have not yet made the best of PLMs. In this paper, we propose a novel model, BERT4GCN, which integrates the grammatical sequential features from the PLM of BERT, and the syntactic knowledge from dependency graphs. BERT4GCN utilizes outputs from intermediate layers of BERT and positional information between words to augment GCN (Graph Convolutional Network) to better encode the dependency graphs for the downstream classification. Experimental results demonstrate that the proposed BERT4GCN outperforms all state-of-the-art baselines, justifying that augmenting GCN with the grammatical features from intermediate layers of BERT can significantly empower ABSC models.


Introduction
Aspect-based sentiment classification (ABSC), a fine-grained sentiment classification task in the field of sentiment analysis, aims at identifying the sentiment polarities of aspects explicitly given in sentences. For example, given the sentence "The price is reasonable although the service is poor.", ABSC needs to correctly assign a positive polarity to price and a negative one to service.
Intuitively, matching aspects with their corresponding opinion expressions is the core of ABSC task. Some of previous deep learning approaches (Tang et al., 2016;Ma et al., 2017;Huang et al., 2018; use various types of attention mechanisms to model the relationship between aspects and opinion expressions in an implicit way. However, these attention-based models do not take * Corresponding author (Qingliang Chen).
good advantage of the syntactic information of sentences, such as the dependency graph, to better associate the aspects with their sentimental polarities, thus leading to poor performance.
In order to better pair aspects and their corresponding opinion expressions, the syntactic features must be taken into account. Recent work such as (Huang and Carley, 2019;Sun et al., 2019;Tang et al., 2020) employs either Graph Convolutional Networks (GCNs) (Kipf and Welling, 2017) or Graph Attention Networks (GATs) (Velickovic et al., 2018) on the dependency graph, in which words in the sentence are nodes in the graph and grammatical relations are edge labels. The dependency graph they use can be obtained from either ordinary dependency tree parser or reshaping it using heuristic rules. These models can achieve promisingly better performance with this additional grammatical features, especially when they incorporate the contextual word embedding, such as BERT (Devlin et al., 2019).
Meanwhile, BERTology researchers have investigated what linguistic knowledge can be learned from unlabeled data by language models (Clark et al., 2019;Hewitt and Liang, 2019;Hewitt and Manning, 2019;Jawahar et al., 2019). And it has been shown that BERT captures a rich hierarchy of linguistic information, with surface features in lower layers, syntactic features in middle layers and semantic features in higher layers (Jawahar et al., 2019).
Inspired by Jawahar et al. (2019), we propose a novel framework of BERT4GCN to make the best of the rich hierarchy of linguistic information from BERT in this paper. Specifically, we firstly encode the context with a BiLSTM (Bidirectional Long Short-Term Memory) to capture the contextual information regarding word orders. Then, we use the hidden states of BiLSTM to initiate node representations and employ multi-layer GCN on the dependency graph. Next, we combine the node representations of each layer of GCN with the hidden states of some intermediate layer of BERT, i.e., for the neighborhood aggregation of GCN. In this way, BERT4GCN can fuse grammatical sequential features with graph-based representations. Besides, we further prune and add the edge of the dependency graph based on self-attention weights in the Transformer encoder (Vaswani et al., 2017) of BERT to deal with parsing errors and make dependency graph better suit ABSC task. In addition, we develop a method which incorporates relative positional embedding in node representations to make GCN position-aware.
Our contributions in this paper are summarized as follows: • We propose a BERT4GCN model, which naturally incorporates grammatical sequential features and syntactic knowledge from intermediate layers in BERT to augment GCN and thus can produce better encodings for the downstream ABSC task. BERT4GCN can make the best of the hidden linguistic knowledge in BERT without learning from scratch on other sources. • The experiments have been conducted on Se-mEval 2014 Task 4 (Pontiki et al., 2014) and Twitter (Dong et al., 2014) datasets, with promising results showing that BERT4GCN achieves new state-of-the-art performance across these prestigious benchmarks.

Related Work
Modeling the connection between aspect terms and opinion expressions is the core of the ABSC task. State-of-the-art models combine GNNs (Graph Neural Networks) with dependency graphs whose syntactic information is very helpful. Some have stacked several GNN layers to propagate sentiment from opinion expressions to aspect terms (Huang and Carley, 2019;Sun et al., 2019;. Some have tried to convert the original dependency graph to the aspect-oriented one and encode dependency relations . More recently, a dual structure model has been proposed in (Tang et al., 2020), jointly taking the sequential features and dependency graph knowledge together. Our BERT4GCN model has a similar structure to that in (Tang et al., 2020), but ours utilizes the grammatical sequential features directly from intermediate layers in BERT, instead of learning from scratch on other sources.

Word Embedding and BiLSTM
Given s = [w 1 , w 2 , ..., w i , ..., w i+m−1 , ..., w n ] as a sentence of n words and a substring a = [w i , ..., w i+m−1 ] representing aspect terms, we first map each word to a low-dimensional word vector. For each word w i , we get a vector x i ∈ R de where d e is the dimensionality of the word embedding. After that, we employ a BiLSTM to word embeddings to produce hidden state vectors where each h t ∈ R 2d h represents the hidden state at time step t from bidirectional LSTM, and d h is the dimensionality of a hidden state output by an unidirectional LSTM. H will fuse with BERT hidden states to produce the input of the first GCN layer.

We consider BERT as a grammatical knowledge encoder that encodes input text features in hidden states and self-attention weights. The input of BERT model is formulated as [CLS]s[SEP]a[SEP]
, where s is the sentence and a is the aspect term.

Hidden States as Augmented Features
BERT captures a rich hierarchy of linguistic information, spreading over the Transformer encoder blocks (Clark et  i ∈ R n×d BERT , and d BERT is the dimensionality of the hidden state. When a word is split into multiple sub-words, we just use the hidden state corresponding to its first sub-word. Then we can get the augmented features G for GCN as: (1)

Supplementing Dependency Graph
Self-attention mechanism captures long-distance dependencies between words. Therefore, we apply self-attention weights of BERT to supplement dependency graphs to deal with parsing errors and make dependency graphs better suit the ABSC task. After getting dependency graphs from an ordinary parser, we substitute the unidirectional edges to bidirectional ones. Next, we obtain attention weight tensors A att = W att 1 , W att 5 , W att 9 , W att 12 , where each W att i ∈ R h×n×n is the self-attention weight tensor of i-th layer Transformer encoder of BERT, and h is the number of attention heads. Then, we average them over the head dimensionĀ att is the l-th element in A att , and l ∈ {1, 2, 3, 4}. Finally, we prune or add directed edges between words if the attention weight is larger or smaller than some thresholds. So the supplemented dependency graph for the l-th layer of GCN is formulated as follows: where A is the adjacency matrix of the original dependency graph, A sup l is the adjacency matrix of the supplemented dependency graph and A sup l,i,j is the element on the i-th row and j-th column of A sup l , α and β are hyperparameters.

GCN over Supplemented Dependency Graph
We then apply GCN over the supplemented dependency graph in every layer, whose input is a fusion of BERT hidden states and output of the previous layer as follows: where k ∈ {2, 3, 4} , l ∈ {1, 2, 3, 4}, G 1 and G k ∈ R n×d BERT is the first and k-th element in G, respectively. R 1 and R k are the node representations which fuse the hidden states of BERT with the hidden states of BiLSTM or the output of the preceding GCN layer, before feeding to the first and k-th GCN layer. O l,i is output of l-th layer in GCN. d i is the outdegree of node i. The weight W 1 ,W k , W l and bias b l are trainable parameters.

Taking Relative Positions
The GCN aggregates neighbouring nodes representations in an averaging way and ignores the relative linear positions in the original context. To address this issue, we learn a set of relative position embeddings P ∈ R (2w+1)×dp to encode positional information, where w is the positional window hyperparameter. Before aggregating neighbouring node representation, we add the relative position embedding to the node representation, formalizing the relative linear position to the current word as follows: clip (x, w) = max (−w, min(w, x)) (8) where R p l,j is the positional node representation, and clip function returns the embedding index.

The Training
Having obtained the representation of words after the last GCN layer, we average the representation of the current aspect terms as the final feature for classification: It is then fed into a fully-connected layer, followed by a softmax layer to yield a probability distribution p over polarity space: where W c and b c are trainable weights and biases, respectively. The proposed BERT4GCN is optimized by the gradient descent algorithm with the cross entropy loss and L 2 -regularization: where D denotes the training dataset, y is the ground-truth label, p y is the y-th element of p, Θ represents all trainable parameters, and λ is the coefficient of the regularization term.

Datasets and Experimental Setup
We have evaluated our model on three widely used datasets, including Laptop and Restaurant datasets    , and DGEDT-BERT (Tang et al., 2020). We divide them into three categories. The category Others includes models that are general and task agnostic ones. The LSTM acts as a baseline for models without using PLM, while BERT-SPC and RoBERTa-MLP are baselines for models with the corresponding PLMs. The categories Attention and Graph are models mainly based on attention mechanism and graph neural networks, respectively. The details of datasets and experimental setup can be found in Appendix A.1.

Overall Results
We now present the comparisons of performance of BERT4GCN with other models in terms of classification accuracy and Macro-F1 on

Results Analysis
BERT-SPC vs. RoBERTa-MLP. RoBERTa-MLP significantly outperforms BERT-SPC on Laptop and Restaurant datasets, while has similar results to BERT-SPC on Twitter dataset. The same pattern is also observed in the comparison of BERT4GCN and RoBERTa4GCN. One possible reason for this phenomenon is that the corpora which the two PLMs were pre-trained on are far different from Twitter dataset. Therefore, RoBERTa's superiority is not shown on the Twitter dataset, which we call the out-domain dataset. BERT-SPC vs. BERT-based models. From the table, we also see that BERT-SPC can parallel BERT-based models, indicating that model architecture engineering only has a marginal effect when using BERT. BERT-SPC vs. BERT4GCN. Observed from the experimental results, the improved performance of BERT4GCN over BERT-SPC on Twitter dataset is higher than the other datasets. A similar pattern also appears in the comparison of RoBERTa4GCN and RoBERTa-MLP, where RoBERTa4GCN is (2) When PLM is strong enough, with the current model framework, heavy model architecture engineering is unnecessary on the in-domain dataset. These two conjectures need to be further explored in future work.

Ablation Study
To examine the effectiveness of different modules in BERT4GCN, we have carried out ablation studies as shown in Table 1. As presented in Table 1, the full model of BERT4GCN has the best performance. And we observe that only fusing grammatical sequential features with GCN (i.e., w/o pos. & w/o att.) can still achieve state-of-the-art results.
With the supplemented the dependency graphs (i.e., w/o pos.), the performance increases just slightly. It is notable that adding the relative position module alone (w/o att) produces a negative effect, and the power of the relative position module can only be revealed when combined together with the supplemented dependency graphs. We conjecture that the supplemented dependency graphs prune some edges which connect nearby aspect terms to words that are irrelevant to sentiment classification, thus reducing the noise.

Effect of Relative Position Window
We have also investigated the effect of the relative position window w on BERT4GCN across three datasets. As shown in Figure 1, Twitter and Laptop datasets prefer a smaller window, while Restaurant dataset prefers a bigger one. We calculate the relative distances between aspect and opinion terms of Laptop and Restaurant datasets with annotated aspect-opinion pairs by Fan et al. (2019), and visualize the distributions with kernel density estimation in Figure 2. Although the two distributions are very similar, the optimal window size for the two datasets is not the same. Therefore we hypothesize that the preference of window size is also influenced by the dataset domain. The long-tailed distributions imply that we need to carefully set the window size for the trade-off between the benefits and losses of position biases.

Conclusion
In this paper, we propose a BERT4GCN model which integrates the grammatical sequential features from BERT along with the syntactic knowledge from dependency graphs. The proposed model utilizes intermediate layers of BERT, which contain rich and helpful linguistic knowledge, to augment GCN, and furthermore, incorporates relative positional information of words to be positionaware. Finally, experimental results show that our model achieves new state-of-the-art performance on prestigious benchmarks.

A.1 Datasets and Experimental Setup
We have evaluated our BERT4GCN model on three widely used datasets, including Laptop and Restaurant datasets from SemEval 2014 Task 4 and Twitter datasets. The statistics of these three datasets are listed in Table 2  We use 10-fold cross validation to evaluate the model performance. Specifically, we evaluate models at the end of epochs and save the checkpoint that achieves highest accuracy in the validation set. Finally, we test the models on the test set and report the average of 10-fold experimental results.

A.2 Implementation Details
We choose 300-dimensional Glove vectors for the word embeddings. During training, the learning rate of BERT and other modules linearly warms up from 0 to 0.00002 and 0.001, respectively, then linearly decreases to 0.
In this paper, we name our model as BERT4GCN and use the bert-base-uncased model as the source of grammatical sequential features, but it is easy to extend it to other transformer-based pre-training language models 2 .
The training of BERT4GCN model has been run on a Nvidia RTX 3090 that requires about 14GB of GPU memory. The PyTorch implementation of BERT 3 is used in the experiments. The dimensionality of unidirectional LSTM hidden state and positional embedding is set to 300. We set α to 0.01 and β to 0.25, respectively. The position window w is set to 2, 3, 5 on Twitter, Laptop and Restaurant datasets, respectively. And the batch size is set to 32, with Adam as the optimizer. As for the regularization, dropout is applied to BERT and GCN output with the rate of 0.8. And we use Spacy 4 toolkit to generate dependency trees.