Improving BERT with Syntax-aware Local Attention

Pre-trained Transformer-based neural language models, such as BERT, have achieved remarkable results on varieties of NLP tasks. Recent works have shown that attention-based models can benefit from more focused attention over local regions. Most of them restrict the attention scope within a linear span, or confine to certain tasks such as machine translation and question answering. In this paper, we propose a syntax-aware local attention, where the attention scopes are restrained based on the distances in the syntactic structure. The proposed syntax-aware local attention can be integrated with pretrained language models, such as BERT, to render the model to focus on syntactically relevant words. We conduct experiments on various single-sentence benchmarks, including sentence classification and sequence labeling tasks. Experimental results show consistent gains over BERT on all benchmark datasets. The extensive studies verify that our model achieves better performance owing to more focused attention over syntactically relevant words.


Introduction
Recently, Transformer (Vaswani et al., 2017) has performed remarkably well, standing on the multiheaded dot-product attention which fully takes into account the global contextualized information. Several studies find that self-attention can be enhanced by local attention, where the attention scopes are restricted to important local regions. Luong et al. (2015); Yang et al. (2018); Xu et al. (2019); Nguyen et al. (2020) utilize dynamic or fixed windows to perform local attention. Strubell et al. (2018); Zhang et al. (2020); Bugliarello and Okazaki (2020) explore to utilize syntax to restrain attention for better performance, but each of them confines to a certain task.
In this work, we propose a syntax-aware local attention (SLA) which is adaptable to several tasks, and integrate it with BERT (Devlin et al., 2019). We first apply dependency parsing to the input text, and calculate the distances of input words to construct the self-attention masks. The local attention scores are calculated by applying these masks to the dot-product attention. Then we incorporate the syntax-aware local attention with the Transformer global attention. A gate unit is employed for each token in each layer, which determines how much attention is paid to syntactically relevant words. We lift weights from existing pre-trained BERT, and evaluate our models on several single-sentence benchmarks, including sentence classification and sequence labeling tasks. Experimental results show that our method achieves consistent performance gains over BERT and outperforms previous syntaxbased approaches on the average performance. Furthermore, we compare our syntax-aware local attention with the window-based local attention. We find that the syntax-aware local attention is more involved in the aggregation of local and global attention. The attention visualization also validates the syntactic information supports to capture important local regions.
To summarize, this paper makes the following contributions: i) SLA can capture the information of important local regions on the syntactic structure. ii) SLA can be easily integrated to Transformer, which allows initialization from pre-trained BERT by increasing very few parameters. iii) Experiments show the effectiveness of SLA on various single-sentence benchmarks.  Transformer (Vaswani et al., 2017) use stacked self-attentions to encode contextual information for input tokens. The calculation of self-attention depends on the three components of queries Q, keys K and values V, which are projected from the hidden vectors of the previous layer. Then the attention output A of one head is computed as follows: where d is the dimension of keys and the mask matrix M controls whether two tokens can attend each other. Within the standard self-attention layer, global attention mechanism is employed that each token provides information to other tokens in the input sentence.

Local Attentions
Local attention involves limiting each token to attend to a subset of the other tokens in the input. Many works utilize a fixed or dynamic window to derive the important local regions. Luong et al. (2015) first propose a Gaussian-based local attention and increase BLEU scores for neural machine translation. Yang et al. (2018) improve the method of Luong et al. (2015) by predicting a central position and window size to model localness. Compared with Yang et al. (2018), Nguyen et al. (2020) attempt to derive the local window span by a softmasking method. However, Levy and Goldberg (2014) suggest that more informative representations can be learned from the syntactic structure, instead of a window of surrounding tokens. Strubell et al. (2018) propose to train one attention head to attend to each token's syntactic parent for semantic role labeling. Zhang et al. (2020) also leverage the syntactic information to self-attention, but confine to question answering. Thus, we explore to take advantage of the syntactic structure to improve the model performance on various benchmarks.

Approach
In this section, we first introduce the syntax-aware local attention, and then integrate it with standard Transformer attention. As shown in Figure 1, we extend the Transformer layer with the syntax-aware local attention. Syntax-based masking is applied to the dot-product of queries and keys. The final attention scores are computed by incorporating local attention with standard global attention. We stack new layers and initialize weights from pre-trained BERT.

Syntax-aware Local Attention
We derive syntactic structure from dependency parsing, and treat it as an undirected tree. Each token x i is mapped to a tree node v i , and the distance of node v i and v j is denoted by dis(v i , v j ). However, the input may be an ungrammatical sentence in some tasks, and the dependency parser is not very accurate. Thus, we calculate the distance from neighboring tokens of x i to token x j as: The motivation is that many attention heads specialize in attending heavily on the next or previous token (Clark et al., 2019). Then, in order to determine whether token x j can attend to token x i , a threshold m is applied to restrict the distance D(i, j). For simplification, the mask matrix M loc calculation can be formulated as: Given the query Q and key K projected from the hidden vectors H, the syntax-aware local attention scores S loc are formally defined as: where d is the dimension of keys. In this local attention, two tokens can attend to each other only if they are close enough in the dependency tree.

Attention Aggregation
As shown in Figure 1, the final attention is the aggregation of syntax-aware local attention and Transformer attention. We denote the Transformer attention scores by S glb . A gated unit is used to combine the global and local attention scores. The gate value g i for each token x i is calculated as follows: where h i is the hidden vector of token x i from the previous layer, W g is a learnable linear transformation and b g is the bias. Then the attention output A i is calculated as a weighted average over values V, and the weights are derived from global and local attention scores: A larger gate value means more focused attention over syntactically relevant words. It can be seen that, if all the outputs of gated units are equal to 0, we could obtain the standard Transformer attention. Compared with the original architecture, our self-attention layer has one more input (M loc ) and two more trainable parameters (W g and b g ). Thus, we can easily lift weights from existing pretrained BERT models.

Experimental Setup
Benchmarks We use two English singlesentence classification datasets from the GLUE benchmark (Wang et al., 2019). We test on the CoLA and SST-2 datasets for acceptability and sentiment classification. Besides, we evaluate our method on two sequence labeling tasks: named entity recognition (NER) and grammatical error detection (GED). We use the CoNLL-2003 and FCE datasets for NER and GED, respectively. The training procedures are introduced in Appendix A.1.
Configuration All the training experiments are based on BERT. We use the uncased version of BERT for CoLA and SST-2, and the cased version for CoNLL-2003 and FCE. We derive dependency tree using Spacy 2 . More implementation details are reported in Appendix A.2.
Baselines We apply the syntax-aware local attention (SLA) to BERT. In addition to comparing with BERT, we also investigate the following approaches: SGNet Zhang et al. (2020) present a syntaxguided self-attention layer, where each word is limited to interact with all of its syntactic ancestor words. Then they stack this layer on the top of the pre-trained BERT model 3 , instead of modifying the Transformer architecture.
LISA Strubell et al. (2018) restrict each token to attend to its syntactic parent in one attention head 4 . We apply it to BERT and add the corresponding supervision at the last attention head in each Transformer layer.
Besides, we implement the window-based local attention (WLA), which allows each token to attend to the neighboring tokens within a window size 2k + 1 (varying k in {3,4,5}). Then it is also integrated with BERT as shown in Section 3.2.

Main Results
Experimental results are shown in Table 1. We report results on the dev set of CoLA and SST-2 and the test set of CoNLL-2003 and FCE. We employ t-tests to see if the mean difference differed from 0 between the standard attention and our proposed attention. It can be seen that our Besides, the proposed local attention outperforms other approaches leveraging syntactic information on the average performance. Compared with BERT, the syntax-aware local attention improves performances consistently but the windowbased local attention can't. This suggest that BERT can benefit from more attention over syntactically relevant words on several datasets. However, there are still some gaps between our model and the state-of-the-art models on these datasets. We argue that our method just modifies the standard Transformer attention without changing its main architecture, but those models are trained by using more advanced pre-training methods (Sun et al., 2020), larger-scale datasets (Raffel et al., 2019), or learning framework .

Analysis
Gated Unit in Each Layer It can be seen from Equation (6) that a larger gate value means a more important role of local attention in the attention aggregation. We analyze the gate values in different layers on SST-2 and FCE datasets. The gated unit outputs are collected from the best-trained basesize models, and are averaged over all input tokens in each layer. As shown in Figure 2, on the SST-2 dataset, the syntax-aware local attention has higher values than the window-based local attention in most layers. Even if the sentences of the FCE dataset are ungrammatical, our attention plays a more important role in 8 of 12 layers. It indicates that our Input: In a way, the film feels like a breath of fresh air, but only to those that allow it in. local attention is more important in the attention score calculation process. Besides, Table 1 and Figure 2 illustrate that our model achieves better performances owing to more attention on syntactically relevant words.
Attention Visualization In order to compare the syntax-aware attention with the window-based attention, we plot their attention scores in Figure 3. As formulated in Equation (6), the attention scores are calculated from the aggregation of global and local attention. We mainly focus on the interactions of tokens, except for [CLS] and [SEP]. Then the attention scores are averaged over all heads and layers. This visualization validates the effectiveness of incorporating syntactic information into self-attention. As shown in Figure 3, we can see that there are many informative tokens overlooked by the window-based method (left) but captured by our method (right). For instance, the syntax-aware attention allows the tokens "fresh air" and "allow" to strongly attend to the token "film", but these tokens are paid less attention in the window-based attention.
Testing on Sentence-Pair Classification We attempt to evaluate our model on sentence-pair classification datasets. Given a single sentence, we can easily apply dependency parsing and restrain the attention scopes inside the sentence. But for pairwise classification, one problem is how to limit the scopes between a pair of sentences. So a naive approach is adopted, that each token in a sentence  can attend to all tokens in another sentence. We conduct experiments on four pairwise classification datasets from GLUE benchmark (Wang et al., 2019), which cover paraphrase, textual entailment and text similarity. Experimental results are shown in Table 2. The syntax-aware local attention achieves better performances on MRPC and STS, but doesn't perform well on RTE and QNLI. We suspect that it is because the cross-sentence interactions are more important for textual entailment task.

Conclusion
This work verifies that BERT can be further promoted by incorporating syntactic knowledge to the local attention mechanism. With more focused attention over the syntactically relevant words, our model achieves better performance on various benchmarks. Additionally, the extensive experiments demonstrate the universality of our syntaxaware local attention.

A.1 Training Procedure
We extend the Transformer encoder layer and lift weights from BERT to our model. Following Devlin et al. (2019), we apply the fine-tuning procedure for various NLP tasks. For classification tasks, the final output of the first token [CLS] is taken as the representation of the input. The probability that the input sentence X is labeled as class c is predicted by a linear transformation with softmax: where h [CLS] is the representation of the token [CLS], W c and b c are task-specific parameters. For labeling tasks, we apply the BIO annotation (Ratinov and Roth, 2009) to label outputs and compute the probability that token x i belongs to class c as: where h i is the representation of the token x i , W t and b t are task-specific parameters. Finally, the training objective for all tasks is to minimize the cross-entropy loss.

A.2 Implementation Details
We apply the whitespace tokenization to the input sentence, and obtain the dependency tree using the Spacy parser 5 . However, the BERT inputs are tokenized by WordPiece tokenizer, which means one word may be split into several sub-words. To address this issue, for each word in the dependency tree, the sub-words split by WordPiece tokenizer share the same masking value in the calculation of syntax-aware local attention. An important detail is that BERT represents the input by adding a [CLS] token at the beginning as the special classification embedding and separating sentences with a [SEP] token. Clark et al. (2019) find that these special tokens are attached with a substantial amount of BERT's attention. Thus, the [CLS] and [SEP] tokens are guaranteed to be present and are never masked out in our local attention.
We use the uncased version of BERT for CoLA and SST-2, and the cased version for CoNLL-2003 and FCE. During the training, we empirically select the threshold m from {3,4}. The maximum 5 https://spacy.io/ sequence length is set to 128 for all tasks. We use Adam (Kingma and Ba, 2015) as our optimizer, and perform grid search over the sets of the learning rate as {2e-5, 3e-5} and the number of epochs as {3,5,10} for most tasks. In particular, we use smaller learning rates {5e-6, 1e-5, 2e-5} and train more epochs {30, 60} on CoNLL-2003, but the average F1 of the best 5 runs still hasn't reached the results reported by Devlin et al. (2019). The batch size is fixed to 32 to reduce the search space, and we evaluate models every 500 training steps for all datasets. Furthermore, we experiment with the window-based attention on BERT, which allows each token to pay more attention to the neighboring tokens within a window size 2k + 1. We vary the k within {3,4,5}, and also incorporate the attention scores with global attention scores.

A.3 Testing on Chinese Benchmarks
The ChnSentiCorp dataset is used for sentiment classification task. We treat the ChnSentiCorp as single-sentence datasets although there are some examples including multiple sentences. The MSRA NER and CGED datasets are selected for named entity recognition and grammatical error detection in Chinese. The accuracy (Acc) is used as the metric of ChnSentiCorp, the precision, recall and F 1 are used as metrics of MSRA NER and CGED. In particular, for a fair comparison with the results of iFLYTEK's single model (Fu et al., 2018), we construct the CGED test set from CGED 2016 and 2017 test sets. Then we report detection-level results computed by the official evaluation tool. Table 3 shows the main results on Chinese datasets. All results are reported on their test set. The proposed syntax-aware local attention outperforms the window-based attention and the basic BERT on all evaluated datasets. We attain 95.7 accuracy on ChnSentiCorp and 94.9 F1 on MSRA NER. Besides, BERT+SLA outperforms the stateof-the-art with a large margin on CGED.  Table 3: Experimental results on Chinese single-sentence benchmarks. We only show the results of base-size models because Google has not released the large-size model. Reported results are averaged over 5 runs.