Syntax-Enhanced Pre-trained Model

We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa. Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages. Such a problem would lead to the necessity of having human-annotated syntactic information, which limits the application of existing methods to broader scenarios. To address this, we present a model that utilizes the syntax of text in both pre-training and fine-tuning stages. Our model is based on Transformer with a syntax-aware attention layer that considers the dependency tree of the text. We further introduce a new pre-training task of predicting the syntactic distance among tokens in the dependency tree. We evaluate the model on three downstream tasks, including relation classification, entity typing, and question answering. Results show that our model achieves state-of-the-art performance on six public benchmark datasets. We have two major findings. First, we demonstrate that infusing automatically produced syntax of text improves pre-trained models. Second, global syntactic distances among tokens bring larger performance gains compared to local head relations between contiguous tokens.


Introduction
Pre-trained models such as BERT (Devlin et al., 2019), GPT (Radford et al., 2018), and RoBERTa  have advanced the state-of-the-art performances of various natural language processing tasks. The successful recipe is that a model is first pre-trained on a huge volume of unsupervised * Work is done during internship at Microsoft. † For questions, please contact D. Tang and Z. Xu. ‡ Corresponding author. 1 The source data is available at https://github.com/Hi-ZenanXu/Syntax-Enhanced Pre-trained Model. data with self-supervised objectives, and then is fine-tuned on supervised data with the same data scheme. Dominant pre-trained models represent a text as a sequence of tokens 2 . The merits are that such basic text representations are available from vast amounts of unsupervised data, and that models pre-trained and fine-tuned with the same paradigm usually achieve good accuracy in practice (Guu et al., 2020). However, an evident limitation of these methods is that richer syntactic structure of text is ignored.
In this paper, we seek to enhance pre-trained models with syntax of text. Related studies attempt to inject syntax information either only in the finetuning stage (Nguyen et al., 2020;Sachan et al., 2020), or only in the pre-training stage (Wang et al., 2020), which results in discrepancies. When only fusing syntax information in the fine-tuning phase, Sachan et al. (2020) finds that there is no performance boost unless high quality human-annotated dependency parses are available. However, this requirement would limit the application of the model to broader scenarios where human-annotated dependency information is not available.
To address this, we conduct a large-scale study on injecting automatically produced syntax of text in both the pre-training and fine-tuning stages. We construct a pre-training dataset by applying an offthe-shelf dependency parser  to one billion sentences from common crawl news. With these data, we introduce a syntax-aware pretraining task, called dependency distance prediction, which predicts the syntactic distance between tokens in the dependency structure. Compared with the pre-training task of dependency head prediction (Wang et al., 2020) that only captures local syntactic relations among words, dependency distance prediction leverages global syntax of the text. In addition, we developed a syntax-aware attention layer, which can be conveniently integrated into Transformer (Vaswani et al., 2017) to allow tokens to selectively attend to contextual tokens based on their syntactic distance in the dependency structure.
We conduct experiments on entity typing, question answering and relation classification on six benchmark datasets. Experimental results show that our method achieves state-of-the-art performance on all six datasets. Further analysis shows that our model can indicate the importance of syntactic information on downstream tasks, and that the newly introduced dependency distance prediction task could capture the global syntax of the text, performs better than dependency head prediction. In addition, compared with experimental results of injecting syntax information in either the pre-training or fine-tuning stage, injecting syntax information in both stages achieves the best performance.
In summary, the contribution of this paper is threefold.
(1) We demonstrate that infusing automatically produced dependency structures into the pre-trained model shows superior performance over downstream tasks. (2) We propose a syntax-aware attention layer and a pre-training task for infusing syntactic information into the pre-trained model.
(3) We find that the newly introduced dependency distance prediction task performs better than the dependency head prediction task.

Related Work
Our work involves injecting syntax information into pre-trained models. First, we will review recent studies on analyzing the knowledge presented in pre-trained models, and then we will introduce the existing methods that enhance pre-trained models with syntax information.

Probing Pre-trained Models
With the huge success of pre-trained models (Devlin et al., 2019;Radford et al., 2018) in a wide range of NLP tasks, lots of works study to what extent pre-trained models inherently. Here, we will introduce recent works on probing linguistic information, factual knowledge, and symbolic reasoning ability from pre-trained models respectively. In terms of linguistic information, Hewitt and Manning (2019) learn a linear transformation to predict the depth of each word on a syntax tree based on their representation, which indicates that the syntax information is implicitly embedded in the BERT model. However, Yaushian et al. (2019) find that the attention scores calculated by pre-trained models seem to be inconsistent with human intuitions of hierarchical structures, and indicate that certain complex syntax information may not be naturally embedded in BERT. In terms of probing factual knowledge, Petroni et al. (2019) find that pretrained models are able to answer fact-filling cloze tests, which indicates that the pre-trained models have memorized factual knowledge. However, Poerner et al. (2019) argue that BERT's outstanding performance of answering fact-filling cloze tests is partly due to the reasoning of the surface form of name patterns. In terms of symbolic reasoning, Talmor et al. (2020) test the pre-trained models on eight reasoning tasks and find that the models completely fail on half of the tasks. Although probing knowledge from pre-trained model is a worthwhile area, it runs perpendicular to infusing knowledge into pre-trained models.

Integrating Syntax into Pre-trained Models
Recently, there has been growing interest in enhancing pre-trained models with syntax of text. Existing methods attempt to inject syntax information in the fine-tuning stage or only in the pre-training stage. We first introduce related works that inject syntax in the fine-tuning stage. Nguyen et al. (2020) incorporate a tree-structured attention into the Transformer framework to help encode syntax information in the fine-tuning stage.  utilize the syntax to guide the Transformer model to pay no attention to the dispensable words in the fine-tuning stage and improve the performance in machine reading comprehension. Sachan et al.
(2020) investigate two distinct strategies for incorporating dependency structures in the fine-tuning stage and obtain state-of-the-art results on the semantic role labeling task. Meanwhile, Sachan et al.
(2020) argue that the performance boost is mainly contributed to the high-quality human-annotated syntax. However, human annotation is costly and difficult to extend to a wide range of applications. Syntax information can also be injected in the pretraining stage. Wang et al. (2020) introduce head prediction tasks to inject syntax information into the pre-trained model, while syntax information is not provided during inference. Note that the head prediction task in Wang et al. (2020)

only focuses
My dog is playing frisbee outside the room .
on the local relationship between two related tokens, which prevents each token from being able to perceive the information of the entire tree. Despite the success of utilizing syntax information, existing methods only consider the syntactic information of text in the pre-training or the fine-tuning stage so that they suffer from discrepancy between the pre-training and the fine-tuning stage. To bridge this gap, we conduct a large-scale study on injecting automatically produced syntax information in both the two stages. Compared with the head prediction task (Wang et al., 2020) that captures the local relationship, we introduce the dependency distance prediction task that leverages the global relationship to predict the distance of two given tokens.

Data Construction
In this paper, we adopt the dependency tree to express the syntax information. Such a tree structure is concise and only expresses necessary information for the parse (Jurafsky, 2000). Meanwhile, its head-dependent relation can be viewed as an approximation to the semantic relationship between tokens, which is directly useful for capturing semantic information. The above advantages help our model make more effective use of syntax information. Another available type of syntax information is the constituency tree, which is used in Nguyen et al. (2020). However, as pointed out in Jurafsky (2000), the relationships between the tokens in dependency tree can directly reflect important syntax information, which is often buried in the more complex constituency trees. This property requires extra techniques to extracting relation among the words from a constituency tree (Jurafsky, 2000) 3 .
The dependency tree takes linguistic words as one of its basic units. However, most pre-trained models take subwords (also known as the word pieces) instead of the entire linguistic words as the input unit, and this necessitates us to extend the definition of the dependency tree to include subwords. Following Wang et al. (2020), we will add edges 3 https://web.stanford.edu/˜jurafsky/slp3/ from the first subword of v to all subwords of u, if there exists a relationship between linguistic word v and word u.
Based on the above extended definition, we build a pre-training dataset from open-domain sources. Specifically, we randomly collect 1B sentences from publicly released common crawl news datasets (Zellers et al., 2019) that contain English news articles crawled between December 2016 and March 2019. Considering its effectiveness and ability to expand to multiple languages, we adopt offthe-shelf Stanza 4 to automatically generate the syntax information for each sentence. The average token length of each sentence is 25.34, and the average depth of syntax trees is 5.15.

Methodology
In this section, we present the proposed Syntax-Enhanced PRE-trained Model (SEPREM). We first define the syntax distance between two tokens. Based on the syntax distance, we then introduce a syntax-aware attention layer to learn syntax-aware representation and a pre-training task to enable model to capture global syntactic relations among tokens.

Syntax Distance over Syntactic Tree
Intuitively, the distance between two tokens on the syntactic tree may reflect the strength of their linguistic correlation. If two tokens are far away from each other on the syntactic tree, the strength of their linguistic correlation is likely weak. Thus, we define the distance of two tokens over the dependency tree as their syntactic distance. Specifically, we define the distance between the token v and token u as 1, i.e. d(v, u) = 1, if v is the head of u. If two tokens are not directly connected in the dependency graph, their distance is the summation of the distances between adjacent nodes on the path. If two tokens are separated in the graph, their distance is set to infinite. Taking the sentence "My dog is playing frisbee outside the room." in Fig 1 as an example, d(playing, frisbee) equals 1 since the token "playing" is the head of the token "frisbee".

Syntax-Aware Transformer
We follow BERT (Devlin et al., 2019) and use the multi-layer bidirectional Transformer (Vaswani et al., 2017) as the model backbone. The model takes a sequence X as the input and applies N transformer layers to produce contextual representation: (1) where n ∈ [1, N ] denotes the n-th layer of the model,Ĥ is the syntax-aware representation which will be described in Section 4.3, H 0 is embeddings of the sequence input X, and α is a learnable variable.
However, the introduction of syntax-aware rep-resentationĤ in the Equation 1 changes the architecture of Transformer, invalidating the original weights from pre-trained model, such as BERT and RoBERTa. Instead, we introduce a learnable importance score α that controls the proportion of integration between contextual and syntax-aware representation. When α is equal to zero, the syntaxaware representation is totally excluded and the model is architectural identical to vanilla Transformer. Therefore, we initialize the parameter α as the small but not zero value, which can help better fuse syntactic information into existing pre-trained models. We will discuss importance score α in detailed in Section 5.6.
Each transformer layer transf ormer n contains an architecturally identical transformer block, which is composed of a multi-headed self-attention M ultiAttn (Vaswani et al., 2017) and a followed feed forward layer F F N . Formally, the output H n of the transformer block transf ormer n (H n−1 ) is computed as: where the input H n−1 is (1 − α)H n−1 + αĤ n−1 and LN represents a layer normalization operation.

Syntax-aware Attention Layer
In this section, we will introduce how to obtain the syntax-aware representationĤ used in syntaxaware transformer.
Tree Structure Encoding We adopt a distance matrix D to encode the tree structure. The advantages of distance matrix D are that it can well preserve the hierarchical syntactic structure of text and can directly reflect the distance of two given tokens. Meanwhile, its uniqueness property guarantees the one-to-one mapping of the tree structure. Given a dependency tree, the element D i,j of distance matrix D in i-th row and j-th column is defined as: (3) where v i and v j are tokens on the dependency tree. Based on the concept that distance is inversely proportional to importance, we normalize the matrix D and obtain the normalized correlation strength matrixD as follows: Syntax-aware Representation Given the tree structure representationD and the contextual representation H n , we fuse the tree structure into the contextual representation as: where σ is the activation function, W 1 n and W 2 n ∈ R d h ×d h are model parameters. We can see that DH n allows one to aggregate information from others along the tree structure. The closer they are on the dependency tree, the larger the attention weight, and thus more information will be propagated to each other, and vice verse.

Syntax-aware Pre-training Task
To better understand the sentences, it is beneficial for model to be aware of the underlying syntax. To this end, a new pre-training task, named dependency distance prediction task (DP), is designed to enhance the model's ability of capturing global syntactic relations among tokens. Specifically, we first randomly mask some elements in the distance matrix D, e.g., supposed D i,j . Afterwards, the representations of tokens i and j from SEPREM are concatenated and fed into a linear classifier, which outputs the probabilities over difference distances. In all of our experiments, 15% of distance are masked at random. Similar to BERT (Devlin et al., 2019) and RoBERTa , we conduct the following operations to boost the robustness. The distance in matrix D will be masked at 80% probability or replaced by a random integer with a probability of 10%. For the rest 10% probability, the distance will be maintained.
During pre-training, in addition to the DP pretraining task, we also use the dependency head prediction (HP) task, which is used in Wang et al. (2020) to capture the local head relation among words, and the dynamic masked language model (MLM), which is used in  to capture contextual information. The final loss for the pretraining is the summation of the training loss of DP, HP and MLM tasks.

Implementation Details
The implementation of SEPREM is based on Hug-gingFace's Transformer (Wolf et al., 2019). To accelerate the training process, we initialize parameters from RoBERTa model released by Hugging-Face 5 , which contains 24 layers, with 1024 hidden states in each layer. The number of parameters of our model is 464M. We pre-train our model with 16 32G NVIDIA V100 GPUs for approximately two weeks. The batch size is set to 2048, and the total steps are 500000, of which 30000 is the warm up steps.
In both pre-training and fine-tuning stages, our model takes the syntax of the text as the additional input, which is pre-processed in advance. Specially, we obtain the dependency tree of each sentence via Stanza and then generate the normalized distance matrix.

Experiments
In this section, we evaluate the proposed SEPREM on six benchmark datasets over three downstream tasks, i.e., entity typing, question answering and relation classification.

Entity Typing
The entity typing task requires the model to predict the type of a given entity based on its context. Two fine-grained public datasets, Open Entity (Choi et al., 2018) and FIGER (Ling et al., 2015), are employed to evaluate our model. The statistics of the aforementioned datasets are shown in Table  1  "@" is added before and after a certain entity, then the representation of the first special token "@" is adopted to predict the type of the given entity.  Experimental Results As we can see in Table  2, our SEPREM outperforms all other baselines on both entity typing datasets. In the Open Entity dataset, with the utility of the syntax of text, SEPREM achieves an improvement of 3.6% in micro-F1 score comparing with RoBERTa-large (continue training) model. The result demonstrates that the proposed syntax-aware pre-training tasks and syntax-aware attention layer help to capture the syntax of text, which is beneficial to predict the types more accurately. As for the FIGER dataset, which contains more labels about the type of entity, SEPREM still brings an improvement in strict accuracy, macro-F1, and micro-F1. This demonstrates   the effectiveness of leveraging syntactic information in tasks with more fine-grained information.
Specifically, compared with the K-adapter model, our SEPREM model brings an improvement of 2.6% F1 score on Open Entity dataset. It is worth noting that SEPREM model is complementary to the K-adapter model, both of which inject syntactic information into model during pre-training stage. This improvement indicates that injecting syntactic information in both the pre-training and fine-tuning stages can make full use of the syntax of the text, thereby benefiting downstream tasks.

Question Answering
We use open-domain question answering (QA) task and commonsense QA task to evaluate the proposed model. Open-domain QA requires models to answer open-domain questions with the help of external resources such as materials of collected documents and webpages. We use SearchQA (Dunn et al., 2017) and QuasarT (Dhingra et al., 2017) for this task, and adopt ExactMatch (EM) and loose F1 scores as evaluation metrics. In this task, we first retrieve related paragraphs according to the question from external materials via the information retrieval system, and then a reading comprehension technique is adopted to extract possible answers from the above retrieved paragraphs. Following previous work (Lin et al., 2018), we use the retrieved paragraphs provided by Wang et al. (2017b) for the two datasets. For fair comparison, we follow Wang et al. (2020) to use [<sep>, quesiton,</sep>, paragraph,</sep>] as the input, where <sep> is a special token in front of two segmants and </sep> is a special symbol to split two kinds of data types. We take the task as a multi-classification to fine-tune the model and use two linear layers over the last hidden features from models to predict the start and end positions of the answer span. Commonsense QA aims to answer questions which require commonsense knowledge that is not explicitly expressed in the question. We use the public CosmosQA dataset (Huang et al., 2019) for this task, and the accuracy scores are used as evaluation metrics. The data statistics of the above three datasets are shown in Table 3. In CosmosQA, each question has 4 candidate answers, and we concatenate the question together with each answer separately as [<sep>, context,</sep>, paragraph,</sep>] for input. The representation of the first token is adopted to calculate a score for this answer, and the answer with the highest score is regarded as the prediction answer for this question.

Experimental Results
The results of the opendomain QA task are shown in Table 4. We can see that the proposed SEPREM model brings significant gains of 3.1% and 8.4% in F1 scores, compared with RoBERTa-large (continue training) model. This may be partially attributed to the fact that, QA task requires a model to have reading comprehension ability (Wang et al., 2020), and the introduced syntax information can guide the model to avoid concentrating on certain dispensable words and improve its reading comprehension capacity . Meanwhile, SEPREM achieves state-of-the-art results on the CosmosQA dataset, which demonstrates the effectiveness of the proposed SEPREM model. It can be also seen that the performance gains observed in CosmosQA are not as substantial as those in the open-domain QA tasks. We speculate that Cos-mosQA requires capacity for contextual commonsense reasoning and the lack of explicitly injection of commonsense knowledge into SEPREM model limits its improvement.

Relation Classification
A relation classification task aims to predict the relation between two given entities in a sentence. We use a large-scale relation classification dataset TA-CRED (Zhang et al., 2017) for this task, and adopt Micro-precision, recall, and F1 scores as evaluation metrics. The statistics of the TACRED datasets are shown in Table 1. Following Wang et al. (2020), we add special tokens "@" and "#" before and after the first and second entity respectively. Then, the representations of the former token "@" and "#" are concatenated to perform relation classification.   large, RoBERTa-large (continue training), and K-Adapter (Wang et al., 2020) for a comprehensive comparison. Table 5 shows the performances of baseline models and the proposed SEPREM on TACRED. As we can see that the proposed syntax-aware pre-training tasks and syntaxaware attention mechanism can continuously bring gains in relation classification task and SEPREM outperforms baseline models overall. This further confirms the outstanding generalization capacity of our proposed model. It can be also seen that compared with K-Adapter model, the performance gains of SEPREM model observed in the TACRED dataset are not as substantial as that in Open Entity dataset. This may be partially due to the fact that K-Adapter also injects factual knowledge into the model, which may help in identifying relationships.

Ablation Study
To investigate the impacts of various components in SEPREM, experiments are conducted for en-tity typing, question answering and relation classification tasks under the different corresponding benchmarks, i .e., Open Entity, CosmosQA, and TACRED, respectively. Note that due to the timeconsuming issue of training the models on entire data, we randomly sample 10 million sentences from the whole data to build a small dataset in this ablation study.
The results are illustrated in Figure 2, in which we eliminate two syntax-aware pre-training tasks (i.e., HP and DP) and syntax-aware attention layer to evaluate their effectiveness. It can be seen that without using the syntax-aware attention layer, immediate performance degradation is observed, indicating that leveraging syntax-aware attention layer to learn syntax-aware representation could benefit the SEPREM. Another observation is that for all three experiments, eliminating DP pre-training task leads to worse empirical results. In other words, compared with existing method (i.e., head prediction task), the proposed dependency distance prediction task is more advantageous to various downstream tasks. This observation may be attributed to the fact that leveraging global syntactic correlation is more beneficial than considering local correlation. Moreover, significant performance gains can be obtained by simultaneously exploiting the two pre-training tasks and syntax-aware attention layer, which further confirms superiority of our pre-training architecture.

Case Study
We conduct a case study to empirically explore the effectiveness of utilizing syntax information. In the case of relation classification task, we need to predict the relationship of two tokens in a sentence. As the three examples shown in Figure 3, SEPREM can capture the syntax information by the dependency tree and make correct predictions. However, without utilizing syntax information, RoBERTa fails to recognize the correct relationship. To give further insight of how syntax information affects prediction, we also take case 1 for detailed analysis. The extracted dependency tree captures the close correlation of "grew" and "Jersey", which indicates that "New Jersey" is more likely to be a residence place. These results reflects that our model can better understand the global syntax relations among tokens by utilizing dependency tree.

Analysis of Importance Score α
Under the syntax-enhanced pre-trained framework introduced here, the contextual representation (H n ) and syntax-aware representation (Ĥ n ) are jointly optimized to abstract semantic information from sentences. An interesting question concerns how much syntactic information should be leveraged for our pre-trained model. In this regard, we further investigate the effect of the importance score α on the aforementioned six downstream tasks, and the learned weights α after fine-tuning SEPREM model are shown in Table 6. We observe that the values of α are in the range of 13% and 15% on six downstream datasets, which indicates that those downstream tasks require syntactic information to obtain the best performance and once again confirms the effectiveness of utilizing syntax information.
To have a further insight of the effect brought by importance score α, we conduct experiments on SEPREM w/o α, which eliminates the α in Equation 1 and equally integrates the syntaxaware and contextual representation, i.e., H n = transf ormer n (H n−1 +Ĥ n−1 ). The pre-training settings of the SEPREM w/o α model are the same  with the proposed SEPREM model. It can be seen in Table 6 that, the performances drop 1%∼3% on the six datasets when excluding the α. This observation indicates the necessity of introducing the α to better integrate the syntax-aware and contextual representation.

Conclusion
In this paper, we present SEPREM that leverage syntax information to enhance pre-trained models. To inject syntactic information, we introduce a syntax-aware attention layer and a newly designed pre-training task are proposed. Experimental results show that our method achieves state-of-theart performance over six datasets. Further analysis shows that the proposed dependency distance prediction task performs better than dependency head prediction task.