On the evolution of syntactic information encoded by BERT’s contextualized representations

The adaptation of pretrained language models to solve supervised tasks has become a baseline in NLP, and many recent works have focused on studying how linguistic information is encoded in the pretrained sentence representations. Among other information, it has been shown that entire syntax trees are implicitly embedded in the geometry of such models. As these models are often fine-tuned, it becomes increasingly important to understand how the encoded knowledge evolves along the fine-tuning. In this paper, we analyze the evolution of the embedded syntax trees along the fine-tuning process of BERT for six different tasks, covering all levels of the linguistic structure. Experimental results show that the encoded syntactic information is forgotten (PoS tagging), reinforced (dependency and constituency parsing) or preserved (semantics-related tasks) in different ways along the fine-tuning process depending on the task.


Introduction
Adapting unsupervised pretrained language models (LMs) to solve supervised tasks has become a widely spread practice in NLP, with models such as ELMo (Peters et al., 2018) and, most notably, BERT (Devlin et al., 2019), achieving state-of-the-art results in many well-known Natural Language Understanding benchmarks like GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2018). Several studies investigate what the LMs learn, how and where the learned knowledge is represented and what the best methods to improve it are; cf., e.g., (Rogers et al., 2020). There is evidence that, among other information (such as, e.g., PoS, syntactic chunks and roles (Tenney et al., 2019b;Lin et al., 2019;Belinkov et al., 2017), morphology in general (Peters et al., 2018), or sentence length (Adi et al., 2016)) BERT deep models' vector geometry implicitly embeds entire syntax trees (Hewitt and Manning, 2019). However, rather little is understood about how these representations change when fine-tuned to solve downstream tasks (Peters et al., 2019).
In this work, we aim to understand how syntax trees implicitly embedded in the geometry of deep models evolve along the fine-tuning process of BERT on different supervised tasks, and shed some light on the importance of the syntactic information for those tasks. Intuitively, we expect morpho-syntactic tasks to clearly reinforce the encoded syntactic information, while tasks that are not explicitly syntactic in nature should maintain it in case they benefit from syntax (Kuncoro et al., 2020) and lose it if they do not. In order to cover the three main levels of the linguistic description (morphology, syntax and semantics), we select six different tasks: PoS tagging, constituency parsing, syntactic dependency parsing, semantic role labeling (SRL), question answering (QA) and paraphrase identification. The first three inherently deal with (morpho-)syntactic information while the latter three, which traditionally draw upon the output of syntactic parsing (Carreras and Màrquez, 2005;Björkelund et al., 2010;Strubell et al., 2018;Wang et al., 2019, inter-alia), deal with higher level, semantic information. Almost all of our experiments are on English corpora; one is on multilingual dependency parsing.

Related work
BERT has become the default baseline in NLP, and consequently, numerous studies analyze its linguistic capabilities in general (Rogers et al., 2020;Henderson, 2020), and its syntactic capabilities in particular (Linzen and Baroni, 2020). Even if syntactic information is distributed across all layers (Durrani et al., 2020), BERT captures most phrase-level information in the lower layers, followed by surface features, syntactic features and semantic features in the intermediate and top layers (Jawahar et al., 2019;Tenney et al., 2019a;Hewitt and Manning, 2019). The syntactic structure captured by BERT adheres to that of the Universal Dependencies (Kulmizev et al., 2020); different syntactic and semantic relations are captured by self-attention patterns (Kovaleva et al., 2019;Limisiewicz et al., 2020;Ravishankar et al., 2021), and it has been shown that full dependency trees can be decoded from single attention heads (Ravishankar et al., 2021). BERT performs remarkably well on subject-verb agreement (Goldberg, 2019), and is able to do full parsing relying only on pretraining architectures and no decoding (Vilares et al., 2020), surpassing existing sequence labeling parsers on the Penn Treebank dataset (De Marneffe et al., 2006) and on the end-to-end Universal Dependencies Corpus for English (Silveira et al., 2014). It can generally also distinguish good from bad completions and robustly retrieves noun hypernyms, but shows insensitivity to the contextual impacts of negation (Ettinger, 2020).
Different supervised probing models have been used to test for the presence of a wide range of linguistic phenomena in the BERT model (Conneau et al., 2018;Liu et al., 2019;Tenney et al., 2019b;Voita and Titov, 2020;Elazar et al., 2020). Hewitt and Manning (2019)'s structural probe shows that entire syntax trees are embedded implicitly in BERT's vector geometry. Extending their work, Chi et al. (2020) show that multilingual BERT recovers syntactic tree distances in languages other than English and learns representations of syntactic dependency labels.
Regarding how fine-tuning affects the representations of BERT, Gauthier and Levy (2019) found a significant divergence between the final representations of models fine-tuned on different tasks when using the structural probe of Hewitt and Manning (2019), while Merchant et al. (2020) concluded that fine-tuning is conservative and does not lead to catastrophic forgetting of linguistic phenomenawhich our experiments do not confirm. However, we find that the encoded syntactic information is forgotten, reinforced or preserved differently along the fine-tuning process depending on the task.

Experimental setup
We study the evolution of the syntactic structures discovered during pretraining along the fine-tuning of BERT-base (cased) (Devlin et al., 2019) 1 on six different tasks, drawing upon the structural probe of Hewitt and Manning (2019). 2 We fine-tune the whole model on each task outlined below for 3 epochs, with a learning rate of 5e −5 , saving 10 evenly-spaced checkpoints per epoch. The output of the last layer is used as input representation for the classification components of each task. To mitigate the variance in performance induced by weight initialization and training data order (Dodge et al., 2020), we repeat this process 5 times per task with different random seeds and average results.
PoS tagging. We fine-tune BERT with a linear layer on top of the hidden-states output for token classification. 3 Dataset: Universal Dependencies Corpus for English (UD 2.5 EN EWT Silveira et al. (2014)).
Constituency parsing. Following Vilares et al. (2020), we cast constituency parsing as a sequence labeling problem, and use a single feed-forward layer on top of BERT to directly map word vectors to labels that encode a linearized tree. Dataset: Penn Treebank (Marcus et al., 1993).
Dependency parsing. We fine-tune a Deep Biaffine neural dependency parser (Dozat and Manning, 2016) on three different datasets: i) UD 2.5 English EWT (Silveira et al., 2014); ii) a multilingual benchmark generated by concatenating the UD 2.5 standard data splits for German, English, Spanish, French, Italian, Portuguese, and Swedish (Zeman et al., 2019), with gold PoS tags; iii) PTB SD 3.3.0 (De Marneffe et al., 2006). Semantic role labeling. Following Shi and Lin (2019), we decompose the task into i) predicate sense disambiguation and argument identification, and ii) classification. Both subtasks are casted as sequence labeling, feeding the contextual representations into a one-hidden-layer MLP for the first, and a one-layer BiLSTM followed by a one-hiddenlayer MLP for the latter. Dataset: OntoNotes corpus (Weischedel et al., 2013).
Question answering. We fine-tune BERT with a linear layer on top of the hidden-states output to compute span start logits and span end logits. 4 Dataset: Stanford Question Answering Dataset (SQuAD (Rajpurkar et al., 2018)). Paraphrase identification. We fine-tune BERT with a linear layer on top of the pooled sentence representation. 5 Dataset: Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005).

Evolution of syntax trees
Hewitt and Manning (2019)'s structural probe evaluates how well syntax trees are embedded in a linear transformation of the network representation space, performing two different evaluations: i) Tree distance evaluation, in which squared L2 distance encodes the distance between words in the parse tree, and ii) Tree depth evaluation, in which squared L2 norm encodes the depth of the parse tree.
Using their probe, Hewitt and Manning show that the 7th layer of BERT-base is the layer that encodes more syntactic information. Therefore, to analyze the evolution of the encoded syntax trees, we train the probes on the 7th layer of the different checkpoint models generated along the fine-tuning process of each task. 6

Tree distance evaluation
The probe evaluates how well the predicted distances between all pairs of words in a model reconstruct gold parse trees by computing the Undirected Unlabeled Attachment Score (UUAS). It also computes the Spearman correlation between true and predicted distances for each word in each sentence, averaging across all sentences with lengths between 5 and 50 (henceforth referred to as DSpr.). Morpho-syntactic tasks As shown in Figures 1 and 2, both metrics follow a similar behaviour (shades represent the variability across the 5 model runs). PoS tagging shows an important loss of performance all along the fine-tuning process, especially noticeable for UUAS (Figure 1), suggesting that distance-related syntactic information is of less relevance to PoS tagging than could be intuitively   assumed. As many words have a clear preference towards a specific PoS, especially in English, and most of the ambiguous cases can be resolved using information in the close vicinity (e.g., a simple 3gram sequence tagger is able to achieve a very high accuracy (Manning, 2011)), syntactic structure information may not be necessary and, therefore, the model does not preserve it. This observation is aligned with Pimentel et al. (2020), who found that PoS-tagging is not an ideal task for contemplating the syntax contained in contextual word embeddings. The loss is less pronounced on depth-related metrics, maybe because the root of the sentence usually corresponds to the verb, which may also help in identifying the PoS of surrounding words. Constituency parsing and dependency parsing share a very similar tendency, with a big improvement in the first fine-tuning steps preserved along the rest of the process. As both tasks heavily rely on syntactic information, this improvement intuitively makes sense. Dependency parsing fine-tuned on the Penn Treebank (PTB) shows even higher results since the probing is trained on the same dataset. In- terestingly, the probe performs similarly even if the parsing task is modeled as a sequence labeling problem (as in constituency parsing), suggesting that the structure of syntax trees emerges in such models even when no tree is explicitly involved in the task. The initial drop observed for PoS tagging and monolingual dependency parsing with UD, trained on UD EN EWT, may be related to the size of the dataset, since UD EN EWT is significantly smaller than the other datasets and therefore the models see less examples per checkpoint. Semantics-related tasks As shown in Figures 1 and 2, both metrics follow different behaviours (again, shades represent the variability across the 5 model runs). Paraphrase identification shows a small but constant UUAS loss along the fine-tuning, while QA shows a slightly steeper loss trend. Initially, SRL loses around 12 points, suggesting that it discards some syntactic information right at the beginning, and follows a similar downward trend afterwards. Those three tasks show a stable performance along the fine-tuning for the DSpr metric, which implies that even if there is a loss in UUAS information it does not impact the distance ordering.

Tree depth evaluation
The probe evaluates models with respect to their ability to recreate the order of words specified by their depth in the parse tree, assessing their ability to identify the root of the sentence as the least deep word (Root %) and computing the Spearman correlation between the predicted and the true depth ordering, averaging across all sentences with lengths between 5 and 50 (henceforth referred to as NSpr). Morpho-syntactic tasks Again, both metrics follow a similar behaviour, as shown in Figures 3 and 4. PoS tagging shows a sustained loss of performance, though softer than the loss observed for the distance metrics. This loss is slightly less pronounced for Root % than for Nspr, suggesting that while depth-related syntactic information may be of less relevance to PoS tagging than it is to the other morpho-syntactic tasks, identifying the root of the sentence may be important, as the root of the sentence is likely to become one of the ambiguous tags and therefore identifying it may help to select the correct label. Constituency parsing and dependency parsing share a similar tendency, with a big improvement in the first steps preserved along the rest of the fine-tuning process, reinforcing the intuition previously introduced in Section 4.1 about the structure of syntax trees emerging in models even when no tree is explicitly involved in the task. Again, an initial drop can be observed for PoS tagging and monolingual dependency parsing with UD, most probably related to the smaller size of the UD EN EWT dataset used in both tasks.
Semantics-related tasks Both metrics follow a similar behaviour, as shown in Figures 3 and 4, with all tasks following a soft but sustained loss of performance until the end of the fine-tuning process, specially noticeable for Root %.

Conclusions
We show that fine-tuning is not always a conservative process. Rather, the syntactic information initially encoded in the models is forgotten (PoS tagging), reinforced (parsing) or preserved (semanticsrelated tasks) in different (sometimes unexpected) ways along the fine-tuning, depending on the task. Thus, we expected that morpho-syntactic tasks clearly reinforce syntactic information. However, PoS tagging forgets it, which, on the other side, can also be justified linguistically (cf. Section 4.1). In contrast, tasks closer to semantics mostly preserve the syntactic knowledge initially encoded. This interesting observation reinforces recent findings that models benefit from explicitly injecting syntactic information for such tasks (Singh Sachan et al., 2020).
Overall, we observed that morpho-syntactic tasks experiment substantial changes in the initial phases, while semantic-related tasks maintain a more stable trend, highlighting the importance of syntactic information in tasks that are not explicitly syntactic in nature (Kuncoro et al., 2020). These observations lead to some interesting insights, but also to further questions; for instance: Can we find a specific set of probes covering different linguistic phenomena to be used as a pretraining stopping criteria? Would this lead to an improvement in the encoding of the linguistic information on pretrained models?

A Target tasks performance evolution
To complement the results shown in the main paper, we include here the performance curves of the target tasks for which the models are fine-tuned, along with the performance curves of the structural probes metrics, facilitating the comparison of the evolution of the encoded syntax trees information and the target tasks performances. Along with the performance curves of the four structural probes metrics (UUAS, Nspr, Root % and Dspr), the following figures include the performance curves of the target tasks and a brief discussion of the results, to help interpretation. Figure 5 shows the accuracy evolution of PoS tagging.  Figure 10 shows the F1 score evolution of Question Answering. Figure 11 shows the F1 score and accuracy evolution of Paraphrase identification. Finally, Figure 12 shows the F1 score evolution of Semantic Role Labeling.
PoS tagging reaches a 0.95 accuracy in only two checkpoints, ending up with a 0.97 on the last checkpoint (Figure 5a). It shows a loss of accuracy for the four probing metrics all along the fine-tuning process, especially noticeable for UUAS (Figure 5b) and Root % (Figure 5d), suggesting that syntactic information is of less relevance to PoS tagging than could be intuitively assumed. The loss is less pronounced on depth-related metrics, maybe due to the fact that the root of the sentence usually corresponds to the verb, which may also help in identifying the PoS of surrounding words.  Dependency parsing with PTB SD shows a steep learning curve (Figure 6a), reaching a performance of 0.90 LAS on the third checkpoint, up to a final 0.94. All four probing metrics show an important improvement in the first fine-tuning step (Figures 6b, 6c, 6d and 6e), which is preserved along the rest of the process. As the task heavily relies on syntactic information, this improvement intuitively makes sense. Compared to the result of the other dependency parsing experiments, this one show bigger improvements because the probing is trained on the same dataset.  Dependency parsing with EN UD EWT shows a shallower learning curve than other experiments (Figure 7a), as the dataset is significantly smaller than the multilingual and PTB and therefore the models see less examples per checkpoint, ending up with a high performance of 0.9. After an initial drop (probably due to the dataset size, as mentioned before), the probing metrics show a big improvement in the first fine-tuning steps, preserved along the rest of the process (Figures 7b, 7c, 7d and 7e).
As the task heavily relies on syntactic information, this improvement intuitively makes sense.  Multilingual dependency parsing shows a steeper learning curve than dependency parsing with EN UD EWT, as it is trained with a larger dataset (Figure 8a), reaching a performance of 0.87 in LAS. All four probing metrics show a big improvement in the first fine-tuning step, preserved along the rest of the process (Figures 8b, 8c, 8d and 8e). As the task heavily relies on syntactic information, this improvement intuitively makes sense.  Constituency parsing fine-tuning follows a steep curve, quickly reaching an Accuracy of 0.87 that is further improved to 0.9 in the last checkpoint ( Figure 9a). All four probing metrics show a big improvement in the first fine-tuning steps, preserved along the rest of the process (Figures 9b, 9c, 9d and 9e). As the task heavily relies on syntactic information, this improvement intuitively makes sense. Interestingly, even though the task is modeled as a sequence labeling problem, the probe performs similarly to the dependency parsing tasks, suggesting that the structure of syntax trees emerges in such models even when no tree is explicitly involved in the task.  Question answering fine-tuning quickly reaches an F1 score of 0.73 on the first step, which is further improved to 0.88 in the last checkpoint ( Figure 10a). All four probing metrics show a clear loss trend (Figures 10b, 10c, 10d and 10e). The loss is specially noticeable for UUAS and Root %, and more stable for the Spearman correlations, suggesting that even if there is a loss of information it does not impact the distance and depth orderings.  Paraphrase identification fine-tuning starts with an F1 score of 0.81 on the first step that is further improved to 0.90 in the last checkpoint ( Figure 11a). Regarding accuracy, after reaching 0.69 on the first checkpoint it follows a shallower curve to a final 0.86 (Figure 11b). All four probing metrics follow a loss trend (Figures 11c, 11d, 11e and 11f). The loss is specially noticeable for UUAS and Root %, and more stable for the Spearman correlations, suggesting that even if there is a loss of information it does not impact the distance and depth orderings. Semantic Role Labeling fine-tuning follows a steep curve for F1, quickly reaching an F1 score of 0.71 on the first step that is further improved to 0.82 in the last checkpoint ( Figure 12a). All four probing metrics follow a loss trend (Figures 12b, 12c, 12d and 12e). The loss is specially noticeable for UUAS, which initially loses around 12 UUAS points, and more stable for the Spearman correlations, suggesting that even if there is a loss of information it does not impact the distance and depth orderings.