How transfer learning impacts linguistic knowledge in deep NLP models?

Transfer learning from pre-trained neural language models towards downstream tasks has been a predominant theme in NLP recently. Several researchers have shown that deep NLP models learn non-trivial amount of linguistic knowledge, captured at different layers of the model. We investigate how fine-tuning towards downstream NLP tasks impacts the learned linguistic knowledge. We carry out a study across popular pre-trained models BERT, RoBERTa and XLNet using layer and neuron-level diagnostic classifiers. We found that for some GLUE tasks, the network relies on the core linguistic information and preserve it deeper in the network, while for others it forgets. Linguistic information is distributed in the pre-trained language models but becomes localized to the lower layers post fine-tuning, reserving higher layers for the task specific knowledge. The pattern varies across architectures, with BERT retaining linguistic information relatively deeper in the network compared to RoBERTa and XLNet, where it is predominantly delegated to the lower layers.


Introduction
Contextualized word representations learned in transformer-based language models capture rich linguistic knowledge, making them ubiquitous for transfer learning towards downstream NLP problems such as Natural Language Understanding tasks e.g. GLUE (Wang et al., 2018). The general idea is to pretrain representations on large scale unlabeled data and adapt these towards a downstream task using supervision.
Descriptive methods in neural interpretability investigate what knowledge is learned within the representations through relevant extrinsic phenomenon varying from word morphology (Vylomova et al., 2016;Belinkov et al., 2017a;Dalvi et al., 2017) to high level concepts such as structure (Shi et al., 2016;Linzen et al., 2016) and semantics (Qian et al., 2016;Belinkov et al., 2017b) or more generic properties such as sentence length (Adi et al., 2016;Bau et al., 2019). These studies are carried towards analyzing representations from pre-trained models. However, it is important to investigate how this learned knowledge evolves as the models are adapted towards a specific task from the more generic task of language modeling (Peters et al., 2018) that they are primarily trained on.
In this work, we analyze representations of 3 popular pre-trained models (BERT, RoBERTa and XLnet) with respect to morpho-syntactic and semantic knowledge, as they are fine-tuned towards GLUE tasks. More specifically we investigate i) if the fine-tuned models retain the same amount of linguistic information, ii) how this information is redistributed across different layers and individual neurons. To this end, we use Diagnostic Classifiers (Hupkes et al., 2018;Conneau et al., 2018), a popular framework for probing knowledge in neural models. The central idea is to extract feature representations from the network and train an auxiliary classifier to predict the property of interest. The quality of the trained classifier on the given task serves as a proxy to the quality of the extracted representations w.r.t to the understudied property (Belinkov et al., 2020).
We carry layer-wise (Liu et al., 2019a) and neuron-level probing analyses (Dalvi et al., 2019a) to study the fine-tuned representations. The former probes representations from individual layers w.r.t a linguistic property and the latter finds salient neurons in the network that capture the property. Finetuning involves adjusting feature weights, therefore it is important to look at the individual neurons to uncover important details, in addition to a more holistic layer-wise view.
Our layer-wise analysis shows: i) that some GLUE tasks rely on core linguistic knowledge and the model preserves the information deeper in the network, while for others it is retained only in the lower layers ii) interesting cross-architectural differences with knowledge regressed to lower layers in RoBERTa and XLNet as opposed to BERT where it is still retained at the higher layers. Our neuron-wise analysis shows: i) salient linguistic neurons are relocated from the higher to lower layers, reinforcing our layer-wise results, ii) that linguistic information becomes less distributed and less redundant in the network post fine-tuning.
Finally, we show how our analysis entails findings in layer pruning. Dropping higher layers of the models maintains comparable performance to finetuning the full network, with linguistic information regressed to the lower layers. Conversely, pruning the lower layers (which hold the core linguistic information) leads to substantial degradation in performance.
In comparison to the related work done in this direction, our findings resonate with Merchant et al. (2020) who found that fine-tuning primarily affects top layers and does not lead to "catastrophic forgetting of linguistic phenomena" in BERT. However, we found that other models like RoBERTa and XLNet, which they did not study, see a substantial drop in accuracy even at the lower layers and start forgetting linguistic knowledge much earlier in the network. In contrast to Mosbach et al. (2020), we study core-linguistic phenomena whereas their study is based on sentence level probing tasks. Differently from both, we carry out a fine-grained neuron analysis which sheds light on how neurons are distributed and relocated post fine-tuning. Our work complements their findings while extending the layer-wise analysis to core-linguistic tasks and additionally looking at the distribution and relocation of neurons after fine-tuning.

Methodology
Our methodology is based on the probing framework called as Diagnostic Classifiers. We train a classifier using the activations generated from the trained neural network as static features, towards the task of predicting a certain linguistic property. The underlying assumption is that if the classifier can predict the property, the representations implicitly encode this information. We train layerand neuron-wise probes using logistic-regression classifiers. Formally, consider a pre-trained neural language model M with L layers: {l 1 , l 2 , . . . , l L }.
Given a dataset D = {w 1 , w 2 , ..., w N } with a corresponding set of linguistic annotations T = {t w 1 , t w 2 , ..., t w N }, we map each word w i in the data D to a sequence of latent representations: D M − → z = {z 1 , . . . , z n }. The model is trained by minimizing the following loss function: is the probability that word i is assigned property t w i . We extract representations from the individual layers for our layer-wise analysis and the entire network for the neuron-analysis. We use the Linguistic Correlation Analysis as described in Dalvi et al. (2019a), to generate a neuron ranking with respect to the understudied linguistic property: Given the trained classifier θ ∈ R D×T , the algorithm extracts a ranking of the D neurons in the model M based on weight distribution. The elastic-net regularization (Zou and Hastie, 2005) -a combination of λ 1 θ 1 and λ 2 θ 2 2 is used to strike a balance between identifying focused (L1) versus distributed (L2) neurons. The weights for the regularization terms are tuned using a grid-search algorithm.
Following Durrani et al. (2020), we extract salient neurons for a linguistic property by iteratively choosing the top N neurons from the ranked list and retrain the classifier using these neurons, until the classifier obtains an accuracy close (within a specified threshold δ) to the Oracle -accuracy of the classifier trained using all the features in the network.

Experimental Setup
Pre-trained Neural Language Models: We experimented with 3 transformer models: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b) and XLNet (Yang et al., 2019) using the base versions (13 layers and 768 dimensions). This choice of architectures leads to an interesting comparison between auto-encoder versus auto-regressive models. The models were then fine-tuned towards GLUE tasks of which we experimented with SST-2 for sentiment analysis with the Stanford sentiment treebank (Socher et al., 2013), MNLI for natural language inference (Williams et al., 2018), QNLI for Question NLI (Rajpurkar et al., 2016), RTE for recognizing textual entailment (Bentivogli et al. for the semantic textual similarity benchmark (Cer et al., 2017). All the models were fine-tuned with the identical settings and we did 3 independent runs.
Linguistic Properties: We evaluated our method on 3 linguistic tasks: POS tagging using the Penn TreeBank (Marcus et al., 1993), syntactic chunking using CoNLL 2000 shared task dataset (Tjong Kim Sang and Buchholz, 2000), and semantic tagging using the Parallel Meaning Bank data (Abzianidze et al., 2017). We used standard splits for training, development and test data.
Classifier Settings: We used a linear probing classifier with elastic-net regularization, using a categorical cross-entropy loss, optimized by Adam (Kingma and Ba, 2014). Training is run with shuffled mini-batches of size 512 and stopped after 10 epochs. The regularization weights are trained using grid-search. For sub-word based models, we use the last activation value to be the representative of the word following Durrani et al. (2019). We computed selectivity (Hewitt and Liang, 2019) to ensure that our results reflect the property of representations and not the probe's capacity to memorize. Please see Appendix for details.

Layer-wise Probing
First we train layer-wise probes to show how linguistic knowledge is redistributed across the net-work as we fine-tune it towards downstream tasks. Figure 1 shows results for POS and Chunking tasks. 1 We found varying observations across different GLUE tasks.
Comparing GLUE tasks: We found that linguistic phenomena are more important for certain downstream tasks, for example STS, RTE and MRPC where they are preserved in the higher layers post fine-tuning, as opposed to others, for example SST, QNLI and MNLI where they are forgotten in the higher layers. It would be interesting to study this further by connecting linguistic probes with any causation analysis on these tasks. Such an analysis would shed light on what concepts are used by the network while making predictions and why such information is forgotten for certain tasks. We leave this exploration for future.
Comparing Architectures: We found that pretrained models behave differently in preserving information post fine-tuning. In the case of BERT, linguistic knowledge is fully preserved until layer 9, after which different task-specific models drop to varying degree, with SST and QNLI showing significant drop compared to others. An exception to this overall trend is MNLI where we start seeing a decline in performance earlier (between layers 5 − 7). Contrastingly RoBERTa and XL-Net show a depreciation in linguistic knowledge as early as layer 5. Also the drop is much more catastrophic in these two models with accuracy dropping by more than 35% in RoBERTa and 70% in XLNet. These results indicate that BERT retains its primarily learned linguistic knowledge and uses only a few of the final layers for fine-tuning, as opposed to XLNet and RoBERTa, where linguistic knowledge is retained only in the lower half of the network. Another cross-architectural observation that we made was that in RoBERTa and XLNet, the fine-tuned models do not ever reach the baseline performance (i.e. accuracy before fine-tuning -See Figure 1) at any layer, although the loss is < 2%. We conjecture this discrepancy is due to the fact that the knowledge is more redundant and polysemous in the case of BERT, compared to XL-Net, where it is more localized (also observed in Durrani et al. (2020)). Consequently, during finetuning XLNet and RoBERTa are more likely to lose linguistic information that is unimportant to the downstream task. We discuss this further in our neuron-analysis section.

Neuron-wise Probing
In our second set of experiments, we conducted analysis at a more fine-grained neuron level using Linguistic Correlation Method (Dalvi et al., 2019a). We extract the most salient neurons w.r.t a linguistic property (e.g. POS) and compare how the distribution of such neurons changes across the network as it is fine-tuned towards a downstream GLUE task. We use the weights of the trained classifier to rank neurons and select minimal set of salient neurons that give the same classifier accuracy as using the entire network in the baseline model. We found 5% neurons for POS and SEM tagging tasks and 10% for the Chunking tagging were sufficient to achieve the baseline performance.
Information becomes less distributed in the fine-tuned XLNet and RoBERTa models post fine-tuning: Table 1 shows accuracy of the classifier selecting the most (top) and least (bottom) 5% salient neurons on the task of POS tagging. 2 We observed that the bottom neurons in the finetuned models show a significant drop in performance, compared to the baseline model in the case of RoBERTa and XLNet. These results show that the information is more redundant in the baseline models as bottom neurons also preserved linguistic knowledge. On the contrary the information becomes more localized and less distributed in 2 See Appendix for SEM and Chunking tagging. How do salient neurons spread across the network layers? Previously we investigated how representations in each layer change w.r.t linguistic task. Now we study how the spread of the most salient neurons changes across the fine-tuned models. Figure 2 shows results for the selected GLUE tasks. 3 Notice how the most salient linguistic neurons shift from the higher layers towards the lower layers in RoBERTa and XLNet. This is especially pronounced in the case of Roberta-SST and XLNet-QNLI (See Figures 1e and 1f), where the number of salient chunking neurons significantly increased in the lower layers and droped in the higher layers, compared to the baseline. These findings reinforces our layer-wise results and additionally show how more responsibility is delegated to the neurons in the lower layers. Contrastingly, BERT did not exhibit this behavior. These results are inline with Durrani et al. (2020), who also found linguistic properties in XLNet to be localized to the lower layers 4 and fewer neurons and mutually exclusive as compared to BERT where neurons are highly polysemous 5 and therefore more redundant. Their finding helps us explain why XLNet forgets linguistic information that is unimportant to the downstream task more catastrophically.

Network Pruning
Our layer and neuron-wise analyses showed that core linguistic knowledge is redundant and distributed in the large pre-trained models. But as they are fine-tuned towards a down-stream task, it is relocated and localized to lower layers, with higher layers focusing on the task-specific information. In this section, we show that our findings explain patterns in layer pruning. We question How important is the linguistic knowledge for these downstream NLP tasks? Following Sajjad et al. (2020) we prune top and bottom (excluding the embedding layer) 6 layers of the network in two separate experiments and compare architectures. Table 2 shows that removing bottom layers of the network in RoBERTa and XLNet leads to more damage compared to BERT. How do these findings resonate with our analysis? We showed that BERT retains linguistic information even at the higher layers of the model as opposed to RoBERTa where it is preserved predominantly at the lower layers. Removing the bottom 6 layers in RoBERTa leads to a bigger drop because the network is completely deprived of the linguistic knowledge. Linguistic knowledge is more distributed in BERT and preserved at the higher layers also which leads to a smaller drop as it can still access this information.
We leave a detailed exploration on this for future.

Conclusion
We studied how linguistic knowledge evolves as the pre-trained language models are adapted towards downstream NLP tasks. We fine-tuned three popular models (BERT, RoBERTa and XLNet) towards GLUE benchmark and analyzed representations against core morpho-syntactic knowledge. We used  probing classifiers to carry out layer and neuronwise analyses. Our results showed that morphosyntactic knowledge is preserved at the higher layers in some GLUE tasks (e.g. STS, MRPC and RTE), while forgotten and only retained at the lower layers in others (MNLI, QNLI and SST). Comparing architectures, we found that BERT retains linguistic knowledge deeper in the network. In the case of RoBERTa and XLNet, the information is only preserved in the middle layers. This discrepancy is due to the fact that neurons in BERT are more polysemous and distributed as opposed to XLNet and RoBERTa where they are more localized (towards lower layers) and mutually exclusive. We showed that this difference in architectures, entails different patterns as we prune top or bottom layers in the network. Our code is publicly as part of the NeuroX toolkit (Dalvi et al., 2019b).

Ethics and Broader Impact
For this study, we used existing publicly available data sets while following their terms in the licenses. We do not see any harm or ethical issues resulting from our study and findings. Our study has implications towards the work on interpreting and analyzing deep models.

A.1 Data and Representations
We used standard splits for training, development and test data for the 4 linguistic tasks (POS, SEM, Chunking) that we used to carry out our analysis on. The splits to preprocess the data are available through git repository 6 released with Liu et al. (2019a). See Table 3 for statistics. We obtained the understudied pre-trained models from the authors of the paper, through personal communication.   Figure 4 show results on Semantic tagging. We see a similar pattern across architectures as in Figure 1.

A.3 Neuron-wise Probing
Section 4.2 presented neuron-wise probing results for for Chunking tagging. Figure 2 show results on POS and SEM tagging. We see a similar pattern across architectures as in Figure 3. As the model is fine-tuned towards downstream, number of salient neurons towards a linguistic property, in the lower layers increase.

A.4 Top versus Bottom Neurons
In Section 4.2 we presented spread how information is more distributed and redundant in in the network as bottom neurons also preserved linguistic knowledge. On the contrary the linguistic information becomes more localized and less distributed post fine-tuning using accuracy of the bottom neurons. Tables 4 and 5 demonstrate the same pattern with respect to Chunking and Semantic tagging tasks, selecting 10% and 5% neurons respectively.

A.5 Pruning Layers
In Section 5 we showed how pruning bottom layers in RoBERTa was more harmful in comparison to BERT. We conjectured that this pattern entails from our analysis that in RoBERTa linguistic information is preserved in the initial middle layers as opposed to BERT where linguistic knowledge is distributed deeper in the network. We show that XLNet exhibit similar pattern to RoBERTa in Table  6.

A.6 Control Tasks
While there is a plethora of work demonstrating that contextualized representations encode a continuous analogue of discrete linguistic information, a question has also been raised recently if the representations actually encode linguistic structure or whether the probe memorizes the understudied task. We use Selectivity as a criterion to put a "linguistic task's accuracy in context with the probe's capacity to memorize from word types" (Hewitt and Liang,   ). It is defined as the difference between linguistic task accuracy and control task accuracy. An effective probe is recommended to achieve high linguistic task accuracy and low control task accuracy.

A.7 Infrastructure and Run Time
Our experiments were run on NVidia GeForce GTX TITAN X GPU card. Grid search for finding optimal lambdas is expensive when optimal number of neurons for the task are unknown. Running grid search would take O(M N 2 ) where M = 100 (if we try increasing number of neurons in each step by 1%) and N = 0, 0.1, . . . 1e −7 . We fix the M = 20% to find the best regularization parameters first reducing the grid search time to O(N 2 ) and find the optimal number of neurons in a subsequent step with O(M ). The overall running time of our algorithm therefore is O(M + N 2 ). This varies a lot in terms of wall-clock computation, based on number of examples in the training data, number of tags to be predicted in the downstream task. Including a full forward pass over the pre-trained model to extract the contextualized vector, and running the grid search algorithm to find the best hyperparameters and minimal set of neurons took on average 8 hours ranging from 3 hours for the Chunking experiment to 12 hours for POS and SEM due to large training data.

A.8 Hyperparameters
We use elastic-net based regularization to control the trade-off between selecting focused individual neurons versus group of neurons while maintaining the original accuracy of the classifier without any regularization. We do a grid search on L 1 and L 2 ranging from values 0 . . . 1e −7 . See Table 8 for the optimal values for each task across different architectures.   Table 7: Selecting minimal number of neurons for each downstream NLP task. Accuracy numbers reported on blind test-set (averaged over three runs) -Neu a = Total number of neurons, Neu t = Top selected neurons, Acc a = Accuracy using all neurons, Acc t = Accuracy using selected neurons after retraining the classifier using selected neurons, Sel = Difference between linguistic task and control task accuracy when classifier is trained on all neurons (Sel a ) and top neurons (Sel t ).