A Closer Look at How Fine-tuning Changes BERT

Given the prevalence of pre-trained contextualized representations in today’s NLP, there have been many efforts to understand what information they contain, and why they seem to be universally successful. The most common approach to use these representations involves fine-tuning them for an end task. Yet, how fine-tuning changes the underlying embedding space is less studied. In this work, we study the English BERT family and use two probing techniques to analyze how fine-tuning changes the space. We hypothesize that fine-tuning affects classification performance by increasing the distances between examples associated with different labels. We confirm this hypothesis with carefully designed experiments on five different NLP tasks. Via these experiments, we also discover an exception to the prevailing wisdom that “fine-tuning always improves performance”. Finally, by comparing the representations before and after fine-tuning, we discover that fine-tuning does not introduce arbitrary changes to representations; instead, it adjusts the representations to downstream tasks while largely preserving the original spatial structure of the data points.


Introduction
Pre-trained transformer-based language models (e.g., Devlin et al., 2019) form the basis of state-of-the-art results across NLP.The relative opacity of these models has prompted the development of many probes to investigate linguistic regularities captured in them (e.g., Kovaleva et al., 2019;Conneau et al., 2018;Jawahar et al., 2019).
Broadly speaking, there are two ways to use a pre-trained representation (Peters et al., 2019): as a fixed feature extractor (where the pre-trained weights are frozen), or by fine-tuning it for a task.The probing literature has largely focused on the former (e.g., Kassner and Schütze, 2020;Perone et al., 2018;Yaghoobzadeh et al., 2019;Krasnowska-Kieraś and Wróblewska, 2019;Wallace et al., 2019;Pruksachatkun et al., 2020;Aghajanyan et al., 2021).Some previous work (Merchant et al., 2020;Mosbach et al., 2020b;Hao et al., 2020) does provide insights about fine-tuning: fine-tuning changes higher layers more than lower ones and linguistic information is not lost during fine-tuning.However, relatively less is understood about how the representation changes during the process of fine-tuning and why fine-tuning invariably seems to improve task performance.
In this work, we investigate the process of finetuning of representations using the English BERT family (Devlin et al., 2019).Specifically, we ask: 1. Does fine-tuning always improve performance?2. How does fine-tuning alter the representation to adjust for downstream tasks? 3. How does fine-tuning change the geometric structure of different layers?We apply two probing techniques-classifierbased probing (Kim et al., 2019;Tenney et al., 2019) and DIRECTPROBE (Zhou and Srikumar, 2021)-on variants of BERT representations that are fine-tuned on five tasks: part-of-speech tagging, dependency head prediction, preposition supersense role & function prediction and text classification.Beyond confirming previous findings about fine-tuning, our analysis reveals several new findings, briefly described below.
First, we find that fine-tuning introduces a divergence between training and test sets, which is not severe enough to hurt generalization in most cases.However, we do find one exception where fine-tuning hurts the performance; this setting also has the largest divergence between training and test set after fine-tuning ( §4.1).
Second, we examine how fine-tuning changes labeled regions of the representation space.For a representation where task labels are not linearly separable, we find that fine-tuning adjusts it by grouping points with the same label into a small number of clusters (ideally one), thus simplifying the underlying representation.Doing so makes it easier to linearly separate labels with fine-tuned representations than untuned ones ( §4.2).For a representation whose task labels are already linearly separable, we find that fine-tuning pushes the clusters of points representing different labels away from each other, thus introducing large separating regions between labels.Rather than simply scaling the points, clusters move in different directions and with different extents (measured by Euclidean distance).Overall, these clusters become distant compared to the untuned representation.We conjecture that the enlarged region between groups admits a bigger set of classifiers that can separate them, leading to better generalization ( §4.3).
We verify our distance hypothesis by investigating the effect of fine-tuning across tasks.We observe that fine-tuning for related tasks can also provide useful signal for the target task by altering the distances between clusters representing different labels ( §4.4).
Finally, fine-tuning does not change the higher layers arbitrarily.This confirms previous findings.Additionally, we find that fine-tuning largely preserves the relative positions of the label clusters, while reconfiguring the space to adjust for downstream tasks ( §4.5).Informally, we can say that fine-tuning only "slightly" changes higher layers.
These findings help us understand fine-tuning better, and justify why fine-tuned representations can lead to improvements across many NLP tasks1 .

Preliminaries: Probing Methods
In this work, we probe representations in the BERT family during and after fine-tuning.First, let us look at the two supervised probes we will employ: a classifier-based probe (e.g., Tenney et al., 2019;Jullien et al., 2022) to assess how well a representation supports classifiers for a task, and DIRECT-PROBE (Zhou and Srikumar, 2021) to analyze the geometry of the representation.

Classifiers as Probes
Trained classifiers are the most commonly used probes in the literature (e.g.Hewitt et al., 2021;Whitney et al., 2021;Belinkov, 2021).To understand how well a representation encodes the labels for a task, a probing classifier is trained over it, with the embeddings themselves kept frozen when the classifier is trained.
For all our experiments, we use two-layer neural networks as our probe classifiers.We use gridsearch to choose the best hyperparameters.Each best classifier is trained five times with different initializations.We report the average accuracy and its standard deviation for each classifier.
Classifier probes aim to measure how well a contextualized representation captures a linguistic property.The classification performance can help us assess the effect of fine-tuning.

DIRECTPROBE: Probing the Geometric Structure
Classifier probes treat the representation as a black box and only focus on the final task performance; they do not reveal how fine-tuning changes the underlying geometry of the space.To this end, we use DIRECTPROBE (Zhou and Srikumar, 2021)2 , a recently proposed technique which analyzes embeddings from a geometric perspective.We briefly summarize the technique and refer the reader to the original work for details.For a given labeling task, DIRECTPROBE returns a set of clusters such that each cluster only contains the points with the same label, and there are no overlaps between the convex hulls of these clusters.Any decision boundary must cross the regions between the clusters that have different labels (see in Figure 1).Since fine-tuning a contextualized representation creates different representations for different tasks, it is reasonable to probe the representation based on a given task.These clusters allow us to measure three properties of interest.

Number of Clusters:
The number of clusters indicates the linearity of the representation for a task.If the number of clusters equals the number of labels, then examples with the same label are grouped into Figure 1: Using the clustering to approximate the set of all decision boundaries.The left subfigure is a simple binary classification problem with a dashed circular decision boundary.The right subfigure is the result of DIRECTPROBE where the gray area is the region that a separator must cross.The connected points represent the clusters that DIRECTPROBE produces.one cluster; a simple linear multi-class classifier will suffice.If, however, there are more clusters than labels, then at least two clusters of examples with the same label can not be grouped together (as in Figure 1, right).This scenario calls for a non-linear classifier.Distances between Clusters: Distances 3 between clusters can reveal the internal structure of a representation.By tracking these distances during fine-tuning, we can study how the representation changes.To compute these distances, we use the fact that each cluster represents a convex object.This allows us to use max-margin separators to compute distances.We train a linear SVM (Chang and Lin, 2011) to find the maximum margin separator and compute its margin.The distance between the two clusters is twice the margin.Spatial Similarity: Distances between clusters can also reveal the spatial similarity of two representations.Intuitively, if two representations have similar relative distances between clusters, the representations themselves are similar to each other for the task at hand.
We use these distances to construct a distance vector v for a representation, where each element v i is the distance between the clusters of a pair of labels.With n labels in a task, the size of v is n(n−1)

2
. This construction works only when the number of clusters equals the number of labels (i.e., the dataset is linearly separable under the representation).Surprisingly, we find this to be the case for most representations we studied.As a measure of the similarity of two representations for a labeling task, we compute the Pearson correlation coefficient between their distance vectors.Note that this coefficient can also be used to measure the similarity between two labeled datasets with respect to the

Experimental Setup
In this section, we describe the representations and tasks we will encounter in our experiments.

Representations
We investigate several models from the BERT family (Devlin et al., 2019;Turc et al., 2019).These models all share the same basic architecture but with different capacities, i.e., different layers and hidden sizes.Table 1 summarizes the models we investigate in this work4 .All of these models are for English text and uncased.
For tokens that are broken into subwords by the tokenizer, we average the subword embeddings for the token representation.We use the models provided by HuggingFace v4.2.1 (Wolf et al., 2020), and Pytorch v1.6.0 (Paszke et al., 2019) for our experiments.

Tasks
We instantiate our analysis of the BERT models on a diverse set of five NLP tasks, which covers syntactic and semantic predictions.Here, we briefly describe the tasks, and refer the reader to the original sources of the data for further details. 5art-of-speech tagging (POS) predicts the partof-speech tag for each word in a sentence.The task helps us understand if a representation captures coarse grained syntactic categorization.We use the English portion of the parallel universal dependencies treebank (ud-pud, Nivre et al., 2016).Dependency relation (DEP) predicts the syntactic dependency relation between two tokens, i.e.
(w head and w mod ).This task can help us understand if, and how well, a representation can characterize syntactic relationships between words.This task involves assigning a category to a pair of tokens.We concatenate their contextualized representations from BERT and treat the concatenation as the representation of the pair.We use the same dataset as the POS task for dependencies.Preposition supersense disambiguation involves two categorization tasks of predicting preposition's semantic role (PS-role) and semantic function (PSfxn).These tasks are designed for disambiguating semantic meanings of prepositions.Following the previous work (Liu et al., 2019), we only train and evaluate on single-token prepositions from Streusle v4.2 corpus (Schneider et al., 2018).Text classification, in general, is the task of categorizing sentences or documents.We use the TREC-50 dataset (Li and Roth, 2002) with 50 semantic labels for sentences.As is the standard practice, we use the representation of the [CLS] token as the sentence representation.This task can show how well a representation characterizes a sentence.

Fine-tuning Setup
We fine-tune the models in §3.1 on the five tasks from §3.2 separately. 6The fine-tuned models (along with the original models) are then used to generate contextualized representations.The probing techniques described in §2 are applied to study both original and fine-tuned representations.
Our preliminary experiments showed that the commonly used 3-5 epochs of fine-tuning are insufficient for the smaller representations, such as BERT tiny , and they require more epochs.We finetuned all the representations for 10 epochs except BERT base , which we fine-tuned for the usual three epochs.Note that the fine-tuning phase is separate from the classifier training phase for probing; for the probe classifiers, we train two-layer neural networks (described in §2.1) from scratch on both original and fine-tuned representations7 , ensuring a fair comparsion between them.

Observations and Analysis
In this section, we will use classifier probes to examine if fine-tuning always improves classifier per-formance ( §4.1).Then we propose a geometric explanation for why fine-tuning improves classification performance using DIRECTPROBE ( §4.2 and §4.3).Next, we will confirm this geometric explanation by investigating cross-task finetuning ( §4.4).Finally, we will analyze how finetuning changes the geometry of different layers of BERT base ( §4.5).

Fine-tuned Performance
It is commonly accepted that the fine-tuning improves task performance.Does this always hold? Table 2 summarizes the relevant observations from our experiments.Appendix C presents the complete fine-tuning results.

Fine-tuning diverges the training and test set.
In Table 2, the last column shows the spatial similarity between the training and test set for each representation.We apply DIRECTPROBE on the training and test set separately.The spatial similarity is calculated as the Pearson correlation coefficient between the distance vectors of training and test set (described in §2).We observe that after fine-tuning, all the similarities decrease, implying that the training and test set diverge as a result of fine-tuning.In most cases, this divergence is not severe enough to decrease the performance.
There are exceptions, where fine-tuning hurts performance.An interesting observation in Table 2 is that BERT small does not show the improvements on the PS-fxn task after fine-tuning, which breaks the well-accepted impression that fine-tuning always improve the performance.However, only one such exception is observed across all our experiments (see Appendix C).It is insufficient to draw any concrete conclusions about why this is happening.We do observe that BERT small shows the smallest similarity (0.44) between the training and test set after fine-tuning on PS-fxn task.We conjecture that controlling the divergence between the training and test sets can help ensure that finetuning helps.Verifying or refuting this conjecture requires further study.

Linearity of Representations
Next, let us examine the geometry of the representations before and after fine-tuning using DIRECT-PROBE and counting the number of clusters.We will focus on the overwhelming majority of cases where fine-tuning does improve performance.Smaller representations require more complex classifiers.Table 3 summarizes the results.For brevity, we only present the results on BERT tiny .
The full results are in Appendix C. We observe that before fine-tuning, small representations (i.e., BERT tiny ) are non-linear for most tasks.Although a non-linearity does not imply poor generalization, it represents a more complex spatial structure, and requires a more complex classifier.This suggests that to use small representations (say, due to limited resources), it would be advisable to use a non-linear classifier rather than a simple linear one.
Fine-tuning makes the space simpler.In Table 3, we observe that the number of clusters decreases after fine-tuning.This tells us that after finetuning, the points associated with different labels are in a simpler spatial configuration.The same trend holds for TREC-50 (Table 4), even when the final representation is not linearly separable.

Spatial Structure of Labels
To better understand the changes in spatial structure, we apply DIRECTPROBE to every intermediate representation encountered during fine-tuning.
Here, we focus on the BERT base .Since all representations we considered are linearly separable8 , the number of clusters equals the number of labels.As a result, each cluster exclusively corresponds to one label.Going ahead, we will use clusters and labels interchangeably.
Fine-tuning pushes each label far away from each other.This confirms the observation of Zhou and Srikumar (2021), who pointed out that the fine-tuning pushes each label away from each other.However, they use the global minimum distance between clusters to support this argument, which only partially supports the claim: the distances between some clusters might increase despite the global minimum distance decreasing.We track the minimum distance of each label to all other labels during fine-tuning.We find that all the minimum distances are increasing.Figure 2 shows how these distances change in the last layer of BERT base for the PS-role and POS tagging tasks.Appendix D includes the plots for all tasks.For clarity, we only show the three labels where the distance increases the most, and the three where it increases the least.We also observe that although the trend is increasing, the minimum distance associated with a label may decrease during the course of fine-tuning, e.g., the label STUFF in PS-role task, suggesting a potential instability of fine-tuning.To further see how labels move during the finetuning, we track the centroids of each cluster.We select three closest labels from the POS tagging task and track the paths of the centroids of each label cluster in the last layer of BERT base during the fine-tuning.Figure 3 (right) shows the 2D PCA projection of these paths.We observe that before fine-tuning, the centroids of all these three labels are close to each other.As fine-tuning proceeds, the centroids move around in different directions, away from each other.
We conclude that fine-tuning enlarges the gaps between label clusters and admits more classifiers consistent with the labels, allowing for better generalization.Note that neither the loss nor the optimizer explicitly mandates this change.Indeed, merging during fine-tuning.since the labels were originally linearly separable, the learner need not adjust the representation at all.

Cross-task Fine-tuning
In §4.3, we hypothesized that fine-tuning improves the performance because it enlarges the gaps between label clusters.A natural inference of this hypothesis is that the process may shrink the gaps between labels of an unrelated task, and its performance can decrease.In this subsection, we investigate how fine-tuning for one task affects another.
We fine-tune the BERT base on PS-role and POS tagging tasks separately and use the fine-tuned models to generate contextualized representations for the PS-fxn task.Our choice of tasks in this experimental design is motivated by the observation that PS-role and PS-fxn are similar tasks that seek to predict supersense tags for prepositions.On the other hand, POS tagging can adversely affect the PS-fxn task because POS tagging requires all the prepositions to be grouped together (label ADP) while PS-fxn requires different prepositions to be far away from each other.We apply DI-RECTPROBE on both representations to analyze the geometric changes9 with respect to PS-fxn.
The effects of cross-task fine-tuning depends on how close two tasks are.The third and fourth columns of Table 5 indicate the number of labels whose minimum distance is increased or decreased after fine-tuning.The second column from the right shows the average distance change over all labels, e.g.fine-tuning on POS results in the minimum distances of the PS-fxn labels decreasing by 1.68 on average.We observe that fine-tuning on the same dataset (PS-fxn) increases the distances between labels (second row), which is consistent with observations from §4.3; fine-tuning on a similar task also increases the distances between clusters (third row) but to a lesser extent.However, fine-tuning on a "opposing" task decreases the distances between clusters (last row).These observations suggest that cross-task fine-tuning could add or remove information from the representation, depending on how close the source and target task are.
Small distances between label clusters indicate a poor performance.Based on our conclusion in §4.3 that a larger gap between labels leads to better generalization, we expect that the performance of PS-fxn after fine-tuning on PS-role would be higher than the performance after fine-tuning on POS tagging.To verify this, we train two-layer neural networks on PS-fxn task using the representations that are fine-tuned on PS-role and POS tagging tasks.Importantly, we do not further fine-tune the representations for PS-fxn.The last column of Table 5 shows the results.Fine-tuning on PSfxn enlarges gaps between all PS-fxn labels, which justifies the highest performance; fine-tuning on PS-role enlarges gaps between some labels in PSfxn, leading to a slight improvement; fine-tuning on POS tags shrinks the gaps between all labels in PS-fxn, leading to a decrease in performance.In summary, based on the results of §4.2, §4.3 and §4.4,we conclude that fine-tuning injects or removes task-related information from representations by adjusting the distances between label clusters even if the original representation is linearly separable (i.e., when there is no need to change the representation).When the original representation does not support a linear classifier, fine-tuning tries to group points with the same label into a small number of clusters, ideally one cluster.

Layer Behavior
Previous work (Merchant et al., 2020;Mosbach et al., 2020b) showed that during fine-tuning, lower layers changed little compared to higher layers.In the following experiments, we confirm their findings and further show that: (i) fine-tuning does not change the representation arbitrarily, even for higher layers; (ii) an analysis of the changes of different layers by a visual comparison between lower and higher layers.Here, we focus on the POS tagging task with BERT base .Our conclusions extend to other tasks, whose results are in Appendix E.
Higher layers do not change arbitrarily.Although previous work (Mosbach et al., 2020b) shows that higher layers change more than the lower layers, we find that higher layers still remain close to the original representations.To study the dynamics of fine-tuning, we compare each layer during fine-tuning to its corresponding original pretrained one.The spatial similarity between two representations is calculated as the Pearson correlation coefficient of their distance vectors as described in §2.Intuitively, a classifier learns a decision boundary that traverses the region between clusters, which makes the distances between clusters more relevant to our analysis (as opposed to the spatial structure of points within each cluster).
Figure 4 shows the results for all four tasks.10To avoid visual clutter, we only show the plots for every alternate layer.For the higher layers, we find that the Pearson correlation coefficient between the original representation and the fine-tuned one is surprisingly high (more than 0.5), reinforcing the notion that fine-tuning does not change the representation arbitrarily.Instead, it attempts to preserve the relative positions the labels.This means the fine-tuning process encodes task-specific information, yet it largely preserves the pre-trained information encoded in the representation.The labels of lower layers move only in a small region and almost in the same directions.The unchanged nature of lower layers raises the question: do they not change at all?To answer this question, for every label, we compute difference between its centroids before and after fine-tuning.Figure 5 shows the PCA projection in 2D of these difference vectors.For brevity, we only present the plots for every alternative layer.A plot with all layers can be found in Appendix E. We observe that the movements of labels in lower layers concentrate in a few directions compared to the higher layers, suggesting the labels in lower layers do change, but do not separate the labels as much as higher layers.Also, we observe that the labels INTJ and SYM have distinctive directions in the lower layers.
Note that, in Figure 5, the motion range of lower layers is much smaller than the higher layers.The projected two dimensions range from −1 to 3 and from −3 to 3 for layer two, while for layer 12 they range from −12 to 13 and −12 to 8, suggesting that labels in lower layers only move in a small region compared to higher layers.Figure 3 shows an example of this difference.Compared to the layer 12 (right) paths, we see that the layer 1 paths (left) traverse almost the same trajectories, which is consistent with the observations from Figure 5.

Discussion
Does fine-tuning always improve performance?Indeed, fine-tuning almost always improves task performance.However, rare cases exist where finetuning decreases the performance.Fine-tuning introduces a divergence between the training set and unseen examples ( §4.1).However, it is unclear how this divergence affects the generalization ability of representations, e.g.does this divergence suggest a new kind of overfitting that is driven by representations rather than classifiers?How does fine-tuning alter the representation to adjust for downstream tasks?Fine-tuning alters the representation by grouping points with the same label into small number of clusters ( §4.2) and pushing each label cluster away from the others ( §4.3).We hypothesize that the distances between label clusters correlate with the classification performance and confirm this hypothesis by investigating cross-task fine-tuning ( §4.4).Our findings are surprising because fine-tuning for a classification task does not need to alter the geometry of a representation if the data is already linearly separable in the original representation.What we observe reveals geometric properties that characterize good representations.We do not show theoretical analysis to connect our geometric findings to representation learnability, but the findings in this work may serve as a starting point for a learning theory for representations.How does fine-tuning change the underlying geometric structure of different layers?It is established that higher layers change more than the lower ones.In this work, we analyze this behavior more closely.We discover that higher layers do not change arbitrarily; instead, they remain similar to the untuned version.Informally, we can say that fine-tuning only "slightly" changes even the higher layers ( §4.5).Nevertheless, our analysis does not reveal why higher layers change more than the lower layers.A deeper analysis of model parameters during fine-tuning is needed to understand the difference between lower and higher layers.Limitations of this work.Our experiments use the BERT family of models for English tasks.Given the architectural similarity of transformer language models, we may be able to extrapolate the results to other models, but further work is needed to confirm our findings to other languages or model architectures.In our analysis, we ignore the structure within each cluster, which is another information source for studying the representation.We plan to investigate these aspects in future work.We make our code available for replication and extension by the community.

Related Work
There are many lines of work that focus on analyzing and understanding representations.The most commonly used technique is the classifierbased method.Early work (Alain and Bengio, 2017;Kulmizev et al., 2020) starts with using linear classifiers as the probe.Hewitt and Liang (2019) pointed out that a linear probe is not sufficient to evaluate a representation.Some recent work also employ non-linear probes (Tenney et al., 2019;Eger et al., 2019).There are also efforts to inspect the representations from a geometric persepctive (e.g.Ethayarajh, 2019;Mimno and Thompson, 2017), including the recently proposed DIRECT-PROBE (Zhou and Srikumar, 2021), which we use in this work.Another line of probing work designs control tasks (Ravichander et al., 2021;Lan et al., 2020) to reverse-engineer the internal mechanisms of representations (Kovaleva et al., 2019;Wu et al., 2020).However, in contrast to our work, most studies (Zhong et al., 2021;Li et al., 2021;Chen et al., 2021) focused on pre-trained representations, not fine-tuned ones.
While fine-tuning pre-trained representations usually provides strong empirical performance (Wang et al., 2018;Talmor et al., 2020), how fine-tuning manage to do so has remained an open question.Moreover, the instability (Mosbach et al., 2020a;Dodge et al., 2020;Zhang et al., 2020) and forgetting problems (Chen et al., 2020;He et al., 2021) make it harder to analyze fine-tuned representations.Despite these difficulties, previous work (Merchant et al., 2020;Mosbach et al., 2020b;Hao et al., 2020) draw valuable conclusions about fine-tuning.This work extends this line of effort and provides a deeper understanding of how fine-tuning changes representations.

Conclusions
In this work, we take a close look at how finetuning a contextualized representation for a task modifies it.We investigate the fine-tuned representations of several BERT models using two probing techniques: classifier-based probing and DIRECT-PROBE.First, we show that fine-tuning introduces divergence between training and test set, and in at least one case, hurts generalization.Next, we show fine-tuning alters the geometry of a representation by pushing points belonging to the same label closer to each other, thus simpler and better classifiers.We confirm this hypothesis by crosstask fine-tuning experiments.Finally, we discover that while adjusting representations to downstream tasks, fine-tuning largely preserves the original spatial structure of points across all layers.Taken collectively, the empirical study presented in this work can not only justify the impressive performance of fine-tuning, but may also lead to a better understanding of learned representations.

A Fine-tuning Details
In this work, we fine-tune all tasks and representations using HuggingFace library.We use a linear weight schduler with a learning rate of 3e −4 , which uses 10% of the total update steps as the warmup steps.The same schduler is used for all tasks.All the models are optimized by Adam (Kingma and Ba, 2015) with batch size of 32.All the fine-tuning is run on a single Titan GPU.The best hidden-layer sizes for each task are shown in Table 7.

B Summary of Tasks
In this work, we conduct experiments on five NLP tasks, which are chosen to cover different usages of the representations we study.Table 6 summarizes these tasks.

C Probing Performance
Table 7 shows the complete table of probing results in our experiments.The last column is the spatial similarity between the training set and test set.Some entries are missing because the similarity can only be computed on the representations that are linearly separable for the given task.

D Dynamics of Minimum Distances
Figure 6 shows the dynamics of minimum distances for labels on all four tasks.For clarity, we only present the distances for the three labels where the distances increase the most and the three where it decreases the most.

E PCA Projections of the Movements
Figures 7-10 show the PCA projections of the difference vector between the centroids of labels before and after fine-tuning based on BERT base .Figure 9: The PCA projection of the difference vector between the centroids of labels before and after fine-tuning based on Supersense function task and BERT base .
Figure 10: The PCA projection of the difference vector between the centroids of labels before and after fine-tuning based on Supersense role task and BERT base .

F Cluster Number Revision
We discovered a bug in the implementation of DI-RECTPROBE which causes the merging to stop early while the remaining clusters are still mergeable.The main paper (Table 3, Table 4, and Table 7) has been updated to report the correct results.
Table 8 shows the original results.This bug does not change the natural of the linearity of datasets and representations.All the findings from original experiments remain the same.This bug only affects the number of clusters when the representation is non-linear for a given task.

Figure 2 :
Figure 2: The dynamics of the minimum distances of the three labels where the distance increases the most, and the three where it increases the least.The horizontal axis is the number of fine-tuning updates; the vertical axis is chosen label's minimum distance to other labels.These results come from the last layer of BERT base .A full plots of four tasks can be found in Appendix D.

Figure 3 :
Figure 3: The PCA projection of three closest labels in POS tagging task based on the first (left) and last (right) layer of BERT base .These lines show the paths of the centroids of each label cluster during the fine-tuning.The markers indicate the starting points.This figure is best seen in color.

Figure 4 :
Figure 4: Dynamics of spatial similarity during the finetuning process based on BERT base .The horizontal axis is the number of updates during fine-tuning.The vertical axis is the Pearson correlation coefficient between current space and its original version (before fine-tuning).

Figure 5 :
Figure5: The PCA projection of the difference vector between the centroids of labels before and after finetuning based on POS tagging task and BERT base .Lower layers have a much smaller projection range than the higher layers.This figure is best seen in color.

Figure 8 :
Figure 8: The PCA projection of the difference vector between the centroids of labels before and after fine-tuning based on dependency prediction task and BERT base .

Table 2 :
Fine-tuned performances of BERT small based on the last layers.The last column shows the spatial similarity (described in §2) between the training and test set.A complete table of all representations and tasks can be found in Appendix C.

Table 3 :
The linearity of the last layer of BERT tiny for each task.Other results are in Appendix C.

Table 4 :
The linearity of the last layer of all models on TREC-50 task.The number of clusters is always more than the number of labels (50).
Zexuan Zhong, Dan Friedman, and Danqi Chen.2021.Factual probing is [MASK]: Learning vs. learning to recall.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5017-5033, Online.Association for Computational Linguistics.

Table 7 :
A complete table of the probing results of five representations on five tasks.

Table 8 :
Original table of the probing results of five representations on five tasks.These results were in the original version of the paper before we found a bug in the implementation of DIRECTPROBE.The updated results are in Table7.See Appendix C for details.