Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In experiments across twelve NLU tasks, we discover a consistent block diagonal structure in the similarity of representations within fine-tuned RoBERTa and ALBERT models, with strong similarity within clusters of earlier and later layers, but not between them. The similarity of later layer representations implies that later layers only marginally contribute to task performance, and we verify in experiments that the top few layers of fine-tuned Transformers can be discarded without hurting performance, even with no further tuning.


Introduction
Fine-tuning pretrained language encoders such as BERT (Devlin et al., 2019) and its successors (Liu et al., 2019b;Lan et al., 2020;Clark et al., 2020;He et al., 2020) has proven to be highly successful, attaining state-of-the-art performance on many language tasks, but how do these models internally represent task-specific knowledge?
In this work, we study how learned representations change through fine-tuning by studying the similarity of representations between layers of untuned and task-tuned models. We use centered kernel alignment (CKA; Kornblith et al., 2019) to measure representation similarity and conduct extensive experiments across three pretrained encoders and twelve language understanding tasks.
We discover a consistent, block diagonal structure (Figure 1c and ALBERT models, where early layer representations and later layer representations form two distinct clusters, with high intra-cluster and low inter-cluster similarity. Given the strong representation similarity of later model layers, we hypothesize that many of the later layers only marginally contribute to task performance. We show in experiments that the later layers of task-tuned RoBERTa and ALBERT can indeed be discarded with minimal impact to performance, even without any further fine-tuning.

Experimental Setup
Models For the majority of our experiments, we consider three commonly used language-encoding models: RoBERTa (Liu et al., 2019b), ALBERT (Lan et al., 2020) and ELECTRA (Clark et al., 2020). Because of the large number of exper-iments being performed, we use RoBERTa BASE , ALBERT LARGEV2 and ELECTRA BASE rather than the largest available versions of these models.
Optimization The representations learned over the course of training and similarity of representations may be sensitive to the number of steps used in training. To control for this, and to avoid taskspecific hyperparameter tuning, we fine-tune on each task for up to 10,000 steps. We use the Adam (Kingma and Ba, 2014) optimizer with batch size of 4, a learning rate of 1e-5, and 1,000 warmup optimization steps.

Representation Similarity with CKA
To analyze how learned representations change via fine-tuning, we use centered kernel alignment (CKA; Kornblith et al., 2019) to measure representation similarity. CKA is invariant to both orthogonal transformation and isotropic scaling of the compared representations, making it ideal for measuring the similarity of neural network representations, and has applied to BERT-type models in prior work (Wu et al., 2020;Sridhar and Sarah, 2020). Given two sets of representations X ∈ R N ×d 1 and Y ∈ R N ×d 1 where N is the number of examples and d 1 , d 2 the hidden dimensions, CKA computes a similarity score between 0 and 1, where a higher score indicates greater similarity. Further details on CKA are provided in Appendix A.
Using CKA, we can compare the similarity of representations between different layers of the same model or even different models. For our analysis, we use the representations of the CLS token, i.e. the token whose final layer representation is fed to the task output head. 2 We compute CKA over the validation examples of each task.
To provide intuition for CKA scores, we first show in Figure 1 an example of the comparison formats using ALBERT fine-tuned on RTE.
ORIG-ORIG The top left plot shows the similarity of representations across the layers of the untuned ALBERT model on RTE inputs. Adjacent layers have high similarity scores, only gradually decreasing as more distant layers are compared.
FT-ORIG We show layers of the task-tuned model on the Y-axis and untuned model on the X-axis. The CLS representations of the later layers in the task-tuned model appear highly dissimilar to any of the untuned model: In other words, the representations differ starkly from those used for ALBERT's masked language modeling (MLM) and sentence order prediction (SOP) pretraining. This coheres with prior work showing that representations of later layers are most likely to change during fine-tuning (Kovaleva et al., 2019;Wu et al., 2020).
FT-FT Next, we compare layers within a single fine-tuned model. We observe a block-diagonal structure in the representation similarities-two distinct clusters of earlier (approx. first 10) and later (approx. last 14) layers that have high inter-cluster but low intra-cluster similarity. When considered together with FT-ORIG, we can infer that the earlier layer representations resemble those used for pretraining, whereas the later layers encode a representation suitable for tackling the task. The high internal similarity between the top few layers and the sharp block diagonal structure of the similarity matrix imply that the representations starkly differ.

FT[1]-FT[2]
Finally, we compare fine-tuned ALBERT models across two random restarts. We observe a similar block diagonal structure. In particular, the similarity of the CLS representations in the later layers indicates that CKA is able recover the similarity of representations for tackling the same task across random restarts. This likely arises as the models are fine-tuned from the same initial pretrained parameters.

RoBERTa
ALBERT ELECTRA Figure 2: Representation similarity between layers for task-tuned models (FT-FT). RoBERTa and ALBERT task models exhibit a 'block diagonal' structure in the representation similarity of CLS tokens across nearly all tasks.

Results
We extend our CKA analysis to all twelve tasks and all three pretrained models, showing the FT-FT results in Figure 2. We observe that the block diagonal structure of representation similarity identified in Section 3 appears in almost every RoBERTa and ALBERT model, sharply delineating the earlier and later clusters of representations. In fact, RoBERTa often has even more distinct clusters than ALBERT. We hypothesize that since ALBERT shares parameters across layers, it is more difficult for representations to sharply change across a single layer, whereas RoBERTa, which has no parameter sharing, has no such constraint. The significant similarity of the later layers suggests that many of the later layers may not contribute much to the task. Given residual connections between Transformer layers, later layers could learn a 'no-op' or only slightly adjust the output representation if the task can be adequately 'solved' at an earlier layer. If this is true, we should be able to feed an intermediate representation from later layers to the output head with no further finetuning and retain most of the task performance. We investigate this hypothesis in Section 4.
In contrast, we do not see the same pattern in the ELECTRA models. The representations of the later layers are generally highly dissimilar even up to the penultimate layer in many tasks. A few tasks do exhibit a minor block diagonal structure, such as STS-B, Yelp Polarity and SST-2, but it is far less apparent compared to the other two models. ELEC-TRA has a very different pretraining task from the other two models (replaced token detection), which may explain this difference. Figure 5 and Figure 6. For RoBERTa and ALBERT, while the earlier layers of the task models have similar CLS representations to the untuned models, the later layers are largely dissimilar to any layer in the base model.

Truncating Fine-tuned Models
To test our hypothesis that the later layers of tuned task-models only marginally contribute to task performance, we propose a simple experiment where we feed the representations from an intermediate layer directly to the task output head, effectively discarding the later layers. We refer to these as truncated models. We test three different configurations: (a) UNTUNED , where we feed intermediate representations from a fine-tuned model to the tuned task output head without any further finetuning, (b) TUNED , where we fine-tune only the output head, and (c) TUNEDORIG , where we use representations from the base model (not fine-tuned on the task), but we fine-tune the output head. Performance of the UNTUNED trunated models indicates the extent to which an intermediate representation can be directly substituted for the final layer's representation; the TUNED and TUNEDORIG models provide an upper-bound of performance using the CLS representation of a given layer of a finetuned and non-fine-tuned encoder respectively.
Our results are shown in Figure 3. For RoBERTa and ALBERT, we find that the UNTUNED truncated models perform comparably to the Tuned truncated and full fine-tuned models 3 at the later layers. instance, the top 4 layers of the RoBERTa for Yelp Polarity model can be discarded with no further tuning and minimal impact to performance (95.5 vs 96.1). On the other hand, TUNEDORIG models perform very poorly compared to the TUNED models across all layers, showing that task-tuned intermediate representations are crucial for good performance, even when fine-tuning the output head. For ALBERT, which shares parameters between layers, a larger fraction of layers can be discarded with minimal impact to performance for both UN-TUNED and TUNED truncated models. On the other hand, we do not find a similar pattern in ELECTRA models. The UNTUNED truncated models perform extremely poorly when discarding almost any number of layers, and even the TUNED truncated models quickly drop in performance with even one or two layers discarded. These results are consistent with our CKA analyses that showed that the learned and task-tuned representations for ELECTRA do not share the same structure as those of RoBERTa and ALBERT. We speculate that this differences stems from the different pretraining objectives-replaced token detection is a binary prediction problem, whereas masked language modeling involves predicting a distribution over a large number of tokens-leading to differences in learned representations that propagate even to fine-tuned models. We leave further is equivalent to a regular fine-tuned model. investigation these differences to future work.

Skipping Layers
We perform a smaller set of experiments on skipping intermediate layers in a model and measuring the impact on performance. We use fully fine-tuned RoBERTa models on a subset of the tasks we considered above, and evaluate task performance of the tuned models when we skip over contiguous spans of layers in the model without any further fine-tuning. We show the results for skipping every possible span of layers in Fig 4. Performance tends to drop as larger spans of layers are skipped, although in many cases skipping any single layer seems to make little to no impact to performance. The primary exception to this is the very first layer, where we observe that skipping just the first layer can heavily impact task performance, such as in CoLA, STS-B and Cosmos QA. On the other hand, we find that skipping multiple of the later layers can have minimal impact on performance, consistent with our results above. The profile of performance drops given the number of intermediate layers skipped also differs greatly across tasks: For instance, dropping more than two contiguous layers in the middle of the model seems to heavily impact MNLI and RTE performance, whereas for SST-2 the impact is not as large until 3-4 layers are skipped.

Cosmos QA
Figure 4: Layer Experiments: Task performance when skipping contiguous spans of Transformer layers, with the Y-axis and X-axis indicating the first and last (inclusive) skipped layers, with no further fine-tuning. Performance tends to drop as more layers are skipped, but in many cases skipping any single layer makes little to no impact to performance, except for the first layer. Consistent with results above, many of the higher layers can be skipped with minimal impact to performance.

Related Work
While CKA (Kornblith et al., 2019) was initially proposed as an interpretability method for computer vision models, it has more recently seen application to NLP models. Wu et al. (2020) applied CKA to pretrained Transformers models such as BERT and GPT-2, focusing on cross-model comparison-our analysis builds on their findings, with greater focus on layer-wise comparisons and implications for fine-tuning and discarding layers. Sridhar and Sarah (2020) use CKA to measure the impact of a proposed model architecture change on the learned representations. Voita et al. (2019) and Merchant et al. (2020) apply similar representation similarity analyses to Transformers, with the latter also investigating freezing and dropping layers from models. More broadly, significant work has been done on better understanding and interpreting the capabilities of BERT-type models- Rogers et al. (2020) offers a thorough survey of this line of work. Of particular relevance to our work: Work on model probing (Tenney et al., 2019b;Liu et al., 2019a;Tenney et al., 2019a) has studied the extent to syntactic and semantic features are represented at different layers of BERT-type models.
Our results on model truncation also cohere with existing work on early exit in BERT models (Xin et al., 2020a,b;Zhou et al., 2020), wherein models are explicitly fine-tuned to dynamically skip the later layers of a BERT encoder and directly to the output head, often to reduce inference times of models. Our results somewhat differ as we show that models can also be truncated or exited early without any explicit tuning. It has also been shown in the computer vision domain that models with residual networks work akin to an ensemble of deep and shallow models (Veit et al., 2016).

Conclusion
We show a consistent pattern to the structure of representation similarity in task-tuned RoBERTa and ALBERT models, with strong representation similarity within clusters of earlier and later layers, but not between them. We further show that the later layers of task-tuned RoBERTa and AL-BERT models can often be discarded without hurting task performance, verifying that the later layers of these models truly have similar representations. However, we find that ELECTRA models exhibit starkly different properties from the other two mod-els, which prompts further investigation into how and why these models differ.

A Centered Kernel Alignment
Given two sets of representations X ∈ R N ×d and Y ∈ R N ×d where N is  Figure 5 shows the FT-ORIG plots for all tasks and models. Figure 6 shows the FT[1]-FT[2] plots for all tasks and models. Figure 7 computes representation similarity between models.