Representation Projection Invariance Mitigates Representation Collapse

Fine-tuning contextualized representations learned by pre-trained language models remains a prevalent practice in NLP. However, fine-tuning can lead to representation degradation (also known as representation collapse), which may result in instability, sub-optimal performance, and weak generalization. In this paper, we propose Representation Projection Invariance (REPINA), a novel regularization method to maintain the information content of representation and reduce representation collapse during fine-tuning by discouraging undesirable changes in the representations. We study the empirical behavior of the proposed regularization in comparison to 5 comparable baselines across 13 language understanding tasks (GLUE benchmark and six additional datasets). When evaluating in-domain performance, REPINA consistently outperforms other baselines on most tasks (10 out of 13). We also demonstrate its effectiveness in few-shot settings and robustness to label perturbation. As a by-product, we extend previous studies of representation collapse and propose several metrics to quantify it. Our empirical findings show that our approach is significantly more effective at mitigating representation collapse.


Introduction
Fine-tuning pre-trained language models has been shown to achieve remarkable performance on a variety of natural language processing (NLP) tasks (Kenton and Toutanova, 2019;Brown et al., 2020a;Zhang et al., 2022).A standard fine-tuning strategy involves adapting the pre-trained model to a supervised downstream task (Fig 1;left).Such procedure can result in representation collapse (Aghajanyan et al., 2021;Zhou and Srikumar, 2022), a distortion of the pre-trained representations that limits their 1 Our code is available at https://github.com/arazd/REPINA.generalizability to other domains, styles or tasks.An alternative approach to full model tuning is to fine-tune only several top layers, while keeping the rest of the model frozen (e.g., we could train solely a classification head, Fig 1; middle).This practice of freezing all/most of model parameters can prevent unwanted changes to pre-trained representations, but it can also limit fine-tuning and negatively affect performance (Lee et al., 2019b;Kumar et al., 2021).This study aims to determine if it is possible to fine-tune the entire model without compromising representation quality.
We introduce Representation Projection Invariance (REPINA), a regularization objective that prevents undesirable changes in the representations (Fig 2a).Our regularization applies an invariance loss on a tunable projection of the representation.In effect, this regularization allows the underlying representation to change mildly (e.g., shift and scaling) while not losing its expressivity (Fig 2b).Our regularization objective provides a knob that controls the amount of loss-free transformations allowed during fine-tuning.
We compare our method against several established regularization approaches which explicitly or implicitly address the issue of representation degradation (Section 5.1).We show that our approach consistently outperforms major fine-tuning methods on seven GLUE classification tasks and six additional non-GLUE tasks (Fig 3;left).We find that our approach is particularly effective in scenarios where data is limited (such as with only 250, 500, or 1000 examples), as the model is more likely to overfit and memorize the training data in these cases (Section 5.4).Furthermore, we thoroughly investigate fine-tuning under label perturbation (from 5% to 30% label noise) and observe that our approach is robust to incorrect labels, exceeding the performance of standard fine-tuning procedure and common baseline methods (Section 5.3).
Finally, we quantify how much different methods mitigate the degradation of representations (Section 6).We use previously explored probing experiments (Aghajanyan et al., 2021), and propose a new set of metrics that quantify representation collapse in an objective way, without requiring extra datasets/training.We observe that REPINA shows the strongest resistance to representation degradation among all methods.

REPINA: Representation Projection Invariance
Our method avoids representation collapse by preventing undesirable changes in representations during the fine-tuning process.A straightforward implementation would anchor representations during fine-tuning to their pre-trained values.That is, the final loss L would combine the standard fine-tuning objective and regularizer of the deviation in representations: where L is the downstream task loss (e.g.cross entropy for classification tasks), I are the input samples of the task, f pre and f f in are the representation functions defined by the pre-trained and fine-tuned networks.Optimizing full model parameters under this modified objective would prevent representation degradation.However, this formulation of the loss function could be very restrictive.
There exist various transformations of a representation that maintain its expressivity (such as linear shift; Fig 2b).While such transformations do not change the information content of a representation, they incur a high regularization loss based on equation 1.
To address this issue and allow flexibility in representations while preserving their expressive capacity, we propose representation projection invariance regularization (REPINA): Here Φ is a class of dimension preserving functions chosen before the fine-tuning process and defines the strength of regularization.The intuition behind the regularization objective is to incentivize the representations to be invariant under some projection φ ∈ Φ; pre-trained representations can be constructed from the fine-tuned representations by applying a function φ ∈ Φ.For instance, if Φ is the set of linear functions {φ | ∃W, b : φ(z) = W z + b} , then we bias finetuned representations to be linear transformations of the fine-tuned representations.
Thus, regularization loss in the case of Fig 2a would be zero since there exists a linear mapping  2020) (higher is better).Representation Diversity: mathematical measure of the information content of representations (we report GM-5 score, see Section 6; higher is better).
from fine-tuned representations to the pre-trained representations.However, regularization loss for Fig 2a would be high as there does not exists such a linear mapping from fine-tuned to the pre-trained representations.
Choice of class of functions Φ: Φ defines the strength of the regularizer.For instance, a singleton Φ containing an identity function is the strongest regularizer which keeps fine-tuned representations close to the pre-trained representations (equivalent to equation 1).
Conversely, for a rich Φ, e.g., deep and wide neural networks, φ can be chosen to reconstruct the lost information in representation even if there is a severe degradation.Thus, it provides a weak regularization.Choice of Φ then depends on how prone fine-tuning is to over-fitting and how strong of a regularization method is needed.For instance, few-shot setting may require the strongest regularization and larger training datasets may require milder regularization.
In this paper, we experiment with Φ containing identity function (REPINA I ) and shallow multilayer perceptrons (REPINA MLP ).
Choosing the right representation: Normally, sentence-level representations are obtained from the final encoder blocks of a transformer model.However, it may be more beneficial to use representations from the lower layers.In fact, Zhang et al. (2020) show that re-initializing weights of the top layers of encoder improves fine-tuning performance, suggesting that representation consistency may not be desirable for the top layers.Thus, we consider a variant of regularization that uses representations from the intermediate layers of encoder.
Explaining representation invariance regularization as implicitly learning multiple tasks: Consider the overfitting shown in Fig 2a again.It can be prevented by fine-tuning the representations for multiple tasks simultaneously instead of a single task.This is a well known method of preventing overfitting known as multi-task learning.It not only prevents overfitting of representations but can also improves generalization performance for all of the tasks.
We show that REPINA's regularization objective (equation 2) is equivalent to fine-tuning on multiple hypothetical tasks.Due to space constraints, we defer further discussion on the connection and the formal proof of equivalence in Appendix 10.

Related Work
Mitigating Representation Collapse: Aghajanyan et al. ( 2020) study representation collapse and propose two methods, R3F and R4F, to address it.R3F induces bias towards a solution with a locally smooth prediction function, and R4F extends R3F by adding a Lipschitzness constraint on the top classification layer.Some of the other methods which implicitly target representation collapse are FreeLB and SMART.FreeLB uses adversarial training to improve fine-tuning (Zhu et al., 2020) and SMART is a trust region-based method that avoids aggressive updates during fine-tuning (Jiang et al., 2020).R3F (Aghajanyan et al., 2020) has been shown to outperform both of these methods.Thus, we only include R3F in our set of baselines.
A method that specifically targets representations during fine-tuning is Supervised Contrastive Learning (SCL).SCL induces representations of examples with the same label to be close to each other and far from the examples of other classes (Gunel et al., 2020).A major disadvantage of SCL is a requirement for large mini-batch size and, hence, heavy memory consumption.We implement a memory-efficient SCL version but exclude the original implementation from the baselines due to computational cost (see Appendix 17).Another method which can be potentially useful for mitigating representation collapse and form part of our baseline is Data augmentation.It can be done via back translation, synonymn replacement, random deletion, and synthetic noise and is known to improve generalization performance (Feng et al., 2021;Wei and Zou, 2019).
Catastrophic Forgetting (Kirkpatrick et al., 2017) is a phenomenon closely related to representation collapse.In a sequential training ssetting, it refers to forgetting information learnt from previous tasks while being trained on a new task.In our context, this means forgetting the pre-training language modeling task while training for the finetuning task.In contrast to catastrophic forgetting, we measure representation collapse as the loss in expressive power of the representations irrespective of the performance on pre-training task.A method known to alleviate catastrophic forgetting is Weight Consolidation (Chen et al., 2020;Kirkpatrick et al., 2017).It regularizes the fine-tuning process by encouraging fine-tuned weights to be close to the pre-trained weights.In contrast to weight consolidation, REPINA does not put direct constraints on weight updates, but rather tries to control the structural changes in representations.
Apart from working with representations instead of model weights, key difference between our method and weight consolidation is that our method does not always require fine-tuned representations to be close to the pre-trained representations.Regularizing fine-tuned and pre-trained representations to be close is a special case and the most strict form of our method (REPINA I ).The knob (Φ) in our method can be selected to choose a regularizer ranging from this extreme case to the other extreme case with no regularization.Note that this knob is not the same as the regularization constant which is used to weigh regularization loss compared to cross-entropy loss.
Due to our limited space, we discuss further details on related works in Appendix 9.

Experimental Set Up
4.1 Our Methods: REPINA I & REPINA MLP Recall our methods REPINA I and REPINA MLP introduced in Section 2. For REPINA I , we observe that regularizing intermediate layer representations (5th, 10th, 20th from input) perform better than regularizing the top layer (near the output, before the classifier) representation.Thus, the regularization objective for REPINA I is: where f pre , f f in are -th representations from the -th layer (from input) of the model.Choice of is a hyper-parameter with possible values 5, 10 and 20.Layer 5 is most effective for small training datasets and layer 20 is most effective for large training datasets (see Appendix 14).
Due to computational limitations, we experiment with only top layer representations for REPINA MLP .Thus, the regularization loss for REPINA MLP , R(Φ = MLP) is: where f pre , f f in are the representations from the top layer of the model (before the classifier) of the pre-trained and fine-tuned model and Θ are the parameters of a multi-layer perceptron (MLP).We set the depth of MLP to 2, keeping the width equal to the representation dimension.By varying the depth from 1 to 5, we observe that for smaller training datasets, lower depth performs better.Training with large datasets is robust to the choice of depth (see Appendix 13).

Baselines
We use a diverse range of baselines for our study: STD++ is an improved variant of the standard fine-tuning scheme that includes the use of bias correction in AdamW, following the works of (Zhang et al., 2020;Mosbach et al., 2020) which shows that bias correction is a major cause of instability in language model fine-tuning.
R3F (Aghajanyan et al., 2020) is a local smoothness regularization that prevents aggressive model updates by restricting divergence of outputs upon input perturbation.For model f (•) and input token embeddings x, R3F adds a regularization term KL S (f (x) f (x + z)) to the loss function, where KL S is the symmetric Kullback-Leibler divergence and noise z is sampled from a normal distribution.
ReInit (Zhang et al., 2020) improves fine-tuning performance by re-initializing the top-k layers of the encoder (closer to the output) with gaussian random samples from N (0, 0.02 2 ).Following the original study, we perform hyperparameter search for k = 2, 4 or 6.
Data Augmentation (DA) generates augmented samples by adding noise to the training data (keeping the label intact) (DeVries and Taylor, 2017).
For a detailed description and regularization coefficients of each baseline method, see Appendix 11.

Datasets
We evaluate methods on GLUE benchmark (Wang et al., 2018) and six additional non-GLUE datasets (Table 2).These include: biomedical relation extraction on CHEMPROT (Kringelum et al., 2016), sentiment classification on YELP (Zhang et al., 2015a) and IMDB (Maas et al., 2011), citation intent classification on SCICITE (Cohan et al., 2019), language inference on SCITAIL (Khot et al., 2018) and article topic classification on AG-NEWS (Zhang et al., 2015b).For each task, we use their corresponding adopted performance metric.On these 13 datasets, we conduct a variety of experiments with many and few supervision instances.
To keep the cost of fine-tuning computations on extremely large datasets (such as MNLI and QQP), we limited their training sets to 10, 000 data points, and marked with a suffix "-10K" henceforth.
For datasets with no available test set labels, we use their development set to report the performance.We use a subset of original train data split (size equal to validation set) which is not used for training for hyper-parameter selection.

Fine-tuning Settings
Due to the large scale of the experiments and in order to have a meaningful comparison with various approaches, we consistently use BERT-large model for implementing both our proposed algorithm and the baselines.Existing works such as (Zhang et al., 2020;Mosbach et al., 2020) also use similar experimental setups.Additionally, to verify the generality of our findings to other models, we performed limited experiments on RoBERTa-base where observe similar performance gain.
We fine-tune all models for 5 epochs (unless otherwise specified) at a learning rate of 2e-5, and report performance with 5 different seeds.Due to resource constraints and in contrast to prior works (Kenton and Toutanova, 2019;Aghajanyan et al., 2020;Chen et al., 2020), we do not search for optimal learning rate for each method-task combination.To verify the impact of this choice, we perform limited experiments selecting best learning rate, dropout and number of epochs for each method and a subset of tasks (Appendix 15).We observe similar gains as reported in the main paper.
For each method, we select optimal hyperparameters by performing evaluation on the unused fraction of the training set (see Appendix 12).
Since standard fine-tuning is susceptible to failed runs that substantially lowers the average resulting performance (Mosbach et al., 2020), we filter out these failed runs and report average performance over 5 successful runs.We define run as failed if its performance is close to or lower than the majority classifier (Dodge et al., 2020).Majority classifier is a dummy model that always predicts the label of the majority class in the dataset.We define a threshold close to the performance of the majority classifier as per metric in Table 2.A fine-tuning run is a failed run if the performance on unused part of the training dataset is below the threshold.See Section 12.2 for the exact thresholds.

Results: Generalization Performance
In this section, we present experimental results with the baselines introduced in the earlier section.

Full dataset -Generalization performance
Table 1 shows that REPINA models outperform the baselines consistently across a variety of tasks: our method outperform other ones on 10/13 tasks.Both REPINA I and REPINA MLP outperform baseline methods in terms of mean performance, with improvements in the mean performance over the corrected fine-tuning strategy STD++ by 1.7 and 2.0 points respectively for GLUE benchmark,  and 4.3 and 4.5 points for non-GLUE benchmark.

Analyses on Fine-tuning Stability
Similar to the prior literature (Dodge et al., 2020;Mosbach et al., 2020;Wang et al., 2018), we observe that the standard fine-tuning procedure is prone to instability and sub-optimal convergence, leading to failed runs.Recall that we formally define a fine-tuning run as a failed run if the resulting performance is close to the majority classifier.
In the previous section, we reported the mean performance of only successful runs (complement of failed runs).Figure 4 shows the fraction of runs that were successful for each method.
We note that REPINA I has the least number of failed runs (maximum number of successful runs).Moreover, if we do not filter out failed runs, our methods perform even better than all the baseline methods.REPINA I achieves an average 2.6 percentage point improvement over the next best baseline method (WC).Thus, we conclude that our methods demonstrate higher stability and less frac-tion of failed runs than other approaches.(additional experiments in Table 27 in Appendix 20.) Figure 4: Fraction of successful runs across all tasks.Run is defined as successful if its test performance is higher than the performance of a majority classifier.Our proposed regularization (RP I /RP M ) increases the fraction of successful runs, hence, leading to more stable fine-tuning behavior.

Robustness to Label Perturbation
Real-world data can often contain mislabeled samples, which can hinder the training process.Hence, robustness to label noise is a desirable quality of the fine-tuning approaches.Here, we study the performance of the fine-tuning methods under label perturbation.We introduce label noise as follows: let C = {1, . . ., c} be a class of labels and X = {(x, y)} be the true dataset for the finetuning task.Our fine-tuning method has access to a noisy dataset X = {(x, y )} where y = y with probability 1 − p and sampled uniformly from {1, . . ., c} \ {y} with probability p.
REPINA I and REPINA MLP show the highest resistance to label perturbation, retaining closest to the original performance upon introducing 5-10% noise to labels (Table 3).The second most resistant approach, WC, is also close to our method conceptually, as it discourages the finetuned weights to deviate from pre-trained weights.

Analyses on Few-shot Fine-tuning
To investigate our methods' robustness to small dataset sizes, we study REPINA MLP and REPINA I performance in limited data settings (250/500/1000 training data points).We fix the same data subset across all models to avoid performance changes related to data variability.Since finetuning in few-shot setting is particularly prone to instability and the performance on a single dataset can distort the mean statistics for the entire collection, we use average rank as a more stable metric to compare different methods.A method's rank for a task corresponds to the position of the method in a list of all methods sorted by performance on that dataset.The minimal and best possible rank is 1.The average rank of a method is obtained by averaging ranks across all tasks.
We observe in Table 4 that REPINA I is the most effective method in the few-shot setting measured in terms of the average rank.See Appendix 19 for a detailed analysis.Overall, we find that REPINA MLP yields performance gain on large-scale datasets, whereas RE-PINA I is effective for few-sample fine-tuning (since newly introduced parameters in REPINA MLP are undertrained when the training data is limited).For wall-time analysis, see Appendix 24.For experiments on hyper-parameter optimization over learning rate, batch size and other hyper-parameters see Appendix 15.We observe that RP I /RP M are best performing models for all the datasets even in this setting.We also observe that our technique applied to RoBERTa achieves similar gains; our method is model independent (see Appendix 16 for details).

Degradation of Representations:
Analytical Perspective Here we quantify the representation collapse.

Probing Representation Collapse
We follow the setting of Aghajanyan et al. (2020) for studying representation collapse with probing experiments as follows: (i) fine-tune model on a downstream task A, (ii) freeze the encoding layers and train the top linear layer for a different task B. Low performance in the second step implies representation collapse in the first step.To assess robustness of the proposed approach to representation collapse, we perform a series of probing experiments.In our experimetnts, we use four GLUE and four non-GLUE datasets in the first step and all datasets in the second step except the one used in the first step (Table 5).We observe that REPINA MLP and REPINA I show high resistance to representation collapse, outperforming other approaches in 6/8 cases (Table 5).For instance, fine-tuning for QNLI-10k in the first step with REPINA MLP results in a mean performance of 49.5 in the second step, whereas the next best baseline results in a mean performance of 44.5.
Note that auxiliary tasks considered here are used only to evaluate the degradation of representations.They are not available during finetuning.During fine-tuning stage, only one task dataset is available.Thus, we do not compare our methods to the rehearsal-based learning methods.

Diversity of Fine-tuned Representations
Probing experiments rely on the availability of extra fine-tuning tasks and, thus, are limited in the amount of information they can assess, requiring additional fine-tuning rounds.Here, we propose metrics that can reliably quantify the power of finetuned representations by capturing their geometric diversity.The intuition behind our metrics is the following: if all representations lie in a small dimensional space such as a straight line or a single point, then they contain little information and are not expressive.But if representations are well spread out and span the entire representation space, then they possess high information capacity).We illustrate representation collapse metrics from the geometric perspective in Figure 5.The top three plots show three different distributions of data points (representations).The left distribution spans all three dimensions, indicating the highest degree of data diversity.Since data points equally lie in all dimensions, all three eigenvectors (V (λ i )'s) will be of equal importance and all three eigenvalues (λ i 's) will be approximately equal.In contrast, the central distribution spans two axes, leading to a smaller third eigenvalue that corresponds to the "redundant" dimension.Right distribution has all the data points collapsed along one axis, resulting in one eigenvalue being substantially higher than the others.Overall, more uniform distribution of the eigenvalues corresponds to a better representation matrix diversity.In the bottom bar-plot we show distribution of the top-20 eigenvalues of the finetuned representations with REPINA I and STD++ after training on QQP dataset with 250 points (Figure 5).REPINA I preserved a closer to uniform eigenvalue distribution, while STD++ results in representations with much higher first eigenvalue, indicating representation collapse.Thus, REPINA I yields better representation matrix diversity and less representation collapse than STD++ .
Next, we formalize this intuition by defining a representation diversity metric based on the geometric diversity of sentence-level representations.
Diversity Metrics: We compute the gram matrix G for the representations where To measure diversity of representations, we use geometric mean (GM) and harmonic mean (HM) of the eigenvalues: These metrics attain a high value if the representations are well spread out and are zero or close to zero if all/most of the representations lie in a smaller dimension subspace.In contrast to the arithmetic mean, geometric and harmonic mean are not as sensitive to outliers.We observe that these metrics turn out to be always zero as representations typically lie in 20dimensional space.Hence, we chose top-k λ i values for k = 5, 10, 20 where GM and HM are bounded away from 0.
We compare REPINA I and REPINA MLP to the existing baselines using GM-k and HM-k with k = 5, 10, 20 (Table 6).Low GM-k and HM-k indicates representation collapse, when fine-tuned representations lie in a low-dimensional space.High GM-k and HM-k indicates that representations are well spread out and span a higher dimensional space.Table 6 shows that REPINA I results in the most diverse representations among all the baseline methods and incurs least representation collapse (see Appendix 23 for detailed results).

Conclusion
In this paper, we propose a novel representation invariance regularizer targeted at avoiding representation degradation during finetuning.It has a knob that can control strength of regularization.We experiment with two choices of this knob, REPINA I and REPINA MLP and show that they both achieve significant performance gain on GLUE benchmark and six additional tasks, including few-shot learning settings and label noise conditions.We also study the degradation of representations during fine-tuning, representation collapse, and propose new metrics to quantify it.Our methods reduce representation collapse both in terms of newly proposed metrics as well as previously studied metrics that use auxiliary task data.

Limitations
We conduct extensive experiments in our paper and show that the proposed approaches lead to significant gains.However, we did not exhaust all avenues of investigation due to limited resources.Firstly, we could experiment with different choices of φ other than in REPINA I (φ is identity) and RE-PINA MLP (φ is MLP).Other choices of φ may include deeper networks or transformer-based models, which could potentially improve performance even further.Secondly, we investigated how representations from intermediate layers affect RE-PINA I performance, and observe major improve-ments with top layer representations.Similar experiments for REPINA MLP may also yield further gain.Also, in REPINA I we could experiment with more choices of the representations layer (we tried 5th, 10th, 20th layer).Since lower-layer representations are more computationally efficient to regularize (do not require a full forward pass through the model), another interesting direction is finding a trade-off between the computational efficiency of the regularizer and performance gains.This study primarily focused on medium-sized models due to computational constraints.It is the outcome of extensive experimentation, which would have been impractical with limited computational resources.Although we have experimented with masked language models, we believe the findings apply to other architectures that follow similar principles.We anticipate that future research will provide more insight into these issues.

Ethical Considerations
REPINA aims to improve performance and retain the quality of representations during fine-tuning.Practically, our method can help in suppressing potential biases of language models after fine-tuning on datasets that include biased content.REPINA can achieve this by reducing collapse of representations and preserving their pre-trained knowledge.All our experiments are based on publicly available datasets and, hence, there is no immediate concern about harmful content.Due to limited space in the main text, part of the related is below.We will reintroduce these to the main text upon having more space.David et al., 2020).Data-centeric approaches involve pseudo-labeling (Abney, 2007), using auxiliary tasks (Phang et al., 2018), and data selection (Moore and Lewis, 2010; Wang et al., 2017).
) is a variant of Dropout regularization that replaces dropped neurons with the pre-trained model neurons, thereby mixing pretrained and fine-tuned parameters.

Measures of representation:
(Aghajanyan et al., 2020) measures the quality of finetuned representations by fitting them on auxiliary tasks.CKA (Kornblith et al., 2019) measures correspondences between representations from different network.(Wu et al., 2020) study the similarity of internal representation and attention of different trained models using some new similarity measures.(Merchant et al., 2020) also studies what happens during finetuning via probing classifiers and representation similarity analysis.It argues that finetuning does not necessarily incurs catastrophic forgetting.It analyze the effect for finetuning different tasks on the changes in representation.(Rongali et al., 2020) show that rehearsal based learning can improve performance and perform better than Weight Consolidation.However, even though our method is inspired by multi-task learning and performs pseudo multi-task learning implicitly, we do not have access to any dataset additional to the single fine-tuning task.Thus, rehearsal based learning does not apply in our setting.
Parameter Efficient Finetuning: Rather than storing a model for each of the finetuning task, some approaches try to keep most of the model parameters frozen and only tune a subset of parameters.Rebuffi et al. ( 2017) and Pfeiffer et al. (2020) insert adapter layers between the layers of pretrained model and keep the original parameters frozen.Guo et al. (2021) keep the change in model parameters to be sparse.(Aghajanyan et al., 2021) learns a small dimensional vector whose projection onto a large dimension space added to pretrained model parameter yields the finetuned model.Zaken et al. (2022) show that finetuning only the bias parameters can also lead to competitive performance.
Multi-task learning: In multi-task learning, we jointly finetune for many tasks where each task has a classification head but share the backbone with several other tasks (Caruana, 1997;Ruder, 2017;Zhang and Yang, 2021;Liu et al., 2019).This approach however requires access to a large amount of labeled data from unrelated tasks which is typically unavailable.Since our method focuses on the scenario when a single finetuning task data is available, we focus on comparing it against works of a similar nature and do not provide an extensive comparison with these works.It is likely that the clear difference between the methods makes them complementary, but exploring this is outside the scope of this paper.
Text-to-text finetuning: Autoregressive models such as T5 and GPT-3 cast the finetuning in a textto-text format (Raffel et al., 2020;Brown et al., 2020b).They can work in the few-shot learning setting by framing the finetuning task as the pretraining task.Autoregressive models make it easier to sample text whereas masked Language models such as BERT and RoBERTa are restricted fill in the blanks.Connection to Pseudo Meta-learning Idea: We view the pre-trained model as a multitask learner with an infinite number of pseudo tasks T 1 , T 2 , . . . .That is, for each i there exists a linear layer that fits pre-trained representations to a pseudo task T i .Our aim is to fine-tune the representations on a specific downstream task B while preserving their ability to perform well on T 1 , T 2 , . . .; namely, there must exist a linear model on the finetuned representations for each pseudo task T i .The linear classification head for T i does not have to be the same for the pre-trained and fine-tuned representations, but their output should be close.Let the training samples be x 1 , . . ., x N for the fine-tuning task B and z j pre , z j f in ∈ R d be the representations (output of the encoder layer) for the pre-trained model and the model being fine-tuned.Let F be a family of functions operating on representations such that each function signifies a task, T i .We can formalize our objective as follows: For any function g ∈ F on pre-trained representations, there must exists a corresponding h ∈ F on fine-tuned representations giving the same output; During fine-tuning, we do not expect an exact agreement and allow representations to lose some expressive power.Hence, we relax the constraint and consider the representation loss error. 2   For g ∈ F, L g = min h∈F N j=1 loss(g(z j pre ), h(z j f in )) 2 If Ti's were actual pre-training tasks and the data was is available, we would compute the loss on the input of Ti.In absence of that, we approximate it by loss on (unlabeled) input of the given fine-tuning task.
If g comes from a distribution D over F, then our representation loss is E g∼D [L g ].Here, we consider F to be the set of regression tasks3 , which are characterized by vectors in {u ∈ R d } and tasks to be sampled from a standard Gaussian distribution with mean 0 and unit variance.We consider loss to be the natural 2 loss function.
The inner minimization has a closed form solu- , and the resulting expectation can be reduced to get: where Z pre ∈ R N ×d matrix has j-th row z j pre and Z f in ∈ R N ×d matrix has j-th row z j f in .|| • || 2 for a vector denote the 2 -norm and for a matrix denote the Frobenius norm.A † denote the pseudoinverse of a symmetric matrix A. The derivation of v and reduction of expectation can be found in Theorem 1 in Appendix 10.1.Loss function L is not easily decomposed into mini-batches, making it challenging to optimize directly.We find an equivalent optimization problem whose objective is decomposable into mini-batches and whose optimum is equivalent to the representation loss L: We can minimize L for W along with the finetuning objective.Derivation of the above equivalence can be found in Theorem 2 in Appendix 10.1.We can interpret the above loss as follows: There exists a linear function (φ W : x → W x) which operated on fine-tuned representations results in pre-trained representations.We can generalize it to include a class of functions Φ: (3) This corresponds to the aggregate loss for pseudotasks T i 's, if instead of using a linear head for pseudo tasks on fine-tuned representations, we use a function φ ∈ Φ followed by u i .Thus, Φ defines the strength of the regularizer.A singleton Φ containing an identity function enforces the use of the same linear head u i for task T i on both pretrained and fine-tuned representations.This results in the strongest regularizer which keeps fine-tuned representations close to the pre-trained representations.On the other hand, if Φ is a set of very deep neural networks, then we allow a deep neural network (+u i ) to fit fine-tuned representations for task T i .Such a neural network will almost always exist even if the fine-tuned representations have degraded significantly.Thus, it is a weak regularizer and puts mild constraints on the change of the structure of representations.
Overall, this section can be summarized as follows: (i) L is an aggregate error in fitting fine-tuned representations to pseudo-pre-training tasks T i 's.(ii) Φ controls the amount of structural changes in representations allowed during fine-tuning.

Detailed Derivations
2 where y j is the j-th entry of y.Proof.Let the loss function be L is a smooth function with minimizer v .Hence, minimum is achieved at a local minimum.Thus, where X is the pseudo inverse which is equal to the inverse if X is invertible.Else it spans only the space spanned by X.Note that N j=1 b j b T j = B T B and Proof.To simplify notation, we use a j = z j pre ,b j = z j f in ,B = Z f in ∈ R N ×d matrix has j-th row b j , A = Z pre ∈ R N ×d matrix has j-th row a j and X † is the pseudo-inverse of X. Let From Lemma 1, we get Lemma 2. For any matrix M , R d×d .
is the forbenius norm of the matrix M .
Proof.Let the i, j-th entry of M be m i,j and the j-th entry in u be u j .Then, Since u is a gaussian random variable with mean 0 and covariance matrix Substituting equality from Lemma 2 to W , we get 2 and substituting back A = Z pre and B = Z f in , we get Proof.To simplify notation, we use a j = z j pre ,b j = z j f in ,B = Z f in ∈ R N ×d matrix has j-th row b j , A = Z pre ∈ R N ×d matrix has j-th row a j and X † is the pseudo-inverse of X.Let W = R d×d have i-th row w i .We need to compute where a j,i is the i-th entry of a j .Applying Lemma 1, we get where c i is the i-th column of A (j-th entry of c i is a j,i ).
Lemma 3.For a matrix M ∈ R N ×N and a set of where V ∈ R N ×k is the matrix with columns v 1 , . . ., v k .
Proof.Let j-th row of M be m j .Then, Combining the two equalities, we get Applying Lemma 3in eq 4, we get This finishes the proof of theorem.
11 Baselines -detailed Weight Consolidation: (Kirkpatrick et al., 2017;Chen et al., 2020) Let P be the set of all model parameters and B be the subset of the bias parameters (affine component in the linear transformations) of the model.For a parameter θ i , let θ pre i be the pre-trained value and θ f in i be the value during the finetuning process.Then, the regularization loss is.
Local smoothness inducing regularization R3F (Aghajanyan et al., 2020) For a classification problem, let f be the probability prediction function corresponding to model being finetuned.It's input is the input to the first BERT encoder layer (output of token embedding layer).Let x 1 , . . ., x N be the outputs of the token embedding layer for the inputs of the finetuning task.For i ∈ [N ], let δ,i be a Gaussian noise term with mean 0 and covariance matrix δI.We set δ = 1e − 5.Then, the regularization loss is Data Augmentation: Let x 1 , . . ., x N be the outputs of the token embedding layer for the inputs of the finetuning task and y 1 , . . ., y N be their associated labels.In data augmentation, we add noise to the data during the training process.
where L CE is the cross entropy loss, f is the prediction function as per the model and δ,i is a Gaussian noise with mean 0 and co-variance matrix δI added to x i .We set δ = 1e − 5. Our implementation is based on the HuggingFace library.

Experimental Setup
To avoid the excessively high computational cost of fine-tuning on large-scale datasets, we limited their full training sets to 10, 000 data points (marked with a suffix -10k in Table 2).For few-sample experiments, we fixed the same data subset across all models to avoid performance changes related to data variability.Since test set labels are unavailable, we use development set to report the performance.
Batch Size: Different methods have different memory requirement.For instance, R3F has the highest footprint which limits the batch size as we can not process too many inputs at the same time.
Table 8 shows the batch size used for each dataset in our experiments.

Experiments for RoBERTa
In the results above, we observed that our methods improve significant gain over baseline methods for BERT-large.Table 19 shows the result when we compare REPINA I against STD++ .We finetune the model for 10 epochs with regularization coefficient of 0.01 and learning rate 1e-5.Mean and standard deviation across three runs is reported.We observe that REPINA I improved STD++ performance in all cases.Table 19: Results for RoBERTa-base on 3 GLUE datasets.
Let a mini-batch has m examples, (x 1 , y 1 ), . . ., (x m , y m ) and z 1 , . . ., z m be the representations (output of encoder) using the model being finetuned.Supervised Contrastive Learning encourages the representations of examples of same label in the mini-batch to be close to each other and far from the examples with different label by additing the following loss to the objective: where τ is a scalable temperature parameter that controls the separation of classes.Loss function during training is where L CE is the cross entropy loss where λ is a factor that can be tuned.This was shown to improve finetuning process in (Gunel et al., 2020) for few-shot finetuning.
From the definition of L SCL we observe that SCL is only effective when the mini-batch size is large and each label class is sufficiently represented in the mini-batch.Otherwise, the loss function L SCL is vacuous.For instance, if the mini-batch size is 1 which is the case for many of our datasets, then LSCL = 0 for all the mini-batches.Thus, it is equivalent to the standard finetuning.Large minibatch size however requires large memory during finetuning process which is not always available as in our case.Thus, we look for a relaxation of SCL which can be implemented in a memory efficient manner.
We start by considering L SCL over the entire input set instead of mini-batch and then replace the example x j with mean of examples of the same class as x j while computing similarity with another example.More formally, let the training data be (x 1 , y 1 ), . . ., (x N , y n ), the set of labels be {1, . . ., } and representation of x i from the encoder of finetuning model.Let C j = {i | y i = j} and c j = 1 |C j | i∈C j z i be the center of embeddings of inputs with label j.We consider the fol- lowing relaxation of L SCL .
A naive implementation of this loss function would be very expansive as the centers c 1 , . . ., c would change in each iteration.We observe that centers change much slower than the individual examples.This is the reason to replace individual training samples with the centers while computing similarity z i , z j .Thus, we do not update it in each iteration and instead update it only ten times during the finetuning process.Note that it increases the training time by roughly a factor of 10 which is also prohibitive for large datasets.20 Results without any filtered runs 21 Detailed results for label noise

Representation Collapse -Continual learning perspective
Table 36,37,38,39,40,41,42 shows results for representation collapse when we finetune the model for task A using different methods and then finetune the top layer for task B.

Measuring representation collapse
Table 43,44,45,46,47 show the sum of top-k normalized eigenvalues (divide each eigenvalue by the sum of eigenvalues) for k=1, 2, 5, 10, and 20.From this, we can observe that almost all the normalized eigenvalues after the first twenty are close to zero Table 48, 49 and 50 show the GM-k for k=5, 10 and 20.Table 51, 52, 53 show the HM-k for k=5, 10 and 20.We observe that REPINA I achieves the highest value and thus is most effective in reducing representation collapse.

Walltime Analysis
Walltime analysis: STD++ uses a single forward and backward pass with simplest loss function and thus has the least training time.ReInit is a close second as it only differs in the initialization of the model.WC also uses a single forward and backward pass but is slower due to the regularization loss function computation.R3F and DA use two forward passes and two (effective) backward passes.Our method on the other hand use only one forward and backward pass.In addition to that we use only an extra forward pass of the pretrained model.Thus, our method is slower than STD++ , ReInit and WC and is faster than R3F and DA.Table 54 show the training time for all the methods.We observe that R3F consistently takes more time than all the methods.REPINA I runs faster than R3F and DA but slower than STD++ , WC and ReInit.REPINA MLP runs slower than REPINA I .

Connection of GMand HMto parameter estimation error
Let the pseudo linear regression task on finetuned representations be defined by w ∈ R d and the noisy labels observed on z i 's be y i = z T i w + i where i 's are the gaussian noise centered around 0 with identity covariance matrix.If ŵ is the least square minimizer (same as log-likelihood maximizer), then ŵ = w + N 0, G −1 GM corresponds to minimizing the confidence ellipsoid corresponding to the error ŵ − w.HM corresponds to minimizing the expected 2 2 norm of the error vector ŵ − w.Derivation of the ŵ and the explanation can be found in Madan et al. (2019).

Figure 1 :
Figure 1: Fine-tuning whole architecture (left) generally leads to good performance though distorts the representations.Minimal tuning (of classification head, for example; middle) mitigates the representation collapse but limits the model performance.Our proposal REPINA (right) leads to good performance while mitigating the representation collapse.

Figure 2 :
Figure 2: Example a) shows representations collapsing into a single dimension and losing useful information after fine-tuning.Example b) shows changes in representations that preserve their expressive power (e.g., coordinate shift, rotation, scaling etc.).

Figure 3 :
Figure 3: REPINA improves finetuning performance and mitigates represetation collapse.STD++: Improved variant of standard finetuning.REPINA I and REPINA MLP are our methods.GLUE & Non-GLUE: Average test set performance across seven GLUE and six non-GLUE tasks.Probing Experiments: measure of representation collapse introduced by Aghajanyan et al. (2020) (higher is better).Representation Diversity: mathematical measure of the information content of representations (we report GM-5 score, see Section 6; higher is better).

Figure 5 :
Figure 5: Top: λ i and V (λ i ) correspond to ith eigenvalue and its associated eigenvector after eigendecomposition of Gram matrix.Data from the left distribution is well spread out and spans all three dimensions, with all of its eigenvalues being similar.The right distribution shows all of the data collapsed along one eigenvector, hence one of the eigenvalues significantly exceeds two others.Bottom: comparison of top-20 eigenvalues of STD++ and REPINA I after fine-tuning on QQP with 250 points.Less skewed distribution of eigenvalues for REPINA I compared to STD++ indicates a more spread out distribution of fine-tuned representations with RE-PINA I compared to STD++ .

Figure 6 :
Figure 6: Intuitive explanation of the proposed approaches from the multi-task learning perspective.The total loss consists of the cross-entropy loss for the finetuning task and the consistency losses for the pseudotasks.The pre-trained model is non-trainable (frozen).

N
j=1 b j y j = B T y.So, v = (B T B) † B T y.Least square error can be written in terms of vector form to get min v∈R d N j=1 (y j − v T b j ) 2 = min v∈R d ||y − Bv|| 2 2 where ||•|| 2 2 for a vector denote the 2 norm squared.Substituting v * we get min v∈R d N j=1 3 ± 1.0 92.0 ± 0.5 RTE 74.1 ± 2.1 77.1 ± 0.7 CoLA 60.0 ± 1.1 60.1 ± 0.6

Table 1 :
Performance for our methods (REPINA I/MLP ) and baselines on 7 GLUE and 6 non-GLUE datasets.Average gain of 2.1 over STD++ for GLUE datasets and 4.5 over non-GLUE datasets.REPINAbeat all baseline methods in 10/13 cases.For QQP, MNLI, QNI, AGNEWS, IMDB, YELP and SCITAIL, we only used 10K training datapoints.

Table 2 :
The datasets used in this study, their size, number of classes (C) and the corresponding evaluation metrics.MCC denotes Matthews correlation coefficient.

Table 4 :
Average rank of different methods for fewshot learning.RP I : REPINA I , RP M : REPINA MLP .

Table 5 :
Task A ↓ STD++ DA WC ReInit R3F RPI RPM Results of probing experiments to measure representation collapse (higher score is better).Model is fine-tuned for task A with different methods, then a new linear head is trained for the remaining 12 tasks and the mean performance is reported.Aver. is average over different choices of task A. RP I is REPINA I , RP M is REPINA MLP .AG: AGNEWS-10k, SCIT: SCITAIL-10k, SCIC: SCICITE-10k, QNLI: QNLI-10k, QQP:QQP-10k, MNLI: MNLI-10k.

Table 6 :
Diversity of fine-tuned representations.Mean value across all the 13 tasks is presented.RP I is RE-PINA I , RP M is REPINA MLP .REPINA I yields finetuned representations with maximum representation matrix diversity.
(Gururangan et al., 2020)training and finetuning data: Even though pretrained models achieve high performance for a large number of NLP tasks, they tend to suffer if there is a significant domain shift between the pretraining data and finetuning data.Domain Adaptation bridges this gap by adapting the model to the finetuning task domain.It can done by doing additional pre-training on task domain data if such data is available(Gururangan et al., 2020)or algorithmically finding such data from general domain corpus if such a data is not available(Madan et al., 2021).
Table 7 show the regularization coefficients used for each method.

Table 8 :
Batch size used in our experiments

Table 10 :
Performance of REPINA MLP with different number of MLP layers.

Table 11 ,
12 and 13 compares the result between regularizing the top layer vs regularizing the intermediate layer in REPINA I .We observe that RE-PINA I consistently outperform when regularizing the intermediate layer.

Table 11 :
Performance for REPINA I -intermediate vs REPINA I -top.

Table 12 :
Performance for REPINA I -intermediate vs REPINA I -top.

Table 14 ,
15, 16 and 17show the the result for REPINA I with representations chosen from 5th, 10th or 20th layer of encoder.Note that 5th layer is the closest to the input and doesn't account for token embedding layer.We note that all three choices are performing roughly equally well.Mean performance is typically less than a percentage point from each other.If one were to use a single layer, one

Table 13 :
Performance for REPINA I -intermediate vs REPINA I -top.

Table 14 :
Effect of embedding layer to be regularized in REPINA I .

Table 15 :
Effect of embedding layer to be regularized in REPINA I .

Table 16 :
Effect of embedding layer to be regularized in REPINA I .

Table 17 :
Effect of embedding layer to be regularized in REPINA I .learning rate, epochs and dropout

Table 18 :
Results with HPO over epochs and learning rate.DR is a baseline method where we do additional HPO over dropout rate as well.
Table 20, 21 and 22 shows the comparison of memory efficient SCL with our methods.We see that both REPINA I and REPINA MLP beat SCL consistently for 250, 500 and 1000 training datapoints.Moreover, SCL incur significant loss for several datasets.

Table 21 :
Performance of memory-efficient SCL.

Table 23 :
Number of tasks for which REPINA I or RE-PINA MLP outperform the baseline method.Datapoints Table 24, 25 and 26 show the performance for few-sample finetuning setting.

Table 32 ,
33, 34, 35shows detailed results with varying amount of label noise in the training data.

Table 24 :
Performance for different regularization methods.

Table 25 :
Performance for different regularization methods.

Table 26 :
Performance for different regularization methods.

Table 27 :
Stability of fine-tuning results.RP I : RE-

Table 28 :
Performance for different regularization methods without filtering failed runs.

Table 29 :
Performance for different regularization methods without filtering failed runs.

Table 30 :
Performance for different regularization methods without filtering failed runs.

Table 31 :
Performance for different regularization methods without filtering failed runs.

Table 32 :
Training with at most 10k training datapoints on 13 datasets.

Table 33 :
Training with at most 10k training datapoints on 13 datasets.

Table 34 :
Training with at most 10k training datapoints on 13 datasets.

Table 35 :
Training with at most 10k training datapoints on 13 datasets.

Table 36 :
Results for training top layer for different task after finetuning entire model for QNLI-10k

Table 37 :
Results for training top layer for different task after finetuning entire model for QQP-10k

Table 38 :
Results for training top layer for different task after finetuning entire model for MNLI-10k

Table 39 :
Results for training top layer for different task after finetuning entire model for AGNEWS-10k

Table 40 :
Results for training top layer for different task after finetuning entire model for IMDB-10k

Table 41 :
Results for training top layer for different task after finetuning entire model for SCICITE

Table 42 :
Results for training top layer for different task after finetuning entire model for RTE

Table 43 :
Normalized average of top-1 eigenvalues

Table 54 :
Training time for different methods