Towards Robust Low-Resource Fine-Tuning with Multi-View Compressed Representations

Due to the huge amount of parameters, finetuning of pretrained language models (PLMs) is prone to overfitting in the low resource scenarios. In this work, we present a novel method that operates on the hidden representations of a PLM to reduce overfitting. During fine-tuning, our method inserts random autoencoders between the hidden layers of a PLM, which transform activations from the previous layers into multi-view compressed representations before feeding them into the upper layers. The autoencoders are plugged out after fine-tuning, so our method does not add extra parameters or increase computation cost during inference. Our method demonstrates promising performance improvement across a wide range of sequence- and token-level lowresource NLP tasks.


Introduction
Fine-tuning pretrained language models (PLMs) (Devlin et al., 2019;Conneau and Lample, 2019;Liu et al., 2020) provides an efficient way to transfer knowledge gained from large scale text corpora to downstream NLP tasks, which has achieved state-of-the-art performance on a wide range of tasks (Yang et al., 2019;Yamada et al., 2020;Chi et al., 2021;Sun et al., 2021). However, most of the PLMs are designed for general purpose representation learning Conneau et al., 2020), so the learned representations unavoidably contain abundant features irrelevant to the downstream tasks. Moreover, the PLMs typically possess a huge amount of parameters (often 100+ millions) (Brown et al., 2020;Min et al., 2021), which makes them more expressive compared with simple models, and hence more vulnerable to overfitting * Equal contribution. Linlin Liu and Xingxuan Li are under the Joint Ph.D. Program between Alibaba and Nanyang Technological University. noise or irrelevant features during fine-tuning, especially in the low-resource scenarios.
There has been a long line of research on devising methods to prevent large neural models from overfitting. The most common ones can be roughly grouped into three main categories: data augmentation (DeVries and Taylor, 2017;Feng et al., 2021), parameter/activation regularization (Krogh and Hertz, 1991;Srivastava et al., 2014;Lee et al., 2020) and label smoothing (Szegedy et al., 2016;Yuan et al., 2020), which, from bottom to top, operates on data samples, model parameters/activations and data labels, respectively.
Data augmentation methods, such as backtranslation (Sennrich et al., 2016) and masked prediction (Bari et al., 2021), are usually designed based on our prior knowledge about the data. Though simple, many of them have proved to be quite effective. Activation (hidden representation) regularization methods are typically orthogonal to other methods and can be used together to improve model robustness from different aspects. However, since the neural models are often treated as a black box, the features encoded in hidden layers are often less interpretable. Therefore, it is more challenging to apply similar augmentation techniques to the hidden representations of a neural model.
Prior studies (Yosinski et al., 2015; Allen-Zhu and Li, 2020) observe that neural models trained with different regularization or initialization can capture different features of the same input for prediction. Inspired by this finding, in this work we propose a novel method for hidden representation augmentation. Specifically, we insert a set of randomly initialized autoencoders (AEs) (Rumelhart et al., 1985;Baldi, 2012) between the layers of a PLM, and use them to capture different features from the original representations, and then transform them into Multi-View Compressed Representations (MVCR) to improve robustness during fine-tuning on target tasks. Given a hidden arXiv:2211.08794v2 [cs.CL] 6 May 2023 representation, an AE first encodes it into a compressed representation of smaller dimension d, and then decodes it back to the original dimension. The compressed representation can capture the main variance of the data. Therefore, with a set of AEs of varying d, if we select a random one to transform the hidden representations in each fine-tuning step, the same or similar input is compressed with varying compression dimensions. And the upperlevel PLM layers will be fed with more diverse and compressed representations for learning, which illustrates the "multi-view" concept. We also propose a tailored hierarchical AE to further increase representation diversity. Crucially, after fine-tuning the PLM with the AEs, the AEs can be plugged out, so they do not add any extra parameters or computation during inference. We have designed a toy experiment to help illustrate our idea in a more intuitive way. We add uniform random Gaussian noise to the MNIST (LeCun et al., 1998) digits and train autoencoders with different compression ratios to reconstruct the noisy input images. As shown in Fig. 1, the compression dimension d controls the amount of information to be preserved in the latent space. With a small d, the AE removes most of the background noise and preserves mostly the crucial shape information about the digits. In Fig. 1b, part of the shape information is also discarded due to high compression ratio. In the extreme case, when d = 1, the AE would discard almost all of the information from input. Thus, when a small d is used to compress a PLM's hidden representations during fine-tuning, the reconstructed representations will help: (i) reduce overfitting to noise or task irrelevant features since the high compression ratio can help remove noise; (ii) force the PLM to utilize different relevant features to prevent it from becoming overconfident about a certain subset of them. Since the AE-reconstructed representation may preserve little information when d is small (Fig. 1b), the upper layers are forced to extract only relevant features from the limited information. Besides, the shortcut learning problem (Geirhos et al., 2020) is often caused by learning features that fail to generalize, for example relying on the grasses in background to predict sheep. Our method may force the model to use the compressed representation without grass features. Therefore, it is potentially helpful to mitigate shortcut learning. As shown in Fig. 1(d), with a larger d, most information about the digits can be reconstructed, and noises also start to appear in the background. Hence, AEs with varying d can transform a PLM's hidden representations into different views to increase diversity.
We conduct extensive experiments to verify the effectiveness of MVCR. Compared to many strong baseline methods, MVCR demonstrates consistent performance improvement across a wide range of sequence-and token-level tasks in low-resource adaptation scenarios. We also present abundant ablation studies to justify the design. In summary, our main contributions are: • Propose a novel method to improve low-resource fine-tuning, which leverages AEs to transform representations of a PLM into multi-view compressed representations to reduce overfitting. • Design an effective hierarchical variant of the AE to introduce more diversity in fine-tuning. • Present a plug-in and plug-out fine-tuning approach tailored for our method, which does not add extra parameter or computation during inference. • Conduct extensive experiments to verify the effectiveness of our method, and run ablation studies to justify our design.

Related Work
Overfitting is a long-standing problem in large neural model training, which has attracted broad interest from the research communities (Sarle et al., 1996;Hawkins, 2004;Salman and Liu, 2019;Santos and Papa, 2022). To better capture the massive information from large-scale text corpora during pretraining, the PLMs are often over-parameterized, which makes them prone to overfitting. The commonly used methods to reduce overfitting can be grouped into three main categories: data augmentation, parameter/hidden representation regularization, and label smoothing. Data augmentation methods (DeVries and Tay lor, 2017;Feng et al., 2021) are usually applied to increase training sample diversity. Most of the widely used methods, such as synonym replacement ), masked prediction (Bari et al., 2021 and back-translation (Sennrich et al., 2016), fall into this category. Label smoothing methods (Szegedy et al., 2016;Yuan et al., 2020) are applied to the data labels to prevent overconfidence and to encourage smoother decision boundaries. Some hybrid methods, like MixUp (Zhang et al., 2018;Verma et al., 2019), are proposed to manipulate both data sample and label, or hidden representation and label. Most parameter/hidden representation regularization methods are orthogonal to the methods discussed above, so they can be used as effective complements.
Neural models are often treated as a black box, so it is more challenging to design efficient parameter or hidden representation regularization methods. Existing methods mainly focus on reducing model expressiveness or adding noise to hidden representations. Weight decay (Krogh and Hertz, 1991) enforces L 2 norm of the parameters to reduce model expressiveness. Dropout (Srivastava et al., 2014) randomly replaces elements in hidden representations with 0, which is believed to add more diversity and prevent overconfidence on certain features. Inspired by dropout, Mixout (Lee et al., 2020) stochastically mixes the current and initial model parameters during training. Mahabadi et al.
(2021) leverage variational information bottleneck (VIB) (Alemi et al., 2016) to help models to learn more concise and task relevant features. However, VIB is limited to regularize last layer sequencelevel representations only, while our method can be applied to any layer, and also supports token-level tasks like NER and POS tagging.

Methodology
We first formulate using neural networks over the hidden representation of different layers of deep learning architectures as effective augmentation modules ( §3.1) and then devise a novel hierarchical autodencoder (HAE) to increase stochasticity and diversity ( §3.2). We utilise our novel HAE as a compression module for hidden representations within PLMs, and introduce our method Multi-View Compressed Representation (MVCR) ( §3.3). We finally discuss the training, inference, and optimization of MVCR. An overview of MVCR is presented in Fig. 2.

Generalizing Neural Networks as Effective Augmentation Modules
Data augmentation (Simard et al., 1998) aims at increasing the diversity of training data while preserving quality for more generalized training (Shorten and Khoshgoftaar, 2019). It can be formalized by the Vicinal Risk Minimization principle (Chapelle et al., 2001), which aims to enlarge the support of the training distribution by generating new data points from a vicinity distribution around each training example (Zhang et al., 2018). We conjecture that shallow neural networks can be used between the layers of large-scale PLMs to construct such vicinity distribution in the latent space to facilitate diversity and generalizability. We consider a neural network g(·), and denote its forward pass F (·) for an input x with F (x) = g(x). We denote a set of M such networks with G = {g 1 (·), . . . , g M (·)}, where each candidate network g i (·) outputs a different "view" of a given input. We treat G as a stochastic network and define a stochastic forward pass F S (·, G), where a candidate network g i (·) is randomly chosen from the pool G in each step, enabling diversity due to different non-linear transformations. Formally, for an input x, we obtain the output o of network g i (·) using F S as, For a chosen candidate network g i , the output o represents a network dependent "view" of input x.
We now formalize using g(·) over the hidden representation of large-scale PLMs for effective generalization. Let f (·) denote any general transformerbased PLM containing N hidden layers with f n (·) being the n-th layer and h n being the activations at that layer for n ∈ {1, . . . , N }, and let h 0 denote the input embeddings. We consider a set of layers L ⊂ {1, . . . , N } where we insert our stochastic network G for augmenting the hidden representations during the forward pass. To this end, we Figure 2: Illustration of token-level Multi-View Compressed Representations (MVCR) with three stochastic hierarchical autoencoders (HAEs) inserted between the transformer layers. During training, the output of layer n is either passed through a randomly selected HAE or directly passed to layer n + 1 (denoted as "I" in the figure). If an HAE is picked, the output of the outer encoder is either passed through a randomly selected sub-AE or directly passed to the outer decoder (via "I"). In inference, we drop MVCR without adding parameters to the original PLM.
substitute the standard forward pass of the PLM f (·) during the training phase with the stochastic forward pass F S (·, G). Formally, We now devise a novel hierarchical autoencoder network which can be effectively used for augmenting hidden representations of large-scale PLMs.

Stochastic Hierarchical Autoencoder
Autoencoders (AEs) are a special kind of neural networks where the input dimension is the same as the output dimension (Rumelhart et al., 1985).
For an input x ∈ R d , we define a simple autoencoder AE d,d (·) with compression dimensiond as a sequential combination of a feed-forward downprojection layer D d,d (·) and an up-projection layer Ud ,d (·). Given input x, the output o ∈ R d of the autoencoder can be represented as: Hierarchical Autoencoder (HAE) We extend a standard autoencoder to a hierarchical autoencoder and present its overview in Fig. 2 Note that, HAEs are different from the AEs having multiple encoder and decoder layers, since it also enforces reconstruction loss on the outputs of D(x) and U (D (D(x))) as expressed in Eq. 4. Thus HAEs can compress representations step by step, and provides flexibility to reconstruct the inputs with or without the sub-autoencoders. By sharing the outer layer parameters of a series of inner AEs, HAE can introduce more diversity without adding significant parameters and training overhead.

Stochastic Hierarchical Autoencoder
We use a stochastic set of sub-autoencoders Ed within an HAE to formulate a stochastic hierarchical autoencoder, whered is the input dimension of the subautoencoders. While performing the forward pass of the stochastic hierarchical autoencoder, we randomly choose one sub-autoencoder AEd ,d i ∈ Ed within the autoencoder AE d,d . We set the com- where z is uniformly sampled from the range [0, 1], and AEd ,d i ∈ Ed is randomly selected in each step. So for 30% of the time, we do not use the subautoencoders. These randomness introduces more diversity to the generated views to reduce overfitting. For the stochastic HAE, we only compute the reconstruction loss between x and o, since this also implicitly minimizes the distance between AEd ,d i (D(x)) and D(x) in Eq. 5. See §A.2 for detailed explanation. In the following sections, we use HAE to denote stochastic hierarchical autoencoder. We also name the hyper-parameterd as HAE compression dimension.

Multi-View Compressed Representation
Autoencoders can represent data effectively as compressed vectors that capture the main variability in the data distribution (Rumelhart et al., 1985). We leverage this capability of autoencoders within our proposed stochastic networks to formulate Multi-View Compressed Representation (MVCR).
We give an illustration of MVCR in Fig. 2. We insert multiple stochastic HAEs ( §3.2) between multiple layers of a PLM (shown for one layer in the figure). Following §3.1, we consider the Transformer-based PLM f (·) with hidden dimension d and N layers. At layer n, we denote the set of stochastic HAEs with where M is total the number of HAEs. To prevent discarding useful information in the compression process and for more stable training, we only use them for 50% of the times. We modify the forward pass of f (·) with MVCR, denoted by F MVCR (·, HAE S,n ) to obtain the output h n for an input h n−1 as where z is uniformly sampled from the range [0, 1]. Following Eq. 2, we finally define MVCR(·, HAE S ) for a layer set L of a PLM to compute the output h n using stochastic HAEs as Note that MVCR can be used either at layer-level or token-level. At layer-level, all of the hidden (token) representations in the same layer are augmented with the same randomly selected HAE in each training step. At token-level, each token representation can select a random HAE to use. We show that token-level MVCR performs better than layer-level on account of more diversity and stochasticity ( §5) and choose this as the default setup.
Network Optimization We use two losses for optimizing MVCR. Firstly, our method do not make change to the task layer, so the original taskspecific loss L task is used to update the PLM, task layer, and HAE parameters. We use a small learning rate α task to minimize L task . Secondly, to increase training stability, we also apply reconstruction loss L MSE to the output of HAEs to ensure the augmented representations are projected to the space not too far from the original one. At layer n, we have where M is the number of HAEs on each layer, and L is the PLM input sequence length (in tokens). We use a larger learning rate α MSE to optimize L MSE since the HAEs are randomly initialized. This reconstruction loss not only increases training stability, but also allows us to plug-out the HAEs during inference, since it ensures that the generated views are close to the original hidden representation.

Experiments
In this section, we present the experiments designed to evaluate our method. We first describe the baselines and training details, followed by the evaluation on both the sequence-and token-level tasks across six (6) different datasets.
Baselines We compare our method with several widely used parameter and hidden representation regularization methods, including Dropout (Srivastava et al., 2014), Weight Decay (WD) (Krogh and Hertz, 1991) (Devlin et al., 2019) for the sequence-level tasks. The token-level tasks are multilingual, so we tune XLM-R base (Conneau et al., 2020) on them. We use MVCR l to denote our methods, where l is the PLM layer that we insert HAEs. For example in MVCR 1 , the HAEs are inserted after the 1st transformer layer. On each layer, we insert three HAEs with compressed representation dimensions 128, 256 and 512, respectively. All HAEs are discarded during inference. For each experiment, we report the average result of 3 runs.
For each dataset, we randomly sample 100, 200, 500 and 1000 instances from the training set to simulate the low-resource scenarios. The same amount of instances are also sampled from the dev sets for model selection. The full test set is used for evaluation. More details can be found in §A.5.

Results of Sequence-Level Tasks
For the sequence-level tasks, we experiment with two natural language inference (NLI) benchmarks, namely SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018), and two text classification tasks, namely IMDb (Maas et al., 2011) and Yelp (Zhang et al., 2015). Since MNLI has matched and mismatched versions, we report the results on each of them separately. We present the experimental results for the NLI tasks in Table 1, and text classification tasks in Table 2. MVCR 1 and MVCR 12 are our methods that insert HAEs after the 1st and 12th transformer layer, respectively. 3 We have the following observations: (a) MVCR 1 and MVCR 12 consistently outperform all the baselines in the low-resource settings, demonstrating the effectiveness of our method. (b) In very low-resource settings such as with 100 and 200 training samples, MVCR 1 often performs better than MVCR 12 . This indicates that bottom-layer hidden representation augmentation is more efficient than top-layer regularization on extreme low-resource data. We attribute this to the fact that the bottom-layer augmentation impacts more parameters on the top layers. (c) We observe that MVCR outperforms strong baselines, such as VIB. We believe this is due to the fact that MVCR acts as the best-of-both-worlds. Specifically, it acts as a data augmentation module promoting diversity while also acts as a lossy compression module that randomly discards features to prevent overfitting.

Results of Token-Level Tasks
For token-level tasks, we evaluate our methods on WikiAnn (Pan et al., 2017) for NER and Universal Dependencies v2.5 (Nivre et al., 2017) for POS tagging. Both datasets are multilingual, so we conduct experiments in the zero-shot cross-lingual setting, where all models are fine-tuned on English first and then evaluated on the other languages directly.
As shown in Table 3, our methods are also proven to be useful on the token-level tasks. MVCR 2,12 is our method that inserts HAEs after 3 More details about the hyper-parameter search space for insertion layers can be found in §A.6.   both the 2nd and 12th transformer layers. Comparing the results of MVCR 2 and MVCR 2,12 , we can see adding one more layer of HAEs indeed leads to consistent performance improvement on WikiAnn. However, MVCR 2 performs better on the POS tagging task. We also observe that Mixout generally does not perform well when the number of training samples is very small. Detailed analysis about HAE insertion layers is presented in §5.

Analysis and Ablation Studies
To better understand the improvements obtained by MVCR, we conduct in-depth analysis and ablation studies. We experiment with both sequence-and token-level tasks by sampling 500 instances from 3 datasets, namely MNLI, IMDb and WikiAnn.
• Insertion Layer(s) of MVCR Based on (Clark et al., 2019), each layer in BERT captures different type of information, such as surface, syntactic or semantic information. Same as above, we fix the dimensions of the three HAEs on each layer to 128, 256 and 512. To comprehensively analyse the impact of adding MVCR after different layers, we    Fig. 3, adding MVCR after layer 1 or 12 achieves the best performance than the other layers. This result can be explained in the way of different augmentation methods (Feng et al., 2021). By adding MVCR after layer 1, the module acts as a data generator which creates multi-view representation inputs to all the upper layers. While adding MVCR after layer 12, which is more relevant to the downstream tasks, the module focuses on preventing the task layer from overfitting (Zhou et al., 2022). Apart from one layer, we also experiment with adding MVCR to a combination of two layers (Table 4 and Table 5). For token-level task, adding MVCR to both layer 2 and 12 performs even better than one layer. However, it does not help improve the performance on sequence-level tasks. Similar to the β parameter in VIB (Mahabadi et al., 2021) which controls the amount of random noise added to the representation, the number of layers in MVCR controls the trade-off between adding more variety to the hidden representation and keeping more information from the original representation. And the optimal trade-off point can be different for sequence-and token-level tasks.
• Number of HAEs in MVCR We analyse the impact of the number of HAEs in MVCR on the diversity of augmenting hidden representations. We fix the compression dimension of HAE's outer encoder to 256, and only insert HAEs to the bottom layers, layer 1 for MNLI and IMDb, and layer 2 for WikiAnn. As shown in Fig. 4, the performance improves with an increasing number of HAEs, which indicates that adding more HAEs leads to more diversity and better generalization. However, the additional performance gain is marginal after three HAEs. This is probably because without various compressed dimensions, the variety only comes from different initialization of HAEs which can only improve the performance to a limited extent. Hence, we fix the number of HAEs to three for other experiments, including the main experiments.
• Diversity of HAE Dimensions The compression dimensions of HAEs control the amount of information to be passed to upper PLM layers, so using HAEs of varying dimensions may help generate more diverse views during training. To analyse the impact of the compression dimension diversity, we run experiments of three types of combinations: "aaa", "aab" and "abc", which contains one, two and three unique dimensions respectively. We compute the average performance of the HAEs with dimensions {32,32,32} to {512,512,512} for "aaa". We sample the dimensions for "aab" and "abc" since there are too many possible combinations. 4 As we can see from Fig. 5, "abc" consistently outperforms "aab", while "aab" consistently outperforms "aaa", which indicates increasing compression dimension diversity can help further improve model generalization. • HAE vs. AE vs. VAE HAEs in MVCR serve as a bottleneck to generate diverse compressed views of the original hidden representations. There are many other possible alternatives for HAE, so we replace HAE with the vanilla AE and variational autoencoder (VAE) (Kingma and Welling, 2014) for comparison. The results in Fig. 6 show that HAE consistently outperforms AE and VAE. • Token-Level vs. Layer-Level MVCR In our method, the selection of random HAE can be on token-level or layer-level. For token-level MVCR, each token in the layer randomly selects an HAE from the pool, with the parameters shared within the same layer. Compared with layer-level MVCR, token-level MVCR adds more variety to the model, leading to better results as observed in Fig. 7. • More Results and Analysis We conduct experiments to compare the training overhead of MVCR with other baselines. MVCR has slightly longer training time yet shorter time for convergence. We also experiment with inference with or without

Conclusions
In this work, we have proposed a novel method to improve low-resource fine-tuning robustness via hidden representation augmentation. We insert a set of autoencoders between the layers of a pretrained language model (PLM). The layers are randomly selected during fine-tuning to generate more diverse compressed representations to prevent the top PLM layers from overfitting. A tailored stochastic hierarchical autoencoder is also proposed to help add more diversity to the augmented representations. The inserted modules are discarded after training, so our method does not add extra parameter or computation cost during inference. Our method has demonstrated consistent performance improvement on a wide range of NLP tasks.

Limitations
We focus on augmenting the hidden representations of a PLM. Thus most of our baselines, such as dropout (Srivastava et al., 2014) and variational information bottleneck methods (Mahabadi et al., 2021), do not require unlabeled data. For a fair comparison, we assume that the unlabeled data is not available. Therefore, only the limited labeled training set are used to train the autoencoders in our experiments. However, such unlabeled generalor in-domain data (e.g., Wikipedia text) are easy to obtain in practice, and can be used to pre-train the autoencoders with unsupervised language modeling tasks, which may help further improve the performance. We leave it for future work.

Ethical Impact
Deep learning has demonstrated encouraging performance on a wide range of tasks during the past few years. However, neural models are data hungry, which usually requires a large amount of training data to achieve reasonable performance. It is expensive and time consuming to annotate a large amount of data. Pretrained language models (PLMs) (Devlin et al., 2019;Conneau and Lample, 2019;Liu et al., 2020) have been proven to be useful to transfer knowledge from massive unlabeled text to downstream tasks, but they are also prone to overfitting during fine-tuning due to overparameterization. In this work, we propose a novel method to help improve model robustness in the low-resource scenarios, which is part of the attempt to reduce neural model reliance on the labeled data, and hence reduce annotation cost. Our method has also demonstrated promising performance improvement on cross-lingual NLP tasks, which is also an attempt to break the language barrier and allow a larger amount of population to benefit from the advance of NLP techniques.  et al., 2016) to suppress irrelevant features when fine-tuning the model on target tasks. It compresses the sentence representation x generated by PLMs into a smaller dimension representation z with mean µ(x) and also introduces more diversity through GN with variance Var(x).

A.4 Layer-level MVCR
The default setting of MVCR is token-level selection, which means at each training step each token in the same layer randomly selects a different HAE from the pool where the weights are shared. A variation of token-level MVCR is layer-level MVCR (Fig. 8), where all tokens in the same layer randomly selects the same HAE from the pool.

A.5 Training Details
We use the same hyper-parameters as (Mahabadi et al., 2021) and (Hu et al., 2020)

A.6 Hyper-Parameter Search
Most of the hyper-parameters we use in the downstream tasks are same as (Mahabadi et al., 2021) and (Hu et al., 2020). Therefore, we only need to decide the hyper-parameters specific to our methods.

Diversity of HAE Dimensions
We conduct extensive experiments to study the impacts of different types of dimension combinations, including "aaa", "aab" and "abc". Results are presented in Table 7. On avergage, "dim-abc" outperforms the other two types, while "dim-aab" outperforms "dim-aaa". Furthermore, as shown in the first row of Table 7, the results are close for "aaa" ranging from {32,32,32} to {512,512,512}. As such, we choose {64, 64, x}, {256, 256, x} for the analysis of "aab" and {128,256,x} for "abc". HAE vs. AE vs. VAE We replace HAE in MVCR with AE and VAE and compare the results in Table 8. The results for IMDb is also plotted in Fig. 9.
Token-vs Layer-level MVCR We compare the performance of Token-and Layer-level MVCR, and report detailed results in Table 9. The results for IMDb is also plotted in Fig. 10     Training Overhead We compare the training runtime and converge time on MNLI with 500 training data in Table 10. The total runtime of 80 epochs for MVCR is only 3.07% more than the average runtime of all baselines. Moreover, MVCR converges faster than other baselines.    Inference With or Without MVCR We compare the results of inference with or without MVCR on MNLI and IMDb. The results in Table 11 shows that the performances are similar.