HiddenCut: Simple Data Augmentation for Natural Language Understanding with Better Generalization

Fine-tuning large pre-trained models with task-specific data has achieved great success in NLP. However, it has been demonstrated that the majority of information within the self-attention networks is redundant and not utilized effectively during the fine-tuning stage. This leads to inferior results when generalizing the obtained models to out-of-domain distributions. To this end, we propose a simple yet effective data augmentation technique, HiddenCut, to better regularize the model and encourage it to learn more generalizable features. Specifically, contiguous spans within the hidden space are dynamically and strategically dropped during training. Experiments show that our HiddenCut method outperforms the state-of-the-art augmentation methods on the GLUE benchmark, and consistently exhibits superior generalization performances on out-of-distribution and challenging counterexamples. We have publicly released our code at https://github.com/GT-SALT/HiddenCut.


Introduction
Fine-tuning large-scale pre-trained language models (PLMs) has become a dominant paradigm in the natural language processing community, achieving state-of-the-art performances in a wide range of natural language processing tasks (Devlin et al., 2019;Yang et al., 2019a;Clark et al., 2019;Lewis et al., 2020;Bao et al., 2020;Raffel et al., 2020). Despite the great success, due to the huge gap between the number of model parameters and that of task-specific data available, the majority of the information within the multi-layer self-attention networks is typically redundant and ineffectively utilized for downstream tasks (Guo et al., 2020;Gordon et al., 2020;Dalvi et al., 2020). As a result, after task-specific fine-tuning, models are very likely to overfit and make predictions based on spurious patterns (Tu et al., 2020;Kaushik et al., 2020), making them less generalizable to outof-domain distributions Jiang et al., 2019;Aghajanyan et al., 2020).
In order to improve the generalization abilities of over-parameterized models with limited amount of task-specific data, various regularization approaches have been proposed, such as adversarial training that injects label-preserving perturbations in the input space Jiang et al., 2019), generating augmented data via carefully-designed rules (McCoy et al., 2019;Xie et al., 2020;Andreas, 2020;Shen et al., 2020), and annotating counterfactual examples Kaushik et al., 2020). Despite substantial improvements, these methods often require significant computational and memory overhead Jiang et al., 2019;Xie et al., 2020) or human annotations Kaushik et al., 2020).
In this work, to alleviate the above issues, we rethink the simple and commonly-used regularization technique-dropout (Srivastava et al., 2014)in pre-trained transformer models (Vaswani et al., 2017). With multiple self-attention heads in transformers, dropout converts some hidden units to zeros in a random and independent manner. Although PLMs have already been equipped with the dropout regularization, they still suffer from inferior performances when it comes to out-of-distribution cases (Tu et al., 2020;Kaushik et al., 2020). The underlying reasons are two-fold: (1) the linguistic relations among words in a sentence is ignored while dropping the hidden units randomly. In reality, these masked features could be easily inferred from surrounding unmasked hidden units with the self-attention networks. Therefore, redundant information still exists and gets passed to the upper layers.
(2) The standard dropout assumes that every hidden unit is equally important with the ran-dom sampling procedure, failing to characterize the different roles these features play in distinct tasks. As a result, the learned representations are not generalized enough while applied to other data and tasks. To drop the information more effectively, Shen et al. (2020) recently introduce Cutoff to remove tokens/features/spans in the input space. Even though models will not see the removed information during training, examples with large noise may be generated when key clues for predictions are completely removed from the input.
To overcome these limitations, we propose a simple yet effective data augmentation method, Hid-denCut, to regularize PLMs during the fine-tuning stage. Specifically, the approach is based on the linguistic intuition that hidden representations of adjacent words are more likely to contain similar and redundant information. HiddenCut drops hidden units more structurally by masking the whole hidden information of contiguous spans of tokens after every encoding layer. This would encourage models to fully utilize all the task-related information, instead of learning spurious patterns during training. To make the dropping process more efficient, we dynamically and strategically select the informative spans to drop by introducing an attentionbased mechanism. By performing HiddenCut in the hidden space, the impact of dropped information is only mitigated rather than completely removed, avoiding injecting too much noise to the input. We further apply a Jensen-Shannon Divergence consistency regularization between the original and these augmented examples to model the consistent relations between them.
To demonstrate the effectiveness of our methods, we conduct experiments to compare our HiddenCut with previous state-of-the-art data augmentation method on 8 natural language understanding tasks from the GLUE  benchmark for in-distribution evaluations, and 5 challenging datasets that cover single-sentence tasks, similarity and paraphrase tasks and inference tasks for out-ofdistribution evaluations. We further perform ablation studies to investigate the impact of different selecting strategies on HiddenCut's effectiveness. Results show that our method consistently outperforms baselines, especially on out-of-distribution and challenging counterexamples. To sum up, our contributions are: • We propose a simple data augmentation method, HiddenCut, to regularize PLMs dur-ing fine-tuning by cutting contiguous spans of representations in the hidden space.
• We explore and design different strategic sampling techniques to dynamically and adaptively construct the set of spans to be cut.
• We demonstrate the effectiveness of Hidden-Cut through extensive experiments on both indistribution and out-of-distribution datasets.
2 Related Work

Adversarial Training
Adversarial training methods usually regularize models through applying perturbations to the input or hidden space (Szegedy et al., 2013;Goodfellow et al., 2014;Madry et al., 2017) with additional forward-backward passes, which influence the model's predictions and confidence without changing human judgements. Adversarial-based approaches have been actively applied to various NLP tasks in order to improve models' robustness and generalization abilities, such as sentence classification (Miyato et al., 2017), machine reading comprehension (MRC) (Wang and Bansal, 2018) and natural language inference (NLI) tasks (Nie et al., 2020). Despite its success, adversarial training often requires extensive computation overhead to calculate the perturbation directions (Shafahi et al., 2019;Zhang et al., 2019a). In contrast, our HiddenCut adds perturbations in the hidden space in a more efficient way that does not require extra computations as the designed perturbations can be directly derived from self-attentions.

Data Augmentation
Another line of work to improve the model robustness is to directly design data augmentation methods to enrich the original training set such as creating syntactically-rich examples (McCoy et al., Min et al., 2020) with specific rules, crowdsourcing counterfactual augmentation to avoid learning spurious features Kaushik et al., 2020), or combining examples in the dataset to increase compositional generalizabilities (Jia and Liang, 2016;Andreas, 2020;Chen et al., 2020b,a). However, they either require careful design (McCoy et al., 2019;Andreas, 2020) to infer labels for generated data or extensive human annotations Kaushik et al., 2020), which makes them hard to generalize to different tasks/datasets. Recently Shen et al. (2020) introduce a set of cutoff augmentation which directly creates partial views to augment the training in a more task-agnostic way. Inspired by these prior work, our HiddenCut aims at improving models' generalization abilities to out-of-distribution via linguistic-informed strategically dropping spans of hidden information in transformers.

Dropout-based Regularization
Variations of dropout (Srivastava et al., 2014) have been proposed to regularize neural models by injecting noise through dropping certain information so that models do not overfit training data. However, the major efforts have been put to convolutional neural networks and trimmed for structures in images recently such as DropPath (Larsson et al., 2017), DropBlock (Ghiasi et al., 2018), DropCluster (Chen et al., 2020c) and AutoDropout (Pham and Le, 2021). In contrast, our work takes a closer look at transformer-based models and introduces HiddenCut for natural language understanding tasks. HiddenCut is closely related to Drop-Block (Ghiasi et al., 2018), which drops contiguous regions from a feature map. However, different from images, hidden dimensions in PLMs that contain syntactic/semantic information for NLP tasks are more closely related (e.g., NER and POS information), and simply dropping spans of features in certain hidden dimensions might still lead to information redundancy.

HiddenCut Approach
To regularize transformer models in a more structural and efficient manner, in this section, we introduce a simple yet effective data augmentation technique, HiddenCut, that reforms dropout to cutting contiguous spans of hidden representations after each transformer layer (Section 3.1). Intuitively, the proposed approach encourages the models to fully utilize all the hidden information within the self-attention networks. Furthermore, we propose an attention-based mechanism to strategically and judiciously determine the specific spans to cut (Section 3.2). The schematic diagram of HiddenCut, applied to the transformer architecture (and its comparison to dropout) are shown in Figure 1.

HiddenCut
For an input sequence s = {w 0 , w 1 , ..., w L } with L tokens associated with a label y, we employ a pre-trained transformer model f 1:M (·) with M layers like RoBERTa  to encode the text into hidden representations. Thereafter, an inference network g(·) is learned on top of the pretrained models to predict the corresponding labels.
In the hidden space, after layer m, every word w i in the input sequence is encoded into a D dimensional vector h m i ∈ R D and the whole sequence could be viewed as a hidden matrix H m ∈ R L×D .
With multiple self-attention heads in the transformer layers, it is found that there is extensive redundant information across h m i ∈ H that are linguistically related (Dalvi et al., 2020) (e.g., words that share similar semantic meanings). As a result, the removed information from the standard dropout operation may be easily inferred from the remaining unmasked hidden units. The resulting model might easily overfit to certain high-frequency features without utilizing all the important task-related information in the hidden space (especially when task-related data is limited). Moreover, the model also suffers from poor generalization ability while being applied to out-of-distribution cases.
Inspired by Ghiasi et al. (2018); Shen et al. (2020), we propose to improve the dropout regularization in transformer models by creating augmented training examples through HiddenCut, which drops a contiguous span of hidden information encoded in every layer, as shown in Figure 1 (c). Mathematically, in every layer m, a span of hidden vectors, S ∈ R l×D , with length l = αL in the hidden matrix H m ∈ R L×D are converted to 0, and the corresponding attention masks are adjusted to 0, where α is a pre-defined hyper-parameter indicating the dropping extent of HiddenCut. After being encoded and hiddencut through all the hidden layers in pre-trained encoders, augmented training data f HiddenCut (s) is created for learning the inference network g(·) to predict task labels.

Strategic Sampling
Different tasks rely on learning distinct sets of information from the input to predict the corresponding task labels. Performing HiddenCut randomly might be inefficient especially when most of the dropping happens at task-unrelated spans, which fails to effectively regularize model to take advantage of all the task-related features. To this end, we propose to select the spans to be cut dynamically and strategically in every layer. In other words, we mask the most informative span of hidden representations in one layer to force models to discover other useful clues to make predictions instead of Figure 1: Illustration of the differences between Dropout (a) and HiddenCut (b), and the position of HiddenCut in transformer layers (c). A sentence in the hidden space can be viewed as a L × D matrix where L is the length of the sentence and D is the number of hidden dimensions. The cells in blue represent that they are masked. Dropout masks random independent units in the matrix while our HiddenCut selects and masks a whole span of hidden representations based on attention weights received in the current layer. In our experiments, we perform HiddenCut after the feed-forward network in every transformer layer.
relying on a small set of spurious patterns.
Attention-based Sampling Strategy The most direct way is to define the set of tokens to be cut by utilizing attention weights assigned to tokens in the self-attention layers (Kovaleva et al., 2019). Intuitively, we can drop the spans of hidden representations that are assigned high attentions by the transformer layers. As a result, the information redundancy is alleviated and models would be encourage to attend to other important information. Specifically, we first derive the average attention for each token, a i , from the attention weights matrix A ∈ R P ×L×L after self-attention layers, where P is the number of attention heads and L is the sequence length: We then sample the start token h i for HiddenCut from the set that contains top βL tokens with higher average attention weights (β is a pre-defined parameter). Then HiddenCut is performed to mask the hidden representations between h i and h i+l . Note that the salient sets are different across different layers and updated throughout the training.
Other Sampling Strategies We also explore other widely used word importance discovery methods to find a set of tokens to be strategically cut by HiddenCut, including: • Random: All spans of tokens are viewed as equally important, thus are randomly cut.
• LIME (Ribeiro et al., 2016) defines the importance of tokens by examining the locally faithfulness where weights of tokens are assigned by classifiers trained with sentences whose words are randomly removed. We utilized LIME on top of a SVM classifier to pre-define a fixed set of tokens to be cut.
• GEM (Yang et al., 2019b) utilizes orthogonal basis to calculate the novelty scores that measure the new semantic meaning in tokens, significance scores that estimate the alignment between the semantic meaning of tokens and the sentence-level meaning, and the uniqueness scores that examine the uniqueness of the semantic meaning of tokens. We compute the GEM scores using the hidden representations at every layer to generate the set of tokens to be cut, which are updated during training.
• Gradient (Baehrens et al., 2010): We define the set of tokens to be cut based on the rankings of the absolute values of gradients they received at every layer in the backward-passing. This set would be updated during training.

Objectives
During training, for an input text sequence s with a label y, we generate N augmented examples {f HiddenCut 1 (s), ..., f HiddenCut N (s)} through performing HiddenCut in pre-trained encoder f (·). The whole model g(f (·)) is then trained though several objectives including general classification loss (L ori and L aug ) on data-label pairs and consistency regularization (L js ) (Miyato et al., 2017(Miyato et al., , 2018Clark et al., 2018;Xie et al., 2019;Shen et al., 2020) across different augmentations: where CE and KL represent the cross-entropy loss and KL-divergence respectively. p avg stands for the average predictions across the original text and all the augmented examples.
Combining these three losses, our overall objective function is: L = L ori + γL aug + ηL js where γ and η are the weights used to balance the contributions of learning from the original data and augmented data.

Datasets
We conducted experiments on both in-distribution datasets and out-of-distribution datasets to demonstrate the effectiveness of our proposed HiddenCut.

In-Distribution Datasets
We mainly trained and evaluated our methods on the widely-used GLUE benchmark  which covers a wide range of natural language understanding tasks: single-sentence tasks including: (i) Stanford Sentiment Treebank (SST-2) which predict the sentiment of movie reviews to be positive or negative, and (ii) Corpus of Linguistic Acceptability (CoLA) which predict whether a sentence is linguistically acceptable or not; similarity and paraphrase tasks including (i) Quora Question Pairs (QQP) which predict whether two question are paraphrases, (ii) Semantic Textual Similarity Benchmark (STS-B) which predict the similarity ratings between two sentences, and (iii) Microsoft Research Paraphrase Corpus (MRPC) which predict whether two given sentences are semantically equivalent; inference tasks including (i) Multi-Genre Natural Language Inference (MNLI) which classified the relationships between two sentences into entailment, contradiction, or neutral, (ii) Question Natural Language Inference (QNLI) which predict whether a given sentence is the correct answer to a given question, and (iii) Recognizing Textual Entailment (RTE) which predict whether the entailment relation holds between two sentences. Accuracy was used as the evaluation metric for most of the datasets except that Matthews correlation was used for CoLA and Spearman correlation was utilized for STS-B.
Out-Of-Distribution Datasets To demonstrate the generalization abilities of our proposed methods, we directly evaluated on 5 different out-ofdistribution challenging sets, using the models that are fine-tuned on GLUE benchmark datasets: • Single Sentence Tasks: Models fine-tuned from SST-2 are directly evaluated on two recent challenging sentiment classification datasets: IMDB Contrast Set (Gardner et al., 2020) including 588 examples and IMDB Counterfactually Augmented Dataset (Kaushik et al., 2020) including 733 examples. Both of them were constructed by asking NLP researchers (Gardner et al., 2020) or Amazon Mechanical Turkers (Kaushik et al., 2020) to make minor edits to examples in the original IMDB dataset (Maas et al., 2011) so that the sentiment labels change while the major contents keep the same.
• Similarity and Paraphrase Tasks: Models fine-tuned from QQP are directly evaluated on the recently introduced challenging paraphrase dataset PAWS-QQP (Zhang et al., 2019b) that has 669 test cases. PAWS-QQP contains sentence pairs with high word overlap but different semantic meanings created via word-swapping and back-translation from the original QQP dataset.
•  Table 2: Out-of-distribution evaluation results on 5 different challenging sets. † means our proposed method. For all the datasets, we did not use their training sets to further fine-tune the derived models from GLUE.

Baselines
We compare our methods with several baselines: • RoBERTa  is used as our base model. Note that RoBERTa is regularized with dropout during fine-tuning.
• ALUM  is the state-of-theart adversarial training method for neural language models, which regularizes fine-tuning via perturbations in the embedding space.
• Cutoff (Shen et al., 2020) is a recent data augmentation for natural language understanding tasks by removing information in the input space, including three variations: token cutoff, feature cutoff, and span cutoff.

Implementation Details
We used the RoBERTa-base model  to initialize all the methods. Note that Hid-denCut is agnostic to different types of pre-trained models. We followed  to set the linear decay scheduler with a warmup ratio of 0.06 for training. The maximum learning rate was selected from {5e − 6, 8e − 6, 1e − 5, 2e − 5} and the max number of training epochs was set to be either 5 or 10. All these hyper-parameters are shared for all the models. The HiddenCut ratio α was set 0.1 after a grid search from {0.05, 0.1, 0.2, 0.3, 0.4}. The selecting ratio β in the important sets sampling process was set 0.4 after a grid search from {0.1, 0.2, 0.4, 0.6}. The weights γ and η in our ob-jective function were both 1. All the experiments were performed using a GeForce RTX 2080Ti.

Results on In-Distribution Datasets
Based on Table 1, we observed that, compared to RoBERTa-base with only dropout regularization, ALUM with perturbations in the embedding space through adversarial training has better results on most of these GLUE tasks. However, the extra additional backward passes to determine the perturbation directions in ALUM can bring in significantly more computational and memory overhead. By masking different types of input during training, Cutoff increased the performances while being more computationally efficient. In contrast to Span Cutoff, HiddenCut not only introduced zero additional computation cost, but also demonstrated stronger performances on 7 out of 8 GLUE tasks, especially when the size of training set is small (e.g., an increase of 1.1 on RTE and 1.5 on CoLA). Moreover, HiddenCut achieved the best average result compared to previous stateof-the-art baselines. These in-distribution improvements indicated that, by strategically dropping contiguous spans in the hidden space, HiddenCut not only helps pre-trained models utilize hidden information in a more effective way, but also injects less noise during the augmentation process compared to cutoff, e.g., Span Cutoff might bring in additional noises for CoLA (which aims to judge whether input sentences being linguistically acceptable or not) when one span in the input is removed, since  Table 3: The performances on SST-2 and QNLI with different strategies when dropping information in the hidden space. Different sampling strategies combined with HiddenCut are presented. "-R" means sampling outside the set to be cut given by these strategies.
it might change the labels.

Results on Out-Of-Distribution Datasets
To validate the better generalizability of Hidden-Cut, we tested our models trained on SST-2, QQP and MNLI directly on 5 out-of-distribution/outof-domain challenging sets in zero-shot settings. As mentioned earlier, these out-of-distribution sets were either constructed with in-domain/out-ofdomain data and further edited by human to make them harder, or generated by rules that exploited spurious correlations such as lexical overlap, which made them challenging to most existing models. As shown in Table 2, Span Cutoff slightly improved the performances compared to RoBERTa by adding extra regularizations through creating restricted input. HiddenCut significantly outperformed both RoBERTa and Span Cutoff. For example, it outperformed Span Cutoff. by 2.3%(87.8% vs. 85.5%) on IMDB-Conts, 2.7%(41.5% vs. 38.8%) on PAWS-QQP, and 2.8%(71.2% vs 68.4%) on HANS consistently. These superior results demonstrated that, by dynamically and strategically dropping contiguous span of hidden representations, HiddenCut was able to better utilize all the important task-related information which improved the model generalization to out-of-distribution and challenging adversary examples.

Ablation Studies
This section presents our ablation studies on different sampling strategies and the effect of important hyper-parameters in HiddenCut.

Sampling Strategies in HiddenCut
We compared different ways to cut hidden representations (DropBlock (Ghiasi et al., 2018) which randomly dropped spans in certain random hidden dimensions instead of the whole hidden space) and different sampling strategies for HiddenCut described in Section 3.2 (including Random, LIME (Ribeiro et al., 2016), GEM (Yang et al., 2019b), Gradient (Yeh et al., 2019), Attention) based on the performances on SST-2 and QNLI. For these strategies, we also experimented with a reverse set denoted by "-R" where we sampled outside the important set given by above strategies.
From Table 3, we observed that (i) sampling from important sets resulted in better performances than random sampling. Sampling outside the defined importance sets usually led to inferior performances. These highlights the importance of strategically selecting spans to drop. (ii) Sampling from dynamic sets sampled by their probabilities often outperformed sampling from predefined fixed sets (LIME), indicating the effectiveness of dynamically adjusting the sampling sets during training. (iii) The attention-based strategy outperformed all other sampling strategies, demonstrating the effectiveness of our proposed sampling strategies for HiddenCut. (iv) Completely dropping out the spans of hidden representations generated better results than only removing certain dimensions in the hidden space, which further validated the benefit of HiddenCut over DropBlock in natural language understanding tasks.

The Effect of HiddenCut Ratios
The length of spans that are dropped by Hidden-Cut is an important hyper-parameter, which is controlled by the HiddenCut ratio α and the length of input sentences. α could also be interpreted as the extent of perturbations added to the hidden space. We presented the results of Hidden-Cut on MNLI with a set of different α including {0.05, 0.1, 0.2, 0.3, 0.4} in Table 5. HiddenCut achieved the best performance with α = 0.1, and the performance gradually decreased with higher α since larger noise might be introduced when dropping more hidden information. This suggested the importance of balancing the trade-off between applying proper perturbations to regularize models and injecting potential noises.  Table 4: Visualization of the attention weights at the last layer in models. The sentences in the first section are from IMDB with positive labels and the sentences in the second section is constructed by changing ratings or diminishing via qualifiers (Kaushik et al., 2020)

The Effect of Sampling Ratios
The number of words that are considered important and selected by HiddenCut is also an influential hyper-parameter controlled by the sampling ratio β and the length of input sentences. As shown in Table 6, we compared the performances on SST-2 by adopting different β including {0.1, 0.2, 0.4, 0.6}. When β is too small, the number of words in the important sets is limited, which might lead HiddenCut to consistently drop certain hidden spans during the entire training process. The low diversities reduce the improvements over baselines. When β is too large, the important sets might cover all the words except stop words in sentences. As a result, the Attention-based Strategy actually became Random Sampling, which led to lower gains over baselines. The best performance was achieved when β = 0.4, indicating a reasonable trade-off between diversities and efficiencies.

Visualization of Attentions
To further demonstrate the effectiveness of Hid-denCut, we visualize the attention weights that the special start token ("<s>") assigns to other tokens at the last layer, via several examples and their coun-terfactual examples in Table 4. We observed that RoBERTa only assigned higher attention weights on certain tokens such as "8 stars", "intriguing" and especially the end special token "</s>", while largely ignored other context tokens that were also important to make the correct predictions such as scale descriptions (e.g., "out of 10") and qualifier words (e.g., "more and more"). This was probably because words like "8 stars" and "intriguing" were highly correlated with positive label and RoBERTa might overfit such patterns without probable regularization. As a result, when the scale of ratings (e.g., from "10" to "20") or the qualifier words changed (e.g., from "more and more" to "only slightly more"), RoBERTa still predicted the label as positive even when the groundtruth is negative. With HiddenCut, models mitigated the impact of tokens with higher attention weights and were encouraged to utilize all the related information. So the attention weights in HiddenCut were more uniformly distributed, which helped models make the correct predictions for out-of-distribution counterfactual examples. Taken together, HiddenCut helps improve model's generalizability by facilitating it to learn from more task-related information.

Conclusion
In this work, we introduced a simple yet effective data augmentation technique, HiddenCut, to improve model robustness on a wide range of natural language understanding tasks by dropping contiguous spans of hidden representations in the hidden space directed by strategic attentionbased sampling strategies. Through HiddenCut, transformer models are encouraged to make use of all the task-related information during training rather than only relying on certain spurious clues. Through extensive experiments on indistribution datasets (GLUE benchmarks) and out-of-distribution datasets (challenging counterexamples), HiddenCut consistently and significantly outperformed state-of-the-art baselines, and demonstrated superior generalization performances.