BatchMixup: Improving Training by Interpolating Hidden States of the Entire Mini-batch

Usually, we train a neural system on a sequence of mini-batches of labeled instances. Each mini-batch is composed of k samples, and each sample will learn a representation vector. M IXUP implicitly generates synthetic samples through linearly interpolating inputs and their corresponding labels of random sample pairs in the same mini-batch. This means that M IXUP only generates new points on the edges connecting every two original points in the representation space. We observed that the new points by the standard M IXUP cover pretty limited regions in the entire space of the mini-batch. In this work, we propose B ATCH M IXUP —improving the model learning by interpolating hidden states of the entire mini-batch. B ATCH M IXUP can generate new points scattered throughout the space corresponding to the mini-batch. In experiments, B ATCH M IXUP shows superior performance than competitive baselines in improving the performance of NLP tasks while using different ratios of training data.


Introduction
The study of data augmentation techniques has a long history in the NLP community. Typical data augmentations include synonym replacement (Kobayashi, 2018), back-translation (Fadaee et al., 2017), adding data noise (Xie et al., 2017), etc. Mostly, these techniques are combined with the augmentation-free models in pipeline. MIXUP (Zhang et al., 2018) is able to augment the data by linearly combining each two examples by their hidden representations, keeping the whole system trained in end-to-end.
MIXUP has shown effectiveness in a range of NLP tasks (Sun et al., 2020). Nevertheless, it has two drawbacks. First, MIXUP generates new points merely and exactly on the connecting edges of random point pairs; these new points cover pretty limited region in the representation space of the mini-batch. Second, the training of a system equipped with MIXUP is considerably inefficientgenerally, MIXUP slows down the training by n times if it generates n new points for each original point pair. In this work, we propose BATCHMIXUP, an improved mixup paradigm that generates new points scattered uniformly throughout the whole representation region of the mini-batch. Specifically, within a mini-batch, each example and its label will first learn a representation vector respectively, BATCHMIXUP then generates n new points (including a new input representations and a new label representation) simultaneously by non-linearly interpolating all the examples in the same minibatch. The new n points are expected to better identify the space represented by the mini-batch. Finally, the n mixed points will act as one batch to update the model.
Our model BATCHMIXUP, as a batch-wise nonlinear MIXUP, shows advantages in two aspects. (i) Compared with the standard MIXUP, BATCH-MIXUP further improves the representation learning in solving downstream NLP tasks, yielding better performance. (ii) BATCHMIXUP works much more efficient than the conventional MIXUP and other pair-wise mixup variants.

Related Work
MIXUP was originally proposed in the computer vision community. The standard MIXUP (Zhang et al., 2018) interpolates the raw pixels of each two images in a mini-batch. Verma et al. (2019) conducted interpolation in the hidden states of images. Guo et al. (2019b) discovered a limitation of MIXUP, called "manifold intrusion", which is the conflict between the synthetic label of the mixed-up points and the labels of the original examples. They came up with "AdaMixup", an adaptive MIXUP, where the mixing policies are automatically learned from the data using an additional network and objective function designed to avoid manifold intrusion. Other work tried to explain the work mechanisms of MIXUP from different threads, such as "MIXUP as directional adversarial training" (Archambault et al., 2019), "MIXUP training as the complexity reduction" (Kimura, 2020) To date, only a couple of previous studies explored the effectiveness of the standard MIXUP in NLP. Guo et al. (2019a) tried two strategies: interpolating word embeddings or sentence embeddings generated by convolutional/recurrent neural networks. Sun et al. (2020) incorporated MIXUP into BERT (Devlin et al., 2019), the state of the art architecture in NLP. To improve the standard MIXUP, Guo (2020) added non-linearity to the MIXUP for text classification tasks. However, that non-linear MIXUP works on word embedding level, which is less applicable to Transformer-style (Vaswani et al., 2017) systems. All the work above are pair-wise mixup, this work is the first work that interpolates all the examples in the same mini-batch to cover the representation space better.

The Base Model: MIXUP
Given a pair of samples (x i , y i ) and (x j , y j ) from the original mini-batch (x: input, y: the one-hot label), the standard MIXUP (Zhang et al., 2018) generates a synthetic sample as follows.
where β is a mixing scalar, sampled from a Beta(α, α) distribution with a hyper-parameter α, for mixing both the inputs and the corresponding targets. The generated synthetic data are then fed into the model for training to minimize the loss function. From the same mini-batch, the standard MIXUP will sample the β value n times so that totally n new mixed points for a sampled input pair will be generated sequentially. The model, as a result, will be updated n times more than the mixup-free model.

Our Model: BATCHMIXUP
BATCHMIXUP mixes all the samples in the same mini-batch on the level of hidden states generated by RoBERTa (Liu et al., 2019). 1 To start, we first think about how the standard text classifier works: For the labeled input (x i , y i ), first RoBERTa (optionally with a multilayer perceptron block) generates a representation for x i ("v(x i ) ∈ R d "), then v(x i ) is fed to a logistic regression (LR) layer to classify to y i . The LR layer has a weight matrix W ∈ R c×d where c is the class size and d is the dimension size of representations. Each row in W , i.e., w i ∈ R d , can be treated as the representation vector of the class y i . So, LR essentially uses the dot-product to derive the matching score (s i ∈ R) between the input x i and the label y i : For the same mini-batch of inputs {v(x i )} and labels {w i }, BATCHMIXUP deploys the same mixing policy to interpolate the {v(x i )} and {w i }.
We denote the whole batch of input representations {v(x i )} as X ∈ R d×b where b is the batch size and the whole mixing policy for this batch is M ∈ R n×d×b . To generate a single mixed point x i ∈ R d , the BATCHMIXUP uses the following mixing policy M[i] ∈ R d×b (i = 1, · · · , n) on X , where each element of M[i] is independently sampled from a Beta(α, α) distribution: where • is the Hadamard product. Equation 3 can be performed for all i values in [1,n] simultaneously; this means the original batch input X is transformed into a new batch of mixed input X ∈ R n×d . Similarly, the same mixing policy M is applied to the batch of label representations, denoted as Y ∈ R d×b (Y = {w i }): Each (x i ,ŷ i ) (i = 1, · · · , n) is a newly mixed point. All {(x i ,ŷ i )} can be generated in parallel and are scattered throughout the space represented by X .
For training, we minimize the negative-dotproduct loss between the mixed input and the mixed label. In testing, an input x i still compares with all classes {y i } by dot-product between v(x i ) and all {w i } to find the best class.

Experiments
In experiments, we check the effectiveness of our approach in NLP tasks with two settings: one is  (Han et al., 2018)) and intent classification (BANKING77 (Casanueva et al., 2020)). We decrease the size of training data from 100% to 1% with random sampling. All numbers are averaged over three random seeds.
full-shot setting that trains on the regular full training data; the other is few-shot setting that train with limited training data. Unfortunately, prior work about mixup never evaluated on few-shot scenarios.
Tasks. We evaluate on the following three tasks.
• Textual Entailment. Textual entailment is a task that figures out the truth value of a hypothesis sentence given a premise sentence (Dagan et al., 2005). This is a binary classification ("entailment" or "non-entailment") problem where the input is a sentence pair. We use the GLUE RTE (Wang et al., 2019) (Williams et al., 2018) for example) makes it a good testbed for data augmentation techniques.
• Relation Classification. FewRel (Han et al., 2018) is a large-scale relation classification dataset. It has 100 relation types, each with 700 labeled examples. The original FewRel relation set was split by 64/16/20 for developing meta-learning techniques which only allow a test instance to search for its relation type within the 20 candidates. This is not a practical setting because (i) in relation detection, an input should search for a label in the entire space of defined relations, (ii) we should always define a "None" type in this problem because most span pairs in the input actually do not have a relation. Since the test relations of FewRel is not publicly available, we use the 64+16=80 relations as the entire relation set, in which 5 relations are treated "None" (So, basically this is a regular "75+None" setting).
• Intent Classification ("intent"). We use the benchmark BANKING77 2 (Casanueva et al., 2020), which is single-domain intent detection dataset comprising 13,083 annotated examples over 77 intents (average: 170 examples per intent). Each intent class is described by a short name, such as "get physical card", "lost or stolen card", etc.
Baselines. The augmentation-free system we use for above tasks consists of a RoBERTa 3 encoder and a final logistic regression layer. Based on this RoBERTa system, we compare our system BATCH-MIXUP with (i) the standard MIXUP (Zhang et al., 2018;Sun et al., 2020), and (ii) non-linear MIXUP (Guo, 2020). Both baselines conduct data interpolation in the hidden states output by the RoBERTa.
All systems are implemented through the Huggingface's Transformers package. 4 4 https://github.com/huggingface/ Results and Analysis. Table 1 lists the main results. We notice that our approach RoBERTa+BATCHMIXUP consistently outperforms the baselines MIXUP and non-linear MIXUP. In "1% entailment", none of systems really worked-all system results are around the majority baseline. This is because that the RTE task is very challenging with over limited annotations.
With 2.5K × 1%=25 labeled examples, the "RoBETTa" cannot learn any useful representations. This is in line with the observations in  which showed that few-shot RTE (when k ∈ {1, 3, 5, 10}) will make RoBERTa fail. So, we conclude that when a system is close to random guess, adding mixup is not helpful. In this situation, maybe using other conventional data augmentation skills makes more sense as the representation learning of synthetic data and that of the original data are decoupled.
To further study how BATCHMIXUP works, we simulate the classification process with a toy experiment: we generate a large amount of 2-dimensional data in Gaussian distributions for two classes (Figure 1(a)), and randomly sample 5 examples for each class to conduct 5-shot classification. We used a MLP as the classifier, trained 100 epochs. Comparing the final hyperplane of training with BATCH-MIXUP with that of training without MIXUP, we can observe that BATCHMIXUP can improve the training considerably.
Last but not least, the training of same epochs for "w/ MIXUP", "w/ Nonlinear MIXUP" takes much longer than our system "w/ BATCHMIXUP". For example, when all systems separately run on a GPU Tesla V100, our system BATCHMIXUP and the baseline "RoBERTa-large" both take about 1.5min to finish one epoch on RTE, but "w/ MIXUP" and "w/ Nonlinear MIXUP" will take ∼20mins if β is sampled 15 times per point pair.

Conclusion
In this work, we proposed a novel MIXUP model, named BATCHMIXUP, to improve the text classifier. Different with prior MIXUP variants, which always interpolate random two points, our system interpolates all the hidden states in the mini-batch. The mixed points by our system are able to better cover the space expressed by the minibatch. The experiments and visualization analysis both show the effectiveness of our model BATCHMIXUP.