Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data Augmentation via MiniMax

In Natural Language Processing (NLP), finding data augmentation techniques that can produce high-quality human-interpretable examples has always been challenging. Recently, leveraging kNN such that augmented examples are retrieved from large repositories of unlabelled sentences has made a step toward interpretable augmentation. Inspired by this paradigm, we introduce Minimax-kNN, a sample efficient data augmentation strategy tailored for Knowledge Distillation (KD). We exploit a semi-supervised approach based on KD to train a model on augmented data. In contrast to existing kNN augmentation techniques that blindly incorporate all samples, our method dynamically selects a subset of augmented samples that maximizes KL-divergence between the teacher and student models. This step aims to extract the most efficient samples to ensure our augmented data covers regions in the input space with maximum loss value. We evaluated our technique on several text classification tasks and demonstrated that Minimax-kNN consistently outperforms strong baselines. Our results show that Minimax-kNN requires fewer augmented examples and less computation to achieve superior performance over the state-of-the-art kNN-based augmentation techniques.


Introduction
Knowledge distillation (KD) (Buciluǎ et al., 2006;Hinton et al., 2015) has been successful in improving the performance of various NLP tasks such as language modelling (Jiao et al., 2020;Sanh et al., 2019;Turc et al., 2019), machine translation (Tan et al., 2019;, natural language understanding Rashid et al., 2021), and multi-task learning (Clark et al., * Equal Contribution † Work done while at Huawei Noah's Ark Lab 2019). It aims to transfer the knowledge embedded in a model-called teacher-to another succedent model-called student, without compromising on accuracy (Furlanello et al., 2018). Data plays a significant role in the success of KD. The importance of data becomes even more crucial when dealing with large teacher models (Lopez-Paz et al., 2015) or managing tasks with small amount of labelled data (Rashid et al., 2020;Nayak et al., 2019). The training objective of KD focuses on minimizing the discrepancy between representations of a teacher model and a student model. However, this might not be the case for regions which are not covered by training data in the input space. Data augmentation comes into play as a natural solution for such circumstances.
Most existing data augmentation techniques are not tailored for KD as the dynamics of teacher and student models are not considered in generating augmented data. Moreover, other model-based data augmentation techniques such as adversarial approaches do not generate interpretable samples for NLP tasks (Du et al., 2021). In this work, inspired by the success of retrieval-based augmentation techniques (Guu et al., 2020;Khandelwal et al., 2020;Du et al., 2021;Kassner and Schütze, 2020), we propose MiniMax-kNN, an interpretable data augmentation methodology. Our technique is interleaved with KD training to generate realisticallylooking training points. For this purpose, we use a massive external respostiory of unlabelled sentences. In contrast to previous kNN augmentation techniques which naively extract and incorporate k samples, we propose a minimax approach to adapt kNN augmentation to KD and select our augmented samples more efficiently.
Experimental results show that our technique requires significantly fewer samples, reaches the state-of-the-art kNN augmentation technique (Du et al., 2021), and improves generalization to unseen data. 1 Our key contributions can be summarized as follows: • We tailor kNN-based data augmentation for KD via MiniMax to select more impactful augmented samples for training.
• We significantly improve sample efficiency of kNN-based data augmentation.
• We conduct extensive experiments to evaluate our proposed method and manifest that we can maintain the test performance with training on only influential augmented examples.

Data Augmentation in KD
KD (Hinton et al., 2015) is a training method that incorporates the knowledge of a teacher network in training a student network. The teacher can be trained on the same dataset as the student and often provides a suitable approximation of the underlying distribution of data. The training loss of the student using KD is formulated as in Eq. (1).
where z s and z t refer to the logits of the student and teacher networks, σ(.) is the softmax prediction, CE and KL refer to cross entropy and KL-divergence loss, respectively. λ is a hyperparameter which controls the contribution of the KD loss with respect to the original cross entropy loss, and T is the temperature parameter which determines the smoothness of the output probability. Although KD has been shown to be successful in model compression (Buciluǎ et al., 2006) and improving the performance of neural networks (Furlanello et al., 2018), the core prerequisites for effective KD are often overlooked. Lopez-Paz et al. (2015) give a good insight about these conditions using the VC-dimension analysis: Figure 1: Data sparsity problem in KD; f , f t , and f s are representing the underlying function, teacher, and student outputs respectively. We show 10 augmented samples around x 2 with small circles on the X-axis. The green circles show the augmented samples which are selected by our MiniMax-kNN because these points correspond to maximum divergence regions of the teacher and student networks. The red circles are rejected augmented samples.
where F s and F t are the function classes corresponding to the teacher and student; |.| c is a function class capacity measure; O(.) is the estimation error of training the learner; ε s is the approximation error of the best estimator function belonging to the F s class with respect to the underlying function; ε t is a similar approximation error for the teacher with respect to the underlying function; ε l is the approximation error of the best student function with respect to the teacher function; n is the number of training samples, and 1 2 ≤ α ≤ 1 is a parameter related to the difficulty of the problem.
According to Eq. (2), it is clear that when the capacity of the teacher is large or when the number of the training samples is small, training with KD can be less beneficial. Figure 1 illustrates this problem through a synthetic example that KD loss forces the student to follow the teacher on training samples but there is no guarantee for such phenomenon to happen in regions in the input space that are not covered by training data. Therefore, the chance of a mismatch between two networks would be higher if training data is sparse or when there is a large gap between two networks.
Data augmentation can be considered as a remedy for this problem. To the best of our knowledge, most existing techniques are not sample efficient and blindly consider all generated samples in their training. As illustrated in Figure 1, different augmented samples might have different contribution to the final teacher/student loss. Moreover, these augmentation techniques are not tailored for KD. Our MiniMax-kNN solution addresses these two problems.

Nearest Neighbour Data Augmentation
The kNN augmentation strategy consists of two main stages: (a) a paraphrastic nearest neighbour retrieval engine, and (b) a training method using augmented samples.
Initially, training examples are queried over a large sentence repository using a general-purpose paraphrastic encoder. The aim of this stage is to find interpretable unannotated augmented samples that are semantically close to training data. For this purpose, we use one of the sentence repositories from SentAugment (Du et al., 2021), comprising 100M sentences collected from Common Crawl. We also employ the same paraphrastic sentence encoder, namely SASE, introduced in SentAugment. SASE is an XLM model (Lample and Conneau, 2019), fine-tuned on a number of well-known paraphrase datasets using a triplet loss to maximize the cosine similarity between representations of paraphrases. The similarity between a pair of sentence representations obtained from SASE can be adopted for unsupervised semantic similarity. Du et al. (2021) show that SASE achieves high correlation (0.73 on average) with human judgment on several STS benchmarks. Consequently, the kNN operation can be summarized as follows: Suppose a dataset {x i , y i } N i=1 where x i and y i denote an example and its corresponding label respectively. Given a large sentence repository R encoded using SASE, kNN is determined via top k sentences with respect to cos(SASE(x i ), SASE(s j )) where s j ∈ R.
Next, in step (b), a model is trained on the original data by minimizing L CE from Eq. (1). The trained model learns task-specific knowledge that is further useful in finding relevant augmented examples.

KD in Tandem with Data Augmentation
Adaptive data augmentation can strengthen the capacity of the teacher in transferring knowledge to the student during distillation (Fu et al., 2020). Numerous studies (Chen et al., 2020b;Xie et al., 2020b) have applied KD for self-training in image classification tasks. In NLP, however, generating semantically plausible examples that can be easily inspected by humans is more challenging. In TinyBERT (Jiao et al., 2020), a contextual augmentation method is used along with KD, but such augmentation does not take the advantages of teacher or student's knowledge. A recent paradigm that heavily relies on data augmentation is zero-shot KD (Nayak et al., 2019;Rashid et al., 2020). In contrast, we explore the interpretability of augmentation in KD, which distinguishes our approach from the literature.

Data Augmentation in NLP
Word-level methods (Zhang et al., 2015;Xie et al., 2017;Wei and Zou, 2019) are heuristic based and do not necessarily yield natural sentences. More recently, contextual augmentations (Kobayashi, 2018;Yi et al., 2021) that substitute words for other words, is shown effective in text classification. However, these approaches do not produce diverse syntactic forms. Similarly, inspired by denoising auto-encoders, augmented examples can be sampled from the reconstruction distribution of corrupted sentences via Masked Language Modelling (Ng et al., 2020). Back-translation (Sennrich et al., 2016) is also another strategy to obtain augmented data (Yu et al., 2018;Xie et al., 2020a;Chen et al., 2020a;Qu et al., 2021).
Another line of work that mainly targets model robustness is to create new data or counterfactual examples via human-in-the-loop perturbations (Kaushik et al., 2020;Khashabi et al., 2020;Jin et al., 2020). Nonetheless, these strategies are taskspecific and not scalable to generate data at massive scale. Besides, our method diverges from these studies in that we intend to build a semi-supervised system with minimal human intervention.
Several models (Miyato et al., 2017;Zhu et al., 2020;Qu et al., 2021) leveraged adversarial training for data augmentation. These methods manipulate the input embedding space to construct synthetic examples. Neighbourhoods around training instances in Figure 2: A schematic view of MiniMax-kNN the embedding space cannot be translated back to text and thus are not interpretable. Although we advocate for interpretable data augmentation, we do not compete with these techniques and in fact, gradient-based augmentation is complementary to our method.
Finally, kNN, a non-parametric search algorithm that probes an external data source to find nearest neighbours is employed in several NLP tasks such as language modelling (Khandelwal et al., 2020), machine translation (Khandelwal et al., 2021), cloze question answering (Kassner and Schütze, 2020), and open-domain question answering . kNN offers access to explicit memory that can retrieve factual knowledge from a data store. kNN is highly interpretable as knowledge is stored in raw text, an easy format for humans to understand. Recently, SentAugment (Du et al., 2021) introduced a semi-supervised strategy with unlabelled sentences. It retrieves augmented samples from a universal data store using kNN. Our proposed strategy is in line with Sen-tAugment at heart, but different in leveraging the augmented examples during training. We focus on sample efficiency and show that we can reduce the size of the augmented data-e.g., by 60% in sentiment classification as reported in Section 5.6while reaching a competitive performance.

MiniMax-kNN Data Augmentation for KD
Inspired by Volpi et al. (2018) and Madry et al. (2018), we apply minimax framework to tailor a sample efficient kNN data augmentation for KD. Minimizing the maximum expected risk is used in adversarial training (Volpi et al., 2018) and it is shown to have guaranteed good performance on distributions (P ) within a particular distance (ρ) from a source distribution (P 0 ): where D is a notion of distance between distributions, l refers to the loss function, θ represents the parameters of the estimator model, and in our framework, (x , y ) are augmented data samples. Let us define the set of kNN augmented samples corresponding to the training sample x i ∈ X from the training set, X , to be A( In the maximization phase, we define the loss l(x , y ; θ) = KL T (x ), S(x ; θ) between the softmax output of the teacher T (x ) and that of the student network S(x ; θ) with trainable parameters θ, with respect to the given augmented samples. Note that the augmented samples are unlabelled in the maximization phase. Then, we sort the augmented samples based on their loss value and form our MiniMax-kNN augmentation setĀ(x i ) by selecting the top n out of the k samples in A(x i ). n is a hyper-parameter in our method that determines the sample efficiency of MiniMax-kNN. In order to enforce D(P, P 0 ) ≤ ρ on the distance between the two distributions in Eq. (3), in our kNN search, we set a maximum radial semantic distance between the sentence representation of accepted augmented samples inĀ(x i ) and the sentence representation of their corresponding input x i based on the angular distance metric: (4) where h t cls refers to the teacher's last layer hidden representation of the [CLS] token, and < · > denotes the dot product of two vectors. The discussion on how to adjust is given in Section 5.3.
In summary, our technique equips kNN augmentation with minimax to improve its sample efficiency. In contrast to adversarial data augmentation methods, our approach uses the minimax loss for selecting augmented samples. The overall structure of our augmentation strategy is visualized in Figure 2. We essentially follow three steps in each iteration during training: (1) We construct teacher logits and student logits for augmented samples to measure KLdivergence between the two models.
(2) Out of all kNN samples, n samples with highest KL-divergence will be selected.
(3) KD loss is minimized for training data and selected augmented samples.
Our experiments reveal that this modification to KD underscores sample efficiency while retaining the test performance.

FLOPs Analysis of MiniMax-kNN
Minimax computations in MiniMax-kNN incur additional overhead costs during training, but how much precisely do minimax operations curtail the runtime performance? To answer this question, we analyze the logical compute complexity of our algorithm in terms of floating point operations (FLOPs) because it can be measured regardless of hardware considerations (Clark et al., 2020).
To this end, we compare FLOPs corresponding to each augmented example from MiniMax-kNN with vanilla kNN within an epoch. Suppose a forward pass and a backward pass for one batch takes F and B FLOPs, respectively. The number of matrix operations between the forward pass and the backward pass is not considerably different and hence, F ≈ B (Clark et al., 2020). For simplicity, we assume batch size is 1. Considering k 1 is the number of retrieved NNs, vanilla kNN requires k 1 F + k 1 B additional FLOPs per epoch.
On the other hand, MiniMax-kNN selects n neighbours from k 2 retrieved nearest neighboursi.e., n < k 2 . The algorithm first takes the logits of all k 2 neighbours to compute KL-divergence vectors, which needs k 2 F FLOPs, similar to vanilla kNN. The extra operations of MiniMax-kNN occur in the maximization step in which top n neighbours are determined with respect to their KL-divergence values. This operation can be carried out by sorting the KL-divergence vector, which costs S FLOPs. Note that S F because obtaining an output from a deep neural network model is far more costly than a sorting operation. The backward pass is then computed only for the n selected neighbours. Accordingly, the overall FLOPs for MiniMax-kNN is k 2 F + S + nB. The difference between the FLOPs is: Given that B can be approximated with F (as mentioned earlier) and S F : Thus, as long as k 2 + n < 2k 1 , MiniMax-kNN is more efficient than vanilla kNN. In experiments, we illustrate that how MiniMax-kNN surpasses vanilla-kNN while satisfying the FLOPs condition.

Datasets
We evaluate MiniMax-kNN on five datasets: SST-2 and SST-5 (Socher et al., 2013) for sentiment analysis, TREC (Li and Roth, 2002) for question type classification, CR (Hu and Liu, 2004) for product review classification, and Impremium's hatespeech detection dataset (IMP) 2 . Information related to all datasets is summarized in Table 1.

Experimental Setup
We adopt the publicly available pre-trained RoBERTa Large     TREC, augmentation takes effect at epochs 8, 6, and 6, respectively, whereas on IMP, and CR, augmentation starts at the beginning of training. All experiments were conducted on two Nvidia Tesla V100 GPUs.
Few-shot learning setup We follow Du et al. (2021) to setup the environment for few-shot learning experiments. In particular, we sample 2 training subsets with replacement from the original training set for each task. Each subset is balanced and consists of 20 examples per label. The development set is reduced to 200 examples for all tasks except CR in which we keep all of the original set. The label distribution is retained in the reduced development data. Evaluation is conducted on the actual test dataset. To obtain reliable results, we repeat training with 10 different seeds on each sampled dataset and report the average across all runs-i.e., 20 runs per task. Few-shot experiments were run on a single Nvidia Tesla V100 GPU.

MiniMax-kNN Results
First, we investigate the impact of kNN data augmentation at test time and compare MiniMax-kNN with vanilla kNN data augmentation. To this end, we train a RoBERTa Large as teacher on the original data. Then, we distill a small size student based on DistilRoBERTa from the teacher using the augmented data and the original data.
In Table 2, we report the performance of MiniMax-kNN as well as the vanilla-kNN on the downstream tasks. In this experiment, the number of nearest neighbours (k) is set to 8 and for MiniMax-kNN, we empirically select the minimum number of augmented examples (n) out of 8-NNs such that MiniMax-kNN exceeds vanilla-kNN. We observe that using KD alone leads to a marginal improvement on all tasks. Adding more data results in further improvements but comes at the expense of substantially longer training time. On the other hand, MiniMax-kNN reduces the cost of training as it learns through less than half of the NNs and yet, consistently outperforms vanilla-kNN.

Varying the number of selected examples (n) in
MiniMax-kNN We explore the number of selected augmentations by varying n ∈ {1, 2, 4, 6} for 8-NNs on the downstream tasks. Results are reported in Table 3. Interestingly, picking n as small as either 1 or 2 results in superior performance of MiniMax-kNN, compared to vanilla-kNN, on all tasks. In TREC, and SST-2, the sweet spot is n = 4. In SST-5, and CR, MiniMax-kNN performs better as n grows. On the contrary, in IMP, accuracy declines by increasing n.

Varying the number of nearest neighbours (k)
In order to investigate the optimal number of NNs, we assess the effect of k on the downstream tasks. The Results are reported in Table 4. We observe that more data sometimes makes the training noisy and as a result, performance deteriorates-e.g., k = 2 in SST-5 and IMP. Nonetheless, when the augmentation size is sufficiently large, test results improve-i.e., k = 8 in all datasets except CR. Apart from three cases-i.e., k = 2, 4 in SST-2, and k = 4 in CR-MiniMax-kNN is superior to vanilla-kNN by incorporating roughly 50% fewer  Table 4: Test accuracy (↑) of DistilRoBERTa on the downstream tasks varying the number of nearest neighbours (k). KD refers to knowledge distillation with no data augmentation. For MiniMax, n is equal to half of k neighbours for k = 2, 4 and when k = 8, n is selected as in Table 2 (bold and underline indicate best and second best results per task).

examples.
Adjusting the maximum radial distance ( ) in MiniMax-kNN We plot the distance distribution of augmented data for two cases: (a) when the teacher predicts the same label as the original examples for augmented ones (matched labels) (b) when the predicted label for augmented examples do not match that of original examples (mismatched labels). Figure 3 illustrates a clear distinction between these two groups. Considering these insights, we find an empirical heuristic to set . When the overlap between groups is infinitesimal, we tune in the vicinity of the maximum distance of matched labels. The rationale here is to avoid altering the skewness of the original label distribution. Throughout our experiments, is set to 0.22, and 0.4 for SST-5, and SST-2, respectively. However, we find = ∞ works best on CR, IMP, and TREC.

Runtime Efficiency
In §4.1, we showed that MiniMax-kNN is computationally more efficient than vanilla-kNN when k 2 + n < 2k 1 . Given the number of nearest neighbours is identical (k 1 = k 2 ) in our experiments, any choice of n makes MiniMax-kNN more efficient than vanilla-kNN in theory. However, in our implementation of MiniMax, we feed selected examples again to the student, thereby triggering a redundant forward pass 4 . Although this change re- 4 All augmented examples are initially fed to the student within a PyTorch no grad block. Since we want to backpropagate through only selected examples, they should be fed again to the student. duces the efficiency of MiniMax-kNN in practice, it significantly simplifies the implementation. Thus, the above condition evolves to k 2 + 2n < 2k 1 in our experiments. Nonetheless, in Table 2, this new efficiency constraint still holds on all tasks except SST-2, and TREC. To calculate the exact amount of speed-up, we measure the average training time corresponding to one epoch for each task. The results are outlined in Table 5. On IMP, MiniMax-kNN saves more than 60% of training time and on CR, MiniMax-kNN brings almost 30% speed-up. Also, MiniMax-kNN is slightly faster than vanilla-kNN on SST-5. However, MiniMax-kNN trains around 30% slower on SST-2 and TREC.

Ablation Study
We analyze each component of our augmentation strategy to understand how they impact the overall   effectiveness of MiniMax-kNN. To this end, three components of our strategy are targeted for an ablation study. First, the effect of nearest neighbours is measured by replacing them with random examples from the sentence repository. Then, to determine whether reranking neighbours by teacher is helpful, we preserve the order of nearest neighbours returned by the SASE. Finally, we relax the maximum radial distance to include all nearest neighbours. In Table 7, we report the results on SST-5. Surprisingly, random augmentation (row 3) scores only 0.3% lower than kNN augmentation (row 6). Reranking nearest neighbours by the teacher further boosts the results by 0.5% (row 7). The presence of maximum radial distance is not helpful for vanilla-kNN as it leads to 0.4% drop in the accuracy (row 8). Finally, our selection mechanism in MiniMax-kNN (row 9) leads to a 0.2% improvement compared to vanilla-kNN (row 7).

Few-shot experiments
Our data augmentation strategy can be applied to few-shot learning scenarios where a minuscule number of labelled data is available. Therefore, we simulate a few-shot learning setting as described in Section 5.2. In addition to vanilla-kNN and no aug-mentation baselines, we compare our results with SentAugment (Du et al., 2021), the state-of-the-art method in kNN data augmentation. In SentAugment, experiments are conducted on 5 randomly sampled small datasets and top 3 results of 10 different runs are averaged across sampled datasets, which means average over 15 runs in total. To be comparable to SentAugment, we average across all 10 runs for 2 sampled datasets, average over 20 runs in total, to report our results. In SentAugment, augmented few-shot datasets contain 1000 examples including the original data. For MiniMax-kNN, we use 10-NNs in this experiment with a maximum radial distance. Table 6 shows the few-shot learning results. The performance of our baselines follows a similar trend in the full-size data experiments. In particular, KD without augmentation slightly improves the test accuracy; Vanilla-kNN brings almost 1.9% improvement on average, and MiniMax-kNN consistently surpasses vanilla-kNN by 0.7% on average. Compared to SentAugment, MiniMax-kNN reaches a competitive performance. The key advantage of MiniMax-kNN lies in sample efficiency. Specifically, MiniMax-kNN falls short by only 0.3% on SST-2, and IMP with using less than 40% of the SentAugment augmented data on average. On CR, MiniMax-kNN lags behind by 1.4%, but again on roughly 40% of the SentAugment data size. Moreover, SentAugment outperforms our approach by 0.3% on TREC, while the size of augmentation is reduced by almost 20%. Lastly, MiniMax-kNN outperforms SentAugment in SST-5 by 2.4% with almost same amount of data.

Qualitative Analysis
We study the quality of augmented examples retrieved from the sentence repository. Table 8 presents four examples from SST-5, CR, and TREC along with the corresponding top 3-NNs. The top (i) SST-5: this is a stunning film, a one-of-a-kind tour de force.

very positive
Here is masterful film-making in action. (5) very positive It's an expertly-crafted spectacle-event movie. (1) very positive This is a unique cinematographic experience. (6) very positive (ii) CR: one also exhibited extremely slow speed when going to the menu.
negative No menu appears to make it very quick and easy to use. (15) negative Switching between options in the main menu is relatively slow. (13) negative the only niggle i have found is that the menus are a bit slow at times. (8) negative (iii) SST-5: final verdict: you've seen it all before.
very negative Below is the final result. (15) neutral The final verdict: Go ahead and buy (4) positive Nut in the end, the final result always pays out. (7) positive (iv) TREC: What causes the body to shiver in cold temperatures? DESC How is it possible that a higher minimum wage could actually lead to more inequality within a country? (11) DESC How did the minimum wage increase come about? (13) DESC How is the new minimum wage hike impacting them? (7) DESC Table 8: Examples, derived from the augmented CR, SST-5, and TREC, after teacher reranking (the numbers in the bracket indicate the initial rank by SASE). For the nearest neighbours, the teacher's predictions are also provided, although soft labels will be used during training. Row (iii) shows an example of label mismatch and row (iv) highlights a mediocre paraphrase retrieval despite matching labels.
two rows show clear-cut examples that the nearest neighbours are in fact paraphrased forms of original samples. Also, the teacher predicts the same label as the original examples for these augmented examples. We observe task-specific knowledge that the teacher has learned from original data helps to rank retrieved sentences-e.g., in the second row, reranking pushes the neighbours at ranks 15 and 13 to the top 3. However, the augmented data is not always perfect. To identify the limitations of kNN augmentation, we manually inspect 20 samples, randomly drawn from SST-5 and TREC. We find that inaccurate paraphrase retrieval undercuts the quality of augmented examples shown in the bottom rows of Table 8. A side effect of this weakness is the domain mismatch, denoting that augmentation can introduce out-of-domain data. For instance, the input data in TREC is expected to be in interrogative mood, but the retrieval may return declarative sentences. A potential solution to this problem could be utilizing a different repository, entirely comprised of questions in this case, similar to that of Perez et al. (2020). However, curating such repository for specialized domains can be challenging. Moreover, the improvements we observe in the experiments show that this issue is not prevalent in our selected tasks.

Conclusion
In this paper, we presented a sample efficient semisupervised data augmentation technique, namely MiniMax-kNN. The augmentation procedure is framed as finding nearest neighbours from a mas-sive repository of unannotated sentences. The crucial aspect of kNN augmentation is interpretability as augmented examples are written in natural language. We adopt KD to learn from unlabelled data.
The key ingredient of our approach is to find the most impactful examples that maximize the KLdivergence between the teacher and the student models. We show that MiniMax-kNN can reduce the augmented data size by 50% while improving upon vanilla augmentation.