A Survey of Data Augmentation Approaches for NLP

Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP


Introduction
Data augmentation (DA) refers to strategies for increasing the diversity of training examples without explicitly collecting new data. It has received active attention in recent machine learning (ML) research in the form of well-received, general-purpose techniques such as UDA (Xie et al., 2020) (3.1), which used backtranslation (Sennrich et al., 2016), Au-toAugment (Cubuk et al., 2018), andRandAugment (Cubuk et al., 2020), and MIXUP (Zhang et al., 2017) (3.2). These are often first explored in computer vision (CV), and DA's adaptation for natural language processing (NLP) seems secondary and comparatively underexplored, perhaps due to challenges presented by the discrete nature of language, which rules out continuous noising and makes it more difficult to maintain invariance.
Despite these challenges, there has been increased interest and demand for DA for NLP. As NLP grows due to off-the-shelf availability of large pretrained models, there are increasingly more tasks and domains to explore. Many of these are low-resource, and have a paucity of training examples, creating many use-cases for which DA can play an important role. Particularly, for many nonclassification NLP tasks such as span-based tasks and generation, DA research is relatively sparse despite their ubiquity in real-world settings.
Our paper aims to sensitize the NLP community towards this growing area of work, which has also seen increasing interest in ML overall (as seen in Figure 1). As interest and work on this topic continue to increase, this is an opportune time for a paper of our kind to (i) give a bird's eye view of DA for NLP, and (ii) identify key challenges to effectively motivate and orient interest in this area. To the best of our knowledge, this is the first survey to take a detailed look at DA methods for NLP. 1 This paper is structured as follows. Section 2 discusses what DA is, its goals and trade-offs, and why it works. Section 3 describes popular methodologically representative DA techniques for NLP-which we categorize into rule-based (3.1), example interpolation-based (3.2), or model-based (3.3). Section 4 discusses useful NLP applications for DA, including low-resource languages (4.1), mitigating bias (4.2), fixing class imbalance (4.3), few-shot learning (4.4), and adversarial examples (4.5). Section 5 describes DA methods for common Figure 1: Weekly Google Trends scores for the search term "data augmentation", with a control, uneventful ML search term ("minibatch") for comparison. NLP tasks including summarization (5.1), question answering (5.2), sequence tagging tasks (5.3), parsing tasks (5.4), grammatical error correction (5.5), neural machine translation (5.6), data-to-text NLG (5.7), open-ended and conditional text generation (5.8), dialogue (5.9), and multimodal tasks (5.10). Finally, Section 6 discusses challenges and future directions in DA for NLP. Appendix A lists useful blog posts and code repositories.
Through this work, we hope to emulate past papers which have surveyed DA methods for other types of data, such as images (Shorten and Khoshgoftaar, 2019), faces (Wang et al., 2019b), and time series (Iwana and Uchida, 2020). We hope to draw further attention, elicit broader interest, and motivate additional work in DA, particularly for NLP.

Background
What is data augmentation? Data augmentation (DA) encompasses methods of increasing training data diversity without directly collecting more data. Most strategies either add slightly modified copies of existing data or create synthetic data, aiming for the augmented data to act as a regularizer and reduce overfitting when training ML models (Shorten and Khoshgoftaar, 2019;Hernández-García and König, 2020). DA has been commonly used in CV, where techniques like cropping, flipping, and color jittering are a standard component of model training. In NLP, where the input space is discrete, how to generate effective augmented examples that capture the desired invariances is less obvious.
What are the goals and trade-offs? Despite challenges associated with text, many DA techniques for NLP have been proposed, ranging from rule-based manipulations (Zhang et al., 2015) to more complicated generative approaches (Liu et al., 2020b). As DA aims to provide an alternative to collecting more data, an ideal DA technique should be both easy-to-implement and improve model performance. Most offer trade-offs between these two.
Rule-based techniques are easy-to-implement but usually offer incremental performance improvements (Li et al., 2017;Wei and Zou, 2019;Wei et al., 2021b). Techniques leveraging trained models may be more costly to implement but introduce more data variation, leading to better performance boosts. Model-based techniques customized for downstream tasks can have strong effects on performance but be difficult to develop and utilize.
Further, the distribution of augmented data should neither be too similar nor too different from the original. This may lead to greater overfitting or poor performance through training on examples not representative of the given domain, respectively. Effective DA approaches should aim for a balance.
Kashefi and Hwa (2020) devise a KL-Divergence-based unsupervised procedure to preemptively choose among DA heuristics, rather than a typical "run-all-heuristics" comparison, which can be very time and cost intensive.
Interpretation of DA Dao et al. (2019) note that "data augmentation is typically performed in an adhoc manner with little understanding of the underlying theoretical principles", and claim the typical explanation of DA as regularization to be insufficient. Overall, there indeed appears to be a lack of research on why exactly DA works. Existing work on this topic is mainly surface-level, and rarely investigates the theoretical underpinnings and principles. We discuss this challenge more in §6, and highlight some of the existing work below. Bishop (1995) show training with noised examples is reducible to Tikhonov regularization (subsumes L2). Rajput et al. (2019) show that DA can increase the positive margin for classifiers, but only when augmenting exponentially many examples for common DA methods. Dao et al. (2019) think of DA transformations as kernels, and find two ways DA helps: averaging of features and variance regularization. Chen et al. (2020d) show that DA leads to variance reduction by averaging over orbits of the group that keep the data distribution approximately invariant.

Techniques & Methods
We now discuss some methodologically representative DA techniques which are relevant to all tasks via the extensibility of their formulation. 2

Rule-Based Techniques
Here, we cover DA primitives which use easyto-compute, predetermined transforms sans model components. Feature space DA approaches generate augmented examples in the model's feature space rather than input data. Many few-shot learning approaches (Hariharan and Girshick, 2017;Schwartz et al., 2018) leverage estimated feature space "analogy" transformations between examples of known classes to augment for novel classes (see §4.4). Paschali et al. (2019) use iterative affine transformations and projections to maximally "stretch" an example along the class-manifold. Wei and Zou (2019) propose EASY DATA AUG-MENTATION (EDA), a set of token-level random perturbation operations including random insertion, deletion, and swap. They show improved performance on many text classification tasks. UDA (Xie et al., 2020) show how supervised DA methods can be exploited for unsupervised data through consistency training on (x, DA(x)) pairs.
For paraphrase identification, Chen et al. (2020b) construct a signed graph over the data, with individual sentences as nodes and pair labels as signed edges. They use balance theory and transitivity to infer augmented sentence pairs from this graph. Motivated by image cropping and rotation, Şahin and Steedman (2018) propose dependency tree morphing. For dependency-annotated sentences, children of the same parent are swapped (à la rotation) or some deleted (à la cropping), as seen in Figure 2. This is most beneficial for language families with rich case marking systems (e.g. Baltic and Slavic).

Example Interpolation Techniques
Another class of DA techniques, pioneered by MIXUP (Zhang et al., 2017), interpolates the inputs and labels of two or more real examples. This class of techniques is also sometimes referred to as Mixed Sample Data Augmentation (MSDA). Ensuing work has explored interpolating inner components (Verma et al., 2019;Faramarzi et al., 2020), more general mixing schemes (Guo, 2020), and adding adversaries (Beckham et al., 2019).
Another class of extensions of MIXUP which has been growing in the vision community attempts to fuse raw input image pairs together into a single 2 Table 1 compares several DA methods by various aspects relating to their applicability, dependencies, and requirements. Figure 2: Dependency tree morphing DA applied to a Turkish sentence, Şahin and Steedman (2018) input image, rather than improve the continuous interpolation mechanism. Examples of this paradigm include CUTMIX (Yun et al., 2019), CUTOUT (De-Vries and Taylor, 2017) and COPY-PASTE (Ghiasi et al., 2020). For instance, CUTMIX replaces a small sub-region of Image A with a patch sampled from Image B, with the labels mixed in proportion to sub-region sizes. There is potential to borrow ideas and inspiration from these works for NLP, e.g. for multimodal work involving both images and text (see "Multimodal challenges" in §6).
A bottleneck to using MIXUP for NLP tasks was the requirement of continuous inputs. This has been overcome by mixing embeddings or higher hidden layers (Chen et al., 2020c). Later variants propose speech-tailored mixing schemes (Jindal et al., 2020b) and interpolation with adversarial examples (Cheng et al., 2020), among others. SEQ2MIXUP (Guo et al., 2020) generalizes MIXUP for sequence transduction tasks in two ways -the "hard" version samples a binary mask (from a Bernoulli with a β(α, α) prior) and picks from one of two sequences at each token position, while the "soft" version softly interpolates between sequences based on a coefficient sampled from β(α, α). The "soft" version is found to outperform the "hard" version and earlier interpolation-based techniques like SWITCHOUT (Wang et al., 2018a).

Model-Based Techniques
Seq2seq and language models have also been used for DA. The popular BACKTRANSLATION method (Sennrich et al., 2016) translates a sequence into another language and then back into the original  (2018) language. Kumar et al. (2019a) train seq2seq models with their proposed method DiPS which learns to generate diverse paraphrases of input text using a modified decoder with a submodular objective, and show its effectiveness as DA for several classification tasks. Pretrained language models such as RNNs (Kobayashi, 2018) and transformers  have also been used for augmentation.
Kobayashi (2018) generate augmented examples by replacing words with others randomly drawn according to the recurrent language model's distribution based on the current context (illustration in Figure 3).  propose G-DAUG c which generates synthetic examples using pretrained transformer language models, and selects the most informative and diverse set for augmentation. Gao et al. (2019) advocate retaining the full distribution through "soft" augmented examples, showing gains on machine translation. Nie et al. (2020) augment word representations with a context-sensitive attention-based mixture of their semantic neighbors from a pretrained embedding space, and show its effectiveness for NER on social media text. Inspired by denoising autoencoders, Ng et al. (2020) use a corrupt-andreconstruct approach, with the corruption function q(x |x) masking an arbitrary number of word positions and the reconstruction function r(x|x ) unmasking them using BERT (Devlin et al., 2019). Their approach works well on domain-shifted test sets across 9 datasets on sentiment, NLI, and NMT. Feng et al. (2019) propose a task called SEMAN-TIC TEXT EXCHANGE (STE) which involves adjusting the overall semantics of a text to fit the context of a new word/phrase that is inserted called the replacement entity (RE). They do so by using a system called SMERTI and a masked LM approach. While not proposed directly for DA, it can be used as such, as investigated in Feng et al. (2020).
Rather than starting from an existing example and modifying it, some model-based DA approaches directly estimate a generative process from the training set and sample from it. Anaby- Tavor

Applications
In this section, we discuss several DA methods for some common NLP applications. 2

Low-Resource Languages
Low-resource languages are an important and challenging application for DA, typically for neural machine translation (NMT). Techniques using external knowledge such as WordNet (Miller, 1995) may be difficult to use effectively here. 3 There are ways to leverage high-resource languages for low-resource languages, particularly if they have similar linguistic properties. Xia et al. (2019)   Other techniques such as EDA (Wei and Zou, 2019) can possibly be used for oversampling as well.

Few-Shot Learning
DA methods can ease few-shot learning by adding more examples for novel classes introduced in the few-shot phase. Hariharan and Girshick (2017) use learned analogy transformations φ(z 1 , z 2 , x) between example pairs from a non-novel class z 1 → z 2 to generate augmented examples x → x for novel classes. Schwartz et al. (2018) generalize this to beyond just linear offsets, through their "∆network" autoencoder which learns the distribution P (z 2 |z 1 , C) from all y * z 1 = y * z 2 = C pairs, where C is a class and y is the ground-truth labelling function. Both these methods are applied only on image tasks, but their theoretical formulations are generally applicable, and hence we discuss them. Kumar et al. (2019b) apply these and other DA methods for few-shot learning of novel intent classes in task-oriented dialog. Wei et al. (2021a) show that data augmentation facilitates curriculum learning for training triplet networks for few-shot text classification. Lee et al. (2021) use T5 to generate additional examples for data-scarce classes.

Adversarial Examples (AVEs)
Adversarial examples can be generated using innocuous label-preserving transformations (e.g. paraphrasing) that fool state-of-the-art NLP models, as shown in Jia et al. (2019). Specifically, they add sentences with distractor spans to passages to construct AVEs for span-based QA. Zhang et al. (2019d) construct AVEs for paraphrase detection using word swapping. Kang et al. (2018) and Glockner et al. (2018) create AVEs for textual entailment using WordNet relations.

Tasks
In this section, we discuss several DA works for common NLP tasks. 2 We focus on nonclassification tasks as classification is worked on by default, and well covered in earlier sections (e.g. §3 and §4 which substitutes a portion of the input text with its translation in another language, improving performance across multiple languages on NLI tasks including the SQuAD QA task. Asai and Hajishirzi (2020) use logical and linguistic knowledge to generate additional training data to improve the accuracy and consistency of QA responses by models. Yu et al. (2018) introduce a new QA architecture called QANet that shows improved performance on SQuAD when combined with augmented data generated using backtranslation. Dai and Adel (2020) modify DA techniques proposed for sentence-level tasks for named entity recognition (NER), including label-wise token and synonym replacement, and show improved performance using both recurrent and transformer models. Zhang et al. (2020) propose a DA method based on MIXUP called SEQMIX for active sequence labeling by augmenting queried samples, showing improvements on NER and Event Detection.

Parsing Tasks
Jia and Liang (2016) propose DATA RECOMBINA-TION for injecting task-specific priors to neural semantic parsers. A synchronous context-free gram-mar (SCFG) is induced from training data, and new "recombinant" examples are sampled.  introduce GRAPPA, a pretraining approach for table semantic parsing, and generate synthetic question-SQL pairs via an SCFG. Andreas (2020) use compositionality to construct synthetic examples for downstream tasks like semantic parsing.

Grammatical Error Correction (GEC)
Lack of parallel data is typically a barrier for GEC. Various works have thus looked at DA methods for GEC. We discuss some here, and more can be found in Table 2 in Appendix C.
There is work that makes use of additional resources. Boyd (2018) use German edits from Wikipedia revision history and use those relating to GEC as augmented training data. Zhang et al. (2019b) explore multi-task transfer, or the use of annotated data from other tasks.
There is also work that adds synthetic errors to noise the text.

Neural Machine Translation (NMT)
There are many works which have investigated DA for NMT. We highlighted some in §3 and §4.1, e.g. (Sennrich et al., 2016;Fadaee et al., 2017;Xia et al., 2019). We discuss some further ones here, and more can be found in Table 3 in Appendix C. Wang et al. (2018a) propose SWITCHOUT, a DA method that randomly replaces words in both source and target sentences with other random words from their corresponding vocabularies. Gao et al. (2019) introduce SOFT CONTEXTUAL DA that softly augments randomly chosen words in a sentence using a contextual mixture of multiple related words over the vocabulary. Nguyen et al. (2020) propose DATA DIVERSIFICATION which merges original training data with the predictions of several forward and backward models.

Data-to-Text NLG
Data-to-text NLG refers to tasks which require generating natural language descriptions of structured or semi-structured data inputs, e.g. game score tables (Wiseman et al., 2017). Randomly perturbing game score values without invalidating overall game outcome is one DA strategy explored in game summary generation (

Open-Ended & Conditional Generation
There has been limited work on DA for open-ended and conditional text generation. Feng et al. (2020) experiment with a suite of DA methods for finetuning GPT-2 on a low-resource domain in attempts to improve the quality of generated continuations, which they call GENAUG. They find that WN-HYPERS (WordNet hypernym replacement of keywords) and SYNTHETIC NOISE (randomly perturbing non-terminal characters in words) are useful, and the quality of generated text improves to a peak at ≈ 3x the original amount of training data.

Dialogue
Most DA approaches for dialogue focus on taskoriented dialogue. We outline some below, and more can be found in Table 4 in Appendix C.
Quan and Xiong (2019) present sentence and word-level DA approaches for end-to-end task-oriented dialogue. Louvan and Magnini (2020) propose LIGHTWEIGHT AUGMENTATION, a set of word-span and sentence-level DA methods for lowresource slot filling and intent classification.
Hou et al. (2018) present a seq2seq DA framework to augment dialogue utterances for dialogue language understanding (Young et al., 2013), including a diversity rank to produce diverse utterances. Zhang et al. (2019c) propose MADA to generate diverse responses using the property that several valid responses exist for a dialogue context.
There is also DA work for spoken dialogue.

Multimodal Tasks
DA techniques have also been proposed for multimodal tasks where aligned data for multiple modalities is required. We look at ones that involve language or text. Some are discussed below, and more can be found in Table 5 in Appendix C.
Beginning with speech, Wang et al. (2020) propose a DA method to improve the robustness of downstream dialogue models to speech recognition errors. Wiesner et al. (2018) and Renduchintala et al. (2018) propose DA methods for end-to-end automatic speech recognition (ASR).
Looking at images or video, Xu et al. (2020) learn a cross-modality matching network to produce synthetic image-text pairs for multimodal classifiers. Atliha and Šešok (2020) explore DA methods such as synonym replacement and contextualized word embeddings augmentation using BERT for image captioning.

Challenges & Future Directions
Looking forward, data augmentation faces substantial challenges, specifically for NLP, and with these challenges, new opportunities for future work arise.
Dissonance between empirical novelties and theoretical narrative: There appears to be a conspicuous lack of research on why DA works. Most studies might show empirically that a DA technique works and provide some intuition, but it is currently challenging to measure the goodness of a technique without resorting to a full-scale experiment. A recent work in vision (Gontijo-Lopes et al., 2020) has proposed that affinity (the distributional shift caused by DA) and diversity (the complexity of the augmentation) can predict DA performance, but it is unclear how these results might translate to NLP.
Minimal benefit for pretrained models on indomain data: With the popularization of large pretrained language models, it has come to light that a couple previously effective DA techniques for certain English text classification tasks (Wei and Zou, 2019;Sennrich et al., 2016) provide little benefit for models like BERT and RoBERTa, which already achieve high performance on in-domain text classification (Longpre et al., 2020). One hypothesis is that using simple DA techniques provides little benefit when finetuning large pretrained transformers on tasks for which examples are wellrepresented in the pretraining data, but DA methods could still be effective when finetuning on tasks for which examples are scarce or out-of-domain compared with the training data. Further work could study under which scenarios data augmentation for large pretrained models is likely to be effective.
Multimodal challenges: While there has been increased work in multimodal DA, as discussed in §5.10, effective DA methods for multiple modalities has been challenging. Many works focus on augmenting a single modality or multiple ones separately. For example, there is potential to further explore simultaneous image and text augmentation for image captioning, such as a combination of CUTMIX (Yun et al., 2019) and caption editing.
Span-based tasks offer unique DA challenges as there are typically many correlated classification decisions. For example, random token replacement may be a locally acceptable DA method but possibly disrupt coreference chains for latter sentences. DA techniques here must take into account dependencies between different locations in the text.
Working in specialized domains such as those with domain-specific vocabulary and jargon (e.g. medicine) can present challenges. Many pretrained models and external knowledge (e.g. WordNet) cannot be effectively used. Studies have shown that DA becomes less beneficial when applied to out-of-domain data, likely because the distribution of augmented data can substantially differ from the original data (Zhang et al., 2019a;Herzig et al., 2020;Campagna et al., 2020;Zhong et al., 2020).
Working with low-resource languages may present similar difficulties as specialized domains. Further, DA techniques successful in the highresource scenario may not be effective for lowresource languages that are of a different language family or very distinctive in linguistic and typological terms. For example, those which are language isolates or lack high-resource cognates.
More vision-inspired techniques: Although many NLP DA methods have been inspired by analogous approaches in CV, there is potential for drawing further connections. Many CV DA techniques motivated by real-world invariances (e.g. many angles of looking at the same object) may have similar NLP interpretations. For instance, grayscaling could translate to toning down aspects of the text (e.g. plural to singular, "awesome" → "good"). Morphing a dependency tree could be analogous to rotating an image, and paraphrasing techniques may be analogous to changing perspective. For example, negative data augmentation (NDA) (Sinha et al., 2021) involves creating out-of-distribution samples. It has so far been exclusively explored for CV, but could be investigated for text.
Self-supervised learning: More recently, DA has been increasingly used as a key component of self-supervised learning, particularly in vision (Chen et al., 2020e). In NLP, BART (Lewis et al., 2020) showed that predicting deleted tokens as a pretraining task can achieve similar performance as the masked LM, and ELECTRA (Clark et al., 2020) found that pretraining by predicting corrupted tokens outperforms BERT given the same model size, data, and compute. We expect future work will continue exploring how to effectively manipulate text for both pretraining and downstream tasks.
Offline versus online data augmentation: In CV, standard techniques such as cropping and rotations are typically done stochastically, allowing for DA to be incorporated elegantly into the training pipeline. In NLP, however, it is unclear how to include a lightweight code module to apply DA stochastically. This is because DA techniques for NLP often leverage external resources (e.g. a dictionary for token substitution or translation model for backtranslation) that are not easily transferable across training pipelines. Thus, a common practice for DA in NLP is to generate augmented data offline and store it as additional data to be loaded during training. 4 Future work on a lightweight module for online DA in NLP could be fruitful, though another challenge will be determining when such a module will be helpful, which-compared with CV, where invariances being imposed are wellaccepted-can vary substantially across NLP tasks.
Lack of unification is a challenge for the current literature on data augmentation for NLP, and popular methods are often presented in an auxiliary fashion. Whereas there are well-accepted frameworks for DA for CV (e.g. default augmentation libraries in PyTorch, RandAugment (Cubuk et al., 2020)), there are no such "generalized" DA techniques for NLP. Further, we believe that DA research would benefit from the establishment of standard and unified benchmark tasks and datasets to compare different augmentation methods.
Good data augmentation practices would help make DA work more accessible and reproducible to the NLP and ML communities. On top of unified benchmark tasks, datasets, and frameworks/libraries mentioned above, other good practices include making code and augmented datasets publicly available, reporting variation among results (e.g. standard deviation across random seeds), and more standardized evaluation procedures. Further, transparent hyperparameter analysis, explicitly stating failure cases of proposed techniques, and discussion of the intuition and theory behind them would further improve the transparency and interpretability of DA techniques.

Conclusion
In this paper, we presented a comprehensive and structured survey of data augmentation for natural language processing (NLP). We provided a background about data augmentation and how it works, discussed major methodologically representative data augmentation techniques for NLP, and touched upon data augmentation techniques for popular NLP applications and tasks. Finally, we outlined current challenges and directions for future research, and showed that there is much room for further exploration. Overall, we hope our paper can serve as a guide for NLP researchers to decide on which data augmentation techniques to use, and inspire additional interest and work in this area. Please see the corresponding GitHub repository at https://github.com/styfeng/DataAug4NLP.