Text Augmentation in a Multi-Task View

Traditional data augmentation aims to increase the coverage of the input distribution by generating augmented examples that strongly resemble original samples in an online fashion where augmented examples dominate training. In this paper, we propose an alternative perspective—a multi-task view (MTV) of data augmentation—in which the primary task trains on original examples and the auxiliary task trains on augmented examples. In MTV data augmentation, both original and augmented samples are weighted substantively during training, relaxing the constraint that augmented examples must resemble original data and thereby allowing us to apply stronger augmentation functions. In empirical experiments using four common data augmentation techniques on three benchmark text classification datasets, we find that using the MTV leads to higher and more robust performance than traditional augmentation.


Introduction
Most data augmentation techniques aim to generate augmented examples for training that are similar to original data. In computer vision, operations such as flipping, cropping, and color jittering are both widely used and highly effective-it is selfevident that augmented examples closely resemble original data, and so we generate augmented data in an online fashion during each minibatch such that no original, unmodified examples are seen during training (Krizhevsky et al., 2012;Zagoruyko and Komodakis, 2016;Huang et al., 2017).
In language, on the other hand, even slight modifications can cause significant semantic changes, and so it is not always clear whether augmented examples resemble original data. Despite this uncertainty, many augmentation techniques in NLP still Traditional Data Augmentation

Multi-Task View (MTV) of Data Augmentation
• Intuition: Auxiliary task of classifying augmented examples acts as regularization for the primary task of classifying original examples. • Guideline: It might be a good idea for augmented samples to resemble original data, but they can be anything that boosts performance. • Training: Both original and augmented data receive substantive weighting during training. generate examples stochastically and ignore original data (Zhang et al., 2015;Sennrich et al., 2016a;Xie et al., 2017;Kobayashi, 2018;Wang et al., 2018). When it is unclear whether augmented examples resemble original data-as is often the case-is it wise to neglect the original training data? Our paper questions this practice by proposing to include original data during training. Specifically, we make two contributions: 1. We propose a multi-task view of data augmentation (MTV data augmentation), which trains on both original and augmented examples and therefore allows us to relax the constraint that augmented examples must resemble original data. The MTV facilitates augmentation using a higher strength parameter.
2. We show empirically that four common data augmentation techniques provide higher and more robust performance gains using the MTV compared with traditional augmentation.
wherep p p(X, Y ) is the empirical distribution of training pairs x, y and p p p θ (y|x) is the parameterized model that we aim to learn (e.g., a neural network). Asp p p(X, Y ) is typically the observed data, it will likely have some mismatch with the true data distribution p p p(X, Y ). When the mismatch is dramaticfor instance, whenp p p(X, Y ) does not sufficiently cover the training space-model performance will likely suffer.
Remedy. In practice, we often use data augmentation to mitigate the inadequacy ofp p p(X, Y ) by providing additional training data. We generate an augmented distribution q q q(X,Ŷ ) and now minimize the cost of this augmented training set J aug : As we now optimize solely on q q q(X,Ŷ ), our goal is to find (x,ŷ) pairs that are likely to fall in the true distribution p p p. Assuming the smoothness of p p p, similar (x, y) pairs will have similar probabilities, and therefore if an augmented example is more similar to an observed example, it is more likely to be sampled under the true distribution. In other words, good augmented examples resemble the observed data, and we aim to find them. Conversely, if an augmented example diverges too far from any observed data, it is likely invalid and thus harmful for training; we don't want to train on these examples. The majority of prior work follows this framework of augmented examples resembling real data. As popular techniques, semantic noising substitutes tokens with synonyms (Wang and Yang, 2015;Zhang et al., 2015;; Pervasive Dropout randomly removes words from the input sequence (Sennrich et al., 2016a); and SwitchOut (for machine translation) replaces some words in both source and target sentences with other words from their corresponding vocabularies (Wang et al., 2018).
Moreover, most of these techniques perform augmentation on every training example in an online fashion, implicitly assuming that augmented examples so closely resemble original data that directly 1 We closely follow the intuition and notation of Wang et al. (2018) training on original examples is not even worth considering. As we shall see in the next section, adding in these original examples during training might actually be a worthwhile idea.

MTV Data Augmentation
Multi-task optimization jointly trains on a primary task and one or more auxiliary tasks-the intuition is that requiring an algorithm to also learn an auxiliary task can act as better regularization than penalizing all complexity uniformly. Prior work has found that multi-task models work particularly well when the tasks are similar, but can also improve performance even on unrelated tasks (Paredes et al., 2012;Hajiramezanali et al., 2018).
We propose a multi-task view of data augmentation that has a primary task that optimizes regular training on original examples and an auxiliary task that optimizes training on augmented data. This MTV jointly optimizes the primary and auxiliary task(s) using a weighted cost function so that both original and augmented data receive substantial weight during training: where γ O is the weight of original data and γ aug is the weight of augmented data, and γ O + γ aug = 1. In this context, observe that vanilla training uses γ O = 1 and γ aug = 0, and traditional data augmentation uses γ O = 0 and γ aug = 1.
The MTV gives us an important freedom that is not offered by the traditional data augmentation framework. Since traditional data augmentation only trains on augmented examples, performance suffers detrimentally when augmented data differs too much from the true distributiontherefore, most studies aim to generate augmented examples that resemble original data. MTV data augmentation, however, jointly trains on both original and augmented data, thereby allowing us to relax the constraint that original and augmented examples come from the same distribution. In fact, accepting that the original and augmented distributions might differ or could even be unrelated-as work in multi-task learning has done (Paredes et al., 2012;Hajiramezanali et al., 2018;Rai and Daumé, 2010)-liberates us to apply stronger levels of data augmentation, which, as we will demonstrate in the next section, leads to higher and more robust performance.

Experiments
This section compares multi-task view augmentation to traditional augmentation for various datasets and augmentation techniques.

Models and Experimental Procedures.
For text classification, we use BERT (Devlin et al., 2019) (bert-base-uncased from HuggingFace) to extract features by averaging the last hidden states of the input tokens. To reduce the number of model hyperparameters and save computation time, we classify these features using a linear SVM trained for 1000 epochs.
2 Since training data size depends on the amount of augmented data, we adjust the number of training epochs so that all models receive the same number of updates. All experiments are run for five random seeds. Our baseline models without data augmentation achieved 84.5%, 93.1%, and 83.9% accuracy respectively on the SST2, SUBJ, and TREC tasks.
Augmentation Techniques. In this paper, we experiment with four simple and common data augmentation techniques studied in Wei and Zou    Table 2 summarizes results for data augmentation in the MTV using γ O = γ aug = 0.5 compared with traditional augmentation for the bestperforming augmentation strength from α ∈ {0.05, 0.1, 0.2, 0.3, 0.4, 0.5}. In the traditional framework, pervasive dropout had the strongest performance boost of 1.8% using α = 0.1. The MTV, however, allowed for stronger augmentation (i.e., α ≥ 0.3) that resulted in all four techniques to achieving boosts of more than 2.0%. Perhaps strikingly, token injection and positional shuffling, which are less intuitive and not as commonly used as token substitution and pervasive dropout, achieve the strongest gains (> 1.0%) from using the MTV. One potential reason for this is that, compared with token substitution and pervasive dropout, token injection and positional shuffling are non-destructive in that they do not remove any of the original words, and so the nature of examples augmented at high α could be more conducive for the MTV.

More-robust gains at high α
When using data augmentation with high α, high levels of noising are employed and augmented data are therefore more likely to diverge from their original examples. Figure 1 takes a closer look at how performance is affected by varying α. Whereas traditional augmentation often negatively affected performance at high α, the multi-task view, which jointly optimizes the original distribution, had robust performance gains at high augmentation strengths. In the traditional framework, improvements are largest when augmentation strength α is small, with performance deteriorating for large α. The MTV, on the other hand, jointly optimizes for both original and augmented data, leveraging higher α to provide higher and more robust performance gains.

Choosing γ O and γ aug weighting
As our experiments so far have used the MTV with balanced weighting of original and augmented data (γ O = γ aug = 0.5), in this section we explore different weightings of γ O and γ aug . Figure 2 shows these results averaged over all three datasets and all four augmentation techniques. Traditional data augmentation, which uses modest augmentation strength (e.g., α ∈ {0.05, 0.1}) and does not train on original data (γ O = 0.0), achieves reasonable performance gains. As expected, when stronger augmentations were applied (e.g., α ≥ 0.4), training with only augmented data hurts performance. When training on both augmented and original data, however, performance improved with stronger augmentation and remained robust for varying augmentation strengths 0.2 ≤ α ≤ 0.5 and original data weights 0.3 ≤ γ O ≤ 0.7.

Further Related Work
Prior work on data augmentation, to our knowledge, generally follows the traditional data augmentation framework. In addition to the methods mentioned in §2, Xie et al. (2017) replaced words with samples from the unigram frequency distribution; Yu et al. (2018) translated English sentences to French and back to English (backtranslation); and Kobayashi (2018) replaced words with other words based on a language model. All these methods could potentially be formulated in the MTV. Some prior work has also drawn connections between seeing data augmentation as multiple tasks. Similar to how we optimize augmented data as a separate task, Meyerson and Miikkulainen (2018) created fake tasks by using multiple distinct decoders to train a shared structure to solve the same problem in different ways. In machine translation, Sennrich et al. (2016b) used monolingual training examples as parallel examples with an empty source side, noting that their setup could be seen as multi-task learning with the tasks as translation with known sources and language modeling with unknown sources. Compared with these papers that create multiple tasks in very specialized scenarios, the multi-task view that we have presented here can be used for any type of text data augmentation.
To be clear, our study is not the first to mix original and augmented data in training. For instance, Wang and Yang (2015) use a ratio of 1:5 original to augmented examples, but this weight of original data is much smaller than the 0.3 ≤ γ O ≤ 0.7 that we advocate for. Sennrich et al. (2016b) also include original data when training with backtranslation augmentation, but the given ratios of original and augmented data they use appear to dictated by the speed of their back-translation models rather than an intentionally-motivated design choice. We see our work as the first to explicitly formulate the MTV, advocate for a joint optimization function, and comprehensively explore its implications on common text augmentation techniques.
As a limitation, our study has focused on labelpreserving augmentation techniques, and our line of reasoning may not apply when augmentation techniques intentionally change the label. Moreover, we have only studied text classification with simple models using task-agnostic augmentation techniques. Future work in this direction could experiment with larger-scale models or study taskspecific augmentation.

Conclusions
We have proposed a multi-task view that gives both original and augmented examples substantial weight during training, contrasting prior work that performs stochastic data augmentation and ignores original training data. For four common augmentation techniques, we found experimentally that this alternative view allows for stronger levels of augmentation, which in turn leads to better and more robust performance than traditional augmentation. We hope our paper inspires future work using text data augmentation to think more explicitly about how much augmented examples resemble original data and consider substantive weighting of origi-nal data when using data augmentation to improve model performance.
To close, we leave the enthusiastic reader with one last thought. Most existing text data augmentation techniques have obediently followed the paradigm from computer vision of generating augmented examples that are similar to the original data. Who's to say that's how data augmentation ought to work in NLP? In this paper, we've shown how to search for relative freedom from this constraint, simply by taking a different view of the underlying assumptions. Now, a bigger question arises on the horizon-what new text augmentation techniques are unlocked when augmented data are not forced to resemble the original?