Good-Enough Example Extrapolation

This paper asks whether extrapolating the hidden space distribution of text examples from one class onto another is a valid inductive bias for data augmentation. To operationalize this question, I propose a simple data augmentation protocol called “good-enough example extrapolation” (GE3). GE3 is lightweight and has no hyperparameters. Applied to three text classification datasets for various data imbalance scenarios, GE3 improves performance more than upsampling and other hidden-space data augmentation methods.


Introduction
Text classification is a fundamental task in NLP for which modern architectures have achieved high performance when training data is sufficient (Wang et al., 2019). In many applied settings where data collection and annotation is limited, however, a common challenge is data imbalance (Krawczyk, 2016), in which training data from certain categories is scarce. A classic example of such a scenario is intent classification, where developers may wish to update a conversational agent to be able to classify new intents, but the amount of training data for these new intents lags behind that of existing ones (Bapna et al., 2017;Gaddy et al., 2020).
One common method for mitigating the weaknesses of limited training data is data augmentation, a paradigm that is become increasingly seductive in the NLP landscape (see Feng et al., 2021, for a survey). While data augmentation may be of only incremental utility when training data is sufficient (Longpre et al., 2020), it can be particularly helpful for mitigating data scarcity in low-resource settings such as few-shot classification (Wei et al., 2021) or, as this paper will soon explore, data-imbalanced text classification. In this paper, I propose a simple data augmentation protocol called good-enough 1 example extrapolation (GE3) for the class-imbalanced scenario. As shown in Figure 1, GE3 extrapolates the hidden-space distribution of examples from one class onto another class. GE3 has no hyperparameters, is model-agnostic, and requires little computational overhead, making it easy-to-use. In empirical experiments, I apply GE3 to intent classification, newspaper headline classification, and relation classification, finding that in a variety of class-imbalanced scenarios, GE3 substantially outperforms upsampling and other hidden state augmentation techniques.

Hidden Space Extrapolation
Intuition. Representation learning aims to map inputs into a hidden space such that desired properties of inputs are easily extractable from their continuous representations (Pennington et al., 2014). For many representation learning functions, inputs with similar properties map to nearby points in hidden space, and the distances between hidden space representations represent meaningful relationships (Mikolov et al., 2013;Kumar et al., 2020).
So for a given classification task, inputs from the same category will have some distribution (i.e., cluster) in hidden space, where the distance between distributions represents the relationship between categories, and the distribution of points within the same category models some random variable. GE3 leverages this intuition by extrapolating the distribution of points in the same category, which models some random variable, from one category onto another. Figure 2 illustrates this intuition in a hypothetical instance from the HuffPost dataset of news headlines classification, where the hidden-space relationship between two examples in the travel category can be extrapolated to form a new example in the health category. Method. Formally, I describe GE3 as follows. Given a text classification task with k output classes c indicate the mean of all hidden space representations in that class. GE3 generates augmented examples by extrapolating the data distribution from a source class c s to a target class c t in hidden space. And so for each hidden space representation x i cs in the source class, I generate a corresponding augmented examplex ct in the target class: In total, for each class in the training set, I can generate a set of extrapolated points from every other class, augmenting the size of the original training set by a factor of k. I then train the classification model on the union of original data and extrapolated examples. Notably, this extrapolation method operates without any hyperparameters, as augmented examples are generated via distributions from other classes instead of a noising function (c.f. augmentation techniques that usually have a strength parameter (Sennrich et al., 2016b;Wei and Zou, 2019)).

Experimental Setup
I evaluate the proposed hidden space extrapolation protocol in several data imbalance scenarios on three diverse text classification datasets.

Datasets
SNIPS. The Snips Voice Platform dataset 2 (Coucke et al., 2018) is an intent classification dataset that maps utterances to 7 different intents (e.g., 'play music', 'get weather', etc. FEWREL. The few-shot relation classification dataset (Han et al., 2018) contains categorized relationships between specified tokens (e.g., 'capital of,' 'birth name,' etc). The posted training set contains 64 classes, and I perform a train-test split such that each class has 500 examples in the training set and 100 examples in the evaluation set. For all three datasets, I create artificially imbalanced datasets via random sampling. Specifically, I randomly select half the classes to maintain the original number of examples N many (i.e., N many = {1800, 700, 500} for SNIPS, HUFF, and FEWREL respectively), and for the other half of the classes, I train on only a subset of N few examples. I run experiments on a range of N few . HUFF

Hidden Space Augmentation Baselines
As baselines for comparison, I also explore several other hidden space augmentation techniques: Example interpolation. Given the hidden space representations of two examples x i c and x j c in the same class, I generate an augmented examplê Noising. Given some example x i c , I add noise n to yield an augmented examplê For these techniques, I generate augmented examples until each class has n aug · N many training examples, where n aug = 5 (a choice which is later explored in Figure 3). Table 1 shows results for GE3 on the three datasets for N few = 20 and N few = 50. GE3 outperforms the upsampling baseline by an average of 4.6%, with strongest improvements on HUFF and FEWREL. Of the other augmentation techniques, Gaussian noising and uniform noising had the best performance, with an average improvement of 2.7% and 1.3%, respectively. Whereas these techniques only enforce smoothness around the distribution of points in a single class, I hypothesize that GE3 improved performance more because it injects a stronger inductive bias that the distribution of examples of the same class around their mean can be extrapolated to other classes. Moreover, as Table 1 only shows results for N few ∈ {20, 50}, in Figure 4 I compare GE3 with upsampling, as well as Gaussian and uniform noise (the strongest baselines), for N few ∈ {10, 20, 40, 60, 100, 200, 300, 400, 500}. GE3 improves performance across a wide variety of N few values, with improvements over the baselines slightly diminishing when training data is more balanced (as expected). Finally, one of the reasons that GE3 improves performance more than other techniques could be that each class gets extrapolated examples from every other class, and if classes have unique distributions, then these extrapolated examples are valuable as additional training data. Therefore, I perform an ablation study using a variable n aug , which restricts the number of other classes a given class can receive extrapolated examples from. For instance, if n aug = 2, then any given class may only receive extrapolated examples from two other random classes, even if there are 63 other classes (as is the case in FEWREL). I also perform a similar ablation for Gaussian and uniform noise, in which I generate augmented examples until each class has n aug · N many training examples. Figure 3 shows these results. For uniform and Gaussian noise, additional augmented examples did not further improve performance after around n aug = 16. For GE3, on the other hand, improvement continued to increase as n aug increased (although the marginal improvement decreases for each n aug ). This result confirms the intuition that extrapolations from more classes provided additional value during training.

Related Work
Text data augmentation. Data augmentation methods for NLP have garnered increased interest in recent years. Many common techniques modify data using either token perturbations (Zhang et al., 2015;Sennrich et al., 2016a) or language models (Sennrich et al., 2016b;Kobayashi, 2018;Ross et al., 2021). These techniques occur at the input-level, where all augmented data is represented by discrete tokens in natural language.
Hidden space augmentation. A growing direction in data augmentation has proposed to augment data in hidden space instead of at the input-level. In computer vision, DeVries and Taylor (2017) explored noising, interpolation, and extrapolation, and MIXUP (Zhang et al., 2018) (Raffel et al., 2020) to, given some examples of a class as an input sequence, generate additional examples. Because GE3 operates in hidden space, it is simpler and more computationally accessible compared with fine-tuning T5 for each classification task.

Discussion
The motivation for this work emerged from a mixture of failed experiments (I tried to devise an algorithm to select better augmented sentences in hidden space) in addition to an admiration for the elegance of the Ex2 (Lee et al., 2021). In hindsight, it would have been helpful to compare the performances of these two techniques in the same setting (notably, whereas I artificially restrict the sample size for certain classes in this paper, Ex2 uses the original data distributions of the datasets, which is a harder setting to show improvements from data augmentation).
I would be remiss not to mention at least one weakness that I see in my own work. There has been an influx of recent work proposing various augmentation techniques for different NLP tasks, and due to the lack of standardized evaluation datasets and models, many papers 3 do not perform a full comparison with respect to relevant baselines. This paper circumvents comparing with many data augmentation baselines (e.g., Chawla et al. (2002)) by focusing on the question of whether hiddenspace example extrapolation is a valid inductive bias (and not whether it is the best augmentation technique). Hence, although I find example extrapolation to be a nice idea, I should concede that the particular GE3 operationalization of example extrapolation should undergo more comprehensive comparison with baselines before I can recommend it as a go-to augmentation technique.
In summary, I have proposed a data augmentation protocol called GE3, which extrapolates the hidden space distribution of one class onto another. The empirical experiments in this paper suggests that example extrapolation in hidden space is a valid inductive bias for data augmentation. Moreover, GE3 is appealing because it has no hyperparameters, is model agnostic, and is lightweight. If example extrapolation is an idea deserving of further exploration by our field, I hope this paper adds a leaf to the tree of knowledge in that space.