Text Augmentation Using Dataset Reconstruction for Low-Resource Classiﬁcation

In the deployment of real-world text classiﬁcation models, label scarcity is a common problem. As the number of classes increases, this problem becomes even more complex. One way to address this problem is by applying text augmentation methods. One of the more prominent methods involves using the text-generation capabilities of language models. We propose Text AUgmenta-tion by Dataset Reconstruction (TAU-DR), a novel method of data augmentation for text classiﬁcation. We conduct experiments on several multi-class datasets, showing that our approach improves the current state-of-the-art techniques for data augmentation.


Introduction
The deployment of deep learning models in the real-world requires an abundance of labels.However, labeled data is often difficult and expensive to obtain, especially when the models are deployed in highly specialized domains.Therefore, in this paper, we focus on data augmentation for text classification in low-resource environments.
Text classification (Sebastiani, 2002) is fundamental to machine learning and natural language processing.It includes various tasks, such as intent classification (Kumar et al., 2019;Rabinovich et al., 2022), which is a vital component of many automated chatbot platforms (Collinaszy et al., 2017); sentiment analysis (Tang et al., 2015); topic classification (Tong and Koller, 2001;Shnarch et al., 2022); and relation classification (Giridhara et al., 2019).The design and development of such AI applications may begin with a dataset containing only a limited amount of data.
To improve the performance of downstream models in such low-resource settings, a data augmentation mechanism is often implemented (Wong et al., 2016).To achieve this, new data are synthesized from existing training data.It has been demonstrated that the use of such mechanisms can significantly improve the performance of various neural network models.For computer vision and speech recognition, a number of well-established methods are available for synthesizing labeled data and enhancing classification accuracy.Some of the basic methods, which are also class preserving, include transformations such as cropping, padding, flipping, and shifting along time and space dimensions (Cui et al., 2015;Krizhevsky et al., 2017).
However, the application of simple transformation for textual data augmentation is more challenging, since simple transformations often invalidate and distort the text, thereby producing grammatically and semantically incorrect texts that are different from the actual text distribution.Consequently, rule-based data augmentation methods for texts typically involve replacing one word with a synonym, deleting a word, or changing a word (Wei and Zou, 2019;Dai and Adel, 2020).
Recent advances in text generation models (Radford et al., 2018) facilitate an innovative approach for handling scarce data situations.In an effort to reduce the cost of obtaining labeled in-domain data, Wang et al. (2021) use the self-training framework to generate pseudo-labeled training data from unlabeled in-domain data.Xu et al. (2021) have recently demonstrated the difficulty in extracting such domain-specific unlabeled data from general corpora.
A number of existing works (Ding et al., 2020;Anaby-Tavor et al., 2020;Yang et al., 2020) have overcome this difficulty by using the generation capabilities of pre-trained language models.
In this paper, we follow the latter paradigm and propose Text Augmentation by Dataset Reconstruction (TAU-DR), a novel text augmentation algorithm that generates new sentences based on the reconstruction of the original sentences from the hidden representations of a pre-trained classifier.
TAU-DR utilizes frozen auto-regressive language models by soft-prompt tuning, using a relatively small number of trainable parameters compared to the language model and unlike most existing methods that rely on language models, it does not require an additional pertaining phase.During training, we extract the hidden representation from the pre-trained classifier and use a Multi-Layer Perceptron (MLP) to turn the hidden representation into a soft-prompt.The soft-prompt is then fed into the frozen language mode.
Our approach is motivated by the observation that if the pre-trained classifier is trained from a language model (i.e.BERT), then the hidden representation is a contextual embedding of the original sentence.Thus, the soft-prompt will also summarize contextual information from a small neighborhood of the hidden representation, giving the frozen language model additional information for enriching the original dataset.
By using this training approach and manipulating the trained prompts, we are able to generate novel sentences with their corresponding pseudolabels.Then, as in previous works (Anaby-Tavor et al., 2020;Wang et al., 2022), we apply a filtering mechanism and filter out low-quality sentences.
We conduct experiments on four multi-class datasets: TREC, ATIS, Banking77, and T-Bot (in various low-resource settings) and show that our approach consistently outperforms the current stateof-the-art approaches.We also conduct several experiments measuring the quality of the generated sentences 1 .
Our contributions are two-fold, and can be summarized below: • We propose a novel approach for data augmentation using dataset reconstruction.We demonstrate that our method achieves stateof-the-art performance on several text classification datasets.
• We suggest two novel filtering approaches for better exploitation of the generated sentences -one approach for cases where the evaluation set is available, and another approach for cases where such datasets are absent.
The remainder of the paper is organized as follows: Section 2 introduces the problem framework 1 Our implementation will be released after the anonymity period.and relevant studies.In Section 3, we present TAU-DR and our approach.In Section 4, we conduct the experiments.Section 5 concludes the paper and includes a discussion of future work.

Problem Setup and Related Work
In this section, we introduce the data augmentation setting in a low-resource text classification.Let X train = {(x i , y i )} N i=1 be a text classification dataset with L classes, where we denote x i to be the example and y i to be its corresponding label.We assume that for each class, we have m examples where m is a relatively low number.As in previous works (e.g., Anaby-Tavor et al. 2020;Wang et al. 2022), we assume the existence of a validation set X val and a test set X test . 2ur goal is to create an augmented dataset X gen by using X train so that by training a classifier on the union of the generated and the original dataset X train ∪ X gen , we improve the performance of the same classifier trained on X train .The performance of each classifier is measured on X test .
The task of text augmentation is relatively challenging, since even small modifications can change the meaning and label of the text.By carefully setting up a rule-based approach, one can deal with this challenge.This was tried by Wei and Zou (2019), who proposed Easy Data Augmentation (EDA), which utilizes simple predefined rules to edit, remove, and substitute portions of the text while maintaining its meaning.Dai and Adel (2020) suggested a rule-based augmentation method named SDANER, tailored for named entity recognition.
A different line of research, which is the prominent approach, uses pre-trained language models.Wu et al. (2019) proposed Conditional BERT (CBERT) for contextual data augmentation.Given a sentence and its label, words in the sentence are masked randomly.The label is then used as a context to predict substitute words while keeping the original sentence in the same class.Anaby-Tavor et al. (2020) introduced Language Model Based Data Augmentation (LAMBADA), which is also a conditional generation-based data augmentation.LAMBADA fine-tunes an entire language model, GP T 2, by concatenating all of the sentences together with their corresponding labels, thereby creating additional textual data on which the language model can be fine-tuned.Due to the noisiness of the generation process, a filtering process is used to ensure that only high-quality sentences remain.The filtering process consists of a classifier that was trained on the original dataset by taking those sentences with the top-K softmax scores.Wang et al. (2022) recently suggested PromDA.This approach first trains an entire pre-trained language model on the task of converting keywords to sentences from a general corpus.Then, using RAKE (Rose et al., 2010), keywords are extracted from the original dataset.By concatenating these keywords to a learned prefix, the language model from the previous step is used to reconstruct the original sentence.Then, the same filtering process as in LAMBADA is used, with the exception that all sentences for which the original classifier agrees with the pseudo-label are taken.

Soft-Prompts
TAU-DR, as will be discussed in the next section, exploits the language-generation capabilities of language models by using soft-prompts, one of the dominant approaches for parameter-efficient tuning.Prompt-based learning was introduced by Brown et al. (2020).Their study demonstrated that a large language model can be adapted for downstream tasks by carefully constructing prompts (i.e., textual instructions).A method proposed by Gao et al. (2020) for simplifying the construction process involves expanding prompts by using pretrained language models.Each downstream task requires manual construction of discrete prompts.The construction of discrete prompts is still an independent process that is difficult to optimize together with downstream tasks.
A study by Lester et al. (2021); Li and Liang (2021) suggests using soft-prompts.Soft-prompts do not represent actual words, as opposed to hard prompts, and can be incorporated into frozen pretrained language models.As demonstrated by Li and Liang (2021), pre-trained language models (PLMs) with soft-prompts provide better performance in low-resource settings, and enable end-toend optimization of downstream tasks.P ← M LP (h) % transform the hidden representation into soft-prompt.

5:
x ← LM(P ) % predict a sentence using the soft-prompt 6: end for 8: end while %% generation phase 9: TAU-DR consists of three stages: training, generation, and filtration, as described below.

Training
We now describe the training phase in TAU-DR as shown in Algorithm 1.Given an example x from the original dataset, we extract its hidden representation, h, from the pre-trained classifier, which we denote by C base (line 3).For instance, if C base is a BERT classifier, it can be the [CLS] token representation in the last layer.The next step in line 4 is to apply a multi-layer perceptron (M LP ) with parameters θ M LP , and turn the hidden representation, h, into a prompt of length n denoted as P .P is then fed into the frozen language model LM (line 5).The training objective of the language model is to reconstruct the original sentence using only the hidden representation.The training step is illustrated in Figure 1.

Generation
To generate new sentences that will challenge the classifier and ultimately improve its accuracy, we perturb the learned soft prompts.We suggest two novel strategies to provide new soft prompts for the frozen language model3 .

Intra-class generation
The motivation behind the following approach is that by combining soft prompts from the same class, we will be able to lexically and semantically enrich the class itself.The method can be described as follows: We select two sentences, x 1 , x 2 from the same class, and then extract their corresponding hidden representation, h 1 , h 2 , using the pre-trained classifier C base .Using the trained M LP , we transform them into their corresponding soft prompts, P 1 and P 2 .Then, by averaging the two prompts, we achieve a new aggregated soft prompt P agg .The latter is passed into the language model.The pseudo-label for the generated sentences is set as class x 1 .This is illustrated in Figure 2.
Inter-class generation With inter-class generation, we help the classifier to better distinguish between the different classes.This is done by generating sentences using soft-prompts, which are created by combining soft-prompts from two different classes.First, we randomly sample two sentences from two different classes, x 1 , x 2 , and then, as detailed above, extract their soft-prompts denoted as P 1 and P 2 , respectively.We then aggregate the two prompts by taking their weighted mean, P agg = wP 1 + (1 − w)P 2 , where 0 < w < 1 is sampled uniformly.In this case, we set the pseudolabel as the label of the closest prompt as illustrated in Figure 3.

Dynamic Consistency Filtering
By generating new sentences for our classifier, we risk the creation of low-quality data.This can hap-pen if we set an incorrect pseudo-label or if the language model generates out-of-domain examples.Therefore, it is common to apply a consistency filtering mechanism (Anaby-Tavor et al., 2020;Wang et al., 2022).
The consistency filtering suggested by Anaby-Tavor et al. (2020) used the pre-trained classifier and considered the top-K sentences (ordered by their softmax scores).Wang et al. ( 2022) also used the trained classifier.However, instead of using the top-K approach, they kept all the generated sentences for which the classifier agrees with the pseudo-label.
Clearly, the chosen filtration method has a large effect on the final classifier, as it controls the data quality of the final trained classifier.The top-K approach might be too conservative, keeping a large safety margin, which results in filtering out most of the generated instances.On the other hand, keeping all the instances on which the classifier agrees with the pseudo-label might include many noisy-label sentences, resulting in a degraded classifier.
We now present Dynamic Consistency Filtering -our filtering approach for a case where an evaluation set exists.In Section 4.5, we discuss the no-evaluation case.Our method relies upon the evaluation dataset to approximate the optimal portion of the generated instances to include in the augmented dataset.We do so by training k classifiers, one of which trained on a different quantile of the generated instances, ordered by their softmax scores (received from the pre-trained classifier C base ).After training the k instances, we choose the best preforming classifier using the evaluation dataset.
It is important to note that there is a possibility of applying the filtering mechanism in a recursive manner, for example, training a classifier on the filtered data and running that classifier on the original

Training and Generating in
Low-Resource Setting We can observe, on Figure 4, that DC, Precision and Recall and MAUVE converges to 1.This suggest that without any control measures in place, the distribution of the generated text quickly converges into the training distribution.This is not a desired property since our goal is to generate texts which will expand the support of the training distribution.It is interesting to note that the nature of the results remains the same, even when soft-prompt tuning is applied.
Therefore, to address the above, we deploy two heuristics.The first heuristic is to increase the number of training samples.We do so by using the EDA rule-based simple augmentation method discussed earlier (Wei and Zou, 2019).Please note that in this enrichment we do not consider the pseudo-labels, since our goal is to provide more reference points for the M LP training.Moreover, we checkpoint the M LP several times during training, and generate sentences from the different checkpoints.

Experiments 4.1 Setup
We conduct experiments on four multi-class classification datasets (described in the next subsection).Each benchmark dataset is split into 80% train ,10% evaluation and 10% test.We compare TAU-DR to the methods discussed in Section 4.1: The rule-based data augmentation methods EDA (Wei and Zou, 2019); CBERT (Wu et al., 2019), LAMBADA (Anaby-Tavor et al., 2020), and PromDA (Wang et al., 2022) which is implemented with a T5-large model (700M parameters).All hyperparameters used for these methods are those recommended by the authors.We repeat the experiments five times and report the averaged accuracy for each shot-k dataset.
For TAU-DR we used the T5-large model for all of our experiments.This model was fine tuned an additional 100k steps on the C4 dataset using the regular LM loss, to achieve better adaptivity to soft prompt tuning (Lester et al., 2021)  6 .We choose M LP with 2 hidden layers and a ReLU activation.The prompt-length is set as 10 in all of our experiments.TAU-DR was trained for 100 epochs.We checkpointed the model every 20 epochs, resulting in 5 checkpoints.The pre-trained classifier C base used in our method is the same classifier discussed above.For the dynamic filtering, we use 10 classifiers with the same configuration as the pre-trained classifier, where each classifier is trained on a different portion of the generated dataset ordered by the softmax score of C base .The experimental results are shown in Table 1 for shot-5 and shot-10 for the different multi-class benchmarks.

Datasets
All datasets used are classification datasets, with different numbers of classes and across several do-6 https://huggingface.co/google/t5-large-lm-adapt mains, three of which are available in the public domain.Hemphill et al. 1990): The ATIS dataset provides a large set of queries about flight information along with the intent, the subject of the various questions.
Text Retrieval Conference (TREC, Hovy et al. 2001): TREC is a question classification dataset that consists of a variety of questions from different areas and their intent.
Teleco-Bot (T-Bot): An internal intent classification dataset, includes data used for the training of chatbots used by telco companies for customer support.
The datasets used are summarized in Table 2

Main results
First, we can observe that the addition of the generated data from TAU-DR to the classification models significantly improves the performance of C base and outperforms the existing method.Overall, the EDA rule-based approach does not lead to a significant improvement over C base on the more challenging datasets Banking77 and T-Bot on both shot-5 and shot-10.On the other hand, the language-model-based approaches, i.e., C-BERT, LAMBADA, PromDA and TAU-DR outperform the rule-based approach.PromDA can provide better results than LAMBADA on the ATIS and TREC datasets.However, with the exception of Bank-ing77 (shot-5) it fails when considering domainspecific datasets with a larger number of classes, such as Banking77 and T-Bot.
On T-Bot and Banking77 in the shot-10 setting none of the methods expect TAU-DR where able to give a statistically significant improvement over C base .
The accuracy improvements of TAU-DR over C base on the ATIS dataset are approximately 20% for both shot-5 and shot-10.For TREC the improvement rate is 29% for the shot-5 setting and 9% for the shot-10 setting.For Banking77, the average improvement rate is 4.5% and for the challenging T-Bot dataset the average improvement rate is 5%.

Estimating the Generation Quality
We now turn to estimate the quality of text generated by the different methods.We use the following measures: • Recall and Precision (Sajjadi et al., 2018): Given two distributions P, Q, this measure compares their "precision", or how much of Q can be generated by a "part" of P , while "recall" measures how much of P can be generated by a "part" of Q. Recall and Precision are summarized as F1.
• Complexity (Kour et al., 2021): Quantifies how difficult observations are, given their true class label and how they will challenge the classifier.The measure can be used to automatically determine a baseline performance threshold..
• MAUVE (Pillutla et al., 2021): This metric measures the gap between two text distributions by calculating the area under the information divergence curve.
A recent study (Kour et al., 2022) compared several statistical and distributional measures.The different measures were compared over several desired criteria.In their experiments, MAUVE turned out to be the most robust performance measure for text generation quality.
In this set of experiments, we took the generated text and compared it to the test set, which represents the actual text distribution.A desired property of the augmented texts is that their distribution will expand the intersection between the support of the train distribution with the test distribution.Thus, we can compare the generation quality of the different methods by looking on how close they are to the test distribution.
We report the average results on Table 3.The implementation details for this experiment are detailed on Appendix B. The measures of the text generated by TAU-DR are superior to 2 out of 3 in all configurations.Showing that we can generate text that is close to the actual distribution of the data.In addition, by looking at the Atis dataset, we observe that we were able to produce more challenging and complex sentences for the classifier.
It has not been explored if or how these measures relate to the classifier's performance.Nevertheless, these measures can provide some insight into how well a model can reproduce the test distribution.

Dynamic Consistency Filtering with no Evaluation
Our suggested dynamic filtering method is shown to be effective in filtering out low-quality generated data.However, the existence of such datasets is not obvious in real-world scenarios.
In this subsection, we suggest an approach for filtering the generated data without relying on the existence of an evaluation dataset.The method can be described as follows: As in the Dynamic Consistency Filtering approach, for each class we order the generated examples according to their softmax scores obtained from the pre-trained classifier C base .We then filter out all instances on which the classifier disagrees with the pseudo-label.Then we train k classifiers on a different quantile of the ordered data (i.e., for k = 5, we train the i-th classifier i = 1, ..., 5, on the i/5 quantile).We then use the obtained classifiers to filter out the generated instances based on the majority vote of the classifiers, we denote this approach as TAU-DR maj .As shown in Table 4, with the exception ATIS (shot-5) and Banking77 (shot-5), TAU-DR maj also outperforms the benchmark methods and on average only slightly degrades the performance of TAU-DR.The average accuracy obtained by the base classifier C base and TAU-DR with the dynamic filtering approach and with TAU-DR equipped with the weighted majority filtration approach not relying on the existence of evaluation dataset.

Conclusion and Future Work
In this paper, we present TAU-DR, a novel textaugmentation method for low-resource classification using dataset reconstruction.We test our method on four multi-class classification datasets in various few-shot scenarios and show that our approach outperforms the state-of-the-art approaches.
In the future, we plan to explore the learned prompt space and check how it can be used for generating helpful sentences.In our preliminary experiment, we found that the averages of the prompts were concentrated in a narrow cone.This concentration hinders the exploitation of the geometry in the learned prompt space.The above observation is aligned with other findings regarding the anisotropy of the word embedding space in pre-trained language models (Li et al., 2020;Ethayarajh, 2019).Finally, we wish to explore if and how additional information (e.g, in-domain textual-data) might improve the performance of text augmentation methods on highly specialized domains.

Limitations
To address the low-resource data in the training of TAU-DR, we apply two heuristics, dataset enrichment and generation from different checkpoints.Despite being effective, they require additional computational time that might be challenging in applications with low-computational resources.A possible approach to reduce the computational time might be to average the checkpoints.We believe that this might lead to competitive results, with a significant reduction in computational time, since checkpoint averaging proved to be an effective approach in low-resource settings.Another limitation is when the original dataset is in a highlyspecialized domain that might contain domainspecific phrases that were most likely not included in the pre-training data of the language model.The results obtained by existing data augmentation approaches will most likely exhibit only marginal improvement.

Ethics Statement
Text generation by nature entails a number of ethical considerations when considering possible applications.The main failure is when the model generates text with undesirable properties (bias etc.) for training the classifier but these properties are not present in the original training data.Because our model converges and learns to generate data close to the underlying source material, the above considerations, in our approach, are negligible.As a result, the generated text may be harm-ful if users of such models are unaware that such issues appear on their training data or if they fail to consider them, e.g., by selecting and evaluating data more carefully.

A Ablation
In our ablation studies, we evaluated the independent effect of 5 different components on our method: enrichment, intra-class, inter-class, checkpointing and dynamic filtering.In this ablation study we want to emphasize the contribution of each module for the success of our method.Results are summarized in Table 5.During the M LP training we used dataset enrichment in order to add more reference points.As we can observe from the results this enrichment is an important aspect of our method as our method without dataset enrichment results in an average degradation of 4.5 accuracy points.In addition, we evaluated the effect of each generation method we proposed -intra-class and inter-class.The intra-class generation is meant to enrich the number of examples in a given class, whereas inter-class is meant to highlight the difference between different classes.We can see that both generation methods are vital components of our method, with degradation of 2 and 3.25 accuracy points when not using intra-class or inter-class, respectively.
Moreover, we determined the efficacy of the checkpointing paradigm.We utilized checkpointing to overcome the over training affects as discussed on Section 3. Based on the results, we can see that the checkpointing paradigm plays an important role in the method's success.

,3
Text Augmentation by Dataset Reconstruction (TAU-DR) In this section, we introduce Text AUgmentation by Dataset Reconstruction (TAU-DR), our novel text augmentation algorithm.Algorithm 1 Text Augmentation by Dataset Reconstruction (TAU-DR) Require: Training dataset X train , pre-trained classifier C base , pre-trained language model LM %% training phase 1: while training steps not done do 2: for (x, y) in X train do 3: Extract h from C base 4:

Figure 1 :
Figure 1: TAU-DR: We take a sentence from the dataset and pass it through the pre-trained classifier.Then, the last hidden representation (i.e., [cls] token) is used as an input for the multi-layered preceptron (MLP), whose parameters are the only trained parameters.The MLP outputs a soft prompt and the generator reconstructs the original sentence.

Figure 2 :
Figure 2: Intra-class generation: We sample two instances from the same class and average their prompts.The averaged prompt is then used as a prompt for the generator.The (pseudo)-label is decided according to the class of the instances.

Figure 3 :
Figure 3: Inter-class generation: We sample two instances from two different classes.We then calculate the weighted average of their prompts (the weights are sampled randomly).The (pseudo)-label is decided according to the closest sample.

Figure 4 :
Figure 4: F1, DC and MAUVE of the generated text distribution compared to the train distribution as a function of the training steps.
We then take the train dataset and sample K examples for each class where classes without K examples are removed, resulting in a shot-K dataset.In our experiments, we choose K ∈ (5, 10).As a base classifier, we choose the BERT-base model 5 , as in the study of Anaby-Tavor et al. (2020); Wang et al. (2021

Table 1 :
).The average accuracy results of the different benchmarks to the multi-class classification tasks.The best improvement in each configuration over the performance of the base model is marked in bold.The results of TAU-DR are significant compared to C base (paired student's t-test, p < 0.05).same set of hyperparameters is used for the training of C base , for training without the original data, and for training with the generated data.The performance of C base is evaluated during training using X val .

Table 2 :
Properties of the used multi-class datasets Airline Travel Information Systems (ATIS,

Table 3 :
The average of the generation quality measures two of the multi-class classification tasks, ATIS and Banking77.The best performing approach in each configuration is marked in bold.

Table 5 :
The average accuracy results of the different components of TAU-DR on the multi-class classification tasks.