Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot Text Classification Tasks

Text classification tasks often encounter few shot scenarios with limited labeled data, and addressing data scarcity is crucial. Data augmentation with mixup has shown to be effective on various text classification tasks. However, most of the mixup methods do not consider the varying degree of learning difficulty in different stages of training and generate new samples with one hot labels, resulting in the model over confidence. In this paper, we propose a self evolution learning (SE) based mixup approach for data augmentation in text classification, which can generate more adaptive and model friendly pesudo samples for the model training. SE focuses on the variation of the model's learning ability. To alleviate the model confidence, we introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up. Through experimental analysis, in addition to improving classification accuracy, we demonstrate that SE also enhances the model's generalize ability.


Introduction
Recently, there has been growing interest in generative large language models (LLMs), which have achieved impressive performance on various natural language processing (NLP) tasks.However, empirical evidence (Zhong et al., 2023) suggests that LLMs do not necessarily outperform BERT in semantic understanding tasks.Hence, using BERT is still be a viable option in some applications.Data are expensive and text classification tasks often encounter few-shot scenarios, where there are limited labeled data available for training.Data augmentation (DA) is an effective technique to alleviate the issue of data scarcity.
In text classification, DA methods can be divided into two categories: the methods that only alter the inputs and the methods that modify both inputs and labels.The methods that only alter the inputs generate new data through certain operations such as performing semantic substitutions or adding words to the original samples, while keeping the original labels unchanged.These methods are easy to implement, but they often generate samples with limited diversity, which may endanger less generalizability of the model.The methods that modify both inputs and labels transform the input samples in a certain way and simultaneously change the corresponding labels to generate new data.These methods generate samples with greater diversity that allows the model to learn more accurate and comprehensive knowledge.
Mixup is a commonly used technique in The methods that modify both inputs and labels.Mixup is a technique based on mixing up inputs of samples and their labels, labels are commonly represented with one-hot encoding.Mixup are categorized into input-level mixup (Yun et al., 2019) and hiddenlevel mixup (Verma et al., 2019) depending on the location of the mix operation.However, mixup typically randomly selecting two samples to mix them up to compose one sample, and the generated pseudo sample may not adaptive and friendly to the model's learning ability.Moreover, in few-shot scenarios, using hard labels (one-hot labels) can lead to issues, where the model tries produce logit value of the correct class that is much larger than any of the incorrect classes, accordingly, the incorrect logits are very different from one other.It results in a too confident model about its predictions (Szegedy et al., 2016).The current label smoothing techniques generate soft labels that cannot dynamically adapt the model's increasing ability as the training going on and cannot adjust according to the model's performance at the current stage.
In this paper, we propose a self-evolution learn-ing for data augmentation in text classification.To cater to the model's learning ability, we first divide the data into easy-to-learn and hard-to-learn subsets, and then we gradually start from the mixup of easy-to-learn data, and then gradually transition to the mixup operation of hard-to-learn data.Note that the mixup is performed between similar samples.To avoid the model's over-confidence issue, we introduce an instance-specific label smoothing method, where we linearly interpolate the predicted probability distribution of the original sample and its one-hot label to obtain a soft label.This dynamic label can dynamically adapt the model's increasing ability as the training going on and can adjust according to the model's performance at the current stage.
Our proposed method has empirically proven effective through extensive experiments on a wide range of text classification benchmarks.We prove that mixing up data in order to increase difficulty can make the generated samples more adaptive for model training compared to randomly selecting samples.
To summarize, our main contributions are: • We propose SE to strengthen the mixup on text classification tasks.SE utilizes the learning difficulty of samples to generate model-friendly samples.
• We propose a novel instance-specific label smoothing approach for regularization, which performs linear interpolation between the model's output and the one-hot label to obtain dynamic and adaptive soft label.
• Extensive experiments show that our SE could significantly and robustly improve the mixup method on few-shot text classification tasks.

Few-shot Text Classification
Driven by the observation that humans can rapidly adapt existing knowledge to new concepts with limited examples, few-shot learning (Fei-Fei et al., 2006) has recently drawn a lot of attention.Fewshot text classification entails performing classification after training or tuning a model on only a few examples.Several studies have explored various approaches for few-shot text classification.(Yu et al., 2018) propose an adaptive metric learning approach to dynamically select an optimal distance metric for different tasks.(Geng et al., 2019) introduce the Induction Network, which utilizes the dynamic routing algorithm proposed by (Sabour et al., 2017) to learn a generalized class-wise representation.
Pre-trained language models have also been employed in few-shot text classification.(Bansal et al., 2020) present LEOPARD, which utilizes BERT (Devlin et al., 2019) within an optimization-based meta-learning framework to achieve good performance across diverse NLP classification tasks.Furthermore, GPT-3 (Brown et al., 2020) demonstrates that the language model itself can perform few-shot text classification without relying on meta-learning.These pre-trained models have been adapted and fine-tuned for few-shot text classification tasks, showing improved performance compared to traditional methods.Data augmentation techniques have also been explored to address the limited labeled data problem.Our work still focus on utilizing pre-trained model for few-shot text classification tasks.

Data Augmentation
Since the bottleneck in few-shot learning is the lack of data, the performance can be easily improved if we can generate more labeled data.Hence, various NLP data augmentation techniques have been proposed.The most commonly used method is the token replacement: randomly select tokens in a sentence and replace them with semantically similar tokens to synthesize a new sentence.(Wei and Zou, 2019) directly uses the WordNet (Miller, 1995) for replacement.(Kobayashi, 2018) suggests employing contextual augmentation to predict the probability distribution of replacement tokens using two causal language models.(Wu et al., 2019) introduces an extension to contextual augmentation by incorporating the masked language modeling (MLM) technique of BERT, thereby considering bi-directional context.However, the data augmentation methods mentioned above primarily focus on altering the original input, resulting in a lack of diversity in the generated samples.(Szegedy et al., 2016) proposes a domain-independent data augmentation technique, mixup, that linearly interpolates image inputs on the pixel-based feature space.(Guo et al., 2019) integrates mixup with CNN and LSTM for text applications.They only conduct mixup on the fixed word embedding level like (Szegedy et al., 2016) did in image classification.(Sun et al., 2020) incorporates a dynamic mixup layer on top of the final hidden layer of the pre-trained transformer-based model.This mixup layer is trained together within the complete text classification model.(Chen et al., 2020) a mixup method named Tmix, which involves taking two different inputs, denoted as x, passing them through m layers of hidden units, and then utilizing the traditional method of Mixup to merge the two hidden representations together.(Yoon et al., 2021) employs a ingenious approach to perform mixup directly on the input of the text, instead of predominantly utilizing it on the hidden layers as seen in previous mixup methods.Under the premise of retaining the majority of important tokens, They introduce a novel span by replacing certain information, as a means of augmentation.
Although achieving remarkable performance, these mixup strategies still have some limitations.First, these mixup methods all randomly select samples to mix and do not consider the model's learning ability.Second, most of these mixup methods generate samples with one-hot labels.Along the same research line, in this paper, we improve the mixup with a novel self-evolution learning mechanism.

Overview
For the text classification task in the few-shot scenario, we propose a easy-to-hard mixup strategy.First we use the BERT model for text classification tasks, and then we use the mixup method for data agumentation to expand the amount of data (sec3.2).To make the mixup adaptive for model learning ability, we propose self-evolution learning for mixup (sec3.3).To alleviate the overconfidence problem of the model, we propose an instance-specific label smoothing regularization method, which linearly interpolates the model's output and one-hot label of the original samples to generate new soft for label mixing up (sec3.4).

Text Classification Model and Mixup
We utilize the BERT (Devlin et al., 2018) for text classification tasks, this model adopts a multi-layer bidirectional Transformer encoder architecture and is pretrained on plain text for masked word prediction and next sentence prediction tasks.BERT takes an input of a sequence and outputs the representation of the sequence.The sequence has one or two segments that the first token of the sequence is always [CLS] which contains the special classifica-tion embedding and another special token [SEP ] is used for separating segments.
For text classification tasks, BERT takes the final hidden state h of the first token [CLS] as the representation of the whole sequence.Then We append a softmax classifier to predict the probability distribution over labels.
To address the issue of lacking of data in fewshot scenarios, we propose the data augmentation method to generate new data for training the BERT model.The core idea of mixup is to select two labeled data points (x i , y i ) and (x j , y j ), where x is a input and y is a label.The algorithm then produces a new sample (x, ỹ) through linear interpolation: (2)

Self-Evolution Learning for Mixup
To make the mixed samples more adaptive and friendly to the model training, we propose a novel mixup training strategy: progressive mixup training from easy to hard.This idea inspired by human learning behavior: it is often for humans to start with simpler tasks and gradually progressing to more challenging ones.We first propose the degree of difficulty to measure the difficulty of the model in learning sample and then conduct in two stages: conducting data division based on degree of difficulty, mixup based on the order from easy to hard.To obtain the degree of difficulty d i , we calculate the difference between the predicted probability on the correct label p y i and the maximum predicted probability among the wrong labels as Eq.3: where p y i denotes the predicted probability of the correct label by the model, and p j denotes the predicted probability of an incorrect label.The p i can directly reflect the model's understanding of the sample and the learning effect.The higher the p i , the stronger the model's ability to understand the sample, and the lower the difficulty of the sample for the model.
In the first stage of self-evolution learning (SE), we divide thetraining data into two datasets according to the degree of difficulty.Given a training set D, we calculate the difficulty level of each sample using the mentioned above (Eq.3)and use the Algorithm 1 Self-evolution learning Input: x j = x arg max cos ⟨xi,xj⟩ x j = x arg max cos ⟨xi,xj⟩ Output: Generated data median degree of difficulty as the criterion for partitioning the dataset.Then, we assign samples with degree of difficulty greater than the median to the easy-to-learn dataset D easy , and samples with degree of difficulty less than the median to the hardto-learn dataset D hard .
In the second stage of self-evolution learning, we condult mixup from D easy to D hard .For easyto-learn data, we perform mixup operations on the D easy .Given a sample x i from D easy , we search for the most similar sample x j from D easy and mix them up as Eq.1 and Eq.2, where similarity measured by cosine similarity.The data selected according to the above process is then used for mixup, and the resulting generated data is added to the model training.In the hard-to-learn dataset, we follow the same way that select two most similar samples and mixup to compose a pseudo sample, The sample serve as a new sample to augment the training data.Algorithm 1 summarizes the above procedure.

Instance-Specific Label Smoothing
To avoid overfitting caused by hard labels in few-shot scenarios, we propose a novel instancespecific label smoothing (ILS) approach to adaptively regularize the training and improve the generalization of classification model.
LS approach aims to minimize the cross-entropy between the soft label y ′ i and the predicted output p i of the model formulated as: where u i is a fixed distribution that is usually a uniform distribution, and α is a weighting factor.Additionally, based on (Yuan et al., 2020), we reformulate loss function for LS as: where H denotes the ordinary cross-entropy loss and D kl denotes the KL divergence loss.
We can regard D kl (u, p) as a knowledge distillation process, where u represents a virtual teacher that guides the training of the classification model.However, the u cannot follow the improvement of the student model and hardly provides enough linguistic information to guide the training of model.
Motivated by this, in our ILS, we propose a more informative prior distribution to smooth the labels.Specifically, We replace the fixed distribution u with a dynamic and informative distribution that is adaptively generated by the classifacation model itself.In practice, for each sentence, in addition to its original one-hot label, we consider the classification model's predicted probability distribution over the classes as the reference probabilities r i .Then, similar to Eq.5, we can obtain the smoothed label ỹi via: Finally, in the SE training stage, we use the crossentropy loss function, defined as follows: ỹi log p i (7)

Datasets
As presented in Table 8, we conducted experiments on various text classification benchmarks to evaluate the effectiveness of our method.We randomly selected 10 samples class from each dataset to form the train set for training.

Baselines
To test the effectiveness of our method,we compared it with several recent models: BERT (Devlin et al., 2018) is a pre-trained model.We use the BERT-based-uncased model for the text classification tasks.Notably, the BERT model did not use the data augmentation technique.
EmbedMix is a method that mixup on the embedding layer, which is similar to Guo et al., 2019. TMix (Chen et al., 2020) is a mixup technique that interpolates the hidden states of two distinct inputs at a particular encoder layer and subsequently feeds the combined hidden states forward to the remaining layers.
SSMix (Yoon et al., 2021) is a mixup method where the operation is performed on input text and uses span-based mixing to synthesize a sentence.

Model Settings
We use BERT-base-uncased for the text classification task as the underlying model.We perform all experiments with five different seeds and report the average score.We set a maximum sequence length of 128, batch size of 32, with AdamW optimizer with weight decay of 1e-4.We use a linear scheduler with a warmup for 10% of the total training step.We update the best checkpoint by measuring validation accuracy on every epoch.We conduct training in two stages: we first train without mixup with a learning rate of 5e-5, and then train with mixup starting from previous training's best checkpoint, with a learning rate of 1e-5.
For self-evolution learning, we use our method on various mixup methods.During mixup training, for each iteration, we start with mixup on the easyto-learn dataset and then mixup on the hard-to-learn dataset.For instance-specific label smoothing, we compared the prior ratio λ of label smoothing from 0.1 to 0.9 and found out that setting the λ to 0.1 is the optimal hyperparameter.

Main Comparison
Table 1 illustrates our main results.From the result, it is clear that our method outperforms the original mixup methods in various tasks and demonstrates a substantial improvement in low-resource scenarios.We also observe that mixup avails improvement effects on BERT over multiple tasks.

Ablation Studies
We performed ablation studies to show the effectiveness of each component in self-evolution learning.
Impact of Data Selection Methods.To investigate the impact of different data selection methods on mixup, we conducted an experiment in which we compared the effects of (1) randomly selecting  data, (2) selecting data from easy-to-hard, and (3) selecting data from hard-to-easy.We used SSMix as the underlying mixup method and without label smoothing in all cases.The results in Table 2 show that selecting data from easy-to-hard, which is the method we used, achieved the best performance.This also indicates that the learning process of the model follows the law of human learning, which progresses from easy to hard.
Comparison of Different Label Smoothing.We investigated the performance of our proposed instance-specific label smoothing compared to the normal label smoothing method (Szegedy et al., 2016) on the few-shot RTE, MRPC, and SST2 datasets.We report the results in Table 3.We observed that our proposed instance-specific label smoothing method outperforms the general label smoothing method in terms of model performance.This confirms that our proposed label smoothing method is more effective.
Comparison of Different α Set. Figure 1 shows the results on different α setting for self-evolution learning.It is evident from the results that the model's performance is generally better when the α is set to 0.1 compared to other settings.Therefore, we conclude that setting the α parameter to 0.1 is a suitable choice for our method.
Remove different parts from SE.We also measured the performance of self-evolution learning by stripping each component each time and displayed the results in Table 4.We observed the performance drops after removing each part, suggesting that all components in self-evolution learning con-    tribute to the final performance.The model performance decreased most significantly after removing easy-to-hard training process which is as expected.Moreover, the model also showed a significant drop in performance after removing label smoothing, indicating that our proposed label smoothing regularization method is effective.

Model Generalization Ability Discussion
We also discussed the generalization ability of the models trained using our method.We conducted Out-of-distribution Generalization experiments (ood) by training models on the Rotten Tomato dataset using three different methods and evaluating them on other target datasets.The results presented in Table 5 show that our method produces models with better performance on mul-   tiple tasks.This indicates that the models trained using our method have better generalization ability and robustness.

Varing the Base Model
We upgraded BERT-base to BERT-large and conducted tests on various datasets.According to the results presented in Table 6, our method still demonstrated a significant improvement in performance.Meanwhile, it is worth noting that our proposed approach demonstrates a more pronounced improvement on BERT-large compared to BERTbase.This can be attributed to the fact that BERTlarge has a larger number of parameters, which makes it more susceptible to overconfidence issues when trained on a very small amount of data.Therefore, our approach can have a greater impact in such scenarios.

Impact of the Number of Labeled Data
In this study, we investigated the performance of our proposed method with varying amounts of training data.We changed the percentage of training data used from 20% to 100% and the results are shown in Figure 2.  As expected, our method showed significant improvement in performance when the amount of training data was extremely limited.This indicates that our method is effective for few-shot text classification tasks.Moreover, our method still showed some improvement in performance even when more training data was available.These findings demonstrate that our method is effective and scalable for text classification tasks with varying amounts of training data.

Compare to Other DA Methods
We also compared our method with other nonmixup data augmentation methods: EDA (Wei and Zou, 2019) consists of four simple operations: synonym, replacement, random insertion, random swap, and random deletion.
Back Translation (Shleifer, 2019) translate a sentence to a temporary language (EN-DE) and then translate back the previously translated text into the source language (DE-EN).
The results are presented in Table 7, which clearly shows that our method outperforms other non-mixup data augmentation methods in various text classification tasks.This suggests that mixup methods are relatively more effective in the fewshot scenario, and our method can further enhance various mixup methods.Our method has significant implications for data augmentation in few-shot learning, as it demonstrates its effectiveness in improving model performance.

Conclusion
In this paper, we propose a effective self-evolution (SE) learning mechanism to improve the existing mixup methods on text classification tasks.SE for mixup follows two stages: conducting data division based on degree of difficulty and mixup based on the order from easy to hard.SE can be used various mixup method to generate more adaptive and model-friendly pesudo samples for the model training.Also, to avoid over-confidence of the model, we propose a novel instance-specific label smoothing approach.Extensive experiments on three popular mixup methods, EmbedMix,TMix and SSMix, verify the effectiveness of our method.Further analyses show our method improves the generalization, robustness of model.

Figure 2 :
Figure 2: Improvemence (%) on Amazon and Mrpc with varying number of labeled data per class for selfevolution learning.

Table 1 :
Experimental results of comparison with baselines.All values are average accuracy (%) of five runs with different seeds.Models are trained with 10 labeled data per class.

Table 2 :
Experimental results of different data selection methods.All values are average accuracy (%) of five runs with different seeds.Models are trained with 10 labeled data per class.

Table 3 :
Experimental results of different label smoothing.All values are average accuracy (%) of five runs with different seeds.Models are trained with 10 labeled data per class.

Table 4 :
Performance (test accuracy (%)) on Amazon counterfactual.Answer with 10 labeled data per class after removing different parts of Self-evolution learning.

Table 5 :
Experimental results of out-of-distribution for each model.All values are average accuracy (%) of five runs with differen seeds.

Table 6 :
Experimental results of comparison with BERT-large.All values are average accuracy (%) of five runs with different seeds.Models are trained with 10 labeled data per class.

Table 7 :
Experimental results of different data augmenta methods.All values are average accuracy (%) of five runs with different seeds.Models are trained with 10 labeled data per class.