Boosting Text Augmentation via Hybrid Instance Filtering Framework

Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses ≈ 2% in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework ( B OOST A UG ) based on pre-trained language models that can maintain a similar feature space with natural datasets. B OOST A UG is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by ≈ 2 − 3% in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that B OOST A UG addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.


Introduction
Recent pre-trained language models (PLMs) (Devlin et al., 2019;Brown et al., 2020;He et al., 2021;Yoo et al., 2021) have been able to learn from large amounts of text data.However, this also leads to a critical problem of data insufficiency in many lowresource fine-tuning scenarios (Chen et al., 2020;Zhou et al., 2022a;Miao et al., 2021;Kim et al., 2022;Wang et al., 2022b;Yang et al., 2022).Despite this, existing augmentation studies still encounter failures on large public datasets.While some studies (Ng et al., 2020;Body et al., 2021;Chang et al., 2021;Luo et al., 2021)   to leverage the language modeling capabilities of PLMs in text augmentation, these methods still suffer from performance drops on large datasets.
To explore the root cause of this failure mode, we conducted experiments to explain the difference between "good" and "bad" augmentation instances.Our study found that existing augmentation methods (Wei and Zou, 2019;Coulombe, 2018;Li et al., 2019;Kumar et al., 2019;Ng et al., 2020) usually fail to maintain the feature space in augmentation instances, which leads to bad instances.This shift in feature space occurs in both edit-based and PLM-based augmentation methods.For example, edit-based methods can introduce breaking changes that corrupt the meaning of the text, while PLM-based methods can introduce outof-vocabulary words.In particular, for the edit-based methods, the shifted feature space mainly comes from breaking text transformations, such as changing important words (e.g., 'but ' ) in sentiment analysis.As for PLM-based methods, they usually introduce out-of-vocabulary words due to word substitution and insertion, which leads to an adverse meaning in sentiment analysis tasks.
To address the performance drop in existing augmentation methods caused by shifted feature space, we propose a hybrid instance-filtering framework (BO O S TAU G) based on PLMs to guide augmentation instance generation.Unlike other existing methods (Kumar et al., 2020), we use PLMs as a powerful instance filter to maintain the feature space, rather than as an augmentor.This is based on our finding that PLMs fine-tuned on natural datasets are familiar with the identical feature space distribution.The proposed framework consists of four instance filtering strategies: perplexity filtering, confidence ranking, predicted label constraint, and a cross-boosting strategy.These strategies are discussed in more detail in section Section 2.3.Compared to prominent studies, BO O S TAU G is a pure instance-filtering framework that can improve the performance of existing text augmentation methods by maintaining the feature space.
With the mitigation of feature space shift, BO O S TAU G can generate more valid augmentation instances and improve existing augmentation methods' performance, which more augmentation instances generally trigger performance sacrifice in other studies (Coulombe, 2018;Wei and Zou, 2019;Li et al., 2019;Kumar et al., 2020)).According to our experimental results on three finegrained and coarse-grained text classification tasks, BO O S TAU G 1 significantly alleviates feature space shifts for existing augmentation methods.
Our main contributions are:

Proposed Method
The workflow of BO O S TAU G is shown in Figure 2 and the pseudo code is given in Algorithm 1. Different from most existing studies, which focus on unsupervised instance generation, BO O S TAU G serves as an instance filter to improve existing augmentation methods.The framework consists of two main phases: 1) Phase #1: the training of surrogate language models; 2) Phase #2: surrogate language models guided augmentation instance filtering.The following paragraphs will provide a detailed explanation of each step of the implementation.

Surrogate Language Model Training
At the beginning of Phase #1, the original training dataset is divided into k ≥ 3 folds where the k−2 ones are used for training (denoted as the training fold) while the other two are used for the validation and augmentation purposes, denoted as the validation and boosting fold, respectively2 (lines 4-6).Note that the generated augmentation instances, which will be introduced in Section 2.2, can be  identical to the instances in training folds the surrogate language model.This data overlapping problem will lead to a shifted feature space.We argue that the proposed k-fold augmentation approach, a.k.a."cross-boosting", can alleviate the feature space shift of the augmentation instances, which will be validated and discussed in detail in Section 4.3.The main crux of Phase #1 is to build a surrogate language model as a filter to guide the elimination of harmful and poor augmentation instances.
We construct a temporary classification model using the DeBERTa (He et al., 2021) architecture.This model is then fine-tuned using the data in the k − 2 training folds and the validation fold to capture the semantic features present in the data (line 7).It is important to note that we do not use the original training dataset for this fine-tuning process.Once the fine-tuning is complete, the language model constructed from the DeBERTa classification model is then utilized as the surrogate language model in the instance filtering step in Phase #2 of BO O S TAU G.
This is different from existing works that use a pre-trained language model to directly generate augmentation instances.We clarify our motivation for this from the following two aspects.• In addition to modeling the semantic feature, the surrogate language model can provide more information that can be useful for the quality control of the augmentation instances, such as text per-plexity, classification confidence, and predicted label.• Compared to the instance generation, we argue that the instance filtering approach can be readily integrated with any existing text augmentation approach.

Augmentation Instance Generation
As a building block of Phase #2, we apply some prevalent data augmentation approaches as the back end to generate the augmentation instances in BO O S TAU G (line 9).More specifically, let D org := {d i org } N i=1 be the original training dataset.d i org := ⟨s i , ℓ i ⟩ is a data instance where s i indicates a sentence and ℓ i is the corresponding label, i ∈ 1, • • • , N .By applying the transformation function F (•, •, •) upon d i org as follows, we expect to obtain a set of augmentation instances D i aug for d i org : where Ñ ≥ 1 is used to control the maximum number of generated augmentation instances.In the end, the final augmentation set is constituted as D aug := N i=1 D i aug (line 14).Note that depending on the specific augmentation back end, there can be more than one strategy to constitute the transformation function.For example, EDA (Wei and Zou, 2019) has four transformation strategies, including synonym replacement, random insertion, random swap, and random deletion.Θ consists of the parameters associated with the transformation strategies of the augmentation back end, e.g., the percentage of words to be modified and the mutation probability of a word.

Instance Filtering
Our preliminary experiments have shown that merely using data augmentation can be detrimental to the modeling performance, no matter how many augmentation instances are applied in the training process.In addition, our experiments in Section 4.3 have shown a surprising feature space shift between the original data and the augmented instances in the feature space.To mitigate this issue, BO O S TAU G proposes an instance filtering approach to control the quality of the augmentation instances.It consists of three filtering strategies, including perplexity filtering, confidence ranking, and predicted label constraint, which will be delineated in the following paragraphs, respectively.Note that all these filtering strategies are built on the surrogate language model developed in Phase #1 of BO O S TAU G (lines 12 and 13).

Perplexity Filtering
Text perplexity is a widely used metric to evaluate the modeling capability of a language model (Chen and Goodman, 1999;Sennrich, 2012).Our preliminary experiments have shown that low-quality instances have a relatively high perplexity.This indicates that perplexity information can be used to evaluate the quality of an augmentation instance.Since the surrogate language model built in Phase #1 is bidirectional, the text perplexity of an augmentation instance d aug is calculated as: (2) where w i represents the token in the context.s is the number of tokens in d aug and is the probability of w i conditioned on the preceding tokens, according to the surrogate language model, i ∈ 1, • • • , s.Note that d aug is treated as a lowquality instance and is discarded if P(d aug ) ≥ α while α ≥ 0 is a predefined threshold.

Confidence Ranking
We observe a significant feature space shift in the augmentation instances.These instances will be allocated with low confidences by the surrogate language model.In this case, we can leverage the classification confidence as a driver to control the quality of the augmentation instances.However, it is natural that long texts can have way more augmentation instances than short texts, thus leading to the so-called unbalanced distribution.Besides, the confidence of most augmentation instances is ≥ 95%, which is not selective as the criterion for instance filtering.To mitigate the unbalanced distribution in augmentation instances and make use of confidence, we develop a confidence ranking strategy to eliminate the redundant augmentation instances generated from long texts while retaining the rare instances having a relatively low confidence.More specifically, we apply a softmax operation on the output hidden state learned by the surrogate language model, denoted as H(d aug ), to evaluate the confidence of d aug as: where c is the number of classes in the original training dataset.To conduct the confidence ranking, 2 × Ñ instances are generated at first, while only the top Ñ instances are selected to carry out the confidence ranking.By doing so, we expect to obtain a balanced augmentation dataset even when there is a large variance in the confidence predicted by the surrogate language model.After the confidence ranking, the augmentation instances with C daug ≤ β are discarded while β ≥ 0 is a fixed threshold.

Predicted Label Constraint
Due to some breaking text transformation, text augmentation can lead to noisy data, e.g., changing a word "greatest" to "worst" in a sentence leads to an adverse label in a sentiment analysis task.
Since the surrogate language model can predict the label of an augmentation instance based on its confidence distribution, we develop another filtering strategy that eliminates the augmentation instances whose predicted label ldaug is different from the ground truth.By doing so, we expect to mitigate the feature space bias.

Feature Space Shift Metric
To quantify the shift of the feature space, we propose an ensemble metric based on the overlapping ratio and distribution skewness of the t-SNE-based augmented instances' feature space.
The feature space overlapping ratio measures the diversity of the augmented instances.A larger overlapping ratio indicates that more natural instances have corresponding augmented instances.On the other hand, the distribution skewness measure describes the uniformity of the distribution of the augmented instances.A smaller distribution skewness indicates that the natural instances have approximately equal numbers of corresponding augmented instances.To calculate the feature space shift, we first calculate the overlapping ratio and distribution skewness of the natural instances and their corresponding augmented instances.The feature space shift is calculated as follows: where O and sk are the feature space convex hull overlapping ratio and feature space distribution skewness, which will be introduced in the following subsections.

Convex hull overlapping calculation
To calculate the convex hull overlapping rate, we use the Graham Scan algorithm 3 (Graham, 1972) to find the convex hulls for the test set and target dataset in the t-SNE visualization, respectively.
Let P 1 and P 2 represent the convex hulls of two datasets in the t-SNE visualization; we calculate the overlapping rate as follows: where ∩ and ∪ denote convex hull intersection and union operations, respectively.O is the overlapping rate between P 1 and P 2 .

Distribution skewness calculation
The skewness of an example distribution is computed as follows: where N is the number of instances in the distribution; sk is the skewness of an example distribution.m i and x are the i-th central moment and mean of the example distribution, respectively.Because the t-SNE has two dimensions (namely x and y 3 https://github.com/shapely/shapely. axes), we measure the global skewness of the target dataset (e.g., training set, augmentation set) by summarizing the absolute value of skewness on the x and y axes in t-SNE: where sk g is the global skewness of the target dataset; sk x and sk y are the skewness on the x and y axes, respectively.By combining the convex hull overlapping ratio and distribution skewness, the proposed feature space shift metric offers a comprehensive view of how well the augmented instances align with the original data distribution.This metric can be used to evaluate the effectiveness of different data augmentation approaches, as well as to inform the fine-tuning process for better model performance.
3 Experimental Setup

Datasets
Our experiments are conducted on three classification tasks: the sentence-level text classification (TC), the aspect-based sentiment classification (ABSC), and natural language inference (NLI).The datasets used for the TC task include SST2, SST5 (Socher et al., 2013) from the Stanford Sentiment Treebank, and AGNews10K4 (Zhang et al., 2015).Meanwhile, the datasets used for the ABSC task are Laptop14, Restaurant14 (Pontiki et al., 2014), Restaurant15 (Pontiki et al., 2015), Restau-rant16 (Pontiki et al., 2016), and MAMS (Jiang et al., 2019).The datasets5 used for the NLI task are the SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) datasets, respectively.The split of these datasets is summarized in Table 1.The commonly used Accuracy (i.e., Acc) and macro F1 are used as the metrics for evaluating the performance of different algorithms following existing research (Wang et al., 2016;Zhou et al., 2022a).Additionally, all experiments are repeated five times with different random seeds.Detailed information on the hyper-parameter settings and sensitivity tests of α and β can be found in Appendix A. In our experiments, LSTM, BERT-BASE (Devlin et al., 2019), and DeBERTa-BASE (He et al., 2021) are used as the objective models for the TC task.
FastLCF is an objective model available for the 4 Experimental Results

Main Results
From the results shown in Table 2, it is clear that BO O S TAU G consistently improves the performance of the text augmentation method EDA across all datasets and models.It is also worth noting that some traditional text augmentation methods can actually harm the performance of the classification models.Additionally, the performance improvement is relatively small for larger datasets like SST-2, SST-5, and MAMS.Furthermore, the performance of LSTM is more affected by text augmentation, as it lacks the knowledge gained from the large-scale corpus that is available in PLMs.
When comparing the different text augmentation methods, it is apparent that EDA performs the best, despite being the simplest method.On the other hand, SplitAug performs the worst for LSTM because its augmentation instances are heavily biased in the feature space due to the word splitting transformation.The performance of SpellingAug is similar to EDA.This can be attributed to the fact that PLMs have already captured some common misspellings during pretraining.Additionally, PLM-based augmentation methods like WordsEmbsAug tend to generate instances with unknown words, further exacerbating the feature space shift of the augmented texts.
We also compare the performance of BO O S TAU G with several state-of-the-art text augmentation methods.The results of these comparisons can be found in Table 3. From the results, it can be seen that even when using EDA (Wei and Zou, 2019) as the backend, BO O S TAU G outperforms other state-of-the-art methods such as AEDA (Karimi et al., 2021), AMDA (Si et al., 2021), and Bayesian optimization-based TAA (Ren et al., 2021) on the full SST2 dataset.

Ablation Study
To gain a deeper understanding of the working mechanism of BO O S TAU G, we conduct experiments to evaluate the effectiveness of crossboosting, predicted label constraint, confidence ranking, and perplexity filtering.The results, which can be found in Table 4, show that the performance of the variant MonoAug is significantly lower than that of BO O S TAU G.This is because MonoAug trains the surrogate language model using the entire   BERT 92.33 (0.29) 92.33 (0.29) training set, leading to a high degree of similarity between the original and augmentation instances.This data overlapping problem, as discussed in Section 2.1, results in biased instance filtering and overfitting of the instances to the training fold data distribution.Additionally, the variant without the perplexity filtering strategy performs the worst, indicating that the perplexity filtering strategy is crucial in removing instances with syntactical and grammatical errors.The performance of the variants without the predicted label constraint and confidence ranking is similar, with the label constraint helping to prevent the mutation of features into an adverse meaning and the confidence ranking helping to eliminate out-of-domain words and reduce feature space shift.

Feature Space Shift Investigation
In this subsection, we explore the feature space shift problem in more detail by using visualizations and the feature space shift metric.We use t-SNE to visualize the distribution of the features of the testing set and compare it to different augmented variants.The full results of feature space shift metrics are available in Figure 6.The results of feature space shift metrics in our experiment show that the augmentation instances generated by BO O S TAU G have the least shift of feature space.Specifically, the overlapping ratio and skewness in relation to the testing set are consistently better than those of the training set.This explains the performance improvement seen when using BO O S TAU G in previous experiments.In contrast, the augmentation instances generated by EDA, which was the best peer text augmentation method, have a worse overlapping rate compared to even the training set.This explains the performance degradation when using EDA on the baseline classification models.It is also noteworthy that the quality of the augmentation instances generated by MonoAug is better than EDA.

Effect of Augmentation Instances Number
To further understand the effectiveness of BO O S TAU G, we conduct an experiment to analyze the relationship between the number of augmentation instances generated and the performance of  The trajectory visualization plot of MonoAug and EDA can be found in Figure 7 the classification models.We use Acc and F1 as the evaluation metrics and plot the trajectories of these metrics with error bars against the number of augmentation instances generated for an example by using BO O S TAU G.The results are shown in Figure 3.For comparison, the trajectory visualization plots of MonoAug and EDA can also be found in Figure 7. From the results, it is clear to see that the performance of the classification models improves as the number of augmentation instances increases, but eventually reaches a saturation point.Furthermore, it is observed that the performance improvement achieved by BO O S TAU G is consistently better than that of MonoAug and EDA.This further confirms the effectiveness of BO O S TAU G in mitigating the feature space shift problem and improving the performance of the classification models.
However, it is also important to consider the computational budgets required to generate a large number of augmentation instances, as this can impact the overall efficiency of the text augmentation method being used.

Hyper-parameter Sensitivity Analysis
We find that there is no single best setting for the two hyper-parameters, α and β, in different situations such as different datasets and backend augmentation methods.To explore the sensitivity of these hyper-parameters, we conducted experiments on the Laptop14 and Restaurant14 datasets and show the Scott-Knott rank test (Mittas and Angelis, 2013) plots and performance box plots in Figure 4 and Figure 5, respectively.We found that the best value of α highly depends on the dataset.For the Laptop14 and Restaurant14 datasets, a value of α = 0.5 was found to be the best choice according to Figure 4.However, it's worth noting that the smaller the value of α, the higher the computation complexity due to the need for more augmentation instances.To balance efficiency and performance, we recommend a value of α = 0.99 (α = 1 means no augmentation instances survive) in BO O S TAU G, which reduces computation complexity.Additionally, we found that β is relatively easy to determine, with a value of β = 4 being commonly used.
While recent studies have emphasized the importance of quality control for augmentation instances (Lewis et al., 2021;Kamalloo et al., 2022;Wang et al., 2022b), there remains a need for a transferable augmentation instance-filtering framework that can serve as an external quality controller to improve existing text augmentation methods.
Our work aims to address the failure mode of large dataset augmentation and improve existing augmentation methods more widely.Specifically, BO O S TAU G is a simple but effective framework that can work with a variety of existing augmentation backends, including EDA (Wei and Zou, 2019) and PLM-based augmentation (Kumar et al., 2020).

Conclusion
Existing text augmentation methods usually lead to performance degeneration in large datasets due to numerous low-quality augmentation instances, while the reason for performance degeneration has not been well explained.We find low-quality augmentation instances usually have shifted feature space compare to natural instances.Therefore, we propose a universal augmentation instance filter framework BO O S TAU G to widely enhance existing text augmentation methods.BO O S TAU G is an external and flexible framework, all the existing text augmentation methods can be seamless improved.Experimental results on three TC datasets and five ABSC datasets show that BO O S TAU G is able to alleviate feature space shift in augmentation instances and significantly improve existing augmentation methods.the grammar and syntax to a certain extent.We apply the perplexity filtering strategy, but it is an implicit constraint and cannot ensure the syntax quality of the augmentation instances due to some breaking transformations, such as keyword deletions and modifications.However, we do not need precise grammar and syntax information in most classification tasks, especially in PLM-based classification.For some syntax-sensitive tasks, e.g., syntax parsing and the syntax-based ABSC (Zhang et al., 2019;Phan and Ogunbona, 2020;Dai et al., 2021), ensuring the syntax quality of the augmented instances is an urgent problem.Therefore, BO O S TAU G may not be an best choice for some tasks or models requiring syntax as an essential modeling objective (Zhang et al., 2019).In other words, the syntax quality of BO O S TAU G depends on the backend.(Niu and Bansal, 2018) replaces words in the original text with their synonyms.This method has been shown to be effective in improving the robustness of models on certain tasks.• SpellingAug (Coulombe, 2018): it substitutes words according to spelling mistake dictionary.• SplitAug (Li et al., 2019) (NLPAug): it splits some words in the sentence into two words randomly.• BackTranslationAug (Sennrich et al., 2016) (NLPAug): it is a sentence level augmentation method based on sequence translation.• ContextualWordEmbsAug (Kumar et al., 2020) (NLPAug): it substitutes similar words ac-cording to the PLM (i.e., Roberta-base (Liu et al., 2019)) given the context.

C Additional Experiments C.1 Natural Language Inference Experiments
The experimental results in

C.2 Hyper-parameter Sensitivity Experiment
We provide the experimental results of BO O S TAU G on the Laptop14 and Restaurant14 datasets in Figure 5.

C.3 Performance of BO O S TAU G on Different Backends
To investigate the generalization ability of BO O S TAU G, we evaluate its performance based on the existing augmentation backends.From the results shown in Table 6, we find that the performance of these text augmentation back ends can be improved by using our proposed BO O S TAU G.
Especially by cross-referencing the results shown in Table 2, we find that the conventional text augmentation methods can be enhanced if appropriate instance filtering strategies are applied.Another interesting observation is that PLMs are not effective for text augmentation, e.g., WordEmdsAug is outperformed by EDA in most comparisons 10 .Moreover, PLMs are resourceintense and usually cause a biased feature space.This is because PLMs can generate some unknown words, which are outside the testing set, during the pre-training stage.Our experiments indicate that using PLM as an augmentation instance filter, instead of a text augmentation tool directly, can help alleviate the feature space shift.

Figure 1 :
Figure 1: The visualization of feature space shift of the Laptop14 dataset based on t-SNE.We calculate the shift metric S of feature space between augmented and natural instances.The augmentation methods are BoostAug, MonoAug, and EDA augmentation, respectively.Our BoostAug has the least feature space shift.

Figure 2 :
Figure 2: The workflow of BO O S TAU G can be divided into two phases: Phase #1 and Phase #2.In Phase #1, we fine-tune a DeBERTa-based classification model using re-split training and validation sets and extract the fine-tuned DeBERTa to build a surrogate language model.In Phase #2, BO O S TAU G employs a text augmentation backend to generate raw augmentations and filters out low-quality instances identified by the surrogate language model.To avoid data overlapping between the training folds and validation fold, BO O S TAU G performs k-fold cross-boosting, meaning that Phase #1 and #2 are repeated k times.

Figure 3 :
Figure 3: Trajectories of the Acc and the F1 values with error bars versus the number of augmentation instances generated for an example by using BO O S TAU G(EDA).The trajectory visualization plot of MonoAug and EDA can be found in Figure 7

Figure 4 :
Figure 4: The Scott-knott rank test plots under different α and β in BO O S TAU G(EDA).The bigger rank means better performance.

Figure 5 :
Figure 5: The performance box plots under different α and β in BO O S TAU G(EDA).

Figure 6
Figure6shows the feature space shift of the ABSC datasets, where the augmentation back end of BO O S TAU G is EDA.C.5 Trajectory Visualization of RQ4Figure7shows the performance trajectory visualization of MonoAug and EDA.Compared to BO O S TAU G, MonoAug and existing augmentation methods usually trigger performance sacrifice while augmentation instances for each example are more than 3.

Figure 7 :
Figure 7: The performance (i.e., classification accuracy and F1 score) visualization of how BO O S TAU G perform as the number of augmentation instances per example increases.
have attempted * Corresponding author

Table 1 :
(Niu and Bansal, 2018)ental datasets for the text classification, aspect-based sentiment analysis and natural language inference tasks.We use BO O S TAU G to improve five state-of-theart baseline text augmentation methods, all of which are used as the text augmentation backend of BO O S TAU G. Please find the introductions of these baselines in Appendix B and refer to Table6for detailed performance of BO O S TAU G based on different backends.We also compare BO O S TAU G enhanced EDA with the following text augmentation methods: • EDA (TextAttack 6 )(Wei and Zou, 2019)performs text augmentation via random word insertions, substitutions, and deletions.•SynonymAug(NLPAug 7 )(Niu and Bansal, 2018)replaces words in the original text with their synonyms.This method has been shown to be effective in improving the robustness of models on certain tasks.• TAA (Ren et al., 2021) is a Bayesian optimization-based text augmentation method.It searches augmentation policies and automatically finds the best augmentation instances.• AEDA (Karimi et al., 2021) is based on the EDA, which attempts to maintain the order of the words while changing their positions in the context.Besides, it alleviates breaking changes such as critical deletions and improves the robustness.• AMDA (Si et al., 2021) linearly interpolates the representations of pairs of training instances, which has a diversified augmentation set compared to discrete text adversarial augmentation.

Table 3 :
The performance comparison on augmented SST2 dataset between different augmentation methods.We list the standard deviations for each method, while "-" indicates the standard deviation is not available.
* is derived from our experiments.

Table 4 :
The performance comparison between different ablated variants of BO O S TAU G. .5885.77 85.77 45.79 43.84 88.45 88.16 BERT 84.01 83.44 92.33 92.33 52.38 51.70 92.48 92.25 DeBERTa 84.51 83.97 96.09 96.09 57.78 56.15 92.95 92.76 Findings of the Association for Computational Linguistics, pages 2966-2975.Association for Computational Linguistics.The fixed confidence and perplexity thresholds are set as α = 0.99 and β = 5 based on grid search.We provide sensitivity test of α and β in Appendix C.2. • The learning rates of base models LSTM and DeBERTa-BASE are set as 10 −3 and 10 −5 , respectively.• The batch size and maximum sequence modeling length are 16 and 80, respectively.• The L 2 regularization parameter λ is 10 −8 ; we use Adam as the optimizer for all models during the training process.
Table 5 show that the performance of both BERT and DeBERTa models can be improved by applying BO O S TAU G.With BO O S TAU G, the accuracy of the BERT model on SNLI improves from 70.72% to 73.08%, and on MNLI from 51.11% to 52.49%.The DeBERTa model also shows significant improvement with EDA, achieving 86.39% accuracy on SNLI and 78.04% on MNLI.These results demonstrate the effectiveness of BO O S TAU G in improving the generalizability of natural language inference models, and its compatibility with different state-of-the-art pre-trained models such as BERT and DeBERTa.

Table 5 :
The additional experimental results on the SNLI and MNLI datasets for natural language inference.The back end of BO O S TAU G is EDA.