Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification

Data augmentation aims to enrich training samples for alleviating the overfitting issue in low-resource or class-imbalanced situations. Traditional methods first devise task-specific operations such as Synonym Substitute, then preset the corresponding parameters such as the substitution rate artificially, which require a lot of prior knowledge and are prone to fall into the sub-optimum. Besides, the number of editing operations is limited in the previous methods, which decreases the diversity of the augmented data and thus restricts the performance gain. To overcome the above limitations, we propose a framework named Text AutoAugment (TAA) to establish a compositional and learnable paradigm for data augmentation. We regard a combination of various operations as an augmentation policy and utilize an efficient Bayesian Optimization algorithm to automatically search for the best policy, which substantially improves the generalization capability of models. Experiments on six benchmark datasets show that TAA boosts classification accuracy in low-resource and class-imbalanced regimes by an average of 8.8% and 9.7%, respectively, outperforming strong baselines.


Introduction
Model performance on Natural Language Processing (NLP) tasks, such as text classification, often heavily depends on the size and the quality of the training data. However, it is time-consuming and labor-intensive to obtain sufficient training instances, and models face low-resource regimes most of the time. Data augmentation (Simard et al., 1996;Szegedy et al., 2015;Wei and Zou, 2019) aims to enlarge the training dataset by synthesizing additional distinct and label-invariant instances based on raw instances to improve the model per-1 Our code is available at https://github.com/ lancopku/text-autoaugment formance. Data augmentation approaches demonstrate their superiority in many scenarios, especially when data resources are insufficient (Hu et al., 2019b;Anaby-Tavor et al., 2020;Chen et al., 2021) and class-imbalanced (Ren et al., 2018;Hu et al., 2019b).
Previous data augmentation methods can be roughly divided into two categories: generationbased (Sennrich et al., 2016;Imamura et al., 2018;Kobayashi, 2018;Anaby-Tavor et al., 2020;Quteineh et al., 2020) and editing-based methods (Wei and Zou, 2019;Xie et al., 2020). Generation-based methods utilize conditional generation models (Sutskever et al., 2014; to synthesize the paraphrases (Kumar et al., 2019;Hu et al., 2019a) of the original sentences, which have advantages in instance fluency and label preservation but suffer from the heavy cost of model pre-training and decoding. Editing-based methods instead apply label-invariant sentence editing operations (swap, delete, etc.) on the raw instance, which are simpler and more efficient in practice. However, the editing-based methods are Figure 2: The structure and usage of augmentation policy P. A policy P consists of N atomic editing operations O = t, p, λ . Given a text, we randomly pick N * operations from P then apply the operations sequentially. In this case, N = 4 and N * = 2. Red operation O indicates it is finally applied to the text according to probability p. The modification to the original text cased by it is also showed in Red. We use RS (Random Swap), RD (Random Delete), TI (TF-IDF Insert), TS (TF-IDF Substitute) for abbreviation. Best viewed in color.
sensitive to the preset hyper-parameters including the type of the applied operations and the proportion of words to be edited. As shown in Figure 1, we probe the classification accuracy on IMDB dataset (Maas et al., 2011) with different editing operations and magnitudes (proportion of edited words). We find that inappropriate hyper-parameter settings, e.g, TF-IDF Substitute (Xie et al., 2020) with a magnitude larger than 0.4, lead to inferior results. Therefore, heuristic hyper-parameter setting is prone to fall into the sub-optimum and lacks effectiveness. Besides, most of the editing-based methods (Kobayashi, 2018;Hu et al., 2019b;Sennrich et al., 2016;Imamura et al., 2018;Andreas, 2020) only apply a single operation on the sentence once a time, which restrict the diversity of the augmented dataset and thus limit the performance gain.
To overcome the limitations above, we propose a framework named Text AutoAugment (TAA) to establish a learnable and compositional paradigm for the data augmentation. Our goal is to automatically learn the optimal editing-based data augmentation policy for obtaining a higher quality augmented dataset and thus enhancing the target text classification model. We design an augmentation policy as a set of various editing operations: policy = {op 1 , · · · , op N } and each operation is defined with a parameter triplet: op = type t, probability p, magnitude λ . Such a compositional structure allows more than one operation to be applied to the original sentence and can improve the the diversity of the synthetic instances. In conclusion, a policy solution consists of two kinds of knowledge to learn: the operation set and the editing parameters for each operation. To search the optimal policy, we propose a novel objective function and utilize the sequential Model-based Global Optimization (SMBO) (Bergstra et al., 2011), an efficient and widely used method in Au-toML (Yao et al., 2018), to learn the optimal configuration of the compositional policy.
Given a target dataset, our algorithm learns an augmentation policy automatically and adaptively. The implementation based on distributed learning frameworks can efficiently obtain promising results in several GPU hours. To summarize, our contribution is two-fold: • We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.
• In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Text AutoAugment
In this section, we introduce the Text AutoAugment framework to search for the optimal augmentation policy automatically. Our augmentation policy is composed of various operations and forms a hierarchical structure, which is first detailed in Section 2.1. Then we give an overview of the proposed learnable augmentation method and establish a global objective function for it (Section 2.2). The specific policy optimization algorithm is presented at last (Section 2.3).

Compositional Augmentation Policy
To generate a much broader set of augmentations, we introduce a compositional policy instead of a single operation. Figure 2 shows the structure and usage of the policy P. Specifically, our policy is a set of N various editing operations: The operation O is an atomic component and is responsible for applying an editing transformation on a text x to synthesize an augmented instance x aug . Each operation O i is defined as a triplet: It means that for a given text x, it has probability p i to be applied with the operation O i for synthesizing the transformed text x aug = O i (x; t i , λ i ) under the magnitude λ i , while has probability 1 − p i to remain identical, i.e., x aug = x.
To further increase the diversity of augmented data and enlarge the support of the training distribution, our policy applies more than one operation to the original sentence in a recursive way. Specifically, we randomly sample N * editing operations from the policy P and apply them to a given text consecutively. The N * atomic operations can be combined as a compositional operation with the recursive depth equals N * . For example, when N * = 2 and the sampled operations are [O i , O j ], the final augmented instance can be denoted as: In other words, one policy can synthesize up to A N * N × 2 N * augmented instances, where A denotes a permutation. Note that our policy is determined by only N × 3 parameters, which does NOT cause the problem of search space explosion. As discussed before, the setting of these parameters in a policy has a great impact on the quality of augmented data and the model performance, which motivates us to devise a learnable framework for automatically selecting the optimal parameters instead of a naive grid search or cumbersome manual tuning.

Learnable Data Augmentation Policy
In this subsection, we first review the data augmentation and model training with the traditional objective function. We then propose a new objective function to learn the optimal policy for obtaining a augmented dataset with higher quality and improving the model performance.
Given an input space X and output space Y of the text classification task, a model f is responsible for learning a mapping from input texts x ∈ X to target labels y ∈ Y. In some scenarios, the training set D train is extremely small or imbalanced, which leads to a large generalization error on the test set. Therefore, data augmentation is incorporated as an implicit regularizer (Hernández-García and König, 2018) to help models learn better patterns (Simard et al., 1996) and further improve the generalization ability. Let D aug (P) be the augmented dataset containing both training set and the synthetic data generated by the policy P, the loss function L of model training on the augmented dataset can be formulated as a sum of instance-level loss l such as cross-entropy: In traditional methods, the parameters of data augmentation are preset before training then tuned on the validation set D val , which is similar to the paradigm of hyper-parameters tuning. For this reason, we cast the problem of augmentation policy optimization as the Combined Algorithm Selection and Hyper-parameter (CASH) optimization problem (Thornton et al., 2013;Feurer et al., 2019) in AutoML (Yao et al., 2018). Formally, let F and P be the search space of models and policies, respectively. Each model f is trained on D aug (P) augmented by the policy P. We propose a novel metric to measure the loss of a policy and a model: Here, J f, D aug (P), D val denotes the loss that the model f achieves on the validation set D val  Figure 3: The overview of the optimization procedure in the Text AutoAugment algorithm. In each iteration, the optimizer samples a policy and trains a corresponding model on the augmented training set. After that, the loss on the validation set is calculated to update the policy optimizer then execute the next iteration.
after trained on the augmented set D aug (P). It quantifies the generalization ability of the model after augmentation thus reflects the quality of the augmented data. Consequently, the objective function of our policy optimization can be written as Through minimizing the validation loss of the model, we can find the optimal configuration of the policy P and improve the quality of the augmented data.

Augmentation Policy Optimization
Like the optimization of hyper-parameters of a deep learning model, our objective function for policy optimization depends on the validation loss, which can not be solved by gradient-based methods such as back-propagation. To tackle this problem, we introduce Sequential Model-based Global Optimization (SMBO) (Bergstra et al., 2011), a kind of Bayesian optimizer widely used in AutoML, to our TAA framework. Our optimization procedure is illustrated in Figure 3. In a nutshell, the SMBO optimizer builds a probability model of the objective function as a surrogate and uses it to samples the most promising policy, then evaluates the policy in the true objective function. In practice, we use Treestructured Parzen Estimator (TPE) (Bergstra et al., 2011) as a surrogate agent M to model the function between the policy and the objective loss in Eq. 7.
This process is carried out in an iterative manner. At each iteration, the SMBO optimizer samples an augmentation policy P i to synthesize the augmented set D aug (P i ), and then trains a model f i based on it. The loss J i of the policy P i is then evaluated by Eq. 7 and merged into the observation history H to help update the surrogate model. The specific updating procedure of TPE is introduced in Appendix B. After that, we employ the following Expected Improvement (EI) criterion as an acquisition function to sample the next promising policy: Here, J † is a threshold calculated by the observation history and the surrogate model. Eq. 8 is the expectation under the surrogate model that the loss J of a policy will exceed (negatively) the threshold J † . Note that our target is to find the policy that minimizes the loss J , the policy that maximizes the expected improvement will be chosen in the next iteration. The TAA framework is summarized in Algorithm 1. 3

Baselines
We compare our TAA method with the following representative baselines: Back Translation (BT) (Sennrich et al., 2016;Imamura et al., 2018). We utilize WMT'19 English-German translation models  based on Transformer (Vaswani et al., 2017) to translate the text from English to German, 4 and then translate it back to English. We use random sampling for decoding as recommended by (Xie et al., 2020;Edunov et al., 2018), and set the temperature to 0.8 to generate more diverse paraphrases.
Contextual Word Substitute (CWS) (Kobayashi, 2018). CWS masks some words in the original text stochastically, then predicts new words for substitution using a label-conditional language model (LM). The proportion α of words to be masked and predicted is 0.15.
Easy Data Augmentation (EDA) (Wei and Zou, 2019). Given a text from the training set, EDA randomly applies one of four simple editing operations including Synonym Replacement, Random Insertion, Random Swap, and Random Deletion.
We set the proportion of words to be edited to 0.05 according to the recommendations in the paper.
Learning Data Manipulation (LDM) (Hu et al., 2019b). LDM is also a learnable data augmentation method, but uses only one operation: aforementioned CWS. LDM establishes a reinforcement learning system to iteratively optimize the parameters of the task model and label-conditional LM.
The main characteristics of the baselines are summarized in Table 1.

Experimental Settings
We choose large-scale pre-trained BERT (base, uncased version) (Devlin et al., 2019) as the backbone model. We adopt the Adam optimizer (Kingma and Ba, 2015) and a linear warmup scheduler with an initial learning rate of 4e-5. The training epoch is 20 for TREC and 10 for the reset datasets. We pick the best checkpoint according to the validation loss. Note that the validation loss is also the criterion for policy optimization, so we split out two validation sets with the same size during the phase of policy optimization. One is used for evaluating the model checkpoints, and the other is for evaluating the sampled policy. All experiments are conducted with 8 Tesla P40 GPUs. Following previous works (Hu et al., 2019b;Xie et al., 2020;Wei and Zou, 2019 we verify the performance of all data augmentation baselines in two special data scenarios: Low-resource Regime. For every dataset in Section 3.1, we constrain the amount of available labeled data by sub-sampling a smaller training and validation set. In order to maintain the distribution of the original dataset, we apply Stratified Shuf-fleSplit (Esfahani and Dougherty, 2014) to split the training and validation set. The final datasets IMDB, SST-5, TREC, YELP-2, and YELP-5 have 80, 200, 120, 80, 200 labeled training samples, respectively, which pose significant challenges for learning a well-performing classifier. The number of validation samples is 60, 150, 60, 60, 150, respectively. In low-resource regime, we introduce a parameter n aug , representing the magnification of augmentation. For example, n aug = 16 means that we synthesize 16 samples for each given sample.
Class-imbalanced Regime. For binary sentiment classification datasets IMDB and SST-2, we subsample the training samples from an imbalanced class distribution. After sub-sampling, the negative class of the training set has 1000 samples, while the positive class has only 20/50/100 training samples respectively in three experiments. In this regime, we only augment the samples in positive class by 50/20/10 times, so that the final data amount of the positive and negative class are the same. We set up another baseline Over-Sampling (OS) for this scenario, which over-samples the training samples in the positive class by 50/20/10 times. Following the settings of the previous work (Ren et al., 2018;Hu et al., 2019b), the validation set for training and policy optimization is balanced.
In both low-resource and class-imbalanced regimes, the test set for final evaluation is balanced and intact without any reduction, which indicates that accuracy is a fair metric for evaluation. The size of policy N and the number of the operations sampled for each text N * is 8 and 2, respectively. For a fair comparison, we use 3 random seeds for data sub-sampling and run 5 times under each seed.

Method
IMDB (80)   Finally, we report the results on the average of 15 runs with standard deviation (denoted as ±).

Main Results
We train the model on the training set augmented by baselines in Section 3.2, then conduct evaluations on the balanced and intact test set.
Low-resource Regime. Table 2 shows the test accuracy of TAA on five datasets. In the low-resource regime, the model suffers from a severe overfitting problem. For example, the accuracy on the subtraining set of TREC is as high as 98.06%±1.34%, while only obtains 69.3%±5.4% on the test set. Several kinds of data augmentation baselines have greatly improved the generalization ability of the model. As shown in Figure 1, however, the performance of the augment operations is very sensitive to their parameters like the magnitude. The heuristic-based approaches such as EDA and CWS are likely to be trapped in sub-optimum because of the manually parameters setting. On the contrary, our Text AutoAugment algorithm with a learnable and compositional policy outperforms all the baselines by a considerable margin. Compared to the model without augmentation, TAA boosts the accuracy by 8.8% averagely. Note that it does not cost too much time of computation to achieve such performance. When the number of iterations T (see Algorithm 1) is 200, it only requires 4.24±0.10 GPU hours on 8 NVIDIA Tesla P40 GPUs to finish the optimization.
Class-imbalanced Regime. To verify that TAA can also improve the performance of the model in class-imbalanced regime, we conduct experiments based on the settings in Section 3.3. As illustrated in Table 3, the over-sampling method alleviates the overfitting problem to some extent but is not as efficient as augmentation baselines. In contrast, our TAA boosts the test accuracy by an average of about 9.7%, which surpasses other algorithms.
In both low-resource and class-imbalanced regimes, our TAA framework based on multiple learnable operations surpasses BT & CWS based on single unlearnable operation, EDA based on multiple unlearnable operations, and LDM based on single learnable operation.

Analysis
We analyse the learned policies on four aspects including transferability, scalability, magnification and structure. Besides, we evaluate the diversity and semantics preservation of the augmented data, and further conduct a case study. The final policies searched by our TAA algorithm are listed in Appendix D.

Transferability
We conduct experiments to study whether the policies searched on one source dataset can be applied to other target datasets. As illustrated in Figure 4,  Table 3: Test accuracy (%) with standard deviation in class-imbalanced regime. SST-2 (20:1000) means the positive class of SST-2 has 20 training samples while the negative class has 1000 training samples. * indicates statistically significant (p < .05) improvements over the best baseline.  the transfer of the policy to other datasets only results in a slight performance degeneration of 1.1% on average and can boosts the test accuracy by 7.7% comparing to no augmentation. The policies maintain better transferability when the gap of class number and sequence length 5 is smaller between the source and target datasets.

Scalability
In order to investigate whether the TAA policy is still helpful as the amount of training data increases, we apply the policies searched in the low-resource regime to the corresponding full datasets. Note that the only difference between the experimental setting here and that in the low-resource and classimbalanced regimes is the amount of the training data. The test sets are intact and balanced in all three situations. As shown in Table 4 Figure 5 shows the results of each method with different augment magnification n aug . We find that CWS and LDM achieve their best performance on SST-5 when n aug = 4, but the accuracy drops sharply as n aug continues to increase. The performance improvement from methods with a single operation cannot scale with more augmented samples, presumably because of the lack of novel information. Worsely, EDA, which is based on unlearnable multiple operations, even harms the model training on TREC, due to its inappropriate parameter setting with human experience (Wei and Zou, 2019). In virtue of the compositional and learnable policy, the augmented data synthesized by TAA are effective and perform well in most cases.

Impact of the Policy Structure
We change the size of the policy N and the number of the operations sampled for each text N * and re-execute the experiments to explore the impact of the policy structure. The left panel in Figure 6 shows the results with different N * . As N * increases, the original text is likely to be applied with more operations sequentially, which causes slight   damage to the performance of TAA. While the right panel shows that the more operations a policy contains, the better TAA performs. Therefore, for the compositional data augmentation algorithm, it is helpful to generate more diverse samples. Other ablation studies on operation type and searching algorithm can be found in Appendix E.

Diversity of Augmented Data
We evaluate the diversity of the augmented data by computing the Dist-2 (Li et al., 2016), which measures the number of distinct bi-grams in generated sentences. The metric is scaled using the total number of bi-grams in the generated sentence, which ranges from 0 to 1 where a larger value indicates higher diversity. As shown in Table 5, TAA achieves the best overall performance, even slightly outperforms the generation-based method, i.e., back-translation. The results validate the effectiveness of applying various editing operations in the compositional augmentation policy, and demonstrate the superiority of our well-designed algorithm for automatic hyper-parameters tuning.

Semantics Preservation of Augmented Data
The augmented sentence are supposed to preserve the semantic meaning of the original sentence, for enriching rather than deviating from the original data set support. We propose to evaluate semantic similarity between the generated sentences and the original ones, based on sentence embedding cosine similarity. In more detail, for the sentence pair (x, x aug ) consists of the original sentence x and the corresponding augmented text x aug , we utilize Sentence-BERT (Reimers and Gurevych, 2019) library, which achieves the state-of-the-art performance on various semantic textual similarity benchmarks, to obtain dense vector representations of sentences (x, x aug ). The semantic preservation score SP(x, x aug ) is defined as: We compute the average semantic preservation score for the whole augmented datasets using different augmentation methods, and the results are listed in Table 6. It can be found that our TAA achieves the best results, comparable with the generationbased method back-translation. We note that our method achieves the highest semantic preservation score when the number of training samples is relatively small, which demonstrates that our method can generalize to extreme low-resource scenarios. Figure 7 shows some cases of augmented texts generated by TAA. Given the original text with the label "positive", the modifications applied by TAA distort the semantics and sentence structure at the beginning. As the number of iteration T increases, TAA captures the feature of the dataset adaptively, helping it achieve a better balance between diversity and quality on augmented samples.

Related Work
Previous data augmentation algorithms in NLP can be categorized into generation-based and editingbased methods. For generation-based methods, back-translation (Sennrich et al., 2016;Imamura et al., 2018;Luque, 2019;Zhang et al., 2020) Table 6: Cosine similarity of sentence embeddings between the augmented sentence and the original one. IMDB (80) means the number of training samples after sub-sampling is 80. Here n aug = 16 and number of iterations T is 200.
Original text: A deeply felt and vividly detailed story about newcomers in a strange new world.

Sampled operations Synthetic text
A deeply felt and vividly detailed story about newcomers in a strange new world.
A deeply felt and vividly depression detailed story about newcomers in a strange new world.
A deeply felt vividly and about elaborate tale in newcomer earth a young unknown.

Label: Positive
A deeply felt and vividly detailed story about newcomers in a strange new world.
A deeply felt and vividly detailed story particularly about newcomers in a strange new world.
A deeply felt and vividly elaborate story about newcomers in a foreign new world. propose to substitute uninformative words with low TF-IDF scores. These methods require a lot of prior knowledge to preset their parameters and are prone to fall into the sub-optimum.
Recently, some algorithms are proposed for automatically learning augmentation policy in the field of Computer Vision (Cubuk et al., 2019;Ho et al., 2019;Lim et al., 2019;Hataya et al., 2020;Li et al., 2020) andNLP (Cai et al., 2020;Miao et al., 2021). However, their modelings are different from ours. Specifically, AutoAugment (Cubuk et al., 2019) establishes a Reinforcement Learning (RL) framework to search for the best augmentation policy. In order to reduce the time of policy exploration, Ho et al. (2019) replace the RL framework with a population-based algorithm. Lim et al. (2019) leverage 5-fold cross-validation and augments the validation set instead of the training set. While in our proposed approach, the objective function for policy optimization is designed as the loss that models achieve on the validation set after trained on the augmented set. Besides, we utilize SMBO as the optimization algorithm, which can return a promising result efficiently. In the field of NLP, TAA surpasses both EDA (Wei and Zou, 2019) and LDM (Hu et al., 2019b) via a learnable and compositional augmentation policy.

Conclusion
In this paper, we propose an effective method called Text AutoAugment (TAA) to establish a compositional and learnable paradigm for data augmentation. TAA regards a combination of various editing operations as an augmentation policy and utilizes SMBO for policy learning. Experiments show that TAA can substantially improve the generalization ability of models as well as lighten the burden of artificial augmentation designing.

A Editing Operations for Augmentation
We use five simple and effect operations including Random Swap (Wei and Zou, 2019), Random Delete (Wei and Zou, 2019), WordNet Substitute (Zhang et al., 2015;Mueller and Thyagarajan, 2016;Wei and Zou, 2019), TF-IDF Substitute (Xie et al., 2020) and TF-IDF Insert (Xie et al., 2020) as the basic component of our augmentation policy. All operations used here are word-level, considering their low complexity and high effectiveness. The detailed description is shown in Table 7.

B Sequential Model-based Global Optimization (SMBO)
Recall the TAA framework in Section 2, our objective is to search for the optimal augmentation policy P that minimizes the following loss: We leverage the Sequential Model-based Global Optimization (SMBO) as our optimizer for the policy learning. The optimization procedure is carried out in an iterative manner. To build the relation between the policy P and the objective loss J and sample the most promising policy at each iteration, we use the Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011) as a surrogate agent M to model the conditional probability p M (J | P). Besides, we employ the following Expected Improvement (EI) criterion as an acquisition function for policy sampling in the current iteration: Here, J † is a threshold and Eq. 8 stands for the expectation under the surrogate model that the loss J of a policy will exceed (negatively) the threshold J † . We expect the loss of the sampled policy in each iteration to be smaller than the current threshold, thus the policy that maximizes the Expected Improvement will be chosen in the next iteration. Instead of directly representing p M (J | P) for calculating the Expected Improvement, the TPE builds a model of p M (P | J ) by applying Bayes rule. The TPE splits the historical observations of policies in two groups: the best performing one (e.g., the upper quartile) and the rest. The threshold J † is defined as the splitting value for the two groups and the likelihood probability for being in each of these groups is modeled as: Here, l(P) models the distribution of previous sampled policies whose loss is less than the threshold, and g(P) models the distribution of the other policies whose loss is greater than the threshold. The two densities l and g are modeled using Parzen estimators (also known as kernel density estimators), which are a simple average of kernels centered on existing data points. With Bayes Rule, we can prove that the Expected Improvement which we are trying to maximize is proportional to the ratio l(P)/g(P): Accordingly, the maximization of the Expected Improvement can be achieved by maximizing the ratio l(P)/g(P). In other words, we should sample the polices which are more likely under l(P) than under g(P). The TPE works by drawing sample polices from l(P), evaluating them in terms of l(P)/g(P), and returning the set that yields the highest value under l(P)/g(P) corresponding to the greatest expected improvement. These policies are then evaluated on the objective function and the results are merged into the observation history. The algorithm builds l(P) and g(P) using the history to update the probability model M of the objective function that improves with each iteration.

Operation Name Description
Random Swap (RS) Swap two adjacent words randomly. Random Delete (RD) Delete words randomly. WordNet Substitute (WS) Substitute words with their synonyms according to WordNet. TF-IDF Substitute (TS) Substitute uninformative words with low TF-IDF scores. TF-IDF Insert (TI) Insert informative words with high TF-IDF.

D Searched Policy
The policies searched by our TAA algorithm are listed in Table 9. Each policy finally consists of 8 atomic editing operations and each operation satisfies the form of O = t, p, λ . All the policies can be used directly to augment the full training set to further boost the model performance on the corresponding downstream tasks.

E Ablation Study of TAA Policy E.1 Searching Algorithm Ablation
To verify the effectiveness of our optimization algorithm, we re-execute the experiment on the same search space described in Section 2.1, but do NOT conduct any policy optimization. For each text in the training set, a random policy is sampled to synthesize the augmented data. We call this method Text AutoAugment-Random (TAA-R). Table 10 illustrates the results of TAA and TAA-R. Generally, the augmentation policy after optimization performs better than the random policy by 1.92%. Note that we incorporate prior knowledge in Section 2.1 to constrain the range of the operation magnitude, which ensures the performance of the operation and avoids generating bad samples.

E.2 Operation Type Ablation
We conduct operation type ablation study to examine the effect of the operation types search space. Specifically, we eliminate one operation type while keeps others for searching the optimal policy, and evaluate the task performance using the learned policy. The task performance difference with different ablated operations on IMDB and SST-2 dataset are shown in Table 11. We find that the elimination of operation type generally leads to a decrease of task performance, indicating that the effect of differ-   We attribute it to that more operations types will improve the diversity of the combinations of text manipulation, thus boosting the dataset quality and benefiting the generalizability of the model.