Substructure Substitution: Structured Data Augmentation for NLP

We study a family of data augmentation methods, substructure substitution (SUB2), for natural language processing (NLP) tasks. SUB2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with ones with the same label, which can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB2 based on constituency parse trees, introducing structure-aware data augmentation methods to general NLP tasks. For most cases, training with the augmented dataset by SUB2 achieves better performance than training with the original training set. Further experiments show that SUB2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset.


Introduction
Data augmentation has been shown effective for various natural language processing (NLP) tasks, such as machine translation (Fadaee et al., 2017;Gao et al., 2019;Xia et al., 2019, inter alia), text classification (Wei and Zou, 2019;Quteineh et al., 2020), semantic role labeling (Fürstenau and Lapata, 2009) and dialogue understanding (Hou et al., 2018;Niu and Bansal, 2019). Such methods enhance the diversity of the training set by generating examples based on existing ones and simple heuristics, and make the training process more consistent (Xie et al., 2019). Most existing work focuses on word-level manipulation (Kobayashi, 2018;Wei and Zou, 2019;Dai and Adel, 2020, inter alia) or global sequence-to-sequence style generation (Sennrich et al., 2016).
In this work, we study a family of general data augmentation methods, substructure substitution (SUB 2 ), which generates new examples by samelabel substructure substitution (Figure 1). SUB 2 naturally fits structured prediction tasks such as part-of-speech tagging and parsing, where substructures exist in the annotations of the tasks. For more general NLP tasks such as text classification, we present a variation of SUB 2 which (1) performs constituency parsing on existing examples, and (2) generates new examples by subtree substitution based on the parses.
Different from other investigated methods which sometimes hurt the performance of models, we show through intensive experiments that SUB 2 helps models achieve competitive or better performance than training on the original dataset across tasks and original dataset sizes. When combined with pretrained language models (Conneau et al., 2019), SUB 2 establishes new state of the art results for low-resource part-of-speech tagging and sentiment analysis.
The question of whether explicit parse trees can help neural network-based approaches on downstream tasks has been raised by recent work (Shi et al., 2018b;Havrylov et al., 2019) in which nonlinguistic balanced trees have been shown to rival the performance of those from syntactic parsers. Our work shows that constituency parse trees are more effective than balanced trees as backbones for SUB 2 on text classification, especially when only few examples are available, introducing more potential applications for constituency parse trees in the neural network era.

Related Work
Data augmentation aims to generate new examples based on available ones, without actually collecting new data. Such methods reduce the cost of dataset collection, and usually boost the model performance on desired tasks. Most existing data augmentation methods for NLP tasks can be classified into the following categories: Token-level manipulation. Token-level manipulation has been widely studied in recent years. An intuitive way is to create new examples by substituting (word) tokens with ones with the same desired features, such as synonym substitution (Zhang et al., 2015;Wang and Yang, 2015;Fadaee et al., 2017;Kobayashi, 2018) or substitution with words having the same morphological features (Silfverberg et al., 2017). Such methods have been applied to generate adversarial or negative examples which help improve the robustness of neural networkbased NLP models (Belinkov and Bisk, 2018;Shi et al., 2018a;Alzantot et al., 2018;Zhang et al., 2019;Min et al., 2020, inter alia), or to generate counterfactual examples which mitigate bias in natural language (Zmigrod et al., 2019;Lu et al., 2020).
Other token-level manipulation methods introduce extra noise such as random token shuffling and deletion (Wang et al., 2018;Wei and Zou, 2019). Models trained on the augmented dataset are expected to be more robust to the considered noise.
Label-conditioned text generation. Recent work has explored generating new examples by training a conditional text generation model (Bergmanis et al., 2017;Liu et al., 2020a;Ding et al., 2020;Liu et al., 2020b, inter alia), or applying post-processing on the examples generated by pretrained models (Yang et al., 2020;Wan et al., 2020;Yoo et al., 2020). In the data augmentation stage, given labels in the original dataset as conditions, such models generate associated text accordingly. The generated examples, together with the original datasets, are used to further train models for the primary tasks. A representative among them is back-translation (Sennrich et al., 2016), which has been demonstrated effective on not only machine translation, but also style-transfer (Prabhumoye et al., 2018;Zhang et al., 2020a), conditional text generation (Sobrevilla Cabezudo et al., 2019), and grammatical error correction (Xie et al., 2018).
Another group of work on example generation is to generate new examples based on predefined templates (Kafle et al., 2017;Asai and Hajishirzi, 2020), where the templates are designed following heuristic, and usually task-specific, rules.
Soft data augmentation. In addition to explicit generation of concrete examples, soft augmentation, which directly represents generated examples in a continuous vector space, has been proposed: Gao et al. (2019) propose to perform soft word substitution for machine translation; recent work has adapted the mix-up method (Zhang et al., 2018), which augments the original dataset by linearly interpolating the vector representations of text and labels, to text classification (Guo et al., 2019;Sun et al., 2020), named entity recognition (Chen et al., 2020) and compositional generalization (Guo et al., 2020).
Structure-aware data augmentation. Existing work has also sought potential gain from structures associated with natural language: Xu et al. (2016) improve word relation classification by dependency path-based augmentation. Şahin and Steedman (2018) show that subtree cropping and rotation based on dependency parse trees can help part-of-speech tagging for low-resource languages, while Vania et al. (2019) has demonstrated that such methods also help dependency parsing when very limited training data is available.
SUB 2 also falls into this category. The idea of same-label substructure substitution has improved over baselines on structured prediction tasks such as semantic parsing (Jia and Liang, 2016), constituency parsing (Shi et al., 2020), dependency parsing (Dehouck and Gómez-Rodríguez, 2020), named entity recognition (Dai and Adel, 2020), meaning representation-based text generation (Kedzie and McKeown, 2020), and compositional generalization (Andreas, 2020). To the best of our knowledge, however, SUB 2 has not been systematically studied as a general data augmentation method for NLP tasks. In this work, we not only extend SUB 2 to part-of-speech tagging and structured sentiment classification, but also present a variation that allows a broader range of NLP tasks (e.g., text classification) to benefit from syntactic parse trees. We evaluate SUB 2 and several representative general data augmentation methods, which can be widely applied to various NLP tasks.
When constituency parse trees are used, there is a connection between SUB 2 and tree substitution grammars (TSGs; Schabes, 1990), where the approach can be viewed as (1) estimating a TSG using the given corpus and (2) drawing new sentences from the estimated TSG.

Method
We introduce the general framework we investigate in Section 3.1, and describe the variations of SUB 2 which can be extended to text classification and other NLP applications.

Substructure Substitution (SUB 2 )
As shown in Figure 1, given the original training set D, SUB 2 generates new examples using samelabel substructure substitution, and repeats the process until the training set reaches the desired size. The general SUB 2 procedure is presented in Algorithm 1.
For part-of-speech (POS) tagging, we let text spans be substructures and use the corresponding POS tag sequence as substructure labels (Figure 1a); for constituency parsing, we use subtrees as the substructures, with phrase labels as the substructure labels ( Figure 1b); for dependency parsing, we also use subtrees as substructures, and let the label of dependency arc, which links the head of the subtree to its parent, be the substructure labels.

Variations of SUB 2 for Text Classification
Text classification examples do not typically contain explicit substructures. However, we can obtain them by viewing all text spans as substructures (Figure 1d). This approach may be too unconstrained in practice and could introduce noise during augmentation, so we consider constraining substitution based on matching several features of the spans: • Number of words (SUB 2 +N): when considering this constraint, we can only substitute a span with another having the same number of words; otherwise we can substitute a span with any other span.
• Phrase or not (SUB 2 +P): when considering this constraint, we can only substitute a phrase with another phrase (according to a constituency parse of the text); otherwise the considered spans do not necessarily need to be phrases.
• Phrase label (SUB 2 +L): this constraint is only applicable when also using SUB 2 +P. When considering this constraint, we can only perform substitution between phrases with the same phrase label (from constituency parse trees).
• Text classification label (SUB 2 +T): when considering this constraint, we can only substitute a span with another span that comes from text annotated with the same class label as the original one; otherwise we can choose the alternative from any example text in the training corpus.
We also investigate combinations of the above constraints, where we require all the involved substructures to be the same to perform SUB 2 . For example, SUB 2 +T+N ( Figure 1d) requires the original and the alternative span to have the same text label and the same number of words.

Experiments
We evaluate SUB 2 and other data augmentation baselines (Section 4.2) on four tasks: part-ofspeech tagging, dependency parsing, constituency parsing, and text classification.

Setup
For part-of-speech tagging and text classification, we add a two-layer perceptron on top of XLM-R (Conneau et al., 2019) embeddings, where we calculate contextualized token embeddings by a learnable weighted average across layers. We use endpoint concatenation (i.e., the concatenation of the first and last token representation) to obtain fixed-dimensional span or sentence features, and keep the pretrained model frozen during training. 1 For dependency parsing, we use the SuPar implementation of Dozat and Manning (2017). 2 For constituency parsing, we use Benepar (Kitaev and Klein, 2018). 3 For all data augmentation methods, including the baselines (Section 4.2), we only augment the training set, and use the original development set. If not specified, we introduce 20 times more examples than the original training set when applying an augmentation method. When introducing k× new examples, we also replicate the original training set k times to ensure that the model can access sufficient examples from the original distribution.
All models are initialized with the XLM-R base model (Conneau et al., 2019) if not specified. We train models for 20 epochs when applying the highresource setting (i.e., high-resource part-of-speech tagging, sentiment classification trained on the full training set) or when applying data augmentation methods, and for 400 epochs in the low-resource settings without augmentation; we select the one with the highest accuracy or F 1 score on the development set. All models are optimized using Adam (Kingma and Ba, 2015), where we try learning rates in {5 × 10 −4 , 5 × 10 −5 }. For hidden size (i.e., the hidden size of the perceptron for part-of-speech tagging and text classification, the dimensionality of span representation and scoring multi-layer perceptron for constituency parsing, and the dimensionality of token representation and scoring multi-layer perceptron for dependency parsing), we vary between 128 and 512. We apply a 0.2 dropout ratio to the contextualized embeddings in the training stage. All other hyperparameters are the same as the default settings in the released codebases.

Baselines
We compare SUB 2 to the following baselines: • No augmentation (NOAUG), where the original training and development set are used.
• Contextualized substitution (CTXSUB), where we apply contextualized augmentation (Kobayashi, 2018), masking out a random word token from the existing dataset, and use multilingual-BERT (mBERT; Devlin et al., 2019) to generate a different word.
• Random shuffle (RAND), where we randomly shuffle all the words in the original sentence, while keeping the original structured or non-structured labels. It is worth noting that for dependency parsing, we shuffle the words, while maintaining the dependency arcs between individual words; for constituency parsing, we shuffle the terminal nodes, and insert them back into the tree structure. Our RAND method for constituency parsing is arguably more noisy than that for dependency parsing.
For non-structured text classification tasks, we also introduce the following baselines: • Random word substitution (RANDWORD), where we substitute a random word in an original example with another random word. This can be viewed as a less restricted version of CTXSUB.
• Binary balanced tree-based SUB 2 (SUB 2 +P, balanced tree). Shi et al. (2018b) argue that binary balanced trees are better backbones for recursive neural networks (Zhu et al., 2015;Tai et al., 2015) on text classification. In this work, we present binary balanced tree as the backbone for SUB 2 : we (1) generate balanced trees by recursively splitting a span of n words into two consecutive groups, which consist of n 2 and n 2 words respectively, and (2) treat each nonterminal in the balanced tree as a substructure to perform SUB 2 .
All of the data augmentation baselines are explicit augmentations where concrete new examples are generated and used. The methods above are generally applicable to a wide range of NLP tasks.

Part-of-Speech Tagging
We conduct our experiments using the Universal Dependencies (UD; Nivre et al., 2016Nivre et al., , 2020 dataset. First, we compare both NOAUG and SUB 2 to the previous state-of-the-art performance (Heinzerling and Strube, 2019) to ensure that our baselines are strong enough (Table 1). Heinzerling and Strube (2019) take the token-wise concatenation of mBERT last-layer representations, byte-pair encoding (BPE; Gage, 1994)-based LSTM hidden states and character-LSTM hidden states as the input to the classifier, and fine-tune the pretrained mBERT during training. We find that with our framework with frozen mBERT and extra learnable layer weight parameters, we are able to obtain competitive or better results than those reported by Heinzerling and Strube (2019) Table 1: Part-of-speech tagging accuracy (×100) on the standard test set of UD 1.2 high-resource (top) and low-resource (bottom) languages, across different pretrained models and augmentation methods. The best numbers in each row are bolded. SOTA: the best test accuracy for each language among all methods reported by Heinzerling and Strube (2019). Note that XLM-R is the same setting as NOAUG in Table 2. larger corpora than mBERT. In addition, by augmenting the training set with SUB 2 , we obtain competitive performance on all languages, and achieve better average accuracy on low-resource languages. We further test the part-of-speech tagging accuracy on 5 selected low-resource treebanks in the UD 2.6 dataset (Table 2), following the official splits of the dataset. For four among the five investigated treebanks, SUB 2 achieves the best performance among all methods, while also maintaining a competitive performance on te (mtg). In contrast, other augmentation methods (CTXSUB and RAND) are harmful compared to NOAUG on all treebanks, indicating that the examples generated by SUB 2 may be closer to the original data distribution.

Dependency Parsing
We evaluate the performance of models using the standard Penn Treebank dataset (PTB; Marcus et al., 1993), converted by Stanford dependency converter v3.0, 5 following the standard splits. We first compare the performance of SUB 2 and baselines in the low-resource setting (Table 3). All methods, though not always, may help achieve better performance than NOAUG. CTXSUB helps achieve the best LAS when there is only an extremely small training set (e.g., 10 examples) available; however, when the size of the original training set becomes larger, SUB 2 begins to dominate, while CTXSUB and RAND start to sometimes hurt the performance. In addition, a larger augmented

Constituency Parsing
We evaluate SUB 2 and baseline methods on fewshot constituency parsing, using the Foreebank (Fbank; Kaljahi et al., 2015) and NXT-Switchboard (SWBD;Calhoun et al., 2010) datasets. Foreebank consists of 1,000 English and 1,000 French sentences; for either language, we randomly select 50 sentences for training, 50 for development, and 250 for testing. 7 We follow the standard splits of NXT-Switchboard, and randomly select 50 sentences from the training set and 50 from the development set for training and development respectively.
We compare different data augmentation methods using the setup of few-shot parsing from scratch (Table 5). Among all settings we tested, SUB 2 achieves the best performance, while all augmentation methods we investigated improve over training only on the original dataset (NOAUG). Surprisingly, we find that the seemingly meaningless RAND, which random shuffles the sentence and inserts the shuffled words back into the original parse tree structure as the nonterminals, also consistently helps few-shot parsing by a nontrivial margin. 8   For domain adaptation (Table 6), we first train Benepar (Kitaev and Klein, 2018) on the Penn Treebank dataset, and use the pretrained model as the initialization. While compared to few-shot parsing trained from scratch, the gain by data augmentation generally becomes smaller, SUB 2 still works the best across datasets.

Text Classification
We evaluate the methods introduced in Section 3.2 and baselines on two text classification datasets: (SST; Socher et al., 2013) and AG News (Zhang et al., 2015) sentence (Table 7), in the low-resource setting. 9 We obtain the constituency parse trees using Benepar (Kitaev and Klein, 2018) trained on the standard PTB dataset. Since the SST dataset provides sentiment labels of phrases, it is also natural to apply such phrase sentiment labels as substructure labels, where the substructures are phrases (SUB 2 +P+SENTI).
Across the two investigated settings, data augmentation is usually helpful to improve over NOAUG, and most variations of SUB 2 with the phrase-or-not (+P) substructure label are among the best-performing methods on each task (except SUB 2 +P for SST-10%). Additionally, constituency ing/optimization stability in this few-shot setting, but we leave a richer exploration of potential explanations for future work. 9 We only keep the single-sentence instances among all examples in each split of the original AG News dataset, following Shi et al. (2018b).  tree-based SUB 2 with phrase labels (+P+L) outperforms balanced tree-based SUB 2 in both settings, indicating that phrase structures can be considered as useful information for data augmentation in general.
We further use SUB 2 +P+T+SENTI to augment the full SST training set, since it is the best augmentation method for few-shot sentiment classification. In addition to sentences, we also add phrases (i.e., subtrees) as training examples, following most of existing work (Socher et al., 2013;Kim, 2014;Brahma, 2018, inter alia), 10 to boost performance. In this setting, we find that SUB 2 helps set a new state of the art on the SST dataset (

Discussion
We investigate substructure substitution (SUB 2 ), a family of data augmentation methods that generates new examples by same-label substructure substitution. Such methods help achieve competitive or better performance on the tasks of part-of-speech tagging, few-shot dependency parsing, few-shot constituency parsing, and text classification. While other data augmentation methods (e.g., CTXSUB and RAND) sometimes improve the performance, SUB 2 is the only one that consistently helps lowresource NLP across tasks.
While existing work has shown that explicit constituency parse trees may not necessarily help improve recursive neural networks for text classification and other NLP tasks (Shi et al., 2018b), our work shows that such parse trees can be robust backbones for SUB 2 -style data augmentation, introducing more potential ways to help neural networks take advantages from explicit syntactic annotations.
There is an open question remaining to be addressed: it is still unclear that why RAND helps improve few-shot constituency parsing, as the training process requires the model to output the correct parse tree of a sentence while only accessing shuffled words. We leave the above question, as well as applications of SUB 2 to more NLP tasks, for future work.