AugNLG: Few-shot Natural Language Generation using Self-trained Data Augmentation

Natural Language Generation (NLG) is a key component in a task-oriented dialogue system, which converts the structured meaning representation (MR) to the natural language. For large-scale conversational systems, where it is common to have over hundreds of intents and thousands of slots, neither template-based approaches nor model-based approaches are scalable. Recently, neural NLGs started leveraging transfer learning and showed promising results in few-shot settings. This paper proposes AugNLG, a novel data augmentation approach that combines a self-trained neural retrieval model with a few-shot learned NLU model, to automatically create MR-to-Text data from open-domain texts. The proposed system mostly outperforms the state-of-the-art methods on the FewshotWOZ data in both BLEU and Slot Error Rate. We further confirm improved results on the FewshotSGD data and provide comprehensive analysis results on key components of our system. Our code and data are available at https://github.com/XinnuoXu/AugNLG.


Introduction
Large-scale conversational systems provide a natural interface to achieve various daily-life tasks. Natural Language Generation (NLG) is a key component in such a system to convert the structured meaning representation (MR) to the natural language, as shown in Figure 1. In task-oriented dialogue systems, NLG is typically accomplished by filling out a basic set of developer-provided templates, leading to a conversational system generating unnatural, robotic responses. In order to make the system sound more human-like, model-based NLG approaches, in particular neural models, have recently been gaining an increasing traction (Gao et al., 2018;Wen et al., 2015). However, neither the template-based approaches nor the model-based approaches are sufficiently scalable for large-scale conversational systems, where it is common to have over hundreds of intents and thousands of slots.
With the rise of neural transfer learning for NLP using pretrained LMs, recently, neural NLGs started to leverage transfer learning and showed some promising results (Radford et al., 2019;Brown et al., 2020;Dai et al., 2019;Edunov et al., 2019). In particular, Peng et al. (2020) proposed FEWSHOTWOZ, the first NLG benchmark test in few-shot learning settings, and achieved a SOTA performance by leveraging existing MR-to-Text data sets via task-specific continued pre-training. Despite the improved result, their approach leaves little room for further improvements as MR-to-Text data are expensive to obtain for new domains, practically circling back to the same scalability problem after exhausting the existing data.
In order to go beyond this restriction, this paper proposes AUGNLG, a novel data augmentation approach, that automatically creates MR-to-Text data from open-domain texts by combining a self-trained neural retrieval model with a few-shot learned NLU model. Since our data augmentation approach is orthogonal to the prior transfer learning approaches, one can use our approach in conjunction with other approaches. In experiments, we empirically show that AUGNLG mostly boosts the performance of both the fine-tuned GPT-2 (FT-GPT) (Radford et al., 2019) and SC-GPT (Peng et al., 2020), the continued pretraining approach with existing MR-to-Text data, on the FEWSHOT- WOZ task. Furthermore, we construct another fewshot learning testbed, FEWSHOTSGD, out of the Schema-Guided Dialogue (SGD) corpus  and confirm improved results by applying AUGNLG to the FT-GPT. 1 Finally, we provide comprehensive analysis results on the key components of our system to gain detailed insights into the relationship between component-wise behavior and various parameters.
2 Related Work NLG for Dialogue Response Generation There has been a body of work on neural NLG models, adopting various architectures, such as RNNs (Wen et al., 2015), attention RNNs (Dušek and Jurčíček, 2016), SC-LSTM (Wen et al., 2016), T2G2 (Kale and Rastogi, 2020), AdapterCL (Madotto et al., 2020) and associated variants . Despite the improved flexibility and naturalness over template-based methods, neural approaches require large amounts of annotated data to reach good performance. Data Augmentation Data augmentation has been widely applied to a variety of NLP tasks, including sentence classification (Xie et al., 2020), natural language inference (Hu et al., 2019) and spoken language understanding (Li et al., 2019;Quan and Xiong, 2019;. Prior approaches for text data utilized back-translation (Sennrich et al., 2016;Edunov et al., 2018), c-BERT word replacement (Jiao et al., 2020), mixed labels and representations (Guo et al., 2019; and paraphrase data . However, the range of augmented data will be inherently limited, particularly in few-shot learning settings due to the nature of prior approaches, which only leverages in-domain data. In contrast, we take a rarely explored approach, tapping into a wealth of opendomain text that covers almost all topics. Recently, Du et al. (2021) proposed a self-training method to augment data for NLU tasks by retrieving sentences from data crawled on the web. However, their method cannot be directly applied to the NLG problem since it does not yield MR annotations. Our approach, in contrast, generates MR-to-Text data by jointly employing a self-trained neural retrieval model with a few-shot learned NLU model.

Few-shot Transfer Learning for NLG
The goal of NLG is to translate an MR A into its natural language response x = x 1 , . . . , x T , where x i is the ith token in the sequence x and T is the sequence length. A is defined as the combination of intent I and slot-value pairs where the intent stands for the illocutionary type of the system action while slot-value pairs indicate category names and their values to embed in the utterance. For example, in the MR, inform (food = chinese ; price = cheap), inform is the intent, food and price are two slot keys and chinese and cheap are the corresponding slot values.
Given in-domain MR-to-Text data D = {(A n , x n )} N n=1 for training, where N is the number of examples, a statistical neural language model parameterized by θ is adopted to characterize the conditional probability p θ (x|A). By adopting the chain rule on auto-regressive generation, the joint probability of x conditioned on A is decomposed as T t=1 p θ (x t |x <t , A). The training process, i.e. the learning of θ, is then defined as maximizing the log-likelihood of the conditional probabilities over the entire training dataset: In the few-shot learning setup, the number of training examples N is extremely small (e.g. ≤ 50), which easily leads to non-fluent generated sentences with many grammar mistakes or missing pieces of information. In order to combat the data sparseness problem, inspired by prior transfer learning approaches, we introduce a three-step pipeline to gradually evolve a general large-scale language model to a domain-specific NLG model (shown in Figure 2): (1) pre-training a base language model with massive amounts of text, (2) NLG-specific continued pre-training with auto-augmented MRto-Text data, and (3) final fine-tuning with the limited in-domain MR-to-Text ground-truth data.

Specifically, in
Step (1), we adopt GPT-2 (Radford et al., 2019) as our base language model since GPT-2 has demonstrated a remarkable performance on auto-regressive text generation tasks, which is close to MR-to-Text generation, in a variety of domains. However, GPT-2 is pre-trained on Open-WebText and the language style and topics thereof are quite different from those of daily conversations in a target domain. Furthermore, the generation task in NLG is conditioned on the input MR, as opposed to the unconditioned generation of the underlying GPT-2 pre-training task. Thus, to bring the model a step closer to the final NLG model in the target domain, in Step (2), we continuously pre-train the GPT-2 model on an automatically constructed set of augmented MR-to-Text where M is the number of augmented examples, which is much larger than the amount of in-domain ground-truth data. Data augmentation is achieved by retrieving a large amount of relevant text from Reddit (Henderson et al., 2019) with a self-trained neural retrieval model and then synthesizing MRs with a few-shot learned NLU model. The details of data augmentation is described in Section 4. Finally, in Step (3), we fine-tune the NLG model on a limited amount of in-domain ground-truth MR-to-Text pairs D for a final adaptation.

Data Augmentation
The data augmentation procedure aims to construct a large amount of MR-to-Text pairs D from open-domain texts that are relevant to the in-domain ground-truth MR-to-Text pairs D. The augmentation process consists of two stages: (1) retrieving keyword-matching utterances and filtering out domain-irrelevant instances, (2) generating synthetic MR annotations. Figure 3 illustrates the overall pipeline with some examples. For further analysis and studies, we release the data from all intermediate steps for each domain at https://github.com/XinnuoXu/ AugNLG/tree/master/augmented_data.

Retrieval and Filtering
The utterance retrieval and filtering procedure consists of three steps: (1) keyword extraction that collects n-gram keywords from all in-domain utterances X = {x n } N n=1 ; (2) keyword-based retrieval that searches the open-domain texts for utterances that match any keywords extracted in the previous step, yielding a set of utterances X cand ; (3) self-trained neural classifier that filters out some retrieved utterances that are semantically irrelevant to the target domain. After the filtering, we form an augmented set of utterances X with the unfiltered utterances.
Keywords Extraction. To efficiently extract keywords, we first gather all n-gram phrases that appear in X. Since some phrases are too general to be effective, e.g. "I cannot", "is your", we use TF-IDF scores to measure the specificity of a phrase (see Appendix A for more detail). We first rank the collected n-grams according to their TF-IDF scores and filter out those n-gram phrases with relatively low TF-IDF score.
Keyword-based Retrieval. Having extracted the keywords, we retrieve utterances from the opendomain utterance pool that contains at least one Algorithm 1 Self-trained Neural Filtering Require: In-domain utterances X in the target domain; Retrieved utterances X cand 1: The aim of this step is to source a large amount of domain-relevant utterances X cand based on the surface-level overlap.
Self-trained Neural Filtering. Although the keyword-based retrieval is efficient, the retrieved utterances X cand can be quite noisy since an n-gram keyword only matches some part of the utterance, failing to detect the existence of irrelevant pieces in other parts. For example, in Figure 3, even though the utterance "With kids movies?" contains the keyword "with kids", it is irrelevant to the target domain Restaurant given the word movies. Thus, we introduce a self-trained neural classifier to filter out domain-irrelevant utterances from X cand by considering the semantic representation of an entire utterance and yield a domain-relevant set X .
The algorithm of the self-training and filtering process is listed in Algorithm 1. We adopt a BERT (Devlin et al., 2019) model with a binary classification layer atop as the base model and then train the classifier with in-domain utterances X and randomly selected open-domain utterances 2 , serving as positive and negative examples (U + and U − ), respectively. After that, the self-training and filtering cycle starts. At each iteration, we make predictions on the utterances in X cand with the classifier trained in the previous iteration. All utterances with a score over the threshold σ + , together with the in-domain utterances X, are then taken as a new set of positive examples E + , whereas all utterances with a score less than the threshold σ − are collected as a new set of negative examples E − . 3 The self-training loop terminates if either the increment of positive examples at the last iteration is less than the threshold δ or the iterations is over the pre-defined maximum number of iterations. Otherwise, a new classifier is trained on E + and E − and the algorithm keeps going on the loop. Once the loop terminated, we label all utterances in X cand with the classifier from the last iteration. Finally, we build a domain-relevant set of augmented utterances X by taking all utterances with a score over the threshold σ. 4

Synthetic MR Annotation
Having built the domain-relevant set of augmented utterances X , we now proceed to synthesize MR labels to produce a complete MR-to-Text dataset D . To this end, we build a few-shot NLU model by fine-tuning a BERT model with in-domain groundtruth data. To put the data in the right format for the NLU task, we take MRs and utterances as labels and model inputs, respectively. Each token is annotated with the slot name if it is a part of the associated slot value and the final hidden state of the special token [CLS] is used to predict the intent (see Figure 5 in Appendix B). Finally, we generate an MR-to-Text dataset D by concatenating the utterances in X with the synthetic MR labels predicted by the few-shot NLU model.

Dataset
Fewshot NLG Data FEWSHOTWOZ is a fewshot NLG benchmark, built upon RNNLG and MultiWOZ (Budzianowski et al., 2018). In each domain, MR-to-Text pairs are grouped according to their delexicalized MRs (i.e. slot values being masked) and a training set is created by taking a pair each from 50 random groups and then the rest are taken as the test set. We also construct a new dataset FEWSHOTSGD by applying the same  Table 9 and Table 10 in Appendix C.

Evaluation Metrics
Following Wen et al. (2015) and Peng et al. (2020), we use BLEU score and Slot Error Rate (ERR) for automatic evaluation. BLEU score measures the surface-level similarity between generated responses and human-authored references. Whereas, 5 Note that, the average number of delexicalized MRs in the training set is 33, which means the number of training examples in some domains are less than 50. 6 The novelty is calculated by dividing the number of ngrams in the test set that does not appear in the training set by the number of n-grams in the test set. 7 https://github.com/pengbaolin/SC-GPT.
ERR measures the semantic alignment in terms of slot-value insertion and omission. Specifically, ERR = (p + q)/M , where M is the total number of slots in the MR and p, q are the number of missing and redundant slots in the surface realisation.
Since the SGD dataset does not provide enough information to compute ERR, we report ERR only on FEWSHOTWOZ.

Systems
We apply our data augmentation approach AUGNLG to two baseline systems, • FT-GPT GPT-2 is directly fine-tuned on the in-domain ground-truth MR-to-Text data. We introduce AUGNLG-FT, which further pretrains GPT-2 on the augmented MR-to-Text data and performs a final fine-tuning on the in-domain data.
• SC-GPT (Peng et al., 2020) further pre-trains GPT-2 on existing MR-to-Text data borrowed from other NLG corpora and fine-tunes on the in-domain data. We introduce AUGNLG-SC, which pre-trains GPT-2 on both existing MR-to-Text data and automatically augmented data, and finally fine-tunes on the indomain data.

Results
FEWSHOTWOZ Table 2 reports the results on FEWSHOTWOZ. AUGNLG-FT substantially outperforms FT-GPT across all domains in both BLEU and ERR. Similarly, AUGNLG-SC performs better than SC-GPT and achieves the state-of-theart performance in most domains. Remarkably, AUGNLG-FT achieves a competitive performance with SC-GPT in many domains without leveraging any existing MR-to-Text data. It even outperforms SC-GPT in "TV" and "Attraction" domain in both BLEU and ERR.
FEWSHOTSGD    Table 16 in Appendix E). Both FT-GPT and SC-GPT are prone to omit important slots. Comparing to SC-GPT, FT-GPT tends to overgenerate and introduces hallucinations. However, AUGNLG and AUGNLG-SC managed to generate fluent, natural text while precisely reflecting the the input MR. We further examined 70 randomly sampled utterances generated by AUGNLG-SC, whose BLEU scores are lower than those generated by SC-GPT, in the "Hotel", "Train" and "Taxi" domain to understand some potential factors causing the lower BLEU scores We found that the lower BLEU scores are mainly driven by BLEU penalizing semantically correct paraphrases due to the nature of BLEU only checking surface-level matches. Some examples of such penalization are provided in Table 15 in Appendix E. Only 7 out of the 70 manually checked examples generated by AUGNLG-SC are actually worse than SC-GPT. 8 In sum, the results (1) verify the effectiveness of complementing existing transfer learning methods with our novel data augmentation approach; (2) reveal that automatically augmented MR-to-Text data alone can lead to a competitive performance, previously only achieved with existing MR-to-Text data. Since existing MR-to-Text data is not a scalable data source, our approach brings more practical values to real-world applications; (3) indicate that 8 We also examined 70 randomly sampled utterances generated by AUGNLG-SC, whose BLEU scores are equal/higher than those generated by SC-GPT. leveraging augmented MR-to-Text data on top of existing MR-to-Text data yields a new SOTA performance on the benchmark test.

In-depth Analysis
In this section, we provide comprehensive analysis results on the key components and parameters of our system to gain detailed insights: (1) intrinsic evaluation on augmented data, (2) influence of NLU quality, and (3) performance trends over varying amounts of augmented data.

Intrinsic Evaluation on Augmented Data
For intrinsic evaluation of augmented data, we first introduce four metrics: • MR coverage (MR Cov.) evaluates the coverage of delexicalized MRs of the test set in the augmented set: where A and A test denote delexicalized MRs in the augmented set and the test set, respectively. Higher MR Cov. values indicate that more delexicalized MRs of the test set appear in the augmented set.
• Slot coverage (SL Cov.) evaluates the coverage of slot keys of the test set in the augmented set.
• Language model perplexity (PPL) is the perplexity of augmented utterances calculated by a GPT-2 language model fine-tuned on the test set.
Lower PPL values indicate that the distribution of augmented utterances is close to that of the test utterances.
• Average n-gram novelty (Nvt.) N-gram novelty measures the fraction of the n-grams in the test set Domain: Restaurant Input MR inform(name=marlowe; goodformeal=dinner; area=mission bay) Reference marlowe serves dinner in the mission bay area. FT-GPT there is a restaurant marlowe in the mission bay area called dinner, it is good for dinner, a good area. SC-GPT marlowe is a good restaurant in mission bay that serves dinner. AUGNLG-FT marlowe is a good restaurant that serves good dinner in mission bay. AUGNLG-SC marlowe is good for dinner in mission bay.
Domain: Laptop Input MR inform(name=tecra proteus 23;type=laptop;battery=9 hour;memory=4 gb;isforbusiness=false) Reference the tecra proteus 23 laptop has a 9 hour battery life with 4 gb memory but is not for business computing. FT-GPT tecra proteus 23 laptop with 9 hour battery and 4 gb memory. % miss {isforbusiness=false} SC-GPT tecra proteus 23 laptop with 4 gb memory and 9 hour battery. % miss {isforbusiness=false} AUGNLG-FT the tecra proteus 23 laptop has a 9 hour battery, 4 gb of memory, and is not for business computing. AUGNLG-SC tecra proteus 23 laptop -9 hour battery -4 gb memory -not used for business computing.
Domain: TV Input MR inform(name=pontus 45; type=tv;ecorating=a+; screensizerange=medium; hdmiport=2) Reference the pontus 45 tv has a a+ eco rating and a medium screen size and 2 hdmi ports. FT-GPT the pontus 45 tv has 2 hdmi ports, a medium screen size , and 2 screensize. % miss {ecorating=a+} SC-GPT the pontus 45 tv has a medium screen size and 2 hdmi ports. % miss {ecorating=a+} AUGNLG-FT the pontus 45 tv has a+ eco rating, 2 hdmi ports, and a medium screen size. AUGNLG-SC the pontus 45 tv has a medium screen size of 2 hdmi ports and a+ eco rating.
Domain: Attraction Input MR inform(area=north;price=no entrance fee; type=park; phone=01223) Reference yes, it's a park located in the north area of town and has no entrance fee. the phone number is 01223. FT-GPT yes, there is a free entrance fee of £1. phone 01223 for more information. % miss {area=north} SC-GPT no problem. 01223 is the phone number and the park is north. the entrance fee is free. AUGNLG-FT yes, the entrance fee is no entrance fee. the park is in the north. phone is 01223. AUGNLG-SC yes, the park is in the north. no entrance fee. phone number 01223.   that do not appear in the augmented set: where X and X test denote utterances in the augmented set and test set, respectively. Lower Nvt. values indicate that more n-grams of the test set appear in the augmented set. We consider from 1-grams to 4-grams and report the average value. The results of MR Cov. / SL Cov. on FEWSHOT-WOZ and FEWSHOTSGD are shown in Table 5 and Table 6, respectively. SL Cov. achieves 70% in most domains on both datasets while MR Cov. has a wide range of values across domains. Noteworthily, Table 6 strongly correlates with Table 3 -"Banks" and "Media" domains are worse than other domains in both coverage metrics and NLG performance. On the other hand, "Restaurants" and "Events" domains are better than the others in both aspects. Although we do not see the same pattern on FEWSHOTWOZ, it could be attributed to the large variance in the number of delexicalized MRs in each domain (see Table 2 in (Peng et al., 2020)). The results of PPL and Nvt. on FEWSHOTWOZ are shown in Table 7. We compare the augmented data (AUG) with the existing MR-to-Text data (EX-IST). The top section shows that AUG achieves lower PPL values in all seven domains compared to EXIST. The bottom section again demonstrates that AUG achieves lower Nvt. values in most domains. However, in the "Train" and "Taxi" domains EXIST attains lower novelty values, which matches the results in Table 2, SC-GPT outperforming AUGNLG-SC in these two domains. 9

Influence of NLU
Few-shot NLU performance Since few-shot NLU models are a key component of our system, we report their performance in F1 score. For each domain, we evaluate the few-shot NLU model on the Text-to-MR test set, prepared in Section 4.2.
The average F1 over all domains on FEWSHOT-WOZ and FEWSHOTSGD are 0.77 and 0.68, respectively. A further breakdown over the domains are provided in Table 13 and Table 14 in Appendix D.
Influence of NLU Quality The mediocre NLU performance on FEWSHOTSGD leads to the following research question: can better NLU models boost NLG performance? To answer this question, we select four domains from FEWSHOTSGD with relatively low NLU performance: "Buses (0.63)", "Flights (0.74)", "Movies (0.44)", and Ridesharing (0.63). In each domain, we construct a new test set by randomly sampling 500 MR-to-Text pairs from the original test set, and take the rest as the NLU training pool. To obtain NLU models of varying quality, we train a set of models while varying the amount of training data with stratified sampling. The top row in Figure 4 clearly shows that F1 score increases in proportion to the training size, reaching 0.95 in F1 in all four domains. We then annotate the augmented utterances with different sults on FEWSHOTSGD are shown in Table 12 Table 8: BLEU scores for FT-GPT (FT) and AUGNLG-FT (AUG) with different training sizes (50,100,200,500,1500). "Bu", "Fl", "Mo" and "Ri" are short for the domain names "Buses", "Flights", "Movies", "Ridesharing". All experiments are repeated for 5 times with different samples. NLU models and pre-train the NLG models with the augmented MR-to-Text data updated with new MR labels. Finally, we fine-tune the NLG models on the in-domain training set D and perform evaluation on the newly constructed 500 test set. The bottom row in Figure 4 confirms that there is a general proportional relationship between the performances of NLU and NLG.

Varying Amounts of Augmentation
Lastly, we investigate the relationship between the amount of in-domain ground-truth data and the effect of augmentation. As in the previous section, we build new test sets by randomly taking 500 examples and vary the size of training set to train both NLU and NLG models. Table 8 shows that, in all four domains, the performance difference between AUGNLG-FT and FT-GPT culminates at the smallest training set and gradually diminishes as more training data become available.

Conclusion
In this paper, we proposed AUGNLG, a novel data augmentation approach that combines a self-trained retrieval model with a few-shot learned NLU, to automatically create MR-to-Text data from opendomain texts. Experimental results verify the effectiveness of our approach by establishing new SOTA performances on two benchmark tests. More importantly, we showed how our approach complements the previous SOTA approach, which hinges on unscalable data sources, with unlimited opendomain data. Future work includes (1) technical innovations on each component of our system for further performance improvements, (2) exploring self-training on the NLU side too to evolve both the NLU and NLG model at the same time.

A The calculation of TF-IDF
To calculate the TF-IDF score for a n-gram phrase, we take all in-domain texts X as one document d to calculate its TF (Term Frequency) score, and randomly selected open-domain texts as the set of documents D to calculate the IDF (Inverse Document Frequency) score 10 . Thus, we formulate the TF-IDF score for n-gram phrase ph i as: in which, freq (ph i , d) denotes the raw count of the phrase ph i appears in the document d.
B The structure of the BERT-based NLU annotation

E Generation Examples
Domain: Hotel Input MR inform(name=parker guest house; area=mastro; address=520 church street; phone=4156213222) Reference parker guest house is near mastro at 520 church street. their phone number is 4156213222. SC-GPT the parker guest house is located in mastro and is at 520 church street. the phone number is 4156213222. AUGNLG-SC the parker guest house in mastro's address is 520 church street. the phone number is 4156213222.
Domain: Train Input MR inform(day=sunday; id=tr5413; arriveby=00:04 ) @ book (none) Reference train tr5413 arrives at 24:44 or 00:04 on sunday. would you like to book it now ? SC-GPT tr5413 arrives at 00:04. would you like me to book it for you ? AUGNLG-SC tr5413 arrives at 00:04 on sunday. would you like to book ?
Domain: Taxi Input MR inform(arriveby=16:00) Reference yes , the taxi is due to arrive at 16:00. SC-GPT yes , the taxi will arrive by 16:00. AUGNLG-SC i 'm on it , the taxi will arrive by 16:00. Table 15: Generation examples in "Hotel", "Train" and "Taxi" domain on FEWSHOTWOZ. In the same example, the BLEU scores for the SC-GPT generations are higher than the AUGNLG-SC generations.