Neural Data-to-Text Generation with LM-based Text Augmentation

For many new application domains for data-to-text generation, the main obstacle in training neural models consists of a lack of training data. While usually large numbers of instances are available on the data side, often only very few text samples are available. To address this problem, we here propose a novel few-shot approach for this setting. Our approach automatically augments the data available for training by (i) generating new text samples based on replacing specific values by alternative ones from the same category, (ii) generating new text samples based on GPT-2, and (iii) proposing an automatic method for pairing the new text samples with data samples. As the text augmentation can introduce noise to the training data, we use cycle consistency as an objective, in order to make sure that a given data sample can be correctly reconstructed after having been formulated as text (and that text samples can be reconstructed from data). On both the E2E and WebNLG benchmarks, we show that this weakly supervised training paradigm is able to outperform fully supervised sequence-to-sequence models with less than 10% of the training set. By utilizing all annotated data, our model can boost the performance of a standard sequence-to-sequence model by over 5 BLEU points, establishing a new state-of-the-art on both datasets.


Introduction
Neural data-to-text generation has been the subject of much recent research. The task aims at transforming source-side structured data into target-side natural language text (Reiter and Dale, 2000;Barzilay and Lapata, 2005). While neural end-to-end systems afford the advantage of easy adaptability (Lebret et al., 2016;Wiseman et al., 2017), huge amounts of data-text pairs are still necessary to perform on par with their rule-based counterparts The Blue Spice is a restaurant that serves English cuisine. (van der Lee et al., 2018). This makes using neural systems less appealing: oftentimes, in-domain text samples are not readily available, and there is a high cost to collecting in-domain texts which fit the data samples, and annotating these texts with the data labels -the cost for collecting this data might hence even outweigh the efforts of designing a rule-based system (Gkatzia, 2016). The goal of this work is to improve the performance of neural data-to-text models in scenarios where only very few text samples exist (we assume that these text samples are paired with corresponding data samples). We aim to answer how we can make the most of the scarce annotations, together with large amounts of unlabelled data, in order to push the limit of the neural data-to-text models. Figure 1 illustrates the scenario.
To address the limited-data challenge, we propose a simple yet effective way of augmenting the text side with the pretrained language model (LM) GPT-2 (Radford et al., 2019). Unlike other text augmentation work employed in data-to-text generation systems (Freitag and Roy, 2018;Agarwal et al., 2018), our proposal assumes little to no domain-dependent heuristics. It consists of two steps: (1) information augmentation by slot-value replacement and (2) LM augmentation by GPT-2 generation.
Once we have augmented the set of text samples, we are essentially in a similar setting as previously proposed semi-supervised approaches to data-totext generation Schmitt and Schütze (2019) ;Qader et al. (2019); Su et al. (2020), which assume the presence of vast amounts of unpaired data and text instances. These approaches exploit a cycle consistency objective in order to learn a pairing for the data samples. The cycle consistency objective tries to make sure that data samples can be reconstructed correctly from their textual formulations, and similarly that texts can be reconstructed after having been parsed into a data representation.
As the automatically generated text samples from GPT-2 might be very noisy and not pair well with data samples, we align each augmented text sample with its most similar unlabeled data sample, as defined in their encoded vector space. This idea is inspired by recent work on representation matching in MT (Artetxe and Schwenk, 2019;Ruiter et al., 2019). To ensure good quality of the training data, only pairs above a certain similarity threshold are retained as pseudo pairs for training. The quality of the pseudo pairs will gradually improve as the encoder improves in the training process. In return, the learning of the encoder will also be facilitated with the improved quality of pseudo pairs as a virtuous cycle.
On two data-to-text benchmarks E2E (Novikova et al., 2017) and WebNLG (Gardent et al., 2017), we show that our LM-augmented weakly supervised model succeeds on outperforming fully supervised seq2seq model, though utilizing less than 10% of the data annotations. It even outperforms previous work which additionally has access to all unpaired text samples. When trained with full data annotations, it is able to boost the model performance by up to 5 BLEU points, establishing a new state-of-the-art on both datasets.
In summary, this work makes the following contributions: 1. We study the few-shot data-to-text scenario where, unlike previous works, no further target-side text is available.
2. We present an effective way of automatically augmenting target text by resorting to the pretrained LM GPT-2.
3. We propose utilizing the augmented text by a combination of cycle consistency and rep-resentation matching. The resulting model outperforms standard seq2seq model with less than 10% data annotations.
4. The proposed model is shown to be complementary with current seq2seq pretraining techniques, and can offer orthogonal improvements when combining both.

Related Work
Building neural data-to-text systems with few paired samples (but a large set of unpaired samples) has been a hot research topic recently. Most works adopt the idea of cycle consistency (Zhu et al., 2017), which has been used in many text generation tasks like machine translation (Artetxe et al., 2017;Lample et al., 2017) and style transfer (Prabhumoye et al., 2018;Subramanian et al., 2018). Schmitt and Schütze (2019) Chang et al. (2020Chang et al. ( , 2021a applied this idea to the task of data-to-text generation and reported promising results. Ma et al. (2019) separate the generation process into few-shot content selection and surface realization components and learn them separately. Nonetheless, all of these approaches assume the existence of huge quantity of unpaired text samples, which, as we mentioned, is an unrealistic assumption for the task of data-totext generation. Freitag and Roy (2018) proposes to reconstruct usable sequences re-written from data with rules for unsupervised data-to-text generation. Unfortunately, designing these rules require efforts similar to building a template-based system. (Budzianowski and Vulić, 2019;Peng et al., 2020) tackle the few-shot challenge by finetuning a pretrained LM to incorporate prior knowledge from general-domain text or data-text pairs. We show that our technique is complementary with them and can offer orthogonal improvements when combining both. D U where |D U | k > 0 1 . This, we believe, is a more realistic setting as unlabeled data are usually abundant and also can be easily fabricated from predefined schemata. Notably, we assume no access to outside resources containing in-domain text. The k annotations are all we know about the text side.

Approach
In this section, we first explain our proposed new method for text sample augmentation, and then discuss methods to remove noise and automatically align the data by elaborating on the ideas of cycle consistency and representation matching. Finally, we summarize the approach and present the detailed algorithm.

Text Augmentation
To mitigate the paucity of the set of text samples T , we propose a pipeline approach to augment the text samples by (1) information augmentation and (2) LM augmentation.

Information Augmentation
We generate additional text samples by performing slot-value replacements. As many data values are exactly copied to the text samples, these copied information can be easily detected and replaced with other values (for the same slot type) to enrich the information space of the text samples. This can be considered as a simplified version of traditional methods of template mining where key words are extracted to construct templates (Kondadadi et al., 2013;Oya et al., 2014). An example is shown in Figure 2. Each text sample is augmented with 10 more distinct text samples or with all possible values being replaced.
The slot-value replacement is efficient to implement. However, it can only detect identical values and augment text with the same combinatorial patterns as the few-shot annotations. To enrich the linguistic realizations of text sentences and enable new combinations of information, we further propose a LM augmentation approach using GPT-2.

LM Augmentation
GPT-2 (Radford et al., 2019) is a language model pretrained on the collected WebText. It has demonstrated remarkable zero-shot multitask adaptability by simply feeding the input of each task into the LM and continuing to generate words. People have also also shown that GPT-2 is able to improve classification tasks via in-domain text augmentation (Papanikolaou and Pierleoni, 2020;Sun et al., 2020). We use a similar technique by first finetuning GPT-2 in the few-shot annotations (Wolf et al., 2019), and then applying it to produce synthetic text through an iterative conditional generation process: With initial seeds being samples of T L plus new samples from information augmentation, the LM iteratively conditions on the previous output sentence to generate in-domain text 2 . Each synthetic sentence is pruned if it (1) is shorter than 5 words or (2) contains only special tokens. The iterative generation is terminated when all tokens in the initial seeds are covered or if the maximum of 100 runs is reached. All the unpruned synthetic text samples are added into the space of T to benefit the learning direction of t → d → t andt → t. Figure 2 depicts the generation process of GPT-2.
In practice, obtaining clean in-domain text requires extreme efforts of designing heuristic rules. Nonetheless, the synthetic text from GPT-2 makes decent sense and can already provide useful signals to drive the learning process.

Cycle Consistency
The core idea of encouraging cycle consistency is that starting from one sample in a domain, the model first maps it into the other domain, then maps it back (He et al., 2016). The resulting sample should be identical to the original sample. Specifically, let p θ (t|d) be the probability distribution to map a data sample d to its corresponding text t, and p φ (d|t) be the probability distribution to map text back to data. Starting from a data sample d ∈ D, its objective is: which basically ensures the consistency in the direction of d → t → d. Note that only p φ is updated in this direction and p θ serves only as as an auxiliary function to provide pseudo samples t from d.
Though it is also possible to update θ at the same time through tricks like Gumbel-softmax (Jang et al., 2016) or REINFORCE (Williams, 1992), we find it did not lead to better performance, yet complicated the training. Similar observations have GPT-2 <name> Blue Spice <area> Riverside <price range> less than £20 Blue Spice is located in Riverside and has a price range of less than £20.
The Punter is located in city centre and has a price range of £20-25.

Pseudo Data-Text Pairs
Similarity Score <name> The Punter <area> City Centre <price range> £20-25 Alimentum is a family-friendly restaurant located in the by the city area.  Similarly, starting from a text t ∈ T , the objective is to ensure the consistency in the direction of Finally, we further add two denoising autoencoding objectives on both the data and text sides: whered andt are the corrupted versions of d and t. We use the same noise function as in Lample et al. (2018) which randomly permutes and pads a portion of the input. This can encourage the encoder to learn meaningful latent representations by reconstructing the input itself (Currey et al., 2017;Lample et al., 2018). Figure 3 illustrates all the four directions of the cycle consistency objective.
We use one shared encoder Enc for both the data and text sides. Each data sample is flattened into a sequence by making a list of slot value pairs and fed into the same encoder. Using the same encoder for both types of input gives the model an inductive bias to project similar data/text into surrounding latent space.
We will show later that encoder sharing is essential for a good performance under the few-shot scenario. From the shared encoded space, two separate decoders Dec d and Dec t are used to decode d and t respectively 4 .

Representation Matching
Apart from training under the cycle consistency, we further consider matching each synthetic text with its most similar data sample and treating them as supplementary training pairs. Compared with the pseudo d obtained from back translation (Eq. 2), the matched data samples are extracted from the existing corpus D U and thereby are guaranteed to be clean. This can provide a much more stable training signal especially at the initial training stage 5 . Previous work has used representation matching to automatically extract pseudo training pairs for machine translation (Artetxe and Schwenk, 2019;Ruiter et al., 2019). Baziotis et al. (2019); Chu and Liu (2019) also demonstrate that the representation similarity between input-output pairs can serve as a useful regularization for unsupervised text summarization. We adopt a similar idea to create pseudo pairs based on their cosine similarity in the representation space. To summarize, the process of representation matching can be described as: where T is augmented text from the LM and 1 is the indicator function. We also perform mean pooling over the encoded representations before matching them. ε is a threshold. Pseudo pairs with a cosine similarity less than ε will be discarded. Ideally, as the encoder improves, the pseudo pairs created by representation matching will make more sense, which can in turn benefit the training of the encoder.

Summary
Apart from the above unsupervised objective, on the few annotated data-text pairs, we can impose the supervised objective: where (D L , T L ) contains the k data annotations. Putting all together, we summarize it in Algorithm 1. In the training stage, we optimize the objectives of cycle consistency, representation matching and supervised learning sequentially to maintain a constant ratio of signals from all sides.

Experiment Setting
Data We conduct experiments on the E2E (Novikova et al., 2017) and WebNLG (Colin et al., 2016) datasets. E2E is a crowd-sourced dataset containing 50k instances in the restaurant domain. The inputs are dialogue acts consisting of three to eight slot-value pairs. WebNLG contains 25k instances describing entities belonging to fifteen distinct DBpedia categories. The inputs are up to seven RDF triples of the form (subject, relation, object).

Configuration
The model is implemented based on fairseq (Ott et al., 2019). We use 600dimensional token embedding and Adam optimizer Optimize by Eq. 5; 14: until convergence with initial learning rate at 0.0002. Batch size is kept at 48 with a dropout rate at 0.3. We employ beam search with size 3 for decoding and select models based on BLEU-4 scores on the development set. The score is averaged over 10 random initialization runs. In this work, the seq2seq models are built upon the long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). For LSTM cells, both the encoder and decoder have 3 layers, amounting to 18M parameters for the seq2seq model (600-dimension and 1024 hidden units). Maximum sequence length is set as 100 for E2E and 200 for WebNLG (SPM-based). All encoder parameters are shared between data and text samples. All models were trained on 1 Nvidia V100 GPUs (32GB and CUDA Version 10.2) for 4k steps. The total batch size is around 48K tokens per GPU and we use the Adam optimizer ( = 1e−6, β 2 = 0.98) along with linear learning rate decay scheduling. The total number of updates is set to 8000 for all training and models are selected based on optimal validation BLEU4. At decoding time, sentences are generated using greedy decoding.

Results and Analysis
In this section, we present experiment results and analysis. We first compare our model with other baselines on both datasets, then perform a set of ablation studies on the E2E dataset to see the effects of each component. Finally, we analyze how text augmentation helps improves the model, include example outputs and show the human evaluation results in the end. Comparison with Other Models In Table 1, we compare our model with (1) seq2seq baseline, (2) cycle consistency model as in Qader et al. (2019) 6 and (3) finetuned GPT-2 model as in Chen et al. (2020) 7 . For all models, we try running with 10% and 100% annotations to see how they perform under different data sizes. Our model is implemented both with LSTM encoder-decoders, same as the seq2seq baseline for a fair comparison. Note that Qader et al. (2019) further utilized all the groundtruth unpaired text samples, while the other models run only on the few-shot annotations. We also include the results of SLUG (Juraska et al., 2018) and MELBOURNE (Gardent et al., 2017), the overall winner on automatic metrics in the E2E and WebNLG challenge respectively(both seq2seqbased). SLUG uses a heuristic slot aligner based on a set of handcrafted rules and combines a complex pipeline of data augmentation, selection, model ensemble and reranker.
The results show that our proposed model significantly improves over the baseline on both the few-shot and fully supervised setting. The improvement is more evident when only 10% annotations are available, with a leap of 11 and 7 BLEU scores on E2E and WebNLG respectively. It also outperforms systems relying on task-dependent heuristics. In comparison, Qader et al. (2019), though with access to all text samples at all percentages, still underperforms our model with tangible margin. On the fully supervised setting, it brings little to no 6 The author did not open-source their code. We reproduced their model based on our implementation. The results on 10k annotations matches their reports in the paper. 7 https://github.com/czyssrs/Few-Shot-NLG  difference compared with the seq2seq baseline as no more extra data is incorporated in the training process. As such, we also observe that the text augmentation from finetuned GPT-2 model helps the proposed model on the few-shot setting, but its advantage also vanishes when all data annotations are available.
In Figure 4, we draw the model performance with varying number of data annotations. All models are trained from scratch with 10 different random initializations and the standard deviation of the BLEU-4 score is visualized. We can see our model (LSTM-based), though with a relatively larger standard deviation due to the uncertainty of text augmentation sampling, still consistently outperforms other baselines significantly and even surpasses the fully supervised seq2seq model with less than 10% of data annotations.

Ablation Study on Cycle Consistency
In Table 2, we study how the four directions, input noise and parameter sharing affect the performance of cycle-consistency. The experiments are conducted with 10% annotations and no further unpaired text samples are available.
As can be observed, adding the training direction t → d → t (i.e. back translation) has little effects on top of the supervised seq2seq baseline. This is expected since back translation is naturally designed to incorporate additional unpaired text samples. When run only on the few-shot annotations, its power is very limited. The backward direction d → t → d is surprisingly useful when the encoder is shared between the data and text. Though this direction will not affect the text decoder at all, the improvement suggests the model can benefit a lot by simply structuring its encoded space and mapping aligned data-text pairs to similar vector space. The autoencoding directions brings a little improvement. When combined with input noise, the performance further increases. This is similar to previous findings that denoising autoencoding is more helpful in inducing meaningful latent space (Lample et al., 2018) in comparison to simply learning to copy the original input.
The results also suggest encoder sharing is important for the cycle consistency objective to work in our few-shot setting. Decoder sharing, in contrast, makes little or even negative influence. This is kinda similar as in multilingual machine translation where sharing the decoder among languages might negatively interfere with the performance (Johnson et al., 2017).

Ablation Study on Text Augmentation
On top of the four-direction cycle consistency training, we study the effects of text augmentation in Table 3. We compare our proposed info + LM augmentation with (1) random augmentation, where a ran-    harms the model performance, suggesting reasonable in-domain text augmentation are necessary for the model improvement. UDA augmentation also makes rather little difference as it simply paraphrases the current available annotations but cannot bring any new information. The information augmentation by slot-value replacement helps improve a bit. When combined with LM, the performance can be further boosted, especially for lowerresource scenarios. The representation matching always helps lift the performance, with gains of up to 10 BLEU points. As expected, the benefit from text augmentation gradually vanishes as more annotations are collected, especially for datasets with relatively simple patterns as E2E.
How text augmentation helps Intuitively the GPT-2 augmentation is expected to impose new tokens and combination patterns to the few-shot annotations. To investigate whether this is the case, for the decoded text in the test phase, we count the number of unique tokens (excluding copied data values) and unique information combination patterns (attribute combinations in E2E). The results in Fig. 6 show that LM-augmentation indeed greatly enriches the vocabulary space, even doubling the generated unique tokens in low-resource scenarios. The same happens for new combination patterns. In contrast, info-aug cannot insert new tokens or combinations at all since all it does is replacing data values based on the same text annotation. UDA can impose new tokens by paraphrasing the annotations, but it hardly helps the model generalize to new combinations of information. Moreover, when trained on a toy dataset, we observe from the generation outputs that Seq2seq and info-aug produce the wrong outputs and overfit to the information in the 5 training instances. With LM augmentation, it adapts to the new combination and connects information correctly. Figure 5 shows a generation example with different text augmentation techniques. We train the systems in a toy setting with only 5 data annotations (Trainset in the Appendix). We pick an input data with an unseen attribute combination to test if models can generalize correctly. Seq2seq and info-aug produce the wrong generation overfit to the information in the 5 training instances. With LM augmentation, it adapts to the new combination and connects information correctly.
Human Evaluation We further run a human evaluation on the model outputs to closely check the generation quality. We compared four types of models: the seq2seq baseline, seq2seq plus cycleconsistency as in Qader et al. (2019), finetuned GPT-2 as in  and our proposed model. All models are LSTM-based apart from the finetuned GPT-2 one. We sample 100 data instances from the test set and apply all the models to generate corresponding text. The data and generated text are evaluated by 50 crowdworkers on Prolific 8 . For each data-text pair, the annotator is instructed to evaluate (1) if the text is fluent (score 0-5 with 5 being fully fluent), (2) if it misses information contained in the source data and (3) if it includes wrong information. The average fluency scores, count of information miss and wrong information are presented in Table 4. The scores are generally consistent with the automatic evaluation results, our proposed model outperforms other ones by a large margin, even though cycle-only can access all unpaired text and finetuned GPT-2 is significantly larger than our LSTM-based seq2seq.
The generated text are more fluent, yet maintaining the information completeness and correctness to a large extent.

Conclusion
We study few-shot data-to-text generation with only limited annotated data. We propose text augmentation with slot-value replacement followed by GPT-2 generation. The augmented text, when combined with cycle consistency and representation matching, is shown to help the model to generalize to unseen new tokens and patterns of token combinations. With less than 10% annotations, it outperforms supervised seq2seq model trained on 100% annotations and is extensible enough to be combined with pretraining techniques.