Text-to-Text Pre-Training for Data-to-Text Tasks

We study the pre-train + fine-tune strategy for data-to-text tasks. Our experiments indicate that text-to-text pre-training in the form of T5 (Raffel et al., 2019), enables simple, end-to-end transformer based models to outperform pipelined neural architectures tailored for data-to-text generation, as well as alternatives such as BERT and GPT-2. Importantly, T5 pre-training leads to better generalization, as evidenced by large improvements on out-ofdomain test sets. We hope our work serves as a useful baseline for future research, as transfer learning becomes ever more prevalent for data-to-text tasks.


Introduction
Natural language generation from structured data, or data-to-text (Kukich, 1983;McKeown, 1985), is the task of generating a textual description conditioned on source content provided in the form of structured data such as a table, graph etc. Some examples of its applications include task oriented dialog (Wen et al., 2015), creating summaries from weather forecasts (Sripada et al.) etc.
In this work we study the applicability of large scale transfer learning learning for this task. We use the term "pre-train + fine-tune" to refer to the paradigm of first pre-training a high capacity model on massive text corpora before fine-tuning it on a downstream task. Our study shows that such form of transfer learning, which is now ubiquitous in many areas of NLP (Devlin et al., 2018), works well for text generation from structured data as well. In particular, we focus on pre-training in form of the "Text-to-Text Transfer Transformer" (T5) models released by Raffel et al. (2019).
Fine-tuning T5 achieves state-of-the-art results on three diverse benchmarks spanning task oriented dialogue (MultiWoz), tables-to-text (ToTTo) and graph-to-text (WebNLG). Empirical results further suggest the following: • Transfer learning greatly improves robustness of models to out-of-domain inputs.
• By leveraging pre-training, a single endto-end model can outperform sophisticated, multi-stage pipelined approaches.
• With the addition of pre-training, simple transformer (Vaswani et al., 2017) models exceed the performance of more exotic architectures (eg: pointer networks, graph neural networks) specifically tailored for data-to-text generation.
Our approach is simple, only scratching the surface of what is possible. There is much to be explored in the space of leveraging unlabelled data, developing unsupervised objectives etc. that are more tailored for generating text from structured data. We hope our work serves as a useful baseline for future research, as pre-training becomes ever more prevalent for this task.

Related Work
Transfer Learning Devlin et al. (2018), Howard and Ruder (2018) showed that unsupervised pretraining can greatly benefit tasks like text classification, question answering, summarization etc. In particular, Raffel et al. (2019) perform a large scale study of different training objectives, model capacity and size of data. Peng et al. (2020) and Chen et al. (2019b) show that pre-training in the form of GPT-2 can indeed improve performance on data-to-text task as well. Our experiments show that pre-training with T5, where both encoder and decoder are trained using a span masking objective, performs significantly better than encoder-only alternatives such as BERT and GPT-2. Some works have also studied pre-training via supervised objectives, such as machine translation Siddhant et al. (2019); Kale and Roy (2020) and reading comprehension (Khashabi et al., 2020).
Data-to-Text Early work on data-to-text focused on rule-based pipelined methods, while recent works have adopted neural approaches. Wen et al. (2015) proposed the Semantically Controlled LSTM and were one of the first to show that neural networks can be successfully applied to this problem. Liu et al. (2018)

Pre-training
We rely on the T5 pre-trained models released by Raffel et al. (2019). They consist of a transformer based encoder-decoder architecture. These models were pre-trained in a multitask fashion with an unsupervised "span masking" objective on the C4 dataset as well as supervised translation, summarization, classification, and question answering tasks. Note that none of the supervised tasks include language generation from structured data. Disentangling the effects of unsupervised and supervised objectives is in interesting area for future work.

Fine-tuning
Our modeling approach is simple. The data-totext task is cast in the text-to-text framework by representing the structured data as a flat string (linearization). Figure 1 shows examples of the input representation for each dataset.
We then fine-tune T5 on the data-to-text corpus for a small number of steps. The maximum training steps is set to 5K for MultiWoz and WebNLG, while the larger ToTTo dataset is trained for 10K steps. All the model parameters are updated in the fine-tuning process.

Experimental Setup
The T5 vocabulary consists of 32,000 sentencepieces. Following (Raffel et al., 2019), models are fine-tuned with a constant learning rate of 0.001.
The best checkpoint is chosen based on the bleu score on the development set. Decoding is done via greedy search. For model development, we compute BLEU (Papineni et al., 2002) scores using sacrebleu (Post, 2018). In the final evaluation, for each dataset we rely on metrics used by prior work.

Datasets
We conduct experiments on 3 English datasets spanning a variety of domains. • MultiWoz (Budzianowski et al., 2018) is a corpus of 10K human-human dialogs for developing task oriented dialogue systems. For the NLG task, a meaning representation encapsulating system actions must be verbalized into natural language response. The meaning representation consists of dialog acts (inform, request etc.) and list of slot key-value pairs.
• WebNLG (Gardent et al., 2017), where the task is to convert a graph of subject-objectpredicate triples into a textual description.
Each dataset uses a different kind of structured data (tables, meaning representations and graph/triples). Table ?? lists the sizes of the three datasets and Figure 1 shows examples for each.

Dataset
Train  The evaluation is done using BLEU and METEOR (Lavie and Agarwal, 2007), similar to (Ferreira et al., 2019). The test set is split into two partsseen and unseen. The examples in the unseen set are drawn from domains not present in the training set. It also features roughly 100 relations not seen during training. Some of the baselines we compare with are: • Melbourne, a neural encoder-decoder approach, which scored the highest in the automatic evaluation of the WebNLG challenge (Gardent et al., 2017). The model relies on delexicalization, where entities are replaced with placeholders.
• GTR-LSTM (Distiawan et al., 2018), which employs a graph based triple encoder. • Step-by-Step (Moryossef et al., 2019) which splits the generation procedure into a planning stage followed by a neural generation stage.
• PlanEnc (Zhao et al., 2020), the current stateof-the-art system. It consists of a graph convolution network based planning model which first predicts the order of the triples. This is followed by an LSTM with attention and copy mechanism to generate the text. To train the planning model, the approach relies on extra annotations for the triple ordering. Such annotations are can be expensive and time consuming to obtain, especially for large, complex inputs.   Zhao et al. (2020) Results are reported in Table 2, for the overall test set as well as the seen and unseen splits. T5-Large performs the best across BLEU as well as METEOR. It and improves over PlanEnc by 4.3 BLEU on the overall test set. It also displays excellent generalization to new domains and relations, with a 14 BLEU improvement on the unseen test set. The results indicate that with pre-training, end-to-end neural models can surpass sophisticated pipelined approaches.
All the T5 models perform well on the Seen test set. On the Unseen test set, T5-Small scores substantially lower, indicating that pre-training with large capacity models is required for out-of-domain generalization.

ToTTo
Following (Parikh et al., 2020), BLEU and PAR-ENT are employed as evaluation metrics for this table-to-text generation task. PARENT is a reference less, word-overlap based metric that reflects the factual accuracy of generated text relative to the structured data. Dhingra et al. (2019) find that PARENT correlates better with human factual accuracy judgements in comparison to other generation metrics like ROGUE (Lin, 2004) and METEOR. The following baseline models are compared: • Content Planner (Puduppully et al., 2019) -A seq2seq model with separate content planning and generation stages. Notably, ToTTo features a hidden test set, which is split into two halves -Overlap and Non-Overlap. The Non-Overlap test set features examples that are out-of-domain. A submission must be made to the leaderboard in order to get the metrics on the test sets.
Results are reported in Table 3. Our only submission (based on T5-3B 1 ), achieves state-of-the-art results, improving upon the BERT based baseline by 5.5 BLEU and 5.8 PARENT. Moreover, the model is more robust to out-of-domain tables, with larger improvements of 6.6 BLEU and 7.5 PAR-ENT on the Non-Overlap test set. Table 4 reports results on the development set for the different T5 model sizes. T5-Base, which has roughly the same number of parameters as BERT-to-BERT, shows large improvements. (+3.7 BLEU, +4.5 PARENT). Even T5-Small, which has 3x fewer parameters, performs better than BERT.

MultiWoz
Evaluation on MultiWoz is done using BLEU and SER (Slot Error Rate). SER is the fraction of examples where at least one slot value from the structured data is not expressed in the predicted response. The metric is noisy since the comparison is done via exact match and does not cover all slots. We compare with the following baselines: • HDSA (Chen et al., 2019a) -Hierarchically Disentangled Self-Attention, a transformer based architecture that encodes the dialog acts    into a multi-layer hierarchical graph, with disentangled attention heads modeling specific nodes in the dialog act graph.
• SC-GPT (Peng et al., 2020) -A GPT-2 (345M parameters) model that is further pre-trained on a large data-to-text dialog corpus 2 and finally fine-tuned on MultiWoz. This 2 stage pre-training approach is currently state-of-theart for Multiwoz.
Results are reported in Table 5. All T5 based models (including T5-small which has 5x fewer parameters) outperform SC-GPT by 4-5 BLEU without any in-domain pre-training. While the SER scores are slightly worse, upon manual inspection we found that the difference can largely be attributed to false positives arising from a combination of annotation inconsistencies in the dataset coupled with the exact match constraint, which does not account for paraphrases.

Conclusion and Future Work
In this study we evaluated pre-training in the form of T5 for the data-to-text task. We found that it leads to state-of-the-art results, while greatly improving robustness to out-of-domain inputs. Though we focused on automatic metrics, corroborating our findings via human evaluation is an important next step. In the future, we also hope to design unsupervised pre-training objectives that are specifically tailored for the data-to-text task.