Structure-Aware Pre-Training for Table-to-Text Generation

Table-to-text generation is a subtask of data-to-text generation which aims to generate nal-tural language text based on input table. Pre-training techniques have achieved great success on table-to-text generation. However, the pre-trained models used in previous works are typically trained on free-form natural language text while the input of table-to-text task is structured table. In this paper, we propose STTP, a pre-trained model that is trained with tables and their contexts. The STTP model can understand the structured input table and generate ﬂuent text. Experiments on two datasets show the efﬁcacy of our model.


Introduction
Data-to-text generation (Reiter and Dale, 1997) is an important natural language generation task with many practical applications, and it refers to the task of generating textual output from nonlinguistic input data. The input data of the task can include tables of records, simulations of physical systems, spreadsheets, and so on. The output of the task is a natural language text. Datasets in common use include WEATHERGOV (Liang et al., 2009), ROTOWIRE (Wiseman et al., 2017), WebNLG (Gardent et al., 2017) and so on. Neural generation models with different improvements have achieved impressive results on data-to-text task. Table-to-text generation is a subtask of datato-text generation which takes tables as input.
The pretrain-and-finetune framework, which refers to first pre-training a high capacity model on large corpora and then fine-tuning it on a downstream task, has outperformed prior state of the art on both natural language understanding task and natural language generation task. Inspired by the success of transfer learning, recently some works (Mager et al., 2020;Kale, 2020;Ribeiro et al., 2020) try to apply the pretrain-and-finetune framework on data-to-text generation. They finetuned the pre-trained model such as BART (Lewis et al., 2019) or T5 (Raffel et al., 2019) on several downstream data-to-text tasks and achieved stateof-the-art results.
Although transfer learning has achieved great success on data-to-text generation, the pre-trained models used in previous works are typically trained on free-form natural language texts while the input of table-to-text task is structured table. The textto-text pre-trained models learn a lot of knowledge and good language models from large amount of texts, so they work well on data-to-text generation task. But they still lack the ability to understand the structured data. So we propose a structureaware table-to-text pre-trained model STTP which is trained with tables and their contexts for tableto-text task. STTP is built on top of the text-to-text pre-trained model BART, and it can understand the structured table and describe it with natural language text. We train our model based on BART because we hope our model can benefit from the knowledge and language model learned from large corpora. We propose three self-supervised tasks to train our model with large amount of tables and their contexts. The first self-supervised task is masked table language model (MTLM) which is like the classic MLM of BERT. The second selfsupervised task is adjacent cell prediction (ACP) which refers to predict the cells around the current cell. The third self-supervised task is context reconstruction (CR) which refers to reconstructing the context of a table given the table and its broken context. The first two tasks aim to train the model to better understand the structured table, while the latter task aims to align the table and text. We use the tables extracted from WDCWebTable Corpus (Lehmberg et al., 2016) and their contexts to train our model. Experimental results on WEATH-ERGOV dataset and WebNLG dataset show the efficacy of our model.
The main contributions of this work are: • We propose a structure-aware table-to-text pre-trained model STTP which is trained with three self-supervised tasks.
• Experimental results on WEATHERGOV dataset and WebNLG dataset show the efficacy of our model. Code will be released at https://github.com/XingXinyu96/STTP.

Related work
Data-to-text generation task involves taking structured data as input and generating text that describes this data. Traditional approaches (Stent et al., 2004;Walker et al., 2007) deal with the task in two steps: the selection of a subset of the input data to discuss and the surface realization of a generation. More recent works combine both steps by learning content plan and surface realization jointly with end-to-end models (Wen et al., 2015;Peng et al., 2017). Although the end-to-end model has achieved good results, many models (Perez-Beltrachini and Lapata, 2018;Puduppully et al., 2019) consider adding content selection and content planning modules to the endto-end framework to improve performance. A lot of other new models with different improvements (Wiseman et al., 2018;Li and Wan, 2018;Roberti et al., 2019;Rebuffel et al., 2020) are proposed to explore how to build an effective data-to-text generator. Inspired by the success of the pre-trained models in other natural language generation tasks, Harkous et al. (2020), Kale (2020) and Ribeiro et al. (2020) achieve state-of-the-art results on different data-totext benchmarks with different pre-trained models. However, the existing pre-trained models are usually designed to generate text based on text input, thus lacking the ability to understand structured inputs. Several pre-training methods designed for table-to-text task have been proposed. Deng et al. (2020) present a weakly supervised Structure-Grounded pretraining framework (STRUG) for text-to-SQL that can effectively learn to capture text-table alignment. But their model is only for text-to-SQL task and need parallel text-table data. Yin et al. (2020) propose TABERT, a pretrained model which is trained with large amount of tables with their context. Their model is also used for text-to-SQL task. Chen et al. (2020a) propose a knowledge-grounded pre-trained (KGPT) model which is trained on a massive knowledge-grounded text corpus crawled from the web. Li et al. (2020) propose two self-supervised tasks, Number Ordering and Significance Ordering, to help to learn better table representation.

Approach
We use the same model architecture as BART, and add several classification layers on top of the encoder for our new self-supervised tasks. We train our model based on the text-to-text pre-trained model BART instead of training from scratch because we hope our model can benefit from the knowledge and language model BART learned from large corpus of text. The three self-supervised tasks are shown in Figure 1.  of table and text. If we only train the encoder with the previous two tasks and do not consider the alignment of table and text, we might get a better representation of input table but the representation is hard to understand for the well-trained decoder provided by BART. Table-text alignment data are difficult to obtain, so we use the table with text context (which is usually not strictly aligned with the table, but somewhat relevant with the table) to help our model align table and text. We randomly mask 15% tokens of the context and reconstruct the broken context with our model. There also exists a mismatch between pre-training and fine-tuning, since the input of our model is usually just a table without its context during fine-tuning. Therefore, we also train our model with a task to predict context based only on the input table. Since the table and its context are not exactly aligned, it is difficult to predict the context only from the table, but this task can help our model mitigate the mismatch problem.

Dataset for Pre-training
The first two tasks only need unsupervised tables, while the third task needs tables with context text. Yin et al. (2020) collect tables and their surrounding text from English Wikipedia and the WDCWebTable Corpus (Lehmberg et al., 2016) to train their model. The data they use is also suitable for our task. We use the preprocessing tool they provided to handle the WDCWebTable Corpus and get a lot of tables with context. Then we filter out the data with low matching degree between table and context, because such data are difficult to train for the third task. In addition, we filter out tables that contains too many numbers. Finally, we get 800k tables with context. We linearize the structure of tables to be compatible with our model. We add a special token [cell] in the middle of cells in the same row and add another special token [row] in the middle of rows of the table. When training task 3, the input of model includes both text and table, so we concatenate them together and add a special token [TABLE] between text and table.

Pre-training Procedure
We alternately train our model with the previously mentioned three self-supervised tasks. The first two tasks are only used to train the encoder while keeping the decoder frozen. The third task is divided into two subtasks in practice: one is to reconstruct the context given the table with its damaged context; the other one is to predict the context given only the table. Both of the two subtasks train the encoder and decoder at the same time. Considering that the latter subtask is very difficult because the context of a table is difficult to be predicted in many cases, we reduce the times of training it.

Dataset
We perform the experiments on WEATHER-GOV dataset (Liang et al., 2009) and WebNLG dataset (Gardent et al., 2017). In the WEATHER-GOV dataset, the output text is a weather report, and the source data provides a structured representation of the temperature, sky conditions, etc. The WEATHERGOV dataset consists of 29, 528 scenarios, each with 36 weather records paired with a natural language weather forecast (28.7 avg. word length). The WebNLG challenge consists of mapping sets of RDF triples to text. The newest WebNLG dataset contains 16, 095 data inputs and 42, 873 data-text pairs. The average length of the output text is 22.3 words. We convert the input data into a table and then linearize the structure of table like what we do when pre-training.

Results on WeatherGov Dataset
We use a batch size of 4 and finetune for 100 epochs over the WeatherGov Dataset. Results are presented in Table 1. As is shown in this table, seq2seq model (Mei et al., 2015) has achieved very good results, but the pre-trained model further improves the results greatly. The BLEU scores of the pretrained models are more than 80, which indicates that the generated text is highly similar to the gold text. BART-Retrain refers to finetuning BART on both our pre-training dataset and the datasets in downstream tasks. Our STTP model outperforms the BART-Retrain model, which proves the improvements of STTP model over BART model is from the proposed training objective instead of the additional training data. Model BLEU METEOR (Mei et al., 2015) 61

Results on WebNLG Dataset
We use a batch size of 4 and fine-tune for 16 epochs over the WebNLG Dataset. Results are presented in Table 2. The results of Seq2Seq, Seq2Seq+Delex and Seq2Seq+copy are copied from (Shimorina and Gardent, 2018). The results of GCN and KGPT-Seq are copied from (Chen et al., 2020b). As is shown in this table, all pre-trained models outperform the models without pre-training even if some of them do not explicitly consider the structure of the input data. This is due to the pre-trained models learn a lot of external knowledge and a good language model from large corpora. The structure-aware model like GCN outperforms the normal seq2seq model, which shows that structure understanding is important in this task.  We randomly sample 50 instances from the WebNLG dataset and perform human evaluation on them. Three graduate students are employed to rank the generated texts produced by each model in three aspects: readability (whether the generated text is fluent), accuracy (whether the information of the generated texts is consistent with that contained in the input table) and overall quality. We use Best-Worst Scaling (Louviere et al., 2015), which has been shown to produce more reliable results than ranking scales (Kiritchenko and Mohammad, 2017). Specifically, each score is computed as the percentage of times it was selected as best minus the percentage of times it was selected as worst, and ranges from -1 (unanimously worst) to +1 (unanimously best). Human evaluation results on WebNLG dataset are shown in Table 3. We can see our model outperforms KGPT model and BART-base model, which further demonstrates the efficacy of our method. Running examples are provided in the supplementary materials.

Conclusion
In this paper, we propose STTP, a pre-trained model trained with tables and their contexts. STTP model has achieved great performance on two downstream tasks. In the future work, we hope to collect more data and try other self-supervised tasks to train more effective model for table-to-text task.