Sketch and Refine: Towards Faithful and Informative Table-to-Text Generation

Table-to-text generation refers to generating a descriptive text from a key-value table. Traditional autoregressive methods, though can generate text with high fluency, suffer from low coverage and poor faithfulness problems. To mitigate these problems, we propose a novel Skeleton-based two-stage method that combines both Autoregressive and Non-Autoregressive generations (SANA). Our approach includes: (1) skeleton generation with an autoregressive pointer network to select key tokens from the source table; (2) edit-based non-autoregressive generation model to produce texts via iterative insertion and deletion operations. By integrating hard constraints from the skeleton, the non-autoregressive model improves the generation's coverage over the source table and thus enhances its faithfulness. We conduct automatic and human evaluations on both WikiPerson and WikiBio datasets. Experimental results demonstrate that our method outperforms the previous state-of-the-art methods in both automatic and human evaluation, especially on coverage and faithfulness. In particular, we achieve PARENT-T recall of 99.47 in WikiPerson, improving over the existing best results by more than 10 points.


Introduction
to-text generation is a challenging task which aims at generating a descriptive text from a keyvalue table. There have been a broad range of applications in this field, such as the generation of weather forecast (Mei et al., 2016), sports news (Wiseman et al., 2017), biography (Lebret et al., 2016;, etc. Figure 1 illustrates a typical input and output example of this task. Previous methods Nie et al., 2018;Bao et al., 2018) are usually trained in an † Corresponding author. end-to-end fashion with the encoder-decoder architecture (Bahdanau et al., 2015). Despite generating text with high fluency, their lack of control in the generation process leads to poor faithfulness and low coverage. As shown in Figure 1, the case of poor faithfulness hallucinates the occupation "singer" not entailed by the source table, and the case of low coverage misses the information of the place of birth. Even if trained with a cleaned dataset, end-to-end methods still encounter these problems as it is too complicated to learn the probability distribution under the table constraints (Parikh et al., 2020).
To alleviate these problems, recent studies (Shao et al., 2019;Puduppully et al., 2019;Ma et al., 2019) propose two-stage methods to control the generation process. In the first stage, a pointer network selects the salient key-value pairs from the table and arranges them to form a content-plan. In the second stage, an autoregressive seq2seq model generates text conditioned on the content-plan. However, such methods can cause the following problems: (1) since the generated content-plan may contain errors, generating solely on the content-plan leads to inconsistencies; (2) even if a perfect content-plan is provided, the autoregressive model used in the second stage is still prone to hallucinate unfaithful contents due to the well-known exposure bias (Wang and Sennrich, 2020) problem;(3) there is no guarantee that the selected key-value pairs can be described in the generated text. As a result, these methods still struggle to generate faithful and informative text.
In this paper, we propose a Skeleton-based model that combines both Autoregressive and Non-Autoregressive generation (SANA). SANA divides table-to-text generation into two stages: skeleton construction and surface realization. At the stage of skeleton construction, an autoregressive pointer network selects tokens from the source table and composes them into a skeleton. We treat the skeleton as part of the final generated text. At the stage of surface realization, an edit-based nonautoregressive model expands the skeleton to a complete text via insertion and deletion operations. Compared with the autoregressive model, the editbased model has the following advantages: (1) the model generates text conditioned on both the skeleton and the source table to alleviate the impact of incomplete skeleton; (2) the model accepts the skeleton as decoder input to strengthen the consistency between the source table and generated text; (3) the model generates texts with the hard constraints from the skeleton to improve the generation coverage over the source table. Therefore, SANA is capable of generating faithful and informative text.
The contributions of this work are as follows: • We propose a skeleton based model SANA which explicitly models skeleton construction and surface realization. The separated stages helps the model better learn the correlation between the source table and reference.
• To make full use of the generated skeleton, we use a non-autoregressive model to generate text based on the skeleton. To the best of our knowledge, we are the first to introduce non-autoregressive model to table-to-text generation task.
• We conduct experiments on WikiPerson and WikiBio datasets. Both automatic and human evaluations show that our method outperforms previous state-of-the-art methods, especially on faithfulness and coverage. Specially, we obtain a near-optimal PARENT-T recall of 99.47 in the WikiPerson dataset. Table-to-text Generation Table-to-text generation has been widely studied for decades (Kukich, 1983;Goldberg et al., 1994;Reiter and Dale, 1997). Recent works that adopt end-to-end neural networks have achieve great success on this task (Mei et al., 2016;Lebret et al., 2016;Wiseman et al., 2017;Nema et al., 2018;Liu et al., , 2019a. Despite generating fluent texts, these methods suffer from poor faithfulness and low coverage problems. Some works focus on generating faithful texts. For example, Tian et al. (2019) proposes a confident decoding technique that assigns a confidence score to each output token to control the decoding process. Filippova (2020) introduces a "hallucination knob" to reduce the amount of hallucinations in the generated text. However, these methods only focus on the faithfulness of the generated text, they struggle to cover most of the attributes in the source table. Our work is inspired by the recently proposed two-stage method (Shao et al., 2019;Puduppully et al., 2019;Moryossef et al., 2019;Ma et al., 2019;Trisedya et al., 2020). They shows that table-totext generation can benefit from separating the task into content planing and surface realization stages. Compared with these methods, SANA guarantee the information provided by the first stage can be preserved in the generated text, thus significantly improving the the coverage as well as the faithfulness of the generated text.

Related Work
Non-autoregressive Generation Although autoregressive models have achieved remarkable success in natural language generation tasks, they are time-consuming and inflexible. To overcome these shortcomings, Gu et al. (2018) proposed the first non-autoregressive (NAR) model that can generate tokens simultaneously by discarding the generation history. However, since a source sequence may have different possible outputs, discarding the dependency of target tokens may cause the degradation in generation quality. This problem also known as the "multi-modality" problem (Gu et al., 2018). Recent NAR approaches alleviate this problem via partially parallel decoding (Stern et al., 2019; or iterative refinement (Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019). Specially, Stern et al. (2019) per-forms partially parallel decoding through insertion operation. Gu et al. (2019) further incorporates deletion operation to perform iterative refinement process. These edit-based models not only close the gap with autoregressive models in translation task, but also makes generation flexible by allowing integrates with lexical constrains. However, the multi-modality problem still exists, making it difficult to apply NAR models to other generation tasks, such as table-to-text generation, story generation, etc. In this work, we use the skeleton as the initial input of our edit-based text generator. The skeleton can provide sufficient contexts to the text generator, thus significantly reducing the impact of multi-modality problem.

Methods
The task of table-to-text generation is to take a structured table T as input, and outputs a descriptive text Y = {y 1 , y 2 , ..., y n }. Here, the table T can be formulated as a set of attributes T = {a 1 , a 2 , ..., a m }, where each attribute is a key-value pair a i = k i , v i . Figure 2 shows the overall framework of SANA. It contains two stages: skeleton construction and surface realization. At the stage of skeleton construction, we propose a Transformer-based (Vaswani et al., 2017) pointer network to select tokens from the table and compose them into a skeleton. At the stage of surface realization, we use an edit-based Transformer to expand the skeleton to a complete text via iterative insertion and deletion operations.

Table Encoder
The source table is a set of attributes represented as key-value pairs a i = k i , v i . Here, the value of an attribute a i is flattened as a token sequence where w j i is the j-th token and l is the length of v i . Following Lebret et al. (2016), we linearize the source table by representing each token w j i as a 4-tuple (w j i , k i , p + i , p − i ), where p + i and p − i are the positions of the token w j i counted from the beginning and the end of the value v i , respectively. For example, the attribute of " Name ID, {Thaila Ayala} " is represented as "(Thaila, Name ID, 1, 2)" and "(Ayala, Name ID, 2, 1)". In order to make the pointer network capable of selecting the special token EOS 1 , we add a 1 EOS denotes the end of the skeleton. special tuple ( EOS , EOS , 1, 1) at the end of the table.
To encode the source table, we first use a linear projection on the concatenation w j i ; k i ; p + i ; p − i followed by an activation function: where W f and b f are trainable parameters. Then we use the Transformer encoder to transform each f j i into a hidden vector and flatten the source table into a vector sequence H = {h 1 , h 2 , ..., h l }.

Pointer Network
After encoding the source table, we use a pointer network to directly select tokens from the table and compose them into a skeleton. Our pointer network uses a standard Transformer decoder to represent the tokens selected at the previous steps. Let r t denote the decoder hidden state of previous selected tokenŷ t . The pointer network predict the next token based on the attention scores, which are computed as follows: where W q and W k are trainable parameters, d r is the embedding dimension of r t . According to the calculated probability distribution α, we select token based on the following formula: whereŷ t+1 represents the output at the next timestep, and P copy (w) represents the probability of copying token w from the source. There may be multiple identical tokens in the table, so we sum up the attention scores of their corresponding positions.
The pointer network needs target skeletons as supervision, which are not provided by the table-totext datasets. In this paper, we obtain the skeleton by collecting tokens in both the table and description except the stop words. The token order in the skeleton remains the same as their relative positions in the description. More details are described in Appendix A. Given the skeleton S = {s 1 , s 2 , ..., s q }, the pointer network is trained to maximize the conditional log-likelihood: where the special tokens s 0 = BOS and s q+1 = EOS denote the beginning and end of the target skeleton.

Stage 2: Surface Realization
At the surface realization stage, we use the same encoder as in the skeleton construction stage. The decoder is an edit-based Transformer decoder (Gu et al., 2019) that generates text via insertion and deletion operations. Different from the original Transformer decoder which predicts the next token in an left-to-right manner, the edit-based decoder can predict tokens simultaneously and independently. In this setting, we can use the full self-attention without causal masking.

Model Structure
To perform the insertion and deletion operations, we remove the softmax layer at the top of the Transformer decoder and add three operation classifiers: Deletion Classifier, Placeholder Classifier and Token Classifier. We denote the outputs of the Transformer decoder as (z 0 , z 1 , ..., z n ), details of these three classifiers are as follows: 1. Deletion Classifier which predicts for each token whether they should be "deleted"(1) or "kept"(0): 2. Placeholder Classifier which predicts the number of placeholders [PLH] to be inserted at each consecutive pair: 3. Token Classifier which replaces each [PLH] with an actual token: During decoding, we use the skeleton predicted from the first stage as the initial input of the decoder. We also use the full table information from encoder side to mitigate the impact of incomplete skeleton. As shown in Figure 2, the skeleton will pass through the three classifiers sequentially for several iterations. Benefiting from the full selfattention, each operation is allowed to condition on the entire skeleton, and thus reduces the probability of hallucinating unfaithful contents in the final text.

Training
Following Gu et al. (2019), we adopt imitation learning to train our model and simplify their training procedure. The iterative process of our model will produce various of intermediate sequences. To simulate the iterative process, we need to construct the intermediate sequence and provide an optimal operation a * (either oracle insertion p * , t * or oracle deletion d * ) as the supervision signal during training. Given an intermediate sequence Y , the optimal operation a * is computed as follows: Here, D denotes the Levenshtein distance (Levenshtein, 1965) between two sequences, and E(Y, a) represents the output after performing operation a upon Y .
To improve the training efficiency, we construct the intermediate sequence via a simple yet effective way. Given a source table, skeleton and reference (T, S, Y * ), We first calculate the longest common subsequence X between S and Y * , and then construct the intermediate sequence Y by applying random deletion on Y * except the part of X. We use Y to learn the insertion and deletion operations. The learning objective of our model is computed as follows: where Y is the output after inserting placeholders p * upon Y , Y is the output by applying the model's insertion policy π tok θ to Y . 2 λ is the hyper parameter. 3

Inference
As mentioned above, at the inference stage, we use the generated skeleton as the initial input of the decoder. The insertion and deletion operations will perform alternately for several iterations. We stop the decoding process when the current text does not change, or a maximum number of iterations has been reached.
In order to completely retain the skeleton in the generated text, we follow Susanto et al. (2020) to enforce hard constraints through forbidding the 2 We do argmax from Equation (9) instead of sampling. 3 In our experiment, λ = 1 gives a reasonable good result. deletion operation on tokens in the skeleton. Specially, we compute a constraint mask to indicate the positions of constraint tokens in the sequence and forcefully set the deletion classifier prediction for these positions to "keep". The constraint masks are recomputed after each insertion and deletion operation.

Datasets
We conduct experiments on the WikiBio (Lebret et al., 2016) and WikiPerson  datasets. Both datasets aim to generate a biography from a Wikipedia

Implementation Details
We implement SANA using fairseq (Ott et al., 2019). The token vocabulary is limited to the 50K most common tokens in the training dataset. The dimensions of token embedding, key embedding, position embedding are set to 420, 80, 5 respectively. All Transformer components used in our methods adopt the base Transformer (Vaswani et al., 2017) setting with d model = 512, d hidden = 2048, n heads = 8, n layers = 6. All models are trained on 8 NVIDIA V100 Tensor Core GPUs.
For the skeleton construction model, the learning rate linearly warms up to 3e-4 within 4K steps, and then decays with the inverse square root scheduler. Training stops after 15 checkpoints without improvement according to the BLEU score. We set the beam size to 5 during inference.
For the surface realization model, the learning rate linearly warms up to 5e-4 within 10K steps, and then decays with the inverse square root scheduler. Training stops when the training steps reach 300K. We select the best checkpoint according to the validation BLEU.

Baselines
We compare SANA with two types of methods: end-to-end methods and two-stage methods.
For end-to-end methods, we select the following methods as baselines: (1) DesKnow , a seq2seq model with a table position selfattention to capture the inter-dependencies among related attributes; (2) PtGen (Pointer-Generator, See et al. (2017)), an LSTM-based seq2seq model with attention and copy mechanism; (3) Struct-Aware ), a seq2seq model using a dual attention mechanism to consider both key and value information; (4) OptimTrans , a Transformer based model that incorporates optimal transport matching loss and embedding similarity loss. (5) Conf-PtGen (Tian et al., 2019), a pointer generator with a confidence decoding technique to improve generation faithfulness; (6) S2S+FA+RL (Liu et al., 2019b), a seq2seq model with a force attention mechanism and a reinforce learning training procedure; (7) Bert-to-Bert (Rothe et al., 2019), a Transformer encoderdecoder model where the encoder and decoder are both initialized with BERT (Devlin et al., 2019).
For two-stage methods, we select the following methods as baselines: (1) Pivot (Ma et al., 2019), a two stage method that first filter noisy attributes in the table via sequence labeling and then uses the Transformer to generate text based on the filter table; (2) Content-Plan (Puduppully et al., 2019), a two stage method that first uses a pointer network to select important attributes to form a content-plan and then uses a pointer generator to generate text based on the content-plan.

Comparison with End-to-End Methods
We first compare SANA with end-to-end methods, Table 2 shows the experimental results. From Table  2, we can outline the following statements: (1) For WikiPerson dataset, SANA outperforms existing end-to-end methods in all of the automatic evaluation metrics, indicating high quality of the generated texts. Specially, we obtain a near-optimal PARENT-T recall of 99.47, which shows that our model has the ability to cover all the contents of the table.
(2) For the noisy WikiBio dataset, SANA outperforms previous state-of-the-art models in almost all of the automatic evaluation scores except the PARENT precision, which confirms the robustness of our method. Although Conf-PtGen achieves the highest PARENT precision, its PARENT recall is significantly lower than any other method. Different from Conf-PGen, SANA achieves the highest recall while maintaining good precision.
(3) It is necessary to prohibit deleting tokens in the skeleton. After removing this restriction (− hard constrains), our method has different degrees of decline in various automatic metrics. (4) Table 2: Comparison with end-to-end methods. P, R, F1 represent precision, recall and F1 score, respectively. "− hard constrains" means removing the restriction of forbidding the deletion operation on tokens in the skeleton, "− skeleton" means removing the skeleton construction stage.  Table 3: Comparison with two-stage methods. P, R, F1 represent precision, recall and F1, respectively. "+ Oracle" means using oracle information (i.e., oracle skeleton or content-plan) as input.
performs poorly after removing the skeleton construction stage (− skeleton). This shows that the edit-based non-autoregressive model is difficult to directly apply to table-to-text generation tasks. The skeleton is very important for the edit-based model, which can significantly reduce the impact of the multi-modality problem. Combining both autoregressive and non-autoregressive generations, SANA achieves state-of-the-art performance.

Comparison with Two-Stage Methods
We further compare SANA with the two-stage methods. As shown in Table 3, there is an obvious margin between SANA and the two baselines, which shows that SANA can more effectively model the two-stage process. In order to prove that SANA can make use of the information provided by the first stage, we use the gold standard (i.e., the oracle skeleton or content-plan extracted from heuristics methods) as the input of the models used in the second stage.  the first stage of Content-Plan is similar to SANA, its PARENT scores (either precision, recall and F1) has not been obvious improved, especially on WikiPerson dataset. This shows that the edit-based decoder of SANA can make use of the oracle skeleton to produce high quality descriptions.

Human Evaluation
We report the human evaluation result on the WikiPerson dataset in Table 4. From the demonstrated results, it can be found that SANA outperforms the other end-to-end or two-stage models on all the human evaluation metrics. This is consistent with our model's performance in the automatic evaluation. In the evaluation of fluency, though the  models except for Struct-Aware reach similar performances, SANA performs the best, which demonstrates that its generation has fewer grammatical and semantic mistakes. In the evaluation coverage, SANA outperforms the Content-Plan model and defeats the other models by a large margin. This result is consistent with our proposal that SANA can cover sufficient information in the source table, and it can ensure the informativeness of generation. As to correctness, the advantage of SANA over the other models indicates that our model generates more faithful content and suffers less from the hallucination problem. It should be noted that although Content-Plan and DesKnow are on par with SANA on coverage and correctness respectively, they fail to perform well on both metrics in contrast with SANA. This indicates that our model generates both informative and faithful content.

Conclusion
In this paper, we focus on faithful and informative table-to-text generation. To this end, we propose a novel skeleton-based method that combines both autoregressive and non-autoregressive generations. The method divides table-to-text generation into skeleton construction and surface realization stages. The separated stages helps model better learn the correlation between the source table and reference. In the surface realization stage, we further introduce an edit-based non-autoregressive model to make full use of the skeleton. We conduct experiments on the WikiBio and WikiPerson datasets. Both automatic and human evaluations demonstrate the effectiveness of our method, especially on faithfulness and coverage.

A Automatic Skeleton Annotation
Algorithm 1 describes the automatic skeleton annotation process. Given a table and its corresponding description, we first collect tokens appearing in both the table and description except the stop words, then these tokens are sorted based on their positions in the description in ascending order. In this way, we can obtain a sequence composed of the selected tokens. We regard this sequence as a skeleton.

B More Generation examples
We further provide a case study, using another two examples (including a very challenging example which needs to recover a large number of facts), to show the effectiveness of our method SANA. In the following pages, we show the example outputs in Table 6 and 7. In these examples, the SANA model shows much better capability of generating informative and faithful descriptions compared with the baselines. for y * j ∈ Y * i do 8:

Algorithm 1 Automatic Skeleton Annotation
if y * j ∈ V i and y * j / ∈ W then 9: Append token y * j to the skeleton list S i 10: end if 11: end for 12: collect the skeleton list S += S i 13: end for Table 6: Example outputs from different methods. The red text stands for the hallucinated content in each generated description. Compared with DesKnow and Struct-Aware, SANA recovers all the table facts without generating any unfaithful content.