Attend, Memorize and Generate: Towards Faithful Table-to-Text Generation in Few Shots

Few-shot table-to-text generation is a task of composing fluent and faithful sentences to convey table content using limited data. Despite many efforts having been made towards generating impressive fluent sentences by fine-tuning powerful pre-trained language models, the faithfulness of generated content still needs to be improved. To this end, this paper proposes a novel approach Attend, Memorize and Generate (called AMG), inspired by the text generation process of humans. In particular, AMG (1) attends over the multi-granularity of context using a novel strategy based on table slot level and traditional token-by-token level attention to exploit both the table structure and natural linguistic information; (2) dynamically memorizes the table slot allocation states; and (3) generates faithful sentences according to both the context and memory allocation states. Comprehensive experiments with human evaluation on three domains (i.e., humans, songs, and books) of the Wiki dataset show that our model can generate higher qualified texts when compared with several state-of-the-art baselines, in both fluency and faithfulness.


Introduction
to-text generation, which aims to translate a semi-structured table into natural language descriptions while preserving the conveyed table information, are drawing increasing interest over the past few years. It has been widely applied in many real-world scenarios, such as automatically generating weather forecasting reports (Liang et al., 2009), biographies (Lebret et al., 2016;, restaurant descriptions (Novikova et al., 2017), task-oriented conversations (Budzianowski et al., 2018;Williams et al., 2013) as well as healthcare descriptions (DiMarco et al., 2007; Hasan and 1 All the source code and experimental dataset are available at https://github.com/wentinghome/AMG. Farri, 2019). Despite such significant gains, current approaches are driven by large-scale well-labeled training data, hindering the generalization to other scenarios with limited labeled data. In addition, the faithfulness of generated contents is still not well explored. Few-shot natural language generation (Brown et al., 2020;Schick and Schütze, 2021;Xia et al., 2020a) has been in increasing demand since sufficient labeled data are always unavailable in many scenarios. To improve the table-to-text generation in few-shot scenarios, many existing works (Chen et al., 2020c;Gong et al., 2020;Peng et al., 2020) resort to the pre-training techniques which have been widely adopted in NLP, that is, pre-training a model first on large-scale unlabeled data, and then transfer the learned knowledge in pre-trained model to the few-shot scenario of table-to-text generation. Although these pre-trained models have achieved promising performance on generating fluent descriptions, from our investigation, they are still suffering from three major limitations: (1) The structure of table has not been well preserved. On table representation, existing methods (Chen et al., 2020c;Gong et al., 2020;Chen et al., 2020a) used to flatten the table into sequential sentences, ignoring the structured features (e.g., correlation between words within each table slot) among tables, which is also critical for table-to-text generation.
(2) Generation bias. Current approaches that directly fine-tune the model on target data make the model in favor of the knowledge learned from pretraining rather than specific target task knowledge, hurting the faithfulness because extra information irrelevant to the input table is introduced.
For example, as shown in Figure 1, given a table in the top box, the aim is to generate a coherent and faithful sentence with high coverage of table slots, as well as less out-of-table information. From this table, we can observe that current stateof-the-art models tend to generate sentences with hallucinated contents. For example, GPT-2 introduces wrong middle name "kelly" and the nationality "american". In addition, the table coverage of contents generated by current approaches is low. For example, BART does not mention the event "marathon". These observation motivate us to design a model that can generate faithful texts from tables while keeping the fluency.
To tackle the aforementioned limitations, this paper proposes a novel approach Attend, Memorize and Generate (called AMG) for faithful table-totext generation in few-shots. Inspired by the human generation process which copies a consecutive slot span to compose a sentence using the context, we propose a table slot attention mechanism to empower the model generalization ability in inference by strengthening the dependency between the generated sentence with the input table. In addition, to avoid generating hallucinated contents, we design a memory unit to monitor the visits of each table slot. Particularly, the memory unit is initialized as all the meta-data of table slots, and then updated by checking the generated words as well as the current memory state.
Looking back to Figure 1, we can also observe several advantages of AMG. First of all, we can see AMG allows the to-be-predicted word "1998" from "birth_date" table slot to attend on the table as well as the previously generated sentence "robert . . . born", while the attention on within table slot  words are prohibited. Thus, the model is enforced  to capture the table span structure and rely on the  table span value to generate. To this end, the model  learns to capture the slot level table representation. Furthermore, as shown in Figure 1, "M 0 " is the memory initial state where all the slot are available to be chosen (marked by green). After predicting the last word of table slot "name", "M 1 " will be updated since it detects that the table slot "name" is present in the generated sentence, thus making the state of "name" unavailable (marked by red). In addition, the generation of word "1998" takes the context and table slot allocation into account, therefore "1998" is selected by locating the value of table span "birth_date" as well as the activated signal of table slot "birth_date" (marked by blue) from memory allocation status.
To summarize, the primary contributions of this paper are as follows: (1) To better preserve the structure of table, we design a multi-grain attention that can attend over the table word as well as table slots level. (2) It is the first time that we introduce a memory mechanism to improve the faithfulness of generated texts by tracking the allocation of table slots. (3) We have conducted comprehensive experiments on three domains (i.e., Humans, Books and Songs) of the Wiki dataset to validate the effectiveness of our proposed approach.

Problem Definition
, where a i and v i refer to the attribute name and value of i-th table slot, respectively, the table-to-text generation task aims at producing a coherent text Y = (y 1 , · · · , y L ) that can describe the table information with fluency and faithfulness, where L denotes the length of generated text.

UniLM
To alleviate the under-fitting issue caused by insufficient training examples in few shot learning, AMG adopts the state-of-art pre-trained language model UniLM (Dong et al., 2019) structure to integrate the external knowledge. UniLM is a multilayer Transformer network which can be applied into both tasks of natural language understanding (NLU) and natural language generation (NLG). In this paper, we configure UniLM using Seq2Seq self-attention mask to aggregate the context of the masked i-th to-be-predicted word y [M ASK] i that are source sequence words from table T , and the previously generated target words y <i . The proposed model computes the conditional probability for the to-be-predicted word using the masked language model objective function, as follows:

AMG Approach
3.1 Overview Figure 2 illustrates the overall architecture of our model, which is composed of three components, i.e., attend, memorize, and generate.
(1) Attend. We propose a multi-granularity attention mechanism which attends over both token level and the table slot level to capture the linguistic knowledge as well as table structure information. We think that these knowledge can improve the faithfulness of generated texts.
(2) Memory. We develop a memory to store and keep track of the table slot allocation status.
(3) Generate. We take both the context representation and the table slot allocation states into account while making predictions. The above three building blocks interweave and lead the model to generate descriptions from tables faithfully.

Table Representation
Table Linearization Table-to-text generation receives semi-structured table as input. However, our proposed model AMG is built upon the UniLM architecture which requires natural sentence as input. Therefore, the first step we need to do is to translate the table into a natural sentence by linearization (Chen et al., 2020c). For the table example shown in Figure 1, the attribute value pair "name: robert kiprono cheruiyot" can be linearized Representing the History of Table Slot Allocation AMG makes prediction on the to-bepredicted token by taking the memory allocation status into account. The memory at different time step is updated by the previously generated table slots. Thus, we need to prepare the previously generated table slot representation his t at time step t by using the static UniLM model. For example, in Figure 2, when making prediction for "[MASK]", the representation of table slot allocation history is computed by feeding "robert kiprono cheruiyot" to the static UniLM model and obtain the average of hidden states.

Multi-Granularity Attention
AMG introduces the multi-granularity attention (MA) which is the combination of two granularity of attention, i.e., token level and  Figure 2, the memory augmented attention A is the average of token level attention A ta and table slot level attention A sa , as following: where the token level self-attention mechanism learns a unique series of query matrix W l Qta , key matrix W l Kta , and value matrix W l Vta at the l-th Transformer layer for each attention head. Then, AMG maps the (l − 1)-th Transformer layer output T l−1 to three matrices: query Q ta , key K ta , and value V ta . The output of a self-attention head A ta is computed as Eq. (3), where M ask ta ∈ R N ×N is the seq2seq attention mask, allowing the to-bepredicted token to attend to table tokens as well as the previously generated tokens. N refers to the total token length of table, previously generated tokens and the current to-be-predicted token.
Table Slot Attention Table slot attention works in a similar way with the self attention, while the major difference is to learn new key and value mapping matrices W l Ksa and W l Vsa and project memory M l−1 using W l Ksa and W l Vsa to obtain K sa and V sa . The query Q sa is computed by the projection of UniLM hidden state h l−1 using mapping matrix W l Qsa . Memory M in AMG is defined as a R d h ×slotn matrix where slot n is the maximum number of table slots. The j-th column of memory at time step t is denoted as M t j , and the initial state of memory M 0 j is the average embedding of the j-th table slot value computed using static UniLM model. The output of slot level attention head A l sa is as follows: Instead of applying the original seq2seq attention from UniLM to the input, a table slot attention mask M ask slot ∈ R N ×N is introduced to decide which word should be attended. In our case, we prohibit the to-be-predicted token to attend the previously generated words within the same table slots, while allow to attend the rest of generated words and the table. As shown in Figure 2, "1998" from the descriptive sentence can attend to both the table " name is . . . , birth_date is . . . " and previously generated words "robert kiprono cheruiyot ( born", while is not allowed to attend to words within the same table slot "august 10 ,". into the reference. Memory is updated using the gated mechanism, following (Henaff et al., 2016): In Eq. (5), W a , W b , W c and W d are trainable parameters. First,M t j is the new candidate memory to be combined with the existing memory M t−1 j . Then, the gate function z t j employs a sigmoid function δ to determine how much memory M t j will be influenced. At last, we retain M t j by using gate function to control how much each cell in memory is updated by considering the history of table slot appearance in the target sentence, as well as the last memory.
Text Generation When predicting the next token at each time step, AMG considers both the context representation and the table slot allocation status from memory shown in Eq. (6) where tb refers to the table representation, tk t denotes the token predicted at time t by AMG, and tk 0...t−1 denote the tokens previously generated from time 0 to t − 1.

Task-Adaptive Pre-Training
AMG is built upon the pre-trained UniLM and introduces additional weight. The memory updater depends on W a , W b , W c and W d to project memory and history values, as shown in Eq.(5). Besides, the newly added special token [E_CLS] and [E_SEP] is supposed to learn appropriate embedding weight from scratch. It is challenging to expect the newly introduced weight can be learned properly if we directly fine-tune AMG under the few shot scenario.
Inspired by the pre-trained language models and the task adaptive pre-training (Gururangan et al., 2020), we collect the unlabelled table side data to do a second phase task adaptive pre-training.
We first linearize the input During pre-training, AMG modifies the UniLM model architecture by designing a novel slot attention mask as well as slot memory mechanism which introduces additional weights. There are two goals for pre-training: 1) tune UniLM weights to incorporate slot attention mask , and 2) learn proper weights for slot memory block. We divide the pretraining stage into two phases: slot attention based pre-training and slot memory based pre-training.
We incrementally incorporate the slot attention and slot memory elements to the UniLM model along the two pre-training phases. First, the model structure of slot attention based pre-training is to add the slot attention mask to the last 6 layers of UniLM. We also learn the embedding of two special tokens [E_CLS] and [E_SEP] by adding them into the UniLM vocabulary. We load the UniLM checkpoint model weight as the initial weight for slot attention based pre-training. The second slot memory based pre-training phase adopts the full AMG model, and is loaded with the checkpoint obtained after the slot attention mask based pre-training.

Fine-Tuning and Inference
In fine-tuning stage, AMG first loads the model weight after the further pre-training stage which exploits valuable information from plenty of unlabelled task relevant data. The input for our proposed model is the concatenation of the linearized table and the reference sentence. The model is trained end to end in masked language model fashion. Around 70% words in the reference are masked, and the cross entropy loss is used to min-imize the discrepancy between the masked token and the groundtruth.
For inference, table side data is present while the reference sentence is missing. Our approach generates sentence auto-regressively. When making prediction on the t-th word, we need to inform the model previously generated table slots through table slot history representation his t .

Experiment
In this section, we explore the following experimental questions: (1) Can the proposed model generate fluent sentences?; and (2) Is the generated sentence faithful to the fact given by input table? We also perform ablation analysis to investigate the two main components of AMG, namely the slot attention and slot memory mechanism.

Dataset
Task Adaptive Dataset for Pre-training To pretrain AMG, we collect additional unlabelled data from WikiBio (Lebret et al., 2016) (Chen et al., 2020c) respectively as the pre-training data.
Dataset for Fine-Tuning Inspired by the experimental settings of few-shot natural language generation in (Chen et al., 2020c), we conduct experiments on three domains, i.e., humans, songs and books of Wiki dataset denoted as Wiki-Humans, Wiki-Songs and Wiki-Books. For each domain, we fine tune AMG to inspect the model performance on various few shot settings by sampling different amount of training examples (e.g. 500, 200, 100, 50). The validation set for each domain includes 1000 instances, and test sets of humans, songs and books domain have 13587, 11879 and 5252 examples. We set the maximum length of the linearized table and the generated sentence as 300 and 64 respectively.

Implementation Details
The base model for AMG is UniLM-base model with 12 Transformer layers, 768 hidden state dimensions, and 110M parameters in total. The implementation of AMG is divided into two stages in total: 1) two-phase task adpative pre-training, and 2) fine-tuning on the target wiki dataset. We run the program on a single 1080Ti GPU with 12GB memory. Due to the memory constraint, the batch size on all stages is set as 4 and gradient is accumulated every 11 steps which results in a comparable 44 batch size. The learning rate is 5e-5. The Adam (Kingma and Ba, 2015) optimizer is used and the weight decay is set as 0.01. For fine-tuning, we fine-tune the AMG on target dataset by setting the maximum number of epoch as 50. For inference, we decode on the test set using the best checkpoints according to the validation set result. During inference, we use beam search with beam size 3 and length penalty 1.

Baselines
We compare the proposed model with strong pretrained language models. UniLM (Dong et al., 2019) is a pre-trained language model for both natural language understanding and generation using three types of language modeling tasks. BART (Lewis et al., 2020) introduces a denoising autoencoder for pre-training sequence-tosequence models. GPT-2 (Radford et al., 2019) is a powerful unidirectional model pre-trained on millions of webpages in auto-regressive fashion. GPT2+copy (Chen et al., 2020c) designed for fewshot table-to-text generation learns how to alternate between copying from table and generating functional words using GPT-2. TableGPT (Gong et al., 2020) is a followup work of (Chen et al., 2020c) while considers to minimize the contradicting part of the generated sentence give the table information.

Automatic Evaluation
Following other generation tasks, we choose three automatic evaluation metrics BLEU-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004) and ME-TEOR (Banerjee and Lavie, 2005) to evaluate the overlapping between the generated sentence and the reference sentence. Besides, to evaluate the faithfulness of generated sentence with the source table, we adopt PARENT (Dhingra et al., 2019)   reference, but also takes how much table slot information is reflected in the generated sentence into account. In addition, to further evaluate the faithfulness of the generated text, PARENT-T (Wang et al., 2020) which only measures the matching between the generated text and the corresponding table is also included.

Results
We first compare AMG with state-ofthe-art models mentioned in section 4.3. Table 1 shows the performance of AMG and baseline models on three domains of Wiki dataset using 500 training examples. For (Chen et al., 2020c), we copy the code that the author released on GitHub and replicate the result denoted as GPT2+copy (our replication). Regarding the conventional overlapping based metrics BLEU-4, METEOR, ROUGE-L, We can see that AMG provides the best overall performance under various domains and evaluation metrics. AMG outperforms the base model UniLM 3.71%/3.32%/2.46% on BLEU-4 under Humans/Books/Songs domains, and AMG gains 0.73%/0.53%/0.16% more than the second best model BART on METEOR. AMG outperforms the second best model BART 1.07%/0.48%/0.90% on the F score of PARENT which is a strong indication that AMG can achieve the strongest balance between the fluency and faithfulness. Regarding the overlapping between the generated sentence with table content, F scores of PARTENT-T metric shows that AMG provides the most informative results on Humans and Songs domains while still very competitive with the best model BART on Books domain. Besides, to verify the stability of AMG when the amount of training data varies to 50, 100, 200 and 500, we show PARENT score for the proposed and other baseline models in Table 2 3.39 GPT2 (Radford et al., 2019) 3.73 1.69 3.61 BART (Lewis et al., 2020) 4.017 1.53 3.24 UniLM (Dong et al., 2019) 3.92 1.65 3.52 AMG 4.023 1.75 3.22 And human domains achieves the most gain since we collect most pre-training data for the task adaptive pre-training, thus it would be beneficial for the further work to collect more task adaptive pre-training data for Books and Songs domains to further boost the model performance.

Analysis
We further analysis the faithfulness and the overall quality of the generated descriptions by conducting human evaluation. Then, we design ablation studies to investigate the importance of two building blocks of AMG: span attention and memory mechanism. In addition, we sample a specific input table and compare sentence generated by AMG with the state-of-the-art models shown in Figure 3.
BART AMG 50 shots rating 3.87 4.11 p = 0.002 500 shots rating 4.46 4.55 p = 0.24 Human Evaluation Following (Wang et al., 2020;Chen et al., 2020c), we recruit three human annotators who pass the College English Test (CET-6) English test 2 to judge the quality of the generated sentence. We sample 100 test tables and collect corresponding outputs from AMG, and baseline models. The sentences are randomly shuffled to reduce human variance. We provide instructions for human annotators to evaluate the sentence quality from two aspects: faithfulness and overall quality. First, for faithfulness, they are supposed to identify the number of entities mentioned in the sentence. Then, they need to compare the entities with ones from source table. Finally, they are supposed to report the number of fact supported and contradicted from the table respectively. Subsequently, we compute the average number of supported and unsupported entities denoted by #sup and #con in Table 3. The second study evaluates the overall quality of the generated sentence from their fluency, grammatical correctness, and the information consistency with the table. To compare the overall quality of various models, annotators rank the sentences generated using different models from 1 (best) to 6 (worst) by comparing the sentence. The "overall" column refers to the average ranking of the model. Table 3 shows that AMG generates better quality sentences compared with other models. Specifically, the outputs generated by AMG contains the most information supported by the table and the overall quality is ranked the first place.
Although it shows the number unsupported by the table is higher than other models, the overall quality still outperforms other models. The overall ranking in Table 3 between BART and AMG is quite close, thus we ask 3 human evaluators to rate the generated sentences from 3 criteria, and then calculate the statistical significance of the overall rating between BART and AMG. We randomly sample 50 sentences for 50 and 100 training examples in few-shot cases respectively. Three annotators are instructed to re-evaluate the overall sentence quality by rating them from 1 (worst) to 5 (best) by considering the following 3 criteria: (1) #sup, (2) #con (see Table 3), (3) naturalness and grammar correctness. The results are listed as follows.
As shown in Table 4, comparing BART with AMG, the p-value p 0.002 of Wilcoxon signedrank tests shows at 95% confidence level, AMG is 2 A national English as a foreign language test in China.   Figure 3 provides a sample input table from test set along with various model outputs.

Case Study
The top box contains an input table while the bottom box includes model generations. In the bottom box, we leave the content supported by table as black, unsupported as light brown, and blue for the remaining words. We find that the output of pre-trained baseline models suffer from the following problems: (1) repetition, e.g., BART fails to generate person name "wayne" correctly while repeats the last two letters as "waynene", (2) hallucination, e.g., GPT-2 generates a middle name "wayne" which is out of table, and GPT2+copy attempts to copy the "office" slot but fail to copy the entire information by introducing unsupported information "the oak house" and "2003 ... brotherwayne.". By contrast, AMG provides the highest table coverage while keeping the sentence fluent which demonstrates the table slot span attention and memory mechanism enables the model to copy from the table slot level correctly and enhance the generation faithfulness.  (Lebret et al., 2016;Liu et al., 2018;Wiseman et al., 2018;Ma et al., 2019;Liu et al., 2019a). Ma et al. (2019) extend the table-to-text generation to low-resource scenario and put forward a Transformer-based model. Of late, as the pretraining language model (e.g, BERT and GPT) has achieved significant successes in NLP, many works also propose to pre-train a model for table understanding. Yin et al. (2020) pre-train a model for jointly understanding of tabular data around textual descriptions on large-scale paired data. Herzig et al. (2020) extend the architecture of BERT to encode tables as input, and propose a weakly supervised pre-training model for question answering over tables. Kale (2020) investigate the performance of pre-trained T5 (Raffel et al., 2019) on multiple table-to-text tasks and provide a benchmark for the future research. To keep the faithfulness of table on generation, one related work to ours is (Wang et al., 2020), which introduces a new table-text optimaltransport matching loss and a table-text embedding similarity loss based on the Transformer model to enforce the faithfulness during text generation.

Related Work
Pre-Trained Language Model Our work is also related to model pre-training for NLP, which has brought dramatic improvements on natural language understanding (Devlin et al., 2019;Liu et al., 2019c;Clark et al., 2020; and generation (Song et al., 2019;Dong et al., 2019;Liu et al., 2020bLiu et al., , 2019b. The widely used pretrained models (PTMs) for table-to-text generation can be categorized into two classes: text-to-text PTMs (Radford et al., 2018;Devlin et al., 2019;Dong et al., 2019;Lewis et al., 2020;Joshi et al., 2020) and structured data-to-text PTMs (Chen et al., 2020b;Herzig et al., 2020;Xing and Wan, 2021). Recently, many pre-training models (Liu et al., 2021(Liu et al., , 2020aYao et al., 2019) start to incorporated the structured information from knowledge bases (KBs) or other structured semantic annotations into pre-training, which is also related to our work.
Few-shot text generation Few-shot text generation learns with minimal data while maintaining decent generation capacity. Few-shot text generation can be used to augment the scarce training data to better assist the down-stream task, e.g., (Xia et al., 2020a,b) for spoken language intent detection, (Bražinskas et al., 2020) for opinion summary generation. In addition, to better utilize the available resources, Chang et al. (2021) investigates the training instance selection on unlabelled data, and (Schick and Schütze, 2020) adapts patternexploiting training strategy to fine-tune a PTM.

Conclusion
In this paper, we have proposed a novel approach AMG for faithful table-to-text generation in few shots. We first attend over the multi-granularity of context using a novel span level and traditional token-by-token level attention strategy to exploit both the table structural and natural linguistic information. Then, we design a memory unit to memorize the table slot allocation states dynamically. Extensive experiments on three domains of Wiki dataset verify the effectiveness of our proposed model on generating fluent and faithful descriptions from tables.