TWT: Table with Written Text for Controlled Data-to-Text Generation

Large pre-trained neural models have recently shown remarkable progress in text generation. In this paper, we propose to generate text conditioned on the structured data (table) and a preﬁx (the written text) by leveraging the pre-trained models. We present a new data-to-text dataset, T able with W ritten T ext (TWT), by repurposing two existing datasets: ToTTo and TabFact. TWT contains both factual and logical statements that are faithful to the structured data, aiming to serve as a useful benchmark for controlled text generation. Compared with existing data-to-text task settings, TWT is more intuitive, the preﬁx (usually provided by the user) controls the topic of the generated text. Existing methods usually output hallucinated text that is not faithful on TWT. Therefore, we design a novel approach with table-aware attention visibility and copy mechanism over the table. Experimental results show that our approach outperforms state-of-the-art methods under both automatic and human evaluation metrics.


Introduction
Data-to-text refers to the task of generating a target textual description conditioned on the structured source data such as tables, graphs, and meaning representations. Reiter and Dale (1997) suggest that a natural language generation (NLG) system consists of content planning (what to say) and surface realization (how to say it). Recent deep neural network-based approaches do not explicitly model these stages and are trained in an end-to-end fashion using the popular encoder-decoder architecture (Sutskever et al., 2014) with the attention mechanism (Dzmitry et al., 2015;Lebret et al., 2016). They achieved promising results on existing data-to-text datasets, such as WebNLG (Gardent * This work was done when the first author was an intern at Microsoft Research. † Corresponding Author et al., 2017), E2ENLG (Novikova et al., 2017), WikiBio (Lebret et al., 2016), ROTOWIRE (Wiseman et al., 2017), ToTTo , and LogicNLG (Chen et al., 2020a). It should be noted that content planning is the key factor for data-to-text generation (Puduppully et al., 2019). Different users might interpret different parts of the structured data. This issue may not be severe for datasets (e.g. WebNLG (Gardent et al., 2017)) that require the generated text to cover all records. However, when the golden sentence only covers part of the records (e.g. Wik-iBio (Lebret et al., 2016)), end-to-end methods that do not explicitly address content planning may output open-ended targets, which leads to unreliable generated results, and places challenges in evaluation.
In NLG, one way to provide signals on what to generate is to add constraints to the model output, which falls in the task of controlled text generation (CTG). Most CTG tasks are conditioned on several key-value pairs of control factors such as tone, tense, length, and sentiment (Hu et al., 2017;Dong et al., 2017;Ficler and Goldberg, 2017). In data-to-text,  propose the dataset ToTTo to address content planning by highlighting some cells in the table, the highlighted cells provide strong guidance on what to generate. However, ToTTo lacks practical use, it would be difficult to have tables with highlighted cells or ask the users to highlight the cells in the real application.
One important application of NLG is to provide writing assistance such as next word prediction or text auto-completion. In this scenario, a natural content planning signal will be the written text provided by the user, which could be a word, a phrase, or an incomplete sentence. For the example shown in Figure 1, given the table, users might interpret different parts of the data with different prefixes. Text generation under this scenario requires inferring the user's intention on content planning based # Governor Took Office  74  Robert  1868  75  Franklin  1872  76 Daniel 1874 Daniel was the 76th

South Carolina Governor
Franklin took office in 1872 Robert was the Governor for 4 years

Written Text Generated Text
Daniel is the second Governor in the 1870s #1 #2 #3 #4 Figure 1: Data-to-text generation conditioned on the written text.
on the structured data and the written prefix.
To encourage the research in controlled data-totext generation, we present a new dataset, Table  with Written Text (TWT), by repurposing two existing datasets: ToTTo  and Tab-Fact (Chen et al., 2019). See Section 3 for details about the dataset construction. TWT contains both factual and logical statements that are faithful to the structured data. Compared with other datasets, TWT is of practical use. The prefix controls the topic of the generated text, and the output model could assist in writing with structured data. Note that TWT differs from those datasets that provide only one golden sentence with no content planning signals.
To generate text faithful to the data, we design a novel approach that leverages large pre-trained models (Rothe et al., 2020) with table-aware attention visibility (based on the written text) and copy mechanism (Oriol et al., 2015;Gu et al., 2016) over the table. Experimental results show that our approach outperforms state-of-the-art methods under both automatic and human evaluation metrics, particularly in terms of faithfulness to the structured data. These results suggest that TWT could be a useful controllable data-to-text benchmark, and may help innovate models to provide intelligent assistance for writing with structured data.

Related Work
Data-to-Text aims to generate natural language from structured data, which has been widely studied recently. Most prior works focus on surfacelevel text generation in a specific domain or schema, such as ROBOCUP (Chen and Mooney, 2008), WEATHERGOV (Liang et al., 2009), E2ENLG (Novikova et al., 2017), and WebNLG (Gardent et al., 2017). These datasets expect the generated text to describe all the records from the data. WikiBio (Lebret et al., 2016) requires the target text to cover salient records with no explicit guidance on the generated topic. ToTTo  guide the topic of the generated target with a set of highlighted table cells. Logic-NLG (Chen et al., 2020a) and Logic2Text (Chen et al., 2020b) address logical inference/generation in data-to-text. ROTOWIRE (Wiseman et al., 2017) and ToTTo  also contain data that requires reasoning.
Many existing works tend to train neural models in an end-to-end fashion (Liu et al., 2018;Wiseman et al., 2017Wiseman et al., , 2018Chen et al., 2020c). Recently, large pre-trained models (Rothe et al., 2020;Raffel et al., 2020;Lewis et al., 2020) have also achieved new state-of-the-art results on data-to-text tasks. Reiter and Dale (1997) suggest that an NLG system consists of content planning and surface realization.  propose ToTTo to control the topics of generated text with highlighted cells. Gong et al. (2020) brings the sense of numerical value comparison into content planning. Li and Wan (2018) propose to generate templates and then fill the slots, while (Iso et al., 2019) incorporate writers' information to generate text step-by-step. Gong et al. (2019) utilize hierarchical encoders with dual attention to consider both the table structure and history information. In NLG, controlled text generation is also a hot research topic. It considers controlling attributes, such as identity of the speaker , sentiment (Dou et al., 2018), tense (Hu et al., 2017), politeness (Sennrich et al., 2016) and text length (Kikuchi et al., 2016). Our work could be considered as a middle-ground between data-to-text and controlled text generation and has more practical usage.

Task Definition
The task input is a tuple of table T, metadata M, and a written prefix X. The metadata M may include the table caption, the title of the section that contains the table, or other context around the table. The output target is denoted by Y , such that concatenating the prefix X and the output target Y results in a fluent sentence that is faithful to the table T. The goal is to learn a data-to-text model conditioned on the written prefix, P (Y |T, M, X).

Dataset Construction
Constructing a data-to-text dataset with clean targets is a significant challenge , we therefore build TWT by repurposing two existing datasets: 1) ToTTo , a largescale controlled table-to-text generation dataset with highlighted cells and 2) TabFact (Chen et al., 2019), a table-based fact-checking dataset with rich logical annotated statements. As shown in Figure  2, in ToTTo, given the table, table metadata (such  as the table title), and a set of highlighted cells, the goal is to produce the text that describes the highlighted cells. In TabFact, the input is a table with the caption and some statements (Figure 3), the task is to distinguish which statements are entailed or refuted. We use all annotated sentences from ToTTo and the entailed statements from TabFact as the clean targets. Chen et al. (2020a) address that data-to-text models should be able to generate text with logical inference over the data. Note that both ToTTo and TabFact contain text with logical inference. In total, we collected 128, 268 and 49, 417 Now, we could build the prefix and the golden target to generate by simulating the user writing process. An easy way to build prefix-target pairs is to break the sentence into two parts randomly, the first part will be the written prefix, and the second part is the target text to generate. However, the difficulty of generating correct target text on different breakpoints is not equal. Therefore, we build TWT evaluation benchmark with selected breakpoints in the sentence on the test set. These breakpoints are carefully selected such that the target contains either fact or logic derived from the table.
We employ a rule-based approach to choose the challenging breakpoints. We consider words or phrases that co-exist in the sentence and the table (or table metadata) as aligned facts. Following Chen et al. (2019), we identify the aligned facts based on the proportion of common words and word frequency of the longest common words between the text and each table cell or table metadata. For some text, we find that it contains numbers that do not exist in the table or table metadata (#3 and #4 in Figure 1). These numbers are usually logically inferred from the data. We consider these numbers as inferred numbers. The position to break the sentence will be the first starting token (excluded) of aligned facts and non-ordinal inferred numbers. For ordinal inferred numbers such as "first", "second" (#4 in Figure 1), the position will be the last token of the ordinal number (excluded). Once the positions to break the sentence are determined, we break the sentence at each position with the requirement that the prefix contains at least one aligned fact. Note that for sentences with multiple aligned facts or numbers, we will have multiple prefix-target pairs for one table-sentence pair. Table 1 shows the statistics of the obtained TWT evaluation benchmark.

Faithfulness Metrics
We propose two evaluation metrics to measure the faithfulness: fact coverage and the modified PAR-ENT (Dhingra et al., 2019). Fact Coverage is similar to the entity-centric metric (Liu et al., 2021), and the overall slot filling metric . Let F g be the set of aligned facts of the golden target and the table data, and F p be that for the generated target. Fact coverage is calculated as |F p ∩ F g |/|F g |. Note that fact coverage of open-ended generated targets will be quite low. We use the same alignment method described in Section 3.2 to acquire F g and F p . PARENT (Dhingra et al., 2019) is a metric specifically designed for data-to-text evaluation that takes the input table into account. It computes smoothed n-gram precision and recall over both the generated target and the input table.  modifies this metric by computing the recall on the highlighted cells on ToTTo. Similarly, we calculate the recall on the set of aligned facts between the golden target and the data.

Text Prediction Metrics
In the scenario of providing writing assistance, whether the generated target can be accepted by the user depends on 1) whether the generated text matches the user's intention, and 2) how much writing effort can be saved. We design the following metrics targeting this scenario. EM@N, the ratio of generated text whose words exactly match the first N words in the golden text. Characters Saved, the number of matched characters between the generated and golden text. This metric measures how useful the model can help to save the writing efforts.

Methodology
With transformer-based structures, finetuning taskspecific models with pre-trained parameters has achieved state-of-the-art results in text generation, achieving an astonishing level of fluency and coherence. Pre-trained models with a encoderdecoder structure such as BART (Lewis et al., 2020), BERT2BERT (Rothe et al., 2020), and T5 (Raffel et al., 2020) can be easily applied to data-to-text tasks. For example, on ToTTo, feeding the highlighted cells with row and column header as input and finetuned with BERT2BERT or T5 achieves relatively high performance . Figure 4 presents an overview of our model. We use a transformer-based encoder with additional positional (row/column) embeddings to encode table structure. We introduce structured encoder-decoder attention visibility based on the prefix to attend to the prefix-relevant sub-structure of the original table. For the decoder, we employ bi-directional attention for the prefix and uni-directional attention for the generated target as the decoder selfattention visibility. We also introduce the copy mechanism over the table data to assure the faithfulness of the generated target. Note that our model is based on the transformer encoder-decoder architecture (Rothe et al., 2020), both the encoder and the decoder are initialized with pre-trained parameters.

Table-aware Additional Embeddings
A common way to encode structured data with transformer is to create a linearized sequence of the data and treat the linearized sequence as text. For table linearization, similar to Yin et al. (2020), we use the template h c | h r | v to represent each table cell, where h c and h r are column and row names of the cell v. Following Herzig et al. (2020) to represent the table structure, we add row embedding r and column embedding c. We also use a type embedding t to represent the input type, where the type could be the table cell or different metadata types.
Given the input data, we first linearize the table row by row into a sequence of words and concatenate words of the metadata before the table words. The words are further tokenized with the WordPiece (Johnson et al., 2017) or Sentence-Piece (Kudo and Richardson, 2018) tokenizer. Let p be the positional embedding, w be the word embedding, and e denote the input representation, we have e = w + r + c + t + p.

Encoder-Decoder Attention Visibility
The prefix provides the content planning signals on the structured data. For example, in Figure 4, the prefix "Daniel was the" indicates that the following text is related to the row or column that "Daniel" belongs to with high probability. Therefore, we build a visibility matrix V based on the prefix as the encoder-decoder attention mask to explicitly model the visible row and column structure during decoding. V i,j = 1 means that the token i (the encoder part) is visible to token j (the decoder part). We first extract the aligned facts for the prefix with

Transformer Decoder
Output Token

Hidden State
Li st o f … … D a n ie l 1 8 7 4 …  the table records, and V i,j = 1 if token i (the encoder part) belongs to the table metadata M or is from the same row/column of the aligned facts.

Decoder Self-attention Visibility
Typically, the encoder-decoder based models generate text starting from the beginning, and the decoder adopts a causal mask to force the state of each decoder time step s t i only attend to the state from the previous time steps, s t|t≤t i , to avoid seeing tokens "from the future". We consider this type of attention as unidirectional. In our task, we have the input prefix as the written text. Tokens in the prefix should be visible to each other. Therefore, we adopt the causal with prefix mask: bidirectional attention mask is applied to the prefix, unidirectional attention is for decoding new tokens.

Copy Mechanism
To improve the faithfulness of the generated text, copying mechanism (Oriol et al., 2015;Gu et al., 2016) that copying from the data records is considered to be a promising solution (Li and Wan, 2018). Following (Chen et al., 2020c), on each decoding step t, we maintain a soft copy switch p copy to choose between generating from the distribution over vocabulary, or copying from the input data with attention weights as the probability distribution: where w x , w s , w h * , and b are learnable parameters, x t is the decoder input, s t is the output of the last decoder layer, σ is the sigmoid function, and h * t is the context vector, h * t = i a t i h i , a t i is the encoderdecoder attention weight that masked with visibility introduced in Section 5.2.
Note that for the multi-head attention, we obtain p copy by averaging that of all heads. Let P vocab (w) be the probability of generating token w, which is calculated through two linear layers with the concatenation of s t and h * t as input (see See et al. (2017) for details), the final probability distribution over the extended vocabulary from the input data will be: Copy mechanism is mainly proposed to handle out-of-vocabulary words (OOV) (Oriol et al., 2015;Gu et al., 2016). However, in our task, many of the table values are not OOV. The reason we employ the copy mechanism is to explicitly "teach" the model when and which fact to copy from the input data to improve faithfulness. We consider tokens of the aligned facts in the golden target as copied tokens, denoted by V a . Following Chen et al. (2020c), we maximize the copy probability p copy with an extra loss term at the copied tokens:   where L c is the original loss between the model's output and the golden target, w j is the target token at position j. λ is a hyper-parameter representing the weight for the copy. • T5 (Raffel et al., 2020): A pre-trained textto-text using the transformer framework. T5 achieved state-of-the-art results on many text generation benchmarks, including ToTTo.
Note that for baseline models, the input is the metadata concatenated with the table flattened row by row, with no additional table-aware embeddings introduced in Section 5.1.

Setup
We build the prefix-target pairs for training and validation by randomly selecting two prefixes of each  (Rothe et al., 2020) and T5 (Raffel et al., 2020), with the remaining parameters initialized randomly. When initialized with BERT, encoder and decoder do not share parameters. The learning rate is 5e−5. We use the linear learning rate scheduler with Adam optimizer (Kingma and Ba, 2015), and use beam search with the beam size of 4 during decoding. When initialized with T5, following (Raffel et al., 2020), we employ a constant learning rate of 1e−3 with AdaFactor optimizer (Shazeer and Stern, 2018). Decoding is conducted via greedy search. For other settings (including the baselines), the batch size is 56, and the maximum number of input and output tokens are 512 and 128, respectively. Tokens that exceed the maximum length will be truncated. We tune the hyper-parameter λ of the copy weight (Equation 1) and set it to 0.4, which achieves the best overall performance. We train both baselines and our approach with 8 NVIDIA Tesla V100 32G GPUs. The best checkpoint is chosen based on the fact coverage metric on the validation set. Table 2 shows the comparison between our approach and the baselines. We observe that: 1) our approach outperforms the baseline methods on all metrics, and 2) on both data sources, our approach  Table 4: Ablation studies, "w/o causal with prefix" means we replace it with the causal mask (unidirectional).

Experimental Results
initialized with T5 achieves the best performance. The improvements on the faithfulness metrics are more significant. The results of the writing suggestion metrics also demonstrate that our approach could help reduce writing efforts with structured data in real applications.

Ablation Study 2
We conduct ablation studies to investigate the model designs of our approach: 1) the table structure-aware additional embeddings, 2) the structured encoder-decoder attention visibility, 3) the copy mechanism, and 4) the "causal with prefix" decoding mask pattern. The results of different variants are listed in Table 4.
The overall performance drops when we employ unidirectional decoding mask on both sources when initialized with BERT or T5, suggesting that it's effective to employ the bidirectional attention mask to the prefix. On the ToTTo source data, it can be seen that, when the parameters are initialized with BERT, the overall performance of all metrics drops without the encoder-decoder attention visibility (enc-dec attn visibility) or the copy mechanism. The results also suggest that introducing the table structure-aware column and row embeddings doesn't show improvements (the results are comparable). We leave this as our future work to further study how to represent tables in transformer-based model structures. The overall results demonstrate that these designs are effective to achieve improved performance.

Human Evaluation
In our task, some correct and faithful generated text might be different from the golden targets, which 2 Due to limited computation resources, we do ablation studies mainly for our approach initialized with BERT on the ToTTo source. results in low performance using the above automatic evaluation metrics. The predictions of our models in Figure 5 Case #2 could be an example of this type. To further evaluate the faithfulness of the generated target, we randomly select 200 samples from the test set and ask the annotators to judge the predictions in terms of factual and logical correctness. We score 3/2/1 to each generated text indicating the facts or logic are all/partially/not correct. Table 3 shows the averaged scores of human evaluation. Compared with baselines, our approach generates more faithful text on data from the ToTTo source, and when initialized with T5, our approach achieves the best overall scores on data from both sources. We also find that the performance is rather poor when the golden target contains logical inference over the data. We leave this as our future work. Figure 5 shows the generated text of several cases for baselines and our approach.

Case Study
Case #1 shows how the copy mechanism affects the generated text. Increasing the value of λ makes the model "reluctant" to generate new text beyond the table content, and we find that the larger the value of λ is, the shorter the output text will be. λ balances between quality (faithfulness) and diversity. Note that "to 1876" in Case #1 is faithful to the table, which is not included in the target.
In Case #2, all baseline models generate unfaithful results while our models generate faithful ones, the output of our approach shall be considered as correct even though it's different from the golden target. This case demonstrates that, with encoderdecoder attention visibility, our model could focus on a specific sub-structure of the table to generate more faithful results.
In Case #3, the prefix is not sufficient to guide  Figure 5: Case studies. Text segments colored in green means the content is faithful to the data, and those colored in red are unfaithful content. the model to generate factual or logical content. Our model still outperforms the baseline models, the model attempts to generate text which involves logical inference. Our model does not explicitly model logic, the reason might be that the logic here is relatively simple, which does not require algebra calculation over the numbers.
Case #4 shows that when the logic involved is complex, all models including ours fail to generate the correct result. We leave generating text with logical inference over the data as our future work.

Task Challenges
Logical Inference. Text generation with logical inference over the data is challenging in our task. For example, the golden target of Case #4 in Figure 5 requires calculation over the numerical values in the table.
Choosing between Fact and Logic. In TWT, the golden target contains both factual and logical text. The model shall be capable of choosing what type of content to generate. For example, in Case #3 of Figure 5, the target sentence is factual while the model attempts to generate logical text, which leads to low evaluation results, though the predicted text is correct. Evaluation metrics. A good text generation model shall be capable of generating diverse and faithful content, which is not limited to generating results close to the provided target. Case #2 is an example of this type. The results of Ours (init from BERT2BERT) shall be considered correct. Even for the evaluation metrics, we find that these metrics usually are not consistent. For example, a high BLEU score does not necessarily mean that the fact coverage or PARENT metric is high.

Conclusion
In this paper, we propose Table with Written Text (TWT), a new controlled data-to-text generation dataset. For this task, we design a novel approach with table-aware attention visibility and copy mechanism over the table. Experimental results show that our approach could generate faithful text over state-of-the-art pre-trained models under both automatic and human evaluation. For future work, we will focus on generating text with logical inference on TWT.