Data-to-text Generation by Splicing Together Nearest Neighbors

We propose to tackle data-to-text generation tasks by directly splicing together retrieved segments of text from “neighbor” source-target pairs. Unlike recent work that conditions on retrieved neighbors but generates text token-by-token, left-to-right, we learn a policy that directly manipulates segments of neighbor text, by inserting or replacing them in partially constructed generations. Standard techniques for training such a policy require an oracle derivation for each generation, and we prove that finding the shortest such derivation can be reduced to parsing under a particular weighted context-free grammar. We find that policies learned in this way perform on par with strong baselines in terms of automatic and human evaluation, but allow for more interpretable and controllable generation.


Introduction
There has been recent interest in text generation systems that make use of retrieved "neighbors" -examples of good text retrieved from a database, perhaps paired with the source information on which these example texts condition -on the hope that these neighbors might make a generation task easier, or the system more interpretable or controllable (Song et al., 2016;Weston et al., 2018;Guu et al., 2018;Zhang et al., 2018;Peng et al., 2019, inter alia).
Whereas most work along these lines has adopted a conventional encoder-decoder approach, conditioning on the retrieved neighbors and then autoregressively generating text token-by-token from left to right, we instead propose to generate text by directly splicing together segments of text from retrieved neighbors. Generating in this way aligns with the intuition that in settings such as data-totext generation it ought to be sufficient to retrieve * This work was done primarily while the first two authors were at TTI-Chicago. sentences similar to the one that must be generated, and then merely change some details, such as names or dates.
There are notable advantages to a generation-bysplicing approach. First, generation becomes more interpretable: it is always clear from which neighbor a particular piece of generated text derives, and it is also clear how these pieces have come together to form the generated text. Generation-by-splicing may also increase our control over the generated text, and we suspect that approaches that make clear the provenance of each piece of generated text (as ours does) will be useful in preventing text generation systems from emitting harmful or biased text (Sheng et al., 2019;Wallace et al., 2019;Gehman et al., 2020, inter alia). That is, we might imagine preventing systems from emitting harmful or biased text by only allowing generation from approved neighbor examples.
Methodologically, we implement this generationby-splicing approach by training a policy to directly insert or replace spans of neighbor text at arbitrary positions within a partially constructed generation, and we define a generalized insert function capable of such manipulations in Section 3. We train this policy with "teacher forcing" (Williams and Zipser, 1989), which requires, for each training example, an oracle sequence of insert actions that derive it. Accordingly, we define a shortest sequence of actions deriving a training generation from its neighbors to be an oracle one, and we prove that, given some neighbors, an oracle sequence of actions can be obtained by parsing under a particular weighted context-free grammar, introduced in Section 3.1.1.
Empirically, we find our proposed approach yields text of comparable quality to strong baselines under automatic metrics and human evaluation on the E2E dataset (Novikova et al., 2017) and Wikibio datasets (Lebret et al., 2016), but with added interpretability and controllability (see Section 5.2). Our reduction of minimum-insertion generation to WCFG parsing may also be of independent interest. Our code is available at https://github.com/swiseman/ neighbor-splicing.

Background and Notation
Conditional text generation tasks involve generating a sequence of tokensŷ 1 , . . . ,ŷ T =ŷ 1:T conditioned on some x ∈ X , where each generated token y t is from a vocabulary V. We will consider in particular the task of table-to-text generation, where x is some tabular data andŷ 1:T is a natural language description of it.
For supervision, we will assume we have access to a dataset, which pairs an input x with a true corresponding reference text y 1:Tx ∈ V Tx consisting of T x tokens. Since we are interested in nearest neighbor-based generation, we will also assume that along with each input x we have a set N = {ν (n) 1:Tn } N n=1 of N neighbor sequences, with each ν (n) t ∈ V. We will be interested in learning to form y 1:Tx from its corresponding x and neighbor set N in a way that will be made more precise below. We note that finding an appropriate set of neighbor sequences to allow for successful generation with respect to an input x is an interesting and challenging problem (see, e.g., Hashimoto et al. (2018)), but for the purposes of our exposition we will assume these neighbor sequences are easy to obtain given only x (and without knowledge of y). We give the details of our simple retrieval approach in Section 5.

Imitation Learning for Text Generation
Much recent work views conditional text generation as implementing a policy π : X × V * → A ∪ { stop } (Bengio et al., 2015;Ranzato et al., 2016); see Welleck et al. (2019) for a recent review. That is, we view a generation algorithm as implementing a policy that consumes an input x ∈ X as well as a partially constructed output in the Kleene closure of V, which we will refer to as a "canvas" (Stern et al., 2019), and which outputs either an action a ∈ A or a decision to stop. Taking action a leads (deterministically in the case of text generation) to a new canvas, and so generation is accomplished by following π from some distinguished start canvas until a stop decision is made, and returning the resulting canvas. For example, sequence-to-sequence style generation (Sutskever et al., 2014;Cho et al., 2014) implements a pol-icy π that consumes x and a canvasŷ 1:M ∈ V M representing a prefix, and produces either an action a ∈ A = V, or else a stop action and generation terminates. When an action a ∈ V is chosen, this leads to the formation of a new prefix canvaŝ y 1:M · a, where · is the concatenation operator.
Imitation learning of text generation policies conventionally proceeds by "rolling in" to a canvaŝ y 1:M using a roll-in policy, and then training a parameterized policy π θ to mimic the actions of an oracle policy π * run fromŷ 1:M . The most common form of such training in the context of neural text generation is known as "teacher forcing" (Williams and Zipser, 1989), which simply amounts to using π * to roll-in to a canvas y 1:M , viewing π θ as a probabilistic classifier, and training π θ using the action π * (x, y 1:M ) as its target.
We will adopt this policy-based perspective on conditional text generation, and we will attempt to learn a policy that generates text by splicing together text from retrieved neighbors. In order to do so, we will make use of a more general form of policy than that implemented by standard sequenceto-sequence models. Our policies will allow for both inserting an arbitrary span of neighbor text (rather than a single token) anywhere in a canvas, as well as for replacing an arbitrary span of text in a canvas with a span of neighbor text, as we make more precise in the next section. Before doing so, we note that there has been much recent interest in generalizing the forms of policy used in generating text; see Section 6 for references and a discussion.

Splicing Nearest Neighbors
Given a canvasŷ 1:M ∈ V M and a set of neighbor sequences N = {ν (n) 1:Tn } N n=1 , we define a generalized insertion function, which forms a new canvas fromŷ 1:M . This generalized insertion function implements the following mapping where · is again the concatenation operator, the slice indexing is inclusive, 0 ≤ i < j ≤ M + 1, 1 and 1 ≤ k ≤ l ≤ T n . Note that this generalized insertion function allows both for inserting a span into any position in the canvas (when j = i + 1), as well as for replacing a span anywhere in the canvas with another span (when j > i + 1), which results in the removal of tokens from the canvas.  Figure 1: Deriving a sentence from the WikiBio dataset, "Ayelet Nahmias-Verbin (born 19 June 1970) is an Israeli lawyer and politician." Top right: neighbor sequences ν (0) , ν (1) , ν (2) ; ν (0) is from the corresponding table. Bottom right: a sequence of insert operations (see Equation (1)) deriving the sentence from the neighbors above. Left: the parse of the target sentence under the grammar in Section 3.1.1 corresponding to the derivation on the bottom right.
Intuitively, this generalized insertion function attempts to capture a generation scheme where text is generated by making only minor insertions or replacements in some existing text. For example, we might imagine generating a new sentence by copying a neighbor sentence to our canvas, and then simply replacing the names or dates in this neighbor sentence to form a new sentence; see Figure 1 for an example.
Having defined this insertion function, we can generate text with a policy that consumes an input x, a set of neighbors, and a canvas, 2 and outputs the arguments of the insert function, or else the stop action. Thus, for a given canvas and neighbor set, we take our policy to produce actions in

An Oracle Policy
As described in Section 2.1, we are interested in learning a parameterized text generation policy π θ . Since we would like to generate using the generalized insert function in Equation (1), we will attempt to learn a parameterized distribution π θ (· | x,ŷ 1:M , N ) over the arguments to this insert function given input x, canvasŷ 1:M , and neighbor set N , by training it with the one-hot action distribution π * (x, y 1:M , N ) as its target. In order to do so, however, we must first obtain an oracle policy π * . That is, for each true output y 1:Tx in our dataset, we require an oracle sequence of canvases paired with corresponding oracle actions in A ins , which derive y 1:Tx from x and N . In this section we suggest an approach to obtaining these.
In what follows, we will assume that each word type represented in y 1:Tx is also represented among the neighbor sequences; in practice, we can always ensure this is the case by using the following expanded neighbor set: where the vocabulary V is viewed as containing spans of length one. Thus, policies will be able to emit any word in our vocabulary. Furthermore, because the source table x itself will often also contain spans of words that might be used in forming y 1:Tx , going forward we will also assume N includes these spans from x.
In arriving at an oracle policy, we first note that given N there will often be many sequences of actions in A ins that derive the reference text y 1:Tx from an empty canvas. For instance, we can simulate standard left-to-right, token-by-token generation with a sequence of actions However, other derivations, which insert or replace spans at arbitrary canvas locations, will often be available. We posit that derivations with fewer actions will be more interpretable, all else equal, and so we define our oracle policy to be that which derives y 1:Tx from N (starting from an empty canvas) in as few actions as possible. We show this optimization problem can be reduced to finding the lowest-cost parse of y 1:Tx using a particular weighted context free grammar (WCFG) (Salomaa, 1969), which can be done in polynomial time with, for instance, the CKY algorithm (Kasami, 1966;Younger, 1967;Baker, 1979).

Reduction to WCFG Parsing
Consider the following WCFG in Chomsky Normal Form (Chomsky, 1959): where S is the start non-terminal and where the bracketed number gives the cost of the rule application. We can see that there is a cost (of 1) for introducing a new neighbor span with an S non-terminal, but no cost for continuing with the remainder of a neighbor already introduced (as represented by the C and R non-terminals).

Claim 1. Given neighbors {ν
(n) 1:Tn } N n=1 , the length of the shortest derivation of a sequence y 1:Tx using actions in A ins is equal to its lowest cost derivation under the WCFG above.
We prove the above claim in Appendix A. The proof proceeds by two simulation arguments. First, we show that, given a derivation of y 1:Tx with a certain weight under the WCFG, we can simulate a subset of the derivations with a number of insert operations equal to the total weight of the derivations and still obtain y 1:Tx . This implies that the cost of the optimal derivation under the WCFG is at least the cost of the optimal number of insert operations. Second, we show that, given a derivation of y 1:Tx with a certain number of insert operations, we can simulate a subset of these operations with a cost-1 derivation of the grammar per insert operation. This implies that the optimal number of insert operations is at least the cost of the optimal derivation according to the WCFG. Together, the two simulation arguments imply the claim.
Complexity If T is the maximum length of all sequences in {y 1:Tx } ∪ N and |N | = N , parsing under the above WCFG with the CKY algorithm is O(N T 6 ). The runtime is dominated by matching the S → Y (n) k:l C (n) s rule; there are O(N T 3 ) sequences that match the right-hand side (all k ≤ l < s for all ν (n) ), and we must consider this rule for each span in y 1:Tx and each split-point.
Obtaining the policy Using Claim 1, we obtain an oracle action sequence deriving y 1:Tx from its neighbors N by first computing the minimum-cost parse tree. As noted in Section 3.1, N is guaranteed to contain any word-type in y 1:Tx . In practice, we ensure this by only adding word-types to N that are not already represented in some neighbor, so that computed oracle parses use the neighbor sequences rather than the vocabulary. Given the minimum-cost parse tree, we then obtain a sequence of insert actions by doing a depth-first leftto-right traversal of the tree. 3 In particular, we can obtain all the arguments for an insert operation after seeing all the children of its corresponding S non-terminal. For example, in Figure 1, the insert operations on the bottom right follow the order in which S non-terminals are encountered in a left-to-right, depth-first traversal of the tree on the left; the arguments of the operation that introduces ν (2) , for example, are determined by keeping track of the corresponding S's distance from the left sentence-boundary and the length of the span it yields. We precompute these oracle derivations for each (x, y 1:Tx , N ) triplet in our training corpus.

Additional Oracle Policies
We will refer to policies derived as above as "FULL" policies. While FULL policies minimize the number of insert operations used in deriving y 1:Tx , there are at least two other reasonable neighborbased oracle policies that suggest themselves. One is the oracle policy that derives y 1:Tx from left to right, one token at a time. This policy is identical to that used in training sequence-to-sequence models, except each token comes from N . In particular, a generated token is always copied from a neighbor sequence if it can be. We will refer to this policy as "LRT," for "left-to-right, token-level." Another oracle policy one might consider would allow for inserting spans rather than words leftto-right, but like FULL policies would attempt to minimize the number of span insertion operations. While a greedy algorithm is sufficient for deriving such policies, in preliminary experiments we found them to consistently underperform both FULL and LRT, and so we do not consider them further.

Models, Training, and Generation
To avoid directly parameterizing a distribution over the impractically large number of combinations of arguments to the insert function, we factorize the distribution over its arguments as where we have left out the explicit conditioning on x,ŷ 1:M , and N for brevity. Thus, our policy first predicts an insertion of token ν (n) k after the canvas tokenŷ i . Conditioned on this, the policy then predicts the final token ν (n) l of the inserted span, and which canvas tokenŷ j immediately follows it.
More concretely, we obtain token-level representations x 1 , . . . , x S andŷ 0 , . . . ,ŷ M +1 , all in R d , of source sequence x = x 1:S and of canvas sequencê y 1:M , padded on each side with a special token, by feeding them to an encoder-decoder style transformer (Vaswani et al., 2017) with no causal masking. We obtain neighbor token representations ν (n) k by feeding neighbor sequences through the same encoder transformer that consumes x. We provide additional architectural details in Appendix D.
Viewing the source sequence x as the 0th neighbor, we then define where the normalization is over all pairings of a canvas token with either a neighbor or source token, and where W 0 , W 1 , and W 2 , all in R d×d , are learnable transformations. Similarly, we let where now the normalization only considers pairings of the jth canvas token with the lth neighbor or source token, where j > i, l ≥ k. W 3 , W 4 , and W 5 are again learnable and in R d×d .
While the above holds for FULL policies, LRT policies always insert after the most recently inserted token, and so they require only a π θ (n, k | i) policy, and only W 0 and W 2 transformations.

Training
As noted in Section 3.1, we propose to train our policies to imitate the oracle derivations obtained by the CKY parse, using teacher-forcing. Suppose that, for a given (oracle) canvas y 1:M and set of neighbors N , the oracle next-action obtained from the parse is (i * , j * , n * , k * , l * ). Since there may be multiple spans in N that are identical to ν (n * ) The π θ (j, l | i, n, k) policy is simply trained to minimize since there is one correct target given i * , n * , k * . Training proceeds by sampling a mini-batch of examples and their derivations, and minimizing the sums of the losses (3) and (4) over each action in each derivation, divided by the mini-batch size.

Generation
For FULL models, we generate with beam search, following the factorization in Equation 2. At each iteration, the beam first contains the top-K partial hypotheses that can be constructed by predicting the i, n, k arguments to the insert function given the current canvas and neighbors. Given these, the remaining j, l arguments are predicted, and the top K of these are kept for the next iteration. We search up to a maximum number of actions, and in computing the final score of a hypothesis, we average the π θ log probabilities over all the actions taken to construct the hypothesis (rather than summing).
For LRT models, we generate with standard leftto-right, token-level beam search. We note that in this setting it is common to marginalize over all occurrences of a word-type (e.g., among neighbors or in the table) in calculating its probability. While this generally improves performance (see below), it also hurts interpretability, since it is no longer clear which precise neighbor or source token gives rise to a predicted token. Below we report results in both the standard marginalization setting, and in a no-marginalization ("no-marg") setting.

Experiments
Our experiments are designed to test the quality of the text produced under FULL policies and LRT policies, as well as whether such policies allow for more controllable or interpretable generation.
Datasets We expect our approach to work best for tasks where different generations commonly share surface characteristics. Table-to-text tasks meet this requirement, and are accordingly often used to evaluate generation that makes use of retrieved neighbors Lin et al., 2020) or induced templates (Wiseman et al., 2018;Li and Rush, 2020). Following recent work, we evaluate on the E2E (Novikova et al., 2017) and WikiBio (Lebret et al., 2016) datasets.
Preprocessing We whitespace-tokenize the text, and mask spans in neighbor sequences that appear in their corresponding sources, which discourages derivations from copying content words from neighbors. We pad each y and ν (n) sequence with beginning-and end-of-sequence tokens, which encourages derivations that insert into the middle of sequences, rather than merely concatenating spans.
Obtaining Neighbors We precompute neighbors for each training example, taking the topscoring 20 neighbors for each example in the training data (excluding itself) under a simple score s(·, ·) defined over pairs of inputs in X .
For the E2E and WikiBio datasets, we define s(x, x ) = F 1 (fields(x), fields(x )) + 0.1F 1 (values(x), values(x )), where fields extracts the field-types (e.g., "name") from the table x, values extracts the unigrams that appear as values in x, and F 1 is the F 1 -score.
Baselines We compare FULL policies to LRT policies, to a transformer-based sequence-tosequence model with a copy mechanism (Gu et al., 2016) that uses no retrieved neighbors (henceforth "S2S+copy"), and to recent models from the literature (see below). The S2S+copy model uses a generation vocabulary limited to the 30k most frequent target words. The neighbor-based policies, on the other hand, are limited to generating (rather than copying) only from a much smaller vocabulary consisting of target words that occur at least 50 times in the training set and which cannot be obtained from the target's corresponding neighbors.
Additional Details All models are implemented using 6-layer transformer encoders and decoders, with model dimension 420, 7 attention heads, and feed-forward dimension 650; 4 all models are trained from scratch. We train with Adam (Kingma   We include sample generations from all systems in Appendix F, and additional visualizations of some FULL generations in Figure 3.

Quality Evaluation
We first evaluate our models and baselines using the standard automatic metrics associated with each dataset, including BLEU (Papineni et al., 2002), NIST, ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015) and METEOR (Banerjee and Lavie, 2005), in Table 1. There we also compare with the model of , which uses retrieved neighbors, and of Li and Rush (2020), which produces interpretable segmentations, as well as with the model of Chen et al. (2020) ("KGPT" in tables), which is a fine-tuned, large pretrained model, and which we take to be close to the state of the art. We first note that our baselines are quite strong, largely outperforming previous work, including large pretrained models. In the case of E2E, we find that the FULL model slightly outperforms these strong baselines and attains, we believe, state-ofthe-art performance in the setting where no pretrained models or data augmentation is used. (See Chang et al. (2021) for even better results without these restrictions). On WikiBio, however, FULL slightly underperforms the strongest baselines. Table 2 we show the (average) results of a human evaluation conducted following the methodology described in Reiter (2017). We ask crowd-workers on Amazon Mechanical Turk to score generations in terms of their naturalness, their faithfulness to the source table, and their informativeness, on a 5-point Likert scale. (Note that following Dhingra et al. (2019) we ask about informativeness rather than usefulness). We score a total of 45 random examples from each test dataset, with each generation being rated by 3 crowd-workers, and each crowd-worker seeing a generation from each system. We ran multi-way ANOVAs with system-type (i.e., FULL, LRT, or S2S+copy), example index, and crowd-worker-id as independent variables, and the rating as the dependent variable.

Human Evaluation In
The only significant interaction involving system-type was with respect to "faithfulness" on the WikiBio dataset (p < 0.018), though this does not reflect a necessary correction accounting for the multiple comparisons implied by crowd-workers rating along 3 dimensions. Furthermore, under Tukey's HSD test no significant pairwise (i.e., between system pairs) interactions were found in any setting. Thus, we find no significant difference between any system pairs according to crowdworkers, although (as with the automatic metrics) FULL performs slightly better on E2E and worse on WikiBio. We give the precise p-values as well as more details about the questions crowd-workers were asked in Appendix E.
We also conduct a manual analysis of the faithfulness errors made by FULL generations in Ap-pendix B; we generally find that FULL generations do not hallucinate more than S2S+Copy generations (the most faithful generations according to crowd-workers), but they do more frequently contradict information in the source table. This generally occurs when a span containing information that contradicts the table is copied to the canvas, and this information is not subsequently replaced; see Appendix B for more details and a discussion.

Interpretability Evaluation
We first emphasize that, on an intuitive level, we believe FULL policies lead to significantly more interpretable generation than token-level policies. This is because FULL policies give an explicit (and often short) span-based derivation of a generated text in terms of neighbor text. We show visualizations of two randomly chosen generations from the WikiBio validation set, along with their derivations, in Figure 3.
However, it is difficult to precisely quantify the interpretability of a text generation model, and so we now quantify several aspects of our models' predictions that presumably correlate with their interpretability. Table 3 shows the average length of the derivations (i.e., how many insert operations are required to form a generation), the average number of neighbors used in forming a prediction, and the percentage of generated tokens copied from a neighbor or the source x (rather than generated from the model's output vocabulary) for FULL and LRT-nomarg policies over 500 randomly-chosen examples from the E2E and WikiBio validation-sets. All else equal, we expect fewer insert operations, fewer neighbors, and more tokens copied from neighbors (resp.) to correlate with better interpretability. We find that FULL generations require many fewer insert operations and distinct neighbors per generation on average than LRT generations, although they use their output-vocabulary slightly more than LRT-no-marg. Note that we use LRT-no-marg for this comparison because marginalization obscures whether a predicted token is from a neighbor.
The fact that FULL policies use so few distinct neighbors per example motivates asking how well these policies perform at test time when given fewer neighbors than they are trained with (namely, 20 in all experiments). We plot the average validation ROUGE of FULL and LRT for both datasets (using ROUGE-4 for WikiBio and ROUGE-L for E2E, as is conventional) against the number of neigh-  bors used at generation time in Figure 2. We see that while using fewer neighbors hurts both types of policies, FULL outperforms LRT for very few neighbors.
Controllability Another approach to evaluating the interpretablility of a model is to use our understanding of the model's prediction process to control it, and then evaluate controllability. In Appendix C we describe, as a case study, attempting to control the number of sentences used in E2E dataset generations by controlling the neighbors; we find that FULL significantly outperforms LRT policies in ensuring that generations have at least three sentences.

Related Work
NLP systems have incorporated neighbors for decades. Early work focused on machine translation (Sumita and Hitoshi, 1991), syntactic disambiguation (Cardie, 1994), and tagging (Daelemans, 1993;Daelemans et al., 1996). While some more recent work has made use of retrieved neighbors for problems such as sequence labeling (Wiseman and Stratos, 2019), auditing multi-label text classification predictions (Schmaltz and Beam, 2020), and reasoning over knowledge bases (Das et al., 2020(Das et al., , 2021, the majority of re-cent NLP work involving neighbor-based methods has focused on conditioning neural text generation systems on retrieved neighbors. This conditioning is variously accomplished using a conventional encoder in an encoder-decoder setup (Song et al., 2016;Weston et al., 2018;Gu et al., 2018b;Cao and Xiong, 2018;Bapna and Firat, 2019), by allowing the parameters of the decoder to depend on the retrieved neighbor , or by viewing the unknown neighbor as a latent variable (Hashimoto et al., 2018;Guu et al., 2018;Chen et al., 2019;He et al., 2020). Recent work (Zhang et al., 2018;Khandelwal et al., 2019Khandelwal et al., , 2020 has also used retrieved neighbors at decoding time to modify the next-token distribution of the decoder. Our work differs from these approaches in that we explicitly parameterize the splicing operations that form a generation from neighbors, rather than conditioning or otherwise modifying a left-to-right token generation model using retrieved neighbors. Our parameterization is motivated by trying to increase the interpretability and controllability of the generation process, which also motivates recent work making explicit the template or plan being fol- Gu et al., 2019b,a, inter alia) approaches. Our work differs from this last category in several important respects: first, we insert and replace (and model) full spans rather than tokens. Our policies are trained to minimize the number of insertion operations rather than to insert (centrally positioned) correct tokens in available slots, as is Insertion Transformer (Stern et al., 2019), or to mimic a Levenshtein distance-based oracle, as is LevT (Gu et al., 2019b). Our policies are also fundamentally sequential, unlike these partially autoregressive alternatives, which can generate tokens in parallel. The sequential nature of our approach makes us-  ing beam search straightforward (unlike in tokenparallel approaches) and, we think, leads to interpretable, serial derivations. On the other hand, decoding serially with beam search will generally be slower than the iterated parallel decoding of partially autoregressive models.
Our work also relates to recent work on sentencelevel transduction tasks, like grammatical error correction (GEC), which allows for directly predicting certain span-level edits (Stahlberg and Kumar, 2020). These edits are different from our insertion operations, requiring token-level operations except when copying from the source sentence, and are obtained, following a long line of work in GEC (Swanson and Yamangil, 2012;Xue and Hwa, 2014;Felice et al., 2016;Bryant et al., 2017), by heuristically merging token-level alignments obtained with a Damerau-Levenshtein-style algorithm (Brill and Moore, 2000).

Conclusion
We have presented an approach to data-to-text generation, which directly splices together retrieved neighbors. We believe this line of work holds promise for improved interpretability and controllability of text generation systems.
In future work we hope to tackle more ambitious text generation tasks, which will likely require retrieving many more neighbors, perhaps dynamically, from larger data-stores, and with more sophisticated retrieval techniques, such as those currently being used in retrieval-based pretraining (Lewis et al., 2020a;Guu et al., 2020).
We also hope to consider more sophisticated models, which explicitly capture the history of produced canvases, and more sophisticated training approaches, which search for optimal insertions while training, rather than as a preprocessing step (Daumé III et al., 2009;Ross et al., 2011).

A Proof of Claim 1
Given sequences ν (1) , . . . , ν (N ) and a target sequence y, let C be the minimum integer such that there exist sequences y (0) , . . . , y (C) with y (0) = ε being the empty sequence, y (C) = y, and y (c) = insert(y (c−1) , i, j, n, k, l) for c = 1, . . . , C, where M = |y (c−1) |, 0 ≤ i < j ≤ M +1, and 1 ≤ k ≤ l ≤ |ν (n) |. Let cost ins (y) be this minimum C, equivalent to the length of the shortest derivation of y with actions in A ins , and let cost CFG (y) be the cost of the minimum cost derivation of y with the WCFG in Section 3.1.1. We want to show that cost ins (y) = cost CFG (y), which we accomplish by showing that cost ins (y) ≤ cost CFG (y) and that cost ins (y) ≥ cost CFG (y).

Proposition 1. cost ins (y) ≤ cost CFG (y).
Proof. Consider the minimum cost derivation tree of y under the WCFG in Section 3.1.1, and let C be the minimum cost. We show inductively that there exists a sequence of ≤ C insert operations that yields y from ε, by considering two cases.
Case 1: y is derived using only the first two grammar rules, and so is a concatenation of sequences ν (n) k:l derived from the first grammar rule. If there are C such sequences, the WCFG derivation costs C. Also, constructing this sequence using insertions requires at most C insertions.
Case 2: y is derived using at least one application of the third grammar rule and can be written as y = s (1) · y (1) · s (2) · y (2) · . . . · y (Q) · s (Q+1) , where the s (q) sequences are all from the same ν (n) (using the last three grammar rules), and the y (q) sequences are derived from an S nonterminal. We can derive y using insert operations by inserting a substring of ν (n) containing all the s (q) (which costs 1 under the grammar), then inserting the remaining y (q) recursively, which, by induction, costs at most cost CFG (y (q) ). The total number of insertions is then at most cost CFG (y). Proposition 2. cost ins (y) ≥ cost CFG (y).
Proof. By induction: ifỹ (c) is non-interleaving, then so isỹ (c+1) by its construction fromỹ (c) . Now let y =ỹ (C) to simplify notation. Let distinct(y ) ≤ C be the number of distinct integers in the sequence y . We show how to derive y from the grammar with derivation cost distinct(y ), which proves the proposition. We call an integer c contiguous in y if y can be written as y = y , c , c , . . . , c , c , y such that sequences y and y do not contain c . We derive y from the grammar inductively, by considering two cases.
Case 1: all integers in y are contiguous, so y consists of contiguous blocks of repeated integers. Let b be the number of blocks. We invoke the second grammar rule b − 1 times to get S repeated b times and then invoke the first grammar rule for each S to derive the contiguous sequence ν (n) k:l corresponding to the block. This costs distinct(y ) = b in total as required.
Case 2: there is an integer in y that is not contiguous. Let c be the left-most non-contiguous integer in y . c splits y into several shorter sequences y (1) , . . . , y (Q) , where each sequence y (q) does not contain any copy of integer c . Since y is non-interleaving (by Observation 1), y (q) and y (q ) do not share any integers for q = q . Therefore, distinct(y ) = 1 + distinct(y (1) ) + . . . + distinct(y (Q) ). Furthermore, each sequence y (q) is non-interleaving. Therefore, we can derive the subsequence ofỹ (C) corresponding to y (q) from the non-terminal S and it costs distinct(y (q) ) by induction. To finish the proof we need to show that we can combine the resulting sequences into the sequence corresponding to y by paying an additional cost of only 1. We can do that by using the last three rules of the grammar where the rule of cost 1 is applied only once. In particular, we pay 1 to derive the sequence corresponding to the first block of integers c in y . The sequence is derived from Y (n) k:l , which comes from the rule S → Y

B Manual Analysis of WikiBio Errors
In Table 4   We find that the FULL model hallucinates at approximately the same rate as does S2S+Copy. However, it also generates more explicit contradictions. These tend to occur when a span containing contradictory information is copied to the canvas, but is not subsequently edited. We suspect that incorporating additional losses, such as a roundtrip reconstruction loss (Tu et al., 2017) will be helpful here. It is notable that the FULL generations struggle with implicit contradiction more than S2S+Copy. Some examples of implicit contradiction we observed include when a person's gender or nationality are strongly suggested by their name, or their area of study by their thesis title, despite this information not being explicit in the table. We suspect that bigger models, especially if they store state in addition to the canvas (which our FULL models do not), will better address these cases.

C Controllability Case Study
We briefly consider a case-study that exemplifies controlling generation by controlling the neighbors used at test time. We consider in particular a situation where control is much more easily accomplished under FULL policies than token-level policies.
Some examples in the E2E dataset consist of only a single sentence (e.g., "The Golden Curry is a non family friendly Indian restaurant with an average rating located in the riverside area near Cafe Rouge."), while others split the description into multiple sentences. We consider requiring the generated text to consist of at least 3 sentences, which is interesting and challenging for two reasons. First, only about 8% of the training examples have ≥ 3 sentences. Second, while it is sometimes possible to force the generations of token-level models to obey structural constraints by constraining beam search (e.g., by disallowing certain tokens depending on the context), a constrained beam search does not make it easy to guarantee the presence of certain structural features. Specifically, while it is easy to constrain beam search so that hypotheses with too many sentences are kept off the beam, it is unclear how to ensure beam search finds only (or even any) hypotheses with enough sentences.
We accordingly restrict both the FULL and LRT models to use only neighbors with ≥ 3 sentences (as determined by a regular expression) when generating on the E2E development set. We find that 87.2% of the resulting FULL generations have ≥ 3 sentences, while only 73.5% of the LRT generations do. Furthermore, the quality of the resulting text remains high, with a ROUGE score of 67.6 for the FULL generations and 71.7 for LRT. (Note this comparison unfairly favors LRT, which generates many fewer of the rare ≥ 3 sentence generations).
When the FULL model fails to respect the constraint it is because it has inserted text that replaces the end of a sentence (or two). We can reach 100% constraint satisfaction by simply constraining the FULL model's beam search to never replace a full sentence in the canvas. As noted above, we cannot easily constrain the LRT beam search to reach 100% constraint satisfaction.

D Additional Model and Training Details
Our models are BART (Lewis et al., 2020b)-style encoder-decoder transformers. They consume embeddings of the linearized source tokens x and the current canvasŷ 1:M (plus positional embeddings). To allow the model to capture how recently tokens were added to the canvas, we add to each canvas token embedding an embedding of a feature indicating how many time-steps have elapsed since it was added. We also add to each x token embedding the embedding of an indicator feature indicating whether it has been copied toŷ 1:M . We obtain neighbor embeddings by putting neighbor token embeddings plus positional embeddings plus the embedding of an indicator feature indicating that these are neighbor tokens through the same encoder that consumes x.
We trained with Adam (Kingma and Ba, 2015). We linearly warm-up the learning rate during the first 4,000 training steps, and then use square-root learning rate decay as in Devlin et al. (2019) after warm-up. To stabilize training we accumulate gradients over 400 target sequences.
We show the training and prediction hyperparameter bounds we considered in Table 5. We selected combinations at random for 50 1-epoch trials for each model, evaluating on validation negative loglikelihood. Our final hyperparameter values are in Table 6.

E Human Evaluation Details
We selected 45 random examples from each of the E2E and WikioBio test-sets for use as crowdworker prompts. Each example was rated by 3 crowd-workers, and each crowd-worker rated each of the 3 systems. We excluded the 11 responses that did not provide all 9 ratings (3 ratings for each of 3 examples). We show a screen-shot of the questions asked of Mechanical Turk crowd-workers, given a table and generated description, in Figure 4. We show results of significance tests in Table 7

F Sample Generations
We provide 5 random generations from each of FULL, LRT, and S2S+copy on the E2E and Wik-iBio test-sets in Tables 9 and 10. The Blue Spice coffee shop is based near Crowne Plaza Hotel and has a high customer rating of 5 out of 5. The Cocum is a pub near Burger King. It has a high customer rating.
Cocum is a pub near The Sorrento. The Giraffe is a restaurant near Rainbow Vegetarian Café in the city centre which serves Fast food. It is not family-friendly. The Cricketers is a family friendly coffee shop near Ranch. It has a low customer rating.
Blue Spice is a coffee shop near Crowne Plaza Hotel. It has a customer rating of 5 out of 5. Cocum pub has a high customer rating and is located near Burger King.
Cocum is a pub near The Sorrento. Giraffe is a fast food restaurant near Rainbow Vegetarian Café in the city centre. It is not family-friendly. The Cricketers is a family friendly coffee shop near Ranch with a low customer rating.
Blue Spice is a coffee shop near Crowne Plaza Hotel. It has a customer rating of 5 out of 5. Cocum is a highly rated pub near Burger King.
Cocum is a pub near The Sorrento. Giraffe is a fast food restaurant located in the city centre near Rainbow Vegetarian Café. It is not family-friendly. The Cricketers is a family friendly coffee shop near Ranch with a low customer rating. .