WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections

Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we cast generating Wikipedia sections as a data-to-text generation task and create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for our dataset to inspire future work on long-form generation.


Introduction
Data-to-text generation (Kukich, 1983;McKeown, 1992) is the task of generating text based on structured data. Most existing data-to-text datasets focus on single-sentence generation, such as WIKIBIO (Lebret et al., 2016), LogicNLG (Chen et al., 2020), and ToTTo (Parikh et al., 2020). Other datasets are relatively small-scale and focus on long-form text generation, such as ROTOWIRE (Wiseman et al., 2017) and MLB (Puduppully et al., 2019). In this work, we cast generating Wikipedia sections as a data-to-text generation task and build a large-scale dataset targeting multi-sentence data-to-text generation with a variety of domains and data sources.
To this end, we create a dataset that we call WIKITABLET ("Wikipedia Tables to Text") that pairs Wikipedia sections with their corresponding tabular data and various metadata. The data resources we consider are relevant either to entire Wikipedia articles, such as Wikipedia infoboxes and Wikidata tables, or to particular sections. Data from the latter category is built automatically from either naturally-occurring hyperlinks or from named entity recognizers. This data construction approach allows us to collect large quantities of instances while still ensuring the coverage of the information in the table. We also perform various types of filtering to ensure dataset quality.
WIKITABLET contains millions of instances covering a broad range of topics and a variety of flavors of generation with different levels of flexibility. Figure 1 shows two examples from WIKI-TABLET. The first instance has more flexibility as it involves generating a fictional character biography in a comic book, whereas the second is more similar to standard data-to-text generation tasks, where the input tables contain all of the necessary information for generating the text. While the open-ended instances in WIKITABLET are to some extent similar to story generation (Propp, 1968;McIntyre and Lapata, 2009;Fan et al., 2018), the fact that these instances are still constrained by the input tables enables different evaluation approaches and brings new challenges (i.e., being coherent and faithful to the input tables at the same time).
Because of the range of knowledge-backed generation instances in WIKITABLET, models trained on our dataset can be used in assistive writing technologies for a broad range of topics and types of knowledge. For example, technologies can aid students in essay writing by drawing from multiple kinds of factual sources. Moreover, WIKITABLET can be used as a pretraining dataset for other relatively small-scale data-to-text datasets (e.g., RO-TOWIRE). A similar idea that uses data-to-text generation to create corpora for pretraining language models has shown promising results (Agarwal et al., 2021).
In experiments, we train several baseline models on WIKITABLET and empirically compare training and decoding strategies. We find that the best training strategies still rely on enforcing hard constraints to avoid overly repetitive texts. Human evaluations reveal that (1) humans are unable to differentiate the human written texts from the generations from our neural models; (2) while the annotations show that grammatical errors in the reference texts and the generations may prevent humans from fully understanding the texts, the best decoding strategy (i.e., beam search with n-gram blocking (Paulus et al., 2018)) does not have such a problem and shows the best performance on several aspects; (3) the degree of topical similarity between the generations and the reference texts depends on the open-endedness of the instances.
Our analysis shows that the generations are fluent and generally have high quality, but the models sometimes struggle to generate coherent texts for all the involved entities, suggesting future research directions. For example, when the instance has a high degree of flexibility, we find the models making mistakes about what a particular entity type is capable of. We also find errors in terms of the factuality of the generated text, both in terms of contradictions relative to the tables and commonsense violations.

Related Work
There have been efforts in creating data-to-text datasets from various resources, including sports summaries (Wiseman et al., 2017;Puduppully et al., 2019), weather forecasts (Liang et al., 2009), andcommentaries (Chen andMooney, 2008). Most of the recent datasets focus on generating single sentences given tables, such as WIKIBIO, ToTTo, LogicNLG, and WikiTableText (Bao et al., 2018), or other types of data formats, such as data triples (Vougiouklis et al., 2017;Gardent et al., 2017;Nan et al., 2021), abstract meaning representations (Flanigan et al., 2016), minimal recursion semantics (Hajdik et al., 2019), or a set of concepts (Lin et al., 2020). Other than single sentences, there have been efforts in generating groups of sentences describing humans and animals , and generating a post-modifier phrase for a target sentence given a sentence context (Kang et al., 2019). In this work, our focus is long-form text generation and we are interested in automatically creating a large-scale dataset containing multiple types of data-to-text instances. As shown in Table 1, WIKITABLET differs from these datasets in that it is larger in scale and contains multi-sentence texts. More details are in the next section.
Wikipedia has also been used to construct datasets for other text generation tasks, such as generating Wikipedia movie plots (Orbach and Goldberg, 2020;Rashkin et al., 2020) and short Wikipedia event summaries (Gholipour Ghalandari et al., 2020), and summarizing Wikipedia documents (Zopf, 2018; or summaries of aspects of interests (Hayashi et al., 2020) from relevant documents.
As part of this work involves finding aligned tables and text, it is related to prior work on aligning Wikipedia texts to knowledge bases (Elsahar et al., 2018;Logan et al., 2019).

The WIKITABLET Dataset
The WIKITABLET dataset pairs Wikipedia sections 2 with their corresponding tabular data and various metadata; some of this data is relevant to entire Wikipedia articles ("article data") or article structure ("title data"), while some is section-specific ("section data"). Each data table consists of a set of records, each of which is a tuple containing an attribute and a value.
The instances in WIKITABLET cover a range of flavors of language generation. Some have more flexibility, requiring models to generate coherent stories based on the entities and knowledge given in the tables. The first instance in Figure 1 is such an example. The text is from the Wikipedia article entitled "Wolfsbane (comics)" and resides within two nested sections: the higher-level section "Fictional character biography" and the lower-level section "Messiah Complex". The task is challenging as models need to generate a coherent passage that can connect all the entities in the section data, and the story also needs to fit the background knowledge provided in the article data.
Other instances are more similar to standard datato-text generation tasks, where the input tables contain all the necessary information for generating 2 We define a Wikipedia section to be all text starting after a (sub)section heading and proceeding until the next (sub)section heading. We include Wikipedia sections at various nesting levels. For example, a top level section may start with a few paragraphs describing general information followed by two subsections with more specific information, in which case the example will be converted into three instances in our dataset.
During the 2007-2008 "Messiah Complex" storyline, Rahne helps Rictor infiltrate the Purifiers; she fakes being shot by Rictor. She is also a member of the new X-Force. During a battle against Lady Deathstrike and the Reavers, Rahne learns that Father Craig was in league with the Purifiers, supposedly divulging enough information about her that the Purifiers can claim to "know her well." She travels with X-Force to her former home Muir Island, now the base of the Marauders. During the climactic battle, Rahne is injured by Riptide, but her wounds, according to Professor X, are superficial and she will recover. the text. The second instance in Figure 1 is an example of this sort of task. However, these tasks are still challenging due to the wide variety of topics contained in WIKITABLET.

Dataset Construction
We begin by describing the steps we take to construct WIKITABLET. More details are in the supplementary material. In general, the steps can be split into two parts: collecting data tables and filtering out texts. When collecting data, we consider five resources: Wikidata tables, infoboxes in Wikipedia pages, 3 hyperlinks in the passage, named entities in the passage obtained from named entity recognition (NER), and Wikipedia article structure. For a given Wikipedia article, we use the same infobox and Wikidata table for all sections. These tables can serve as background knowledge for the article. For each section in the article, we create a second table corresponding to section-specific data, i.e., section data. The section data contains records constructed from hyperlinks and entities identified by a named entity recognizer. 4 We form records for named entities by using the type of the entity as the attribute and the identified entity as the value. We form records for hyperlinks as follows. For the attribute, for a hyperlink with surface text t and hyperlinked article , we use the value of the "instance of" or "subclass of" tuple in the Wikidata table for . For example, the first instance in Figure 1 will be turned into a record with attribute "superhero" and value "Wolfsbane (comics)". If does not have a Wikidata table or no appropriate tuple, we consider the parent categories of . For the value of the tuple, we use the document title of rather than the actual surface text t to avoid giving away too much information in the reference text.
Complementary to the article data, we create a title table that provides information about the position in which the section is situated, which includes the article title and the section titles for the target section. As the initial sections in Wikipedia articles do not have section titles, we use the section title "Introduction" for these.
We also perform various filtering to ensure the quality of the data records, the coverage of the input data, and the length of the reference text. The final dataset contains approximately 1.5 million instances. We randomly sample 4533 instances as the development set and 4351 as the test set. We on CoNLL03 data (Tjong Kim Sang and De Meulder, 2003). also ensure that there are no overlapping Wikipedia articles among splits. Table 1 shows statistics for WIKITABLET and related datasets. While the average length of a WIKI-TABLET instance is not longer than some of the existing datasets, WIKITABLET offers more diverse topics than the sports-related datasets ROTOWIRE and MLB, or the biography-related dataset WIKI-BIO. Compared to the prior work that also uses Wikipedia for constructing datasets, WIKIBIO, LogicNLG, ToTTo, and DART (Nan et al., 2021) all focus on sentence generation, whereas WIKI-TABLET requires generating Wikipedia article sections, which are typically multiple sentences and therefore more challenging. WIKITABLET is also much larger than all existing datasets.

Dataset Characteristics
To demonstrate the diversity of topics covered in WIKITABLET, we use either the "instance of" or "subclass of" relation from Wikidata as the category of the article. 5 We show the top 10 most frequent document categories in Table 2. Due to the criteria we use for filtering, only 1.05% of articles in WIKI-TABLET do not have these relations or Wikidata entries, and we omit these articles in the table. As the table demonstrates, more than 50% of the articles in WIKITABLET are not about people (i.e., the topic of WIKIBIO), within which the most frequent category covers only 4.61%.

Dataset Challenges
In this subsection, we highlight two challenges of WIKITABLET.
1. In contrast to work on evaluating commonsense knowledge in generation where reference texts are single sentences describing everyday scenes (Lin et al., 2020), WIKITABLET can serve as a testbed for evaluating models' abilities to use world knowledge for generating coherent longform text.
2. Compared to other long-form data-to-text datasets such as ROTOWIRE where the input tables are box scores, the input tables in WIKI-TABLET are more diverse, including both numbers (e.g., economy and population data of an area throughout years), and short phrases. This

Methods
In this section, we describe details of models that we will benchmark on WIKITABLET.
Our base model is based on the transformer (Vaswani et al., 2017). To encode tables, we linearize the tables by using special tokens to separate cells and using feature embeddings to represent records in tables. For the title table in the first instance in Figure 1 the linearized table will be boc 1 Doc. 1 title 1 bov 1 Wolfsbane 1 (comics) 1 boc 2 Sec. 2 title 2 bov 2 Fictional 2 character 2 biography 2 boc 3 · · · eoc (1) As shown in Eq. 1, we employ several techniques when encoding tables: (1) we use special tokens boc and bov to separate attributes and values, and eoc to indicate the end of a sequence; (2) we use subscript indices to indicate unique ID embeddings that are added to the embeddings for each record, which helps models align attributes with values; and (3) we restart the positional embeddings at each boc , such that models will not use the ordering of the input records. In addition, we add a special embedding to each record to indicate if it is from the section table or the article/title table. In Wikidata, there could be multiple qualifiers attached to a record, in which case we replicate the record for each qualifier separately.
Similar linearization approaches have been used in prior work (Dhingra et al., 2019;Hwang et al., 2019;Herzig et al., 2020;Yin et al., 2020). With linearized tables, training and inference become similar to other sequence-to-sequence settings. We train our models with teacher-forcing and standard cross entropy loss unless otherwise specified.

Training Strategies
We experiment with three types of modifications to standard sequence-to-sequence training: α-entmax. α-entmax ) is a mapping from scores to a distribution that permits varying the level of sparsity in the distribution. This mapping function has been used in machine translation  and text generation (Martins et al., 2020). When using α-entmax in the decoder, we also replace the cross entropy loss with the α-entmax loss . Both  α-entmax and the α-entmax loss have a hyperparameter α. We follow Martins et al. (2020) and use α = 1.2 as they found it to be the best value for reducing repetition in generation.
Copy Mechanism. Similar to prior work on datato-text generation (Wiseman et al., 2017;Puduppully et al., 2019), we use pointer-generator network style copy attention (See et al., 2017) in the decoder.
Cyclic Loss. Cyclic losses have been shown to be effective in textual style transfer (Shetty et al., 2018;Pang and Gimpel, 2019) and neural machine translation (Cheng et al., 2016;He et al., 2016;Tu et al., 2017). Wiseman et al. (2017) also used this for data-to-text and found it helpful for generating long sequences. In this work, we experiment with adding the cyclic loss to our transformer models, where the backward model can be seen as an information extraction system. We expect that adding the cyclic loss should enable a data-to-text model to generate sentences that are more faithful to the conditioned tables. The cyclic loss is used during training only and does not affect the models during inference. More details are in the appendix.

Decoding Strategies
Massarelli et al. (2020) showed that the choice of decoding strategy can affect the faithfulness or repetitiveness of text generated by language models. We are also interested in these effects in the context of data-to-text generation, and therefore benchmark several decoding strategies on WIKI-TABLET. Our models use byte-pair encoding (BPE; Sennrich et al., 2016) and for all of the following strategies, we always set the minimum number of decoding steps to 100 as it improves most of the evaluation metrics, and the maximum number of decoding steps to 300. Specifically, we benchmark (1) greedy decoding; (2) nucleus sampling (Holtzman et al., 2020) with threshold 0.9 as suggested by Holtzman et al. (2020); (3) beam search; and (4) beam search with n-gram blocking (Paulus et al., 2018) where we set the probabilities of repeated trigrams to be 0 during beam search. We set the beam size to be 5 by default. The appendix has more details about the decoding strategies.

Setup
We experiment with two sizes of transformer models. One is "Base", where we use a 1-layer encoder and a 6-layer decoder, each of which has 512 hidden size and 4 attention heads. The other one is "Large", where we use a 1-layer encoder and a 12layer decoder, each of which has 1024 hidden size and 8 Table 3: Test set results for our models. When training the large models, we use the "copy + cyclic loss" setting as it gives the best performance for the base models for most of the metrics.
computational power, we parameterize our backward model as a transformer model with a 2-layer encoder and a 2-layer decoder. 7 We use BPE with 30k merging operations. We randomly sample 500k instances from the training set and train base models on them when exploring different training strategies. We train a large model with the best setting (using the copy mechanism and cyclic loss) on the full training set. We train both models for 5 epochs. During training we perform early stopping on the development set using greedy decoding.
We report BLEU (Papineni et al., 2002), ROUGE-L (RL) (Lin, 2004), METEOR (MET) (Banerjee and Lavie, 2005), and PARENT (Dhingra et al., 2019), including precision (PAR-P), recall (PAR-R), and F1 (PAR-F1) scores. The first three metrics consider the similarities between generated texts and references, whereas PARENT also considers the similarity between the generation and the table. When using PARENT, we use all three tables, i.e., the section, article, and title tables.
As we are also interested in the repetitiveness of generated texts, we define a metric based on ngram repetitions which we call "REP". REP computes the ratio of the number of repeated n-grams to the total number of n-grams within a text, so when REP has higher value, it indicates that the text has more repetitions. Here we consider ngrams that appear 3 or more times as repetitions and the n-grams we consider are from bigrams to 4-grams. When reporting REP scores for a dataset, we average the REP scores for each instance in the dataset. Similar metrics have been used in prior work (Holtzman et al., 2020;Welleck et al., 2020).

Results
In Table 3, we report the test results for both our base models and large models. We also report a set of baselines that are based on simply returning the linearized tables and their concatenations with the references. The linearized table baselines show how much information is already contained in the table, while the reference baselines show the upper bound performance for each metric.
In comparing training strategies, we find that using α-entmax improves REP significantly but not other metrics. Adding the cyclic loss or the copy mechanism helps improve performance for the PAR scores and REP, and combining both further improves these metrics.
When comparing decoding strategies, we find that both nucleus sampling and n-gram blocking are effective in reducing repetition. Nucleus sampling harms the PAR scores, especially PAR-P, but has less impact on the other metrics, indicating that it makes the model more likely to generate texts that are less relevant to the tables. Using beam search improves all metrics significantly when compared to greedy decoding, especially the PAR-P and REP scores. Adding n-gram blocking further reduces the REP score, pushing it to be even lower than that from nucleus sampling, but still retains the improvements in PAR scores from beam search. The best overall decoding strategy appears to be beam search with n-gram blocking.

Analysis
We now describe a manual evaluation and analyze some generated examples. All results in this section use the development set. We also conduct experiments on analyzing the effect of using the section data and the article data during training, finding that the benefits that they bring to the model performance are complementary. See the appendix for more details.

Human Evaluation
We conduct a human evaluation using generations from the large model on the development set. We choose texts shorter than 100 tokens and that cover particular topics as we found during pilot studies that annotators struggled with texts that were very long or about unfamiliar topics. 8 We design two sets of questions. The first focuses on the text itself (i.e., grammaticality and coherence) and its faithfulness to the input article table. Since this set does not involve the reference, we can ask these questions about both generated texts and the reference texts themselves. The second set of questions evaluates the differences between the generations and the reference texts (i.e., relevance and support), allowing us to see if the generated text matches the human written section text. Specifically, relevance evaluates topical similarity between generations and references, and support evaluates whether the facts expressed in the generations are supported by or contradictory to those in the references. The full questions and numerical answer descriptions are in the appendix.
We report results in Tables 4 and 5. The scores are on a 1-5 scale with 5 being the best. For the first set, we collect 480 annotations from 38 annotators. For the second set, we collect 360 annotations from 28 annotators. We also ensure that each system has the same number of annotations. 9 It is interesting to note from Table 4 that human annotators are unable to differentiate the human written texts from the generations from our neural models. Since the Wikipedia section texts are parts of Wikipedia articles, showing the section texts in isolation can make them difficult to understand, potentially resulting in noisy annotations. As shown by the first instance in Table 6, the text uses the pronoun "he" without clarifying what the pronoun refers to. The paragraph is rated 3 for coherence, presumably due to this ambiguity. Also, Wikipedia texts are sometimes grammatically complex and annotators can mistake them for being ungrammatical, e.g., the second instance in Table 6.
On the other hand, the coherence errors in the generated texts are not always easy to spot. See, for example, the last two instances in Table 6, where the incoherence lies in the facts that (1) it is impossible to marry a person before the person is born, and (2) senior year takes place after junior year. These details are embedded in long contexts, which may be overlooked by annotators and lead to results favorable to these neural models.
To study the relationship between coherence and grammaticality, we compute Spearman's correlations between the human annotations for coherence and grammaticality after removing the ones with perfect scores for coherence. Table 7 shows the results. The correlations are much higher for references, beam search, and nucleus sampling than for n-gram blocking. This trend suggests that the imperfect coherence scores for the reference texts are likely because annotators find the texts to contain grammatical errors (or to possess grammatical complexity) which may prevent them from fully understanding the texts. However, n-gram blocking does not have this problem and thus achieves the best results for both coherence and grammaticality. We hypothesize that n-gram blocking is able to avoid the types of grammatical errors that    Table 8: Averaged perplexities and the corresponding numbers of annotations for each option for the relevance and support questions (5 is the best option). We aggregate annotations for different decoding algorithms. We note that the perplexities are computed based on the reference texts using the large model. prevent understanding because (1) unlike nucleus sampling, n-gram blocking does not rely on randomness to avoid repetition; (2) n-gram blocking does not suffer from repetitions like beam search. We report results for the second set of questions in Table 5. The three evaluated systems show similar performance. To investigate the relationship between the degree of open-endedness of a WIKI-TABLET instance and its corresponding evaluation scores, we compute the averaged perplexities (based on our large models) for each option in Table 8. The most relevant generations are typically from more closed-ended or constrained instances. 10 Similarly for the support scores, more open-ended instances are distributed at score 3, which means that there is no fact supported by or contradictory to the shown tables. While the open-endedness of an instance usually depends on its topics (e.g., movie plots are open-ended), there are many cases where the models can benefit from better entity modeling, 10 Li and Hovy (2015)   such as understanding what a particular entity type is capable of (e.g., see the last example in Sec. 6.3). Recent work has also found conducting human evaluation for long-form generation to be challenging, for example in the context of question answering (Krishna et al., 2021) and story generation (Akoury et al., 2020). Our observations for data-to-text generation complement theirs and we hope that our dataset can inspire future research on human evaluation for long-form text generation.

Distribution of Perplexity
To determine the fraction of WIKITABLET that can be seen as constrained, we report the percentiles of perplexities for training and development splits in Table 9. From Table 8, it can be observed that instances with perplexities around 9.0 generally lead to model generations that are closely relevant to the reference texts and mostly supported by the input tables, and therefore are likely to be the constrained instances. From Table 9, we see that at least half of our dataset has perplexities lower than 9.0, so we conjecture that half of our dataset consists of constrained instances. Table 10 shows generation examples for nucleus sampling and beam search with n-gram blocking.

Method
Generated Text Nucleus sampling (threshold 0.9) In a two-issue mini-series titled "Messiah Complex", portrayed by Maisie Williams, Father Craig is ordered to rescue the world from Reavers and run out of the X-Force causing the X-Force to permanently withdraw. The original humans were replaced by slots and Muir Island being pulled away. They also possessed the entire X-Force after issue #9 in which the X-Force appeared. Claremont and Sinclair later reprised their parts in "The Donderful World of Wolverine" with Wolfsbane as one of the new X-Men and her group taken back to their reservation for unknown reasons. n-gram blocking In the "Messiah Complex" storyline, Wolfsbane is a member of the X-Men, a group of mutants recruited by Father Craig to fight the Reavers in Muir Island. Father Craig is killed in the battle with Reavers, but Father Craig manages to rescue the mutant from the Rain of Gesham and retrieve the mutant. She is later revealed to have been killed by Rain in a battle with Father Craig. In the aftermath, Father Craig tells Father Craig that she is the only mutant to be killed by a mutant. Nucleus sampling (threshold 0.9) In May 2013, McCoy allegedly assault a Spoiler business official. The incident motivated him to take a business trip to Harrisburg, Pennsylvania and sign a letter from the Buffalo Bills paying him over $50,000 as well as verbally prohibiting legal kickoff work. In September 2014, McCoy shared the issue with some former players of the first two film franchises. McCoy played a leading role in the 2012 Marvel Cinematic Universe film "Erab of the Press", and was named one of the film's biggest stars on Twitter. On September 9, 2014, McCoy filed a lawsuit against PYT to change its Twitter name to LeSean McCoy Productions. n-gram blocking On September 9, 2014, McCoy was arrested in Harrisburg, Pennsylvania on suspicion of assault. He was charged with assault and battery. In May 2013, he was fined over $50,000 by the Buffalo Bills. In September 2014, he was suspended for two games by the PYT for violating the Marvel Cinematic Universe. He was released by the Bills in October of the same year. He was cleared of all charges on Twitter, and was banned from playing in the 2014 Pro Bowl due to his Twitter account. Table 10: Generation examples from the large model. The first example corresponds to the first instance in Figure 1. The complete set of generations is in the appendix.
We observe very different trends between the two instances in Figure 1. For the first instance about the X-Men, although both generations look fluent, their stories differ dramatically. The generated text for nucleus sampling describes a story that starts by saying Father Craig rescues the world from Reavers and ends with Wolfsbane joining as one of the new X-Men. On the other hand, n-gram blocking generates a story where Wolfsbane already is a member of X-Men, and the story says Father Craig fought and was killed by the Reavers, but manages to rescue the mutant. For the less open-ended instances (e.g., the second instance in Figure 1), different decoding strategies mostly generate similar details (see the appendix for generations).
Despite having different details, these generations appear to try to fit in as many entities from the tables as possible, in contrast to beam search (shown in the appendix) which mostly degenerates into repetition for more open-ended instances. This explains our previous observation that n-gram blocking helps with the PAR-R score.
Even though the generations are of good quality for most instances, their implausibility becomes more apparent when readers have enough background knowledge to understand the involved entities. For example, the second instance in Table 10 comes from the Wikipedia page "LeSean McCoy" (a football player) under the sections "Personal life" and "Controversies" (details in the appendix). The generation from nucleus sampling is implausi-ble/nonsensical in some places ("assault a Spoiler business official") and factually incorrect elsewhere (McCoy did not play a leading role in any film, and "Erab of the Press" is not an actual film). The fourth generation is implausible because a player is unlikely to be suspended for "violating the Marvel Cinematic Universe", and it is unlikely for a person to be cleared of all charges on Twitter. Our models have limited access to knowledge about entities, e.g., the capabilities of a social media company like Twitter. Future research may incorporate extra resources, make use of pretrained models, or incorporate factuality modules to solve these problems.

Conclusion
We created WIKITABLET, a dataset that contains Wikipedia article sections and their corresponding tabular data and various metadata. WIKITABLET contains millions of instances covering a broad range of topics and kinds of generation tasks. Our manual evaluation showed that humans are unable to differentiate the references and model generations, and n-gram blocking performs the best on grammaticality and coherence. However, qualitative analysis showed that our models sometimes struggle with coherence and factuality, suggesting several directions for future work.

Impact Statement
We highlight a few limitations as follows: (1) Wikipedia texts are generally written in objective tones, but some of the texts may contain controversial content that even the community contributors do not agree upon; (2) models trained on our dataset may generate deceitful texts that are unfaithful to what actually happened to particular entities; (3) though the instances in WIKITABLET cover various topics, the writing style is almost always the same. Future work may explore more diverse writing styles.

A Dataset Construction
When collecting data, we consider five resources: Wikidata tables, infoboxes in Wikipedia pages, hyperlinks in the passage, named entities in the passage obtained from named entity recognition (NER), and Wikipedia article structure. For each article in Wikipedia, we use the same infobox and Wikidata table for all sections. These tables can serve as background knowledge for the article. For each section in the article, we create a second table corresponding to section-specific data, i.e., section data. The section data contains records constructed from hyperlinks and entities identified by a named entity recognizer. Section data contributes around 25% of the records in WIKITABLET. We filter out several entity types related to numbers 11 as the specific meanings of these numbers in the section of interest are difficult to recover from the information in the tables. After filtering, we use the identified entities as the values and the entity types as the attributes. This contributes roughly 12% of the records in our final dataset.
We also create records from hyperlinks in the section of interest. We first expand the hyperlinks available for each section with hyperlinks available in the parent categories. We first group hyperlinks across all Wikipedia articles with those same categories, and then we perform string matching between these hyperlinks and the text in the section. If there are exact matches, we will include those hyperlinks as part of the hyperlinks in this section.
Details for constructing a record with attribute a and value v for a hyperlink with surface text t and hyperlinked article are as follows. To set a, we use the value of the "instance of" or "subclass of" tuple in the Wikidata table for . If does not have a Wikidata table or no appropriate tuple, we consider the parent categories of as candidates for a. If there are multiple candidates for a, we first embed these candidates and a using GloVe (Pennington et al., 2014) embeddings and then choose the one that maximizes cosine similarity between the document titles or section titles and the candidates for a. For the value v of the tuple, we use the document title of rather than the actual surface text t to avoid giving away too much information in the reference text. The records formed by hyperlinks contribute approximately 13% of the records in WIKITABLET. 11 List of filtered entity types: PERCENT, TIME, QUAN-TITY, ORDINAL, CARDINAL.
We shuffle the ordering of the records from NER and the hyperlinks to prevent models from relying on the ordering of records in the reference text.
The records from the section data can be seen as section-specific information that can make the task more solvable. Complementary to the article data, we create a title table that provides information about the position in which the section is situated, which includes the article title and the section titles for the target section. As the initial sections in Wikipedia articles do not have section titles, we use the section title "Introduction" for these. 12 As the records in our data tables come from different resources, we perform extra filtering to remove duplicates in the records. In particular, we give Wikidata the highest priority as it is a humanannotated well-structured data resource (infoboxes are human-annotated but not well-structured due to the way they are stored on Wikipedia) and the entities from NER the lowest priority as they are automatically constructed. That is, when we identify duplicates across different resources, we will keep the records from the higher priority resource and drop those from the lower one. More specifically, the duplicates between Wikidata records and infoboxes are determined by whether there are duplicate values or duplicate attributes: for hyperlinks and infoboxes or Wikidata, they are judged by duplicate values; for NER and hyperlinks, they are based on whether there is any token overlapping between values.
After table collection, we have the following criteria for filtering out the texts: (1) we limit the text length to be between 50 and 1000 word tokens; (2) to ensure that there is sufficient information in the table, we only keep data-text pairs that contain more than 2 records per sentence and more than 15 records per 100 tokens from Wikidata and infoboxes; (3) to avoid texts such as lists of hyperlinks, we filter out texts where more than 50% of their word tokens are from hyperlink texts.

B Human Evaluation
The selected topics for human evaluations are: human (excluding the introduction and biography section), film, single (song), song, album, television series. When evaluating grammaticality and coherence, only the generated text is shown to annotators. 1 = it is completely ungrammatical, as it is impossible to understand the text. 2 = it has many grammatical errors, and these errors make the text very difficult to understand. 3 = it has grammatical errors, and some of them make part of the text difficult to understand. 4 = it has some grammatical errors, but they are minor errors that do not affect reading. 5 = it is completely grammatical, as it does not have any grammatical errors. Table 11: Rating explanations for grammaticality. 1 = it is completely incoherent, as it is impossible to piece together information in the text. 2 = it is incoherent in most places. You can only understand part of the story. 3 = it is incoherent in many places, but if you spend time reading it, you still can understand the whole story. 4 = it is mostly coherent. Although the text is incoherent in some places, it does not affect reading. 5 = it is completely coherent. The question for grammaticality is "On a scale of 1-5, how much do you think the text is grammatical? (Note: repetitions are grammatical errors.)" (option explanations are shown in Table 11), and the question for coherence is "On a scale of 1-5, how much do you think the text is coherent? (Coherence: Does the text make sense internally, avoid self-contradiction, and use a logical ordering of information?)" (rating explanations are in Table  12).
When evaluating faithfulness, we show annotators the article data and the generation. The question is "On a scale of 1-5, how much do you think the text is supported by the facts in the following table?" (rating explanations are in Table 13).
When evaluating coherence and relevance, annotators were shown the reference text and the generation, as well as the Wikipedia article title and section titles for ease of understanding the texts. Annotators were asked two questions, with one being "On a scale of 1-5, how much do you think the text is relevant to the reference" (Table 14), and the other being "On a scale of 1-5, how much do you think the text is supported by the facts in the reference?" (Table 15).

C Effect of α-entmax
In this section, we disentangle the effect of αentmax and that of α-entmax loss. We note that (1) when not using the α-entmax loss, we use standard cross entropy loss (e.g., in the case of "base+ent."   we maximize the log probabilities generated by αentmax); (2) when combining α-entmax and copy mechanism, we aggregate the probabilities generated by α-entmax and those from softmax. This is because we use the first attention head in the transformer decoder as the copy attention, following the implementation in OpenNMT (Klein et al., 2017). While it is feasible to combine the α-entmax and α-entmax loss with the copy mechanism if we use the sparse transformer (Correia et al., 2019), we leave this for future study. We report the results in Table 16. It is interesting to see that when using greedy decoding, "ent. + ent. loss" outperforms the baseline model by a significant margin on all the metrics, however the improvement disappears (except for repetition) after we switch to use beam search as the decoding strategy. This is likely because α-entmax promotes sparsity in the generated probabilities, making beam search decoding unnecessary. Removing the α-entmax loss hurts the performance, but its gains become larger in switching to beam search decoding. Adding copy mechanism improves the performance, leading to comparable performance to the baseline model. Although "base+ent.+copy" still underperforms "base+copy" when using beam search, we believe that combining α-entmax and α-entmax loss with the copy mechanism is promising as (1) α-entmax is not used in our large models and the initial results have shown that α-entmax and the copy mechanism are complementary, so it may further improve our current best performance; (2) α-entmax already shows the best performance when using greedy decoding, which has speed and optimization advantages compared to the beam search based decoding strategies especially considering the long-form characteristic 1 = it has quite a few facts contradictory to what is described in the reference. 2 = it has some facts contradictory to what is described in the reference. 3 = it is not supported by the reference, and it does not contradict the reference. 4 = some of the text is supported by the facts in the reference, and the rest of it does not contradict the reference. 5 = it is completely supported by the reference.

D Details of Cyclic Loss
In this section, we will denote the linearized table where the values are replaced with a special mask token by u 1 , · · · , u n , and denote the reference text by x 1 , · · · , x m . Formally, the training loss is w∈S − log p(w|u 1 , · · · , u n , v 1 , · · · , v m ) (2) where S represents the set of masked tokens, and v 1 , · · · , v m is the sequence of token-level probabilities predicted by the forward model (in our experiments, these could either come from the softmax function, or the α-entmax function). Specifically, we multiply the backward transformer's input embedding matrix by the v probability vectors to obtain the input representations to the first encoder layer. We find that it is helpful to add a "reference loss" while training with the cyclic loss, defined as w∈S − log p(w|u 1 , · · · , u n , x 1 , · · · , x m ) (3) This loss does not contain the generation model in it explicitly, but it does lead to an improved backward model by training it with clean inputs. Improving   Table 18: Effect of dropping section or article data when using cyclic training. The results are based on the "base + copy" and "base + copy + cyclic loss" settings.
the backward model then increases the benefits of the cyclic loss. 13

E Effect of Article Data and Section Data
We report results in Table 17 for the models that are trained with partial data input, where art. only and sec. only indicate that we use only article data or section data, respectively. We always use title data. Section data contributes the most to the BLEU and PAR scores, but using section data and article data together is the best setting. We also investigate the effect of partial data input for the cyclic loss in Table 18, where "None" is the model that is not trained with the cyclic loss. We note that in this setting, we still use both data resources as the input to the forward model, but vary the input data and the gold standard for the backward model. Although using only section data gives the best REP score and improves the PAR-P score, it does not help the model in other metrics. Combining the article data with the section data gives significant improvements to the PAR-F1 score compared to section data alone.
Both experiments show that there are interactions between these two data resources that can help models to learn better from both kinds.

F Generation Examples
We show the full set of generations in Table 19. The part of input data and reference text for Table  19 is shown in Figure 2.  Figure 2: An example from WIKITABLET. Only parts of the tables and reference are shown due to space constraints. Underlined texts are hyperlinks. Records with the attributes "DATE", "ORG', "MONEY" are from NER. Subscripts for section titles indicate the ordering of nesting, where smaller numbers indicate higher level sections.
Method Generated Text sampling 0.9 In a two-issue mini-series titled "Messiah Complex", portrayed by Maisie Williams, Father Craig is ordered to rescue the world from Reavers and run out of the X-Force causing the X-Force to permanently withdraw. The original humans were replaced by slots and Muir Island being pulled away. They also possessed the entire X-Force after issue #9 in which the X-Force appeared. Claremont and Sinclair later reprised their parts in "The Donderful World of Wolverine" with Wolfsbane as one of the new X-Men and her group taken back to their reservation for unknown reasons. beam search (size=5) In the "Messiah Complex" storyline, Wolfsbane is a member of the X-Men. She is a member of the X-Factor Investigations team and a member of the X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/X-Force/ blocking (size=5) In the "Messiah Complex" storyline, Wolfsbane is a member of the X-Men, a group of mutants recruited by Father Craig to fight the Reavers in Muir Island. Father Craig is killed in the battle with Reavers, but Father Craig manages to rescue the mutant from the Rain of Gesham and retrieve the mutant. She is later revealed to have been killed by Rain in a battle with Father Craig. In the aftermath, Father Craig tells Father Craig that she is the only mutant to be killed by a mutant. sampling 0.9 Journey to the Center of the Earth is a 1959 American science fiction film based on Jules Verne's 1959 novel of the same name. The film, directed by Henry Levin, stars James Mason as Professor Otto Lidenbrock, a scientist who gets entangled in the power of the archenemy as he tries to impress his closest allies and compromises them with the future. The film marks the first screen appearance of future CinemaScope of 20th Century Fox lead singer Peter Ronson. "The 20th Century Fox" earned critical acclaim and was a box-office success, grossing $10 million worldwide against a production budget of $3.44 million . beam search (size=5) Journey to the Center of the Earth is a 1959 American science fiction film directed by Henry Levin and starring James Mason, Pat Boone, and Arlene Dahl. The screenplay by Charles Brackett and Walter Reisch is based on Jules Verne's novel "A Journey to the Center of the Earth" (1959). The film was produced by Charles Brackett and Walter Reisch and distributed by Twentieth Century-Fox Film Corporation. The film was nominated for an Academy Award for Best Art Direction. It was also nominated for an Academy Award for Best Art Direction and a Golden Globe Award for Best Visual Effects. It was also nominated for a Golden Globe Award for Best Motion Picture -Musical or Comedy, but lost to 20th Century Fox's "CinemaScope". blocking (size=5) Journey to the Center of the Earth is a 1959 American science fiction film directed by Henry Levin and starring James Mason, Pat Boone, and Arlene Dahl. The screenplay by Charles Brackett and Walter Reisch is based on Jules Verne's novel of the same name. The film was produced by CinemaScope and 20th Century Fox and distributed by Twentieth Century-Fox Film Corporation and Joseph M. Schenck Enterprises, Inc. It was nominated for an Academy Award for Best Art Direction.

G Details of Decoding Strategies
Nucleus Sampling. Generating long sequences usually suffers from repetitions. Nucleus sampling (Holtzman et al., 2020) aims to reduce the repetitions in generations by sampling from truncated probability distributions. The truncation is based on whether the cumulative probability is above a threshold. We set the threshold to be 0.9 as suggested in Holtzman et al. (2020).
Beam Search with n-gram Blocking. Paulus et al. (2018) found it effective to reduce the repetitions during beam search by "blocking" n-grams that have been generated in previous decoding steps. We follow their approach by using trigram blocking and setting the probability of repeated trigrams to be 0 during beam search.