ASDOT: Any-Shot Data-to-Text Generation with Pretrained Language Models

Data-to-text generation is challenging due to the great variety of the input data in terms of domains (e.g., finance vs sports) or schemata (e.g., diverse predicates). Recent end-to-end neural methods thus require substantial training examples to learn to disambiguate and describe the data. Yet, real-world data-to-text problems often suffer from various data-scarce issues: one may have access to only a handful of or no training examples, and/or have to rely on examples in a different domain or schema. To fill this gap, we propose Any-Shot Data-to-Text (ASDOT), a new approach flexibly applicable to diverse settings by making efficient use of any given (or no) examples. ASDOT consists of two steps, data disambiguation and sentence fusion, both of which are amenable to be solved with off-the-shelf pretrained language models (LMs) with optional finetuning. In the data disambiguation stage, we employ the prompted GPT-3 model to understand possibly ambiguous triples from the input data and convert each into a short sentence with reduced ambiguity. The sentence fusion stage then uses an LM like T5 to fuse all the resulting sentences into a coherent paragraph as the final description. We evaluate extensively on various datasets in different scenarios, including the zero-/few-/full-shot settings, and generalization to unseen predicates and out-of-domain data. Experimental results show that ASDOT consistently achieves significant improvement over baselines, e.g., a 30.81 BLEU gain on the DART dataset under the zero-shot setting.


Introduction
Data-to-text generation (Kukich, 1983a;Reiter and Dale, 1997) aims at generating natural language text conditioned on structured data content such as tables and graphs.The task has a broad range of applications such as task-oriented dialog (Wen et al., 2015), weather forecasting (Goldberg et al., 1994;Sripada et al., 2003), sports news reporting (Wiseman et al., 2017), and biography generation (Lebret et al., 2016a;Wang et al., 2018).
The problem is challenging in practice due to the vast diversity of the input data in terms of the domains (e.g., finance vs sports), schemata (e.g., the set of predicates, table structures), etc.The inherent ambiguity makes it particularly difficult to learn to understand and describe the data.For instance, in the tuple <Fearless, time, 2008> from a music domain, the predicate word time means the release time of an album, while in <100 metres, time, 9.58> from sports it expresses the world record time.Recent approaches based on end-to-end neural models, e.g., by finetuning pretrained language models (LMs) (Puduppully et al., 2019a;Koncel-Kedziorski et al., 2019;Zhao et al., 2020), typically require massive training instances to resolve the ambiguity and are not applicable to many data-scarce scenarios.
In practice, a data-to-text problem of interest may have a varying number of training examples, ranging from a (small) set to only a few shots, or even no examples at all, and sometimes may rely on available examples out of the current domain to facilitate the generation.We refer to the diverse practical scenarios as the any-shot data-totext problems.Recent work has studied data-to-text solutions when limited examples are available, but is often restricted to single specific settings.For instance, Chen et al. (2020b) and Su et al. (2021) focused on few-shot problems but fail to apply when no examples are accessible, while the zero-shot neural pipeline by Kasner and Dusek (2022) relies on human-crafted templates and thus could not handle out-of-domain data.
In this paper, we develop Any-Shot Data-to-Text (ASDOT), a new flexible approach that makes efficient use of any given (or no) examples and achieves stronger generation quality compared to the prior specific methods.ASDOT draws inspiration from how humans describe data, namely by first disambiguating and understanding the data content, and then fusing and organizing the information together into text paragraphs.As a result, given input data (e.g., a table or graph), ASDOT consists of two intuitive steps, i.e., data disambiguation and sentence fusion.Importantly, each of the two steps is amenable to be solved with the appropriate off-the-shelf pretrained LMs with optional finetuning, enabling the unique flexibility of ASDOT in the presence of any-shot training examples.More specifically, in data disambiguation aiming to understand each data entry (e.g., triple <Fearless, time, 2008>), we use the prompted GPT-3 model (Radford et al., 2019), which has encoded rich commonsense and world knowledge, to convert the triple into a short sentence (Fearless was released in 2008) with greatly reduced ambiguity.The subsequent sentence fusion stage then uses another LM, such as T5 (Raffel et al., 2020), to combine all the resulting sentences into a coherent paragraph as the final description.The sentence fusion as a sub-task allows us to incorporate any available in-/out-of-domain training examples as well as existing large weakly supervised corpus (Kasner and Dusek, 2022) to finetune the LM and boost the performance.
We evaluate the proposed approach in a wide range of practical any-shot scenarios, including (1) the zero-/few-/full-shot setting where we have access to a varying number of training examples, (2) the unseen-predicates setting where we describe the data of new predicates that are never seen in the training examples, and (3) the out-of-domain setting where we are presented only with examples from other domains.Extensive experiments show that our approach consistently achieves significant gains over the diverse previous methods specifically designed for each of the different scenarios.

Related Work
Data-to-text (D2T) generation is a long-standing problem in natural language processing with broad applications in practice.Early research on this task focused on rule-based and pipeline approaches (Kukich, 1983b;Reiter and Dale, 1997), decomposing the task into text planning, sentence planning, and linguistic realisation.Recent work has developed various neural approaches.Lebret et al. (2016b) used a neural encoder-decoder for the task, followed by attention (Bahdanau et al., 2015), content selection (Puduppully et al., 2019a), entity modeling (Puduppully et al., 2019b), and style imitation (Lin et al., 2020) for further improved performance.Recent studies have also incorporated pretrained LMs (Kale and Rastogi, 2020b;Ribeiro et al., 2021;Clive et al., 2021).Although previous fully-supervised methods have achieved remarkable performances, most of them require a large amount of in-domain training examples, leading to limited applicability to the common low-data scenarios in practice.
Recent interests are aroused in zero-/few-shot data-to-text generation problems.Chen et al. (2020b) first formulated the few-shot setting and incorporated a pretrained model with a pointer generator as a solution.Chen et al. (2020a) developed a knowledge-grounded pretrained LM for both zeroand few-shot data-to-text generation.Gong et al. (2020) and Chen et al. (2020b) proposed to solve the few-shot task with content matching and prototype memory, respectively.There are also studies on combining templates and pretrained LM for zero-/few-shot generation.For example, Kale and Rastogi (2020a) trained a neural model to rewrite templates for few-shot task-oriented dialogue.Heidari et al. (2021) applied the idea of template rewriting to build a practical few-shot data-to-text system.Most of the previous methods have each focused on a specific setting (e.g., either zero-or few-shot).In comparison, our work studies a wide spectrum of any-shot scenarios with a varying number of training examples from current or different domains.Of particular relevance to our work is the approach by Kasner and Dusek (2022), which performs zeroshot data-to-text generation by rephrasing given templates.However, the approach relies on humanwritten templates for data disambiguation and thus has limited applicability to wide domains.Besides, the approach involves several components (ordering, aggregation, compression) to fuse sentences, which restricts the use of any-shot examples for improvement.The approach thus studies only in zero-shot settings, while our work makes a comprehensive study on the diverse any-shot problems.

Any-Shot Data-to-Text Generation
We propose ASDOT for any-shot data-to-text generation.§3.1 describes the any-shot problems.We then provide an overview of our method ( §3.2) and Weaklysupervised finetuning

Any-shot finetuning
Figure 1: An overview of our method.Our approach consists of two core steps, i.e., data disambiguation ( §3.3) and sentence fusion ( §3.4).The approach first leverages a prompted GPT-3 to convert each data triple into short sentences with reduced ambiguity.The resulting sentences are then fused by a pretrained LM with optional finetuning using public weakly-supervised corpus or available training examples.
give details of each of the components ( §3.3, 3.4).Figure 1 illustrates our method.
3.1 The Any-Shot Data-to-Text Problems  (i.e., zero-shot) or have access to only a few description examples (i.e., few-shot).Besides, the available examples may not even be in the financial domain (out of domain), or uses different table structures (different schemata) and different table headers (different predicates).We refer to the datato-text training in the various practical scenarios as the any-shot problem.It is highly desirable to develop a general approach that is widely applicable to the different settings.

Method Overview
Intuitively, a data-to-text generation process consists of two core steps, namely, (1) disambiguating and understanding the data triples, and (2) producing the text description.Previous neural approaches typically model the task in an end-to-end manner and require a large number of training examples to learn the data-to-text mapping.In contrast, we take advantage of the task structure by formulating the two stages and solving each with appropriate resources (e.g., pretrained LMs) that are readily available.Figure 1 offers an overview of the approach.Specifically, since each data triple is inherently ambiguous given the compact predicate words, rich commonsense and world knowledge is required to correctly understand the content.For instance, in <Apollo 11, operator, NASA>, a model would need knowledge to determine that NASA operates Apollo 11 rather than the other way around.Therefore, in the data disambiguation stage, we leverage a powerful LM-GPT-3 in our case-that contains massive implicit knowledge in the parameters, to convert each triple into short sentences with reduced ambiguity (e.g., Apollo is operated by NASA).Once we collect a set of short sentences, in the sentence fusion stage, we use another pretrained LM with optional finetuning to compose the sentences into a well-formed paragraph.The stage offers the flexibility to make use of any available training example to boost performance.

Data Disambiguation
In this stage, the goal is to generate a short sentence to describe each data triple precisely.As above, a triple can be highly abstract and ambiguous as it compresses complex relational information into the compact format x = ⟨s, p, o⟩, where the predicate p is often a concise word or phrase (e.g., the predicate time in triple <Fearless, time, 2008>).To reduce the ambiguity, we want to "recover" the missing information in the triple by augmenting it into a complete sentence (e.g., Fearless was released in 2008).Another advantage of converting the structured triples into the free-form text is that a text sequence is more amenable to the LMs used in the subsequent sentence fusion stage ( §3.4) as described shortly.
As the above examples show, augmenting a triple into a sentence naturally requires relevant external knowledge (e.g., Fearless is an album).Training a model specifically for the task could be expensive and could easily overfit to the training domain.Instead, we resort to the general GPT-3 model.Specifically, as shown in Figure 1 (middle panel), we provide GPT-3 with a few demonstrations of converting triples into short sentences, and then feed the target triple to elicit the desired sentence.Appendix A shows the complete demonstrations.We found that the same set of four demonstrations is sufficient to be used for target data in any domain.We thus use the same prompt consisting of those demonstrations throughout our experiments.
Querying the GPT-3 API can be slow and expensive.Given a set of target data in a domain, we reduce the number of queries by generating templates.More concretely, for each predicate in the set, we sample one triple containing the predicate, and generate a sentence for the triple with GPT-3.Then we replace the subject and object in the sentence with placeholders <subject> and <object> to get a template.For instance, the template for the predicate birthPlace in Figure 1 is "<subject> was born in <object>".We then use the template to generate the sentences for all triples with the same predicate.
It is worth noting that many existing data-to-text approaches, ranging from the classical pipeline solutions (Reiter and Dale, 1997) to the recent neural methods (Kale and Rastogi, 2020a;Kasner and Dusek, 2022), have also included similar template components, while their templates are typically crafted by human annotators, making the approaches hard to apply to the diverse new domains.In contrast, our ASDOT is fully automated with the pretrained LMs, without the need of human efforts nor training examples.

Sentence Fusion
In the second stage, we aim to fuse the sentences from the last step and produce a final coherent and fluent paragraph as the output data description.We naturally formulate the sentence fusion as a sequence-to-sequence problem, and use the pretrained LMs, particularly T5 (Raffel et al., 2020), as the backbone for solution.Specifically, we simply concatenate the short sentences, prepended with a prefix word "summarize:", and feed them into the T5 model to obtain the output text.We pick "summarize:" as the prefix for T5 to mimic its pretraining configuration, since the sentence fusion task is similar to the summarization task on which T5 was pretrained.
A key advantage of the sentence fusion stage is that the component permits easy finetuning with diverse available resources.On one hand, there are automatically constructed weak supervision datasets publicly available, such as WikiSplit (Botha et al., 2018) mined from Wikipedia's edit history and DiscoFuse (Geva et al., 2019) constructed by rules.In our zero-/few-shot experiments ( §4), we finetune the sentence fusion model with the public WikiFluent dataset (Kasner and Dusek, 2022) which was constructed by applying a sentence splitting model on the Wikipedia sentences.On the other hand, one can also use any labeled data-to-text examples (by first converting with the data disambiguation stage), even if the examples are from different domains.This is because the general sentence fusion task tends to be domainagnostic, since the operations to fuse sentences are usually similar across domains, e.g., by inserting connective words or subsuming one sentence as the clause of another.We evaluate in our experiments the out-of-domain generalization ability of our approach.

Datasets
We experiment on three widely-used data-to-text benchmarks based on which we study various anyshot settings.
WebNLG (Gardent et al., 2017)   Neural Pipeline (Kasner and Dusek, 2022) is applicable only to the zero-shot setting and the specific WebNLG data due to the need of human-written templates on the dataset.Our method show superior performances under any-shot settings.Our approach shows consistent improvement over the baselines, especially when the training size is small.We use paired bootstrap resampling (Koehn, 2004) which confirms that our method is superior to all the baselines at 95% statistical significance.
subsets.The instances in the test-unseen set are from Wikipedia categories not seen in the training set, which is used in our "unseen predicates" experiments ( §4.4).WebNLG contains 354 types of predicates in total.
E2E (Novikova et al., 2017) is a data-to-text corpus in the restaurant domain annotated by human.The dataset has 42,061/547/629 examples in the training/validation/test sets, respectively.The dataset is relatively easy since it only contains 7 types of predicates and has limited patterns.
DART (Novikova et al., 2017) is a large open-domain data-to-text corpus, constructed from WikiSQL (Zhong et al., 2017), WikiTable-Questions (Pasupat and Liang, 2015), as well as the WebNLG and E2E datasets.It contains 62,659/2,768/5,097 examples in the training/validation/test sets, respectively, and has 4,299 different predicates in total.Note that the predicates in DART include those in WebNLG and E2E.To evaluate model generalization to unseen predicates, we extract a subset of 2,71 test examples whose predicates are completely unseen in the training/validation sets, leading to a more difficult testunseen set compared to that of WebNLG.

Experimental Setup
For ASDOT, the data disambiguation stage ( §3.3) uses the GPT-3 Davinci API provided by OpenAI, with greedy decoding, maximum generation length 256 and the stop token "\n".Please refer to Appendix A for the full prompt we use.As discussed in Section 3.3, we require only a small number of GPT-3 queries by generating one template for each predicate.Therefore, we query GPT-3 for 4299 times in total, generating for all the predicates in WebNLG, E2E and DART, which costs only $23 with the GPT-3 pricing as of 10/21/2022.For the sentence fusion stage ( §3.4), we use T5 models of varying sizes as the sentence fusion LM.In the zero-/few-shot settings ( §4.3), we finetune the T5 with the large weakly-supervised data WikiFluent (Kasner and Dusek, 2022) as mentioned in §3.4.We use the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 3 × 10 −5 , and use a batch size of 64, for 1 epoch.When any shot of labeled data-to-text examples are available, we further finetune the sentence fusion T5 with those examples.For the generation, we use beam search decoding with a beam width of 5. We provide more details of the experimental setup in the Appendix A.
Evaluation Metrics Following previous studies, we report the performance in terms of BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005), as well as the recent PARENT-F1 metric (Dhingra et al., 2019) which measures the alignment between generated text with both the references and input data.We also report two embedding-based metrics BERTScore (Zhang et al., 2019) and BLEURT (Sellam et al., 2020) in the Appendix C. Besides, we perform human evaluation in the few-shot setting as detailed later.Table 1: Full-shot learning results on WebNLG (Left) and DART (Right).ASDOT-X denotes our approach with T5-X as the sentence fusion model.The best scores are in bold.We also show the performance gains against respective baseline models in blue.

Zero
20, 50, 100 to the size of the full training set.We experiment on the WebNLG and DART datasets, respectively.In the zero-/few-shot settings, we use the T5-large model for our sentence fusion LM.In the full-shot setting, we test three T5 models of different sizes (small -60M parameters, base -220M, and large -770M) for sentence fusion.Besides, the recent Prefix-Tuning method (Li and Liang, 2021) shows competitive performances on the data-to-text generation task.We thus also incorporate it with the T5-large architecture and report the results.
Baselines In the zero-/few-shot settings, we compare with KGPT (Chen et al., 2020a), a knowledgegrounded LM pretrained on large-scale automatically constructed data-to-text corpus, as it is one of the few methods applicable to both zero-/fewshot data-to-text generation.Besides, we compare with FS-KG (Li et al., 2021), a recent few-shot data-to-text approach enhanced with representation alignment between knowledge graphs and PLMs.We also compare with the end-to-end model based on T5-large, which has shown remarkable performance on data-to-text tasks with sufficient training examples (Ribeiro et al., 2020).Following Ribeiro et al. ( 2021), for the T5 baseline, we prepend <H>, <R> and <T> before the subjects, predicates, and objects, respectively, and add a prefix "translate Graph to English:" to the input.We finetune the T5 model with available shots of training examples.
On the WebNLG dataset, we report another baseline Neural Pipeline (Kasner and Dusek, 2022), which is a template-based pipeline method also trained on the WikiFluent dataset and is applicable only to the zero-shot setting.However, the method cannot be used on the DART dataset since its templates are specifically written for WebNLG by human.

Automatic Evaluation
The zero-/few-shot results are shown in Figure 2. Our method consistently outperforms baseline models on both datasets, demonstrating its strong zero-/few-shot learning ability.In particular, with fewer training examples, our ASDOT tends to outperform other methods by a larger margin.For instance, we achieve 16.06 higher BLEU than T5-large on 10shot WebNLG, and 10.53 higher on 10-shot DART.This is because the two-stage ASDOT is designed to excel in the low-data contexts by augmenting the generation process with rich external knowledge in pretrained LMs.Neural Pipeline is competitive with ours, but is restricted only to the zeroshot setting on WebNLG.DART contains more diverse types of predicates and thus is arguably more challenging than WebNLG.Our approach tends to achieve stronger performance gains on the difficult dataset.
We report the results of the full-shot setting in Table 1.The performance gain tends to be less significant compared to the zero-/few-shot settings as all methods are presented with a large number of training examples.However, our method still achieves consistently stronger performance over the large diversity of baselines, thanks to ASDOT's proper modeling of the generation process and the incorporation of rich external implicit knowledge.

Human Evaluation
We conduct a human evaluation to further assess our ASDOT against other baselines under the 50-shot setting on WebNLG.
After training, we sample 50 test instances and ask three proficient English speakers in the university to score the model outputs.Following Chen et al. (2020b), each generated result is evaluated on three aspects: the number of the facts that are consistent with the input table (Faithfulness) and contradicted to the table (Contradict), and the language fluency, on a 3-Likert scale (0,1,2).The results are shown in Table 2.The Krippendorff alphas (Krippendorff, 2011) for Faithfulness, Contradict, and language fluency are 0.49, 0.42 and 0.36, respectively, indicating a fair inner-annotator agreement.Consistent with the automatic evaluation results, we observe that ASDOT is substantially better than the baselines on all the three aspects, suggesting that our approach generates more faithful and fluent descriptions.

Ablation Studies
We conduct ablation studies to investigate the effects of both the data disambiguation and sentence fusion stages.Table 3 shows the results.Specifically, for the sentence fusion stage, we study the effect of the weaklysupervised finetuning on the WikiFluent corpus ( §3.4).From the table, we can see that the performance drops sharply without weakly-supervised finetuning, i.e., by 8.86 BLEU points for the zero-  shot setting.However, ASDOT without weak supervision still outperforms the baselines in most cases, validating the strong advantage of our approach under low-data settings.For the data disambiguation stage, we investigate the impact of the automatic templates produced by GPT-3.More concretely, we replace the GPT-3 templates with the humanwritten templates from Kasner and Dusek (2022).The performance is similar or decreases slightly, demonstrating that the short sentences or templates automatically generated in the data disambiguation stage are of competitive or slightly higher quality than the manually created ones (perhaps due to human errors when writing the hundreds of templates).

Generating for Unseen Predicates
We now assess the model's capability of describing new predicates that are never seen during training.
As mentioned in §4.1, WebNLG provides such an official test-unseen set for the evaluation and we construct a similar (but more difficult) test set on DART where all the test predicates are not included in training.We train the models on WebNLG   and DART, and evaluate on the corresponding testunseen sets, respectively.As in §4.3, we compare ASDOT with the respective end-to-end T5 models (small, base, large, prefix-tuning).We also include the previously reported baseline results on the WebNLG test-unseen set, including Best-Plan (Moryossef et al., 2019), Pipeline-Trans (Castro Ferreira et al., 2019) and PlanEnc (Zhao et al., 2020).The experimental results are shown in Table 4 and Table 5, respectively.As can be seen, our method achieves consistent improvements over all the baseline methods, showing the robustness of our method to unseen predicates given the rich commonsense and world knowledge introduced through the pretrained LMs in both stages.The superior performance of ASDOT over the corresponding end-to-end T5 again demonstrates the advantage of our modularization that applies to and improves various pretrained LMs.Similar as in the zero-/few-shot experiments, here we observe that on the more difficult DART test-unseen set with more unseen predicates, our method achieves more significant gains than on WebNLG, which further shows the advantage of our method when generalizing to unseen predicates.

Learning with Out-of-Domain Examples
At last, we quantitatively measure the generalization ability of our approach across domains.To   6 show that our method outperforms the baseline models on both out-ofdomain test sets, echoing the conclusions in previous experiments that our approach with the twostage design and integration of pretrained LMs has a superior generalization ability to handle data-totext generation in any-shot scenarios.

Case Study
Table 7 shows the outputs of our ASDOT (based on T5-large) after the data disambiguation stage and the sentence fusion stage, on two data in the outof-domain and unseen-predicates settings, respectively.The generated words corresponding to different data triples are highlighted in different colors (as in Figure 1).We also provide the results of the T5-large baseline and the human-written references.
As can be seen, ASDOT develops a strong generalization ability to out-of-domain data and unseen predicates.In the first example, ASDOT successfully disambiguates the triple <Zolder, fastest Lap, Liverpool F.C.> into "Liverpool F.C. set the fastest lap in the Zolder" while the T5 baseline fails to do so and simply generates "Zolder's faster lap in Liverpool F.C.".Also, in the second example, the baseline directly copies "associated Band/associated Musical Artist" in the output while ASDOT correctly converts it into "is associated with".

Conclusion
We have proposed ASDOT to deal with the diverse any-shot problems for data-to-text generation.AS-DOT is composed of two stages, data disambiguation that uses prompted GPT-3 to disambiguate input data triples into short sentences, and sentence fusion using state-of-the-art pretrained LMs to fuse these sentences into the desired paragraphs.In the process, ASDOT integrates rich external implicit knowledge from the large LMs, which ensures strong generalization capability and broad applicability to zero-/few-/full-shot, unseen-predicates, and out-of-domain training scenarios.Extensive experiments show our approach consistently achieves significant improvements over diverse baselines.

Limitations
One limitation of our approach is that the data disambiguation stage is done by the GPT-3 model locally, i.e., the GPT-3 model only observes one triple and does not utilize the full-table information.In some difficult cases, the full-table context may be needed for disambiguation.Besides, in this work we directly use the output from GPT-3's as the final disambiguation results, which may be problematic since GPT-3 may not always provide the correct templates, especially when working with highly-specialized domains.In addition, our current approach can only be applied to languages that have access to large LMs.

Ethics Statement
We are aware of the ACL Code of Ethics and the ACM Code of Ethics and Professional Conduct and strictly adhere to the rules throughout the course of this research.Our research does not present any new datasets but introduces a new algorithm for data-to-text generation, which generates text descriptions for a given graph or table.The intended usage of the work may potentially provide benefits to people with difficulties in reading graphs or tables, such as people with visual impairment.We do not anticipate direct harm with the intended usage.
Similar to most generation systems, if harmful input, such as unethical text or input designed for adversarial attacks, exists, our approach is likely to generate unintended output.Therefore, we do not recommend usages of our approach outside controlled research environment before these risks are mitigated.We would also like to point out that a naive deployment of our method may allow malicious exploitation of the backbone Large LMs, thus precautions such as a filtering mechanism need to be implemented.
Our model makes use of the common sense reasoning ability of large LMs, which may reinforce existing social stereotypes, hence care must be taken when applying this approach to materials (e.g.tables and graphs) that are sensitive to populations that already experience marginalization.
Computation-wise, our finetuning procedure takes around 1836 GPU/Hours on NVIDIA GeForce RTX 3090 Ti GPUs.Throughout the study, our prompting module makes about 4600 API calls to Open-AI's GPT-3 API.

Figure 2 :
Figure 2: Results of zero-/few-shot learning on WebNLG (left) and DART (right), respectively.The x-axis is the number of training examples, and the y-axis is the BLEU score.We report results of other metrics in Appendix C.Neural Pipeline(Kasner and Dusek, 2022) is applicable only to the zero-shot setting and the specific WebNLG data due to the need of human-written templates on the dataset.Our method show superior performances under any-shot settings.Our approach shows consistent improvement over the baselines, especially when the training size is small.We use paired bootstrap resampling(Koehn, 2004) which confirms that our method is superior to all the baselines at 95% statistical significance.
In the data-to-text generation task, we are given structured data (e.g., a table or graph) as input, which can be represented as a set of triples {x 1 , x 2 , ..., x n }.Each triple x i = ⟨s i , p i , o i ⟩, consists of data-text pairs where each data is a set of triples extracted from DBpedia and the text is written by human to describe the data.The dataset is split into training, validation, and test set, with 18,102/872/1,862 examples, respectively.The test set is further split into the test-seen and test-unseen -, Few-, to Full-Shot Learning We evaluate ASDOT in the presence of a varying number of training examples, ranging from 0, 10,

Table 2 :
Human evaluation results.↑ means the higher the better and ↓ means the lower the better.ASDOT outperforms the baselines with p < 0.05 in Tukey's HSD test for all the measures.

Table 3 :
Ablation results (BLEU) for zero-/few-shot learning on WebNLG.The w/o weak-sup row shows the results of ASDOT without weakly supervised finetuning, and w/ manual templ.shows the results of using handcrafted templates in the data disambiguation stage.

Table 5 :
Results on DART test-unseen set.

Table 6 :
Out-of-Domain results.B, M and P represent BLEU, METEOR and PARENT-F1, respectively.

Table 7 :
Qualitative examples in the out-of-domain (top) and unseen-predicates (bottom) settings.simulate the out-of-domain setting, we train our model on the WebNLG dataset and evaluate it on the test sets of DART and E2E, respectively.The DART test set includes the instances from the WebNLG and E2E test sets.We remove those instances to avoid any in-domain test examples (w.r.t the WebNLG training examples) and any overlap with E2E evaluation.We compare our method with the end-to-end finetuned T5-large model.The experimental results in Table

Table 12 :
DART few-shot results.x / y / z denotes the model performance on BLEU / METEOR / PARENT-F1.

Table 13 :
DART few-shot results.x / y denotes the model performance on BERTScore / BLEURT.