Compositional Generalization for Data-to-Text Generation

Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of predicates, producing unfaithful descriptions (e.g. hallucinations or omissions). We refer to this issue as compositional generalisation, and it encouraged us to create a benchmark for assessing the performance of different approaches on this specific problem. Furthermore, we propose a novel model that addresses compositional generalization by clustering predicates into groups. Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time. This approach significantly outperforms T5~baselines across all evaluation metrics.Notably, it achieved a 31% improvement over T5 in terms of a metric focused on maintaining faithfulness to the input.


Introduction
Data-to-text generation (DTG) (Gardent et al., 2017;Dušek et al., 2020) aims to accurately generate textual descriptions from input tuples; the tuples should encompass all the information needed for generating the description regardless of the narrative order.Typically, as shown in Figure 1, each tuple consists of two arguments and one predicate that conveys their relationship. 2Given the large number of pre-defined predicates, it is timeconsuming to collect human-annotated training examples for each potential combination of them.Thus, models must have the ability to generalize and handle examples with previously-unseen predicate combinations.We refer to this generalization scenario as compositional generalization.
Prior research (Mehta et al., 2022;Xu et al., 2021;Kale and Rastogi, 2020a;Peng et al., 2020; Figure 1: An example from WebNLG dataset (Gardent et al., 2017).DTG aims at transforming the input structured data (left) into coherent textual description (right).Chen et al., 2020) has focused on evaluating the compositional generalization (CG) abilities of DTG models.These studies created few-shot training splits using established benchmarks by reducing the number of training examples or limiting the number of distinct predicate combinations in the training set through random selection.However, these arbitrary selections overlook the practical effort required for annotating different examples.For example, annotating examples with a larger number of input tuples requires more time and effort.
We introduce a test environment based on Gardent et al. (2017).During training, models are exposed to examples with fewer input tuples, while in the testing phase, examples with more input tuples are presented.To make it even more challenging, we combine CG with few-shot learning by reducing the number of training examples for each predicate combination to one.We also incorporate CG with domain adaptation by evaluating the models on unseen domains.Our results demonstrate that the SoTA pre-trained language models (LMs; Raffel et al. 2020;Kale and Rastogi 2020b) fail to generalize effectively in our experimental setup.
To tackle this issue, we propose a clusteringbased method (Figure 2) that utilize the graph weights learned from training data to decompose unfamiliar predicate compositions into smaller groups during inference.Each group consists of predicate combinations encountered by the model Figure 2: Framework of our proposed inference procedure.We introduce a set of clustering-based methods that leverage the graph weights learned during training to decompose the unseen predicate compositions into familiar groups.For each group, we gather the input tuples associated with predicates in that group, and generate a sentence to describe them.The final text description is created by combining the generated sentences from all the groups.
during training.Then, individual sentence descriptions are generated separately for each group, and combined to form the final description for the input.In contrast to previous studies that primarily rely on self-training to improve CG in DTG (He et al., 2020;Heidari et al., 2021;Li et al., 2021;Mehta et al., 2022), as well as using data augmentation to improve CG in various tasks such as semantic parsing (Andreas, 2020;Qiu et al., 2022;Fang et al., 2023), our method solely relies on small training sets and does not require any additional human-annotated, automatically labeled, or unlabeled data.In the CG-centric testing scenario, we observe significant improvements across all metrics in the benchmark compared to the vanilla T5 model (Raffel et al., 2020;Kale and Rastogi, 2020b).A faithfulness-based metric (Dušek and Kasner, 2020) shows an impressive gain of 31% over T5.A similar trend is seen when combining the CG challenge with few-shot learning and domain adaptation.Our contributions are: • We create a benchmark to assess the CG ability of DTG models and generate four testing scenarios of varying difficulty by combining CG with fewshot learning and domain adaptation.
• We present an innovative architecture that uses a clustering algorithm to decompose the text description generation for unfamiliar input predicate combinations into smaller, familiar ones.
• We show that our method produces outputs that are not only more faithful but also exhibit a greater resemblance to human-written references compared to vanilla pre-trained LMs while tested on the proposed benchmark.
• We also introduce an intrinsic evaluation framework for inspecting input decomposition.

CG focused Benchmark
Existing benchmarks (Ratnaparkhi, 2000;Liang et al., 2009;Mairesse et al., 2010;Banik et al., 2013;Wen et al., 2015;Lebret et al., 2016;Wen et al., 2016;Gardent et al., 2017;Wiseman et al., 2017;Novikova et al., 2017;Parikh et al., 2020) have provided an important test-bed for DTG models.We specifically choose WebNLG 20173 (Gardent et al., 2017) to build our benchmark upon (reasons are shown in Appendix B) and primarily focus on assessing the models' capability for CG.We present two sets of training splits and create four distinct testing scenarios with different difficulty levels by incorporating the training splits with the seen and unseen test sets offered in WebNLG.

Training Sets
CGFULL consists of a set of independent training splits {CGFULL-k}, k ranging from 2 to 7. CGFULL-k exclusively consists of examples where the number of input tuples is equal to or less than k.We excluded CGFULL-5 and -6 here due to the marginal increase in the amount of data within these two sets.Note that, CGFULL-7 represents the full training set of WebNLG 2017.

Validation and Test Sets
The validation and test set in the original WebNLG 2017 remain unchanged.The test set consists of two categories: The seen category includes examples from 9 domains present in training, while the unseen category consists of examples from 5 new domains with newly defined predicates and unseen entities not present in training.Note that, both validation and test sets contain examples with different numbers of input tuples, ranging from 1 to 7.

Evaluation Scenarios
We create four evaluation scenarios (Figure 3) by pairing training splits with test sets.When trained on CGFULL-k, the models are exposed to numerous examples with predicate combinations of up to k predicates.When tested on the WebNLG seen category, their CG abilities are evaluated as they encounter novel combinations consisting of a greater number of predicates.To further intensify the challenge, the models are tested on the unseen category, requiring them to demonstrate both CG and adaptability to predicate combinations from new domains.They also need to handle newly introduced arguments.On the other hand, the models trained on CGONESHOT-k have only seen one example for each combination of up to k predicates.When tested on the seen set, their CG abilities are assessed with few-shot learning skills.When evaluated on the unseen set, the examination of their capabilities further extends to domain adaptation.

Evaluation Metrics
Generation Evaluation focuses on evaluating the generated text w.r.t.its similarity to humanauthored reference sentences.We adopt BLEU (Papineni et al., 2002), a token-level exact matching metric that is incorporated in the WebNLG.
Faithfulness Evaluation tests if the generated text is faithful to the input tuples (Wen et al., 2015;Reed et al., 2018;Dušek and Kasner, 2020).Unfaithful generations contain hallucinations (generations with extra or incorrect information) or omissions (generations missing important input information) (Dušek et al., 2019).We adopt PARENT (Dhingra et al., 2019), an entailment-based metric, where a higher score indicates a lower occurrence of hallucinations; and OK-percent (Dušek and Kasner, 2020), a natural language inferencebased metric representing the proportion of system generations free from hallucinations or omissions.

Clustering based CG Methods
The framework of our approach is shown in Figure 2.For clarity, we denote the input tuples and their corresponding predicates as X = {X 1 , X 2 , • • • X N } and P = {P 1 , P 2 , • • • P N } respectively, where P i is the predicate of the i th tuple X i .N represents the total number of tuples.The output text is denoted as where Y j represents the j th sentence in the output.
Fine-tuning pre-trained LMs (Kale and Rastogi, 2020b;Ribeiro et al., 2021), aim to maximize the log likelihood of generating the ground truth text Y given the linearized input tuples X, denoted as log p (Y|X), during training.However, these models face challenges in generalizing to unseen predicate combinations.To overcome this, an intuitive approach is to decompose these unseen combinations into smaller groups, ensuring that the combination in each group has been seen during training; then, generate a sentence from each group individually and combine them to form the final description.
We denote the decomposition as where C j represents the j th predicate group responsible for generating sentence Y j .Since DTG tasks require to include all the input information in the output without repetition, the decomposition must fulfill The text generation is then broken down into a set of parallel steps.Each step aims at creating a single sentence to describe the tuples associated with the predicates in one of the groups: where X C j is a subset of X.The tuple X i ∈ X C j iff its predicate P i ∈ C j . 4An alternative representation of a predicate decomposition involves the use of a matrix.Given a set of tuples with their corresponding predicates, we construct a fully connected undirected graph, denoted as G = (V, E), where the predicates are represented as nodes in V .
In turn, a decomposition can be considered as a partitioned graph derived from the original graph G.
We encode the partitioned graph as a binary matrix M, where M ij = 1 signifies that the predicates P i and P j belong to the same group, while M ij = 0 indicates that they belong to different groups.
Unfortunately, the annotated ground truth decompositions are unavailable in the majority of DTG training sets.Therefore, the training objective becomes maximizing the marginal log likelihood of the output text Y w.r.t. the latent M.
As the number of input tuples increases, exploring all possible decompositions for each example is intractable.Following Kim et al. (2017); Deng et al. (2018), we approximate the marginal likelihood as: This results in the stochastic decomposition variable M being replaced with the deterministic value M = E M∼p(M|X) M. Assuming that M follows a Bernoulli distribution B(γ), with each element within the matrix being independent, we can represent the distribution of M as Thus, the expectation of the binary matrix M is the Bernoulli parameter matrix γ.In Section 3.1 we demonstrate two training methods to predict γ.

Training
Inspired by Su et al. (2021) and Moryossef et al. (2019), we propose an automatic way for creating silver annotations for the training of γ.For each training example in DTG5 , we calculate the BLEU score for each input tuple w.r.t. each sentence in the reference output.Afterwards, the tuple is aligned with the sentence that achieves the highest BLEU score.This process yields a collection of tuple groups, where tuples within a group are described in the same sentence.By removing the arguments from each tuple, we obtain an annotated predicate decomposition from each DTG training example.Now, we introduce two methods to obtain matrix γ using these annotated predicate decompositions: Numerical Weight Prediction determines γ ij by analyzing individual occurrence vs. co-occurrence of two predicates P i , P j in the training data: where # (P i , P j ) denotes the frequency of both predicate P i and P j being mentioned in the same sentence throughout the corpus.# (P * ) represents the frequency of predicate P * appearing in the corpus6 .However, this approach has a limitation: if either predicate is not included in the training set, the weight will always be zero.This becomes challenging when transitioning to a new domain, since most weights in the matrix γ will be zero.
Neural Network based Prediction solves this problem by introducing a small scale transformersbased neural network7 .The model takes two tokenized predicates concatenated as input, including a "[CLS]" token attached to the beginning.The embedding of the "[CLS]" token from the final transformer layer is then fed into a classification head.This head predicts whether the two predicates should be described in the same sentence (1) or not (0).Elements in γ can be written as: where W is a linear transformation.b is the bias.For classifier training, synthetic data is generated using the automatically annotated predicate decomposition.Positive examples are formed by pairing any two predicates within the same group, while negative examples consist of pairs from different groups.In cases where there is only one predicate

Algorithm 1 DTG with Predicate Decompostion
Require: Input tuples X; Their predicates P; Trained predicate clustering model PC; Fine-tuned generator T5.
Yj ← T5({X Cj }) 10: end for 11: Y ← {Yj} group in the input, indicating a single-sentence output, each input predicate is randomly paired with a predicate from the dataset that is not part of the input for the negative examples' creation.The model is trained with the Cross Entropy loss.
To train the text generation model, we use the annotated tuple groups introduced earlier in this section instead of relying on the predicate decompositions predicted by the introduced models.Since each tuple group is aligned to a sentence, we take the tuples as input and the sentence as output to fine-tune a setence-level T58 for text generation.

Testing
The testing procedure is outlined in Algorithm 1.Initially, we aim to obtain a predicate decomposition C for a given set of input tuples.To accomplish this, we begin by estimating the expectation of the binary matrix M, i.e. γ, using the models introduced in Section 3.1.Afterwards, we iterate through all possible values for the number of predicate groups k (ranging from 1 to |X|) and employ a clustering algorithm (specifically, spectral clustering in this study) over the matrix γ9 .If the minimum weight between two predicates within the same cluster exceeds a threshold ϵ10 , we halt the exploration.Our objective is to minimize the number of predicate clusters and ensure that each cluster does not contain unfamiliar predicate pairs.
To enhance the coherence of the generated text, we implement a simple method to arrange the pred-icate clusters.For each cluster, we calculate the occurrence of its predicates being described in the first sentence across all training examples, and choose the one with the highest frequency as the first cluster.Next, we select the subsequent cluster by identifying the one with the highest number of unique arguments observed in the previous cluster.We repeat this step until all clusters are sorted.Finally, we utilize the fine-tuned T5 model to produce a sentence for each cluster following the order, and concatenate these generated sentences to form the final output describing the input tuples.

REINFORCE-enhanced Decomposition
Deterministic approaches heavily rely on automatically annotated predicate decompositions.However, the annotator based on exact token matching is weak at detecting paraphrasing, resulting in misalignment of tuples to the wrong sentence.To address this, we propose a REINFORCE (Glynn, 1990;Williams, 1992) based approach that reduces reliance on silver annotations.
We first simplify the marginal distribution in Eq 2 using Jensen's Inequality.Since the logarithm function is concave, we have: Our goal is to train the parameters ϕ in p ϕ (M|X) to optimize this lower bound.The gradient of ϕ is: However, directly sampling a binary matrix from the Bernoulli distribution M ∼ B(γ) does not guarantee that the matrix can be transformed into a valid set of clusters C. Therefore, we propose a method to replace the sampling process in the forward pass.We first sample a binary matrix M from the Bernoulli distribution B(γ).Next, we perform element-wise multiplication between this matrix and γ, denoted as M ⊙ γ.Finally, we apply the spectral clustering algorithm to the resulting matrix to obtain discrete clusters C.11 As this process is not differentiable, akin to the concept of the straight-through estimator (Bengio et al., 2013), we perform a backward pass through the sampling step by computing the probability p ϕ (M|X) using the sampled predicate decomposition C: where k ij ∈ {0, 1}.k ij = 1 means that predicate P i and P j belong to the same cluster in C, while k ij = 0 indicates that they are assigned to different clusters.By using this approach, the gradients can be propagated through the Bernoulli distribution.We compute γ ij ∈ (0, 1) using the transformers classifier structure discussed in Section 3.1.In order to speed up convergence, we initialize the parameters using the classifier that has been trained in the deterministic approach.
In turn, we align each cluster in the sampled decompositon C with a sentence from the ground truth text description utilizing the Hungarian algorithm (Kuhn, 1955).The cost matrix for the algorithm is derived from the negative BLEU score between a tuple group and a sentence.Then, we employ the fine-tuned T5 generator (Section 3.1) as the reward model to evaluate the sampled predicate decomposition.The reward is calculated based on Eq 1, which involves multiplying the likelihood of generating each sentence in the ground truth, conditioned on its corresponding aligned tuple group.Since REINFORCE is prone to high variance (Zaremba and Sutskever, 2015;Li et al., 2016), we propose a baseline log p(Y| C, X), to the reward model.C denotes a randomly generated predicate decomposition.We achieve this by randomly assigning each tuple to a cluster. 12

Experiments and Results
We compare our methods (CG-Numerical, CG-NN, CG-RL) against fine-tuning T513 (Kale and Rastogi, 2020b) using the benchmarks introduced in Section 2. Additionally, we include another baseline called CG-Random.It determines the number of groups, ranging from 1 to X and assigns each predicate to one of the groups randomly.Comparing to CG-Random allows us to gain insights into the impact of our methods on predicate decomposition and how it affects the text generation quality.To enhance readability given the large number of experiments conducted, we present a portion of the results that will be thoroughly discussed in this section.Table 2 and 3 show models performance in testing scenario 2 and 4 (refer to Figure 3), respectively.For scenario 1 and 3, the results can be  found in Table 10 and 11 in Appendix F.

Case Study on Pre-trained LMs
In all four testing scenarios, we observe a significant decline in both generation performance and faithfulness when T5 is trained on the splits with fewer input tuples.This suggests that pre-trained LMs are not well-suited for CG tasks.Additionally, we highlight the performance of T5 models trained using CGONESHOT-7, CGFULL-2 and -7 in Table 4. Comparing CGFULL-7 and CGONESHOT-7, we observe that limiting the number of training examples for each predicate combination to one does not significantly hurt the performance.However, when comparing CGFULL-7 with CGFULL-2, a significant decrease in performance is observed.This drop occurs even though there is a smaller reduction in training data.These findings highlight the difficulty of our benchmark and emphasizes the importance of models possessing CG abilities.

Results of Proposed Approaches
Our objective is to identify methods that excel across all of the four distinct testing scenarios introduced in Section 2.3.
Our approaches vs. T5 Our approaches demonstrate superior performance compared to T5 in all four testing scenarios (refer to Figure 3), measured across all three metrics, except for BLEU in the domain adaptation scenario (Table 3).This is due to the presence of new predicates in the out-ofdomain test set.Our approaches tend to decompose unseen predicate compositions into smaller groups for text generation, which deviates from human annotations.On average, the number of sentences in the descriptions generated by humans, T5, and our approaches are 1.35, 1.4, and 2.0 respectively (see Table 16 in Appendix I).This divergence is penalized when evaluating with reference-based metrics like BLEU.However, our study encourages the decomposition behavior as it allows for the generation of faithful texts while maintaining a reasonable level of similarity to the human-written references.Another observation is that, across all four scenarios, when trained solely on examples with fewer input tuples, the performance advantage of our approaches over T5 becomes more pronounced.For example, when trained with CGFULL-2 and tested on the seen set (Table 10 in Appendix F), the best performed CG-based approach outperform T5 2 points on BLEU, 4.2 points on PARENT and 31.2 points on OK-percent.These findings highlight the effectiveness of our approaches in enhancing the CG capability of vanilla pre-trained LMs, particularly when trained on a very limited number of examples with simple predicate compositions.
Our approaches vs. CG-random Our approaches generally outperform CG-random, particularly in terms of BLEU scores.However, there is a noticeable decrease in the OK-percent scores when tested using the seen set (Table 2).This is because predicate compositions in the in-domain test set are more likely to have been seen in the training set.Thus, our models tend to decompose input predicates into fewer groups.However, CGrandom selects the number of groups randomly, resulting in a higher average group number (Table 16 in Appendix I).This allows CG-random to achieve slight gains in OK-percent but comes at the cost of an 8-point decrease in BLEU compared to our approaches.Our study does not promote this unnecessary decompositions that may lead to unnatural text descriptions.These findings indicate that breaking down predicate compositions into smaller groups generally results in more faithful generations.However, when compared to the random approach, learned decompositions produce texts that are closer to human expressions.
CG-Numerical vs. CG-RL In this section, we directly compare CG-Numerical to CG-RL since CG-RL is an extension of CG-NN.The comparison between CG-NN and CG-RL can be found in Appendix F. CG-RL exhibits better performance than CG-Numerical in terms of BLEU score across all scenarios, particularly when evaluated on out-ofdomain examples (Table 3 and 11).The results for metrics PARENT and OK-percent are comparable between the two approaches, except that CG-Numerical consistently outperforms CG-RL in terms of OK-percent when tested in unseen domains (Table 3).The reason for this is that CG-RL utilizes neural networks to encode tokenized predicates, enabling some level of knowledge transfer when encountering out-of-domain predicates that consist of tokens seen in the in-domain predicates.However, CG-Numerical is unable to process outof-domain predicates, resulting in a higher number of decomposed clusters.In fact, this number (2.4) is even higher than that of CG-random (1.8) (see Table 16 in Appendix I).Consequently, this contributes to a decrease in BLEU.
5 In-depth Analysis and Discussion  ing the Normalized Mutual Information (NMI)14 between the hypothesis and the reference.Additionally, we require each model to generate predicate decompositions for every input example based on the number of clusters present in the selected reference.We evaluate the generated clusters by computing the NMI metric w.r.t. the reference.The results are shown in Table 6.When the models are trained using CGFULL-k and tested on the indomain dataset, both of the proposed methods show superior performance compared to CG-Random, with CG-RL outperforming CG-Numerical.However, while not falling too far behind, none of these models achieve the same level of correlation as humans.Similar trends can be observed in the results for the other three testing scenarios, which can be found in Table 17 in Appendix I.

Test on Existing Few-shot Benchmarks
Chen et al. ( 2020) proposed few-shot splits for WebNLG 2020 by randomly selecting a certain portion of examples from the training set.We compare the performance of our best-performed system CG-RL with pre-trained BARG and two prior works, FT-KGPT (Chen et al., 2020) and CBST (Ke et al., 2022), on these splits. 15FT-KGPT finetunes a Knowledge-grounded Language Model that has been pre-trained on 7M tuples-to-sentence data collected from Wikipedia pages.CBST is a BARTbased approach that is tuned using a self-training style, on 0.37M structured data without paired texts collected from GenWiki.The results in the atatürk monument, made of bronze, is located in istanbul, turkey, where ahmet davutoglu is the leader.CG-RL the atatürk monument (izmir) is located in turkey where ahmet davutoglu is the leader.the inauguration date of the atatürk monument (izmir), made of bronze, is 1932-07-27.ankara is the capital of turkey where the currency is the turkish lira.istanbul is the largest city in turkey.

Input
<SUB> spain <PRED> language <OBJ> spanish language; <SUB> ajoblanco <PRED> region <OBJ> andalusia; <SUB> andalusia <PRED> leaderName <OBJ> susana díaz; <SUB> ajoblanco <PRED> country <OBJ> spain; <SUB> spain <PRED> ethnicGroup <OBJ> spaniards T5 ajoblanco is from andalusia where spaniards are an ethnic group.CG-RL ajoblanco originates from the country of spain where spaniards are an ethnic group.susana diaz is the leader of andalusia where ajoblanco is from.spanish is a language spoken in spain.show that CG-RL outperforms BART across all splits and metrics, except for BLEU on the 10% split.FT-KGPT and CBST only reported BLEU scores.In the 1%, 5%, and 10% splits, CBST outperforms our approach.However, we acknowledge that leveraging task-specific pretraining and self-training-based tuning techniques on our text generator can potentially enhance the few-shot generation performance.

Limitations and Future Work
We have constructed our benchmark exclusively using WebNLG 2017, as it exhibits several favorable characteristics (see Appendix B).Nonetheless, it would be advantageous to expand the benchmark to include data from a more diverse range of resources.Additionally, we recognize the importance of including multiple languages in the benchmark.The multilingual divisions introduced in WebNLG 2022 were not included in our benchmark due to the presence of training data generated automatically using translation models, which resulted in noisy data.
In the future, we aim to expand our benchmark to include high-quality multilingual data resources.
Recent work (Axelsson and Skantze, 2023;Yuan and Färber, 2023) explored the zero-shot ability of LLMs on DTG.Across benchmarks including WebNLG, both studies find that GPT-3 and Chat-GPT achieve lower BLEU scores compared to finetuned smaller scale models.In addition, LLMs still face challenges in comprehending the semantic relationships between entities, and the generated text often includes hallucinations.Thus, we did not include LLMs in this study.Due to the limited computational resources, we focused our performance testing on T5-base.However, it is important to evaluate the models in different sizes, including LLMs.We anticipate that tasks with longer inputs/outputs, such as multi-document summarization, may derive even greater benefits from the proposed "solving CG through learning to decompose" idea.In the future, we aim to extend this idea to other tasks.

A Examples of system outputs B Reasons of choosing WebNLG
We

D Transformer-based Weight Predication
The dimension of the model is 128; the dimension of the feedforward is 256; number of heads is 4; number of layers is 2; dropout is 0.1.Other details will be shared with the code.

E Human Evaluation Standards F Extended Main Experiment Results
This section presents an overview of the experimental results.The performance of all models can be found in Table 10 and Table 11.The results for testing scenario 1 and 2 can be found in the left and right sections of Table 10, respectively.Similarly, the results for testing scenario 3 and 4 can be found in the left and right sections of   Table 17: Predicate decomposition performance of all models evaluated using NMI.The NMI among human annotated references are 70.06 and 67.31 for seen and unseen test set respectively.Since we can not control the amount of sentences the vanilla T5 generates, we discard the T5 for this experiment.

Table 1 :
The amount of examples within splits in collection CGFULL and CGONESHOT.

Table 2 :
Performance of models evaluated in scenario 2 (refer to Figure3), i.e. trained on CGONESHOT-k and tested on SEEN category.The top-performing system is highlighted in bold, while the second best system for Ok-percent is underlined.CG-Ra, CG-Nu are short for CG-Random and CG-Numerical, respectively.

Table 3 :
Performance of models evaluated in scenario 4 (refer to Figure3), i.e. trained on CGONESHOT-k and tested on UNSEEN category.CG-Ra, CG-Nu are short for CG-Random and CG-Numerical, respectively.

Table 5 :
Human evaluation with the metrics of Grammar, Repetition, Hallucination, and Omission.A higher score indicates better performance.Models are trained using CGFULL-k and tested on the SEEN set.

Table 6 :
Predicate decomposition performance evaluated using NMI.Models are tested under scenario 1, i.e. trained using CGFULL-k and tested on the SEEN set.

Table 7 :
Models performance on few-shot settings.Systems marked with * are from previous work.To ensure a fair comparison with CBST, for this set of experiments CG-RL is trained on BART.Note that the numbers in this table cannot be directly compared to those in the previous tables due to the inclusion of additional domains in the WebNLG 2020 training set.

Table 8 :
Cherry-picked examples of input and system-generated texts.Models are trained on CGFULL-2 choose WebNLG because it offers examples with different input/output sizes.The training set covers multiple domains with diverse predefined predicates.It also includes out-of-domain test examples, allowing us to create challenging scenarios that involve both compositional generalization and crossdomain generalization.Moreover, several faithfulness evaluation metrics have been shown to be highly correlated with human judgments on the WebNLG dataset.C Data Preprocess for Tuple-to-Sentence Alignment Concretely, we first preprocess the training examples by tokenizing the predicates in the input tuples 16 , resolving coreferences in the output texts 17 , and splitting the texts into sentences 18 .

Table 10 :
Table 11, respectively.CG-NN vs. CG-RL CG-NN and CG-RL show similar performance overall.CG-RL tends to outperform CG-NN on faithfulness metrics when trained on examples with more input tuples, and on BLEU when trained on examples with fewer tuples.This trend is particularly evident when trained on CGONESHOT (right part of Table10 and 11).Models performance on seen category.The top-performing system is highlighted in bold, while the second best system for Ok-percent is underlined.

Table 11 :
Models performance on unseen category.

Table 12 :
T5-Large based models performance on seen category.The top-performing system is highlighted in bold, while the second best system for Ok-percent is underlined.

Table 15 :
T5-Small based models performance on unseen category.