Mind the Labels: Describing Relations in Knowledge Graphs With Pretrained Models

Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on the task of descibing a relation between two entities. For our experiments, we collect a novel dataset for verbalizing a diverse set of 1,522 unique relations from three large-scale knowledge graphs (Wikidata, DBPedia, YAGO). We find that although PLMs for D2T generation expectedly fail on unclear cases, models trained with a large variety of relation labels are surprisingly robust in verbalizing novel, unseen relations. We argue that using data with a diverse set of clear and meaningful labels is key to training D2T generation systems capable of generalizing to novel domains.


Introduction
D2T generation systems need to accurately capture the semantics of relations between values in the data. However, the data labels such as relation names (Färber et al., 2018;Haller et al., 2022), table headings (Parikh et al., 2020), or meaning representation keys (Dušek et al., 2020) may provide only superficial or-if the labels are abbreviations, such as in the Rotowire dataset (Wiseman et al., 2017)-no usable hints about the data semantics. Learning how to properly describe the data is thus a challenge for D2T systems, typically requiring in-domain training data of sufficient quality and quantity (Dušek et al., 2019).
PLMs such as BART (Lewis et al., 2020) or T5 (Raffel et al., 2020) can quickly adapt to new domains and exhibit robustness to out-of-domain Figure 1: Data-to-text generation models use relation labels (such as godparent, occupant, and musicBy) to describe relations between entities. However, unclear labels can lead to various lexical or semantic incoherencies in the output descriptions, such as swapping the relation direction (a) or using too literal expressions (b).
inputs. However, the PLMs for D2T generation are still limited by the expressivity of the data labels. Consider Figure 1 (a): the model can use its representation of "godparent" to understand there is a "is-a-godparent-of" relation between the entities, but it has to infer (or guess) who is the godparent of whom. Even in the less ambiguous cases (b) and (c), the model still has to correctly capture the intended semantics of the relation (e.g. "occupant" meaning "home team").
In this paper, we investigate to what extent PLMs are able to use arbitrary labels describing relations between entities. A suitable testing ground is the task of describing (i.e., verbalizing) individual triples in a knowledge graph (KG), which can be considered a trivial case of graph-to-text (G2T) generation (Ribeiro et al., 2021;Koncel-Kedziorski et al., 2019). In this task, there is a wide range of lexical choices for the relation label (see Table 1), while the entities can be copied verbatim or with only minor morphological changes.
Current human-annotated datasets for D2T generation contain only a small number of relations and rarely contain any unseen relations in the test relation possible verbalization is part of X is part of Y. duration X lasted for Y. platform X is available on Y. X runs on Y. country X was born in Y. X is located in Y. parent X is the parent of Y. Y is the parent of X. ChEMBL X has an id Y in the ChEMBL database. Table 1: Examples of relation labels and their possible verbalizations, with placeholders for head (X) and tail (Y) entities. Relations can be copied verbatim (is part of ), have a unique verbalization (duration), or multiple equivalent lexical choices (platform). There is also ambiguity stemming from the semantics of the entities (country) or the relation itself (parent, ChEMBL).
set (Mille et al., 2021). We collect a novel dataset REL2TEXT (Re-writing edge labels to Text), 2 acting as a test bench for our experiments. It contains 4,097 single triples from three large-scale KGs (Wikidata, DBPedia, and YAGO) and their crowdsourced verbalizations, covering 1,522 unique relations ( §3). Each relation is equipped with a label, a textual description, and up to five triples in which the relation occurs in the KG.
Using the REL2TEXT dataset, we evalute the ability of PLMs to verbalize relations which were not present in the training set. We consider both models finetuned on other relations in our dataset and models finetuned on datasets from a related domain. We also experiment with scenarios involving few-shot finetuning, training on masked labels, and extending the labels with descriptions ( §4, 5).
We find that the PLMs are quite robust in verbalizing a diverse set of relations based on their label (achieving~90% of overall entailment probability). We show that semantically unfaithful model outputs are often caused by incomplete, ambiguous, or noisy input data. Somewhat suprisingly, we also show that longer relation descriptions do not provide substantial improvements over using short labels. However, even for data using short relation labels, the model trained on verbalizing relations can achieve results comparable to verbalizing relations using manual templates in two downstream tasks ( §6).
The contributions of our work are as follows: • We examine the ability of PLMs to describe 2 Or simply "Relations-to-Text".

Related Work
Earlier works in natural language generation from KGs exploited domain-specific ontologies for rulebased systems (Cimiano et al., 2013;Bouayad-Agha et al., 2012;Mellish, 2007, 2006). With the advance of PLMs, structure-aware modeling and task-specific pretraining has lead to remarkable progress on D2T benchmarks such as WebNLG (Gardent et al., 2017b;Ferreira et al., 2020), AGENDA (Koncel-Kedziorski et al., 2019), or E2E (Dušek et al., 2020, indicated via both automatic and human evaluation metrics (Ke et al., 2021;Guo et al., 2020;Ribeiro et al., 2020;Harkous et al., 2020). Agarwal et al. (2021) used a multi-step approach with semantic filtering and distant supervision for verbalizing the English Wikidata, covering the wide range of relations present in the KG. The authors use the approach to generate the KeLM corpus -an automatically cleaned corpus with synthetic (model-generated) verbalizations of Wikidata triplesets. We use the KeLM corpus to investigate how models trained on large-scale synthetic data differ from models trained on a small-scale human-annotated dataset (cf. §4).
Other works have tried incorporating descriptions of data labels in the model inputs. In one of the experiments,  use descriptions of relations from Wikidata instead of their labels for relation embeddings, concluding that it results in worse performance on downstream tasks. Conversely, Kale and Rastogi (2020) and Lee et al. (2021) improve the performance of their systems by including schema descriptions on the input for the dialogue state tracking and dialogue response generation systems.
There has also been a research interest in verbalizing single triples as a stand-alone preprocessing step for NLP tasks. The step has been shown to improve the generalization ability of downstream models for data-to-text generation (Laha et al., 2019;Dušek, 2020, 2022;Xiang et al., 2022) and response generation in dialogue systems (Kale and Rastogi, 2020). This step can also serve for making the input similar to the format used during pretraining, e.g. for natural language inference (NLI) models (Gupta et al., 2020;Neeraja et al., 2021;Dušek and Kasner, 2020). The above works employ a variety of methods to convert triples to text, ranging from simple templates and rule-based systems to prompting large PLMs. However, none of these works investigate how PLMs behave when presented with novel relations.
In a work concurrent to ours, Keymanesh et al. (2022) investigate the aspects of generalization performance of PLMs on the DART dataset 3 (Nan et al., 2021). They compare prompt-based and finetuning-based approaches to D2T generation, focusing on the ability of models to perform on difficult examples. In contrast, we focus on finetuned encoder-decoder models, which were shown in Keymanesh et al. (2022) to be more efficient for D2T generation, and we evaluate the models on clean and manually curated data.

Data
For our experiments, we need data with diverse labels and their human verbalizations. In this section, we describe how we gather RDF 4 triples from large-scale KGs ( §3.1) and collect their verbalization through crowdsourcing ( §3.2, 3.3).

Input Data
An RDF triple is a tuple t " pe h , r, e t q, where r denotes the relation between the head entity e h and the tail entity e t . We retrieve triples from three open large-scale KGs encoding factual knowledge: • Wikidata (Vrandečić and Krötzsch, 2014) is a large-scale Wikipedia-based KG created using collaborative editing. With approximately 10,000 human-created relations equipped with descriptions, 5 it is by far the largest source of variety in relation labels.
• YAGO (Tanon et al., 2020) is a KG which builds upon factual knowledge from Wikidata, but uses a limited set of 116 pre-defined relations from schema.org (Guha et al., 2016) mapped to a subset of Wikidata relations.
• DBPedia (Lehmann et al., 2015) is a KG that maps Wikipedia infotables to a predefined ontology containing 1,355 relations, about 350 of which are accompanied by a description.
We query all KGs using their openly available endpoints to retrieve a list of relations in each KG. For each relation, we retrieve up to five triples that use this relation, and the relation description, i.e. a short explanatory text. If present, we also retrieve descriptions for the head and tail entities.
We apply a set of filtering heuristics, leaving out e.g. relations describing KG metadata or identification numbers. 6 In this way, we collect 7,334 triples with 1,716 relations in total. For the full description regarding the data retrieval, please refer to Appendix A.

Annotation Process
We collect human-written verbalizations for all input triples using Prolific. 7 We built a web interface in which the human annotators are shown a single triple t and asked to describe it in a single sentence. The annotators are encouraged to re-use the entities in their original form, but they are able to change the form if necessary. The annotators can also report noisy inputs. We employed 420 annotators in total, each of which annotated 20 examples. We set the average reward per hour according to the platform recommendations to £7.29 per hour and we accepted all the inputs which passed our built-in checks. See Appendix B for more details on the annotation process.

Postprocessing the Data
A considerable portion of the collected verbalizations contain typos and grammatical errors, misunderstood meaning of the relation, or extra information in the input. To ensure high quality of our data, we manually examined all crowdsourced examples and annotated them as OK, noisy, corrupted or containing extra information. Appendix C includes postprocessing details. In the rest of the paper, we only use the subset of our dataset with OK annotations, one per input triple (4,097 examples, 1,522 distinct relations), although we also make the remaining noisy instances available for future research.

Analysis and Evaluation
In our analysis, we are interested in the following research questions: • RQ1: Are the PLMs finetuned for D2T generation able to describe relations not present in the finetuning corpus?
• RQ2: How many training examples do the PLMs need to generate satisfactory outputs?
• RQ3: How do the PLMs behave when provided limited lexical cues about the relation?
• RQ4: Can relation descriptions help to clarify ambiguous cases and improve semantic accuracy of the outputs?
To answer these questions, we divide our REL2TEXT dataset into a training and test splits (see §4.1 for details). We then use the REL2TEXT test set to evaluate a finetuned BART model (Lewis et al., 2020), a pretrained encoder-decoder transformer, which is used as a backbone of many recent data-to-text models (Ke et al., 2021;Xing and Wan, 2021;Ribeiro et al., 2021;. 8 To answer RQ1, we compare the performance of BART finetuned on the REL2TEXT training set with BART finetuned on two qualitatively different D2T datasets -WEBNLG and KELM. Using REL2TEXT only, we then prepare various setups for answering RQ2, RQ3, and RQ4 (details in §4.2). We analyze the outputs of the models both automatically ( §4.3) and manually ( §4.4).

Experimental Setup
Datasets We experiment with the following datasets, all of which focus on verbalizing factual information from KGs and use the same triplebased input data format: We also exclude any relations for which the maximum semantic similarity 10 to any KELM/WEBNLG/REL2TEXT training relation exceeds a threshold of 0.9. We set this threshold empirically in order to exclude relations which are almost synonymous, but slightly lexically different. We use 90% of the remaining examples for the training set and 10% for the validation set.

Data Preprocessing
We split the camel case in the relation labels. For finetuning the models, we linearize the input triples by marking the triple constituents with special tokens <head>, <rel> and <tail>, which we add to the model vocabulary.
Training and Decoding Setup In a default scenario, we finetune BART-BASE for 10 epochs and select the best checkpoint using validation BLEU score, then use greedy decoding to produce outputs. We repeat each experiment with five random seeds, averaging the results. See Appendix D for details.

Compared Systems
Copy Baseline We introduce a simple baseline by outputting the triple constituents separated by space: "e h r e t ".
Full Training Data We use the default setup ( §4.1) on full REL2TEXT and WEBNLG training sets. For KELM (which is about 300ˆlarger than WebNLG), we finetune the model for 1 epoch only. We denote the trained models full-rel2text, full-webnlg, and full-kelm, respectively.
Limited Training Data For the limited training data setup, we prepare few-shot splits from REL2TEXT as subsets containing N " {25, 50, 100, 200} relations with a single example per relation.
We select examples at random, ensuring that each few-shot split is a subset of the larger splits. We finetune the fewshot-N models for 10 epochs without validation, using the last checkpoint.
Limited Lexical Cues In D2T datasets (with certain exceptions, cf. Gardent et al. (2017a)), unclear labels are kept in original form, implicitly assuming that the models will learn the verbalizations from the training data. We investigate how the models behave if we take this issue to the extreme, i.e. if the relation labels are not available at all. We consider three scenarios: • mask-test -We train the model on REL2TEXT in the standard training setup. For testing, we replace the relation labels in REL2TEXT with the <mask> token.
• mask-train -For training, we replace the relation labels in REL2TEXT with the <mask> token. We test the model on REL2TEXT in the standard evaluation setup.
• mask-all -We replace the relation labels in REL2TEXT with the <mask> token for both training and testing.
Incorporating Descriptions Our dataset contains short textual descriptions of the relations, which may be useful to disambiguate its meaning and provide additional cues to the model. We consider two scenarios: • desc-repl -We replace the relation label with its description.
• desc-cat -We concatenate the relation description with the input, separated using the special token <rel_desc>.

Automatic Evaluation
To get a high-level overview of model behavior, we evaluate generated outputs using the GEMmetrics 11 package (Gehrmann et al., 2021), which provides an extensive set of automatic metrics for text generation.
Lexical Similarity We first measure lexical similarity between the model outputs and human references using BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and BLEURT (Sellam et al., 2020). The first two metrics focus on n-gram overlap; the latter is a trained metric that also captures semantic similarity between the output and the reference. Although these metrics should not be used in isolation (Gehrmann et al., 2022), they give us a better overview of the output quality in combination with other metrics.
Semantic Similarity and Legibility Lexical similarity metrics focus on the surface form, which may not be telling the whole story. For example, if the relation parent denotes that e t is the parent of e h , but the entities are swapped in the generated text, the output will be incorrect, although lexical similarity metrics will be high. To get deeper insights into semantic and lexical properties of the outputs, we use NUBIA (Kane et al., 2020), which is a trained metric combining several features to measure "interchangeability" (equivalence) of two texts. The metric outputs a single score (NB) with a value between 0 and 1. We also report its individual underlying features: the semantic similarity score (SS) on a 0-5 scale, predicted by RoBERTa (Liu et al., 2019) finetuned on the STS-B benchmark (Cer et al., 2017); the contradiction (C), neutral (N), and entailment (E) probabilities from RoBERTa finetuned on the MNLI challenge from the GLUE benchmark (Wang et al., 2018); and the perplexity score (PPL) from vanilla GPT-2 (Radford et al., 2019), computed as a geometric mean of probabilities of the tokens in each step (this score is referenceless).
Lexical Diversity To assess lexical diversity of the generated texts, we use several metrics used in previous work (Dušek et al., 2020;van Miltenburg et al., 2018). We measure the number of unique n-grams (U-1), conditional entropy of bi-grams (CE-2), and the mean segmental type-token ratio over segment lengths of 100 (MSTTR; Johnson, 1944). We also measure the average output length in tokens (len).

Manual Error Analysis
To examine the sources of errors, we perform an in-house annotation of the model outputs. We identify four model error types based on preliminary observations: semantic errors (SEM), with a swap of the relation direction (DIR) as a special case, too    literal (LIT), i.e. containing awkward or misleading phrasing, and grammar/lexical errors (LEX). We further annotate two types of input data errors: ambiguous relations (ENT) and relations with unclear labels (LBL). Examples are shown in Table 3. We select 100 random examples together with their corresponding outputs from the full-rel2text, fullwebnlg, full-kelm, fewshot-100, mask-all and desccat models. Without revealing the output sources, we ask three expert annotators to mark all error categories that apply. Table 2 shows automatic scores for all our models. full-rel2text is the best among the fully trained models in terms of lexical overlap metrics (which is expected, as it is trained on the most similar reference distribution), but the full-webnlg and full-kelm models are almost equal in terms of semantic consistency, achieving around 90% average entailment probability, which is on par with the copy baseline. Semantic consistency is much lower for the fewshot models (e.g. the average entailment probability is between 65% and 85%), showing that there is a certain minimum amount of data needed to achieve consistent outputs. Using more examples for training the model generally helps to decrease variance and increase performance across various metrics (cf. Figure 2).

Automatic Evaluation Results
Interestingly, the models which do not see the relations during test time (mask-test and mask-all) still achieve around 60% average entailment prob-ability, similarly to the worst few-shot model. Although their rate of contradictions is higher than for other models, the results suggest that in many cases, the guessed relation is compatible with the correct relation.
Another interesting observation is that the masktrain model (trained not to use the labels) is able to use the labels provided at test time to improve the outputs considerably (contradiction rate drops from 17% to 5% compared to mask-all). The fact that the short labels are both sufficient and necessary for the successful verbalization is emphasized by the fact that the desc-repl model is worse than full-rel2text (although the descriptions are longer and supposedly explain the relation semantics), and the benefits of concatenating the descriptions alongside the relation labels (desc-cat) are negligible, only slightly improving lexical similarity metrics (0.5 BLEU point gain over full-rel2text).
In terms of lexical diversity, human references use more unique n-grams, but the model outputs are very similar in other aspects. It remains to be seen if the model outputs can stay semantically consistent with diversity-focused decoding techniques such as nucleus sampling (Holtzman et al., 2020).

Error Analysis Results
Results are summarized in Figure 3; complete results are presented in Appendix F. Examples of model outputs for each error type are shown in Table 3; more examples are given in Appendix G.
The full-kelm and full-webnlg models use expressions that are too literal (LIT) in 23 and 29 cases, respectively, while the full-rel2text and desccat models do the same only in 11 cases (5 out of which are marked as LBL, i.e., with an unclear label). This suggests that the variability of our dataset helps models to apply more natural expressions, especially if the relation is understandable from its label.
There is a near-constant portion of examples where the models make a semantic error (SEM) and the input is marked as needing an extra description (LBL). The models also make relatively many semantic errors on their own, most prominently in the case of the fewshot-100 and the mask-all models.
The mask-all model made a semantic error in 78 cases, suggesting that guessing the exact relation just from the entities is difficult (although still possible in 22 cases). Morevover, the outcomes from this model are fluent (only 4 LEX errors), making  Table 3 for the description of error categories and §4.2 for the models). The striped part signifies that the label of the input was marked as unclear. See Appendix F for details.
it hard to detect faulty cases. The case of swapping the relation direction (DIR) is surprisingly not that common. This is probably down to having only a few examples in our dataset prone to this kind of error. Notably, the results for full-rel2text and desc-cat are very similar, rendering the impact of extra descriptions negligible.
Finally, there were only 12 out of 100 examples annotated as ENT, which suggests that the verbalization of the relation can be mostly decided irrespective of the entities in the triple.

Downstream Tasks
Given that the full-rel2text model can describe relations from their labels with high accuracy, we investigate if we can use the model to replace manually created templates in downstream tasks. We select two qualitatively different tasks, both using the idea of transforming individual input triples to premise repr.  simple sentences as a preprocessing step: tabular reasoning ( §6.1) and zero-shot data-to-text generation ( §6.2).  Gupta et al.'s approach, including a better paragraph representation for which they prepare a fine-grained set of rules for individual entity categories. The rules 12 aim to minimize the number of ungrammatical sentences and improve the reasoning abilities of the NLI model.

Tabular Reasoning
We replicate the setup of Neeraja et al. (2021) for the original (OPR) and better (BPR) paragraph representation using their public codebase. We then replace their templates with our full-rel2text model, verbalizing the triple (title, key, value). The results are summarized in Table 4.
Our preliminary manual evaluation suggests that the sentences from our model are indeed more grammatical (even compared to BPR). However, we observe that the performance is comparable across all three test sets. In line with McCoy et al. (2019), we conclude that for classification tasks such as NLI, the input content appears to be more important than the input form.

Zero-shot Data-to-Text Generation
Kasner and Dušek (2022) proposed a setup for zeroshot D2T generation in which pretrained models are used to gradually transform text into the final 12 Formalized using more than 250 lines of Python code: https://github.com/utahnlp/knowledge_infotabs/ blob/main/scripts/preprocess/bpr.py#L120  description. The first step of the pipeline requires transforming individual triples into text. We focus on the WebNLG dataset, for which the authors manually created 354 templates. 13 We replicate the authors' setup using their public code, applying full-rel2text instead of the templates. The results are summarized in Table 5. We note that the pipeline using our model for preprocessing is able to achieve improvements of "2 BLEU points, at the cost of a slightly higher omission and hallucination rate, but crucially without needing the manual effort to create templates. Cursory examination shows that sentences produced by our model are qualitatively similar to the manual templates, but more varied. Unlike the templates, our model may verbalize a relation differently depending on the context. Overall, we argue that training a PLM on verbalizing individual relations can potentially replace the manual effort of creating simple templates, which will have a notable impact for scaling similar approaches to larger datasets.

Discussion
Based on our experiments, we can conclude that PLMs are indeed able to verbalize novel relations (RQ1). However, there is a caveat: if the relation label is ambiguous or when the cues about the relation are limited (RQ3), the model will resolve to guessing and the semantic accuracy of the output descriptions may drop. A takeaway for datasets which do not follow standard naming conventions, such as the Rotowire dataset with basketball summaries (Wiseman et al., 2017) which uses abbreviations for column headers (e.g. FG3A stands for "the number of shots the player attempted beyond the arc"), is that rephrasing the labels to natural language may increase the robustness of D2T systems applied on these datasets.
We have focused on finetuned PLMs, which 13 Available at https://github.com/kasnerz/ zeroshot-d2t-pipeline/blob/main/templates/ templates-webnlg.json in our case require at least several hundreds of examples to produce satisfactory results (RQ2). However, recent research suggests prompting large PLMs capable of in-context learning (Brown et al., 2020) may help to bring down the number of examples required close to zero (Li and Liang, 2021;Reynolds and McDonell, 2021;Schucher et al., 2022;Chia et al., 2022;Xiang et al., 2022). In this case, the models do not have a possibility to learn the correct verbalizations from the training data, which will probably make using clear and unambiguous labels even more important: an issue to investigate in future work.
We showed that improving the outputs using longer relation descriptions is not straightforward (RQ4). To achieve more notable improvements, it may be necessary to combine a more detailed specification regarding the relation direction, type, acceptable values, etc., together with a model able to reason about this specification. A promising research in this direction could be using chain-ofthought reasoning, so far applied for tasks such as open-domain question answering or solving math word problems (Gao et al., 2022;Wei et al., 2022;Yao et al., 2022;Nye et al., 2021).
The remaining open question is how to handle input data with noisy labels. We suggest that detecting these cases and fixing them prior to generation (for example with knowledge-augmented systems or a human-in-the-loop setup) could help to improve the robustness of D2T systems in real-world scenarios.

Conclusion
We analyzed the abilities of PLMs to verbalize unseen relations in KGs using the relation labels. Based on our findings, we believe that having expressive and unambiguous data labels is a good starting point for adapting D2T systems to new domains. For the analysis, we collected the REL2TEXT dataset, which can help to replace the hand-crafted templates on downstream tasks. Future work may investigate how our findings generalize to prompt-based few-shot or zero-shot D2T generation with large PLMs.

Limitations
Our analysis is limited to verbalizing single triples, which is only a stepping stone towards full-fledged G2T generation. To generate data for entire subgraphs, other issues need to be solved first, in-cluding compositional generalization and structureaware modeling. Nevertheless, we believe that this simplified setting allows us to distill insights which are still applicable to G2T generation in general.
The factuality of the REL2TEXT dataset is tightly related to the data in the input KGs, which may contain outdated or incorrect information, and may be influenced by our processing methods (see Appendix A for details). Using the models trained on our dataset should be done with caution, since it can lead in producing harmful, imprecise, or factually incorrect statements.
We focus only on the English part of the KGs and English datasets. In the future, our approach could be extended to multilingual setting using multilingual PLMs and non-English parts of KGs. For more morphologically rich languages, an extra effort would have to be put into correctly inflecting the entities in the generated text.

Ethics Statement
As we are aiming to develop D2T systems which can robustly generate text for multiple domains, we are building upon PLMs which are known to reflect or amplify biases found in their pretraining corpus (Bender et al., 2021). Although the purpose of our study is to minimize these biases, the outputs of our models can still contain statements which are not aligned with the input data and user needs.
We collected our training and evaluation data through the Prolific crowdsourcing platform. We ensured that all the annotators were given an average reward per hour according to the platform recommendations and we put extra attention into informing the participants about the content and purpose of our study. We also manually filtered the output to minimize the amount of noisy references in our dataset. See 3.2 and Appendix B for more details on the annotation process.

A Data Retrieval
DBPedia We query DBPedia through its SPARQL access point: http://dbpedia.org/ sparql. We retrive relations as objects of type rdf:Property which have a property rdfs:comment (i.e., the relation description) and language 'en'.
YAGO We download the English Wikipedia subset of YAGO 4 database dump from https:// yago-knowledge.org/downloads/yago-4. We retrieve all objects of type rdf:Property which have a property rdfs:comment. For the entity descriptions, we parse the entity page at YAGO website http://yago-knowledge.org/resource/.
Wikidata We first use the Wikidata SPARQL access point: https://query.wikidata. org/sparql to retrieve the list of relations as objects of type wikibase:Property with wikibase:language="en", together with their English descriptions (lang(?altLabel) = "en").
Second, we query Wikidata through the LDF endpoint https://query.wikidata.org/bigdata/ldf, which is better able to handle heavy requests, to retrieve the list of triples involved in the relation.
Finally, for retrieving the entity descriptions, we use the API at https://www.wikidata.org/w/api.php.
Filtering We apply a comprehensive set of filters for filtering out noisy triples, including triples with entities containing meta-information ("Category:", "XMLSchema#"), URLs, entites longer than 64 characters, relations having the string "id", "number", or "code" in the label, or having "Reserved for DBpedia" in the description. As a consequence, we lose some relations, most notably about 2/3 of the relations from Wikidata describing various identifiers (we opted for this step in order to maintain data diversity). If KGs contain relations with identical labels, we prefer the relations from DBPedia and YAGO (which have a substantially lower amount of relations) to Wikidata relations.
Missing Units Our dataset mostly does not contain units for quantities. Although the units are usually present in the KGs, they are not part of the quantity itself -they may be either connected to the quantity with another property, or described informally in the relation label. Since our focus was on the relation labels, we decided to not put additional effort in retrieving and processing the units. In effect, we consider verbalizations not using the units (e.g., (Bommersheim substation, voltage, 20000) Ñ "Bommersheim substation has a voltage of 20000.") as correct.
Factual Correctness A certain part of the data is factually incorrect, either because there was an error in the knowledge graph (e.g., (Catalans, population place, Italy)) or because there was a processing error (e.g., (Child Language Teaching and Therapy, final publication year, -1985). Since our focus was not on judging the factuality of the inputs (which is a difficult problem on its own right), we decided to keep the examples in the dataset and consider the examples semantically consistent with the input triple as correct.

Other Notes
• All the data was retrieved in February 2022, except for YAGO where we used the newest available dump 2020-02-24.
• Although we retrieved the entity descriptions wherever possible and we include them in our dataset, we decided not to use them in our experiments.
• The Python code for retriving the data is available in the paper repository.

B Crowdsourcing Details
We built a web interface for collecting verbalizations for the triples. Figure 4 shows the introductory instructions displayed for the participants and Figure 5 shows the annotation interface. We hired annotators on the Prolific crowdsourcing platform https://app.prolific.co/. We required that the annotators are native speakers of English. After completing an introductory example, the annotators were given 20 randomly selected triples presented in a sequential order. The annotators were asked to write a short, single-sentence description of the triple. For making the annotation easier, hovering the mouse over the relation revealed its description (this applied also for the entities, if the description was present).
The annotators could also click on the entity to insert it in the text. This motivated the users to insert the entities in the original form. Once the entity appeared in the text (either typed or inserted), it was highlighted. We required that both entities (and at least two extra characters) are present in the text before proceeding to the next step. Because of this requirement, approximately 98.6% sentences in our dataset can be delexicalized using exact string matching. The users also had an option to modify the entity name, which would be recorded as a new ground-truth input (e.g., to make its form more natural). However, this option was used only sparingly.
In total, we collected 8,265 responses for 7,334 examples. Multiple responses for some examples are a consequence of random selection combined with sessions running in parallel. In the final dataset used in our experiments, we selected at most one correct answer for each example (see Appendix C).

C Postprocessing the Dataset
Two of the paper authors manually postprocessed the dataset. We used the following criteria for marking the responses: • OK -The sentence is fluent and semantically consistent with the input.
• Noisy -The sentence contains a minor typographical or grammatical error, or the sentence sounds "awkward" (e.g., the relation label is used too literally).
• Corrupted -The sentence is semantically incorrect, contains a major typographical or grammatical error, or generally does not make sense.
• Extra information -The sentence is correct, but contains extra information about the entities which cannot be derived from the triple itself (e.g., the country of origin of the person found in the entity description). Figure 6 shows the distribution of responses in our dataset. We marked 4,469 (54.1%) responses as OK, 1,314 (15.9%) responses as Noisy, 2,246 (27.2%) responses as Corrupted and 235 (2.8%) responses as Extra information.
Because our priority was to have clean data for evaluation, we decided to use only the OK part of our dataset in our experiments. We only use one example for each input triple in our experiments, which gives 4,097 instances. However, since we believe that the human outputs can also be an interesting research target, e.g. for investigating the feasibility of verbalizing the input data, we release all the annotations for future investigations.    (Wolf et al., 2020), with 140M parameters as a basis for all our models.
Training We train the models for 10 epochs on a single GeForce RTX 3090 GPU with 24 GB RAM, except for full-kelm model which we train for 1 epoch. Training times were around 15 minutes for the datasets based on REL2TEXT, 2 hours for fullwebnlg and 3 days for full-kelm. We use greedy decoding in all our experiments.

G Examples
Examples of outputs of the models selected for manual evaluation are shown in Table 8. The series to which this episode or season belongs. full-rel2text The Fight Before Christmas is part of The Simpsons. full-webnlg The Simpsons is the series that has The Fight Before Christmas as one of its episodes.

full-kelm
The Fight Before Christmas is part of the Simpsons series.

desc-cat
The Fight Before Christmas is a part of The Simpsons. fewshot-100 The Fight Before Christmas was part of series of The Simpsons.

mask-all
The Fight Before Christmas was a parody of The Simpsons.
Construction material (eg. concrete, steel, iron, stone, brick, wood). full-rel2text Landwasser Viaduct was built using Limestone. full-webnlg The construction material of the Landwasser Viaduct is Limestone.

full-kelm
The Landwasser Viaduct is made of limestone.
The county where the thing is located. full-rel2text Sharnbrook Academy is in Bedfordshire. full-webnlg Sharnbrook Academy is in the county of Bedfordshire. full-kelm Sharnbrook Academy is located in Bedfordshire.

desc-cat
Sharnbrook Academy is in Bedfordshire. fewshot-100 Sharnbrook Academy is in Bedfordshire.

mask-all
Sharnbrook Academy is in Bedfordshire.
some sort of hardware architecture or software framework, that allows this software to run full-rel2text Loco-Motion is run on Tomy Tutor. full-webnlg Tomy Tutor is the computing platform for Loco-Motion.

full-kelm
Loco-Motion is available for Tomy Tutor.

desc-cat
Loco-Motion runs on Tomy Tutor. fewshot-100 Loco-Motion is a computing platform for Tomy Tutor.

mask-all
Loco-Motion was inspired by Tomy Tutor.