Few-shot Knowledge Graph-to-Text Generation with Pretrained Language Models

This paper studies how to automatically generate a natural language text that describes the facts in knowledge graph (KG). Considering the few-shot setting, we leverage the excellent capacities of pretrained language models (PLMs) in language understanding and generation. We make three major technical contributions, namely representation alignment for bridging the semantic gap between KG encodings and PLMs, relation-biased KG linearization for deriving better input representations, and multi-task learning for learning the correspondence between KG and text. Extensive experiments on three benchmark datasets have demonstrated the effectiveness of our model on KG-to-text generation task. In particular, our model outperforms all comparison methods on both fully-supervised and few-shot settings. Our code and datasets are available at https://github.com/RUCAIBox/Few-Shot-KG2Text.


Introduction
Knowledge graphs (KGs), such as Wikidata and DBpedia, are essential for many natural language processing (NLP) applications (Ji et al., 2020). To understand the structured information in KG, the task of KG-to-text generation has been proposed to automatically generate a descriptive text for a given knowledge graph (Koncel-Kedziorski et al., 2019;Ribeiro et al., 2020a). Figure 1 illustrates a KG with the corresponding descriptive text, in which the nodes (e.g., Stan Lee and Iron Man) represent entities and the edges (e.g., creator and alias) describe the relations between connected entities.
In recent years, with the help of crowdsourcing platforms and information extraction (IE) systems, large-scale labelled pairs of KG and its descriptive text have been created, such as WikiBio (Lebret et al., 2016) and WebNLG Challenge (Gardent  Figure 1: A knowledge graph (subgraph) with its descriptive text. The underlined words represent the context keywords about entities. et al., 2017). Based on these datasets, data-driven models have shown impressive capabilities to produce informative and fluent text for a given KG Moryossef et al., 2019). However, due to the great expense in annotation process, it is not always feasible to generate large-scale labelled datasets for a variety of domains in practice. Motivated by this, we propose to study the task of few-shot KG-to-text generation that aims to produce satisfactory output text given only a handful of (several hundred) labelled instances.
To fulfil this task, we need to fully understand the complicated semantic relations between entities from various domains, which is challenging with limited labelled data. Our solution is inspired by the excellent few-shot capabilities of pretrained language models (PLMs) on language understanding and generation tasks (Brown et al., 2020;Chen et al., 2020;Li et al., 2021a). Pretrained on the large-scale corpora, PLMs encode vast amounts of world knowledge into their parameters (Li et al., 2021b), which is potentially beneficial to understand and describe the KG facts in our task.
However, applying PLMs to few-shot KG-totext generation still faces two challenges. First, PLMs are usually pretrained on natural language text, while the KG inputs for our task are structured graphs. This semantic gap makes it difficult to effectively inject KG representations into PLMs especially with limited labelled instances. Second, unlike many other text generation tasks, KG-to-text generation requires faithful generation based on the understanding of KG facts. It needs to learn an accurate semantic correspondence between input KG and output text, which will be more difficult in few-shot settings.
To address the above issues, in this paper, we propose a few-shot KG-to-text generation model based on PLMs. There are three major technical contributions in our model. First, in order to bridge the semantic gap, we enforce the representation alignment by learning the correspondence between KG representations (encoded by graph neural networks) and PLM-based entity representations. Second, to feed KG into PLMs, we propose a relation-biased breadth-first search (RBFS) strategy to linearize KG into a well-planned entity sequence. Finally, we jointly train the primary text generation task and an auxiliary KG reconstruction task under the framework of multi-task learning. This step further enhances the semantic correspondence between input KG and output text, based on which our model can generate faithful text about KG.
To the best of our knowledge, we are the first study to investigate PLMs for few-shot KG-to-text generation. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our few-shot KG-to-text generation model.

Related Work
In this work, we mainly focus on generating text from knowledge graphs using PLMs.
KG-to-Text Generation. Early works mainly centered around statistical methods, applying grammar rules to generate text (Konstas and Lapata, 2013;Flanigan et al., 2016). Recently, neural based approaches have been proposed to generate text from linearized KG triples (Gardent et al., 2017)   presented an unsupervised training method that can iteratively back translate be-tween the text and graph data. Different from them, we explore how to utilize large PLMs for few-shot KG-to-text generation.
Pretrained Language Model. Recent years have witnessed prominent achievement of PLMs in NLP tasks (Devlin et al., 2019;Radford et al., 2019). Pretrained on massive corpora, pretrained models showcase unprecedented generalization ability to solve related downstream tasks (Li et al., 2021b). However, most of existing PLMs were conditioned on text data (Radford et al., 2019;Lewis et al., 2020), lacking consideration of structured data input. Ribeiro et al. (2020b) proposed to utilize PLMs for KG-to-text generation by randomly linearizing graph into a sequence of triples. While, these methods do not explicitly model the structural relations of KG, which is critical for generating faithful text. Our work aims to consider the KG structure and bridge the semantic gap between KG encodings and PLMs.

Problem Formulation
KG-to-text generation (Ribeiro et al., 2020a) aims to automatically generate a natural language text that describes the facts in KG.
Formally, the input KG consists of a set of triples, denoted as G = { e, r, e |e, e ∈ E, r ∈ R}, where E and R denote the entity set and relation set, respectively. A triple e, r, e denotes the fact that relation r exists between head entity e and tail entity e . Note that the input KG is a small and compact subgraph extracted from large-scale knowledge graphs (e.g., DBpedia). Following Koncel-Kedziorski et al. (2019), a text describing the input KG is usually available in this task. Let V denote the vocabulary. The target is to generate a natural language text Y = w 1 , ..., w j , ..., w T (w j ∈ V) that represents the correct and concise semantics of entities and their relations in the given knowledge graph. The text contains a set of entity mentions M = {m e |m e = e, s e , o e , e ∈ E}, where e is the target entity, s e and o e are the start and end indices of this mention in text Y, respectively. In other words, w se , ..., w oe specially corresponds to entity e. For entities with multiple mentions in text, we only keep the first mention of each entity in M. By replacing each word of mentions with the token "[MASK]", we can obtain a masked text, denoted as Y [mask] , which is also taken as input for representation alignment in Section 4.1.
In practice, it is difficult to collect massive pairs  Figure 2: Overview of our proposed model. "RA" and "BP" denote representation alignment and back propagation, respectively. We organize the PLM into lower layers and higher layers. The former provides PLMbased entity representations for alignment with KG encodings, and the latter acts as a decoder for generating text and reconstructing KG facts. After representation alignment, KG embeddings can be directly fed into the higher layers of PLMs for generating text.
of KG and its descriptive text for training. In this paper, we study the task of few-shot KG-to-text generation with a handful of training instances (e.g., 200 instances) based on a given PLM (e.g., GPT-2).

Approach
For our task, two major challenges are how to learn effective input representations and capture the semantic correspondence between KG and text. To address the two challenges, we propose three major technical contributions, namely representation alignment between KG encodings and PLMs, relation-biased BFS strategy for KG linearization, and multi-task learning with KG reconstruction. Figure 2 presents an illustrative overview of our model. Next we will describe each part in detail.

Representation Alignment
Unlike previous works (Ribeiro et al., 2020b;) that directly transform KG into text sequence, we employ graph neural network (GNN) as knowledge graph encoder to explicitly encode entity relations in KG. Based on the input KG, GNN would produce a set of entity embeddings, which can be regarded as the input word embeddings of PLM for generating text. However, the GNN-based entity embeddings and the PLM-based word (entity) embeddings come from two distinct semantic spaces. To bridge such a semantic gap, we propose a representation alignment method to align the GNN-based and PLM-based entity embeddings in different semantic spaces.
KG Encoder. The GNN-based KG encoder aims to generate entity embeddings for KG. Let v e ∈ R d E denote the entity embedding for a general entity e in KG, where d E is the embedding size. In our work, the entity embeddings are shared across different KGs and initialized with pretrained KG embeddings (Yang et al., 2015). We apply R- GCN (Schlichtkrull et al., 2018) to generate entity embeddings by leveraging multi-relational information in KG. Then, the embedding of entity e at the l + 1-th layer of R-GCN can be computed as: r are trainable matrices, and N r e = {e | e, r, e , e , r, e ∈ G} denotes the set of neighbors of entity e under relation r. Finally, after stacking L times, the output entity embedding v (L) e from the last R-GCN layer is used as the final entity embeddingṽ e .
Note that, we represent an entity as a set of nodes. For instance, the entity Iron Man in Figure 1 will be represented by two nodes: one for the token Iron and the other for the token Man. This would enhance the generalization ability of KG encoder on unseen entities, since it learns entity embeddings at the token level.
Text Encoder. To obtain the PLM-based entity embeddings, we feed the masked text Y [mask] into the text encoder, i.e., the lower layers of PLM. As shown in Figure 1, compared with short entity mentions, the masked text contains rich context information about entities. Therefore, similar to masked language model (Devlin et al., 2019), the embeddings of masked text can be computed as: where the entity mention m e corresponds to the embedding sequence v ws e , ...,v wo e and the PLMbased entity embeddingv e can be computed by an average pooling over this embedding sequence.
To bridge the semantic gap, we model the representation alignment by minimizing the Euclidean distance in semantic space between the GNN-based and PLM-based entity embeddings as: whereṽ e andv e are GNN-based and PLM-based entity embeddings, respectively. With representation alignment, the GNN-based entity embeddings can be aligned with the PLMbased entity embeddings in semantic space, which enables us to effectively inject KG representations into PLM for improving generation quality.

Knowledge Graph Linearization
To feed the KG into decoder (i.e., the higher layers of PLM), we need to linearize KG into an entity sequence. Previous work Ribeiro et al., 2020b) usually relies on random or pre-defined rules, which is not flexible to model KG structures. Here, we propose to utilize breadthfirst search (BFS) strategy to traverse KG. BFS, a graph traversal algorithm, starts at the root node and explores all the nodes at the present layer before moving on to the nodes at the next depth layer 1 . Here, we assume that nodes at the same layer potentially express relevant semantics and should be placed in close positions of the entity sequence.
Furthermore, we observe that some relations are often lexicalized before others, e.g., the nationality of a person often precedes the birthplace in descriptive text. Considering such relation priority, in this paper, we propose a relation-biased breadth first search (RBFS) strategy to traverse and linearize KG into entity sequence. Specifically, we first compute RBFS weights α e for each entity e based on their relations as: where W (L) r is a relation matrix from Eq. 1. Then, for two sibling entities e and e at the same layer, we traverse e before e if α e is greater than α e , and vice versa. Finally, through RBFS, we can obtain a linearized entity sequence taken as input of the decoder for text generation.

KG-enhanced Multi-task Learning
After obtaining the linearized entity sequence, we next take it as input and perform text generation. 1 https://en.wikipedia.org/wiki/Breadth-first_search Different from other text generation tasks, KG-totext generation aims to generate text reflecting the concise facts in KG. Inspired by , we incorporate an auxiliary KG reconstruction task to reconstruct the facts in KG for learning the semantic correspondence between text and KG.
Text Generation. The text generation task is performed upon the higher layers of PLM. The objective is to maximize the likelihood of the reference text, which is equivalent to minimize the negative log-likelihood as: log p gen (w j |w 1 , ..., w j−1 ; G), (5) where p gen is the generative probability from PLM. Besides, in KG-to-text generation, some tokens in descriptive text correspond to KG entities shown in Figure 1. The ability to copy entities from KG would enrich the generated text content, which can be achieved by the pointer generator (See et al., 2017). By feeding the hidden states of PLM and the token embedding, the copy probability p j copy of the j-th token w j can be computed as: where W 1 , W 2 , and b copy are trainable parameters, v w j is the embedding of token w j , and s j is the j-th hidden state from the top layer of PLM. Then, we explicitly "teach" our model how to switch between generation and copy via the copy loss as: Our intuition is aimed at minimizing the copy probability p j copy of token w j (generated from vocabulary) and maximizing the copy probability p k copy of token w k (copied from KG entities).
KG Reconstruction. Following Song et al. (2020), we formalize the KG reconstruction task as predicting the relations between any two entities. In detail, given a head entity e and a tail entity e in generated text, we can obtain the hidden states of their mentions from the top layer of decoder, i.e., s se , ..., s oe and s s e , ..., s o e . Then, the entity hidden states h e and t e can be computed by an average pooling over their mention hidden states. The probability for a relation r is calculated as: where W 3 and b 2 are trainable parameters. The loss for reconstructing KG is also defined as the negative log-likelihood of all target triples in KG: log p(r|e, e ).
By incorporating the KG reconstruction task, our model is able to capture the semantic correspondence between input KG and output text, which further improves generating faithful text.
Finally, the total training loss consists of text generation loss L LM (Eq. 5), copy loss L P G (Eq. 7), representation alignment loss L RA (Eq. 3) and KG reconstruction loss L GR (Eq. 9) as: where λ 1 , λ 2 and λ 3 are combination coefficients.

Discussion and Learning
In this part, we present the model discussion and the model optimization.
Few-shot Learning. In few-shot KG-to-text generation, the key lies in how to bridge the semantic gap between KG and PLMs with limited dataset. To achieve this goal, we first utilize representation alignment in Section 4.1 to align the semantic space between KG encodings and PLMs, and then introduce a KG reconstruction task in Section 4.3 to further learn the semantic correspondence between input KG and output text. Besides, we observe that KG entities are often multi-word expressions. To deal with unseen entities in few-shot learning, we employ the Byte Pair Encoding (BPE) (Sennrich et al., 2016) and sub-word vocabulary (Radford et al., 2019) to split entity words into smaller semantic units. Our work is also empowered by the excellent few-shot capacities of PLMs with vast amounts of world knowledge learned from largescale corpora.
Optimization. For PLM, we employ BART-Large model (Lewis et al., 2020). Specially, we adopt the first 6 layers of BART encoder as the lower layers, and the remaining 6 layers of BART encoder and BART decoder as the higher layers. Note that, the target text and text encoder will not be used at test time. In particular, the target text is just used at training time and encoded as PLM-based entity embeddings for representation alignment, while the alignment is not needed at test time. We optimize all parameters according to the total loss in Eq. 10  with the OpenAI AdamW optimizer (Loshchilov and Hutter, 2019). The learning rate, batch size, R-GCN layers and embedding size are set to 1e-5, 20, 2 and 1024, respectively. The weights λ 1 , λ 2 and λ 3 in Eq. 10 are set to 0.7, 0.5 and 0.5, respectively, according to performance on validation set. During inference, we apply the beam search method with a beam size of 8.

Experiments
In this section, we first set up the experiments, and then report the results and analysis.

Experimental Setup
Datasets. To evaluate our model on few-shot KG-to-text generation, we conduct experiments on three benchmarks, including AGENDA (Koncel-Kedziorski et al., 2019), WebNLG (Gardent et al., 2017) and GenWiki Fine . We adopt three large domains (i.e., Airport, Building and Food) for WebNLG and two large domains (i.e., Sports and Games) for GenWiki. Table 1 shows the statistics for each dataset. Each instance of these datasets contains a knowledge graph in the form of triples and a target text describing the graph. The three datasets have originally provided the alignment records from entity mentions to KG entities. Take an example from WebNLG dataset "AGENT-1 is located in PATIENT-1": the entity mention is tagged as "AGENT-1" and the tag "AGENT-1" maps to the entity "11th_Mississippi_Infantry_Monument" in KG. If such alignments are not available, we can utilize entity linking tools (e.g., NER packages) for preprocessing.
Baselines. We make a comparison against five KGto-text generation models: • GraphWriter (Koncel-Kedziorski et al., 2019) introduces a graph transformer encoder and a sequence decoder for generating text based on KG.
• CGE-LW (Ribeiro et al., 2020a) proposes a graph-to-text model by combining both global and local node aggregation strategies.    • CycleGT  jointly learns two dual tasks (graph-to-text generation and text-tograph relation classification) via cycle training.
Among these baselines, GraphWriter and CGE-LW are GNN-based generation models; CycleGT is an unsupervised model using cycle training; GPT2-Base/Large and BART-Base/Large are the most relevant comparisons, which also employ PLMs in KGto-text generation. These baselines were trained on the whole training dataset, i.e., all KG-text pairs. Following previous few-shot work (Chen et al., 2020), we train our model on different few-shot settings with training dataset size ranging from 50, 100, 200 to 500. All the comparison methods are optimized based on validation performance. In our model, the entity embeddings of GNN are initialized with pretrained KG embeddings and the GNN weights are transferred from CGE-LW. We also pretrain GNN weights based on the large-scale KG, i.e., Wikipedia. Based on the pretrained entity embeddings and weights, we continue to train our model.
Evaluation Metrics. For performance comparison, we adopt five automatic evaluation metrics widely used by previous graph-to-text work , i.e., BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015 and CHRF++ (Popovic, 2015). Specifically, BLEU-n and ROUGE-n compute the ratios of overlapping n-grams between generated and real text, CIDEr computes the TF-IDF weights for each n-gram in generated/real text, and CHRF++ computes F-score averaged on both character-and word-level n-grams.   Table 2, 3, and 4 present the fully-supervised and few-shot results of our model and other baselines, respectively. First, by combining global and local entity context, CGE-LW performs better than GraphWriter. Furthermore, with two elaborate designed dual tasks, CycleGT becomes the best non-PLM baseline, outperforming GraphWriter and CGE-LW.

Main Results
Second, as the most direct comparison with our model, BART-Base/Large and T5-Base/Large perform better than baselines by leveraging encoded semantics in PLMs, which reveals the feasibility of utilizing PLMs for KG-to-text generation.
Finally, we observe that our model achieves the best performance on both fully-supervised and fewshot settings. Large-scale PLMs can encode world knowledge by reading a large amount of text, making it easier to recover KG facts. Given only a handful of examples, the performances of baselines drop drastically, while the performance of our model only descents slightly. Furthermore, with only 500 labelled instances, our model improves over CGE-LW and CycleGT, and achieves the best performance in most cases. Compared to these PLM-based KG-to-text baselines, we adopt GNN to explicitly encode KG structure and representation alignment to bridge the semantic gap between PLM and GNN. This helps produce effective semantic representations for few-shot learning. Furthermore, we incorporate an auxiliary KG reconstruction task to learn semantic correspondence between input KGs and output text. These results indicate that our model can achieve more superior performance on KG-to-text generation task in a few-shot setting.

Detailed Analysis
Next, we conduct detailed analysis experiments on our model. We only report the test results on WEBNLG dataset with 500 training instances due to similar findings in other datasets.
Ablation Analysis. In our ablation study, we eval-  uate the effect of each loss L P G , L RA and L GR on the overall model performance. Here, we consider three variants: • w/o PG: the variant removes the copy loss L P G .
• w/o RA: the variant removes the representation alignment loss L RA .
• w/o GR: the variant removes the KG reconstruction loss L GR .
As can be seen from Table 5, by removing any of the three losses, the BLEU/ROUGE/CIDEr performance drops compared to the complete model, especially removing L RA and L GR . The proposed representation alignment bridges the semantic gap between PLM and GNN, which is helpful for adapting KG representations to PLM. The KG reconstruction task learns the correspondence between KG and text ensuring faithful generation about KG. We also observe a small performance drop by removing L P G . It is likely because PLM has learned some common phrase expressions about these KG facts from large-scale pretraining corpus.
KG Linearization Analysis. In Section 4.2, we propose a novel relation-biased BFS (RBFS) strategy to linearize the input KG into entity sequence. To verify the effectiveness of this strategy, we conduct linearization analysis by comparing RBFS with three traversal strategies, including relationbiased depth-first search (RDFS), forest fire search (FFS) and random search (RS). Specifically, RDFS combines both DFS and the relation factor similar Reference asam pedas is a food found in the region of sumatra and malay peninsula in malaysia , the capital of which is putrajaya , and whose ethnic groups include malaysian malay and malaysian chinese .
athens international airport serves the city athens in greece , greek language is spoken in greece and the leaders names in greece are alexis tsipras and nikos voutsis .
athens in greece is led by alexis tsipras and is served by athens international airport greece speaks greek language . Ours Generated Text asam pedas comes from the region of sumatra and malay peninsula in malaysia , where the capital is putrajava , malaysian malay and malaysian chinese are ethnic groups .
athens is served by athens international airport in greece , which speaks greek textbflanguage . greece is led by alexis tsipras and nikos voutsis . to RBFS, where DFS starts at the root node and explores as far as possible along each branch before backtracking 2 ; FFS is a randomized version of RBFS randomly exploring all the nodes at the same layer (Leskovec and Faloutsos, 2006); and RS randomly traverses all the nodes in the input KG. By re-training our model with the above three strategies, we report the comparison of BLEU results in Figure 3. It can be observed that, RBFS and FFS strategies achieve better results compared to the rest strategies. Nodes at the same layer tend to express more relevant semantics, thus searching by layer could produce more reasonable and coherent entity sequence especially considering the relations of entities as our RBFS strategy.
Human Evaluation. Following previous work in data-to-text (Chen et al., 2020), we conduct human evaluation on the generated text. We randomly sample 200 KG subgraphs along with corresponding generated text from CGE-LW, BART-Large and our model. In order to reduce the variance caused by human, three workers were asked to score the text with respect to two aspects: Factual correctness and Language naturalness. The first criterion evaluates how well the generated text correctly conveys 2 https://en.wikipedia.org/wiki/Depth-first_search information in the KG, by counting the number of facts in text supported by the KG (denoted as #Supp.) and contradicting with or missing from the KG (denoted as #Cont.). The second criterion evaluates whether the generated text is grammatically correct and fluent. The scoring mechanism adopts a 5-point Likert scale (Likert, 1932), ranging from 1-point ("very terrible") to 5-point ("very satisfying"). We further average the three scores from the three human judges over the 200 inputs. The results in Table 6 show that our model produces more fidelity and fluent texts than previous models. In our approach, the KG reconstruction task and pointer generator enhance the awareness of KG facts and alleviate producing incorrect facts. Also, with some learned common phrase expressions in PLMs, our model can generate natural text while keeping fidelity.
Qualitative Analysis. In this part, we present intuitive explanations why our model performs well. Table 7 presents two descriptions and the corresponding generated entity sequences and texts by BART-Large baseline and our model. As we can see, based on KG linearization, the generated texts by our model show reasonable and similar content sketch with real texts (e.g., peninsula (region)→malaysia (country)→putrajava (capital)).
Besides, the baseline model incorrectly merges and generates unfaithful facts (e.g., malaysia and sumatra) or misses facts (e.g., nikos voutsis), while our model describes all the KG facts correctly. This improvement could be attributed to the KG reconstruction task, which enables our model to learn the correspondence between the input KG facts and output text. Finally, the entity words in our generated text are enriched and connected by meaningful keywords (e.g., entity greek language and keyword speaks). The reason might be that, with the help of representation alignment, the GNN entity embeddings are aligned with the PLM word embeddings.

Conclusion
This paper presented a few-shot KG-to-text generation model based on PLMs. We make three important technical contributions, namely representation alignment for bridging the semantic gap between KG encodings and PLMs, relation-biased KG linearization for deriving better input KG representations, and multi-task learning for learning the correspondence between KG and text. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our few-shot KG-to-text generation model. As future work, we will consider adopting KG-enhanced PLMs (Zhang et al., 2019; for improving the task performance, which explicitly inject knowledge information into PLMs.