JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs

Existing pre-trained models for knowledge-graph-to-text (KG-to-text) generation simply fine-tune text-to-text pre-trained models such as BART or T5 on KG-to-text datasets, which largely ignore the graph structure during encoding and lack elaborate pre-training tasks to explicitly model graph-text alignments. To tackle these problems, we propose a graph-text joint representation learning model called JointGT. During encoding, we devise a structure-aware semantic aggregation module which is plugged into each Transformer layer to preserve the graph structure. Furthermore, we propose three new pre-training tasks to explicitly enhance the graph-text alignment including respective text / graph reconstruction, and graph-text alignment in the embedding space via Optimal Transport. Experiments show that JointGT obtains new state-of-the-art performance on various KG-to-text datasets.


Introduction
Knowledge-graph-to-text (KG-to-text) generation aims to generate high-quality texts which are consistent with input graphs (Gardent et al., 2017). This task requires to simultaneously encode the graph structure and the content, and effectively leverage the input graphs in the decoding process (Zhao et al., 2020). As a major natural language generation (NLG) task that connects knowledge graphs and texts, this task can further promote the applicability of knowledge graphs in more realistic NLG scenarios, such as knowledge-grounded dialogue generation (Zhou et al., 2018a) and story generation (Guan et al., 2019;Ji et al., 2020).
Due to the limited amount of graph-text parallel data, it's hard for typical neural text generation models to learn the alignments between source entities / relations and target tokens from scratch Fu et al., 2020). Recent work resorts to constructing general-purpose pre-trained language models for KG-to-text generation. The most common and simple way is to linearize input graphs into text sequences, and directly fine-tune textto-text Transformer-based pre-trained models like GPT (Radford et al., 2018(Radford et al., , 2019, BART (Lewis et al., 2020) or T5 (Raffel et al., 2020) on KG-totext datasets (Ribeiro et al., 2020a;Kale and Rastogi, 2020). Benefiting from self-supervised pretraining on large-scale unlabelled text corpora, pretrained language models can generate high-quality texts via simply fine-tuning, and outperform other models with sophisticated structures.
Despite the superior performance of fine-tuning pre-trained models on KG-to-text datasets, we argue that building pre-trained models for KG-totext generation still faces two major challenges: 1) Structural information loss during encoding. Most of the existing pre-trained models capture contextual information via bidirectional Transformers (Devlin et al., 2019), which include full attention connections. This model structure may neglect the structural information when encoding knowledge graphs since the relation between each pair of input entities is not explicitly considered (Zhu et al., 2019). 2) Absence of explicit graph-text alignments. Existing work on pre-trained models for text generation commonly adopts auto-encoding or auto-regressive text reconstruction to learn texttext alignments, which encodes the corrupted text sequence and decodes the original sequence (Lewis et al., 2020;Raffel et al., 2020). Since knowledge graphs may possess more complex structures than text sequences, it's hard to explicitly learn graphtext alignments by directly using the pre-training tasks based on text reconstruction.
Thus, we propose a graph-text joint represen-tation learning framework called JointGT to deal with the above challenges. Firstly, to alleviate the structural information loss during encoding, we devise a simple structure-aware semantic aggregation module at each Transformer layer to aggregate contextual information following the graph structure. Secondly, we propose three pre-training tasks including graph enhanced text reconstruction, text enhanced graph reconstruction, and graph-text embedding alignment to explicitly build the connection between knowledge graphs and text sequences. The first two tasks are expected to enhance the graph-text alignment in the discrete vocabulary space, where our model is required to predict the masked information of graphs / texts based on the observed information of texts / graphs. And the third task is designed to model the graph-text alignment in the continuous embedding space via Optimal Transport (Peyré and Cuturi, 2019) to match the hidden representations of graphs and texts. Our contributions are as follows: • We propose a novel pre-trained model called JointGT for KG-to-text generation tasks. This model adopts a structure-aware semantic aggregation module to model the structure of an input graph at each Transformer layer, and utilizes three pre-training tasks to explicitly learn graph-text alignments in the discrete and continuous spaces.
• We conduct experiments on the datasets of KG-to-text generation including WebNLG, WebQuestions and PathQuestions. Results show that JointGT achieves new state-of-theart performance on KG-to-text generation.

KG-to-Text Generation
Recent studies on KG-to-text generation tasks mainly fall into three aspects: 1) Encoder modification: To alleviate the structural information loss of sequence encoders with the input of linearized graphs (Gardent et al., 2017;Trisedya et al., 2018;Moryossef et al., 2019), researchers focus on more complex encoder structures for better graph representations, such as graph neural networks (Marcheggiani and Perez-Beltrachini, 2018;Ribeiro et al., 2020b) and graph Transformers (Koncel-Kedziorski et al., 2019;Schmitt et al., 2020a). 2) Unsupervised training: researchers devise unsupervised training objectives to jointly learn the tasks of graph-to-text and textto-graph conversion with non-parallel graph-text data (Schmitt et al., 2020b;. 3) Building pre-trained models: With the development of pre-trained NLG models such as GPT (Radford et al., 2018(Radford et al., , 2019, BART (Lewis et al., 2020) and T5 (Raffel et al., 2020), recent work directly fine-tunes these models on graph-totext datasets and reports impressive performance (Ribeiro et al., 2020a;Kale and Rastogi, 2020;Chen et al., 2020b;Mager et al., 2020). Compared with the existing work on pre-trained models for KG-to-text generation, our model utilizes pre-training methods to explicitly learn graphtext alignments instead of directly fine-tuning textto-text pre-trained models on KG-to-text datasets.

KG-Enhanced Pre-Trained Models
Another line of related studies is pre-trained models enhanced by knowledge graphs for natural language understanding (NLU). The motivation of these models is to incorporate knowledge graphs into pre-trained models to facilitate the understanding of entities and relations in natural language. Early work including ERNIE  and KnowBERT (Peters et al., 2019) directly uses fixed entity embeddings based on TransE (Bordes et al., 2013) or word vectors (Mikolov et al., 2013) during pre-training. Recent work like KEPLER (Wang et al., 2021) and JAKET  resorts to jointly pre-training graph-text representations. Specifically, they encode the textual descriptions of entities with pre-trained language models as entity embeddings and jointly optimize the knowledge embedding objective and the masked language modeling objective.
In comparison, our model focuses on joint pretraining methods on knowledge graph encoding and sequence decoding in KG-to-text generation tasks, rather than considering graph-text joint encoding methods in NLU tasks.

Task Definition and Model Overview
Given a knowledge graph G = (V, E) where V = {e 1 , e 2 , · · · , e |V| } denotes the entity set and E = (r ij ) |V|×|V| indicates the relations connecting the entities, and its linearized version G linear = (w 1 , w 2 , · · · , w m ) which consists of m tokens, our goal is to generate a text sequence X = (x 1 , x 2 , · · · , x n ) which is consistent with the input graph.
Our model is built on pre-trained encoderdecoder models like BART (Lewis et al., 2020) and T5 (Raffel et al., 2020). First of all, we follow the existing work (Chen et al., 2020b) to linearize knowledge graphs in the form of triple lists (as shown in Figure 1), and devise a simple structure-aware semantic aggregation module which is plugged into each Transformer layer of the encoder to preserve the structural information of input graphs ( §3.2). Then, we propose three pre-training tasks including graph / text reconstruction in the discrete vocabulary space and graphtext matching in the continuous embedding space, which enable our model to jointly learn the representations of knowledge graphs and texts ( §3.3).   Figure 2: Structure-aware semantic aggregation module at each layer of the Transformer encoder. This module contains a pooling layer to obtain the contextual semantic representations of entities (z l i ) and relations (q l ij ) from the output of the vanilla self-attention layer (h l i ), a structure-aware self-attention layer to aggregate the entity representations (z l i ) based on the graph structure, and a residual layer to fuse the contextual and structural representations (h l i ).
To simultaneously leverage the contextual representation from pre-trained models and preserve the structural information, we devise a structure-aware semantic aggregation module in the Transformer encoder. Assume that the input of our encoder during pre-training is the linearized graph G linear and the corresponding text sequence X (which may be corrupted or empty in some pre-training tasks), the self-attention layer in the l-th Transformer layer can be formulated as follows 2 : are the model parameters and d k denotes the dimension of query / key / value vectors. The fully-connected attention captures rich contextual semantic relationship among the entities, relations and the tokens of text sequences, but is not sufficient to encode the structural information of input graphs. Thus, we devise a structure-aware semantic aggregation module on top of vanilla selfattention, as shown in Figure 2. First of all, we utilize a mean pooling layer 3 to obtain the representation of each entity and relation from the output of the vanilla self-attention layer: where P(e i )/P(r ij ) means the set of positions occupied by e i / r ij in the linearized graph. Note that q l ij will be set to an all-zero vector if there is no relation between e i and e j . Then we update entity representations with a structure-aware self- attention layer (Shaw et al., 2018): are the weight matrices in the structure-aware selfattention. This layer integrates the contextual semantic representation of entities and relations based on the graph structure, thereby injecting the structural information into the vanilla Transformer layer. Finally, we use a residual layer to fuse semantic and structural representations of entities, and obtain the hidden states for the following computation:h (4) i = 1, · · · , m + n; j = 1, · · · , |V| Compared with existing structure-aware Transformer encoders (Zhu et al., 2019;Song et al., 2020) that either use the entity and relation embeddings from an external knowledge embedding model or directly learn them as model parameters, our encoder obtains the entity and relation embeddings via contextual semantic representations. This design fully employs the effective contextual representations from the existing pre-trained models while preserving the structural information, and enables our model to generalize to new entities and relations better when fine-tuned to the datasets with a different knowledge graph.

Pre-Training Task
Given the input graph G and its corresponding text sequence X, the goal of our pre-training task is to jointly learn the graph encoder and sequence decoder to enhance graph-text alignments, which can benefit the downstream tasks of KG-to-text generation. We devise three pre-training tasks to explicitly learn graph-text alignments in both discrete and continuous spaces.

Graph Enhanced Text Reconstruction
The purpose of graph enhanced text reconstruction is to recover the masked text sequence based on the complete knowledge graph, as shown in Figure 3. Assume thatX denotes the masked text sequence, we can formulate the loss function of this pre-training task as follows: To constructX, we masked the entity words with a probability of 40% and other words with 20% since entity words are more important in the task of KG-to-text generation. We also follow the existing work (Lewis et al., 2020) to merge the consecutive mask tokens into one mask token to increase the difficulty of text reconstruction. This task enables our model to utilize the knowledge graph to reconstruct the corrupted text sequence, which explores the connection between them in the discrete vocabulary space.

Text Enhanced Graph Reconstruction
As shown in Figure 3, this pre-training task aims to recover the corrupted graph according to the information of the text sequence. Given the corrupted knowledge graphĜ with masked entities and relations, and the complete text sequence X, the loss function is to recover the masked entities and relations in the linearized knowledge graph: where M i denotes an indicator function and equals 1 if and only if w i is masked. We empirically set the masking probability of entities / relations as 40% / 20%. This task explicitly exerts the impact of the text on the graph reconstruction, thereby guiding the encoder to focus more on the entities and relations that may appear in the text.

Graph-Text Embedding Alignment
This pre-training task is devised to encourage the graph-text alignment in the embedding space. We use Optimal Transport (OT), which is commonly used in the cross-domain alignment (Chen et al., 2020a), to calculate the minimum cost of transporting the graph representation from the encoder to the text representation from the decoder (and vice versa). As shown in Figure 3, the input of the encoder is the linearized knowledge graph G linear while the input of the decoder is the text sequence indicates the final hidden states of the encoder, we can similarly acquire the entity and relation representations via mean pooling: Let G seq = V ∪ E = (g 1 , g 2 , · · · , g |V|+|E| ) denotes the sequence of all the entities and relations in G, we can directly obtain the contextual embedding vectors H G = (h G 1 , · · · , h G |V|+|E| ) for each entity and relation from Equation 7. We can also acquire the embedding vectors of X from the decoder's final hidden states, which is denoted by S = (s 1 , s 2 , · · · , s n ).
To model the alignment between graphs and texts in the embedding space, we regard G seq as a i = n j=1 b j = 1, and δ g i / δ x j indicates the Dirac function centered on g i / x j . Then, we utilize the OT distance between µ and υ as the loss function, which is defined as the solution of the following problem: where T denotes a transport plan, 1 |V|+|E| / 1 n indicates the (|V| + |E|) / n -dimensional all-one vector respectively, and d(g i , x j ) is the cost function of transporting g i to x j . We follow the existing work (Chen et al., 2020c) to adopt the cosine distance between the contextual embedding vectors of g i and x j as the cost function, which is defined as Since the exact minimization over T is computationally intractable, we utilize IPOT algorithm (Xie et al., 2019) to approximate the OT distance and iteratively obtain the solution of T (more details are provided in the Appendix A). After solving T , L OT can serve as an alignment loss to optimize the model parameters. This task builds the connection between the contextual embedding vectors of knowledge graphs and texts, and explicitly promotes the graph-text alignment in the continuous space.

Pre-training Dataset and Implementation
We used KGTEXT (Chen et al., 2020b)   Since our model can adapt to Transformer-based pre-trained models with the encoder-decoder framework, we chose BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) as the base model in this paper, which are denoted by JointGT (BART) and JointGT (T5), respectively. The hyper-parameters of the Transformer blocks were the same as BARTbase and T5-base because of the limited computational resources. We initialized our model parameters with the pre-trained checkpoint of BARTbase / T5-base except for the structure-aware semantic aggregation module, which was randomly initialized. We followed BART / T5 to use Byte-Pair Encoding (BPE) vocabulary (Radford et al., 2019) with the size of 50,265 / WordPiece vocabulary (Kudo and Richardson, 2018) with the size of 32,000. The batch size was 42 / 32 for JointGT (BART) / JointGT (T5). The maximum length of linearized input graphs was 600, while the maximum length of text sequences was 64. We adopted Adam (Kingma and Ba, 2015) as the optimizer and set the learning rate to be 3e-5. The warmup ratio was 0.1. JointGT was pre-trained on KGTEXT for 1 epoch with the proposed pre-training tasks. It took 44 / 69 hours for JointGT (BART) / JointGT (T5) on 3 NVIDIA Quadro RTX 6000 GPUs.

Fine-Tuning Settings
We adopted WebNLG, WebQuestions and Path Questions as the benchmark datasets during finetuning, and provided the statistics in Table 2. WebNLG: This dataset aims to convert RDF triples into a textual description. We followed the existing work (Chen et al., 2020b) to use the version of 2.0 (Shimorina and Gardent, 2018). This dataset contains two official data splits: the traditional split (Unconstrained) which guarantees that there is no overlap of input graphs among train / validation / test sets, and a more challenging split (Constrained) where the non-overlap constraint is applied to the triples of input graphs. We denoted these two data splits as WebNLG(U) and WebNLG(C) in our paper. We followed the preprocessing steps of the existing work (Chen et al., 2020b) to replace the underlines in the entities and relations with spaces, and split the entities and relations in a camel case into multiple words. WebQuestions: This dataset (Yih et al., 2016;Talmor and Berant, 2018) is the benchmark for question generation over knowledge bases (KBQG), whose purpose is to generate natural language questions about the corresponding knowledge graphs (Serban et al., 2016). It is constructed from two question answering datasets, i.e., WebQuestionsSP (Yih et al., 2016) and ComplexWebQuestions (Talmor and Berant, 2018). These two datasets contain natural language questions, SPARQL queries and answer entities. We converted the SPARQL query to return a subgraph, and used the same preprocessing steps and data splits as the existing work (Kumar et al., 2019;Chen et al., 2020d). PathQuestions: Similar to WebQuestions, the PathQuestions dataset is also the benchmark for KBQG, which is constructed from a question answering dataset (Zhou et al., 2018b  difference is that the knowledge graph in PathQuestions is a 2-hop / 3-hop path between two entities. We used the same preprocessing steps and data splits as the existing work (Kumar et al., 2019;Chen et al., 2020d). More detailed fine-tuning settings including the search space and the best assignment of hyperparameters on the downstream datasets are reported in the Appendix B.

Baselines
We chose the following two categories of models as our baselines: Pre-Trained Models: We adopted KGPT (Chen et al., 2020b), BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) as the pre-trained baselines. KGPT is a pre-trained model for KG-to-text generation, which utilizes the same pre-training dataset as our model and directly uses KG-to-text generation as the pre-training task. BART and T5, as the state-of-the-art pre-trained models for text generation, can be applied to KG-to-text generation with the input of linearized knowledge graphs and the output of text sequences (Ribeiro et al., 2020a). Task-Specific Models without Pre-Training: We also chose the state-of-the-art task-specific models without pre-training for each dataset as our baselines, including Seq2Seq with copying or delexicalisation (Shimorina and Gardent, 2018) for WebNLG v2.0, and G2S (Chen et al., 2020d) for WebQuestions and PathQuestions.
We directly re-printed the results of baselines if they use the same datasets as ours. Otherwise, we implemented the baselines based on the codes and model parameters released by the original papers. We reported all the results of our implemented models with the mean values over 5 runs.

Automatic Evaluation
We followed the existing work (Shimorina and Gardent, 2018;Chen et al., 2020d) to use BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE-L (Lin, 2004) as our automatic metrics. The main results on WebNLG, WebQues-tions and PathQuestions are shown in Table 1. We can observe that JointGT based on BART / T5 can outperform vanilla BART / T5 on most of the metrics, respectively, and obtain the state-of-the-art performance on all the datasets. This indicates that our method can promote graph-text alignments and further enhance the performance of the state-of-theart pre-trained models on KG-to-text datasets.

Human Evaluation
To further evaluate the quality of generated results, we conducted human evaluation on the WebNLG(U) dataset. We followed the existing work (Ferreira et al., 2019;Ribeiro et al., 2020b) to select two criteria: fluency (whether a sentence is grammatically fluent) and adequacy (whether a sentence clearly describes the knowledge graph). We randomly sampled 100 knowledge graphs from the test set, and collected the generated results from our models and the most competitive baseline models (i.e., BART and T5). We used the pairwise comparison between BART / T5 and JointGT (BART) / JointGT (T5). Specifically, for each pair of generated texts (one from JointGT and the other from the corresponding baseline, given the same input knowledge graph), three annotators were hired to label which text is better (i.e., win, lose or tie) in terms of the metrics mentioned above. Note that the two metrics were evaluated independently. Table 3 show that JointGT can beat the corresponding baselines in both fluency and adequacy. Especially for adequacy, our model can significantly outperform BART / T5, which indicates that our model equipped with the structure-aware encoder and well-designed pre-training tasks can generate high-quality texts to describe knowledge graphs more clearly. To evaluate the agreement among different annotators, we calculated Fleiss' Kappa (Fleiss, 1971) for each pairwise comparison, where the results in Table 3 show moderate agreement (0.4 ≤ κ ≤ 0.6).

Encoder Structure
To investigate the effect of our proposed structureaware semantic aggregation module, we fixed the pre-training tasks and compared our encoder with two Transformer-based encoders commonly used in the existing work: SeqEnc: This sequence encoder takes linearized graphs as input and ignores structural information (Ribeiro et al., 2020a;Kale and Rastogi, 2020). RelEnc: This relation-aware encoder regards the entity sequence as input and leverages the relation embedding into the self-attention layer. Both the entity and relation embedding vectors are directly learned as model parameters (Shaw et al., 2018;Zhu et al., 2019;Song et al., 2020).  Note that we only chose the encoder structures that can directly adapt to BART / T5 for fair comparison 5 . Results in Table 4 show that our encoder structure can perform better than the other baselines. Compared with the relation-aware encoder which can also capture the structural information of knowledge graphs, our model fully utilizes the effective contextual semantic representation to initialize the entity / relation representation at each Transformer layer instead of directly using the learnable entity / relation embedding vectors. This design equips JointGT with better generalization ability during fine-tuning, thereby enhancing our performance on downstream datasets.  To further demonstrate the effectiveness of our encoder, we divided the test set of WebNLG(U) into two subsets according to the number of triples in knowledge graphs, and compared the performance of three encoders. Results in Table 5 show that the improvement margin between our encoder and other encoders is more evident when the number of input triples is large, which indicates that our model can facilitate the encoding of knowledge graphs with more complex structures.  To study the effect of three pre-training tasks, we maintained the encoder structure and removed each task respectively to test the performance. We also replaced all our pre-training tasks with the tasks of the existing work for comparison: BARTPretrain: The pre-training tasks of BART including text infilling and sentence permutation (Lewis et al., 2020). Since these tasks cannot be applied to graph data, we only used these tasks on the text data of the pre-training dataset. KGPTPretrain: The pre-training task of KGPT, i.e., KG-to-text generation on the pre-training dataset (Chen et al., 2020b).

Pre-Training Task
Results in Table 6 show that each of our pretraining tasks contributes to the model performance. Compared with the other two tasks, graph enhanced text reconstruction plays a more important role in the task of KG-to-text generation, which directly supervises the decoder with the conditional generation loss. We also observe an apparent performance drop if we replace our pre-training tasks with those proposed by the existing work, thereby indicating the effectiveness of our pre-training tasks to promote KG-to-text generation.

Few-Shot Learning
To further analyze whether our pre-training tasks can learn a good graph-text joint representation that benefits the downstream KG-to-text generation tasks, we considered the few-shot setting where The Acharya Institute of Technology is located in the state of Karnataka which has Telangana to its northeast and the Arabian Sea to its west . The Institute was given the 'Technical Campus ' status by the All India Council for Technical Education in Mumbai . One of the sports offered at the Institute is tennis which is governed by the International Tennis Federation . JointGT (T5): The Acharya Institute of Technology is located in the state of Karnataka . Karnataka has Telangana to its northeast and the Arabian Sea to its west . The Institute was given the 'Technical Campus ' status by the All India Council for Technical Education in Mumbai . The Institute offers tennis which is governed by the International Tennis Federation . Figure 4: Generated results on WebNLG(U). We highlight the missing and unfaithful parts of each text in red and blue, respectively.  only a few training instances were used during finetuning. We still fixed our model structure and compared our pre-training tasks with the tasks of BART and KGPT mentioned in §4.6.2.
Results in Table 7 show that our pre-training tasks can perform better than other tasks, especially when the amount of training data is small. This indicates that our proposed tasks can capture the graph-text alignments during pre-training, thereby making our model generalizable to the downstream KG-to-text datasets better with only a few training samples.

Case Study
To intuitively show the generation quality of our model, we provided some generated cases in Figure 4. We observe that JointGT can generate highquality texts that describe the knowledge graph more completely and faithfully. For example, in the generated case on WebNLG(U), both BART and T5 fail to cover all the input triples, where BART misses the triple (Acharya Institute of Technology, sports offer, Tennis) and T5 misses (Tennis, sports governing body, International Tennis Federation). Also, T5 generates non-existing facts that are unfaithful to the knowledge graph. Equipped with the structure-aware Transformer encoder and the well-designed pre-training tasks to learn graph-text alignments, JointGT (BART) and JointGT (T5) can generate descriptions which include all the input triples and express the relation between each pair of entities more faithfully.

Conclusion
We propose a novel graph-text joint representation learning model called JointGT for KG-to-text generation. This model plugs a simple structureaware semantic aggregation module into the vanilla Transformer layer to preserve the structure of input graphs, and utilizes three pre-training tasks to learn graph-text alignments in the discrete vocabulary space and continuous embedding space. Experiments show that JointGT can outperform state-ofthe-art pre-trained NLG models on various datasets of KG-to-text generation.  Table 9: Hyper-parameter search space of JointGT during fine-tuning. uniform-integer means the integers in the interval can be selected uniformly. In the search space of warmup step, total step denotes the total training steps on the corresponding datasets.
Thus all the hyper-parameters reported in our paper were consistent with the codes of Huggingface's Transformers.  We presented the hyper-parameter search space during pre-training in Table 8. The number of hyper-parameter search trials was 10. Manual search was adopted to select hyper-parameters, and the selection criterion was BLEU on the validation set when we fine-tuned the pre-trained model on WebNLG(U). The best assignment of pre-training was described in our main content.
We also provided the detailed settings of hyperparameters during fine-tuning on the downstream datasets, including the hyper-parameter search space in Table 9 and the best assignments in Table  10. The number of hyper-parameter search trials was 20. BLEU was adopted as our criterion in the manual search on all the downstream tasks.