Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model and showing significant improvements on the knowledge intensive tasks of open domain QA and the LAMA knowledge probe.


Introduction
Data-To-Text Generation (Kukich, 1983;Goldberg et al., 1994) involves converting knowledge graph (KG) triples of the form (subject, relation, object) into a natural language sentence(s). There are many standard datasets for this task such as WebNLG (Gardent et al., 2017) and many systems have been developed to improve performance on these datasets. However, to the best of our knowledge, no prior work has attempted to verbalize a full knowledge graph. Verbalizing a full KG has additional challenges over small benchmark datasets, such as entity and relation coverage and the lack of grouped sets of triples that can produce coherent sentences together. In this paper, we convert the English Wikidata KG (Vrandečić and Krötzsch, 2014) into natural language text (Figure 1). The * Work done during internship at Google  Figure 1: An example of generating text from KG. First, the entity subgraphs on the left are created and then converted to the sentence on the right. generated corpus, which we call the KELM Corpus, consists of ∼18M sentences spanning ∼45M triples with ∼1500 distinct relations. For training the verbalization system, we also create an English Wikidata KG-Wikipedia Text aligned corpus consisting of a variety of entities such as dates and numerical quantities.
We evaluate the quality of the generated corpus through human evaluation of a random sample. We further showcase the utility of this corpus in language model pre-training. Text represents a limited coverage of the world knowledge. Therefore, we expect the language models to be restricted to facts that are expressed in natural language. Moreover, facts may not be expressed as explicitly in text as they are in KGs, and the variability in the quality of text can eventually cause biases in the resulting models (Bolukbasi et al., 2016;Sheng et al., 2019;Manzini et al., 2019). Building models that handle structured data and free form text seamlessly has been a long sought-after goal. However, their integration is challenging due to different structural formats. KG verbalization provides a simple way to integrate KGs with natural text. We illustrate this by augmenting the REALM (Guu et al., 2020)

Entity Subgraph Creator
Relation co-occurrence counts (

TEKGEN
One of the challenges in converting an entire KG to text is the wide variety of entities and relations. Wikidata consists of ∼6M entities and ∼1500 relations. In comparison, the WebNLG dataset has ∼600 entities and ∼20 relations. In this section, we discuss the various components of TEKGEN, also illustrated in Figure 2 -1. Create a large yet noisy training dataset using distant supervision.
2. Sequentially fine-tune T5, first on the dataset from step 1 for improved coverage, then on a small clean dataset for reduced hallucination.
3. Build a filter for the generated text based on its semantic quality w.r.t. the KG triples.

Training Dataset
We first create training data using distant supervision by aligning Wikidata triples to Wikipedia text (see Figure 3).

KG-Text Alignment
For each entity, we constrain the candidate sentences to the root section of its Wikipedia page  because this section generally describes the relations of the subject entity with other entities. For each sentence in this section, we match all triples that have this entity as the subject. A triple is said to match if any alias of the object entity occurs in the sentence. We do not match relations to text as there are too many ways to express them. Constraining to the subject entity's page and root section generally ensures that the relation is expressed in the sentence if it mentions the object entity. Each triple can align to multiple sentences and each sentence can have multiple triples aligned to it. If any alias of the subject entity occurs in the given sentence, the sentence is selected as is, else the first animate third-person personal or possessive pronoun is replaced by the subject entity's canonical name. The pronoun replacement heuristic also works well because of this constraint. All triples aligned to a given sentences are combined together as a single example.
Alignment statistics are shown in Table 1 and some alignment examples are shown in Table 2. There are a total of ∼45M triples, ∼35% of which were aligned to sentences. This results in ∼8M examples, covering ∼42% of the relations.
Note that each sentence in the aligned corpus is matched to triples with a common subject entity. While this results in some noise, such errors should be small due to the constraint that the text is the root section of the subject entity page. This constraint allows us to maintain the same property of common subject entity as the entity subgraph used in inference ( §3). It also simplified the alignment process, removing the need to match relations to text. In comparison, the T-REx (Elsahar et al., 2018)   errors due to entity linking and incorrect entailment, which are unlikely in our corpus due to this constraint.

Types of Triples
We extract several types of triples, each of which have slightly different matching techniques. Other alignment corpora built using Wikipedia hyperlinks  would miss many of these triples with entities without Wikipedia pages such as quantities, dates and certain occupations, and hence relations such as date of birth, publication year and distance from Earth. The 2012 reelection campaign of Barack Obama, the 44th President of the United States, was formally announced on April 4, 2011. Blue whale (parent taxon, Balaenoptera) The blue whale (Balaenoptera musculus) is a marine mammal belonging to the baleen whale suborder Mysticeti. While the type of the triples is important in the alignment process, the verbalization model is agnostic to the type and treats all triples the same.

Model
We perform a two-step sequential finetuning of the pre-trained T5-large  model for converting triples to text. Triples are concatenated as subject relation_1 object_1, ....relation_n object_n for input to T5. The model is first fine-tuned on the aligned corpus for 5000 steps to increase the coverage of entities and relations. However, this results in the generation of Wikipedia-like sentences and hallucination if a certain expected input triple is missing. For example, Wikipedia sentences generally mention date of birth, date of death, occupation together. If the occupation is missing in the input, the system hallucinates a random occupation. "Neff Maiava date of birth 01 May 1924, date of death, 21 April 2018." generates "Neff Maiava (1 May 1924 -21 April 2018) was an Albanian actor."; hallucinating a profession. To overcome this, we further fine-tune the model on WebNLG 2017 data for 500 steps. While WebNLG has low coverage, the information in the input triples matches the target sentence exactly. WebNLG also has a different sentence structure than Wikipedia. This reduces conformity to Wikipedia sentence structure and hence reduces hallucination. We use a learning rate of 0.001, a batch size of 1048576 tokens and a maximum decoding length of 256.

Quality Filtering
We perform a semantic quality based filtering of the sentences generated by the triple-to-text module. This is a separate post-processing module used during inference and is not jointly optimized with the text generation module. A semantic quality score is assigned to each generated sentence w.r.t. the input triples that indicates whether or not the generated text captures the full meaning of the triples and does not hallucinate extra information. The score is generated using a BERT base uncased model with input of the form [CLS] concatenated-triples [SEP] reference-or-generated-sentence. It is fine-tuned for 1000 steps on the WebNLG 2017 human assessment data. The data contains system predictions submitted to the shared task rated on a scale of 1-3 for semantics and fluency. We use the semantics score and scale it to 0-1. We also add gold references with a score of 1. This results in 2706 examples, 90% of which are used for finetuning and the remaining for evaluation. High correlations are obtained between the predicted scores and human scores on the evaluation split (Table 3).

KELM Corpus
In this section, we utilize the TEKGEN model and filtering mechanism to build a synthetic corpus that captures the KG in natural language format.

Entity Subgraph
Datasets such as WebNLG have instances with grouped triples that can be expressed as a fluent sentence. Such groups are not available for a large KG and using one triple at a time for inference would lead to hallucination as training uses multi-  Figure 4: Entity Subgraph Creation Algorithm using relation co-occurrence counts based on relation-sentence alignment in the training data. Each entity subgraph consists of a maximum of five triples, all with the same subject entity. The first triple is chosen at random. The second triple is chosen such that its relation has the highest co-occurrence count with the relation in the first triple and so on. ple triples per example. Therefore, we develop a strategy to create entity subgraphs based on relation co-occurrence counts i.e. frequency of alignment of two relations to the same sentence in the training data. The algorithm is shown in Figure 4. It produces ∼18M entity subgraphs from ∼45M triples so the final corpus will have 18M generated sentences corresponding to each entity subgraph.

Generation
For each entity subgraph, we concatenate all its triples as before. We perform top 5 sampling with a temperature of 0.5. The bottom 1% of the generated sentences are filtered out based on the semantic score assigned using the model in §2.3.  Table 4: Human evaluation of the generated corpora, on a scale of 1-5, for semantics and fluency.

Human Evaluation
Generation quality of the KELM Corpus is evaluated using human ratings on a random sample of 200 entity subgraphs. Automatic metrics such as BLEU (Papineni et al., 2002) or BERTscore (Zhang et al., 2019) cannot be used due to the lack of gold references. Following prior work, the generated text is rated for two aspects-fluency and semantics, on a scale of 1-5, where 1 means not at all fluent/does not capture meaning at all and 5 means completely fluent/fully captures meaning with no hallucination. We have eight annotators total and each example is rated by two of them. All annotators are linguists, NLP researchers or NLP practitioners and volunteered for the evaluation. We do not use any crowd sourcing platform. For each instance, scores of the two annotators are averaged to get the final rating. The Pearson correlation between the two sets of ratings is 0.56 for semantics and 0.43 for fluency. We compare TEKGEN to two baseline systems. For both baselines, we fine-tune a T5-large model only on WebNLG 2017 data but use different inference input. For one system, we use one triple at a time as input, resulting in 524 instances from the 200 entity subgraphs. For the second, we use the entity subgraphs as input, resulting in 200 instances. Scores are shown in Table 4. Entity subgraphs during inference do not improve the mean scores but reduce the standard deviation of the fluency. In comparison, TEKGEN with inference using entity subgraphs improve both the semantics and fluency of the generated text. Both the mean scores are higher and the standard deviations are lower. It paraphrases canonical names of relations in the KG to more natural expressions more often. Some examples of generation using the two systems are shown in Table 5. In the second example, the relation 'inception' is paraphrased to 'started' using WebNLG_finetuning+Triple_Inference and 'founded' using TEKGEN+Subgraph_Inference, the latter being more appropriate for organizations.
For completeness, we evaluate two more base-  line systems in which T5-large model is finetuned only on the KG-Text aligned corpus but use the two different inference inputs-single triple and entity subgraphs. One annotator rated the same sample for semantics. The former had an average score of 2.34 and the latter 2.73. Since these scores were very low, we did not pursue the evaluation of these systems further. The use of just the aligned corpus which is noisy to some extent results in the worst performing system out of all the methods.

Knowledge Enhanced LMs
In this section, we showcase an application of the generated KELM Corpus as a way to integrate KGs into natural text corpora for pre-training language models (LMs), as shown in Figure 5. We choose REALM (Guu et al., 2020) as a representative of the recently introduced family of retrieval language models and therefore we expect our work to be equally applicable to other such language models. We show gains on LAMA knowledge probe and open domain QA with augmentation. We also perform experiments where we integrate raw Wikidata triples instead of KELM corpus to confirm the effectiveness of verbalization.

Retrieval Language Models
REALM is a retrieval-based language model and uses two corpora for pre-training-a retrieval corpus and a pre-training corpus. During pre-training, a sentence is selected at random from the pre-training corpus and a random word or salient span (dates and entities) is masked in this sentence. Then using a joint representation of the masked sentence and each of the documents in the retrieval corpus, the masked word is predicted. In the finetuning stage, the model is provided with a query/question as input in place of masked sentence from the pretraining corpora. It retrieves a small set of documents from the retrieval corpus based on the vector similarity of the query and document representation and then selects a span of text from the retrieved documents as the answer.

KELM Documents
We group sentences in the KELM corpus by subject entities to create 5722974 (5.7M) documents. We call these KELM documents. We then replace/augment the retrieval corpus in REALM with these synthetic documents. KELM Corpus has only ∼286M words (∼14%) in comparison to ∼2B words in English Wikipedia.

KG as natural language corpus
Spork EP is an EP released by indie rock band Flake Music.

Language Model
Knowledge Graph Figure 5: Knowledge Graph verbalization for integration with natural text corpora for language model pre-training.

Evaluation Datasets
We perform evaluation using two open domain question answering datasets and one knowledge probing dataset.
WebQuestions (WQ) (Berant et al., 2010): question-answers from the Google Suggest API. We keep the same settings as REALM for both NQ and WQ i.e. we work on the open domain setting for both datasets where no passage is provided as context for each question. Finetuning is performed on respective training splits.
REALM did not include LAMA as one of its evaluation datasets. So we first evaluate REALM on LAMA using the original retrieval corpus and then using the KELM Corpus. No finetuning is performed and the masked word predictions from the pre-trained models are used as answers.

Results
We evaluate REALM on WQ, NQ and LAMA under three settings by modifying the retrieval corpus. The REPLACED and AUGMENTED models are evaluated using both the raw Wikidata triples and the generated sentences. Wikidata triples are grouped by subject entity to form Triple Documents and KELM Corpus sentences are also grouped by subject entity to form KELM Corpus Documents ( §4.2). The model is pre-trained for 200k steps with the CC-News pre-training corpus in all cases with default hyperparameters.
ORIGINAL For NQ and WQ, we fine-tuned the pre-trained REALM on the respective training splits. While we were able to reproduce the accuracy on WQ, the accuracy on NQ was ∼1.5% absolute less than the reported accuracy (row 1&2 in Table 7). For LAMA probe, we first evaluated the pre-trained REALM, reporting the results on the different sub-corpora in Table 6 (row Wikipedia under REALM). Even the ORIGINAL REALM model shows substantial improvement over prior models. The ability of REALM to access the corpus documents during inference not only make it interpretable but also better on the knowledge intensive tasks. It obtains an accuracy of 67.36% on Google-RE, 68.18% on T-REx and 27.96% on    Table 7). This can be attributed to the nature of the datasets-WQ is a KG-based dataset whereas NQ consists of real queries issued to Google. On LAMA (rows 2&3 under REALM in Table 6), the performance is lower than the ORIGINAL model but much higher than BERT. Both Triple Documents and KELM Corpus Documents have similar performance. When using just the KG, the format doesn't matter. However, a system trained on raw triples may not generalize for tasks where sentence structure is important.
AUGMENTED We observe improvements on all the datasets (last two rows of Tables 6&7) with the AUGMENTED model which uses both the Wikipedia text and the KELM Documents. There is an absolute gain of 2.63% and 3.10% on NQ and WQ respectively over the ORIGINAL model. Similarly, there is an absolute gain of 12.94%, 0.95%, 3.61% and 0.47% on Google-RE, T-REx, SQuAD and ConceptNet in LAMA respectively. Unlike the REPLACED model, the improvement is higher when the generated sentences in KELM Corpus are added instead of the raw Wikidata triples, confirming the effectiveness of verbalization of KG into natural language sentences. Wikipedia is the dominant corpus with 2B words whereas KELM corpus sentences are succinct with a total of 286M words ( §4.2) so it is likely the learned representations favour the Wikipedia format which is natural language sentences. We expect augmenting other retrieval-based models such as DPR (Karpukhin et al., 2020) and RAG (Lewis et al., 2020) with the KELM corpus should also improve their performance, given that their enhancements are orthogonal to our contribution. Moreover, we note that our augmented corpus represents a scalable strategy for future QA systems; by adding only 14% more tokens to the original REALM model we outperform huge and computationally expensive models such as   We inspected the errors of the AUGMENTED model with KELM Documents on LAMA. Apart from real errors where the prediction is actually incorrect, there were some false errors that can be broadly classified into three categories-1. Ambiguous Query: e.g. In "X was born in ____", the answer could be the year or the place of birth but only one of them is acceptable depending on the subcorpus.
2. Incomplete Answer Set: e.g. In "Konstantin Mereschkowski had a career as ____", the gold target is biologist and the prediction is botanist but both should be correct.
3. Answer granularity: The prediction is correct but more specific. e.g. In "On the CPI scale, Kenya ranks ____", the gold answer is low but the prediction is 101, which is in fact correct.

Related Work
Data-to-Text Generation Data-to-Text Generation has several benchmark datasets with slightly different objectives-WebNLG (Gardent et al., 2017) to convert a group of triples to text, E2ENLG (Dušek et al., 2018) Shimorina and Gardent, 2018) have been developed and evaluated on these datasets, such as graph transformers over structured data (Koncel-Kedziorski et al., 2019), latent templates for interpretability (Wiseman et al., 2018) and text-to-text generation with T5 (Kale, 2020).  (Etzioni et al., 2008;Angeli et al., 2015;Clancy et al., 2019) inherently create such a corpus but these works generally do not release the extracted KG triples.

KG-Text alignment
Incorporating KGs Most prior works on incorporating KG with text often learn KG entity representations and add them to the mention spans linked to the entity Févry et al., 2020) or create subgraphs relevant to the query that are expanded with text in the embedding space Sun et al., 2019;Xiong et al., 2019). Some others incorporate additional modules. Verga et al. (2020) extend Févry et al. (2020) by adding a triple memory with (subject, relation) encoding as the key and the object encoding as the value. Das et al. (2017) use universal schema (Riedel et al., 2013) that embeds text and KGs in a shared space for their integration. K M et al. (2018) learn a single representation for all the triples mentioned in a sentences during pre-training and update it further in task-specific finetuning. In contrast, we convert the KG into text and use it to augment the pre-training data.

Future Work
The KELM corpus sentences covers all facts in the KG but the generated sentences are limited to a given entity and its direct relations to other entities. For example, given the triples (X, child, Y) and (Y, child, Z), it does not the contain "Z is a grandchild of X". More complex sentences could be generated by incorporating multi-hop relations in the KG. Recent work has also shown promising results on generating multilingual text from English triples (Castro Ferreira et al., 2020;Agarwal et al., 2020). Our proposed approach can be applied to generate a multilingual corpus of facts in various languages using English Wikidata.

Conclusion
In this paper, we converted an entire KG (Wikidata) to natural text (KELM Corpus), tackling various challenges over verbalizing domain-specific benchmark datasets. We further showcase that KG verbalization can be used to integrate KGs and natural text corpora by including the verbalized KG as additional pre-training data. We augment a retrieval-based language model with the generated synthetic KELM corpus as a retrieval corpus.
We evaluated the augmented model on open domain QA and a knowledge probe, showing improvements on both. The KELM Corpus is publicly available at https://github. com/google-research-datasets/ KELM-corpus.