Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Oshin Agarwal, Heming Ge, Siamak Shakeri, Rami Al-Rfou


Abstract
Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model and showing significant improvements on the knowledge intensive tasks of open domain QA and the LAMA knowledge probe.
Anthology ID:
2021.naacl-main.278
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3554–3565
Language:
URL:
https://aclanthology.org/2021.naacl-main.278
DOI:
10.18653/v1/2021.naacl-main.278
Bibkey:
Cite (ACL):
Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
Cite (Informal):
Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training (Agarwal et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.278.pdf
Video:
 https://aclanthology.org/2021.naacl-main.278.mp4
Code
 google-research-datasets/KELM-corpus
Data
KELMTekGenLAMANatural QuestionsSQuADT-REx