2024
pdf
bib
abs
IEPile: Unearthing Large Scale Schema-Conditioned Information Extraction Corpus
Honghao Gui
|
Lin Yuan
|
Hongbin Ye
|
Ningyu Zhang
|
Mengshu Sun
|
Lei Liang
|
Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
2023
pdf
bib
abs
Schema-adaptable Knowledge Graph Construction
Hongbin Ye
|
Honghao Gui
|
Xin Xu
|
Xi Chen
|
Huajun Chen
|
Ningyu Zhang
Findings of the Association for Computational Linguistics: EMNLP 2023
Conventional Knowledge Graph Construction (KGC) approaches typically follow the static information extraction paradigm with a closed set of pre-defined schema. As a result, such approaches fall short when applied to dynamic scenarios or domains, whereas a new type of knowledge emerges. This necessitates a system that can handle evolving schema automatically to extract information for KGC. To address this need, we propose a new task called schema-adaptable KGC, which aims to continually extract entity, relation, and event based on a dynamically changing schema graph without re-training. We first split and convert existing datasets based on three principles to build a benchmark, i.e., horizontal schema expansion, vertical schema expansion, and hybrid schema expansion; then investigate the schema-adaptable performance of several well-known approaches such as Text2Event, TANL, UIE and GPT-3.5. We further propose a simple yet effective baseline dubbed AdaKGC, which contains schema-enriched prefix instructor and schema-conditioned dynamic decoding to better handle evolving schema. Comprehensive experimental results illustrate that AdaKGC can outperform baselines but still have room for improvement. We hope the proposed work can deliver benefits to the community.
2022
pdf
bib
abs
Generative Knowledge Graph Construction: A Review
Hongbin Ye
|
Ningyu Zhang
|
Hui Chen
|
Huajun Chen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Generative Knowledge Graph Construction (KGC) refers to those methods that leverage the sequence-to-sequence framework for building knowledge graphs, which is flexible and can be adapted to widespread tasks. In this study, we summarize the recent compelling progress in generative knowledge graph construction. We present the advantages and weaknesses of each paradigm in terms of different generation targets and provide theoretical insight and empirical analysis. Based on the review, we suggest promising research directions for the future. Our contributions are threefold: (1) We present a detailed, complete taxonomy for the generative KGC methods; (2) We provide a theoretical and empirical analysis of the generative KGC methods; (3) We propose several research directions that can be developed in the future.
pdf
bib
abs
DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population
Ningyu Zhang
|
Xin Xu
|
Liankuan Tao
|
Haiyang Yu
|
Hongbin Ye
|
Shuofei Qiao
|
Xin Xie
|
Xiang Chen
|
Zhoubo Li
|
Lei Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present an open-source and extensible knowledge extraction toolkit DeepKE, supporting complicated low-resource, document-level and multimodal scenarios in the knowledge base population. DeepKE implements various information extraction tasks, including named entity recognition, relation extraction and attribute extraction. With a unified framework, DeepKE allows developers and researchers to customize datasets and models to extract information from unstructured data according to their requirements. Specifically, DeepKE not only provides various functional modules and model implementation for different tasks and scenarios but also organizes all components by consistent frameworks to maintain sufficient modularity and extensibility. We release the source code at GitHub in
https://github.com/zjunlp/DeepKE with Google Colab tutorials and comprehensive documents for beginners. Besides, we present an online system in
http://deepke.openkg.cn/EN/re_doc_show.html for real-time extraction of various tasks, and a demo video.
2020
pdf
bib
abs
Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction
Haiyang Yu
|
Ningyu Zhang
|
Shumin Deng
|
Hongbin Ye
|
Wei Zhang
|
Huajun Chen
Proceedings of the 28th International Conference on Computational Linguistics
Current supervised relational triple extraction approaches require huge amounts of labeled data and thus suffer from poor performance in few-shot settings. However, people can grasp new knowledge by learning a few instances. To this end, we take the first step to study the few-shot relational triple extraction, which has not been well understood. Unlike previous single-task few-shot problems, relational triple extraction is more challenging as the entities and relations have implicit correlations. In this paper, We propose a novel multi-prototype embedding network model to jointly extract the composition of relational triples, namely, entity pairs and corresponding relations. To be specific, we design a hybrid prototypical learning mechanism that bridges text and knowledge concerning both entities and relations. Thus, implicit correlations between entities and relations are injected. Additionally, we propose a prototype-aware regularization to learn more representative prototypes. Experimental results demonstrate that the proposed method can improve the performance of the few-shot triple extraction.