DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population

We present an open-source and extensible knowledge extraction toolkit DeepKE, supporting complicated low-resource, document-level and multimodal scenarios in the knowledge base population. DeepKE implements various information extraction tasks, including named entity recognition, relation extraction and attribute extraction. With a unified framework, DeepKE allows developers and researchers to customize datasets and models to extract information from unstructured data according to their requirements. Specifically, DeepKE not only provides various functional modules and model implementation for different tasks and scenarios but also organizes all components by consistent frameworks to maintain sufficient modularity and extensibility. We release the source code at GitHub in https://github.com/zjunlp/DeepKE with Google Colab tutorials and comprehensive documents for beginners. Besides, we present an online system in http://deepke.openkg.cn/EN/re_doc_show.html for real-time extraction of various tasks, and a demo video.


Introduction
As Information Extraction (IE) techniques develop fast, many large-scale Knowledge Bases (KBs) have been constructed.Those KBs can provide back-end support for knowledge-intensive tasks in real-world applications, such as language understanding (Che et al., 2021), commonsense reasoning (Lin et al., 2019) and recommendation systems (Wang et al., 2018).However, most KBs are far from complete due to the emerging entities and relations in real-world applications.Therefore, Knowledge Base Population (KBP) (Ji and Grishman, 2011) has been proposed, which aims to extract knowledge from the text corpus to complete the missing elements in KBs.For this target, IE is an effective technology that can extract entities and relations from raw texts and link them to KBs (Yan et al., 2021;Sui et al., 2021).
To date, a few remarkable open-source and long-term maintained IE toolkits have been developed, such as Spacy (Vasiliev, 2020) for named entity recognition (NER), OpenNRE (Han et al., 2019) for relation extraction (RE), Stanford Ope-nIE (Martínez-Rodríguez et al., 2018) for open information extraction, RESIN for event extraction (Wen et al., 2021) and so on (Jin et al., 2021).However, there are still several non-trivial issues that hinder the applicability of real-world applications.
Firstly, there are various important IE tasks, but most existing toolkits only support one task.Secondly, although IE models trained with those tools can achieve promising results, their performance may degrade dramatically when there are only a few training instances or in other complex realworld scenarios, such as encountering documentlevel and multimodal instances.Therefore, it is necessary to build a knowledge extraction toolkit facilitating the knowledge base population that supports multiple tasks and complicated scenarios: low-resource, document-level and multimodal.
In various tasks without knowing too many technical details, writing tedious glue code, or conducting hyper-parameter tuning.We will provide maintenance to meet new requests, add new tasks, and fix bugs in the future.We highlight our major contributions as follows: • We develop and release a knowledge base population toolkit that supports low-resource, document-level and multimodal information extraction.
• We offer flexible usage of the toolkit with sufficient modularity as well as automatic hyperparameter tuning; thus, developers and researchers can implement customized models for information extraction.
• We provide detailed documentation, Google Colab tutorials, an online real-time extraction system and long-term technical support.

Core Functions
DeepKE is designed for different knowledge extraction tasks, including named entity recognition, relation extraction and attribute extraction.As shown in Figure 1, DeepKE supports diverse IE tasks in standard single-sentence supervised, low-resource few-shot, document-level and multimodal settings, which makes it flexible to adapt to practical and complicated application scenarios.

Named Entity Recognition
As an essential task of IE, named entity recognition (NER) picks out the entity mentions and classifies them into pre-defined semantic categories given plain texts.For instance, given the sentence "It was one o'clock when we left Lauriston Gardens, and Sherlock Holmes led me meet Gregson from Scotland Yard.",NER models will predict that "Lauriston Gardens" as a location, "Sherlock Holmes" and "Gregson" as persons, and "Scotland Yard" as an organization.To achieve supervised NER, DeepKE adopts the pre-trained language model (Devlin et al., 2019) to encode sentences and make predictions.DeepKE also implements NER in the few-shot setting (including in-domain and crossdomain) (Chen et al., 2022a) and the multimodal setting.

Relation Extraction
Relation Extraction (RE), a common task in IE for knowledge base population, predicts semantic relations between pairs of entities from unstructured texts (Wu et al., 2021).To allow users to customize their models, we adopt various models to accomplish standard supervised RE, including CNN (Zeng et al., 2015), RNN (Zhou et al., 2016), Capsule (Zhang et al., 2018a), GCN (Zhang et al., 2018c(Zhang et al., , 2019)), Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019).Meanwhile, DeepKE provides few-shot and document-level support for RE.For low-resource RE, DeepKE reimplements5 KnowPrompt (Chen et al., 2022b), a recent well-performed few-shot RE method based on prompt-tuning.Note that few-shot RE is significant for real-world applications, which enables users to extract relations with only a few labeled instances.For document-level RE, DeepKE reimplements DocuNet (Zhang et al., 2021) to extract inter-sentence relational triples within one document.Document-level RE is a challenging task that requires integrating information within and across multiple sentences of a document (Nan et al., 2020).RE is also implemented in the multimodal setting described in Section 4.4.

Attribute Extraction
Attribute extraction (AE) plays an indispensable role in the knowledge base population.Given a sentence, entities and queried attribute mentions, AE will infer the corresponding attribute type.For instance, given a sentence "诸葛亮，字孔明， 三国时期杰出的军事家、文学家、发明家。" (Liang Zhuge, whose courtesy name was Kongming, was an extraordinary strategist, litterateur and inventor in the Three Kingdoms period.),an entity "诸葛亮" (Liang Zhuge), and an attribute mention "三国时期" (Three Kingdoms period), DeepKE can predict the corresponding attribute type "朝代" (Dynasty).DeepKE adopts various models for AE (Table 1).

Toolkit Design and Implementation
We introduce the design principle of DeepKE as follows: 1) Unified Framework: DeepKE utilizes the same framework for various task objectives with respect to Data, Model and Core components; 2) Flexible Usage: DeepKE offers convenient training and evaluation with auto-hyperparameter tuning and the docker for operational efficiency; 3) Off-the-shelf Models: DeepKE provides pretrained models (Chinese models with pre-defined schemas) for information extraction.We will introduce details of components in DeepKE and the unified framework in the following sections.

Data Module
The data module is designed for preprocessing and loading input data.The tokenizer in DeepKE implements tokenization for both English and Chinese (in Appendix A.3). Global images and local visual objects are preprocessed as visual information in the multimodal setting.Developers can feed their own datasets into the tokenizer and preprocessor through the dataloader to obtain the tokens or image patches.

Model Module
The model module contains main neural networks leveraged to achieve three core tasks.Various neural networks, including CNN, RNN, Transformer and the like, can be utilized for model implementation, which encodes texts into specific embedding for corresponding tasks.To adapt to different scenarios, DeepKE utilizes diverse architectures in distinct settings, such as BERT for standard RE and BART (Lewis et al., 2020) for few-shot NER.We implement the BasicModel class with a unified model loader and saver to integrate multifarious neural models.

Core Module
In the core code of DeepKE, train, validate, and predict methods are pivotal components.
As for the train method, users can feed the expected parameters (e.g., the model, data, epoch, optimizer, loss function, .etc.) into it without writing tedious glue code.The validate method is for evaluation.Users can modify the sentences in the configuration for prediction and then utilize the predict method to obtain the result.

Framework Module
The framework module integrates three aforementioned components and different scenarios.It supports various functions, including data processing, model construction and model implementation.Meanwhile, developers and researchers can customize all hyper-parameters by modifying configuration files formatted as "*.yaml", from which we apply Hydra6 to obtain users' configuration.We also offer an off-the-shelf automatic hyperparameter tuning component.In DeepKE, we have implemented frameworks for all application functions mentioned in Section 2. For other future potential application functions, we have reserved interfaces for their implementation.
4 Toolkit Usage 4.1 Single-sentence Supervised Setting All tasks, including NER, RE and AE, can be implemented in the standard single-sentence supervised setting by DeepKE.Every instance in datasets only contains one sentence.The datasets of these tasks are all annotated with specific information, such as entity mentions, entity categories, entity offsets, relation types and attributes.

Low-resource Setting
In real-world scenarios, labeled data may not be sufficient for deep learning models to make predictions for satisfying users' specific demands.Therefore, DeepKE provides low-resource few-shot support for NER and RE, which is exceedingly distinctive.DeepKE offers a generative framework with prompt-guided attention to achieve in-domain and cross-domain NER.Meanwhile, DeepKE implements knowledge-informed prompt-tuning with synergistic optimization for few-shot relation extraction.

Document-Level Setting
Relations between two entities not only emerge in one sentence but appear in different sentences within the whole document.Compared to other IE toolkits, DeepKE can extract inter-sentence relations from documents, which predicts an entitylevel relation matrix to capture local and global information.

Online System & cnSchema-based
Off-the-shelf Models Besides this toolkit, we release an online system in http://deepke.zjukg.cn.As shown in Figure 3, we train our models in different scenarios with multilingual support (English and Chinese) and deploy them for online access.The system can be directly applied to recognize named entities, extract relations, classify attributes from plain texts, and visualizes extracted relational triples as knowledge graphs.The models are trained with the pre-defined schema (The system cannot extract knowledge out of the schema scope.)and offer flexible usage for users to obtain their customized models with their own schemas.Furthermore, DeepKE provides off-the-shelf extraction models with Chinese pre-trained language models (Cui et al., 2021b)  sentence, one head entity, one tail entity in the sentence, their offsets, and the relation between them.We utilize six different neural networks in DeepKE for evaluation.Users can select models before training by changing only one hyper-parameter11 .We report the performance of all models in Table 1.

Attribute Extraction
The Chinese dataset for AE is from an online resource12 .In each sample, one entity is annotated with its attribute type, value, and offset.Attributes in the dataset are classified into 6 categories.The training set contains 13,815 samples.The validation set contains 3,131 samples, and the test set includes 5,921 samples.Like RE, we leverage six neural models to extract attributes from the given sentence to evaluate DeepKE.

Low-resource Setting
We report the performance of the low-resource setting (NER and RE) in Table 2, 3, and 4.

Named Entity Recognition
We conduct experiments in both in-domain and cross-domain fewshot settings with LightNER (Chen et al., 2022a).Following Cui et al. (2021a), for the in-domain few-shot scenario, we reduce the number of training samples for certain entity categories by downsampling one dataset.Specifically, from CoNLL-  (Xue et al., 2021) and PTR (Han et al., 2021).Table 4 shows that DeepKE outperforms those baseline methods.

Document-level Setting
DeepKE can extract intra-and inter-sentence relations among multiple entities within one document.We leverage a large-scale document-level RE dataset, DocRED (Ye et al., 2020), containing 3,053/1,000/1,000 instances for training, validation and testing, respectively.We use cased BERT-base and RoBERTa-large (Liu et al., 2019) as encoders.
Compared with BERT-based and RoBERTa-based models, including Coref (Ye et al., 2020), and AT-LOP (Zhou et al., 2021), DeepKE appears the better or comparable performance than baselines as shown in Table 1.

Multimudal Setting
We report the performance of NER and RE in the multimodal scenario in Table 1.
Relation Extraction We use MNRE (Zheng et al., 2021b), a multimodal RE dataset containing sentences and images containing 23 relation categories.Previous SOTA models including BERT+SG (Zheng et al., 2021a), BERT+SG+Att (BERT+SG with attention calculating semantic similarity between textual and visual graphs) and MEGA (Zheng et al., 2021a), are leveraged for comparison.We further observe that DeepKE yields better performance than baselines.

Conclusion
In practical application, the knowledge base population struggles with low-resource, document-level and multimodal scenarios.To this end, we propose DeepKE, an open-source and extensible knowledge extraction toolkit.We conduct extensive experiments that demonstrate the models implemented by DeepKE can achieve comparable performance compared to some state-of-the-art methods.Besides, we provide an online system supporting realtime extraction (with the pre-defined schemas) without training.We will offer long-term maintenance to fix bugs, solve issues, add documents (tutorials) and meet new requests.

Broader Impact Statement
As noted in Manning ( 2022), linguistics and knowledge-based artificial intelligence were rapidly developing, and knowledge (explicit or implicit) as potential dark matter 13 for language understanding still faces obstacles to acquisition and representation.To this end, IE technologies that aim to extract knowledge from unstructured data can serve as valuable tools to not only govern domain resources (e.g., medical, business) but also benefit deep language understanding and reasoning ability.Note that the proposed toolkit, DeepKE, can offer flexible usage in widespread IE scenarios with pre-trained off-the-shelf models.We hope to deliver the benefits of the proposed DeepKE to the natural language processing community.

A Toolkit Usage Details
In this section, we introduce how to use DeepKE exhaustively. A

Figure 1 :
Figure 1: The examples of tasks with different scenarios in DeepKE.

Figure 2 :
Figure 2: The architecture and example code.

Figure 3 :
Figure 3: An example of the online system.

Table 1 :
based cnSchema 8 supporting 28 entity types and 50 relation categories.F1 Score (%) of the single-sentence, document-level and multimodal scenarios.* means these baselines are from other papers.
.1 Build a Model From Scratch Prepare the Runtime Environment Users can clone the source code from the DeepKE GitHub repository and create a runtime environment.There are two convenient methods to create the environment.Users can choose to either leverage Anaconda or run the docker file provided in the repository.Besides, all dependencies can be installed by running pip install deepke directly.If developers would like to modify the source code of DeepKE, the following commands should be executed: running python setup.pyinstall, modifying code and then running python setup.pydevelop.Users can also use corresponding datasets (e.g., default or customized datasets) to obtain specific information extraction models.All datasets need to be downloaded or uploaded in the folder named data.