KLMo: Knowledge Graph Enhanced Pretrained Language Model with Fine-Grained Relationships

Interactions between entities in knowledge graph (KG) provide rich knowledge for language representation learning. However, existing knowledge-enhanced pretrained language models (PLMs) only focus on entity information and ignore the ﬁne-grained relationships between entities. In this work, we propose to incorporate KG (including both entities and relations) into the language learning process to obtain KG-enhanced pretrained Language Model, namely KLMo. Speciﬁcally, a novel knowledge aggregator is designed to explicitly model the interaction between entity spans in text and all entities and relations in a contextual KG. An relation prediction objective is utilized to incorporate relation information by distant supervision. An entity linking objec-tive is further utilized to link entity spans in text to entities in KG. In this way, the structured knowledge can be effectively integrated into language representations. Experimental results demonstrate that KLMo achieves great improvements on several knowledge-driven tasks, such as entity typing and relation clas-siﬁcation, comparing with the state-of-the-art knowledge-enhanced PLMs.


Introduction
Knowledge Graph (KG) with entities and relations provides rich knowledge for language learning (Wang et al., 2017(Wang et al., , 2014. Recently, researchers have explored to incorporate KG information into PLMs (Devlin et al., 2018;Radford et al.) to enhance language representations, such as ERNIE-THU (Zhang et al., 2019), WKLM (Xiong et al., 2019), KEPLER , KnowBERT (Peters et al., 2019), BERT-MK (He et al., 2019) and KALM (Rosset et al., 2020), . However, they only utilize entity information and ignore the finegrained relationships between entities. The finegrained semantic information of relations between entities is also critical to language representation learning. Taking Figure 1 as example, for entity typing, without explicitly knowing the fine-grained relation Guest between Lang Lang and Trio of Happiness, which is different from the relation Performer between Lang Lang and BBC Proms, it's impossible to correctly predict the type of Trio of Happiness as TV Show, since the input sentence literally implies that Trio of Happiness belongs to the same type as BBC Proms. The fine-grained relations between entities in KG provide specific constraint on entities, thus can play an important role in language learning for knowledge-driven tasks.
To explicitly incorporate entities and finegrained relations in KG into PLMs, one main challenge we are faced with is the Text-Knowledge Alignment (TKA) problem: it's difficult to make token-relation and token-entity alignments for the fusion of text and knowledge. To handle this problem, the KG-enhanced pretrained language model (KLMo) is proposed to integrate KG (i.e. both entities and fine-grained relations) into the language representation learning. The main component of KLMo is a knowledge aggregator, which is responsible for text and knowledge information fusion from two individual embedding spaces, i.e. token embedding space and KG embedding space. The knowledge aggregator models the interaction between entity spans in text and all entities and relations in a contextual KG via an entity span-level cross-KG attention to make tokens attend to highly related entities and relations in KG. Based on the KG-enhanced token representations, a relation prediction objective is utilized to predict the relation of each pair of entities in text based on the distant supervision of KG. Furthermore, an entity linking objective is utilized to predict entities in KG based on the corresponding entity spans in text. The relation prediction and entity linking objectives are the key to the integration of KG information into text representations.
We conduct experiments on two Chinese knowledge-driven NLP tasks, i.e. entity typing and relation classification. The experimental results demonstrate that KLMo obtains large improvements over BERT and existing knowledgeenhanced PLMs, by taking full advantage of a structured KG including both entities and fine-grained relations. We also will publish a Chinese entity typing dataset for the evaluation of Chinese PLMs.

Model Description
As shown in Figure 2, KLMo is designed as a multilayer Transformer-based (Vaswani et al., 2017) model, which accepts a token sequence and the entities and relations in its contextual KG as input. The token sequence is firstly encoded by a multi-layer Transformer-based text encoder. The output of the text encoder is further used as input for the knowledge aggregator that fuses the knowledge embeddings of entities and relations into the token sequence to obtain KG-enhanced token representations. Based on the KG-enhanced representations, novel relation prediction and entity linking objectives are jointly optimized as the pre-training objectives, which help incorporate high-related entity and relation information in the KG into the text representations.

Knowledge Aggregator
As shown in Figure 2, the knowledge aggregator is designed as an M -layer knowledge encoder to integrate knowledge in KG into language representation learning. It accepts the hidden embeddings of the token sequence and the knowledge embeddings of the entities and relations in KG as input, and fuses text and KG information from two individual embedding spaces. The knowledge aggregator contains two separate multi-head attentions: token-level self-attention and knowledge graph attention (Veličković et al., 2017), which encodes the input text and the KG independently. The entity

KG-enhanced Token Representations
Lost and Love is a movie starring Andy Lau (playing Lei Zekuan) e1 e2 e3

Entity-level Cross-KG Attention
invisible Entity Spans (2) Entity Linking MLP (1) Relation Prediction MLP Actor 33165 Figure 2: Overview of the model architecture.
representation is computed by pooling over all tokens in an entity-span. Then the aggregator models the interaction between entity spans in text and all entities and relations in a contextual KG through an entity-level cross-KG attention to incorporate knowledge into the text representations.
Knowledge Graph Attention As the entities and relations in a KG composes a graph, it's critical to considering the graph structure during knowledge representation learning. We first represent entities and relations in the contextual KG by TransE (Bordes et al., 2013) and then translate them into an entity and relation embedding sequence {z 0 , z 1 , ..., z q }, served as the input for the knowledge aggregator. Then the knowledge aggregator encodes the entity and relation sequence by a knowledge graph attention which considers its graph structure by importing a visible matrix M into the traditional self-attention mechanism (Liu et al., 2020). The visible matrix M only allows adjacent entities and relations in the KG to be visible to each other during representation learning, as shown in the right bottom of Figure 2.
Entity-level Cross-KG Attention To compute the KG-enhanced entity representations, given an entity mention list C e = {(e 0 , start 0 , end 0 ), ..., (e m , start m , end m )}, the knowledge aggregator first computes the entity span representations {ê i 0 , ...,ê i m } by pooling over all tokens in an entity-span with self-attentive span pooling method from (Lee et al., 2017). The entity span embeddings {ê i 0 , ...,ê i m } can be expanded to all tokens {ê i 0 , ...,ê i n } by makingê i j =t i j for tokens not in any entity spans, wheret i j denotes the representation of the j-th token from the token-level self-attention.
In order to model the interaction between entity spans in text and all entities and relations in a contextual KG, the aggregator performs an entity-level cross-KG attention to allow tokens attend to highly related entities and relations in KG, thus computes the KG-enhanced entity representations. Specifically, the entity-level cross-KG attention in the i-th aggregator is performed by contextual multihead attention between the entity span embeddings {ê i 0 , ...,ê i n } as the query and the entity and relation embeddings {z i 0 , ..., z i q } as the key and value. KG-enhanced Token Representations To inject the KG-enhanced entity information into the token representations, the i-th layer of the knowledge aggregator computes the KG-enhanced token representations {t i 0 , ...,t i n } by adopting an information fusion operation between {t i 0 , ...,t i n } and {e i 0 , ..., e i n }. For the j-th token, the fusion operation is defined as follows: where u i j represents the hidden state integrating the information from both token and entity. σ is a nonlinear activation function. W i * and b i * are learnable weights and biases respectively. The KG-enhanced token representation {t i 0 , ...,t i n } is fed into the next layer of knowledge aggregator as input.

Pre-training Objectives
To incorporate KG knowledge into the language representation learning, KLMo adopts a multi-task loss function as the training objective: In addition to the loss of masked language model L M LM (Devlin et al., 2018;, an relation prediction loss L RP and an entity linking loss L EL are integrated to predict the entities in KG based on the corresponding KG-enhanced tokens representations {t M 0 , ...,t M n }. For each pair of entity spans, we utilize the relation between their corresponding entities in the KG as the distant supervision for relation prediction. The relation prediction and entity linking objectives are the key to the integration of relations and entities in KG into the text. Since the number of entities in KG is quite large for the Sof tmax operation in entity-linking objective, we

Experiments
This section presents the details of KLMo pre-training and its finetuning on two specific knowledge-driven NLP tasks: entity typing and relation classification. We pretrain KLMo by a Chinese corpus of Baidu Baike's webpages and the Baike Knowledge Graph. Details of the pretraining corpus and experimental settings are described in Appendix A. 1

Baselines
We compare KLMo with the state-of-the-art PLMs pretrained on the same Baidu Baike corpus: (1) BERT-Base Chinese (Devlin et al., 2018), which is further pretrained on the Baidu Baike corpus for one epoch. (2)    (1) All knowledge-enhanced PLMs generally perform much better than the BERT baseline on all measures, which shows that entity knowledge is beneficial to entity type predication with limited annotated resources.
(2) Compared with the existing knowledge-enhanced PLMs, KLMo largely improves the recall score over WKLM and ERNIE, leading to an improvement of 1.58 and 0.57 on micro-F1 respectively. This indicates that finegrained relationships between entities help KLMo to predict appropriate categories for more entities.

Relation Classification
Dataset The CCKS 2019 Task 3 Inter-Personal Relational Extraction (IPRE) dataset (Han et al., 2020) is used for the evaluation on relation classification. The training set is automatically labeled by distant supervision, and the test set is manually annotated. There are 35 relations (including a null-relation class "NA"), where "NA" accounts for nearly 86% in the training set and 97% in the test set. The detail statistics of the dataset and finetuning settings are shown in Appendix B.2.

Results
We adopt precision, recall and micro-F1 as the evaluation measures. The results are shown in Table 2. In addition to BERT baseline, we also compare KLMo with an official CNN baseline, which gets CNN output as the sentence embedding and feed it into a relation classifier. From Table   2, we can see that both CNN and BERT baseline models do not perform well, which indicates the high difficulty of the dataset. This ascribes to the large number of noisy labels in the training set automatically generated by distant supervision. Although the dataset are very difficult, we can still observe that: (1) All knowledge-enhanced PLMs largely improve the precision and micro-F1 scores over BERT baseline, which shows that both entity information and KG information can enhance language representations and accordingly prompt the performance of relation classification.
(2) KLMo largely improves the precision score over WKLM and ERNIE, leading to an improvement of 2.41 and 1.29 on micro-F1 respectively, which demonstrates that fine-grained relations in KG help KLMo avoid fitting on noisy labels and predict relations correctly.

Effects of KG Information
Most NLP tasks only provide text inputs and the entity linking itself is a hard task. Thus, we investigate the effects of KG entities and relations for KLMo on entity typing. w/o KG refers to finetuning KLMo without the input of KG entities and relations. Table 3 shows the results of the ablation study. Without KG input for funetuning, KLMo still largely outperforms BERT on both precision and recall scores, leading to an improvement by 1.74 on micro-F1. Compared with KLMo finetuning with KG, KLMo without KG witnesses a small decrease of 0.84 on micro-F1 measure. This demonstrates that KG information has been integrated into KLMo during pre-training. For most specific NLP tasks, KLMo can be finetuned in a similar way as BERT.

Conclusion
In this paper, we propose a novel KG-enhanced pretrained language model KLMo to explicitly integrate KG entities and fine-grained relations into the language representation learning. Accordingly, the novel knowledge aggregator is designed to handle the heterogeneous information fusion and textknowledge alignment problems. Further, the relation prediction and entity linking objectives are jointly optimized to encourage the knowledge information integration. The experiment results show that KLMo outperforms the other state-of-the-art knowledge-enhanced PLMs, which validates the intuition that fine-grained relationships in KG can enhance the language representation learning and benefit some knowledge-driven NLP tasks.

A.2 Implementation Details
In the experiment, we first obtain the knowledge representations trained on Baike KG triples by TransE (Bordes et al., 2013) algorithm using the OpenKE toolkit (Han et al., 2018). The representations are used to initialize the entity and relation embeddings in KLMo. The embedding dimension is set to 100 and the epoch number is set to 5000. As for the pre-training of KLMo, due to the expensive cost of pre-training from scratch, we inherit the parameters of BERT-Base Chinese to initialize the Transformer blocks for token encoding, while the parameters for entity and relation encoding modules are all randomly initialized. The number of text encoder layers L and knowledge aggregator layers M are both 6. The hidden size of token embeddings d t , knowledge embeddings d z and entity span embeddings d e are set to 768, 100 and 100. The number of token-oriented attention heads A t , KG-oriented attention heads A z and entity span-level attention heads A e are set to 12, 4 and 12 respectively. The pre-training of KLMo runs 3 epochs on 4 NVIDIA Tesla V100 (32GB) GPUs with the batch size of 128, the max sequence length of 512 and the learning rate of 5e-5.

B.1 Entity Typing
To evaluate the performance of KLMo, two knowledge-driven tasks, i.e. entity typing and relation classification, are performed in this work.
Given a sentence with an entity mention, the entity typing task is to label the mention with its finegrained semantic type.
Dataset Entity typing is not a new task. However, to our best knowledge, there is no public benchmark dataset available on Chinese fine-grained entity typing. Therefore, In this work, we create a Chinese entity typing dataset, which is a completely manually-annotated dataset containing 23,100 sentences and 28,093 annotated entities distributed in 15 fine-grained categories of media works, such as Movie, Show and TV Play. We split the dataset into a training set with 15,000 sentences and a test set with 8,100 sentences. The detail statistics of the dataset are shown in Table 4.
Finetuning The Chinese entity typing dataset lacks of KG entity annotations, thus we first use an entity linker tool accompanied with Baike Knowledge Base to recognize entity mentions in sentences and link them to their corresponding Baike KG entities. The statistics of the linked entity typing dataset are shown in Table 4. Over 50% of sentences contain at least one linked KG entity mention in both training set and test set. To finetune KLMo for entity typing, we use the representation of the first token of each entity span to predict its entity type. The model is finetuned for 10 epochs on the training set with the batch size of 128, the max sequence length of 256 and the learning rate of 2e-5.

B.2 Relation Classification
We also compare the results of various pretrained models on the task of relation classification. Given a pair of entities in a sentence, the relation classification task is to determine the relation type between the pair of entities.
Dataset The CCKS 2019 Task 3 Inter-Personal Relational Extraction (IPRE) dataset (Han et al., 2020) is used for the evaluation on relation classification. The training set is automatically labeled by distant supervision, and the test set is manually annotated. There are 35 relations (including a null-relation class "NA"), where "NA" accounts for nearly 86% in the training set and 97% in the test set. The detail statistics of the dataset are shown in Table 5.
Finetuning The IPRE dataset also lacks of KG entity annotations, and we recognize and link entity mentions to their corresponding Baike KG entities in the same way as we do for the entity typing dataset. The statistics of the linked dataset are shown in Table 5. Over 40% of sentences contain at least one linked KG entity mention. To finetune KLMo for relation classification, we concatenate the representations of the first token of the two candidate entity spans. The model is finetuned for 10 epochs with the batch size of 128, the max sequence length of 256 and the learning rate of 2e-5.