Learning Knowledge-Enhanced Contextual Language Representations for Domain Natural Language Understanding

Knowledge-Enhanced Pre-trained Language Models (KEPLMs) improve the performance of various downstream NLP tasks by injecting knowledge facts from large-scale Knowledge Graphs (KGs). However, existing methods for pre-training KEPLMs with relational triples are difficult to be adapted to close domains due to the lack of sufficient domain graph semantics. In this paper, we propose a Knowledge-enhanced lANGuAge Representation learning framework for various clOsed dOmains (KANGAROO) via capturing the implicit graph structure among the entities. Specifically, since the entity coverage rates of closed-domain KGs can be relatively low and may exhibit the global sparsity phenomenon for knowledge injection, we consider not only the shallow relational representations of triples but also the hyperbolic embeddings of deep hierarchical entity-class structures for effective knowledge fusion.Moreover, as two closed-domain entities under the same entity-class often have locally dense neighbor subgraphs counted by max point biconnected component, we further propose a data augmentation strategy based on contrastive learning over subgraphs to construct hard negative samples of higher quality. It makes the underlying KELPMs better distinguish the semantics of these neighboring entities to further complement the global semantic sparsity. In the experiments, we evaluate KANGAROO over various knowledge-aware and general NLP tasks in both full and few-shot learning settings, outperforming various KEPLM training paradigms performance in closed-domains significantly.


Introduction
The performance of downstream tasks (Wang et al., 2020) can be further improved by KEPLMs (Zhang et al., 2019;Peters et al., 2019;Liu et al., 2020a;Zhang et al., 2021aZhang et al., , 2022a) ) which leverage rich knowledge triples from KGs to enhance language representations.In the literature, most knowledge injection approaches for KEPLMs can be roughly categorized into two types: knowledge embedding and joint learning.(1) Knowledge embedding-based approaches aggregate representations of knowledge triples learned by KG embedding models with PLMs' contextual representations (Zhang et al., 2019;Peters et al., 2019;Su et al., 2021;Wu et al., 2023).(2) Joint learning-based methods convert knowledge triples into pre-training sentences without introducing other parameters for knowledge encoders (Sun et al., 2020;Wang et al., 2021).These works mainly focus on building KEPLMs for the open domain based on large-scale KGs (Vulic et al., 2020;Lai et al., 2021;Zhang et al., 2022c).
Despite the success, these approaches for building open-domain KEPLMs can hardly be migrated directly to closed domains because they lack the in-depth modeling of the characteristics of closeddomain KGs (Cheng et al., 2015;Kazemi and Poole, 2018;Vashishth et al., 2020).As in Figure 1, the coverage ratio of KG entities w.r.t.plain texts is significantly lower in closed domains than in open domains, showing that there exists a global sparsity phenomenon for domain knowledge injection.This means injecting the retrieved few relevant triples directly to PLMs may not be sufficient for closed domains.We further notice that, in closed-domain KGs, the ratios of maximum-point biconnected components are much higher, which means that entities under the same entity-class in these KGs are more densely interconnected and exhibit a local density property.Hence, the semantics of these entities are highly similar, making the underlying KEPLMs difficult to capture the differences.Yet a few approaches employ continual pre-training over domain-specific corpora (Beltagy et al., 2019;Peng et al., 2019;Lee et al., 2020), or devise pre-training objectives over in-domain KGs to capture the unique domain semantics (which requires rich domain expertise) (Liu et al., 2020b;He et al., 2020).Therefore, there is a lack of a simple but effective unified framework for learning KEPLMs for various closed domains.
To overcome the above-mentioned issues, we devise the following two components in a unified framework named KANGAROO.It aggregates the above implicit structural characteristics of closeddomain KGs into KEPLM pre-training: • Hyperbolic Knowledge-aware Aggregator: Due to the semantic deficiency caused by the global sparsity phenomenon, we utilize the Poincaré ball model (Nickel and Kiela, 2017) to obtain the hyperbolic embeddings of entities based on the entity-class hierarchies in closed-domain KGs to supplement the semantic information of target entities recognized from the pre-training corpus.It not only captures richer semantic connections among triples but also implicit graph structural information of closed-domain KGs to alleviate the sparsity of global semantics.
• Multi-Level Knowledge-aware Augmenter: As for the local density property of closeddomain KGs, we employ the contrastive learning framework (Hadsell et al., 2006;van den Oord et al., 2018) to better capture finegrained semantic differences of neighbor entities under the same entity-class structure and thus further alleviate global sparsity.Specifically, we focus on constructing high-quality multi-level negative samples of knowledge triples based on the relation paths in closeddomain KGs around target entities.By using the proposed approach, the difficulty of being classified of various negative samples is largely increased by searching within the max point biconnected components of the KG subgraphs.This method enhances the robustness of domain representations and makes the model distinguish the subtle semantic differences better.
In the experiments, we compare KANGAROO against various mainstream knowledge injection paradigms for pre-training KEPLMs over two closed domains (i.e., medical and finance).The results show that we gain consistent improvement in both full and few-shot learning settings for various knowledge-intensive and general NLP tasks.

Analysis of Closed-Domain KGs
In this section, we analyze the data distributions of open and closed-domain KGs in detail.Specifically, we employ OpenKG2 as the data source to construct a medical KG, denoted as MedKG.In addition, a financial KG (denoted as FinKG) is constructed from the structured data source from an authoritative financial company in China3 .As for the open domain, CN-DBpedia4 is employed for further data analysis, which is the largest open source Chinese KG constructed from various Chinese encyclopedia sources such as Wikipedia.
To illustrate the difference between open and closed-domain KGs, we give five basic indicators (Cheng et al., 2015), which are described in detail in Appendix A due to the space limitation.From the statistics in Table 1, we can roughly draw the following two conclusions: Global Sparsity.The small magnitude and the low coverage ratio lead to the global sparsity problem for closed-domain KGs.Here, data magnitude refers to the sizes of Nodes and Edges.The low Hence, these entities share similar semantics, which the differences are difficult for the model to learn.We construct more robust, hard negative samples for deep contrastive learning to learn the fine-grained semantic differences of target entities in closed-domain KGs to further alleviate the global sparsity problem.

KANGAROO Framework
In this section, we introduce the various modules of the model in detail and the notations are described in Appendix B.5 due to the space limitation.The whole model architecture is shown in Figure 3. 3.1 Hyperbolic Knowledge-aware Aggregator In this section, we describe how to learn the hyperbolic entity embedding and aggregate the positive triples' representations to alleviate the global sparsity phenomenon in closed-domain KGs.

Learning Hyperbolic Entity Embedding
As discovered previously, the embedding algorithms in the Euclidean space such as (Bordes et al., 2013) are difficult to model complex patterns due to the dimension of the embedding space.Inspired by the Poincaré ball model (Nickel and Kiela, 2017), the hyperbolic space has a stronger representational capacity for hierarchical structure due to the reconstruction effectiveness.To make up for the global semantic deficiency of closed domains, we employ the Poincaré ball model to learn structural and semantic representations simultaneously based on the hierarchical entity-class structure.The distance between two entities (e i , e j ) is: d(e i , e j ) = where H(.) denotes the learned representation space of hyperbolic embeddings and F h means the arcosh function.We define D = {r(e i , e j )} be the set of observed hyponymy relations between entities.Then we minimize the distance between related objects to obtain the hyperbolic embeddings: where ϕ means N eg(e i ) = {e ′ j |r(e i , e ′ j ) / ∈ D} ∪ {e j } and {e j } is the set of negative sampling for e i .The entity class embedding of token t C je can be formulated as

Domain Knowledge Encoder
This module is designed for encoding input tokens and entities as well as fusing their heterogeneous

Trans. (!N) Model
Transformer Layers target tokens masked tokens pos.samples neg.samples max agreement keep distant Token Emb.
En tit y Em b.
! M where σ is activation function GELU (Hendrycks and Gimpel, 2016) and ∥ means concatenation.h p j is entity representation (See Section 3.2.1).LN is the LayerNorm fuction (Ba et al., 2016).and textual token embedding {h t i } n i=1 .To match relevant entities from the domain KGs, we adopt the entities that the number of overlapped words is larger than a threshold.We leverage the M -layer aggregators as knowledge injector to be able to integrate different levels of learned fusion results.In each aggregator, both embeddings are fed into a multi-headed self-attention layer denoted as F m : (5) where v means the v th layer.We inject entity embedding into context-aware representation and recapture them from the mixed representation: where is the mixed fusion embedding.ĥ′ e j and ĥ′ t i are regenerated entity and textual embeddings, respectively.

Multi-Level Knowledge-aware Augmenter
It enables the model to learn more fine-grained semantic gaps of injected knowledge triplets, leveraging the locally dense characteristics to further remedy the global sparsity problem.We focus on constructing positive and negative samples of higher quality with multiple difficulty levels via the point-biconnected components subgraph structure.In this section, we focus on the sample construction process shown in Figure 4.The training task of this module is introduced in Sec.3.3.

Positive Sample Construction
We extract K neighbor triples of the target entity e 0 as positive samples, which are closest to the target entity in the neighboring candidate subgraph structure.The semantic information contained in these triples is beneficial to enhancing contextual knowledge.To better aggregate target entity and contextual tokens representations, K neighbor triples are concatenated together into a sentence.We obtain the unified semantic representation via a shared Text Encoder (e.g., BERT (Devlin et al., 2019)).Since the semantic discontinuity between the sampling of different triples from discrete entities and relations, we modify the position embeddings such that tokens of the same triple share the same positional index, and vice versa.For example, the position of the input tokens in Fig. 4 triple (e 0 , r(e 0 , e 1 ), e 1 ) is all 1.To unify the representation space, we take the [CLS] (i.e., the first token of input format in the BERT) representation as positive sample embedding to represent sample sequence information.We formulate h p j ∈ R d 1 as the positive embedding of an entity word t C je .

Point-biconnected Component-based Negative Sample Construction
In closed-domain KGs, nodes are densely connected to the neighbouring nodes owning to the locally dense property which is conducive to graph searching.Therefore, we search for a large amount of nodes that are further away from target entities as negative samples.For example in Figure 4, we construct a negative sample by the following steps: • STEP 1: Taking the starting node e start (i.e. e 0 ) as the center point and searching outward along the relations, we obtain end nodes e end with different hop distance Hop(P (G, e start , e end )) where Hop(•) denotes the hop distance and P (G, e i , e j ) denotes the shortest path between e i and e j in the graph G.For example, Hop(P (G, e 0 , e 10 )) = 2 in Path 3 and Hop(P (G, e 0 , e 11 )) = 3 in Path 6.
• STEP 2: We leverage the hop distance to construct negative samples with different structurally difficulty levels, where Hop(•) = 2 for Level-1 and Hop(•) = n+1 for Level-n samples.
We assume that the closer the hop distance is, the more difficult it is to distinguish the semantic knowledge contained between triples w.r.t. the starting node.
• STEP 3: The constructed pattern of negative samples is similar to positive samples whose paths with the same distance are merged into sentences.Note that we attempt to choose the shortest path (e.g., Path 4) when nodes' pairs contain at least two disjoint paths (i.e., pointbiconnected component).For each entity, we build negative samples of k levels.
For the high proportions of Point-biconnected Component in closed-domain KGs, there are multiple disjoint paths between starting nodes and end nodes in most cases such as Path 4 and Path 7 in Figure 4. We expand the data augmentation strategy that prefers end node pairs with multiple paths and adds the paths to the same negative sample, enhancing sample quality with diverse information.The relationships among these node pairs contain richer and indistinguishable semantic information.Besides, our framework preferentially selects nodes in the same entity class of target entity to enhance the difficulty and quality of samples.Negative sample embeddings are formulated as {h

Training Objectives
In our framework, the training objectives mainly consist of two parts, including the masked language model loss L M LM (Devlin et al., 2019) and the contrastive learning loss L CL , formulated as follows: where λ 1 and λ 2 are the hyperparameters.As for the multi-level knowledge-aware contrastive learning loss, we have obtained the positive sample ĥ′ , ĥ′ e j )/τ + l e cos( ĥ′ n j )/τ (9) where τ is a temperature hyperparameter and cos is the cosine similarity function.

Experiments
In this section, we conduct extensive experiments to evaluate the effectiveness of the proposed framework.Due to the space limitation, the details of datasets and model settings are shown in Appendix B and the baselines are described in Appendix C.

Results of Downstream Tasks
Fully-Supervised Learning We evaluate the model performance on downstream tasks which are shown in Table 2.Note that the input format of NER task in financial and medical domains are related to knowledge entities and the rest are implicitly contained.The fine-tuning models use a similar structure compared to KANGAROO, which simply adds a linear classifier at the top of the backbone.From the results, we can observe that: (1) Compared with PLMs trained on open-domain corpora, KEPLMs with domain corpora and KGs achieve better results, especially for NER.It verifies that injecting the domain knowledge can improve the results greatly.(2) ERNIE-THU and K-BERT achieve the best results among baselines and ERNIE-THU performs better in NER.We conjecture that it benefits from the ingenious knowledge injection paradigm of ERNIE-THU, which makes the model learn rich semantic knowledge in triples.(3) KANGAROO greatly outperforms the strong baselines improves the performance consistently, especially in two NER datasets (+0.97%, +0.83%) and TC (+1.17%).It confirms that our model effectively utilizes the closed-domain KGs to enhance structural and semantic information.Few-Shot Learning To construct few-shot data, we sample 32 data instances from each training set and employ the same dev and test sets.We also fine-tune all the baseline models and ours using the same approach as previously.From Table 3, we observe that: (1) The model performance has a sharp decrease compared to the full data experiments.The model can be more difficult to fit testing samples by the limited size of training data.In general, our model performs best in all the baseline results.(2) Although ERNIE-THU gets the best score in Question Answer, its performances on other datasets are far below our model.The performance of KANGAROO, ERNIE-THU and K-BERT is better than others.We attribute this to their direct injection of external knowledge into textual representations.

Ablation Study
We conduct essential ablation studies on four important components with Financial NER and Medical QM tasks.The simple triplet method simplifies the negative sample construction process by randomly selecting triplets unrelated to target entities.The other three ablation methods respectively detach the entity-class embeddings, the contrastive loss and the masked language model (MLM) loss from the model and are re-trained in a consistent manner with KANGAROO.As shown in Table 4, we have the following observations: (1) Compared to the simple triplet method, our model has a significant improvement.It confirms that Pointbiconnected Component Data Augmenter builds rich negative sample structures and helps models learn subtle structural semantic to further compensate the global sparsity problem.(2) It verifies that entity class embeddings and multi-level contrastive learning pre-training task effectively complement semantic information and make large contributions to the complete model.Nonetheless, without the modules, the model is still comparable to the best baselines ERNIE-THU and K-BERT.

The Influence of Hyperbolic Embeddings
In this section, we comprehensively analyze why hyperbolic embeddings are better than Euclidean embeddings for the closed-domain entity-class hierarchical structure.
Visualization of Embedding Space.To compare the quality of features in Euclidean and hyperbolic spaces, we train KG representations by TransE (Bordes et al., 2013) and the Poincaré ball model (Nickel and Kiela, 2017), visualizing the embeddings distribution using t-SNE dimensional reduction (van der Maaten and Hinton, 2008) shown in Figure 5.They both reflect embeddings grouped by classes, which are marked by different colors.However, TransE embeddings are more chaotic, whose colors of points overlap and hardly have clear boundaries.In contrast, hyperbolic representations reveal a clear hierarchical structure.The root node is approximately in the center and links  to the concept-level nodes such as drug, check and cure.It illustrates that hyperbolic embeddings fit the hierarchical data better and easily capture between classes.
Performance Comparison of Different Embeddings.We replace entity-class embeddings with Euclidean embeddings to verify the improvement of the hyperbolic space.To obtain entity-class embeddings in the Euclidean space, we obtain embeddings of closed-domain KGs by TransE (Bordes et al., 2013) and take them as a substitution of entity-class embeddings.As shown in Table 5, we evaluate the Euclidean model in four downstream tasks, including NER and TC task in the financial domain, together with NER and QM in the medical domain.The results show that the performance degradation is clear in all tasks with Euclidean entity-class embeddings.Overall, the experimental results confirm that the closed-domain data distribution fits the hyperbolic space better and helps learn better representations that capture semantic and structural information.

The Influence of Point Biconnected Component-based Data Augmentation
To further confirm that our data augmentation technique for contrastive learning is effective, we analyze the correlation between positive and negative samples w.r.t.target entities.We choose two strategies (i.e., dropout (Gao et al., 2021) and word replacement (Wei and Zou, 2019) for positive samples) as baselines.The negative samples are randomly selected from other entities.As shown in Table 6, we calculate the averaged cosine similarity between samples and target entities.In positive samples, the cosine similarity of our model is lower than in baselines, illustrating the diversity between positive samples and target entities.As for negative samples, we design the multi-level sampling strategy in our model, in which Level-1 is the most difficult followed by Level-2 and Level-3.The diversity and difficulty of the negative sam-  ples help to improve the quality of data augmentation.We visualize the alignment and uniformity metrics (Wang and Isola, 2020) of models during training.To make this more intuitive, we use the cosine distance to calculate the similarity between representations.The lower alignment shows similarity between positive pairs features and lower uniformity reveals presentation preserves more information diversity.As shown in Figure 6, our models greatly improve uniformity and alignment steadily to the best point.
Closed-domain KEPLMs.Due to the lack of in-domain data and the unique distributions of domain-specific KGs (Cheng et al., 2015;Savnik et al., 2021), previous works of closed-domain KE-PLMs focus on three domain-specific pre-training paradigms.
(1) Pre-training from Scratch.For example, PubMedBERT (Gu et al., 2022) derives the domain vocabulary and conducts pre-training using solely in-domain texts, alleviating the problem of out-of-vocabulary and perplexing domain terms.
(2) Continue Pre-training.These works (Beltagy et al., 2019;Lee et al., 2020) have shown that using in-domain texts can provide additional gains over plain PLMs.
(3) Mixed-Domain Pre-training (Liu et al., 2020b;Zhang et al., 2021b).In this approach, out-domain texts are still helpful and typically initialize domain-specific pre-training with a generaldomain language model and inherit its vocabulary.Although these works inject knowledge triples into PLMs, they pay little attention to the in-depth characteristics of closed-domain KGs (Cheng et al., 2015;Kazemi and Poole, 2018;Vashishth et al., 2020), which is the major focus of our work.

Conclusion
In this paper, we propose a unified closeddomain framework named KANGAROO to learn knowledge-aware representations via implicit KGs structure.We utilize entity enrichment with hyperbolic embeddings aggregator to supplement the semantic information of target entities and tackle the semantic deficiency caused by global sparsity.Additionally, we construct high-quality negative samples of knowledge triples by data augmentation via local dense graph connections to better capture the subtle differences among similar triples.

Limitations
KANGAROO only captures the global sparsity structure in closed-domain KG with two knowledge graph embedding methods, including euclidean (e.g.transE (Bordes et al., 2013)) and hyperbolic embedding.Besides, our model explores two representative closed domains (i.e.medical and financial), and hence we might omit other niche domains with unique data distribution.

Ethical Considerations
Our contribution in this work is fully methodological, namely a new pre-training framework of closed-domain KEPLMs, achieving the performance improvement of downstream tasks.Hence, there is no explicit negative social influences in this work.However, Transformer-based models may have some negative impacts, such as gender and social bias.Our work would unavoidably suffer from these issues.We suggest that users should carefully address potential risks when the KAN-GAROO models are deployed online.

A Indicators of Closed-domain KGs
The following explanations are the 5 different indicators to analyze the closed-domain KGs.
• #Nodes and #Edges are the numbers of nodes and edges in the corresponding KG.
• Coverage Ratio is the entity coverage rate of the KG in its corresponding text corpus.We calculate it by the percentage of entity tokens matched in the KG by the number of the fulltext tokens, formulated as CR = te tt , where t e is the number of entity tokens and t t is the number of all textual tokens.Texts for closed domains are the same as the pre-training corpora to be described in the experiments.The corpora for the open domain is taken from the CLUE benchmark5 .

D.2 K-hop Thresholds' Results
In the table 9, we consider three different situations to discuss the k-hop thresholds selection including (1) k-hop triple path as positive and k+1, k+2, k+3 hop triple path as negative (2) k, k+1, k+2 triple path as negative and k+3 triple path as positive (3) k hop triple path viewed as both positive and negative samples and k+1, k+2 as negative samples.Specifically, we select the original closed domain pre-training and KG data to perform full data finetuning tasks.For the above three situations, in order to prevent overlapping triple path's conflicts between positive and negative during the sampling process, we mask the sampled triple path data in the iterative sampling process.The "S1" means "Situations 1". the positive sample hop path is to the target entity, the better the model performance (See S1 results).
(2) The closer the negative samples are sampled to the target entity or even closer than positive samples, the performance of the model will sharply decrease (See S2 and S3 results).Hence, we should construct positive samples closer to the target entity, while negative samples should not be too far away simultaneously in graph path to avoid introducing too much knowledge noise.

Figure 2 :
Figure 2: Illustration of the hierarchical structure of entities and classes in closed-domain KGs (e.g.MedKG).

Figure 3 :
Figure 3: Model overview of KANGAROO.The Hyperbolic Entity Class Embedding module mainly leverages the hierarchical entity-class structure to provide more sufficient semantic knowledge.Positive and Negative Triple Construction can obtain negative samples of higher quality in multiple difficulty levels.(Best viewed in color.)

Figure 4 :
Figure 4: Examples of positive and negative sample construction.We add the [SEP] token between paths to differentiate triplets.Note that the subscripts are positional embedding indexes.
n j ∈ R d 1 and l present various different level of negative samples.The specific algorithm description of the negative sample construction process is shown in Appendix Algorithm 1.
e. the textual embedding of token in an entity word).We take the standard InfoNCE (van denOord et al., 2018) as our loss function L CL :

Figure 6 :
Figure 6: Results comparison of ours and other data augmentation methods of alignment and uniformity.

Table 1 :
The statistics of open and closed-domain KGs.
parameters to be trained.Entity Knowledge Injector.It aims to fuse the heterogeneous features of entity embedding {h e j } m j=1

Table 2 :
The performance of fully-supervised learning in terms of F1 (%).∆ and ‡ indicate that we pre-train our model on CN-DBpedia (i.e., open domain KG) and the corresponding close-domain KG, respectively.The results of knowledge-aware PLMs baselines are pre-trained in closed-domain KGs.Best performance is shown in bold.

Table 3 :
The overall results of few-shot learning in terms of F1 (%).

Table 4 :
The performance of models for ablation study in terms of F1 (%).

Table 5 :
The model results when Euclidean and hyperbolic embeddings are employed in terms of F1 (%).

Table 7 :
The number of samples of the datasets in financial and medical domain, respectively.

Table 8 :
(Feng et al., 2022)b)ins in terms of Acc (%).triplet set is S t = {(e i , r(e i , e j ), e j ) | e i , e j ∈ E, r(e i , e j ) ∈ R} where e i is the head entity with relation r(e i , e j ) to the tail entity e j .GreaseLM(Zhang et al., 2022b)fuses encoded representations from PLMs and graph neural networks over multiple layers of modality interaction operations.KALM(Feng et al., 2022)jointly leverages knowledge in local, document-level, and global contexts for long document understanding.In the table 8, we supplement data from two domains including music domain and social domain to further demonstrate the effectiveness of the proposed KANGAROO model.We conduct experiments in the largest Chinese knowledge graph database (i.e., openKG).Both the music and social data are crawled from the largest open-source knowledge graph database in Chinese and Baidu BaiKe.We evaluate the baselines and our kangaroo model in full data fine-tuning settings and the results are shown are follows.The downstream datasets of music D1, music D2 and social D1 are text classification tasks, which 'D1' means Dataset 1. Specifically, Music D1 and D2 are the music emotion text classification tasks that download song comment data on the music app.Social D1 is classifying the relationships between public figures such as history and entertainment.The evaluation metric is accuracy (i.e., ACC).

Table 9 :
Fin and Med means Financial and Medical respectively.From the above table, we can observe that (1) The closer Situations Models Fin NER Fin TC Fin QA Fin QM Fin NED Med NER Med QNLI Med QM Results w.r.t.k-hop thresholds.