KEPL: Knowledge Enhanced Prompt Learning for Chinese Hypernym-Hyponym Extraction

,


Introduction
Hypernym Discovery is a core work in taxonomy construction (Wang et al., 2019).Hypernym relation is a semantic relation that exists between a term (hyponym) and a more general or abstract term (hypernym).Due to its capacity for representing semantic relations, hypernym becomes an essential concept in modern natural-language research, and is a fundamental component in many natural language processing (NLP) tasks, such as question answering (Yang et al., 2017;Yu et al., 2021), taxonomy construction (Chen et al., 2019;Luo et al., 2020) and personalized recommendation (Huang et al., 2019).
Typical efforts in Hypernym Discovery can be roughly classified into two main types: rule-based methods and detection-based methods.Rulebased methods (Auger and Barrière, 2008;Kliegr et al., 2008;Seitner et al., 2016;Snow et al., 2004;Wang et al., 2017) rely on predefined linguistic rules or patterns to extract hyponym-hypernym relations.These rule-based methods capture specific syntactic or semantic structures that indicate a hierarchical association between terms, however, rule-based methods lack sufficient capability to discover the hyponym-hypernym relation embedded in the semantic content of sentences.For example, in the sentence 'Jay Chou, an acclaimed singer, mesmerizes audiences with his soulful performances.' the rule-based approach might fail to recognize the hyponym-hypernym relation between 'Jay Chou' and 'singer'.
Another effective way is to model this problem as a hypernym detection task (Dash et al., 2020a;Roller et al., 2018;Yamane et al., 2016).These methods handle the hypernym detection task in a pipelined manner, i.e., extracting the entities first and then recognizing hypernym relations.This separate framework makes each component more flexible, but it neglects the relevance between two sub-entities.
Identifying hyponym-hypernym relation by modeling the interaction of sparse attributes, many studies are dedicated to constructing English hypernym-hyponym relationship datasets (Berend et al., 2018;Bernier-Colborne and Barrière, 2018).Meanwhile, Chinese language has unique linguistic characteristics and categories that need to be considered (Zhang et al., 2022b).The lack of such datasets has hampered the progress of Chinese hyponymy extraction research.Even though Chinese speakers account for a quarter of the world population, there has been no existing high-quality dataset for Chinese hypernym relation extraction.
In this paper, we propose the Knowledge En- Specifically, our method employs a Dynamic Adaptor for Knowledge which can adaptively construct a unified representation for both the structured prior knowledge and the unstructured text context.To facilitate a more coherent integration of the structured prompts and unstructured text, we employ a mechanism that learns a unified representation of context through specific attention.
For span selection, we use the focal loss to counteract the issue of sample imbalance, a commonly observed phenomenon in extractive tasks.
The lack of specific hypernym relation datasets also leads to deficiencies in current works (Chen et al., 2019;Luo et al., 2020).To address this challenge, we propose the CHR dataset that aims to improve the coverage and accuracy of hypernym relations in taxonomies.We believe that our dataset can contribute to the development of more accurate and comprehensive taxonomies.
The main contributions of our work are summarized as follows: 1. To the best of our knowledge, there currently exists no commonly used dataset for Chinese hypernym-hyponym discovery.We construct a Chinese hypernym relation extraction dataset, which contains three typical scenarios, as follows as Baike, news and We-media.The proposed dataset with multiple data sources can well cover specific expressions in various corpus.
2. We propose a novel framework, the Knowledge Enhanced Prompt Learning (KEPL) which leverages prior knowledge as prompts and transfers prior knowledge into an extraction task.Our framework learns a unified representation of context through specific attention, proving effective for various natural language processing tasks, including but not limited to taxonomy construction and semantic search.
3. Our extensive experiments on the proposed dataset have shown that the KEPL framework achieves a 2.3% improvement in F1 over the best method, demonstrating the effectiveness of our approach.Further, we show the results of individually removing components from the trained KEPL on CHR dataset and proved the effectiveness of each component.

Related Work
Hypernym Dection Research into hypernym relation extraction has mainly used unsupervised methods, falling into two categories: patternbased and distributional approaches.
The pattern-based approach (Navigli and Velardi, 2010;Boella and Di Caro, 2013;Vyas and Carpuat, 2017;Bott et al., 2021), established by (Hearst, 1992;Wang and He, 2020), employs specific predefined linguistic patterns, such as 'isa' and 'including', to detect hypernym relations.While this approach is simple and widely applicable, it is constrained by its reliance on predefined patterns, sensitivity to sentence structure, and the necessity for manual resource curation.Distributional approaches like (Fu et al., 2014) use a distant supervision method for extracting hypernyms from various sources.Their models produce a list of hypernyms for a given entity.Subsequently, (Sanchez and Riedel, 2017) highlights the unique performance of the Baroni dataset in providing consistent results, attributing its effectiveness to its alignment with specific dimensions of hypernymy: generality and similarity Additionally, hybrid methods (Bernier-Colborne and Barrière, 2018;Dash et al., 2020b;Yu et al., 2020) that amalgamate different techniques have been explored.For instance, (Held and Habash, 2019) proposed a method that merges hyperbolic embeddings with Hearst-like patterns, resulting in better performance on various benchmark datasets.
Prompt learning Prompt-based learning, a novel paradigm in pretrained language models (Zhang et al., 2022a), restructures downstream tasks to better align with pre-training tasks, enhancing the model's performance.A notable application of this approach is demonstrated by (Schick and Schütze, 2021), where classification problems are transformed into cloze tasks.This is achieved by creating relevant prompts with blanks and establishing a mapping from specific filled words to the corresponding predicted categories.This method effectively bridges the gap between the task and the model's training.Furthermore, (Ma et al., 2022) introduces a model named PAIE, which leverages prompts for Event Argument Extraction (EAE) at both sentence and document levels.This innovative use of prompts in EAE tasks showcases the versatility and efficiency of the prompt-based learning approach.

Data
We introduce the CHR (Chinese Hypernym Recognition) Dataset -an innovative resource that specifically addresses the current shortcomings in Chinese hypernym discovery.The key idea of constructing the CHR dataset is to enhance the quality and diversity across various domains, which are currently insufficient in existing resources.

Data source
To address the existing limitations in Chinese hypernym discovery, particularly the lack of diversity, we constructed the CHR dataset.
Our dataset is constructed by incorporating data from three distinct sources: encyclopedic knowledge, We-Media public accounts and news.
We gather the Baike data from a variety of reputable Baidu Baike online encyclopedias.However, we strategically omitted entries that were too short, lacked contextual richness, or had content outside the scope of our study, such as stub entries and disambiguation pages.The We-Media data was gathered from a wide range of accounts as lifestyle, entertainment, technology, and education.Our scraper was programmed to regularly check these accounts for the latest articles and retrieve historical articles where possible.Advertisements and unrelated links were excluded from our dataset; we focused only on preserving the main content of the articles.
Our dataset includes news data from multiple high-profile news platforms such as Xinhua News Agency, Tencent News, and Sina News.To eliminate domain bias, we selected articles from a broad spectrum of categories including politics, economics, sports, culture, and science, and we eliminated articles with insufficient text, duplicate articles, and those not pertaining to our research from the dataset.

Data construction
Pre-processing After gathering data from the various sources, we remove irrelevant features which contain HTML tags, special characters, and formatting.We also performed tokenization, converting sentences into tokens for further processing.
Sentences with a character count falling below the established threshold of 10 characters were disregarded.Those exhibiting an overuse of punctuation marks such as commas, colons, and periods were excised.Additionally, sentences embodying numerals or particular non-Chinese characters were filtered out.
The sentences that have passed through these filters are then manually labeled with the appropriate hypernym-hyponym relation by our team.We mark the entity position by determining the start entity start and end entity end positions of the answer in the text, aiming for accurate identification of the relation scope.If a sentence does not contain explicit hypernym-hyponym entities, the corresponding positions are marked as NONE.
Following the above extraction, the potential pairs were presented to a team of trained individuals for review.
Final Format We streamline and structure our data into a uniform and easily accessible format, facilitating subsequent analysis and modeling.
We use the data consisting of a set of (data, span1,span2), Which span1 represents the latent position of hypernym, span2 represents the latent position of the hyponym.An excerpt for one example: 链接检验器(link checker)是测试并报 告网站的页面内的超文本链接有效性的程序.
In this example, 程序 is the hypernym of the 链接 检验器.The format of the output is set as < 程序, 链接检验器.>.We conducted a thorough examination of the data and implemented a meticulous annotation process to ensure the quality and reliability of the dataset, the detail can be seen in Table 1.

Dynamic Adaptor for Knowledge
In the proposed KEPL model, our aim is to construct a unified representation for both the structured prior knowledge and the unstructured text context.To achieve this, we use the Hearst method which facilitates the identification of hypernym ("is-a") relations within a sentence by exploiting certain lexico-syntactic patterns.By introducing this prior knowledge in the form of lexicosyntactic patterns, we aim to increase the model's capability to precisely identify hypernym and hyponym relations.Firstly, we represent each structured prior knowledge (the lexico-syntactic patterns) in the form of prompts, and use the corresponding prompt representations, E p as follows: Here, pmi represents a set of structured prompts with Hearst patterns, and E p denotes the corresponding embeddings obtained from the decoder module L dec of the pre-trained language model.
Given prompt representation E p and a set of prompt pmi, for different input s i , we generate a scoring matrix w ∈ R L×L to select the suitable template combination under specific semantics, this progress can be expressed by: where E p is the set of available prompts embeddings, E p ′ ∈ R M ×L×H is a weighted representation that blends the adapted prompt embeddings, incorporating both the input sentence and the entire adjusted prompt set's semantic information, and W ∈ R L×H is a learned weight matrix that is used to adapt the selected prompt to match the semantics of the input sentence s i .
The Hearst method (Hearst, 1992) is an efficient method for identifying hypernym (is-a) relation from a given sentence with exploiting certain lexico-syntactic patterns.To attenuate the impaction from the specific prompt, we use the Hearst pattern as a representation of knowledge.
We give examples of these prompts in Table 2.

Unified representation for templates and context
To facilitate a more coherent integration of the structured prompts and unstructured text, we employ an attention mechanism, we learn a unified representation of context： where s i ∈ R L×H denotes the context representation.L now refers to the Language Model (LLM) and H is the dimension of the contexts.This progression in the approach is leveraged to effectively integrate context and prompts.
To capture useful information and uncertainty from each view (knowledge and context), we learn a unified representation through specific attention.Specifically, as shown in Figure 2, we use sem q , sem v , E t to compute an attention output x att ∈ R L×H .This process can be expressed as follows: where sem q , sem v are the linear projection of the L(s i ), T em p ′ ∈ R L×H represent the E p ′ ∈ R L×H through linear layer.This advancement is leveraged to facilitate enhanced interplay between templates and input information, promoting more effective integration of structured prompts and unstructured text.

Knowledge-guided extraction
In the process of knowledge-guided extraction, we aim to calculate the distribution of each token for the start or end of a given role based on the unified context-prompt representation.The logits reflect the unnormalized log probabilities of each token being the starting or ending positions of a target hypernym or hyponym.They are calculated via the linear projection from x att as follows: (5) Here, x start|end att ′ are the linear projections of x att that are computed separately for the start and end positions of a target feature (either hypernym or hyponym).x att ′ ∈ R 1×H represents the linear projection of x att , E p ′ T ∈ R H×L , and logits i ∈ R L symbolizes the contextual token distributions.Noted that different span Span pi will result in different corresponding x att .
The logits are the scores assigned to each token in the input sentence and they measure the likelihood of each token being the start or end token of the span of interest.Once these logits are transformed into probabilities through a softmax operation, these values are utilized to determine the start and end positions for hypernym and hyponym spans in the text.
We observed sample imbalance in the extractive task, hence we utilize the focal loss function as follows: In these equations, p i represents the predicted probability, and to avoid the exhaustive threshold tuning, we use the softmax function to compute these probabilities.
The connotation for starting and ending positions of the target span s k is illustrated as : The pair (i, j) that maximizes score k (i, j), which gives us the span position of hypernym or hyponym in the sentence.

Experiment
In this section, we evaluate the efficacy of the proposed KEPL model.The evaluation experiments are presented in Section 5.1 and the results are discussed in Section 5.2.Furthermore, we assess the effect of varying the number of knowledge prompts in Section 5.3.An ablation study can be found in Section 5.4.

Evaluation and Settings
Dataset Our KEPL framework is further evaluated using the CHR Dataset, compiled from three diverse sources: We-Media, Baidu Encyclopedia, and various news outlets.This CHR dataset provides a more practical perspective on the applicability of our KEPL framework, as it encompasses a wide range of language styles and topics.
Evaluation Our experiments have been employed on CHR datasets.We briefly describe each below.In order to evaluate the performance of our experiments on CHR datasets, we employ precision, recall and F1 as our primary evaluation indicators.
Baselines We perform a comparison with the current state-of-the-art and also with some classical models to show the efficiency and effectiveness of the proposed KEPL model.

Implement Details
The CHR Dataset was employed in our experiments, divided into training, validation, and test sets in a 0.7:0.15:0.15ratio.The model was trained with a batch size of 32 and a learning rate of 0.0001 using the Adam optimizer.The models were trained for five epochs, with early stopping implemented to prevent overfitting.

Experiment Results
We report the results of different methods in Table 3.A more visual presentation of the direct results from our experiment can be appreciated in Table 4.It can be seen that our method outperforms all other methods in F1 score and achieves a 2.3% improvement in F1 over the best method UIE (Lu et al., 2022).It shows the effectiveness of our proposed method.Furthermore, from Table 1, we also can see, unlike other methods that predominantly rely on localized features or word-level relationships, KEPL models effectively account for the broader semantic context.This approach enhances the accuracy of inferring hypernym relationships, even in the absence of explicit markers for hierarchical relationships.

Results on CHR Dataset
The KEPL-Bart and KEPL-Ernie3 achieved higher rates when compared to other models such as W2NER, Ernie3, and UIE.This demonstrates the efficacy of the KEPL approach, particularly the effectiveness of incorporating Hearst-like patterns as prompts and embedding patterns and text simultaneously.
The performance enhancements of our KEPL models are largely attributed to the implementation of the Dynamic Adaptor for Knowledge and knowledge attention modules, which effectively mediate the interaction between structured prompts and unstructured text, contributing to a superior hypernym-hyponym extraction.The evaluation results of different methods are shown in Table 5, from which we have several observations: 1. Unstructured text lacks a predefined structure or explicit markers for hierarchical relationships, making it difficult to discern the underlying organization.Hierarchies are often implied through contextual cues, such as sentence structure, semantic relationships, or proximity of concepts, requiring sophisticated natural language processing techniques to identify and extract these relationships accurately.
2. The other methods are unable to recognize the<Hypernym, Hyponym> appears in pairs, they predominantly rely on localized features or word-level relationships, thereby disregarding the encompassing semantic context required for accurate inference of hypernym relations.

Results on specific scene
We compare KEPL performance with the baseline model on each scene in Table 3, observing the results on the Baidu, We-Media, and News datasets, the performance of the KEPL models varies.Compared to other models, KEPL models show higher scores in almost all metrics across different datasets.This can be attributed to the Hearst pattern selector, which adapts to different sentence semantics, and the unified representation for templates and context, which effectively captures the useful information from both the pattern and the text.
The results highlight the effectiveness of the KEPL approach in handling different text complexities and styles, thereby exhibiting promising potential for real-world hypernym-hyponym extraction tasks.Nevertheless, improvements could be made to better handle datasets with informal and context-specific language like the We-Media data.

Effect on Knowledge Number
In this section, we evaluate the effect of varying the number of knowledge prompts on the performance of our KEPL model in Table 6.We vary the number of prompts from 50 to 300 in increments of 50 and report the Recall (R), Precision (P), and F1 score in each case.As seen in Table 6, there is a clear improvement in the KEPL model's performance as the number of prompts increases.This confirms our hypothesis that increasing the volume of knowledge prompts can enhance the model's effectiveness.

Ablation Study
In this section, we further provide more insights with qualitative analysis and error analysisto address the remaining challenges.In Table 7, we display the results of individually removing components from the trained KEPL model on the CHR dataset.
In the prompt w/o experiment, we use a Random matrix to replace the Dynamic Adaptor, which leads to a drop of 6.73% in F1 score, showing the importance of using the prompt as the connection of hypernym and hyponym.The prompt instructs the model to consider the hierarchical structure between concepts and to accurately identify hyponyms and hypernyms.This approach provides a structured framework for the model to leverage contextual cues, linguistic patterns, and semantic associations related to hyponymy and hypernymy, enabling it to capture and utilize the rich hierarchical information present in the text.
In the attn w/o experiment, we deliberately trained KEPL (Knowledge-Enhanced Prompt Learning) without knowledge-attention.This design choice resulted in a reduction in performance, primarily due to the model's compromised ability to effectively integrate context and prompts.Attention mechanisms play a crucial role in enabling the model to attend to relevant parts of the input and to appropriately align them with the given prompts.Therefore, the absence of attention mechanisms in our experiment negatively impacted the model's capacity to fully exploit contextual information and prompts, ultimately affecting its ability to accurately represent and extract hyponym-hypernym relationships from unstructured text.
In the Knowledge-guided extraction w/o experiment, we trained KEPL logits with a Linear MLP.We found that performance decreases by 4.97% in F1 score.This highlights the importance of incorporating prompts before directly merging them with the context, particularly in tasks related to acquiring hypernyms and hyponyms.The prompt serves as a guidance signal, providing explicit instructions to the model and aiding it in understanding the desired relationship or task.By incorporating prompts, the model gains a clearer direction and context-specific cues, which are crucial for accurately capturing and representing hierarchical relationships.

Conclusion
In this paper, we introduce Knowledge Enhanced Prompt Learning (KEPL) for extracting hypernym-hyponym relations in Chinese language.KEPL utilizes the concept of prompt learning to incorporate prior knowledge in the form of patterns into the model, which simultaneously embeds both the pattern and text.
The prompt in the framework uses Hearstlike patterns, specifically for extracting hypernymhyponym relations.Additionally, we have created a Chinese hypernym-hyponym relation extraction dataset, which includes three different types of scenarios: Wikipedia, news articles, and We-media.The results of our experiments using this dataset show that our proposed model is both efficient and effective.

Figure 2 :
Figure 2: The overall framework of KEPL

Table 1 :
The detail of the proposed CHR dataset Each document is represented as a set doc = {d 1 , . . ., d N } and is mapped to a set of spans, S = {S 1 , S 2 , . . ., S k }, using a specific prompt P m i .Here, each span consists of a pair H u ,H d , where H d denotes a hyponym and H u represents a hypernym.

Table 3 :
Different Performance of KEPL on Different Data Sources

Table 4 :
Different model results for Chinese sentence

Table 5 :
Metrics on the test set for the CHR

Table 6 :
Effect on Knowledge number for KEPL

Table 7 :
Ablation Study with A for Konwledge attention, D for Dynamic Adaptor, E for Knowledge-guided extraction While the KEPL model has demonstrated effective performance in scenarios such as Baike, News, and We-media, its applicability in other domains or contexts is yet to be confirmed.For example, the model may require further tuning and optimization for specialized domains such as technology, law, or medicine.Data DependencyThe performance of the KEPL model is to a significant extent dependent on the quality and quantity of available data.In cases where data is scarce, particularly in certain domains or for specific tasks, the model may require larger datasets for efficient training.