Hyperbolic Relevance Matching for Neural Keyphrase Extraction

Keyphrase extraction is a fundamental task in natural language processing that aims to extract a set of phrases with important information from a source document. Identifying important keyphrases is the central component of keyphrase extraction, and its main challenge is learning to represent information comprehensively and discriminate importance accurately. In this paper, to address the above issues, we design a new hyperbolic matching model (HyperMatch) to explore keyphrase extraction in hyperbolic space. Concretely, to represent information comprehensively, HyperMatch first takes advantage of the hidden representations in the middle layers of RoBERTa and integrates them as the word embeddings via an adaptive mixing layer to capture the hierarchical syntactic and semantic structures. Then, considering the latent structure information hidden in natural languages, HyperMatch embeds candidate phrases and documents in the same hyperbolic space via a hyperbolic phrase encoder and a hyperbolic document encoder. To discriminate importance accurately, HyperMatch estimates the importance of each candidate phrase by explicitly modeling the phrase-document relevance via the Poincaré distance and optimizes the whole model by minimizing the hyperbolic margin-based triplet loss. Extensive experiments are conducted on six benchmark datasets and demonstrate that HyperMatch outperforms the recent state-of-the-art baselines.


Introduction
Keyphrase Extraction (KE) aims to extract phrases related to the main points discussed in the source document, a fundamental task in Natural Language Processing (NLP). Because of their succinct and accurate expression, keyphrase extraction is helpful for a variety of applications such as information * Corresponding author.

… …
The Great Plateau is a large region of land that is secluded from … Figure 1: Sample partial of the document in OpenKP dataset. For ease of presentation, we assume "a large region of land" is a 5-gram phrase as an example.
retrieval (Kim et al., 2013) and text summarization (Liu et al., 2009a). Typically, keyphrase extraction methods consist of two main components: candidate keyphrase extraction and keyphrase importance estimation. Concretely, the former extracts the candidate keyphrases from the source document via some heuristics (i.e., n-grams are shown in Figure 1), and the latter determines which candidates are chosen as keyphrases. In other words, keyphrase importance estimation directly affects the performance of the keyphrase extraction model in most cases.
Generally, in the neural supervised keyphrase extraction model, keyphrase importance estimation can be subdivided into information representation and importance discrimination. Specifically, information representation focuses on modeling the encoding procedure, and the importance discrimination focuses on measuring and ranking the importance of candidate phrases. To represent information comprehensively, recent methods have been proposed to learn better representations via different backbones, such as Bi-LSTM (Meng et al., 2017), GCNs (Sun et al., 2019;Zhang et al., 2020), and pre-trained language models (e.g., ELMo (Xiong et al., 2019) and BERT Sun et al., 2020)). To distinguish the importance of candidate phrases precisely, most existing supervised models (Sun et al., 2020;Mu et al., 2020;Song et al., 2021) estimate and rank the importance of candidate phrases to extract keyphrases by using different approaches, such as classification and ranking models.
Although the methods mentioned above have achieved significant performance, the keyphrase extraction task still needs improvement. Among them, there are the following two main issues. The first issue lies in the information representation. Typically, phrases often exhibit inherent hierarchical structure ingrained with complex syntactic and semantic information (Dai et al., 2020;Alleman et al., 2021;. In general, the longer phrases contain more complex structures. (as shown in Figure 1, the phrase "a large region of land" has more complex inherent structures than "region" or "a large region". Similarly, the phrase "a large region" is more complex than "region"). Besides the phrases, since linguistic ontologies are intrinsic hierarchies (Dai et al., 2020), the conceptual relations between phrases and the document can also form hierarchical structures. Therefore, the hierarchical structures need to be considered when representing both phrases and documents and estimating the phrase-document relevance. However, it is difficult to capture such structural information even with infinite dimensions in the Euclidean space (Linial et al., 1995). The second issue lies in distinguishing the importance of phrases. Keyphrases are typically used to retrieve and index their corresponding document, so they should be highly related to the main points of the source document (Hasan and Ng, 2014). However, most existing supervised keyphrase extraction methods ignore explicitly modeling the relevance between phrases and their corresponding document, resulting in biased keyphrase extraction. Motivated by the above issues, we explore the potential of hyperbolic space for the keyphrase extraction task and propose a new hyperbolic relevance matching model (HyperMatch) for neural supervised keyphrase extraction. Firstly, to capture hierarchical syntactic and semantic structure information, HyperMatch integrates the hidden representations in all the intermediate layers of RoBERTa to collect the adaptive contextualized word embeddings via an adaptive mixing layer based on the self-attention mechanism. And then, considering the hierarchical structure hidden in the natural language content, HyperMatch encodes both phrases and documents in the same hyperbolic space via hyperbolic phrase encoder and hyperbolic document encoder. Meanwhile, we adopt the Poincaré distance to calculate the phrase-document relevance by considering the latent hierarchical structures between phrases and the document. In this setting, the keyphrase extraction can be regarded as a matching problem and effectively implemented by minimizing a hyperbolic margin-based triplet loss. To the best of our knowledge, we are the first work to explore the supervised keyphrase extraction in the Hyperbolic space. Extensive experiments on six benchmark datasets show the effectiveness of HyperMatch. The results have demonstrated that HyperMatch outperforms the state-of-the-art baselines in most cases.

Preliminaries
Hyperbolic space is an important concept in hyperbolic geometry, which is considered as a special case in the Riemannian geometry (Hopper and Andrews, 2011). Before presenting our model, this section briefly introduces the basic information of hyperbolic space.
In a traditional sense, hyperbolic spaces are not vector spaces; one cannot use standard operations such as summation, multiplication, etc. To remedy this problem, one can utilize the formalism of Möbius gyrovector spaces allowing the generalization of many standard operations to hyperbolic spaces (Khrulkov et al., 2020). Similarly to the previous work (Nickel and Kiela, 2017;Ganea et al., 2018;Tifrea et al., 2019), we adopt the Poincaré ball and use an additional hyper-parameter c which modifies the curvature of Poincaré ball; it is then defined as D n c = {x ∈ R n : c x 2 < 1, c ≥ 0}. The corresponding conformal factor now takes the form λ c x := 2 1−c x 2 . In practice, the choice of c allows one to balance hyperbolic and Euclidean geometries, which is made precise by noting that when c → 0, all the formulas discussed below take their usual Euclidean form.
We restate the definitions of fundamental mathematical operations for the generalized Poincaré ball model. We refer readers to (Ganea et al., 2018) for more details. Next, we give details of the closedform formulas of several Möbius operations. Möbius Addition. For a pair x, y ∈ D n c , the Möbius addition is defined as, x⊕c y = (1 + 2c x, y + c y 2 )x + (1 − c x 2 )y 1 + 2c x, y + c 2 x 2 y 2 . (1)

Adaptive Contextualized Word Embedding
Word Index

Interactions in Poincaré Disk
Source Document Möbius Matrix-vector Multiplication. For a linear map M : R n → R m and ∀x ∈ D n c , if Mx = 0, then the Möbius matrix-vector multiplication is defined as, where M ⊗ c x = 0 if Mx = 0. Poincaré Distance. The induced distance function is defined as, Note that with c = 1 one recovers the geodesic distance, while with c → 0 we obtain the Euclidean distance lim c→0 d c (x, y) = 2 x − y . Exponential and Logarithmic Maps. To perform operations in the hyperbolic space, one first needs to define a mapping function from R n to D n c to map Euclidean vectors to the hyperbolic space. Let As the inverse of exp c x (·), the logarithmic map log c x (·) : D n c → T x D n c for y = x is defined as: Hyperbolic Averaging Pooling. The average pooling, as an important operation common in natural language processing, is averaging of feature vectors. In the Euclidean setting, this operation takes the following form: Extension of this operation to hyperbolic spaces is called the Einstein midpoint and takes the most simple form in Klein coordinates: where γ i = 1 √ 1−c x i 2 is the Lorentz factor. Recent work (Khrulkov et al., 2020) demonstrates that the Klein model is supported on the same space as the Poincaré ball; however, the same point has different coordinate representations in these models. Let x D and x K denote the coordinates of the same point in the Poincaré and Klein models correspondingly. Then the following transition formulas hold.
Therefore, given points in the Poincaré ball, we can first map them to the Klein model via Eq.(9), compute the average using Eq.(7), and then move it back to the Poincaré model via Eq.(8).

HyperMatch
Given a document D = {w 1 , ..., w i , ..., w M }, the candidate phrases are first extracted from the source document by the n-gram structures, where M indicates the max length of the input document. Then, to determine which candidates are keyphrases, we design a new hyperbolic relevance matching model (HyperMatch), which consists of two main procedures: information representation and importance discrimination. Figure 2 illustrates the overall framework of HyperMatch.

Information Representation
Information representation is one of the essential parts of keyphrase importance estimation, which needs to represent information comprehensively.
To capture rich syntactic and semantic information, HyperMatch first embeds words by the pre-trained language model RoBERTa with the adaptive mixing layer. Then, phrases and documents are embedded in the same hyperbolic space by the hyperbolic phrase encoder and hyperbolic document encoder. In the following subsections, the information representation procedure will be described in detail.

Contextualized Word Encoder
Pre-trained language models (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019) have emerged as a critical technology for achieving impressive gains in natural language tasks. These models extend the idea of word embeddings by learning contextualized text representations from large-scale corpora using a language modeling objective. Thus, recent keyphrase extraction methods (Xiong et al., 2019;Sun et al., 2020;Mu et al., 2020) represent words / documents by the last intermediate layer of pre-trained language models. However, various probing tasks (Jawahar et al., 2019;de Vries et al., 2020) are proposed to discover linguistic properties learned in contextualized word embeddings, which demonstrates that different intermediate layers in pre-trained language models contain different linguistic properties or information. Specifically, each layer has specific specializations, so combining features from different layers may be more beneficial than selecting the last one based on the best overall performance.
Motivated by the phenomenon above, we propose a new adaptive mixing layer to combine all intermediate layers of RoBERTa (Liu et al., 2019) to obtain representations. Firstly, each word in the source document D is represented by all intermediate layers in RoBERTa, which is encoded to a Specially, h i ∈ R L * dr indicates the i-th contextualized word embedding of w i , where L and d r are set to 12 and 768. Then, the self-attention mechanism is adopted to aggregate multi-layer representations of each word as follows: where V a ∈ R dr and W a ∈ R dr * dr are learnable weights. Here, α i ∈ R L represents the adaptive mixing weights of the proposed adaptive mixing layer. In this case, each word in the source document D is transferred to a sequence of vector H = {ĥ 1 , ...,ĥ i , ...,ĥ M }. The adaptive mixing layer allows our model to obtain more comprehensive word embeddings, capturing more meaningful information (e.g., surface, syntactic, and semantic).

Hyperbolic Phrase Encoder
Phrases often exhibit inherent hierarchies ingrained with complex syntactic and semantic information (Zhu et al., 2020). Therefore, representing information requires sufficiently encoding semantic and syntactic information, especially for the latent hierarchical structures hidden in the natural languages. Recent studies (Sun et al., 2020;Xiong et al., 2019) typically obtain phrase representations in Euclidean space, which makes it difficult to learn representations with such latent structural information even with infinite dimensions in Euclidean space (Linial et al., 1995). On the contrary, hyperbolic spaces are non-Euclidean geometric spaces that can naturally capture the latent hierarchical structures (Sarkar, 2011;Sa et al., 2018). Lately, the use of the hyperbolic space in NLP (Dhingra et al., 2018;Tifrea et al., 2019;Nickel and Kiela, 2017) is motivated by the ubiquity of hierarchies (e.g., the latent hierarchical structures in phrases, sentences, and documents) in NLP tasks. Therefore, in this paper, we propose to embed phrases in the hyperbolic space. Concretely, the phrase representation of the i-th n-gram c n i is computed as follows, whereĥ n i ∈ R d h represents the i-th n-gram representation, n ∈ [1, N ] indicates the length of ngrams, and N is the maximum length of n-grams.
Each n-gram has its own set of convolution filters CNN n with window size n and stride 1.
To capture the latent hierarchies of phrases, we map phrases representation to the Poincaré ball using the exponential map, whereh n i indicates the i-th n-gramĥ n i phrase representation in the hyperbolic space. By mapping phrases representation into hyperbolic spaces, Hy-perMatch may implicitly model the latent hierarchical structure of phrases.

Hyperbolic Document Encoder
When using the source document as the query to match keyphrases, the document representation should cover its main points (important information). Meanwhile, documents are usually long text sequences with richer semantic and syntactic information than phrases. Many current BERT-based methods (Mu et al., 2020;Zhong et al., 2020) in NLP obtain documents representation by using the first output token (the [CLS] token) of pre-trained language models.
However, recent studies (Reimers and Gurevych, 2019; demonstrate that in many NLP tasks, documents representation obtained by the average pooling of words representation is better than the [CLS] token. Motivated by the above methods, we use the average pooling, a simple and effective operation, to encode documents. To further consider the latent hierarchical structures of documents, we map word representations and transfer the average pooling operation to the hyperbolic space. In this case, we first map word representations to the hyperbolic space via the exponential map as follows: where W h ∈ R dr * d h maps the original BERT embedding space to the tangent space of the origin of the Poincaré ball. Then exp 0 (·) maps the tangent space inside the Poincaré ball. Next, we use the hyperbolic averaging pooling to encode the source document as follows: whereh ∈ R d h indicates the hyperbolic document representation (called the Einstein midpoint pooling vectors in the Poincaré ball (Gulcehre et al., 2019)). The hyperbolic average pooling emphasizes semantically specific words that usually contain more information but occur less frequently than general ones. It should be noted that points near the boundary of the Poincaré ball get larger weights in the Einstein midpoint formula, which are regarded to be more representative content (more helpful information such as the latent hierarchies) from the source document (Dhingra et al., 2018;Zhu et al., 2020).

Importance Discrimination
Importance discrimination is one of the primary parts of keyphrase importance estimation, which measures and sorts the importance of candidate phrases accurately to extract keyphrases. To reach this goal, we first calculate the scaled phrasedocument relevance between phrases and their corresponding document via the Poincaré distance as the important score of each candidate phrase. Then, the important score is optimized by the marginbased triplet loss to extract keyphrases.

Scaled Phrase-Document Relevance
Besides the intrinsic hierarchies of linguistic ontologies, the conceptual relations between candidate phrases and their corresponding document can also form hierarchical structures. Once the document representationh and phrase representations h n i are obtained, it is expected that the phrases and their corresponding document embedded close to each other based on their geodesic distance 1 if they are highly relevant. Specifically, the scaled phrase-document relevance of the i-th n-gram representation c n i can be computed as follows: where S(·) indicates the scaled phrase-document relevance. Here, d c indicates the Poincaré distance, which is introduced in Eq.(3). Furthermore, f c indicates the linear transformation in the hyperbolic space. Specifically, for Eq. 17, the first term indicates modeling the phrase-document relevance explicitly, and the second term denotes modeling the phrase-document relevance implicitly. Estimating the phrase-document relevance via the Poincaré distance in hyperbolic space allows HyperMatch to model the relationships between candidate phrases and their document by simultaneously considering semantics and latent hierarchical structures, which benefits ranking keyphrases accurately. Furthermore, we find that with increasing representation dimension d h , the value of the phrase-document relevance will also increase, resulting in model optimization crash, and the loss value tends to infinity. To counteract this effect, we scale the phrasedocument relevance by 1

Margin-based Triplet Loss
To select phrases with higher importances, we adopt the margin-based triplet loss in our model and optimize for margin separation in the hyperbolic space. Therefore, we first treat the candidate keyphrases in the document that are labelled as keyphrases, in the positive set P + , and the others to the negative set P − , to obtain the matching labels. Then, the loss function is calculated as follows: where δ indicates the margin. It enforces Hyper-Match to sort the candidate keyphrases p + ahead of p − within their corresponding document. Through this training objective, our model will tend to extract the keyphrases, which are more relevant to the source document.

Implementation Details
Implementation details of HyperMatch are summarized in  et al., 2019) and was trained on eight NVIDIA RTX A4000 GPUs to achieve best performance.

Evaluation Metrics
For the keyphrase extraction task, the model's performance is typically evaluated by comparing the top-K predicted keyphrases with the target keyphrases (ground-truth labels). The evaluation cutoff K can be a fixed number (e.g., F1@5 compares the top-5 keyphrases predicted by the model with the ground-truth to compute an F1 score). Following the previous work (Meng et al., 2017;Sun et al., 2019), we adopt macro-averaged recall and F-measure (F1) as evaluation metrics, and K is set to be 1, 3, 5, and 10. In the evaluation, we apply Porter Stemmer 4 (Porter, 2006) to both target keyphrases and extracted keyphrases when determining the exact match of keyphrases.

Baselines
We compare two kinds of solid baselines to give a comprehensive evaluation of the performance of HyperMatch: unsupervised keyphrase extraction models (e.g., TextRank (Mihalcea and Tarau, 2004) and TFIDF (Jones, 2004)) and supervised keyphrase extraction models (e.g., classification and ranking models based variants of BERT (Sun et al., 2020)). Noticeably, HyperMatch extracts keyphrases without using additional features on the OpenKP dataset. Therefore, for the sake of fairness, we do not compare with the methods (Xiong et al., 2019; which use additional features to extract keyphrases. In addition, this paper mainly focuses on exploring keyphrase extraction in hyperbolic space via a matching framework (sim-   (Xiong et al., 2019;. * denotes these results are not included in the original paper and are estimated with Precision and Recall score. The results of the baselines are reported in their corresponding papers. ilar to the ranking model). Hence, the compared baselines we mainly choose are keyphrase extraction methods based on the classification and ranking models rather than some existing studies based on integration models Ahmad et al., 2021;Wu et al., 2021) or multi-task learning (Song et al., 2021).

Results and Analysis
In this section, we investigate the performance of HyperMatch on six widely-used benchmark keyphrase extraction datasets (OpenKP, KP20k, Inspec, Krapivin, Nus, and Semeval) from three facets. The first one demonstrates its superiority by comparing HyperMatch with the recent baselines in terms of several metrics. The second one is to verify the effect of each component via ablation tests. The last one is to analyze the sensitivity of the triplet loss with different margins.

Performance Comparison
The experimental results are given in Table 2 and  Table 3. Overall, HyperMatch outperforms the recent BERT-based keyphrase extraction models (the results are reported in their own articles) in most cases. Concretely, on the OpenKP and KP20k datasets, HyperMatch achieves better results than the best ranking models RoBERTa-Ranking-KPE. We consider that the main reason for this result may be that learning representation in hyperbolic space can capture more latent hierarchical structures than the Euclidean space. Meanwhile, compared with the results on the other four zero-shot datasets (Inspec, Krapivin, Nus, and Semeval) in Table 3, it can be seen that HyperMatch outperforms both unsupervised and supervised baselines. We consider that the main reason is the scaled phrase-document relevance explicitly models a strong connection between phrases and their corresponding document via the Poincaré distance, obtaining more robust performance even in different datasets.

Ablation Study
In this section, we report on several ablation experiments to analyze the effect of different components. The ablation experiment on the OpenKP dataset is shown in Table 4.
To measure the effectiveness of the hyperbolic space for the keyphrase extraction task, we compare it with the same model in the Euclidean space and use the Euclidean distance to explicitly model the phrase-document relevance. As shown in Table 4, HyperMatch outperforms EuclideanMatch, which shows that the hyperbolic space can capture the latent hierarchical structures more effectively than the Euclidean space.
To verify the effectiveness of the adaptive mixing layer, we propose a model HyperMatch w/o   AML, which indicates HyperMatch without using the adaptive mixing layer module and only uses the last intermediate layer of RoBERTa to embed phrases and documents. As shown in Table 4, the performance drops in all evaluation metrics without the adaptive mixing layer. These results demonstrate that combining all the intermediate layers of RoBERTa by the self-attention mechanism to embed words can capture more helpful information of different layers in RoBERTa. Unlike our model, most recent keyphrase extraction methods (e.g., RoBERTa-Ranking-KPE) implicitly model relevance between candidate phrases and their corresponding document by a linear transformation layer as the phrase-document relevance. Therefore, to verify the effectiveness of explicitly modeling the phrase-document relevance, we built the HyperMatch w/o Relevance, which only implicitly computes the phrase-document relevance by a hyperbolic linear transformation layer (Ganea et al., 2018). The results of HyperMatch w/o Relevance show a drop in all evaluation metrics, indicating that explicitly considering the relevance between phrases and the document is essential for estimating the importance of candidate phrases in the keyphrase extraction task.

Sensitivity of Hyperparameters
In this section, we verify the sensitivity of Hy-perMatch with different margins (δ) of the hyperbolic triplet loss. For keyphrase extraction methods equipped with the margin-based triplet loss, the margin design significantly impacts the final result, where a poor margin usually causes performance degradation. Therefore, we verify the effects of different margins on HyperMatch in Figure 3. We can see that HyperMatch achieves the best results when δ = 1.

Related Work
This section briefly describes the related work from two fields: keyphrase extraction and hyperbolic deep learning.

Keyphrase Extraction
Most existing KE models are based on the twostage extraction framework, which consists of two main parts: candidate keyphrase extraction and keyphrase importance estimation. Candidate keyphrase extraction extracts a set of candidate phrases from the source document by heuristics (e.g., essential n-gram-based phrases (Hulth, 2004;Medelyan et al., 2009;Xiong et al., 2019;Sun et al., 2020;). Keyphrase importance estimation first represents candidate phrases and documents by the pre-trained language models (Devlin et al., 2019;Liu et al., 2019) and then estimates the phrase-document relevance implicitly as the importance scores. Finally, the candidate phrases are ranked by their importance scores, which can be learned by either unsupervised (Mihalcea and Tarau, 2004;Liu et al., 2009b) or supervised (Xiong et al., 2019;Sun et al., 2020;Mu et al., 2020) ranking approaches. Different from the existing KE models, we first embed phrases and documents by RoBERTa in the Euclidean space and then map these representations to the same hyperbolic space to capture the latent hierarchical structures. Next, we adopt the Poincaré distance to model the phrase-document relevance explicitly as the important score of each candidate phrase. Finally, the hyperbolic marginbased triplet loss is used to optimize the whole model for extracting keyphrases. To the best of our knowledge, we are the first study to explore supervised keyphrase extraction in the hyperbolic space.

Hyperbolic Deep Learning
Recent Studies on representation learning (Nickel and Kiela, 2017;Tifrea et al., 2019;Mathieu et al., 2019) demonstrate that the hyperbolic space is more suitable for embedding symbolic data with hierarchies than the Euclidean space since the treelike properties (Hamann, 2018) of the hyperbolic space make it efficient to learn hierarchical representations with low distortion (Sa et al., 2018;Sarkar, 2011). As linguistic ontologies are innately hierarchies, hierarchies are ubiquitous in natural language (Dai et al., 2020). Some recent studies show the superiority of the hyperbolic space for many natural language processing tasks (Gulcehre et al., 2019;Zhu et al., 2020).  demonstrate that mapping contextualized word embeddings (i.e., BERT-based embeddings) to hyperbolic space can capture richer hierarchical structure information than the euclidean space when encoding natural language text.

Conclusions and Future Work
A new hyperbolic relevance matching model Hyper-Match is proposed to map phrase and document representations into the hyperbolic space and model the relevance between candidate phrases and the document via the Poincaré distance. Specifically, HyperMatch first combines the intermediate layers of RoBERTa via the adaptive mixing layer for capturing richer syntactic and semantic information. Then, phrases and documents are encoded in the same hyperbolic space to capture the latent hierarchical structures. Next, the phrase-document relevance is estimated explicitly via the Poincaré distance as the importance scores of all the candidate phrases. Finally, we adopt the hyperbolic margin-based triplet loss to optimize the whole model for extracting keyphrases.
In this paper, we explore the hyperbolic space to implicitly model the latent hierarchical structures when representing candidate phrases and documents. In the future, it will be interesting to introduce external knowledge (i.e., WordNet) to explicitly model the latent hierarchical structures when representing candidate phrases and documents. Our code is publicly available to facilitate other research 5 .