Importance Estimation from Multiple Perspectives for Keyphrase Extraction

Keyphrase extraction is a fundamental task in Natural Language Processing, which usually contains two main parts: candidate keyphrase extraction and keyphrase importance estimation. From the view of human understanding documents, we typically measure the importance of phrase according to its syntactic accuracy, information saliency, and concept consistency simultaneously. However, most existing keyphrase extraction approaches only focus on the part of them, which leads to biased results. In this paper, we propose a new approach to estimate the importance of keyphrase from multiple perspectives (called as KIEMP) and further improve the performance of keyphrase extraction. Specifically, KIEMP estimates the importance of phrase with three modules: a chunking module to measure its syntactic accuracy, a ranking module to check its information saliency, and a matching module to judge the concept (i.e., topic) consistency between phrase and the whole document. These three modules are seamlessly jointed together via an end-to-end multi-task learning model, which is helpful for three parts to enhance each other and balance the effects of three perspectives. Experimental results on six benchmark datasets show that KIEMP outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.


Introduction
Keyphrase Extraction (KE) aims to select a set of reliable phrases (e.g., "harmonic balance method", "grobner base", "error bound", "algebraic representation", and "singular point" in Table 1) with salient information and central topics from a given document, which is a fundamental task in natural language processing. Most classic keyphrase extraction methods typically include two mainly components: candidate keyphrase extraction and * Corresponding author.
Input Document: harmonic balance ( hb ) method is well known principle for analyzing periodic oscillations on nonlinear networks and systems. because the hb method has a truncation error, approximated solutions have been guaranteed by error bounds. however, its numerical computation is very time consuming compared with solving the hb equation. this paper proposes proposes an algebraic representation of the error bound using grobner base. the algebraic representation enables to decrease the computational cost of the error bound considerably. moreover, using singular points of the algebraic representation, we can obtain accurate break points of the error bound by collisions.
Output / Target Keyphrases: harmonic balance method; grobner base; error bound; algebraic representation; singular point; quadratic approximation Table 1: Sample input document with output / target keyphrases in KP20k testing set. Specially, keyphrases typically can be categorized into two types: present keyphrase that appears in a given document and absent keyphrase which does not appear in a given document. keyphrase importance estimation (Medelyan et al., 2009;Liu et al., 2010;Hasan and Ng, 2014).
As shown in Table 1, each keyphrase usually consists of more than one words (Meng et al., 2017). To extract the candidate keyphrases from the the given document which is typically characterized via word-level representation, researchers leverage some heuristics (Wan and Xiao, 2008;Liu et al., 2009a,b;Nguyen and Phan, 2009;Grineva et al., 2009;Medelyan et al., 2009) to identify the candidate keyphrases. For example, the word embeddings are composed to n-grams by Convolution Neural Network (CNN) (Xiong et al., 2019;Sun et al., 2020;. Usually, the candidate set contains much more keyphrases than the ground truth keyphrase set. Therefore, it is critical to select the important keyphrase from the candidate set. In other words, keyphrase importance estimation commonly is one of the essential components in many keyphrase extraction models. Since the keyphrase extraction task concerns "the automatic selection of important and topical phrases from the body of a document" (Turney, 2000). In other words, its goal is to estimate the importance of the candidate keyphrases to determine which one should be extracted. Recent approaches (Sun et al., 2020; recast the keyphrase extraction as a classification problem, which extracts keyphrases by a binary classifier. However, a binary classifier classifies each candidate keyphrase independently, and consequently, it does not allow us to determine which candidates are better than the others (Hulth, 2004). Therefore, some methods (Jiang et al., 2009;Xiong et al., 2019;Sun et al., 2020) propose a ranking model to extract keyphrases, where the goal is to learn a phrase ranker to compare the saliency of two candidate phrases. Furthermore, many previous studies (Liu et al., 2010;Wang et al., 2019;Liu et al., 2009b) extract keyphrases with the main topics discussed in the source document, For example, Liu et al. (2010) proposes to build a topical PageRank approach to measure the importance of words concerning different topics.
However, most existing keyphrase extraction methods estimate the importance of keyphrases on at most two perspectives, leading to biased extraction. Therefore, to improve the performance of keyphrase extraction, the importance of the candidate keyphrases requires to be estimated sufficiently from multiple perspectives. Motivated by the phenomenon mentioned above, we propose a new importance estimation from multiple perspectives simultaneously for the keyphrase extraction task. Concretely, it estimates the importance from three perspectives with three modules (syntactic accuracy, information saliency, and concept consistency) with three modules. A chunking module, as a binary classification layer, measures the syntactic accuracy of each candidate keyphrase. A ranking module checks the semantics saliency of each candidate phrase by a pairwise ranking approach, which introduces competition between the candidate keyphrases to extract more salient keyphrases. A matching module judges the concept relevance of each candidate phrase in the document via a metric learning framework. Furthermore, our model is trained jointly on the above three modules, balancing the effect of three perspectives. Experimental results on two benchmark data sets show that KIEMP outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.

Related Work
A good keyphrase extraction system typically consists of two steps: (1) candidate keyphrase extraction, extracting a list of words / phrases that serve as the candidate keyphrases using some heuristics (Wan and Xiao, 2008;Nguyen and Phan, 2009;Medelyan et al., 2009;Grineva et al., 2009;Liu et al., 2009a,b); and (2) keyphrase importance estimation, determining which of these candidate phrases are keyphrases using different importance estimation approaches.
In the candidate keyphrase extraction, the heuristic rules usually are designed to avoid spurious phrases and keep the number of candidates to a minimum (Hasan and Ng, 2014). Generally, the heuristics mainly include (1) leverage a stop word list (Liu et al., 2009b), (2) allowing words with part-of-speech tags (Mihalcea and Tarau, 2004;Liu et al., 2009a), (3) composing words to n-grams to be the candidate keyphrases (Medelyan et al., 2009;Sun et al., 2020;Xiong et al., 2019;. The above heuristics have proven effective with their high recall in extracting gold keyphrases from various sources. Motivated by the above methods, in this paper, we leverage CNNs to compose words to n-grams as the candidate keyphrases. In the keyphrase importance estimation, the existing methods can be mainly divided into two categories: unsupervised and supervised. The unsupervised method usually are categorized into four groups, i.e., graph-based ranking (Mihalcea and Tarau, 2004), topic-based clustering (Liu et al., 2009b), simultaneous learning (Zha, 2002), and language modeling (Tomokiyo and Hurst, 2003). Early supervised approaches to keyphrase extraction recast this task as a binary classification problem (Witten et al., 1999;Turney, 2002Turney, , 2000Jiang et al., 2009). Later, to determine which candidates are better than the others, many ranking approach is proposed to rank the saliency of two phrases (Jiang et al., 2009;Sun et al., 2020). This pairwise ranking approach, therefore, introduces competition between candidate keyphrases and has been achieved good performance. Both supervised and unsupervised methods construct features or models from different perspectives to measure the importance of candidate keyphrases to determine which keyphrases should be extracted. However, the approaches mentioned earlier consider at most two perspectives when measuring the importance of phrases, which leads to biased keyphrase extraction. Different from the existing methods, the proposed KIEMP considers estimating the importance of the candidate keyphrases from multiple perspectives simultaneously.

Methodology
We formally define the problem of keyphrase extraction as follows. In this paper, KIEMP takes a document D = {w 1 , ..., w i , ..., w M } and learns to extract a set of keyphrases K (each keyphrase may be composed of one or several word(s)) from their n-gram based representations under multiple perspectives.
This section describes the architecture of KIEMP, as shown in Figure 1. KIEMP mainly consists of two submodels: candidate keyphrase extraction and keyphrase importance estimation. The former first identifies and extracts the candidate keyphrases. Then the latter estimates the importance of keyphrases from three perspectives simultaneously with three modules to determine which one should be extracted.

Contextualized Word Representation
Recently, pre-trained language models (Peters et al., 2018;Devlin et al., 2019; have emerged as a critical technology for achieving impressive gains in a wide variety of natural language tasks (Liu and Lapata, 2019). These models extend the idea of word embeddings by learning contextual representations from large-scale corpora using a language modeling objective. In this situation, Xiong et al. (2019) propose to represent each word by its ELMo (Peters et al., 2018) embedding and Sun et al. (2020) leverage variants of BERT (Devlin et al., 2019; to obtain contextualized word representations. Motivated by the above approaches, we represent each word by RoBERTa , which encodes D to a sequence of vector H = {h 1 , ..., h i , ..., h M }: where h i ∈ R d indicates the i-th contextualized word embedding of w i from the last transformer layer in RoBERTa.

Candidate Keyphrase Extraction
In the keyphrase extraction task, keyphrase usually contains more than one word, as shown in Table 1. Therefore, it is necessary to identify the candidate keyphrases via some strategies. Previous work (Medelyan et al., 2009;Sun et al., 2020;Xiong et al., 2019) allow n-grams that appear in the document to be the candidate keyphrases. Motivated by the previous approaches, we consider the language properties (Xiong et al., 2019) and compose the contextualized word representations to n-grams by CNNs (similar to Sun et al. (2020)). Specifically, the phrase representation of the i-th n-gram c n i is computed as: where h n i ∈ R d indicates the i-th n-gram representation. Concretely, n ∈ [1, N ] is the length of n-grams, and N indicates the maximum length of allowed candidate n-grams. Specifically, each ngram has its own set of convolution filters CNN n with window size n and stride 1.

Keyphrase Importance Estimation
In the keyphrase extraction models, keyphrase importance estimation commonly is one of the essential components. To improve the accuracy of keyphrase extraction, we estimate the importance of keyphrases from three perspectives simultaneously with three modules: chunking for syntactic accuracy, ranking for information saliency, and matching for concept consistency.

Chunking for Syntactic Accuracy
Many studies (Turney, 2002;Witten et al., 1999;Turney, 2000) regard keyphrase extraction as a classification task, in which a model is trained to determine whether a candidate phrase is a keyphrase in a syntactic perspective. For example, Xiong et al. (2019); Sun et al. (2020) directly predict whether the n-gram is a keyphrase based on its corresponding representation. Motivated by these above methods, in this paper, the syntactic accuracy of phrase c n i is estimated by a chunking module: where W 1 and b 1 indicate a trainable matrix and a bias. The softmax is taken over all possible ngrams at each position i and each length n. The whole model is trained using cross-entropy loss: where y n i is the label of whether the phrase c n i is a keyphrase of the original document.

Ranking for Information Saliency
The binary classifier-based keyphrase extraction model classifies each candidate keyphrase independently, and consequently, it does not allow us to determine which candidates are better than the others (Hulth, 2004). However, the goal of keyphrase extraction is to identify the most salient phrases for a document (Hasan and Ng, 2014). Therefore, a ranking model is required to rank the saliency of the candidate keyphrases. We leverage a pairwise learning approach to rank the candidate keyphrases globally to compare the information saliency between all candidates. First, we put the candidate keyphrases in the document that are labeled as keyphrases, in the positive set P + , and the others to the negative set P − , to obtain the ranking labels. Then, the loss function is the standard hinge loss in the pairwise learning model: where I 2 (·) represents the estimation of information saliency and δ 1 indicates the margin. It enforces KIEMP to rank the candidate keyphrases p + ahead of p − within the same document. Specifically, the information saliency of the i-th n-gram representation c n i can be computed as follows: where W 2 is a trainable matrix, and b 2 is a bias.
Through the pairwise learning model, we can rank the information saliency of all candidates and extract the keyphrases with more salient information sufficiently.

Matching for Concept Consistency
As phrases are used to express various meanings corresponding to different concepts (i.e., topics), a phrase will play different important roles in different concepts of the document (Liu et al., 2010). A matching module is proposed via a metric learning framework to estimate the concept consistency between the candidate keyphrases and their corresponding document. We first apply variation autoencoder (Rezende et al., 2014) on the documents D and the candidate keyphrases K to obtain their concepts. Each document D is encoded via a latent variable z ∈ R c which is assumed to be sampled from a standard Gaussian prior, i.e., z ∼ p(z) = N (0, I d ). Such variable has ability to determine the latent concepts hidden in the documents and will be useful to extract keyphrase (Wang et al., 2019). During the encoding process, z can be sampled via a reparameterization trick for Gaussian distribution, i.e., z ∼ q(z|D) = N (µ, σ). Specificially, we sample an auxiliary noise variable ε ∼ N (0, I) and reparametrize z = µ + σ ε, where denotes the element-wise multiplication. The mean vector µ ∈ R c and variance vector σ ∈ R c will be inferred by a two-layer network with ReLU-activated function, i.e., µ = µ φ (D) and σ = σ φ (D) where φ is the parameter set. During the decoding process, the document can be reconstructed by a muylti-layer network (f k ) with Tanh-activated function, i.e.,D = f k (z). Furthermore, the candidate keyphrases are processed in the same way as the documents.
Once having the latent concept representation of the document z and the phrase z n i , the concept consistency can be estimated as follows, Here, W 3 is a learnable mapping matrix. The loss function is the triplet loss in the metric learning framework calculated as follows: where δ 2 represents the margin. It enforces KIEMP to match and rank the concept consistency of keyphrases p + ahead of the non-keyphrases p − within their corresponding document D.
Furthermore, to simultaneously minimize the reconstruction loss and penalize the discrepancy between a prior distribution and posterior distribution about the latent variable z, the VAE process can be implemented by optimizing the following objective function for the documents L d and the candidate keyphrases L k : where D KL indicates the Kullback-Leibler divergence between two distributions.
where λ ∈ (0, 1) is the balance factor. Through concept consistency matching, we expect to align keyphrases with high-level concepts (i.e., topics or structures) in the document to assist the model in extracting keyphrases with more important concepts.

Model Training and Inference
Multi-task learning has played an essential role in various fields (Srna et al., 2018), and has been widely used in the natural language processing tasks (Sun et al., 2020;Mu et al., 2020) recently. Therefore, our framework allows end-to-end learning of syntactic chunking, saliency ranking, and concept matching in this paper. Then, we define the training objective of the entire framework with the linear combination of L c , L r , and L t : where the hyper-parameters 1 , 2 , and 3 balance the effects of the importance estimation from three perspectives. Specifically, 1 + 2 + 3 = 1.
In this paper, KIEMP aims to extract keyphrases according to their saliencies. It contains three modules syntactic accuracy chunking, information saliency ranking, and concept consistency matching. Chunking and matching are used to enforce the ranking module to rank the proper candidate keyphrases ahead. Therefore, only the ranking module is used in the inference process (testphase).    Table 2 summarizes the statistics of each testing sets.
OpenKP consists of around 150K documents sampled from the index of the Bing search engine. In OpenKP, we follow the official split of training (134K documents), development (6.6K documents), and testing (6.6K documents) sets. The keyphrases for each document in OpenKP were labeled by expert annotators, with each document assigned 1-3 keyphrases. As a requirement, all the keyphrases appeared in the original document (Xiong et al., 2019).
KP20k contains a large number of high-quality scientific metadata in the computer science domain from various online digital libraries (Meng et al., 2017). We follow the official setting of this dataset and split the dataset into training (528K documents), validation (20K documents), and testing (20K documents) sets. From the training set of  To verify the robustness of KIEMP, we also test the model trained with KP20k dataset on four widely-adopted keyphrase extraction data sets including Inspec, Krapivin, Nus, and SemEval. In this paper, we focus on keyphrase extraction. Therefore, only the keyphrases that appear in the documents are used for training and evaluation.

Baselines
This paper focuses on the comparisons with the state-of-the-art baselines and chooses the following keyphrase extraction models as our baselines.
TextRank An unsupervised algorithm based on weighted-graphs proposed by Mihalcea and Tarau (2004). Given a word graph built on cooccurrences, it calculates the importance of candidate words with PageRank. The importance of a candidate keyphrase is then estimated as the sum of the scores of the constituent words.
TFIDF (Jones, 2004) is computed based on candidate frequency in the given text and inverse document frequency CopyRNN (Meng et al., 2017) which uses the attention mechanism as the copy mechanism to extract keyphrases from the given document.
BLING-KPE (Xiong et al., 2019) first concatenates the pre-trained language model (ELMo (Peters et al., 2018)) as word embeddings, visual as well as positional features, and then uses a CNN network to obtain n-gram phrase embeddings for binary classification.
SMART-KPE+R2J ) presents a multi-modal method to the keyphrase extraction task, which leverages lexical and visual features to enable strategy induction as well as meta-level features to aid in strategy selection.
DivGraphPointer (Sun et al., 2019) combines the advantages of traditional graph-based ranking methods and recent neural network-based approaches. Furthermore, they also propose a diversified point network to generate a set of diverse keyphrases out of the word graph in the decoding process.
Div-DGCN (Zhang et al., 2020) proposes to adopt the Dynamic Graph Convolutional Networks (DGCN) to acquire informative latent document representation and better model the compositionality of the target keyphrases set.
SKE-Large-CLS (Mu et al., 2020) obtains spanbased representation for each keyphrase and further learns to capture the similarity between keyphrases in the source document to get better keyphrase predictions.
In this paper, for ease of introduction, all the baselines are divided according to the following three perspectives, syntax, saliency, and combining syntax and saliency. Among them, BLING-KPE, CopyRNN, ChunkKPE belong to the former, TFIDF, TextRank, as well as RankKPE belong to the second, and DivGraphPointer, Div-DGCN, SKE-Large-CLS, SMART-KPE+R2J, as well as JointKPE belongs to the last.

Evaluation Metrics
For the keyphrase extraction task, the performance of keyphrase model is typically evaluated by comparing the top k predicted keyphrases with the target keyphrases (ground-truth labels). The evaluation cutoff k can be a fixed number (e.g., F 1 @5 compares the top-5 keyphrases predicted by the model with the ground-truth to compute an F 1 score). Following the previous work (Meng et al., 2017;Sun et al., 2019), we adopt macro-averaged recall and F-measure (F 1 ) as evaluation metrics, and k is set to be 1, 3, 5, and 10. In the evaluation, we apply Porter Stemmer (Porter, 2006) to both target keyphrases and extracted keyphrases when determining the match of keyphrases and match of the identical word.    The results of JointKPE are evaluated via the code which is provided by its corresponding paper.

Implementation Details
Implementation details of our proposed models are as follows. The maximum document length is 512 due to BERT limitations (Devlin et al., 2019), and documents are zero-padded or truncated to this length. The training used 6 GeForce RTX 2080 Ti GPUs and took about 31 hours and 77 hours for OpenKP and KP20k datasets respectively. Table 3 lists the parameters of our model. Furthermore, the model was implemented in Pytorch (Paszke et al., 2019) using the huggingface reimplementation of RoBERTa (Wolf et al., 2019).

Results and Analysis
This section investigates the performance of the proposed KIEMP on six widely-used benchmark datasets (OpenKP, KP20k, Inspec, Krapivin, Nus, and Semeval) from four facets. The first one demonstrates its superiority by comparing it with ten baselines in terms of several metrics. The second one is to verify the effect of each component via ablation tests. The third one is to verify the robustness of KIEMP. The last one is to explicitly show the keyphrase extraction results of KIEMP via an example (two testing documents).

Overall Accuracy
The overall performance of different algorithms on two benchmarks (OpenKP and KP20k) is summarized in Table 4. We can see that the supervised methods outperform all the unsupervised algorithms (TFIDF and TextRank). This is not surprising since the supervised methods are trained end-to-end with supervised data. In all the supervised baselines, the methods using additional features are better than those without additional features. The reason is that the models with additional features are equal to encode keyphrases from multiple features perspectives. Therefore, it is helpful for the model to measure the importance of each keyphrase, thus improving the performance of the result of keyphrase extraction. Intuitively, this is the same as our proposed method. KIEMP considers the importance of keyphrases from multiple perspectives and fairly measures the importance of each keyphrase. But the difference is that we do not need additional features to assist. And in many practical applications of keyphrase extraction, there is no additional feature (i.e., visual features) information to use in most cases. Notably, even if we do not use additional features, the proposed KIEMP outperforms the best baseline in most cases. Specifically, the R@1, R@3, R@5, and F 1 @1 results of KIEMP on the OpenKP are 8.79%, 7.89%, 9.60%, and 2.89% higher than the previous SOTA method (SMART-KPE+R2J) respectively. Compared with recent baselines (ChunkKPE, RankKPE, and JointKPE), KIEMP performs stably better on all metrics on both two datasets. These results demonstrate the benefits of estimating the importance of keyphrases from multiple perspectives simultaneously and the effectiveness of our multi-task learning strategy.
Furthermore, to verify the robustness of KIEMP, we also test the KIEMP trained with KP20k dataset on four widely-adopted keyphrase extraction data sets. It can be seen from Figure 2 that KIEMP is superior to the best baseline (JointKPE). These results further confirm that concept consistency matching is helpful to learn phrase representation and properly estimate their saliency. 6 66 66& 5# 5# 5# Figure 3: Performance of KIEMP ablations on the OpenKP development set. "S", "SS", and "SSC" indicate the syntactic accuracy chunking of KIEMP, the KIEMP without the concept consistency matching, and the KIEMP respectively.

Ablation Study
We further conducted a detailed analysis of the KIEMP model on the OpenKP dataset.
Number of Perspectives. Figure 3 shows ablation results on the variations of KIEMP. Each variation adds a perspective and keeps all others unchanged. It can be seen from the results that "SSC" has an improvement in all evaluation metrics compared with "SS" and "S", which benefit from our concept consistency matching module. We consider that this phenomenon comes from two benefits. One is that the high-level concepts captured by a deep latent variable model may contain topic and structure features. These features are essential information to evaluate the importance of phrases. Another one is that concept consistency matching uses a deep latent variable model to capture concepts. Here, the latent variable is characterized by a probability distribution over possible values rather than a fixed value, which can enforce the uncertainty of the model and further lead to robust phrase representation learning.  Concept Dimension. We verify the effectiveness of using different concept dimensions. From Table 5, we can find that the increase of the dimension of latent concept representation has little effect on the result of keyphrase extraction. In contrast, the smaller the dimension, the better the result. Furthermore, in Table 4, the improvement of our proposed KIEMP model on the F 1 @1 evaluation metric is higher than the F 1 @3 and F 1 @5 evaluation metrics on the KP20k dataset. We consider the main reason is that our concept representation may capture the high-level conceptual information of phrases or documents, such as topics and structure information. Therefore, KIEMP with concept consistency matching module focuses more on extracting keyphrases closest to the main topic of the given document.

Case Study
To further illustrate the effectiveness of the proposed model, we present a case study on the results (A) Part of the Input Document: The Great Plateau is a large region of land that is secluded from other parts of Hyrule, as its steep slopes prevent anyone from traveling to and from it without special equipment, such as the Paraglider. The only active inhabitant is the Old Man, a mysterious ... (URL: https://zelda.gamepedia.com/Great_Plateau) Target Keyphrase: (1) great plateau ; (2) breath of the wild ; (3) hyrule KIEMP without Concept Consistency Matching: (1) great plateau ; (2) hyrule ; (3) breath of the wild ; (4) paraglider ; (5) zelda

KIEMP:
(1) great plateau ; (2) breath of the wild ; (3) hyrule ; (4) paraglider ; (5) starting region (B) Part of the Input Document: Transformational leaders also depend on visionary leadership to win over followers, but they have an added focus on employee development. For example, a transformational leader might explain how her plan for the future serves her employees' interests and how she will support them through the changes ...  of the keyphrases extracted by different algorithms. Table 6 presents the results of KIEMP without concept consistency matching and KIEMP. From the first example, we can see that our KIEMP model is more inclined to extract keyphrases closer to the central semantics of the input document, which benefits from our concept consistency matching model. From the second example, we can see that the keyphrases extracted by KIEMP without concept consistency matching contain some redundant or meaningless phrases. The main reason may be that the KIEMP without concept consistency matching does not measure the importance of phrases from multiple perspectives, which leads to biased extraction. On the contrary, the keyphrases extracted by KIEMP are all around the main concepts of the example document, i.e., "leadership". It further demonstrates the effectiveness of our proposed model.

Conclusions and Future Work
A new keyphrase importance estimation from the multiple perspectives approach is proposed to estimate the importance of keyphrase. Benefitting from the designed syntactic accuracy chunking, information saliency ranking, and concept consistency matching modules, KIEMP can fairly extract keyphrases. A series of experiments have demonstrated that KIEMP outperformed the existing stateof-the-art keyphrase extraction methods. In the future, it will be interesting to introduce an adaptive approach in KIEMP to filter the meaningless phrases.