Unsupervised Keyphrase Extraction by Jointly Modeling Local and Global Context

Embedding based methods are widely used for unsupervised keyphrase extraction (UKE) tasks. Generally, these methods simply calculate similarities between phrase embeddings and document embedding, which is insufficient to capture different context for a more effective UKE model. In this paper, we propose a novel method for UKE, where local and global contexts are jointly modeled. From a global view, we calculate the similarity between a certain phrase and the whole document in the vector space as transitional embedding based models do. In terms of the local view, we first build a graph structure based on the document where phrases are regarded as vertices and the edges are similarities between vertices. Then, we proposed a new centrality computation method to capture local salient information based on the graph structure. Finally, we further combine the modeling of global and local context for ranking. We evaluate our models on three public benchmarks (Inspec, DUC 2001, SemEval 2010) and compare with existing state-of-the-art models. The results show that our model outperforms most models while generalizing better on input documents with different domains and length. Additional ablation study shows that both the local and global information is crucial for unsupervised keyphrase extraction tasks.


Introduction
Keyphrase extraction (KE) task aims to extract a set of words or phrases from a document that can represent the salient information of the document (Hasan and Ng, 2014). KE models can be divided into supervised and unsupervised. Supervised methods need large-scale annotated training data and always perform poorly when transferred to different domain or type datasets. Compared with the supervised method, the unsupervised method is more * Contribution during internship at Tencent Inc. † Corresponding Author universal and adaptive via extracting phrases based on information from input document itself. In this paper, we focus on the unsupervised keyphrase extraction (UKE) model. UKE has been widely studied (Mihalcea, 2004;Wan and Xiao, 2008a;Bougouin et al., 2013;Boudin, 2018;Bennani-Smires et al., 2018;Sun et al., 2020) in the keyphrase extraction field. Recently, with the development of text representation, embedding-based models (Bennani-Smires et al., 2018;Sun et al., 2020) have achieved promising results and become the new state-of-the-art models. Usually, these methods compute phrase embeddings and document embedding with static word2vec models (e.g. GloVe (Pennington et al., 2014;Le and Mikolov, 2014;Pagliardini et al., 2018)) or dynamic pre-trained language models (e.g. BERT (Devlin et al., 2019)). Then, they rank candidate phrases by computing the similarity between phrases and the whole document in the vector space. Though, these methods performed better than traditional methods (Mihalcea, 2004;Wan and Xiao, 2008a;Bougouin et al., 2013), the simple similarity between phrase and document is insufficient to capture different kinds of context and limits in performance. Figure 1 shows an intuitive explanation for the importance of context modeling. The nodes are candidate phrase embeddings, the star is the document embedding. Each black circle represents one local context. Nodes in the same black circle mean that these candidate phrases are all related to one vital local information (e.g. one topic/aspect of the document). Nodes in the red circle mean that these candidate phrases are similar with the document semantics. If only model the global context via computing similarity between candidate phrases and the document, the model will tend to select red nodes, which will ignore local salient information in three clusters. In order to get the keywords accurately, we should take the local con- Figure 1: Visualization of embedding space. Nodes refer to candidate phrase representation and star is document representation. Black circles mean clusters which contain local salient information. Red circle means global similarity phrases. text (black circles) and global context (red circle) into consideration.
To obtain information from context adequately, in this paper, we proposed a novel method which jointly models the local and global context of the input document. Specifically, we calculate the similarity between candidate phrases and the whole document for modeling global context. For local context modeling, we first build a graph structure, which represents each phrase as nodes and the edges are similarity between nodes. Then, we proposed a new centrality computation method, which is based on the insight that the most important information typically occurs at the start or end of documents (document boundary) (Lin and Hovy, 1997;Teufel, 1997;Dong et al., 2021), to measure salience of local context based on the graph structure. Finally, we further combine the measure of global similarity and local salience for ranking. To evaluate the effectiveness of our method, we compare our method with recent state-of-the-art models on three public benchmarks (Inspec, DUC 2001, SemEval 2010. The results show that our model can outperform most models while generalizing better on input documents with different domains and length. It is deservedly mentioned that our models have a huge improvement on long scientific documents.

Methodology
The overall framework of our model is shown in Fig. 2. We follow the general process of unsuper-vised keyphrase extraction. The main steps are as follows: (1) We tokenize the document and tag the document with part-of-speech (POS) tags. (2) We extract candidate phrases based on part-of-speech tags. We only keep noun phrases (NP) that consist of zero or more adjectives followed by one or multiple nouns (Wan and Xiao, 2008b). (3) We use a pre-trained language model to map the document text to low-dimension vector space and extract vector representation of candidate phrases and the whole document. (4) We score each candidate phrase with a rank algorithm which jointly models the global and local context. (5) We extract phrases with scores from the rank algorithm.
The main contribution of the whole process is the rank algorithm we proposed in step (4), which can be divided into three components: 1) phrasedocument similarity for modeling global context; 2) boundary-aware centrality for modeling local context; 3) the combination of global and local information. We will introduce the details of these components in this section.

Document and Phrases Representations
Before introducing the rank algorithm, we first make clear step (1) -(3). We follow the common practice and use StanfordCoreNLP Tools * to accomplish step (1) and (2). After previous universal steps, the document D was tokenized into tokens {t 1 , t 2 , ..., t N } and candidate phrases {KP 0 , KP 1 , ..., KP n } were extracted from document D. Different from previous works (Bennani-Smires et al., 2018) which use static vector to represent tokens in document, we employ BERT, which is a strong pre-trained language model, to obtain contextualized dynamic vector representations by Equ. (1).
Where H i is the vector representation of token t i . Then, we obtain the vector representation H KP i of candidate phrases by computing the average of the phrase's token vectors. The document vector representation is computed with max-pooling operation by Equ. (2). (2) Extract noun phrases that consist of zero or more adjectives followed by one or multiple nouns.
(3) Obtain embeddings of tokens in document with BERT. (4) Compute boundary-aware centrality and global relevance of each candidate phrases with global and local similarities. (5) Rank and extract keyphrases from candidate phrases with scores from the previous step.
..,n . Based on these representations, we will introduce the core rank algorithm of our model in the next section.

Phrase-Document Similarity
We first introduce the computation of the phrasedocument similarity for modeling global context. Specifically, we empirically employ Manhattan Distant (i.e. L1-distance) to compute similarity by Equ. (3).

R(H
Where · 1 means Manhattan Distant and R(H KP i ) represent the relevance between candidate phrase i and the whole document.

Traditional Degree Centrality
Graph-based ranking algorithms for keyphrase extraction represent a document as a graph G = ..,n is the set of vector that represent nodes in graph (i.e. candidate phrases in document), and E = {e ij } is the set of edges that represent interactions between candidate phrases. In this paper, we simply employ the degree of nodes as centrality to measure the importance of nodes. The degree centrality for candidate phrase i can be computed with Equ. (4).

C(H
Where e ij = H T KP i · H KP j is the dot-product similarity score for each pair (H KP i , H KP j ). We could also use other similarity measure methods (e.g. cosine similarity), but we empirically find that the simple dot-product performs better.

Boundary-Aware Centrality
Traditional centrality computation is based on the assumption that the contribution of the candidate phrase's importance in the document is not affected by the relative position of them, and the similarities of two graph nodes are symmetric. From human intuition, phrases that exist at the start or the end of a document should be more important than others. To implement this insight, we propose a new centrality computation method called boundary-aware centrality based on the assumption that important information typically occurs near boundaries (the start and end of documents) (Lin and Hovy, 1997;Teufel, 1997).
We reflect this assumption by employing a boundary function d b (i) over position of candidate phrases. This function d b is formulated as Equ (5).
Where n is the number of candidate phrases, and α is a hyper-parameter that controls relative importance of the start and end of a document. For node i and j, if d b (i) < d b (j), then node i is closer to the boundary than node j. When calculating the centrality of node i, we need to reduce the contribution of node j to the centrality of node i. Based on this assumption and the boundary function d b (i), we can reconstruct the centrality computation of node i in the graph as Equ. (6). (6) Where λ is used to reduce the influence of phrases which do not appear near the boundary to the centrality of node i. Besides, we employ a threshold θ = β(max(e ij − min(e ij )) to filter the noise from nodes, which is far different from node i. We remove the influence of them to centrality by setting all e ij < θ to zero. β is a hyper-parameter that controls the filter boundary. With the introduction of the noise filter strategy, we rewrite the Equ. (6) as Equ. (7).
Where C(H KP i ) represents the local salience of candidate phrase i.
For most long documents or news articles, the author tends to write the key information at the beginning of the document. Florescu and Caragea (2017a) point out that the position-biased weight can greatly improve the performance for keyphrase extraction and they employ the sum of the position's inverse of words in the document as the weight. For example, the word appearing at 2th, 5th and 10th, has a weight p(w i ) = 1/2 + 1/5 + 1/10 = 0.8. Our boundary-aware centrality has considered relative position information with boundary function. To prevent double counting, we follow a simpler position bias weight from (Sun et al., 2020), which only considers where the candidate phrase first appears. The position bias weight is computed by p(KP i ) = 1 p 1 , where p 1 is the position of the candidate keyphrase's first appearance. After that the softmax function is used to normalize the position bias weight as follow: Then, boundary-aware centrality can be rewritten as Equ. (9).
We finally employĈ(H KP i ) to measure the local salience of candidate phrase i.

Rank with Global and Local Information
To consider global and local level information at the same time, we simply combine the measure of global relevance R(H KP i ) and local saliencê C(H KP i ) of candidate phrase together with multiplication to obtain the final score by Equ. (10).

S(H
Finally, we rank candidate phrases with their final score S(H KP i ) and extract top-ranked k phrases as keyphrases of the document.

Datasets and Evaluation Metrics
We evaluate our model on three public datasets: Inspec, DUC2001 and SemEval2010. The Inspec dataset (Hulth, 2003) consists of 2,000 short documents from scientific journal abstracts. We follow previous works (Bennani-Smires et al., 2018;Sun et al., 2020) to use 500 test documents and the version of uncontrolled annotated keyphrases as ground truth. The DUC2001 dataset (Wan and Xiao, 2008a) is a collection of 308 long length news articles with average 828.4 tokens. The Se-mEval2010 dataset (Kim et al., 2010) contains ACM full length papers. In our experiments, we use the 100 test documents and the combined set of author-and reader-annotated keyphrases. We follow the common practice and evaluate the performance of our models in terms of f-measure at the top N keyphrases (F1@N), and apply stemming to both extracted keyphrases and gold truth. Specifically, we report F1@5, F1@10 and F1@15 of each model on three datasets.

Comparison Models and Implementation Details
We compare our methods with three types of models to comprehensively prove the effectiveness of our models. Firstly, we compare with traditional statistical methods TF-IDF and YAKE (Campos et al., 2018). Secondly, We compare five strong graph-based ranking methods. TextRank (Mihalcea and Tarau, 2004) is the first attempt to convert text to graph with the co-occurrence of words and employ PageRank to rank phrases. SingleRank (Wan and Xiao, 2008a) improves the graph construction with a slide window. TopicRank (Bougouin et al., 2013) considers keyphrase extraction with topic distribution. PositionRank (Florescu and Caragea, 2017b) employs position information to weight the importance of phrases. MultipartiteRank (Boudin, 2018) splits the whole graph into sub-graph and ranks them with some graph theory. Finally, We compare three stateof-the-art embedding-based models. EmbedRank (Bennani-Smires et al., 2018) first employs embedding of texts with Doc2Vec/Sent2Vec and measures the relevance of phrases and documents to select keyphrases. SIFRank (Sun et al., 2020) improves EmbedRank with contextualized embedding from a pre-trained language model. KeyGames (Saxena et al., 2020) creatively introduces game theoretic approach into automatic keyphrase extraction.

Results
We report the results of our model in Tab. 1. We can observe that our model consistently outperforms most of the existing systems across the three datasets, each with different document length, covering two different domains. SIFRank and SIFRank+ have a remarkable performance on datasets with short input length due to document embedding of short documents can better represent the semantic information of full document and short document has fewer local information (e.g. aspects), which make embedding-based models perform well. We can further see that models with global similarity (i.e. EmbedRank and SIFRank) all outperform graph-based models on short length documents (i.e. DUC2001 and Inspec).
Compared with other works, our model and KeyGames, which is based on game theory, are more generalized and can tackle short and long input documents well. The advantages of our models are very obvious on the long scientific document dataset SemEval2010. This mainly benefits from the boundary-aware centrality for modeling local context.

Ablation Study
We evaluate the contribution of the global and local component of our model with ablation study and the results can be seen in Tab. 2. From the results, we can find that the modeling of local context is more important than the modeling of global context. When we remove local information from our model, our model goes back to an embedding-based model. The performance on SemEval2010 is not sensitive to the removal of relevance-aware weighting. We guess that embedding of long documents may contain multi-aspects information which influences the measure of similarity between the phrase and the whole document, which leads to the influence of global information being limited. Overall, we can prove that jointly modeling global and local context is crucial for unsupervised keyphrase extraction and the revisit of degree centrality is effective for modeling local context and meaningful for future work.

Impact of Hyper-Parameters
In this section, we first analyze the impact of hyperparameter β and then report the best setting on each dataset. We employ three hyper-parameters in our models, β is used to filter noise and we can see the impact of β from Fig. 3. β = 0.2 is a proper configure for three datasets. α is used to control the importance of the start or end of a document. α < 1 means the start of the document is more vital and α > 1 means the end of the document is more vital.
The best settings are α = 0.8, β = 0.2, λ = 0.9 on DUC2001, α = 0.5, β = 0.2, λ = 0.9 on Inspec and α = 1.5, β = 0.2, λ = 0.8 on Se-mEval2010. From these settings, we can get the following three conclusions, which is conforming to the characteristics of these datasets. For DUC2001 and Inspec, most vital information occurs at the start of the document due to the fact that DUC2001 is from news articles, and Inspec is from the abstract. For SemEval2010, the setting of α is contrary to previous datasets due to SemEval2010 is long scientific documents and much key information occurring at the end of the document (section conclusion). The settings of λ on three datasets show that long documents need to reduce more influence from contexts not near the boundary, which is intuitive.

Impact of Different Similarity Measure Methods
Our model employs Manhattan Distance to measure the similarity between phrases and the whole document. We also attempt to employ different measure methods. The results of different similarity measure methods are shown in Tab. 3, and we can see that the advantage of Manhattan Distance is obvious. We also can see that cosine similarity performs badly and is not suitable for our models.

Case Study
In this section, we show an example from DUC2001 in Fig. 4. DUC2001 is a dataset from news articles. The correct keyphrases are underlined. Red text means the extracted gold truth and blue text means extracted phrases by our model.
We can see that all keyphrases occur at the start of the document. Our model extracted many correct phrases which are the same as gold truth and extracted the phrase "existing word record" which is semantically same with "word record" in gold truth. It is worth mentioning that our model focuses on the boundary of the document and most extracted phrases were located at the start of the document, which is controlled by our setting of α. This proves the effectiveness of our boundary-aware centrality. From the figure, we also can find that wrong phrases are highly relevant to topics of this document, which is influenced by our phrase-document relevance weighting. This example shows that the joint modeling of global and local context can improve the performance of keyphrase extraction and our model really captures local and global informa-tion.

Pre-trained Language Model
Pre-trained language model is the kind of model that is trained on large-scale unlabeled corpus to learn prior knowledge and then fine-tuned on downstream tasks. The pre-trained language model without fine-tuning also can provide high quality embedding of natural texts for unsupervised tasks. Different from static word embedding, such as Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and FastText (Joulin et al., 2017). Pre-trained language models can encode words or sentences with context dynamically and solve the OOV problem. In addition, pre-trained language models can provide document-level or sentencelevel embedding which contains more semantic information than Sen2Vec (Pagliardini et al., 2018) or Doc2Vec (Le and Mikolov, 2014).
ELMo (Peters et al., 2018) employs Bi-LSTM structure and concatenate forward and backward information to capture bidirectional information.
BERT (Devlin et al., 2019) is a bidirectional transformer structure pre-trained language model. Compared with the concatenation of bidirectional information, BERT can capture better context information. There are also a lot of other pre-trained language models such as RoBERTa (Liu et al., 2019), XLNET (Yang et al., 2020), etc. In this paper, we choose BERT, the most used, to obtain vector representation of documents and phrases by merging the embedding of tokens.

Unsupervised Keyphrase Extraction
Unsupervised keyphrase extraction can be divided into four main types: statistics-based models, graph-based models, topic-based models, and embedding-based models. Statistics-based models (Campos et al., 2018) mainly analyze an article's probability features such as word frequency feature, position feature, linguistic features, etc. Topicbased models (Jardine and Teufel, 2014;Liu et al., 2009) focus on how to mine keyphrases by making use of the probability distribution of articles.
Graph-based models are the most proposed and popular used in early works which convert the document into a graph. Inspired by (Page et al., 1999), (Mihalcea, 2004) proposed TextRank to convert keyphrase extraction task into the rank of nodes in graph. After this, various works focused on the expansion of TextRank. (Wan and Xiao, 2008a) proposed SingleRank, which employs co-occurrences of tokens as edge weights. (Bougouin et al., 2013) proposed TopicRank, which assigns a significance score to each topic by candidate keyphrase clustering. MultipartiteRank (Boudin, 2018) encodes topical information within a multipartite graph structure. Recently, (Wang, 2015) proposed WordAt-tractionRank, which added distance between word embeddings into SingleRank, and (Florescu and Caragea, 2017b) use node position weights, favoring words appearing earlier in the text. This position bias weighting strategy is very useful in news articles and long documents.
Embedding-based models benefit from the development of representation learning, which maps natural language into low-dimension vector representation. Therefore, in recent years, embeddingbased keyphrase extraction (Wang et al., 2016;Bennani-Smires et al., 2018;Papagiannopoulou and Tsoumakas, 2018;Sun et al., 2020) has achieved good performance . (Bennani-Smires et al., 2018) proposed EmbedRank, which ranks phrases by measuring the similarity between phrase embedding and document embedding. (Sun et al., 2020) proposed SIFRank, which improves the static embedding from EmbedRank with a pretrained language model.
Embedding-based models just measured the similarity between document and candidate phrases and ignored the local information. To jointly model global and local context (Zheng and Lapata, 2019;Liang et al., 2021), in this paper, we revisit degree centrality, which can model local context, and convert it into boundary-aware centrality. Then, we combine global similarity and boundary-aware centrality for local salient information to rank and extract phrases.

Conclusion and Future Work
In this paper, we point out that embedding-based models ignore the local information and propose a novel model which jointly models global and local context. Our model revisited degree centrality and modified it with boundary function for modeling local context. We combine global similarity with our proposed boundary-aware centrality to extract keyphrases. Experiments on 3 public benchmarks demonstrate that our model can effectively capture global and local information and achieve remarkable results. In the future work, we will focus on how to introduce our boundary-aware mechanism into supervised end2end keyphrase extraction/generation models.