MDERank: A Masked Document Embedding Rank Approach for Unsupervised Keyphrase Extraction

Keyphrase extraction (KPE) automatically extracts phrases in a document that provide a concise summary of the core content, which benefits downstream information retrieval and NLP tasks. Previous state-of-the-art methods select candidate keyphrases based on the similarity between learned representations of the candidates and the document. They suffer performance degradation on long documents due to discrepancy between sequence lengths which causes mismatch between representations of keyphrase candidates and the document. In this work, we propose a novel unsupervised embedding-based KPE approach, Masked Document Embedding Rank (MDERank), to address this problem by leveraging a mask strategy and ranking candidates by the similarity between embeddings of the source document and the masked document. We further develop a KPE-oriented BERT (KPEBERT) model by proposing a novel self-supervised contrastive learning method, which is more compatible to MDERank than vanilla BERT. Comprehensive evaluations on six KPE benchmarks demonstrate that the proposed MDERank outperforms state-of-the-art unsupervised KPE approach by average 1.80 F1@15 improvement. MDERank further benefits from KPEBERT and overall achieves average 3.53 F1@15 improvement over SIFRank.


Introduction
Keyphrase extraction (KPE) automatically extracts a set of phrases in a document that provide a concise summary of the core content. KPE is * Work is done during the internship at Speech Lab, Alibaba Group. highly beneficial for readers to quickly grasp the key information of a document and for numerous downstream tasks such as information retrieval and summarization. Previous KPE works include supervised and unsupervised approaches. Supervised approaches model KPE as sequence tagging (Sahrawat et al., 2019;Alzaidy et al., 2019;Martinc et al., 2020;Santosh et al., 2020;Nikzad-Khasmakhi et al., 2021) or sequence generation tasks Kulkarni et al., 2021) and require large-scale annotated data to perform well. Since KPE annotations are expensive and largescale KPE annotated data is scarce, unsupervised KPE approaches, such as TextRank (Mihalcea and Tarau, 2004), YAKE (Campos et al., 2018), Em-bedRank (Bennani-Smires et al., 2018), are the mainstay in industry deployment.
Among unsupervised KPE approaches, embedding-based approaches including Em-bedRank (Bennani-Smires et al., 2018) and SIFRank (Sun et al., 2020) yield the state-of-theart (SOTA) performance. After selecting keyphrase (KP) candidates from a document using rule-based methods, embedding-based KPE approaches rank the candidates in a descending order based on a scoring function, which computes the similarity between embeddings of candidates and the source document. Then the top-K candidates are chosen as the final KPs. We refer to these approaches as Phrase-Document-based (PD) methods. PD methods have two major drawbacks: (i) As a document is typically significantly longer than candidate KPs and usually contains multiple KPs, it is challenging for PD methods to reliably measure their similarities in the latent semantic space. Hence, PD methods are naturally biased towards longer candidate KPs, as shown by the example in Table 1. (ii) The embedding of candidate KPs in the PD methods is computed without the contextual information, hence further limiting the effectiveness of the subsequent similarity match.
In this paper, we propose a novel unsupervised embedding-based KPE method, denoted by Masked Document Embedding Rank (MDERank), to address above-mentioned drawbacks of PD methods. The architecture of MDERank is shown in Figure 1. The basic idea of MDERank is that a keyphrase plays an important role in the semantics of a document, and its absence from the document should cause a significant change in the semantics of the document. Therefore, we propose to compare the embeddings of the original document and its variant where the occurrence(s) of some candidate KPs are masked. This leads to a new ranking principle based on the increasing order of the resulting similarities, i.e., a lower semantic similarity between the original document and its masked variant indicates a higher significance of the candidate.
Our proposed method can be deemed as Document-Document method and it addresses the two weaknesses of the Phrase-Document methods: (i) Since the sequence lengths of the original document and the masked document are the same, comparing their similarities in the semantic space is more meaningful and reliable. (ii) The embedding of the masked document is computed from sufficient amount of context information and hence can capture the semantics reliably using the SOTA contextualized representation models such as BERT. Inspired by (Lewis et al., 2020;Han et al., 2021), where pre-trained language models (PLMs) trained on objectives close to final downstream tasks achieve enhanced representations and improve fine-tune performance, we further propose a novel self-supervised contrastive learning method on top of BERT-based models (dubbed as KPEBERT).
The main contributions of this work include: (i) We propose a novel embedding-based unsupervised KPE approach (MDERank) that improves the reliability of computing KP candidate embeddings from contextualized representation models and improves robustness to different lengths of KPs and documents.
(ii) We propose a novel self-supervised contrastive learning method and develop a new pre- trained language model KPEBERT.
(iii) We conduct extensive evaluations of MDERank on six diverse KPE benchmarks and demonstrate the robustness of MDERank to different lengths of documents. MDERank with BERT achieves 17.00, 21.99 and 23.85 for average F 1 @5, F 1 @10, F 1 @15 respectively, as 1.69, 2.18 and 1.80 absolute gains over the SOTA results from SIFRank (Sun et al., 2020), and 4.44, 3.58, and 2.95 absolute gains over EmbedRank with BERT. MDERank with KPEBERT achieves further absolute gains by 1.70, 2.18 and 1.73. Ablation analysis further provides insights on the effects of document lengths, encoder layers, and pooling methods.

Related Work
Unsupervised KPE Unsupervised KPE approaches do not require annotated data and there has been much effort in this line of research, as summarized in (Papagiannopoulou and Tsoumakas, 2020). Unsupervised KPE approaches can be categorized into statistics-based, graph-based, and embedding-based methods. The statistics-based models such as YAKE (Campos et al., 2018), EQPM (Li et al., 2017), and CQMine  explores both conventional position and frequency features and new statistical features capturing context information. TextRank (Mihalcea and Tarau, 2004) is a representative graphbased method, which converts a document into a graph based on lexical unit co-occurrences and applies PageRank iteratively. Many graphbased methods could be considered as modifica-Document The paper presents a method for pruning frequent itemsets based on background knowledge represented by a Bayesian network . The interestingness of an itemset is defined as the absolute difference between its support estimated from data and from the Bayesian network. Efficient algorithms are presented for finding interestingness of a collection of frequent itemsets , and . . . SIFRank (Best PD method) notation database attributes, research track paper dataset #attrs max, bayesian network bn output, bayesian network computing, interactive network structure improvement process MDERank (Proposed method) interestingness, pruning, frequent itemsets, pruning frequent itemsets, interestingness measures Table 1: An example shows the bias of Phrase-Document (PD) methods towards longer candidate keyphrases at K = 5. Keyphrase extracted are shown in a ranked order and those matching the gold labels are marked in red.
tions to TextRank by introducing extra features to compute weights for edges of the constructed graph, e.g., SingleRank (Wan and Xiao, 2008), PositionRank (Florescu and Caragea, 2017), Ex-pandRank (Wan and Xiao, 2008). The graph-based TopicRank (Bougouin et al., 2013) and Multipartit-eRank (Boudin, 2018) methods enhance keyphrase diversity by constructing graphs based on clusters of candidate keyphrases. For embeddingbased methods, (Wang et al., 2015) first attempted on utilizing word embeddings as external knowledge base for keyphrases extraction and generation. Key2vec (Mahata et al., 2018) used Fasttext to construct phrase/document embeddings and then apply PageRank to select keyphrases from candidates. EmbedRank (Bennani-Smires et al., 2018) measures the similarity between phrase and document embeddings for ranking. SIFRank (Sun et al., 2020) improves the static embeddings in Em-bedRank by a pre-trained language model ELMo and a sentence embedding model SIF (Arora et al., 2017). KeyBERT 1 is a tooltik for keyphrase extraction with BERT, following the PD based methods paradigm. AttentionRank (Ding and Luo, 2021) used a pretrained language model to calculate selfattention of a candidate within the context of a sentence and cross-attention between a candidate and sentences within a document, in order to evaluate the local and global importance of candidates. As analyzed in Section 1, for embedding-based methods, using contextualized embedding models to compute candidate embeddings could be unreliable due to lack of context, and these methods lack robustness to different lengths of keyphrases and documents. Our proposed MDERank approach could effectively address these drawbacks. Contextual Embedding Models Early emebdding models include static word embeddings such 1 https://maartengr.github.io/KeyBERT/ as Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and FastText (Bojanowski et al., 2017), phrase embedding model HCPE (Li et al., 2018), and sentence embeddings such as Sent2Vec (Pagliardini et al., 2018) and Doc2Vec (Lau and Baldwin, 2016), which render word or sentence representations that do not depend on their context. In contrast, pre-trained contextual embedding models, such as ELMo (Peters et al., 2018), incorporate rich syntactic and semantic information from context for representation learning and yield more expressive representations. BERT (Devlin et al., 2019) captures better context information through a bidirectional transformer encoder than the Bi-LSTM based ELMo, and has established SOTA in a wide variety of NLP tasks. In one line of research, RoBERTa , XLNET  and many other BERT variant PLMs have been proposed to further improve the language representation capability. In another line of research, Longformer (Beltagy et al., 2020), BigBird (Zaheer et al., 2020) and other efficient transformers are proposed to reduce the quadratic complexity of transformer on sequence length in order to model long-range dependencies.
In this paper, we mainly use BERT as the default contextual embedding model. We also evaluate the performance of MDERank with these efficient transformers on long documents.

MDERank
In this section, we describe the proposed Masked Document Embedding Rank (MDERank) approach. To address the mismatch between sequence lengths of a document and a candidate phrase as well as lack of contextual information in PD methods as mentioned in Section 1, we hypothesize that it is more reasonable to conduct the similarity comparison at the document-document level rather than at the phrase-document level.
Based on this hypothesis, for each candidate KP c i for a document d, given its occurrence positions in d as [p 1 , p 2 , . . . , p t ], MDERank replaces all occurrences of p t i=1 by a special placeholder token MASK. It is noted the number of MASK used for masking p t is as same as the number of tokens in c i . And then we construct a masked variant of the original document as d c i M . We define the similarity score f (c i ) for ranking the significance of candidates as the cosine similarity between E(d) and represents the embedding of a document. Note that a higher f (c i ) value indicates a lower ranking for c i , which is opposite to the PD methods. This is because the higher similarity the less important the candidate c i is. The semantic of masked document is not changed much compared with original one as only a trivial phrase is masked. We use BERT (Devlin et al., 2019) as the default embedding model and investigate other contextual embedding models in Section 5.4. BERT is pretrained through self-supervised tasks of masked language modeling (MLM) and next sentence pre-diction (NSP), on large-scale unlabeled text of English Wikipedia (2500M words) and Bookscorpus (800 words). A document d = {w 1 , w 2 , . . . , w n } is prepended with a special token [CLS] and then encoded by BERT to obtain the hidden representations of tokens as {H 1 , H 2 , . . . , H n }. The document embedding E(d) is computed as follows: We also investigate average pooling in Section 5.4 and other masking methods in Appendix A.
where sim(H x , H y ) denotes the similarity between embeddings of the document x and y. We use cosine similarity (same as used for MDERank). m is a margin parameter. We initialize KPEBERT from BERT-baseuncased 2 and then incorporate the standard MLM pre-training task as in BERT into the overall learning objective to avoid forgetting the previously learned general linguistic knowledge, as follows: where λ is a hyper-parameter balancing the two losses in the multi-task learning setting. KPEBERT differs from SimSCE in two major aspects: (i) KPE-BERT uses pseudo labeling and positive/negative example sampling strategies (below), different from standard dropout used by SimCSE to construct pair examples; (ii) KPEBERT uses triplet loss whereas SimCSE uses contrastive loss.
Absolute Sampling For a document d, we first select candidate keyphrases C using POS tags with regular expressions as described in Section 3. Then we obtain a set of keyphrases C extracted by another unsupervised KPE approach θ on d, as pseudo labels. We define "keyphrases" as C and "nonkeyphrases" as C \ C . We mask a "keyphrase" from a document with a MASK to construct a positive example d + for d. We select a "non-keyphrase" and perform the same mask operation to construct a negative example d − .
Relative Sampling In this approach, after obtaining a set of KP C extracted by θ, we randomly select a pair of KPs from C and choose the one ranked higher to construct a positive example and the other one to construct a negative example through the mask operation. On one hand, the decisions of "keyphrases" and "non-keyphrases" are fully based on the ranking predicted by θ, hence relative sampling may increase the impact from θ on the inductive bias of KPEBERT. On the other hand, relative sampling mines more hard negative samples which may improve performance of triplet loss based learning. We study the efficacy of these two sampling approaches on KPEBERT in Section 5.3.

Datasets and Metrics
The pre-training data for KPEBERT are the Wiki-Text language modeling dataset with 100+ million tokens extracted from a set of verified Good and Featured articles on Wikipedia 3 . We use six KPE benchmarks for evaluations. Four of them are scientific publications 4 , including Inspec (Hulth, 2003), Krapivin (Krapivin et al., 2009), NUS (Nguyen and Kan, 2007), and SemEval-2010 (Kim et al., 2010), all widely used for evaluations in previous works (Meng et al., 2017;Sahrawat et al., 2019;Bennani-Smires et al., 2018;Meng et al., 2021). We also evaluate on the DUC2001 dataset (news articles) (Wan and Xiao, 2008) and SemEval2017 dataset (science journals) (Augenstein et al., 2017) 5 . Table 2 shows data statistics. For a fair comparison with SIFRank, we use the entire documents, including abstract and main body. Following previous works, predicted KPs are deduplicated and the KPE performance is evaluated as F 1 at the top K KPs (K ∈ {5, 10, 15}). Stemming is applied for computing F 1 .

Baselines and Training Details
The first group for each K in Table 3   exclude such biases, are strong baselines for unsupervised keyphrase extraction with SIFRank considered to be the previous SOTA.
We use YAKE (Campos et al., 2018) as θ to extract "keyphrases" for a document for KPEBERT pre-training, due to its high efficiency and consistent performance. Effects of the choice of θ on KPEBERT are analyzed in Section 5.4 where we compare YAKE and TextRank as θ. The number of pseudo labels for absolute and relative sampling for KPEBERT pre-training are 10 and 20, respectively. The λ is set to 0.1. The default parameter setting is the same as (Gao et al., 2021) except that we set the margin m for triplet loss to 0.2 and the learning rate to 3e-5. We use 4 NVIDIA V100 GPUs for training, the batch size is 2 per device and the gradient accumulation is 4. We train 10 epochs. Table 3 shows F 1 at the top K ∈ {5, 10, 15} predictions. For each K, the first group shows the baseline results, and the second group shows results from our MDERank(BERT) (default using BERT for embedding) and MDERank using KPEBERT for embedding, MDERank(KPEBERT). MDERank(BERT) and MDERank(KPEBERT) perform consistently well on all benchmarks. MDERank(BERT) outperforms EmbedRank(BERT) by 2.95 average F 1 @15 and outperforms the previous SOTA SIFRank by 1.80 average F 1 @15. MDERank further benefits from KPEBERT and MDERank(KPEBERT) achieves 3.53 average F 1 @15 gain over SIFRank, especially on long-document datasets NUS and Krapivin. We also compute the average recalls of KPs with different phrase lengths (PL) in top-15 extracted KPs on all 6 benchmarks, for both EmbedRank(BERT) and MDERank(BERT), as shown in Table 4. We observe that EmbedRank has a strong bias for longer phrases, with PLs of its extracted KPs concentrated in [2,3]; whereas, PLs of KPs extracted by MDERank are more evenly distributed on diverse datasets. This analysis confirms that MDERank indeed alleviates the bias towards longer phrases from EmbedRank.

Performance Comparison
However, we observe that MDERank(BERT) has a large gap to SIFRank on DUC2001 and performs worse than EmbedRank(BERT) on Inspec. We investigate the reasons for these poorer performance. Different from other datasets collected from scientific publications, DUC2001 consists of open-domain news articles. The previous SOTA SIFRank introduces domain adaptation by combining weights from common corpus and domain corpus in the weight function of words for computing sentence embeddings, which may contribute significantly to its superior performance on DUC2001.  atively high (see Table 2). Also, on this dataset, when we move candidates with only 1 token to the end of ranking, MDERank(BERT) improves F 1 @5, F 1 @10, F 1 @15 to 29.71 ,38.15, 39.46, an improvement of 3.54, 4.34 and 3.29, respectively. These analyses show that gold labels for Inspec are biased towards long PL. Therefore, EmbedRank with inductive bias for long PL may benefit from this annotation bias and perform well. However, MDERank still outperforms baselines based on its best average F 1 and top average rank among all methods on all datasets, proving its robustness across domains without any domain adaptation. It is notable that MDERank particularly outperforms baselines on long-document datasets, verifying that MDERank could mitigate the weakness of performance degradation on long documents from PD methods. We further investigate effects of document length in Section 5.4. Absolute and relative sampling for KPEBERT achieve comparable performance on the 6 benchmarks with absolute sampling gaining a very small margin. Relative sampling performs better on NUS but is worse on Inspec and SemEval2017. We plan to continue exploring sampling approaches in future work, to reduce dependency on θ and improve KPEBERT.

Analyses
Effects of Document Length Section 5.3 demonstrates the superior performance of MDERank especially on long documents. We conduct two experiments to further analyze effects of document length on the performance of PD methods and MDERank. We choose EmbedRank(BERT) to represent PD methods. In the first experiment, both approaches use BERT for embedding and we truncate a document into the first 128, 256, 512 words. As shown in Table 5 Table 6: KPE performance on DUC2001 from Em-bedRank(BERT) and MDERank(BERT) using different BERT layers for embedding and pooling methods.
AvgPooling and MaxPooling are employed on the output of a specific layer to produce document embeddings.
weakness of EmbedRank(BERT) that the increased document length exacerbates discrepancy between sequence lengths of the document and KP candidates and mismatches between their embeddings, which degrades the KPE performance. In contrast, the performance of MDERank(BERT) improves steadily with increased document lengths, demonstrating the robustness of MDERank to document lengths and its capability to improve KPE from more context in longer documents.
In the second experiment, we investigate effects of document length beyond 512 on Em-bedRank and MDERank. To accommodate documents longer than 512, we choose BigBird (Zaheer et al., 2020) as the embedding model. BigBird replaces the full self-attention in Transformer with sparse attentions of global, local, and random attentions, reducing the quadratic complexity to sequence length from Transformer to linear. In order to select valid datasets for this evaluation, we count the average percentage of gold label KPs appearing in the first m words in a document on the three longest datasets, DUC2001, NUS, and Krapivin. We observe that the first 500 words nearly cover 90% gold KPs in DUC2001, whereas 50% gold KPs in Krapivin are in the first 2500 words, and 50% gold KPs in NUS are in the first 2000 words. Therefore, we drop DUC2001 and use NUS and Krapivin for the second experiment. We keep the first 2500 and 2000 words for documents in Krapivin and NUS, respectively. Table 7 shows that on NUS, when increasing the document length from 512 to 2000, MDERank(BigBird) outperforms MDERank(BERT) by 2.38 F 1 @15. On Krapivin, when increasing the document length from 512 to 2500, MDERank(BigBird) also improves MDERank(BERT) by 0.12 F 1 @15.
In contrast, the performance of EmbedRank degrades dramatically with longer context, since more context introduces more candidates into ranking and also worsens the discrepancy between lengths of document and phrases, which in turn greatly reduces the accuracy of similarity comparison.

Effects of Encoder Layers and Pooling Methods
The findings in (Jawahar et al., 2019;Kim et al., 2020;Rogers et al., 2020) show that BERT captures a rich hierarchy of linguistic information, with surface features in lower layers, syntactic features in middle layers, and semantic features in higher layers. We conduct experiments to understand the effects on MDERank and EmbedRank when using different BERT layers for embedding. We choose the third, the sixth, and the last layer from BERT-Base. We study the interactions between encoder layers and different Pooling methods. As shown in Table 6, for both AvgPooling and MaxPooling, F 1 from MDERank(BERT) shows a steady gain to the increase of layers. On the contrary, with AvgPooling, F 1 from EmbedRank(BERT) drastically drops as the layers rises from 3 to 12, probably due to that the lower BERT layer provides more rough and generic representations, which may alleviate mismatch in similarity comparison for Phrase-Document methods. We test the average F1@5, F1@10, F1@15 with the configuration for EmbedRank(BERT) that yields best results on DUC2001, i.e., AvgPooling and layer 3, on all 6 datasets and the results are 3.7, 1.8 and 1.6 absolute lower than MDERank(BERT). Compared to Avg-Pooling, MaxPooling produces weaker document embedding, which severely degrades the performance of EmbedRank and slightly degrades performance of MDERank. On the other hand, MaxPool-Method NUS (512) NUS (2000) Krapivin (512) Krapivin (2500) F 1 @5 F 1 @10 F 1 @15 F 1 @5 F 1 @10 F 1 @15 F 1 @5 F 1 @10 F 1 @15 F 1 @5 F   For both pooling methods, MDERank using the last BERT layer achieves the best results, demonstrating that MDERank can fully benefit from stronger contextualized semantic representations. Effects of the Choice of θ on KPEBERT We also investigate the effects of choosing different unsupervised KPE methods as θ for generating pseudo labels for KPEBERT pre-training. When balancing the extraction speed and KPE quality, TextRank is another choice for θ besides YAKE. As shown in Table 3, YAKE performs better than TextRank on long-document datasets but worse on short-document datasets. After replacing YAKE with TextRank as θ for producing pseudo labels and training KPEBERT, the KPE results of the respective MDERank(KPEBERT) with absolute sampling are shown in Table 8. We observe that MDERank(KPEBERT) using YAKE as θ significantly outperforms MDERank(KPEBERT) using TextRank as θ, on both short-document datasets and long-document datasets (except worse on Inspec and comparable on SemEval2017). Although on average YAKE performs worse than TextRank on the six benchmarks, the better performance from YAKE on long documents coupled with its stable performance may be a crucial factor when choosing θ for pre-training KPEBERT. Results in Table 3 shows that MDERank(KPEBERT) with YAKE for pseudo labeling yields superior performance on both short and long documents. In other words, KPEBERT benefits from the stable performance from YAKE on long documents for pseudo labeling while exhibiting robustness to the relatively low performance on short documents from YAKE.

Conclusion
We propose a novel embedding-based unsupervised KPE approach, MDERank, to improve reliability of similarity match compared to previous embeddingbased methods. We also propose a novel selfsupervised learning method and develop a KPEoriented PLM, KPEBERT. Experiments demonstrate MDERank outperforms SOTA on diverse datasets and further benefits from KPEBERT. Analyses further verify the robustness of MDERank to different lengths of keyphrases and documents, and that MDERank benefits from longer context and stronger embedding models. Future work includes improving KPEBERT for MDERank by optimizing sampling strategies and pre-training methods.
nificantly, especially on long documents. Mask subset could partially address the diversity problem by reducing the number of nested candidates selected by MDERank. Figure 3 shows a comparison on diversity between Mask Subset and other methods, where the evaluation metric for diversity is defined in Equation 4. The Phrase-Document method refers to EmbedRank(BERT). We could see from Figure 3 that MDERank with Mask Subset indeed boosts the diversity over Mask All and even exceeds gold labels on several datasets.

Diveristy(d) =
t u t n * 100 (4) Figure 3: Diversity scores from different methods on various datasets. A higher bar indicates a better diversity. The diversity of gold keyphrases are in blue and on the right.

B Impact of Similarity Measure
The common similarity measures include Cosine and Euclidean distance. However, the choice of similarity measure does not matter for MDERank performance. We conduct experiments to investigate the impact of the similarity measure on the performance of MDERank, and the results are shown in Table 10. We observe that Cosine and Euclidean similarity measure are not a salient factor for the ranking results for both EmbedRank(BERT) and MDERank(BERT