Improving Embedding-based Unsupervised Keyphrase Extraction by Incorporating Structural Information

,


Introduction
Keyphrase extraction is the fundamental task of automatically extracting a set of salient phrases from a document that concisely describes its primary content (Hasan and Ng, 2014;Song et al., 2023a).Figure 1 shows an example of the source document and its corresponding keyphrases.
Recent developments in pre-trained language models (Devlin et al., 2019) have heightened the need for utilizing pre-trained embeddings on natural language processing tasks, which significantly improves the performance of embedding-based unsupervised keyphrase extraction models (Sun et al., 2020;Liang et al., 2021;Zhang et al., 2022).Existing embedding-based models mainly consist of two components: candidate keyphrase extraction and keyphrase importance estimation (Hasan and Ng, 2014; Song et al., 2021Song et al., , 2022a)).The former extracts continuous words from the document as candidate keyphrases through heuristic rules, and the latter estimates the importance of candidate phrases by matching similarity with their corresponding document.
Generally, the source document has both salient information and noises (redundant content).Hence, there may be a deviation when directly using the phrase-document relevance as the importance score of each candidate to select keyphrases.For many specific-domain documents (e.g., news or scientific articles), the highlights (the title or the first sentence) typically contains the central information of the source document (as shown in Figure 1), which has more significant guidance for extracting keyphrases.However, the recent embedding-based unsupervised keyphrase extraction models ignore the effect of the highlight information, leading to extract wrong keyphrases.
Motivated by the above issues, we propose a new Highlight-Guided Unsupervised Keyphrase Extraction model (HGUKE), which estimates the impor-

Candidate Keyphrase Extraction
To extract candidate keyphrases from the source document, we follow the previous studies (Liang et al., 2021;Song et al., 2022b;Ding and Luo, 2021)  Next, we leverage word embeddings to obtain candidate keyphrase representations.To capture the central semantics of the candidate keyphrases, we obtain candidate keyphrase representations by leveraging the max pooling operation, which is a simple and effective parameter-free approach and can be calculated as follows, where h p i is the representation of the i-th candidate keyphrase and |p i | indicates the length of p i .Specifically, h k represents the word in the document associated with the candidate keyphrase p i .At the same time, we use the mean pooling operation to obtain the highlight representation h s of the document.

Phrase-Document Relevance
To obtain more relevant candidates, we model the similarity between candidate phrases and the corresponding document as follows, where p h i denotes the phrase-document relevance of i-th candidate keyphrases and ||•|| 1 indicates the Manhattan Distance.
For news and scientific articles, keyphrases often appear at the beginning or front position (Florescu and Caragea, 2017a,b), which means that the position information is important and indicative for extracting keyphrases.For example, the word appearing at 2-th, 5-th and 10-th, has a weight ρ i = 1/2 + 1/5 + 1/10 = 0.8.Inspired by the previous work (Florescu and Caragea, 2017b;Liang et al., 2021), we adopt a position regularization as follows, ρ i = softmax(e1/i ), where ρ i is the position regularization factor of the i-th candidate phrase.Then, the weighted phrase-document relevance ph i can be re-calculated as follows, Here, we finally employ ph i to estimate the phrasedocument relevance of the i-th candidate phrase.

Cross-Phrase Relevance
Generally, the phrase-document relevance is calculated between the highlight information and each candidate independently, and consequently, it cannot determine which candidates are better than the Model DUC2001 Inspec SemEval2010 F1@5 F1@10 F1@15 F1@5 F1@10 F1@15 F1@5 F1@10 F1@15 others.To determine which candidate phrases are more salient than the others, we sum the semantic relatedness between the i-th candidate phrases and all candidates as the cross-phrase relevance.Thus, it calculates the local relevance as follows, where δ i = Mean( j=1,j =i h p i h p j ).Here, we treat δ i as a de-noisy factor to filter the noises, which is far different from the i-th candidate keyphrase in the document.

Relevance Aggregation
We aggregate the phrase-document relevance and the cross-phrase relevance into a whole score as the importance score of each candidate via a simple multiplication, where r i indicates the importance score of the i-th candidate phrase.Then, we rank all candidates with their importance score r i and extract top-ranked k phrases as keyphrases of the source document.
3 Experiments and Results

Experimental Settings
This paper conducts experiments on three benchmark and popular used keyphrase datasets, which includes DUC2001 (Wan and Xiao, 2008), Inspec  (Hulth, 2003), and SemEval2010 (Kim et al., 2010).Due to page limits, please refer to the corresponding articles for the details of the three datasets.Following the previous work (Liang et al., 2021;Ding and Luo, 2021;Song et al., 2023b), we use the standard practice and evaluate the performance of our model in terms of f-measure at the top-K keyphrases (F1@K) and adopt stemming to both extracted keyphrases and gold truth.Concretely, we report F1@5, F1@10, and F1@15 of each model on three benchmark datasets.
We adopt the pre-trained language model BERT (Devlin et al., 2019) as the backbone of our model, initialized from their pre-trained weights.In our experiments, λ is set to 0.9 for three benchmark datasets.

Overall Performance
Table 1 shows the performance of baselines and our model on three benchmark datasets (DUC2001, In-
Compared with EmbedRank (Bennani-Smires et al., 2018), KeyGames (Saxena et al., 2020), and SIFRank (Sun et al., 2020), HGUKE achieves significant improvement, which benefits from using the highlights to calculate the importance score of each candidate keyphrase.Compared with the best baseline JointGL, our model achieves better performance on several benchmark keyphrase extraction datasets in all evaluation metrics.The main reason for this improvement is that we use the highlights as the guidance information instead of the whole document when estimating the importance of keyphrases.

Ablation Test
The ablation experiments on three benchmark keyphrase extraction datasets are shown in Figure 3.It can be seen from the results that using the highlight information can significantly improve the performance of keyphrase extraction, which benefits from estimating the importance score of each candidate by using its corresponding highlight information rather than the whole document.We consider the main reason is that the title or the first sentence of the document usually has a strong guidance for extracting keyphrases.

Impact of Pooling Methods
In this section, we study different pooling methods, including mean-and max-pooling operations.For all pooling methods, HGUKE using the last BERT layer achieves the best results, demonstrating that HGUKE benefits from stronger contextualized semantic representations.We can see the results in Table 2 that the document encoded via the meanpooling operation obtains the best performance.

Impact of Different Similarity Measures
Our model adopts Manhattan Distance to measure the textual similarity between candidate phrases and the highlight information.Furthermore, we attempt to employ different measures to estimate the phrase-document relevance.The results of different similarity measures are shown in Table 3, and we can see that the advantage of Manhattan Distance is obvious.

Related Work
Most existing unsupervised keyphrase extraction methods can be mainly divided into four categories: statistics-based, topic-based, graph-based, and embedding-based models.Specifically, statisticsbased models (Salton and Buckley, 1988;Witten et al., 1999) usually extract keyphrases by estimating the importance of candidate phrases with different statistic features, such as word frequency feature, phrase position feature, linguistic features of natural language, etc. Topic-based models (Liu et al., 2009(Liu et al., , 2010) ) typically utilize topic information to determine whether a candidate phrase is a keyphrase.Graph-based models (Mihalcea and Tarau, 2004;Grineva et al., 2009) represent the document as a graph and rank candidate phrases by graph-based similarities.
Embedding-based models usually adopt the pretrained embeddings to obtain document and candidate phrase representations and calculate the importance score of each candidate depending on the obtained representations.Benefiting from the development of transformer-based pre-trained language models (Devlin et al., 2019) in the natural language processing field, embedding-based models (Bennani-Smires et al., 2018;Sun et al., 2020;Liang et al., 2021) have achieved outstanding performance.Concretely, embedding-based models mainly consist of two procedures: candidate keyphrase representation and keyphrase importance estimation (Hasan and Ng, 2014;Song et al., 2023a).The procedure utilizes natural language linguistic features to construct candidate keyphrases and represents them by pre-trained embedding approaches (e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019)).The second procedure estimates the importance of candidate phrases from different perspectives to determine whether a candidate phrase is a keyphrase.
Unlike the existing unsupervised keyphrase extraction models, we use the highlight information of the document to calculate the phrase-document relevance instead the whole document.

Conclusion and Future Work
In this paper, we incorporate structural information to improve the performance of embedding-based unsupervised keyphrase extraction.Specifically, in this paper, we propose a new Highlight-Guided Unsupervised Keyphrase Extraction model (HGUKE), which calculates the phrase-document relevance via the highlight information instead of the whole document to select relevant candidate phrases.Extensive experiments demonstrate that HGUKE outperforms the state-of-the-art unsupervised baselines.Future research may investigate adopting different structural information of the source document to improve the performance of unsupervised keyphrase extraction.

Acknowledgments
We thank the three anonymous reviewers for carefully reading our paper and their insightful comments and suggestions.This work was partly supported by the Fundamental Research Funds for the Central Universities (2019JBZ110); the National Natural Science Foundation of China under Grant 62176020; the National Key Research and Development Program (2020AAA0106800); the Beijing Natural Science Foundation under Grant L211016; CAAI-Huawei MindSpore Open Fund; and Chinese Academy of Sciences (OEIP-O-202004).

Limitations
There are still some limitations of our work.In the future, we plan to enhance the procedure of extracting candidate keyphrase, to improve the upper bound of the performance of keyphrase extraction.One possible way is to generate candidate phrases of the document by utilizing the high-level semantic relatedness (e.g., attention weights) instead of using the surface-or syntactic-level information.

Figure 1 :
Figure 1: Randomly sampled document with its corresponding keyphrases from the benchmark keyphrase extraction dataset Inspec.Bold orange represents the content related to the title, and underlined indicates the content related to the first sentence.

F1Figure 3 :
Figure 3: The results of calculating the phrase-document relevance via the whole document and the highlights.

Table 1 :
Performance of the selected baselines and our model on DUC2001, Inspec and SemEval2010 test sets.F1 scores on the top 5, 10, and 15 keyphrases are reported.The best results are bolded in the table.

Table 3 :
The results of different similarity measure methods for the phrase-document relevance.