Mingyang Song


2023

pdf bib
A Survey on Recent Advances in Keyphrase Extraction from Pre-trained Language Models
Mingyang Song | Yi Feng | Liping Jing
Findings of the Association for Computational Linguistics: EACL 2023

Keyphrase Extraction (KE) is a critical component in Natural Language Processing (NLP) systems for selecting a set of phrases from the document that could summarize the important information discussed in the document. Typically, a keyphrase extraction system can significantly accelerate the speed of information retrieval and help people get first-hand information from a long document quickly and accurately. Specifically, keyphrases are capable of providing semantic metadata characterizing documents and producing an overview of the content of a document. In this paper, we introduce keyphrase extraction, present a review of the recent studies based on pre-trained language models, offer interesting insights on the different approaches, highlight open issues, and give a comparative experimental study of popular supervised as well as unsupervised techniques on several datasets. To encourage more instantiations, we release the related files mentioned in this paper.

pdf bib
Improving Embedding-based Unsupervised Keyphrase Extraction by Incorporating Structural Information
Mingyang Song | Huafeng Liu | Yi Feng | Liping Jing
Findings of the Association for Computational Linguistics: ACL 2023

Keyphrase extraction aims to extract a set of phrases with the central idea of the source document. In a structured document, there are certain locations (e.g., the title or the first sentence) where a keyphrase is most likely to appear. However, when extracting keyphrases from the document, most existing embedding-based unsupervised keyphrase extraction models ignore the indicative role of the highlights in certain locations, leading to wrong keyphrases extraction. In this paper, we propose a new Highlight-Guided Unsupervised Keyphrase Extraction model (HGUKE) to address the above issue. Specifically, HGUKE first models the phrase-document relevance via the highlights of the documents. Next, HGUKE calculates the cross-phrase relevance between all candidate phrases. Finally, HGUKE aggregates the above two relevance as the importance score of each candidate phrase to rank and extract keyphrases. The experimental results on three benchmarks demonstrate that HGUKE outperforms the state-of-the-art unsupervised keyphrase extraction baselines.

pdf bib
Unsupervised Keyphrase Extraction by Learning Neural Keyphrase Set Function
Mingyang Song | Haiyun Jiang | Lemao Liu | Shuming Shi | Liping Jing
Findings of the Association for Computational Linguistics: ACL 2023

We create a paradigm shift concerning building unsupervised keyphrase extraction systems in this paper. Instead of modeling the relevance between an individual candidate phrase and the document as in the commonly used framework, we formulate the unsupervised keyphrase extraction task as a document-set matching problem from a set-wise perspective, in which the document and the candidate set are globally matched in the semantic space to particularly take into account the interactions among all candidate phrases. Since it is intractable to exactly extract the keyphrase set by the matching function during the inference, we propose an approximate approach, which obtains the candidate subsets via a set extractor agent learned by reinforcement learning. Exhaustive experimental results demonstrate the effectiveness of our model, which outperforms the recent state-of-the-art unsupervised keyphrase extraction baselines by a large margin.

pdf bib
HyperRank: Hyperbolic Ranking Model for Unsupervised Keyphrase Extraction
Mingyang Song | Huafeng Liu | Liping Jing
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Given the exponential growth in the number of documents on the web in recent years, there is an increasing demand for accurate models to extract keyphrases from such documents. Keyphrase extraction is the task of automatically identifying representative keyphrases from the source document. Typically, candidate keyphrases exhibit latent hierarchical structures embedded with intricate syntactic and semantic information. Moreover, the relationships between candidate keyphrases and the document also form hierarchical structures. Therefore, it is essential to consider these latent hierarchical structures when extracting keyphrases. However, many recent unsupervised keyphrase extraction models overlook this aspect, resulting in incorrect keyphrase extraction. In this paper, we address this issue by proposing a new hyperbolic ranking model (HyperRank). HyperRank is designed to jointly model global and local context information for estimating the importance of each candidate keyphrase within the hyperbolic space, enabling accurate keyphrase extraction. Experimental results demonstrate that HyperRank significantly outperforms recent state-of-the-art baselines.

pdf bib
Mitigating Over-Generation for Unsupervised Keyphrase Extraction with Heterogeneous Centrality Detection
Mingyang Song | Pengyu Xu | Yi Feng | Huafeng Liu | Liping Jing
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Over-generation errors occur when a keyphrase extraction model correctly determines a candidate keyphrase as a keyphrase because it contains a word that frequently appears in the document but at the same time erroneously outputs other candidates as keyphrases because they contain the same word. To mitigate this issue, we propose a new heterogeneous centrality detection approach (CentralityRank), which extracts keyphrases by simultaneously identifying both implicit and explicit centrality within a heterogeneous graph as the importance score of each candidate. More specifically, CentralityRank detects centrality by taking full advantage of the content within the input document to construct graphs that encompass semantic nodes of varying granularity levels, not limited to just phrases. These additional nodes act as intermediaries between candidate keyphrases, enhancing cross-phrase relations. Furthermore, we introduce a novel adaptive boundary-aware regularization that can leverage the position information of candidate keyphrases, thus influencing the importance of candidate keyphrases. Extensive experimental results demonstrate the superiority of CentralityRank over recent state-of-the-art unsupervised keyphrase extraction baselines across three benchmark datasets.

2022

pdf bib
Hyperbolic Relevance Matching for Neural Keyphrase Extraction
Mingyang Song | Yi Feng | Liping Jing
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Keyphrase extraction is a fundamental task in natural language processing that aims to extract a set of phrases with important information from a source document. Identifying important keyphrases is the central component of keyphrase extraction, and its main challenge is learning to represent information comprehensively and discriminate importance accurately. In this paper, to address the above issues, we design a new hyperbolic matching model (HyperMatch) to explore keyphrase extraction in hyperbolic space. Concretely, to represent information comprehensively, HyperMatch first takes advantage of the hidden representations in the middle layers of RoBERTa and integrates them as the word embeddings via an adaptive mixing layer to capture the hierarchical syntactic and semantic structures. Then, considering the latent structure information hidden in natural languages, HyperMatch embeds candidate phrases and documents in the same hyperbolic space via a hyperbolic phrase encoder and a hyperbolic document encoder. To discriminate importance accurately, HyperMatch estimates the importance of each candidate phrase by explicitly modeling the phrase-document relevance via the Poincaré distance and optimizes the whole model by minimizing the hyperbolic margin-based triplet loss. Extensive experiments are conducted on six benchmark datasets and demonstrate that HyperMatch outperforms the recent state-of-the-art baselines.

pdf bib
Utilizing BERT Intermediate Layers for Unsupervised Keyphrase Extraction
Mingyang Song | Yi Feng | Liping Jing
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

2021

pdf bib
Importance Estimation from Multiple Perspectives for Keyphrase Extraction
Mingyang Song | Liping Jing | Lin Xiao
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Keyphrase extraction is a fundamental task in Natural Language Processing, which usually contains two main parts: candidate keyphrase extraction and keyphrase importance estimation. From the view of human understanding documents, we typically measure the importance of phrase according to its syntactic accuracy, information saliency, and concept consistency simultaneously. However, most existing keyphrase extraction approaches only focus on the part of them, which leads to biased results. In this paper, we propose a new approach to estimate the importance of keyphrase from multiple perspectives (called as KIEMP) and further improve the performance of keyphrase extraction. Specifically, KIEMP estimates the importance of phrase with three modules: a chunking module to measure its syntactic accuracy, a ranking module to check its information saliency, and a matching module to judge the concept (i.e., topic) consistency between phrase and the whole document. These three modules are seamlessly jointed together via an end-to-end multi-task learning model, which is helpful for three parts to enhance each other and balance the effects of three perspectives. Experimental results on six benchmark datasets show that KIEMP outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.