Clustering-based Sampling for Few-Shot Cross-Domain Keyphrase Extraction

Prakamya Mishra, Lincy Pattanaik, Arunima Sundar, Nishant Yadav, Mayank Kulkarni


Abstract
Keyphrase extraction is the task of identifying a set of keyphrases present in a document that captures its most salient topics. Scientific domain-specific pre-training has led to achieving state-of-the-art keyphrase extraction performance with a majority of benchmarks being within the domain. In this work, we explore how to effectively enable the cross-domain generalization capabilities of such models without requiring the same scale of data. We primarily focus on the few-shot setting in non-scientific domain datasets such as OpenKP from the Web domain & StackEx from the StackExchange forum. We propose to leverage topic information intrinsically available in the data, to build a novel clustering-based sampling approach that facilitates selecting a few samples to label from the target domain facilitating building robust and performant models. This approach leads to large gains in performance of up to 26.35 points in F1 when compared to selecting few-shot samples uniformly at random. We also explore the setting where we have access to labeled data from the model’s pretraining domain corpora and perform gradual training which involves slowly folding in target domain data to the source domain data. Here we demonstrate further improvements in the model performance by up to 12.76 F1 points.
Anthology ID:
2024.findings-eacl.82
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1232–1250
Language:
URL:
https://aclanthology.org/2024.findings-eacl.82
DOI:
Bibkey:
Cite (ACL):
Prakamya Mishra, Lincy Pattanaik, Arunima Sundar, Nishant Yadav, and Mayank Kulkarni. 2024. Clustering-based Sampling for Few-Shot Cross-Domain Keyphrase Extraction. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1232–1250, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Clustering-based Sampling for Few-Shot Cross-Domain Keyphrase Extraction (Mishra et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.82.pdf