A Large-Scale Dataset for Biomedical Keyphrase Generation

Maël Houbre, Florian Boudin, Beatrice Daille


Abstract
Keyphrase generation is the task consisting in generating a set of words or phrases that highlight the main topics of a document. There are few datasets for keyphrase generation in the biomedical domain and they do not meet the expectations in terms of size for training generative models. In this paper, we introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation. The dataset and models are available online.
Anthology ID:
2022.louhi-1.6
Volume:
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Alberto Lavelli, Eben Holderness, Antonio Jimeno Yepes, Anne-Lyse Minard, James Pustejovsky, Fabio Rinaldi
Venue:
Louhi
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47–53
Language:
URL:
https://aclanthology.org/2022.louhi-1.6
DOI:
10.18653/v1/2022.louhi-1.6
Bibkey:
Cite (ACL):
Maël Houbre, Florian Boudin, and Beatrice Daille. 2022. A Large-Scale Dataset for Biomedical Keyphrase Generation. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 47–53, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
A Large-Scale Dataset for Biomedical Keyphrase Generation (Houbre et al., Louhi 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.louhi-1.6.pdf
Video:
 https://aclanthology.org/2022.louhi-1.6.mp4