Pre-training data selection for biomedical domain adaptation using journal impact metrics

Mathieu Lai-king; Patrick Paroubek

doi:10.18653/v1/2024.bionlp-1.27

Pre-training data selection for biomedical domain adaptation using journal impact metrics

Abstract

Domain adaptation is a widely used method in natural language processing (NLP) to improve the performance of a language model within a specific domain. This method is particularly common in the biomedical domain, which sees regular publication of numerous scientific articles. PubMed, a significant corpus of text, is frequently used in the biomedical domain. The primary objective of this study is to explore whether refining a pre-training dataset using specific quality metrics for scientific papers can enhance the performance of the resulting model. To accomplish this, we employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set, we then evaluate the resulting models on biomedical language understanding tasks from the BLURB benchmark. Our results show that pruning using journal impact metrics is not efficient. But we also show that pre-training using fewer abstracts (but with the same number of training steps) does not necessarily decrease the resulting model’s performance.

Anthology ID:: 2024.bionlp-1.27
Volume:: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, Junichi Tsujii
Venues:: BioNLP | WS
SIG:: SIGBIOMED
Publisher:: Association for Computational Linguistics
Note:
Pages:: 363–369
Language:
URL:: https://aclanthology.org/2024.bionlp-1.27
DOI:: 10.18653/v1/2024.bionlp-1.27
Bibkey:
Cite (ACL):: Mathieu Lai-king and Patrick Paroubek. 2024. Pre-training data selection for biomedical domain adaptation using journal impact metrics. In Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 363–369, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Pre-training data selection for biomedical domain adaptation using journal impact metrics (Lai-king & Paroubek, BioNLP-WS 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.bionlp-1.27.pdf

PDF Cite Search