BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR

Souradip Chakraborty, Ekaba Bisong, Shweta Bhatt, Thomas Wagner, Riley Elliott, Francesco Mosconi


Abstract
The SARS-CoV-2 (COVID-19) pandemic spotlighted the importance of moving quickly with biomedical research. However, as the number of biomedical research papers continue to increase, the task of finding relevant articles to answer pressing questions has become significant. In this work, we propose a textual data mining tool that supports literature search to accelerate the work of researchers in the biomedical domain. We achieve this by building a neural-based deep contextual understanding model for Question-Answering (QA) and Information Retrieval (IR) tasks. We also leverage the new BREATHE dataset which is one of the largest available datasets of biomedical research literature, containing abstracts and full-text articles from ten different biomedical literature sources on which we pre-train our BioMedBERT model. Our work achieves state-of-the-art results on the QA fine-tuning task on BioASQ 5b, 6b and 7b datasets. In addition, we observe superior relevant results when BioMedBERT embeddings are used with Elasticsearch for the Information Retrieval task on the intelligently formulated BioASQ dataset. We believe our diverse dataset and our unique model architecture are what led us to achieve the state-of-the-art results for QA and IR tasks.
Anthology ID:
2020.coling-main.59
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
669–679
Language:
URL:
https://aclanthology.org/2020.coling-main.59
DOI:
10.18653/v1/2020.coling-main.59
Bibkey:
Cite (ACL):
Souradip Chakraborty, Ekaba Bisong, Shweta Bhatt, Thomas Wagner, Riley Elliott, and Francesco Mosconi. 2020. BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR. In Proceedings of the 28th International Conference on Computational Linguistics, pages 669–679, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR (Chakraborty et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.59.pdf