Re-train or Train from Scratch? Comparing Pre-training Strategies of BERT in the Medical Domain

Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Pierre Zweigenbaum


Abstract
BERT models used in specialized domains all seem to be the result of a simple strategy: initializing with the original BERT and then resuming pre-training on a specialized corpus. This method yields rather good performance (e.g. BioBERT (Lee et al., 2020), SciBERT (Beltagy et al., 2019), BlueBERT (Peng et al., 2019)). However, it seems reasonable to think that training directly on a specialized corpus, using a specialized vocabulary, could result in more tailored embeddings and thus help performance. To test this hypothesis, we train BERT models from scratch using many configurations involving general and medical corpora. Based on evaluations using four different tasks, we find that the initial corpus only has a weak influence on the performance of BERT models when these are further pre-trained on a medical corpus.
Anthology ID:
2022.lrec-1.281
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2626–2633
Language:
URL:
https://aclanthology.org/2022.lrec-1.281
DOI:
Bibkey:
Cite (ACL):
Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, and Pierre Zweigenbaum. 2022. Re-train or Train from Scratch? Comparing Pre-training Strategies of BERT in the Medical Domain. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2626–2633, Marseille, France. European Language Resources Association.
Cite (Informal):
Re-train or Train from Scratch? Comparing Pre-training Strategies of BERT in the Medical Domain (El Boukkouri et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.281.pdf
Data
MIMIC-IIIOpenWebText