Pretrained Biomedical Language Models for Clinical NLP in Spanish

Casimiro Pio Carrino, Joan Llop, Marc Pàmies, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Joaquín Silveira-Ocampo, Alfonso Valencia, Aitor Gonzalez-Agirre, Marta Villegas


Abstract
This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.
Anthology ID:
2022.bionlp-1.19
Volume:
Proceedings of the 21st Workshop on Biomedical Language Processing
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii
Venue:
BioNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
193–199
Language:
URL:
https://aclanthology.org/2022.bionlp-1.19
DOI:
10.18653/v1/2022.bionlp-1.19
Bibkey:
Cite (ACL):
Casimiro Pio Carrino, Joan Llop, Marc Pàmies, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Joaquín Silveira-Ocampo, Alfonso Valencia, Aitor Gonzalez-Agirre, and Marta Villegas. 2022. Pretrained Biomedical Language Models for Clinical NLP in Spanish. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 193–199, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Pretrained Biomedical Language Models for Clinical NLP in Spanish (Carrino et al., BioNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.bionlp-1.19.pdf
Video:
 https://aclanthology.org/2022.bionlp-1.19.mp4
Code
 PlanTL-GOB-ES/lm-biomedical-clinical-es