AliBERT: A Pre-trained Language Model for French Biomedical Text

Aman Berhe, Guillaume Draznieks, Vincent Martenot, Valentin Masdeu, Lucas Davy, Jean-Daniel Zucker


Abstract
Over the past few years, domain specific pretrained language models have been investigated and have shown remarkable achievements in different downstream tasks, especially in biomedical domain. These achievements stem on the well known BERT architecture which uses an attention based self-supervision for context learning of textual documents. However, these domain specific biomedical pretrained language models mainly use English corpora. Therefore, non-English, domain-specific pretrained models remain quite rare, both of these requirements being hard to achieve. In this work, we proposed AliBERT, a biomedical pretrained language model for French and investigated different learning strategies. AliBERT is trained using regularized Unigram based tokenizer trained for this purpose. AliBERT has achieved state of the art F1 and accuracy scores in different down-stream biomedical tasks. Our pretrained model manages to outperform some French non domain-specific models such as CamemBERT and FlauBERT on diverse down-stream tasks, with less pretraining and training time and with much smaller corpora.
Anthology ID:
2023.bionlp-1.19
Volume:
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Dina Demner-fushman, Sophia Ananiadou, Kevin Cohen
Venue:
BioNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
223–236
Language:
URL:
https://aclanthology.org/2023.bionlp-1.19
DOI:
10.18653/v1/2023.bionlp-1.19
Bibkey:
Cite (ACL):
Aman Berhe, Guillaume Draznieks, Vincent Martenot, Valentin Masdeu, Lucas Davy, and Jean-Daniel Zucker. 2023. AliBERT: A Pre-trained Language Model for French Biomedical Text. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 223–236, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
AliBERT: A Pre-trained Language Model for French Biomedical Text (Berhe et al., BioNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.bionlp-1.19.pdf