Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation

Rian Touchent; Nathan Godey; Éric Clergerie

Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation

Rian Touchent, Nathan Godey, Éric Villemonte de la Clergerie

Abstract

We annotate PubMed Central paragraphs for document type, domain, and educational quality using a two-stage pipeline: Llama-3.1-70B labels 400K paragraphs, then a fine-tuned XLM-RoBERTa propagates annotations to the full corpus. This paragraph-level approach captures content diversity within scientific articles that document-level labels miss. The resulting Biomed-Enriched corpus contains 2M clinical case paragraphs, providing a publicly available alternative to restricted clinical datasets. For decoders, continual pretraining experiments enable targeted improvements, with clinical upsampling boosting performance by 4 points on MMLU ProfMed and educational filtering improving MedQA and MedMCQA by ~1 point. Combinations of these techniques led to faster convergence, reaching the same performance with a third of training tokens. For encoders, our best recipe matches BioClinical-ModernBERT on 11 tasks (77.3% vs 77.1% F1) while using 2.5x fewer tokens and only public data.

Anthology ID:: 2026.findings-acl.1713
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34276–34287
Language:
URL:: https://aclanthology.org/2026.findings-acl.1713/
DOI:
Bibkey:
Cite (ACL):: Rian Touchent, Nathan Godey, and Éric Villemonte de la Clergerie. 2026. Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34276–34287, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation (Touchent et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1713.pdf
Checklist:: 2026.findings-acl.1713.checklist.pdf

PDF Cite Search Checklist Fix data