A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

Louis Estève; Christophe Servan; Thomas Lavergne; Agata Savary

A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

Louis Estève, Christophe Servan, Thomas Lavergne, Agata Savary

Abstract

Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons to investigate theimpact of diversity on pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining atleast comparable performance. We compare diversity-driven sampling algorithms, and we use the best one to pre-train several ModernBERT models on French with a fixed compute budget. We fine-tune and evaluate them on a variety of French benchmarks. We compare them with models pre-trained on randomly sampled data of commensurate size, with the same compute budget. We find that both random and diversity-driven sampling may reduce the pre-training dataset by up to 94% and the pre-training time by up to 73% while maintaining performance. Moreover, in some tasks, the inherent quality of models, estimated via head-only fine-tuning, is up to 10 points higher with diversity sampling than with random sampling.

Anthology ID:: 2026.findings-acl.1707
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34168–34181
Language:
URL:: https://aclanthology.org/2026.findings-acl.1707/
DOI:
Bibkey:
Cite (ACL):: Louis Estève, Christophe Servan, Thomas Lavergne, and Agata Savary. 2026. A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34168–34181, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT (Estève et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1707.pdf
Checklist:: 2026.findings-acl.1707.checklist.pdf

PDF Cite Search Checklist Fix data