Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Bettina Messmer, Vinko Sabolčec, Martin Jaggi


Anthology ID:
2025.swisstext-1.4
Volume:
Proceedings of the 10th edition of the Swiss Text Analytics Conference
Month:
May
Year:
2025
Address:
Winterthur, Switzerland
Editors:
Jonathan Gerber, Mark Cieliebak, Don Tuggener, Manuela Hürlimann
Venue:
SwissText
SIG:
SIGSEM
Publisher:
Association for Computational Linguistics
Note:
Pages:
31–56
Language:
URL:
https://aclanthology.org/2025.swisstext-1.4/
DOI:
Bibkey:
Cite (ACL):
Bettina Messmer, Vinko Sabolčec, and Martin Jaggi. 2025. Enhancing Multilingual LLM Pretraining with Model-Based Data Selection. In Proceedings of the 10th edition of the Swiss Text Analytics Conference, pages 31–56, Winterthur, Switzerland. Association for Computational Linguistics.
Cite (Informal):
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection (Messmer et al., SwissText 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.swisstext-1.4.pdf