Tokenizer Choice For LLM Training: Negligible or Crucial?

Mehdi Ali; Michael Fromm; Klaudia Thellmann; Richard Rutmann; Max Lübbering; Johannes Leveling; Katrin Klug; Jan Ebert; Niclas Doll; Jasper Buschhoff; Charvi Jain; Alexander Weber; Lena Jurkschat; Hammam Abdelwahab; Chelsea John; Pedro Ortiz Suarez; Malte Ostendorff; Samuel Weinbach; Rafet Sifa; Stefan Kesselheim; Nicolas Flores-Herr

doi:10.18653/v1/2024.findings-naacl.247

Tokenizer Choice For LLM Training: Negligible or Crucial?

Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, Stefan Kesselheim, Nicolas Flores-Herr

Abstract

The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.

Anthology ID:: 2024.findings-naacl.247
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3907–3924
Language:
URL:: https://aclanthology.org/2024.findings-naacl.247
DOI:: 10.18653/v1/2024.findings-naacl.247
Bibkey:
Cite (ACL):: Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, et al.. 2024. Tokenizer Choice For LLM Training: Negligible or Crucial?. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3907–3924, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Tokenizer Choice For LLM Training: Negligible or Crucial? (Ali et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-naacl.247.pdf

PDF Cite Search