Tokenizer Choice For LLM Training: Negligible or Crucial?

Tokenizer Choice For LLM Training: Negligible or Crucial? Mehdi Ali author Michael Fromm author Klaudia Thellmann author Richard Rutmann author Max Lübbering author Johannes Leveling author Katrin Klug author Jan Ebert author Niclas Doll author Jasper Buschhoff author Charvi Jain author Alexander Weber author Lena Jurkschat author Hammam Abdelwahab author Chelsea John author Pedro Ortiz Suarez author Malte Ostendorff author Samuel Weinbach author Rafet Sifa author Stefan Kesselheim author Nicolas Flores-Herr author 2024-06 text Findings of the Association for Computational Linguistics: NAACL 2024 Kevin Duh editor Helena Gomez editor Steven Bethard editor Association for Computational Linguistics Mexico City, Mexico conference publication ali-etal-2024-tokenizer 10.18653/v1/2024.findings-naacl.247 https://aclanthology.org/2024.findings-naacl.247/ 2024-06 3907 3924