Johannes Leveling


2024

pdf bib
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali | Michael Fromm | Klaudia Thellmann | Richard Rutmann | Max Lübbering | Johannes Leveling | Katrin Klug | Jan Ebert | Niclas Doll | Jasper Buschhoff | Charvi Jain | Alexander Weber | Lena Jurkschat | Hammam Abdelwahab | Chelsea John | Pedro Ortiz Suarez | Malte Ostendorff | Samuel Weinbach | Rafet Sifa | Stefan Kesselheim | Nicolas Flores-Herr
Findings of the Association for Computational Linguistics: NAACL 2024

The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.

2014

pdf bib
Findings of the 2014 Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Christian Federmann | Barry Haddow | Philipp Koehn | Johannes Leveling | Christof Monz | Pavel Pecina | Matt Post | Herve Saint-Amand | Radu Soricut | Lucia Specia | Aleš Tamchyna
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Automatic Prediction of Aesthetics and Interestingness of Text Passages
Debasis Ganguly | Johannes Leveling | Gareth Jones
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Porting a Summarizer to the French Language
Rémi Bois | Johannes Leveling | Lorraine Goeuriot | Gareth J. F. Jones | Liadh Kelly
Proceedings of TALN 2014 (Volume 2: Short Papers)

2012

pdf bib
Cross-Lingual Topical Relevance Models
Debasis Ganguly | Johannes Leveling | Gareth Jones
Proceedings of COLING 2012

pdf bib
Approximate Sentence Retrieval for Scalable and Efficient Example-Based Machine Translation
Johannes Leveling | Debasis Ganguly | Sandipan Dandapat | Gareth Jones
Proceedings of COLING 2012

2010

pdf bib
Building a Domain-specific Document Collection for Evaluating Metadata Effects on Information Retrieval
Walid Magdy | Jinming Min | Johannes Leveling | Gareth J. F. Jones
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the development of a structured document collection containing user-generated text and numerical metadata for exploring the exploitation of metadata in information retrieval (IR). The collection consists of more than 61,000 documents extracted from YouTube video pages on basketball in general and NBA (National Basketball Association) in particular, together with a set of 40 topics and their relevance judgements. In addition, a collection of nearly 250,000 user profiles related to the NBA collection is available. Several baseline IR experiments report the effect of using video-associated metadata on retrieval effectiveness. The results surprisingly show that searching the videos titles only performs significantly better than searching additional metadata text fields of the videos such as the tags or the description.

pdf bib
A Road Map for Interoperable Language Resource Metadata
Christopher Cieri | Khalid Choukri | Nicoletta Calzolari | D. Terence Langendoen | Johannes Leveling | Martha Palmer | Nancy Ide | James Pustejovsky
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

LRs remain expensive to create and thus rare relative to demand across languages and technology types. The accidental re-creation of an LR that already exists is a nearly unforgivable waste of scarce resources that is unfortunately not so easy to avoid. The number of catalogs the HLT researcher must search, with their different formats, make it possible to overlook an existing resource. This paper sketches the sources of this problem and outlines a proposal to rectify along with a new vision of LR cataloging that will to facilitates the documentation and exploitation of a much wider range of LRs than previously considered.

2007

pdf bib
FUH (FernUniversität in Hagen): Metonymy Recognition Using Different Kinds of Context for a Memory-Based Learner
Johannes Leveling
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)