Katrin Klug
2025
Robustness Evaluation of the German Extractive Question Answering Task
Shalaka Satheesh
|
Katharina Beckh
|
Katrin Klug
|
Héctor Allende-Cid
|
Sebastian Houben
|
Teena Hassan
Proceedings of the 31st International Conference on Computational Linguistics
To ensure reliable performance of Question Answering (QA) systems, evaluation of robustness is crucial. Common evaluation benchmarks commonly only include performance metrics, such as Exact Match (EM) and the F1 score. However, these benchmarks overlook critical factors for the deployment of QA systems. This oversight can result in systems vulnerable to minor perturbations in the input such as typographical errors. While several methods have been proposed to test the robustness of QA models, there has been minimal exploration of these approaches for languages other than English. This study focuses on the robustness evaluation of German language QA models, extending methodologies previously applied primarily to English. The objective is to nurture the development of robust models by defining an evaluation method specifically tailored to the German language. We assess the applicability of perturbations used in English QA models for German and perform a comprehensive experimental evaluation with eight models. The results show that all models are vulnerable to character-level perturbations. Additionally, the comparison of monolingual and multilingual models suggest that the former are less affected by character and word-level perturbations.
2024
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
|
Michael Fromm
|
Klaudia Thellmann
|
Richard Rutmann
|
Max Lübbering
|
Johannes Leveling
|
Katrin Klug
|
Jan Ebert
|
Niclas Doll
|
Jasper Buschhoff
|
Charvi Jain
|
Alexander Weber
|
Lena Jurkschat
|
Hammam Abdelwahab
|
Chelsea John
|
Pedro Ortiz Suarez
|
Malte Ostendorff
|
Samuel Weinbach
|
Rafet Sifa
|
Stefan Kesselheim
|
Nicolas Flores-Herr
Findings of the Association for Computational Linguistics: NAACL 2024
The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
Search
Fix data
Co-authors
- Hammam Abdelwahab 1
- Mehdi Ali 1
- Hector Allende-Cid 1
- Katharina Beckh 1
- Jasper Buschhoff 1
- show all...
- Niclas Doll 1
- Jan Ebert 1
- Nicolas Flores-Herr 1
- Michael Fromm 1
- Teena Hassan 1
- Sebastian Houben 1
- Charvi Jain 1
- Chelsea John 1
- Lena Jurkschat 1
- Stefan Kesselheim 1
- Johannes Leveling 1
- Max Lübbering 1
- Pedro Ortiz Suarez 1
- Malte Ostendorff 1
- Richard Rutmann 1
- Shalaka Satheesh 1
- Rafet Sifa 1
- Klaudia Thellmann 1
- Alexander Weber 1
- Samuel Weinbach 1