Timm Dill


2024

pdf bib
Sparks of Fairness: Preliminary Evidence of Commercial Machine Translation as English-to-German Gender-Fair Dictionaries
Manuel Lardelli | Timm Dill | Giuseppe Attanasio | Anne Lauscher
Proceedings of the 2nd International Workshop on Gender-Inclusive Translation Technologies

Bilingual dictionaries are bedrock components for several language tasks, including translation. However, dictionaries are traditionally fixed in time, thus excluding those neologisms and neo-morphemes that challenge the language’s nominal morphology. The need for a more dynamic, mutable alternative makes machine translation (MT) systems become an extremely valuable avenue. This paper investigates whether commercial MT can be used as bilingual dictionaries for gender-neutral translation. We focus on the English-to-German pair, where notional gender in the source requires gender inflection in the target. We translated 115 person-referring terms using Google Translate, Microsoft Bing, and DeepL and discovered that while each system is heavily biased towards the masculine gender, DeepL often provides gender-fair alternatives to users, especially with plurals.

pdf bib
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
Carolin Holtermann | Paul Röttger | Timm Dill | Anne Lauscher
Findings of the Association for Computational Linguistics: ACL 2024

Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages.Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use.For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e. whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.