Marion Thaler

2025

Construction-Based Reduction of Translationese for Low-Resource Languages: A Pilot Study on Bavarian
Peiqin Lin | Marion Thaler | Daniela Goschala | Amir Hossein Kargaran | Yihong Liu | André F. T. Martins | Hinrich Schütze
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

When translating into a low-resource language, a language model can have a tendency to produce translations that are close to the source (e.g., word-by-word translations) due to a lack of rich low-resource training data in pretraining. Thus, the output often is translationese that differs considerably from what native speakers would produce naturally. To remedy this, we synthetically create a training set in which the frequency of a construction unique to the low-resource language is artificially inflated. For the case of Bavarian, we show that, after training, the language model has learned the unique construction and that native speakers judge its output as more natural. Our pilot study suggests that construction-based mitigation of translationese is a promising approach. Code and artifacts are available at https://github.com/cisnlp/BayernGPT.

pdf bib abs

MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
Abdullatif Köksal | Marion Thaler | Ayyoob Imani | Ahmet Üstün | Anna Korhonen | Hinrich Schütze
Transactions of the Association for Computational Linguistics, Volume 13

Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation. We publicly release datasets and models at https://github.com/akoksal/muri.

Co-authors

Abdullatif Köksal 1

Peiqin Lin 1

Yihong Liu 1

André F. T. Martins 1

Ahmet Üstün 1

Venues

Fix author