German Rigau

Also published as: G. Rigau

2026

Machine Translation for Low-Resource Languages through Monolingual Data and LLM: A Case Study of English-to-Basque
Nam Luu | Aitor Soroa | German Rigau | Ondřej Bojar
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Developing a machine translation (MT) system requires a considerable amount of high-quality parallel data, which is often limited for low-resource languages. This paper explores the use of synthetic data for training an LLM-based MT system in the English-to-Basque direction. Using Basque monolingual corpora as a starting point, we apply back-translation to generate parallel corpora, taking advantage of the fact that current LLMs do not translate well from English to Basque, but they yield an acceptable performance in the reverse direction. We conduct experiments in a multi-stage approach, from a simple Supervised Fine-tuning (SFT) step, to preference learning with the Direct Preference Optimization (DPO) technique. We then evaluate the approach with both automatic metrics and manual assessment. Experimental results suggest that for this task, SFT brings a clear improvement in translation quality, while DPO only yields marginal enhancement.

pdf bib abs

Can LLMs Reason Like Doctors? Exploring the Limits of Large Language Models in Complex Medical Reasoning
Flavio Merenda | Jose Manuel Gomez-Perez | German Rigau
Findings of the Association for Computational Linguistics: EACL 2026

Large language models (LLMs) have shown remarkable progress in reasoning across multiple domains. However, it remains unclear whether their abilities reflect genuine reasoning or sophisticated pattern matching, a distinction critical in medical decision-making, where reliable multi-step problem-solving is required. Accordingly, we conduct one of the largest evaluations to date, assessing 77 LLMs with diverse fine-tuning approaches, ranging from 1 billion parameters to frontier models. Guided by medical problem-solving theory, we select three medical question answering (QA) benchmarks targeting key reasoning skills: reasoning processes, susceptibility to cognitive biases, and metacognitive abilities. Additionally, we manually annotate a subset of questions to assess the abduction, deduction, and induction capabilities of LLMs, offering detailed insight into the reasoning mechanisms followed by physicians, an aspect that has received relatively limited attention in this domain. Most models, particularly smaller ones, struggle even with specialized fine-tuning or advanced prompting. Larger models perform better but still show clear limitations in complex medical reasoning. Our findings highlight the need to improve specific reasoning strategies to better reflect medical decision-making. The datasets and code used in this study are publicly available at: https://github.com/expertailab/Can-LLMs-Reason-Like-Doctors

2025

The current best practice to measure the performance of base Large Language Models is to establish a multi-task benchmark that covers a range of capabilities of interest. Currently, however, such benchmarks are only available in a few high-resource languages. To address this situation, we present IberoBench, a multilingual, multi-task benchmark for Iberian languages (i.e., Basque, Catalan, Galician, European Spanish and European Portuguese) built on the LM Evaluation Harness framework. The benchmark consists of 62 tasks divided into 179 subtasks. We evaluate 33 existing LLMs on IberoBench on 0- and 5-shot settings. We also explore the issues we encounter when working with the Harness and our approach to solving them to ensure high-quality evaluation.

pdf bib abs

Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.

German Rigau

2026

2025

2024

2023

2022

2021

2020

2019

2018

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2004

2002

2001

2000

1999

1998

1997

1996

1994

1992

Co-authors

Venues