Anna Salles


2025

pdf bib
IberoBench: A Benchmark for LLM Evaluation in Iberian Languages
Irene Baucells | Javier Aula-Blasco | Iria de-Dios-Flores | Silvia Paniagua Suárez | Naiara Perez | Anna Salles | Susana Sotelo Docio | Júlia Falcão | Jose Javier Saiz | Robiert Sepulveda Torres | Jeremy Barnes | Pablo Gamallo | Aitor Gonzalez-Agirre | German Rigau | Marta Villegas
Proceedings of the 31st International Conference on Computational Linguistics

The current best practice to measure the performance of base Large Language Models is to establish a multi-task benchmark that covers a range of capabilities of interest. Currently, however, such benchmarks are only available in a few high-resource languages. To address this situation, we present IberoBench, a multilingual, multi-task benchmark for Iberian languages (i.e., Basque, Catalan, Galician, European Spanish and European Portuguese) built on the LM Evaluation Harness framework. The benchmark consists of 62 tasks divided into 179 subtasks. We evaluate 33 existing LLMs on IberoBench on 0- and 5-shot settings. We also explore the issues we encounter when working with the Harness and our approach to solving them to ensure high-quality evaluation.