Javier Aula-Blasco


2025

pdf bib
VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability
Javier Aula-Blasco | Júlia Falcão | Susana Sotelo | Silvia Paniagua | Aitor Gonzalez-Agirre | Marta Villegas
Proceedings of the 31st International Conference on Computational Linguistics

As Large Language Models (LLMs) become available in a wider range of domains and applications, evaluating the truthfulness of multilingual LLMs is an issue of increasing relevance. TruthfulQA (Lin et al., 2022) is one of few benchmarks designed to evaluate how models imitate widespread falsehoods. However, it is strongly English-centric and starting to become outdated. We present VeritasQA, a context- and time-independent truthfulness benchmark built with multilingual transferability in mind, and available in Spanish, Catalan, Galician and English. VeritasQA comprises a set of 353 questions and answers inspired by common misconceptions and falsehoods that are not tied to any particular country or recent event. We release VeritasQA under an open license and present the evaluation results of 15 models of various architectures and sizes.

pdf bib
IberoBench: A Benchmark for LLM Evaluation in Iberian Languages
Irene Baucells | Javier Aula-Blasco | Iria de-Dios-Flores | Silvia Paniagua Suárez | Naiara Perez | Anna Salles | Susana Sotelo Docio | Júlia Falcão | Jose Javier Saiz | Robiert Sepulveda Torres | Jeremy Barnes | Pablo Gamallo | Aitor Gonzalez-Agirre | German Rigau | Marta Villegas
Proceedings of the 31st International Conference on Computational Linguistics

The current best practice to measure the performance of base Large Language Models is to establish a multi-task benchmark that covers a range of capabilities of interest. Currently, however, such benchmarks are only available in a few high-resource languages. To address this situation, we present IberoBench, a multilingual, multi-task benchmark for Iberian languages (i.e., Basque, Catalan, Galician, European Spanish and European Portuguese) built on the LM Evaluation Harness framework. The benchmark consists of 62 tasks divided into 179 subtasks. We evaluate 33 existing LLMs on IberoBench on 0- and 5-shot settings. We also explore the issues we encounter when working with the Harness and our approach to solving them to ensure high-quality evaluation.

2024

pdf bib
Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan
Aitor Gonzalez-Agirre | Montserrat Marimon | Carlos Rodriguez-Penagos | Javier Aula-Blasco | Irene Baucells | Carme Armentano-Oller | Jorge Palomar-Giner | Baybars Kulebi | Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Current LLM-based applications are becoming steadily available for everyone with a reliable access to technology and the internet. These applications offer benefits to their users that leave those without access to them at a serious disadvantage. Given the vastly large amount of data needed to train LLMs, the gap between languages with access to such quantity of data and those without it is currently larger than ever. Aimed at saving this gap, the Aina Project was created to provide Catalan with the necessary resources to keep being relevant in the context of AI/NLP applications based on LLMs. We thus present a set of strategies to consider when improving technology support for a mid- or low-resource language, specially addressing sustainability of high-quality data acquisition and the challenges involved in the process. We also introduce a large amount of new annotated data for Catalan. Our hope is that those interested in replicating this work for another language can learn from what worked for us, the challenges that we faced, and the sometimes disheartening truth of working with mid- and low-resource languages.