Andrés Fernandez García
2025
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Eva Sánchez Salido
|
Roser Morante
|
Julio Gonzalo
|
Guillermo Marco
|
Jorge Carrillo-de-Albornoz
|
Laura Plaza
|
Enrique Amigo
|
Andrés Fernandez García
|
Alejandro Benito-Santos
|
Adrián Ghajari Espinosa
|
Victor Fresno
Proceedings of the 31st International Conference on Computational Linguistics
In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Large Language Models with this dataset. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) Smaller models not only perform worse than the largest models, but also degrade faster in Spanish than in English. The performance gap between both languages is negligible for the best models, but grows up to 37% for smaller models; (ii) Model ranking on UNED-ACCESS 2024 is almost identical (0.98 Pearson correlation) to the one obtained with MMLU (a similar, but publicly available benchmark), suggesting that contamination affects similarly to all models, and (iii) As in publicly available datasets, reasoning questions in UNED-ACCESS are more challenging for models of all sizes.