Comparing Human and Machine Translations of Generative Language Model Evaluation Datasets

Sander Bijl de Vroe, George Stampoulidis, Kai Hakala, Aku Rouhe, Mark van Heeswijk, Jussi Karlgren


Abstract
The evaluation of Large Language Models (LLMs) is one of the crucial current challenges in the field of Natural Language Processing (NLP) and becomes even more challenging in the multilingual setting. Since the majority of the community’s benchmarks exist only in English, test sets are now being machine translated at scale into dozens of languages. This work explores the feasibility of that approach, comparing a Finnish machine translation (MT) of ARC-Challenge with a new human translated version. Our findings suggest that since absolute scores are fairly close and model size rankings are preserved, machine translation is adequate in this case. Surprisingly, however, the datasets reverse the order of base models compared to their chat-finetuned counterparts.
Anthology ID:
2025.nodalida-1.9
Volume:
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Month:
march
Year:
2025
Address:
Tallinn, Estonia
Editors:
Richard Johansson, Sara Stymne
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
80–85
Language:
URL:
https://aclanthology.org/2025.nodalida-1.9/
DOI:
Bibkey:
Cite (ACL):
Sander Bijl de Vroe, George Stampoulidis, Kai Hakala, Aku Rouhe, Mark van Heeswijk, and Jussi Karlgren. 2025. Comparing Human and Machine Translations of Generative Language Model Evaluation Datasets. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 80–85, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):
Comparing Human and Machine Translations of Generative Language Model Evaluation Datasets (de Vroe et al., NoDaLiDa 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.nodalida-1.9.pdf