The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results

Anja Belz; Craig Thomson; Javier González Corbelle; Malo Ruelle

The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results

Anya Belz, Craig Thomson, Javier González Corbelle, Malo Ruelle

Abstract

This paper presents an overview of, and the results from, the 2025 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’25) which followed on from four previous shared tasks on reproducibility of evaluations, ReproNLP’24, ReproNLP’23, ReproGen’22 and ReproGen’21. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of the topic across the two fields. We describe the ReproNLP’25 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results, including for the first time additional, ‘sanity-check’ evaluations by LLMs.

Anthology ID:: 2025.gem-1.78
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shmueli Scheuer, Gabriel Stanovsky, Oyvind Tafjord
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1002–1016
Language:
URL:: https://aclanthology.org/2025.gem-1.78/
DOI:
Bibkey:
Cite (ACL):: Anya Belz, Craig Thomson, Javier González Corbelle, and Malo Ruelle. 2025. The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 1002–1016, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results (Belz et al., GEM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.gem-1.78.pdf

PDF Cite Search Fix data