Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Dmitry Ustalov

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Abstract

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

Anthology ID:: 2025.coling-demos.6
Volume:: Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Brodie Mather, Mark Dras
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 46–53
Language:
URL:: https://aclanthology.org/2025.coling-demos.6/
DOI:
Bibkey:
Cite (ACL):: Dmitry Ustalov. 2025. Reliable, Reproducible, and Really Fast Leaderboards with Evalica. In Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations, pages 46–53, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Reliable, Reproducible, and Really Fast Leaderboards with Evalica (Ustalov, COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-demos.6.pdf

PDF Cite Search Fix data