Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Dmitry Ustalov


Abstract
The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.
Anthology ID:
2025.coling-demos.6
Volume:
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Brodie Mather, Mark Dras
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
46–53
Language:
URL:
https://aclanthology.org/2025.coling-demos.6/
DOI:
Bibkey:
Cite (ACL):
Dmitry Ustalov. 2025. Reliable, Reproducible, and Really Fast Leaderboards with Evalica. In Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations, pages 46–53, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Reliable, Reproducible, and Really Fast Leaderboards with Evalica (Ustalov, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-demos.6.pdf