Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

Meriem Boubdir; Edward Kim; Beyza Ermis; Sara Hooker; Marzieh Fadaee

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee

Abstract

In Natural Language Processing (NLP), the Elo rating system, well-established for ranking dynamic competitors in games like chess, has seen increasing adoption for evaluating Large Language Models (LLMs) through “A vs B” paired comparisons. However, while popular, the system’s suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. Our study investigates the sensitivity and reproducibility of Elo scores for LLMs, integrating both synthetic and human feedback. We show that Elo ratings for LLMs stabilize with 100 or more comparison permutations. A lower K-factor is preferable for closely matched models, whereas a higher K-factor better distinguishes models with clear performance differences. We also report that transitivity (A B and B C implies A C) does not consistently hold, particularly when models demonstrate similar performance. Our empirical findings provide guidelines for more reliable LLM evaluation.

Anthology ID:: 2023.gem-1.28
Volume:: Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, Hooman Sedghamiz
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 339–352
Language:
URL:: https://aclanthology.org/2023.gem-1.28/
DOI:
Bibkey:
Cite (ACL):: Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, and Marzieh Fadaee. 2023. Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 339–352, Singapore. Association for Computational Linguistics.
Cite (Informal):: Elo Uncovered: Robustness and Best Practices in Language Model Evaluation (Boubdir et al., GEM 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.gem-1.28.pdf

PDF Cite Search Fix data