Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, KunTae Kim


Abstract
As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines.This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems.We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison.The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings.We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges.We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on https://huggingface.co/spaces/NCSOFT/ArenaLite
Anthology ID:
2025.emnlp-main.360
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7068–7086
Language:
URL:
https://aclanthology.org/2025.emnlp-main.360/
DOI:
Bibkey:
Cite (ACL):
Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, and KunTae Kim. 2025. Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7068–7086, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons (Son et al., EMNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.emnlp-main.360.pdf
Checklist:
 2025.emnlp-main.360.checklist.pdf