Evaluating the Consistency of LLM Evaluators

Noah Lee, Jiwoo Hong, James Thorne


Abstract
Large language models (LLMs) have shown potential as general evaluators along with the evident benefits of speed and cost. While their correlation against human annotators has been widely studied, consistency as evaluators is still understudied, raising concerns about the reliability of LLM evaluators. In this paper, we conduct extensive studies on the two aspects of consistency in LLM evaluations, Self-Consistency (SC) and Inter-scale Consistency (IC), on different scoring scales and criterion granularity with open-source and proprietary models. Our comprehensive analysis demonstrates that strong proprietary models are not necessarily consistent evaluators, highlighting the importance of considering consistency in assessing the capability of LLM evaluators.
Anthology ID:
2025.coling-main.710
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10650–10659
Language:
URL:
https://aclanthology.org/2025.coling-main.710/
DOI:
Bibkey:
Cite (ACL):
Noah Lee, Jiwoo Hong, and James Thorne. 2025. Evaluating the Consistency of LLM Evaluators. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10650–10659, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Evaluating the Consistency of LLM Evaluators (Lee et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.710.pdf