Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Rajarshi Haldar, Julia Hockenmaier


Abstract
As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
Anthology ID:
2025.findings-emnlp.1361
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24986–25004
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.1361/
DOI:
Bibkey:
Cite (ACL):
Rajarshi Haldar and Julia Hockenmaier. 2025. Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24986–25004, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks (Haldar & Hockenmaier, Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.1361.pdf
Checklist:
 2025.findings-emnlp.1361.checklist.pdf