Do Large Language Models understand how to be judges?

Nicolò Donati, Paolo Torroni, Giuseppe Savino


Abstract
This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced editorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and inconsistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and ordering, each defined with explicit sub-criteria to guide LLMs in assessing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tailored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating fluency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or abstract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future directions include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.
Anthology ID:
2025.luhme-1.9
Volume:
Proceedings of the 2nd LUHME Workshop
Month:
October
Year:
2025
Address:
Bologna, Italy
Editors:
Henrique Lopes Cardoso, Rui Sousa-Silva, Maarit Koponen, Antonio Pareja-Lora
Venues:
LUHME | WS
SIG:
Publisher:
LUHME
Note:
Pages:
85–102
Language:
URL:
https://aclanthology.org/2025.luhme-1.9/
DOI:
Bibkey:
Cite (ACL):
Nicolò Donati, Paolo Torroni, and Giuseppe Savino. 2025. Do Large Language Models understand how to be judges?. In Proceedings of the 2nd LUHME Workshop, pages 85–102, Bologna, Italy. LUHME.
Cite (Informal):
Do Large Language Models understand how to be judges? (Donati et al., LUHME 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.luhme-1.9.pdf