LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts

Zongxia Li; Xiyang Wu; Ishani Mondal; Alexa Siu; Jordan Lee Boyd-Graber; Ani Nenkova

LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts

Zongxia Li, Xiyang Wu, Ishani Mondal, Alexa Siu, Jordan Lee Boyd-Graber, Ani Nenkova

Abstract

Large language models (LLMs) such as GPT-4, Claude and LLaMA are routinely used to evaluate long-form text generated by language models. We study the ability of these models to identify low quality texts, an increasingly rare subset of output which is of great interest to pinpoint during development. We present experiments with a panel of LLM judges, and crowd-sourced approximations of reference judgments. Pinpointing sub-par outputs is a difficult task for both models and crowdworkers, with models doing overall better. Moreover, unlike findings in prior work on factoid question answering, panels of cheaper models do not agree as well with high quality developer judgments of low quality as panels of frontier models. We present qualitative and quantitative analysis of the relative strengths of models in the panel, gleaning insights why they yield better results over a single model.

Anthology ID:: 2025.newsum-main.1
Volume:: Proceedings of The 5th New Frontiers in Summarization Workshop
Month:: November
Year:: 2025
Address:: Hybrid
Editors:: Yue Dong, Wen Xiao, Haopeng Zhang, Rui Zhang, Ori Ernst, Lu Wang, Fei Liu
Venues:: NewSum | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–16
Language:
URL:: https://aclanthology.org/2025.newsum-main.1/
DOI:
Bibkey:
Cite (ACL):: Zongxia Li, Xiyang Wu, Ishani Mondal, Alexa Siu, Jordan Lee Boyd-Graber, and Ani Nenkova. 2025. LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts. In Proceedings of The 5th New Frontiers in Summarization Workshop, pages 1–16, Hybrid. Association for Computational Linguistics.
Cite (Informal):: LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts (Li et al., NewSum 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.newsum-main.1.pdf

PDF Cite Search Fix data