LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian


Abstract
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance.
Anthology ID:
2026.mme-main.7
Volume:
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Pinzhen Chen, Vilém Zouhar, Hanxu Hu, Simran Khanuja, Wenhao Zhu, Barry Haddow, Alexandra Birch, Alham Fikri Aji, Rico Sennrich, Sara Hooker
Venues:
MME | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
99–132
Language:
URL:
https://aclanthology.org/2026.mme-main.7/
DOI:
Bibkey:
Cite (ACL):
Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, and Markarit Vartampetian. 2026. LLM-as-a-qualitative-judge: automating error analysis in natural language generation. In Proceedings of the First Workshop on Multilingual Multicultural Evaluation, pages 99–132, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
LLM-as-a-qualitative-judge: automating error analysis in natural language generation (Chirkova et al., MME 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.mme-main.7.pdf