Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks

Qintong Li, Leyang Cui, Lingpeng Kong, Wei Bi


Abstract
Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether LLMs can serve as reliable alternatives to humans, we examine the fine-grained alignment between LLM evaluators and human annotators, particularly in understanding the target evaluation tasks and conducting evaluations that meet diverse criteria. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning), each with different evaluation criteria. Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such as numerical reasoning. We also find that LLM-pre-drafting before human evaluation can help reduce the impact of human subjectivity and minimize annotation outliers in pure human evaluation, leading to more objective evaluation. All resources are available at https://github.com/qtli/CoEval.
Anthology ID:
2025.coling-main.688
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10325–10344
Language:
URL:
https://aclanthology.org/2025.coling-main.688/
DOI:
Bibkey:
Cite (ACL):
Qintong Li, Leyang Cui, Lingpeng Kong, and Wei Bi. 2025. Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10325–10344, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks (Li et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.688.pdf