CAMIEval: Enhancing NLG Evaluation through Multidimensional Comparative Instruction-Following Analysis

Ziyue Fan; Junliang He; Li Xiaoqing; Shaohui Kuang; Kai Song; Yaqian Zhou; Xipeng Qiu (邱锡鹏)

doi:10.18653/v1/2025.naacl-long.438

CAMIEval: Enhancing NLG Evaluation through Multidimensional Comparative Instruction-Following Analysis

Ziyue Fan, Junliang He, Li Xiaoqing, Shaohui Kuang, Kai Song, Yaqian Zhou, Xipeng Qiu

Abstract

With the rapid development of large language models (LLMs), due to their strong performance across various fields, LLM-based evaluation methods (LLM-as-a-Judge) have become widely used in natural language generation (NLG) evaluation. However, these methods encounter the following challenges: (1) distinguishing instruction-following ability, (2) being applicable across diverse NLG tasks, and (3) identifying low-quality outputs. To address these issues, we propose CAMIEval, a multidimensional comparative evaluation method based on instruction-following. Specifically, we define three fundamental dimensions of instruction-following: relevance, factuality, and adherence. Subsequently, we introduce a concrete Chain-of-Thoughts (ConcreteCoT) process to enhance the accuracy of evaluations. In addition, we trained a “regrettable model” RegretLM to generate low-quality outputs, which helps the evaluator better identify the potential shortcomings of the candidate output by comparing low-quality outputs with reference outputs. Through this comparison, the evaluator can generate instruction-specific dimensions that complement the fundamental dimensions, forming a more comprehensive evaluation metric system. Experiments on two NLG evaluation benchmarks demonstrate that CAMIEval consistently outperforms existing methods in terms of correlation with human evaluations, providing a general and accurate framework for evaluating the outputs of LLMs.

Anthology ID:: 2025.naacl-long.438
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8708–8733
Language:
URL:: https://aclanthology.org/2025.naacl-long.438/
DOI:: 10.18653/v1/2025.naacl-long.438
Bibkey:
Cite (ACL):: Ziyue Fan, Junliang He, Li Xiaoqing, Shaohui Kuang, Kai Song, Yaqian Zhou, and Xipeng Qiu. 2025. CAMIEval: Enhancing NLG Evaluation through Multidimensional Comparative Instruction-Following Analysis. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8708–8733, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: CAMIEval: Enhancing NLG Evaluation through Multidimensional Comparative Instruction-Following Analysis (Fan et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-long.438.pdf

PDF Cite Search Fix data