CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

Pei Ke; Bosi Wen; Andrew Feng; Xiao Liu; Xuanyu Lei; Jiale Cheng; Shengyuan Wang; Aohan Zeng; Yuxiao Dong; Hongning Wang; Jie Tang; Minlie Huang

doi:10.18653/v1/2024.acl-long.704

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang

Abstract

Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4’s direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.

Anthology ID:: 2024.acl-long.704
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13034–13054
Language:
URL:: https://aclanthology.org/2024.acl-long.704/
DOI:: 10.18653/v1/2024.acl-long.704
Bibkey:
Cite (ACL):: Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, and Minlie Huang. 2024. CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13034–13054, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation (Ke et al., ACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.acl-long.704.pdf

PDF Cite Search Fix data