Large Language Models are not Fair Evaluators

Peiyi Wang (王培懿); Lei Li; Liang Chen; Zefan Cai; Dawei Zhu; Binghuai Lin; Yunbo Cao; Lingpeng Kong; Qi Liu; Tianyu Liu; Zhifang Sui

doi:10.18653/v1/2024.acl-long.511

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, Zhifang Sui

Abstract

In this paper, we uncover a positional bias in the evaluation paradigm of adopting large language models (LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. We propose a simple yet effective calibration framework to address our discovered positional bias.To evaluate the effectiveness of our framework, we manually annotate the “win/tie/lose” outcomes of responses from ChatGPT and Vicuna-13B in the Vicuna Benchmark’s question prompt. Extensive experiments demonstrate that our approach successfully alleviates evaluation bias, resulting in closer alignment with human judgments.

Anthology ID:: 2024.acl-long.511
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9440–9450
Language:
URL:: https://aclanthology.org/2024.acl-long.511/
DOI:: 10.18653/v1/2024.acl-long.511
Bibkey:
Cite (ACL):: Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. 2024. Large Language Models are not Fair Evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440–9450, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Large Language Models are not Fair Evaluators (Wang et al., ACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.acl-long.511.pdf

PDF Cite Search Fix data