Explaining Length Bias in LLM-Based Preference Evaluations

Zhengyu Hu; Linxin Song; Jieyu Zhang; Zheyuan Xiao; Tianfu Wang; Zhengyu Chen; Nicholas Jing Yuan; Jianxun Lian; Kaize Ding; Hui Xiong

doi:10.18653/v1/2025.findings-emnlp.358

Explaining Length Bias in LLM-Based Preference Evaluations

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, Hui Xiong

Abstract

The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.

Anthology ID:: 2025.findings-emnlp.358
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6763–6794
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.358/
DOI:: 10.18653/v1/2025.findings-emnlp.358
Bibkey:
Cite (ACL):: Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, and Hui Xiong. 2025. Explaining Length Bias in LLM-Based Preference Evaluations. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 6763–6794, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Explaining Length Bias in LLM-Based Preference Evaluations (Hu et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.358.pdf
Checklist:: 2025.findings-emnlp.358.checklist.pdf

PDF Cite Search Checklist Fix data