Removing Prompt-template Bias in Reinforcement Learning from Human Feedback

Chaojie Wang; Haonan Shi; Long Tian; Bo An; Shuicheng Yan

doi:10.18653/v1/2025.findings-acl.1237

Removing Prompt-template Bias in Reinforcement Learning from Human Feedback

Chaojie Wang, Haonan Shi, Long Tian, Bo An, Shuicheng Yan

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become an essential technique for enhancing pre-trained large language models (LLMs) to generate responses that align with human preferences and societal values. Although RLHF has shown promise, the training of reward models (RMs) still faces the challenge of reward hacking, motivating recent works to prevent RMs from finding shortcuts that bypass the intended optimization objectives by identifying simplistic patterns such as response length. Besides the issue of length bias, our work firstly reveals that prompt-template bias learned by RMs can also cause reward hacking when dealing with some marginal samples, resulting in LLMs preferring to generate responses in a specific format after RLHF fine-tuning, regardless of the format requested in the prompt. To this end, we propose a low-cost but effective method, namely Prompt Bias Calibration (PBC), to estimate the prompt-template bias term during reward modeling, which can be utilized to calibrate reward scores in the following RL fine-tuning process. Then, we show that our PBC method can be flexibly combined with existing algorithms of removing length bias, leading to a further improvement in the aspect of enhancing the quality of generated responses.

Anthology ID:: 2025.findings-acl.1237
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24110–24122
Language:
URL:: https://aclanthology.org/2025.findings-acl.1237/
DOI:: 10.18653/v1/2025.findings-acl.1237
Bibkey:
Cite (ACL):: Chaojie Wang, Haonan Shi, Long Tian, Bo An, and Shuicheng Yan. 2025. Removing Prompt-template Bias in Reinforcement Learning from Human Feedback. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24110–24122, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Removing Prompt-template Bias in Reinforcement Learning from Human Feedback (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1237.pdf

PDF Cite Search Fix data