Zhang Chunliang
2024
Prior Constraints-based Reward Model Training for Aligning Large Language Models
Zhou Hang
|
Wang Chenglong
|
Hu Yimin
|
Xiao Tong
|
Zhang Chunliang
|
Zhu Jingbo
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“Reinforcement learning with human feedback for aligning large language models (LLMs) trainsa reward model typically using ranking loss with comparison pairs. However, the training pro-cedure suffers from an inherent problem: the uncontrolled scaling of reward scores during rein-forcement learning due to the lack of constraints while training the reward model. This paperproposes a Prior Constraints-based Reward Model (PCRM) training method to mitigate thisproblem. PCRM incorporates prior constraints—specifically, length ratio and cosine similaritybetween outputs of each comparison pair—during reward model training to regulate optimiza-tion magnitude and control score margins. We comprehensively evaluate PCRM by examining itsrank correlation with human preferences and its effectiveness in aligning LLMs via RL. Exper-imental results demonstrate that PCRM significantly improves alignment performance by effec-tively constraining reward score scaling. As another bonus, our method is easily integrated intoarbitrary rank-based alignment methods, such as direct preference optimization, and can yieldconsistent improvement. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/tree/PCRM.”