Prior Constraints-based Reward Model Training for Aligning Large Language Models

Zhou Hang, Wang Chenglong, Hu Yimin, Xiao Tong, Zhang Chunliang, Zhu Jingbo


Abstract
“Reinforcement learning with human feedback for aligning large language models (LLMs) trainsa reward model typically using ranking loss with comparison pairs. However, the training pro-cedure suffers from an inherent problem: the uncontrolled scaling of reward scores during rein-forcement learning due to the lack of constraints while training the reward model. This paperproposes a Prior Constraints-based Reward Model (PCRM) training method to mitigate thisproblem. PCRM incorporates prior constraints—specifically, length ratio and cosine similaritybetween outputs of each comparison pair—during reward model training to regulate optimiza-tion magnitude and control score margins. We comprehensively evaluate PCRM by examining itsrank correlation with human preferences and its effectiveness in aligning LLMs via RL. Exper-imental results demonstrate that PCRM significantly improves alignment performance by effec-tively constraining reward score scaling. As another bonus, our method is easily integrated intoarbitrary rank-based alignment methods, such as direct preference optimization, and can yieldconsistent improvement. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/tree/PCRM.”
Anthology ID:
2024.ccl-1.107
Volume:
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Month:
July
Year:
2024
Address:
Taiyuan, China
Editors:
Maosong Sun, Jiye Liang, Xianpei Han, Zhiyuan Liu, Yulan He
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
1395–1407
Language:
English
URL:
https://aclanthology.org/2024.ccl-1.107/
DOI:
Bibkey:
Cite (ACL):
Zhou Hang, Wang Chenglong, Hu Yimin, Xiao Tong, Zhang Chunliang, and Zhu Jingbo. 2024. Prior Constraints-based Reward Model Training for Aligning Large Language Models. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference), pages 1395–1407, Taiyuan, China. Chinese Information Processing Society of China.
Cite (Informal):
Prior Constraints-based Reward Model Training for Aligning Large Language Models (Hang et al., CCL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ccl-1.107.pdf