Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning

Lu Chen, Rui Zheng, Binghai Wang, Senjie Jin, Caishuang Huang, Junjie Ye, Zhihao Zhang, Yuhao Zhou, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang


Abstract
Reinforcement Learning from Human Feedback (RLHF) is a crucial approach to aligning language models with human values and intentions. A fundamental challenge in this method lies in ensuring that the reward model accurately understands and evaluates human preferences. Current methods rely on ranking losses to teach the reward model to assess preferences, but they are susceptible to noise and ambiguous data, often failing to deeply understand human intentions. To address this issue, we introduce contrastive learning into the reward modeling process. In addition to supervised ranking loss, we introduce an unsupervised contrastive loss to enable the reward model to fully capture the distinctions in contrastive data. Experimental results demonstrate that the proposed contrastive learning-based reward modeling method effectively enhances the generalization of the reward model, stabilizes the reinforcement learning training process, and improves the final alignment with human preferences.
Anthology ID:
2024.emnlp-main.852
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15270–15283
Language:
URL:
https://aclanthology.org/2024.emnlp-main.852
DOI:
Bibkey:
Cite (ACL):
Lu Chen, Rui Zheng, Binghai Wang, Senjie Jin, Caishuang Huang, Junjie Ye, Zhihao Zhang, Yuhao Zhou, Zhiheng Xi, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15270–15283, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning (Chen et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.852.pdf