Binghai Wang
2024
Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning
Lu Chen
|
Rui Zheng
|
Binghai Wang
|
Senjie Jin
|
Caishuang Huang
|
Junjie Ye
|
Zhihao Zhang
|
Yuhao Zhou
|
Zhiheng Xi
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reinforcement Learning from Human Feedback (RLHF) is a crucial approach to aligning language models with human values and intentions. A fundamental challenge in this method lies in ensuring that the reward model accurately understands and evaluates human preferences. Current methods rely on ranking losses to teach the reward model to assess preferences, but they are susceptible to noise and ambiguous data, often failing to deeply understand human intentions. To address this issue, we introduce contrastive learning into the reward modeling process. In addition to supervised ranking loss, we introduce an unsupervised contrastive loss to enable the reward model to fully capture the distinctions in contrastive data. Experimental results demonstrate that the proposed contrastive learning-based reward modeling method effectively enhances the generalization of the reward model, stabilizes the reinforcement learning training process, and improves the final alignment with human preferences.
Reward Modeling Requires Automatic Adjustment Based on Data Quality
Binghai Wang
|
Rui Zheng
|
Lu Chen
|
Zhiheng Xi
|
Wei Shen
|
Yuhao Zhou
|
Dong Yan
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
In Reinforcement Learning from Human Feedback (RLHF), the reward model plays a crucial role in aligning language model outputs with human values. The human preference data used to train the reward model consists of a prompt and a response pair, with humans annotating which response better aligns with human value preferences. Due to the complexity and subjectivity of the annotation task, multiple organizations including OpenAI and Anthropic report significant noise in the human preference datasets, leading to instability and deviation in reward model training from human values. We discover that the difference in scores assigned to response pairs by the reward model effectively indicates the quality of data, and data of varying qualities show significant distinctions in reward model training. We introduce a method that automatically adjusts reward modeling based on data quality, reducing the impact of noise and making full use of dataset. Experiments on multiple human preference datasets demonstrate that our method stabilizes reward model training and significantly enhances the alignment performance of RLHF.