The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models

Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, Xiaoyu Shen


Abstract
Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models.
Anthology ID:
2024.emnlp-main.174
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2980–2989
Language:
URL:
https://aclanthology.org/2024.emnlp-main.174
DOI:
Bibkey:
Cite (ACL):
Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, and Xiaoyu Shen. 2024. The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2980–2989, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models (Chen et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.174.pdf