Semi-Supervised Reward Modeling via Iterative Self-Training

Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han Zhao


Abstract
Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
Anthology ID:
2024.findings-emnlp.434
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7365–7377
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.434
DOI:
Bibkey:
Cite (ACL):
Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, and Han Zhao. 2024. Semi-Supervised Reward Modeling via Iterative Self-Training. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7365–7377, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Semi-Supervised Reward Modeling via Iterative Self-Training (He et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.434.pdf