Reward Difference Optimization For Sample Reweighting In Offline RLHF

Shiqi Wang; Zhengze Zhang; Rui Zhao; Fei Tan; Nguyen Cam-Tu

doi:10.18653/v1/2024.findings-emnlp.115

Reward Difference Optimization For Sample Reweighting In Offline RLHF

Shiqi Wang, Zhengze Zhang, Rui Zhao, Fei Tan, Nguyen Cam-Tu

Abstract

With the wide deployment of Large Language Models (LLMs), aligning LLMs with human values becomes increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the ordering relationship between responses, overlooking the crucial aspect of “how much” one is preferred over the others. To address this issue, we propose a simple yet effective solution based on reward difference prediction. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then propose a difference model that considers rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR dataset verify the effectiveness of our method in both automatic metrics and human evaluation, highlighting its potential for aligning LLMs with human values.

Anthology ID:: 2024.findings-emnlp.115
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2109–2123
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.115/
DOI:: 10.18653/v1/2024.findings-emnlp.115
Bibkey:
Cite (ACL):: Shiqi Wang, Zhengze Zhang, Rui Zhao, Fei Tan, and Nguyen Cam-Tu. 2024. Reward Difference Optimization For Sample Reweighting In Offline RLHF. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2109–2123, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Reward Difference Optimization For Sample Reweighting In Offline RLHF (Wang et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.115.pdf

PDF Cite Search Fix data