On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

Yong Lin, Skyler Seto, Maartje Ter Hoeve, Katherine Metcalf, Barry-John Theobald, Xuan Wang, Yizhe Zhang, Chen Huang, Tong Zhang


Abstract
Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM on the limit infinite samples. However, it is unclear how effective is DPORM in practice. DPORM’s effectiveness directly implies the optimality of learned policy of DPO and also has practical implication for more advanced alignment methods, such as iterative DPO. We compare the accuracy at distinguishing preferred and rejected answers using both DPORM and EXRM. Our findings indicate that even though DPORM can fit the training dataset, it generalizes less effective than EXRM, especially when the validation datasets contain distributional shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.
Anthology ID:
2024.findings-emnlp.940
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16015–16026
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.940
DOI:
Bibkey:
Cite (ACL):
Yong Lin, Skyler Seto, Maartje Ter Hoeve, Katherine Metcalf, Barry-John Theobald, Xuan Wang, Yizhe Zhang, Chen Huang, and Tong Zhang. 2024. On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16015–16026, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization (Lin et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.940.pdf