Don’t Forget Your Reward Values: Language Model Alignment via Value-based Calibration

Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Wang Chen, Anh Tuan Luu


Abstract
While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of Large Language Models (LLMs), recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based alignment methods as viable alternatives. This paper delves into existing order-based methods, unifying them into one framework and examining their inefficiencies in utilizing reward values. Building upon these findings, we propose a new Value-based Calibration (VCB) method to better align LLMs with human preferences. Experimental results demonstrate that VCB surpasses existing alignment methods on AI assistant and summarization datasets, providing impressive generalizability, robustness, and diversity in different settings.
Anthology ID:
2024.emnlp-main.976
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17622–17642
Language:
URL:
https://aclanthology.org/2024.emnlp-main.976/
DOI:
10.18653/v1/2024.emnlp-main.976
Bibkey:
Cite (ACL):
Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Wang Chen, and Anh Tuan Luu. 2024. Don’t Forget Your Reward Values: Language Model Alignment via Value-based Calibration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17622–17642, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Don’t Forget Your Reward Values: Language Model Alignment via Value-based Calibration (Mao et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.976.pdf
Software:
 2024.emnlp-main.976.software.zip