GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets

Oh Joon Kwon, Daiki Matsunaga, Kee-Eung Kim


Abstract
A critical component of the current generation of language models is preference alignment, which aims to precisely control the model’s behavior to meet human needs and values. The most notable among such methods is Reinforcement Learning with Human Feedback (RLHF) and its offline variant Direct Preference Optimization (DPO), both of which seek to maximize a reward model based on human preferences. In particular, DPO derives reward signals directly from the offline preference data, but in doing so overfits the reward signals and generates suboptimal responses that may contain human biases in the dataset. In this work, we propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges. Empirical results show GDPO can generate far more diverse responses than the baseline methods that are still relatively aligned with human values in dialog generation and summarization tasks.
Anthology ID:
2024.emnlp-main.951
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17120–17139
Language:
URL:
https://aclanthology.org/2024.emnlp-main.951
DOI:
Bibkey:
Cite (ACL):
Oh Joon Kwon, Daiki Matsunaga, and Kee-Eung Kim. 2024. GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17120–17139, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets (Kwon et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.951.pdf