Curiosity-Driven Reinforcement Learning from Human Feedback

Haoran Sun; Yekun Chai; Shuohuan Wang; Yu Sun; Hua Wu (吴华); Haifeng Wang

doi:10.18653/v1/2025.acl-long.1146

Curiosity-Driven Reinforcement Learning from Human Feedback

Haoran Sun, Yekun Chai, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We will make our code publicly available.

Anthology ID:: 2025.acl-long.1146
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23517–23534
Language:
URL:: https://aclanthology.org/2025.acl-long.1146/
DOI:: 10.18653/v1/2025.acl-long.1146
Bibkey:
Cite (ACL):: Haoran Sun, Yekun Chai, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. 2025. Curiosity-Driven Reinforcement Learning from Human Feedback. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23517–23534, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Curiosity-Driven Reinforcement Learning from Human Feedback (Sun et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1146.pdf

PDF Cite Search Fix data