Joint Optimization of Training Data and Policy in RLHF

Zhuohao Yu; Jiali Zeng; Weizheng Gu; Mengyuan Sun; Yidong Wang; Fandong Meng; Jie Zhou; Shikun Zhang; Wei Ye

Joint Optimization of Training Data and Policy in RLHF

Zhuohao Yu, Jiali Zeng, Weizheng Gu, Mengyuan Sun, Yidong Wang, Fandong Meng, Jie Zhou, Shikun Zhang, Wei Ye

Abstract

Traditional reinforcement learning from human feedback (RLHF) optimizes policies on fixed training inputs, limiting the diversity of learning signals. We propose JODP (Joint Optimization of Data and Policy), a framework where the evolving policy model generates improved variants of training problems to enhance its own learning. While training problems remain fixed, JODP optimizes how they are presented: the policy generates specification hints that guide rollout generation, then learns to reproduce the discovered high-reward behaviors without the hints. This "if you can solve it with a hint, learn to solve it without one" principle creates a co-evolutionary dynamic where better policies discover better specifications, which enable further policy improvement. JODP operates as a plug-and-play enhancement to existing algorithms: specifications are selected via UCB bandits for exploration-exploitation balance, used only during training rollouts, and discarded at deployment. Through evaluation on safety alignment tasks, we demonstrate consistent improvements with GRPO, RLOO, and REINFORCE++, allowing 4B models to approach 8B model performance using less than 1% additional computational overhead.

Anthology ID:: 2026.findings-acl.2109
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42498–42514
Language:
URL:: https://aclanthology.org/2026.findings-acl.2109/
DOI:
Bibkey:
Cite (ACL):: Zhuohao Yu, Jiali Zeng, Weizheng Gu, Mengyuan Sun, Yidong Wang, Fandong Meng, Jie Zhou, Shikun Zhang, and Wei Ye. 2026. Joint Optimization of Training Data and Policy in RLHF. In Findings of the Association for Computational Linguistics: ACL 2026, pages 42498–42514, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Joint Optimization of Training Data and Policy in RLHF (Yu et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.2109.pdf
Checklist:: 2026.findings-acl.2109.checklist.pdf

PDF Cite Search Checklist Fix data