VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

Dingwei Zhu; Shihan Dou; Zhiheng Xi; Senjie Jin; Guoqiang Zhang; Jiazheng Zhang; Junjie Ye (叶俊杰); Mingxu Chai; Enyu Zhou; Ming Zhang; Yuhui Wang; Caishuang Huang; Chenhao Huang; Yunke Zhang; Yuran Wang; Tao Gui; Qi Zhang; Xipeng Qiu (邱锡鹏); Xuan-Jing Huang (黄萱菁)

VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang, Jiazheng Zhang, Junjie Ye, Mingxu Chai, Enyu Zhou, Ming Zhang, Yuhui Wang, Caishuang Huang, Chenhao Huang, Yunke Zhang, Yuran Wang, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang

Abstract

Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.

Anthology ID:: 2026.acl-long.1103
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24046–24067
Language:
URL:: https://aclanthology.org/2026.acl-long.1103/
DOI:
Bibkey:
Cite (ACL):: Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang, Jiazheng Zhang, Junjie Ye, Mingxu Chai, Enyu Zhou, Ming Zhang, Yuhui Wang, Caishuang Huang, Chenhao Huang, Yunke Zhang, Yuran Wang, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. 2026. VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24046–24067, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training (Zhu et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1103.pdf
Checklist:: 2026.acl-long.1103.checklist.pdf

PDF Cite Search Checklist Fix data