Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie; Ruotong Pan; Xiangyu Wu; Zhang Yunfei; Jiayi Fu; Tingting Gao; Guorui Zhou

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Zhang Yunfei, Jiayi Fu, Tingting Gao, Guorui Zhou

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model’s internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using the model’s overall self-confidence, and then applies a token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse.

Anthology ID:: 2026.findings-acl.951
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19057–19076
Language:
URL:: https://aclanthology.org/2026.findings-acl.951/
DOI:
Bibkey:
Cite (ACL):: Can Xie, Ruotong Pan, Xiangyu Wu, Zhang Yunfei, Jiayi Fu, Tingting Gao, and Guorui Zhou. 2026. Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19057–19076, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning (Xie et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.951.pdf
Checklist:: 2026.findings-acl.951.checklist.pdf

PDF Cite Search Checklist Fix data