Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Renren Jin; Pengzhi Gao; Yuqi Ren; Zhuowen Han; Tongxuan Zhang; Wuwei Huang; Wei Liu; Jian Luan; Deyi Xiong (德意 熊)

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Renren Jin, Pengzhi Gao, Yuqi Ren, Zhuowen Han, Tongxuan Zhang, Wuwei Huang, Wei Liu, Jian Luan, Deyi Xiong

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a prominent paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, leading to premature convergence to suboptimal local minima and hindering further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To bridge this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our results identify three key factors that influence entropy: the clipping thresholds in the optimization objective, the number of off-policy updates, and the diversity of the training data. Furthermore, through both theoretical analysis and empirical validation, we demonstrate that tokens with positive advantages are the primary drivers of entropy collapse. Motivated by this insight, we propose Positive-Advantage Reweighting, a simple yet effective approach that regulates model entropy by adjusting the loss weights assigned to tokens with positive advantages during RLVR training, while maintaining competitive performance.

Anthology ID:: 2026.findings-acl.1266
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25300–25322
Language:
URL:: https://aclanthology.org/2026.findings-acl.1266/
DOI:
Bibkey:
Cite (ACL):: Renren Jin, Pengzhi Gao, Yuqi Ren, Zhuowen Han, Tongxuan Zhang, Wuwei Huang, Wei Liu, Jian Luan, and Deyi Xiong. 2026. Revisiting Entropy in Reinforcement Learning for Large Reasoning Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25300–25322, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Revisiting Entropy in Reinforcement Learning for Large Reasoning Models (Jin et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1266.pdf
Checklist:: 2026.findings-acl.1266.checklist.pdf

PDF Cite Search Checklist Fix data