Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

Xuexiang Wen; Hang Yu; Linchao Zhu; Gaoang Wang

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

Xuexiang Wen, Hang Yu, Linchao Zhu, Gaoang Wang

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose **Verifier-free Intrinsic Gradient-Norm Reward (VIGOR)**, a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller ℓ₂ norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level gradients with a √T scaling, and apply group-wise rank shaping to stabilize reward scales across prompts. Across mathematical reasoning benchmarks, VIGOR outperforms the state-of-the-art Reinforcement Learning from Internal Feedback (RLIF) baseline INTUITOR, and it also exhibits cross-domain transfer to code benchmarks when trained only on math data. For instance, on Qwen2.5-7B-Base post-trained on MATH, VIGOR improves the average math accuracy by +3.31% and the average code accuracy by +1.91% over INTUITOR, while exhibiting more stable training dynamics.

Anthology ID:: 2026.findings-acl.1606
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32089–32102
Language:
URL:: https://aclanthology.org/2026.findings-acl.1606/
DOI:
Bibkey:
Cite (ACL):: Xuexiang Wen, Hang Yu, Linchao Zhu, and Gaoang Wang. 2026. Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward. In Findings of the Association for Computational Linguistics: ACL 2026, pages 32089–32102, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward (Wen et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1606.pdf
Checklist:: 2026.findings-acl.1606.checklist.pdf

PDF Cite Search Checklist Fix data