Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

Sanghwan Bae; Jiwoo Hong; Min Young Lee; Hanbyul Kim; Jeongyeon Nam; Donghyun Kwak

Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, Jeongyeon Nam, Donghyun Kwak

Abstract

Recent advances in reinforcement learning with verifiable rewards (RLVR) show that large language models enhance their reasoning abilities when trained with verifiable signals. However, due to reward sparsity, effectiveness depends heavily on selecting samples of appropriate difficulty. In this work, we present a formal analysis of online difficulty-aware filtering and establish its theoretical foundations. We show that expected policy improvement is lower-bounded by the variance of task-level success probabilities, implying that selecting tasks of intermediate difficulty maximizes learning efficiency. Building on this, we demonstrate that balanced filtering maximizes this lower bound, leading to superior performance and sample efficiency. Evaluations across multiple math reasoning benchmarks validate that balanced filtering consistently enhances convergence speed and final performance, achieving up to +12% gains in less than half the training steps of standard GRPO. By extending our analysis to various reward distributions, we provide a principled foundation for future RLVR curriculum strategies, confirmed through both theoretical analysis and extensive empirical results.

Anthology ID:: 2026.eacl-long.30
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 700–719
Language:
URL:: https://aclanthology.org/2026.eacl-long.30/
DOI:
Bibkey:
Cite (ACL):: Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, Jeongyeon Nam, and Donghyun Kwak. 2026. Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 700–719, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning (Bae et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-long.30.pdf
Checklist:: 2026.eacl-long.30.checklist.pdf

PDF Cite Search Checklist Fix data