PARIF: Pushing the Pareto Frontier of Instruction Following and Reasoning with Curriculum Reinforcement Learning

Rongchuan Mu; Zexin Wang; Qianyu Wang; MingHua Ma; Zekun Wang; Ming Liu; Bing Qin (秦兵)

PARIF: Pushing the Pareto Frontier of Instruction Following and Reasoning with Curriculum Reinforcement Learning

Rongchuan Mu, Zexin Wang, Qianyu Wang, MingHua Ma, Zekun Wang, Ming Liu, Bing Qin

Abstract

Large Reasoning Models (LRMs) excel at complex problem-solving but frequently overlook specific instruction constraints. Existing alignment methods struggle to balance general reasoning with instruction-following (IF), hindered by dependency on teacher models, reward hacking, and reasoning-answer inconsistencies. We propose PARIF, a two-stage curriculum learning framework based on Reinforcement Learning from Verifiable Rewards (RLVR) to enhance both IF and general reasoning capabilities. The framework employs a correctness proxy across different stages to mitigate reward hacking. Stage I employs a dynamic weighting strategy simultaneously to optimize the model’s reasoning paradigm regarding constraints. Stage II introduces Decoupled-GRPO, which builds upon the first stage to enhance the logical consistency between the reasoning process and the final answer, enabling the model to better leverage its optimized reasoning paradigm. To support the framework, we curate 26,000 high-quality instructions featuring diverse constraints. Extensive experiments demonstrate PARIF’s effectiveness: our 7B model achieves a remarkable 21.25% relative average improvement to the original model across six representative IF tasks, while our 8B model outperforms leading models like DeepSeek-V3 on these IF tasks, effectively pushing the Pareto frontier of instruction following and reasoning for models of comparable scale. We open-source our code and models to facilitate future research.

Anthology ID:: 2026.acl-long.1136
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24753–24783
Language:
URL:: https://aclanthology.org/2026.acl-long.1136/
DOI:
Bibkey:
Cite (ACL):: Rongchuan Mu, Zexin Wang, Qianyu Wang, MingHua Ma, Zekun Wang, Ming Liu, and Bing Qin. 2026. PARIF: Pushing the Pareto Frontier of Instruction Following and Reasoning with Curriculum Reinforcement Learning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24753–24783, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: PARIF: Pushing the Pareto Frontier of Instruction Following and Reasoning with Curriculum Reinforcement Learning (Mu et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1136.pdf
Checklist:: 2026.acl-long.1136.checklist.pdf

PDF Cite Search Checklist Fix data