Shengjie Bi
2026
Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models
Hoang Phan | Xianjun Yang | Yuanshun Yao | Jingyu Zhang | Shengjie Bi | Xiaocheng Tang | Madian Khabsa | Lijuan Liu | Deren Lei
Findings of the Association for Computational Linguistics: ACL 2026
Hoang Phan | Xianjun Yang | Yuanshun Yao | Jingyu Zhang | Shengjie Bi | Xiaocheng Tang | Madian Khabsa | Lijuan Liu | Deren Lei
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP—a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
Yun He | Wenzhe Li | Hejia Zhang | Songlin Li | Karishma Mandyam | Sopan Khosla | Yuanhao Xiong | Nanshu Wang | Xiaoliang Peng | Beibin Li | Shengjie Bi | Shishir G Patil | Qi Qi | Shengyu Feng | Julian Katz-Samuels | Richard Yuanzhe Pang | Sujan Kumar Gonugondla | Hunter Lang | Yue Yu | Yundi Qian | Maryam Fazel-Zarandi | Licheng Yu | Amine Benhalloum | Hany Hassan Awadalla | Manaal Faruqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yun He | Wenzhe Li | Hejia Zhang | Songlin Li | Karishma Mandyam | Sopan Khosla | Yuanhao Xiong | Nanshu Wang | Xiaoliang Peng | Beibin Li | Shengjie Bi | Shishir G Patil | Qi Qi | Shengyu Feng | Julian Katz-Samuels | Richard Yuanzhe Pang | Sujan Kumar Gonugondla | Hunter Lang | Yue Yu | Yundi Qian | Maryam Fazel-Zarandi | Licheng Yu | Amine Benhalloum | Hany Hassan Awadalla | Manaal Faruqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)—especially for complex, multi-turn, and system-prompted instructions—remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF, a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs’ ability to follow complex, multi-turn, and system-level instructions. We also open-source the evaluation script of AdvancedIF. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
Search
Fix author
Co-authors
- Amine Benhalloum 1
- Manaal Faruqui 1
- Maryam Fazel-Zarandi 1
- Shengyu Feng 1
- Sujan Kumar Gonugondla 1
- Hany Hassan Awadalla 1
- Yun He 1
- Julian Katz-Samuels 1
- Madian Khabsa 1
- Sopan Khosla 1
- Hunter Lang 1
- Deren Lei 1
- Beibin Li 1
- Songlin Li 1
- Wenzhe Li 1
- Lijuan Liu 1
- Karishma Mandyam 1
- Richard Yuanzhe Pang 1
- Shishir G Patil 1
- Xiaoliang Peng 1
- Hoang Phan 1
- Qi Qi 1
- Yundi Qian 1
- Xiaocheng Tang 1
- Nanshu Wang 1
- Yuanhao Xiong 1
- Xianjun Yang 1
- Yuanshun Yao 1
- Licheng Yu 1
- Yue Yu 1
- Hejia Zhang 1
- Jingyu Zhang 1