Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

Bowen Ding; Yuhan Chen; Jiayang Lyu; Jiyao Yuan; Qi Zhu; Shuangshuang Tian; Dantong Zhu; Futing Wang; Heyuan Deng; Fei Mi; Lifeng Shang; Tao Lin

Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

Bowen Ding, Yuhan Chen, Jiayang Lyu, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin

Abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) dominate the post-training landscape for mathematical reasoning, yet differ fundamentally in their reliance on expert trajectories. To understand the optimal way to harness these trajectories for maximizing performance, we propose the Plasticity-Ceiling Framework. This framework empirically grounds the post-training landscape by decomposing the final performance ceiling into the foundational SFT performance and the subsequent RL plasticity (i.e., the maximum improvement via RL). Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability and premature convergence deficits inherent in synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the final ceiling by securing a robust SFT foundation with substantial RL plasticity; (2) Refuting the “Less is More” hypothesis in SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) The Minimum Validation Loss of SFT serves as a reliable indicator for selecting the expert trajectories that maximize the ultimate performance ceiling. Our findings provide actionable guidelines for extracting maximum value from expert trajectories.

Anthology ID:: 2026.acl-long.1528
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 33081–33106
Language:
URL:: https://aclanthology.org/2026.acl-long.1528/
DOI:
Bibkey:
Cite (ACL):: Bowen Ding, Yuhan Chen, Jiayang Lyu, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, and Tao Lin. 2026. Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33081–33106, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning (Ding et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1528.pdf
Checklist:: 2026.acl-long.1528.checklist.pdf

PDF Cite Search Checklist Fix data