Di Zhang
Other people with similar names: Di Zhang, Di Zhang
Unverified author pages with similar names: Di Zhang
2026
SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
Zheng Li | Qingxiu Dong | Jingyuan Ma | Di Zhang | Kai Jia | Zhifang Sui
Findings of the Association for Computational Linguistics: ACL 2026
Zheng Li | Qingxiu Dong | Jingyuan Ma | Di Zhang | Kai Jia | Zhifang Sui
Findings of the Association for Computational Linguistics: ACL 2026
Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models always consume excessive tokens even for simple queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter - a self-adaptive reasoning strategy for efficient and controllable reasoning. Specifically, we first train the model to self-estimate the required reasoning budget based on the query. We then introduce budget-guided GRPO for reinforcement learning, which effectively maintains accuracy while reducing output length. Experimental results demonstrate that SelfBudgeter dynamically allocates budgets according to problem complexity, achieving an average response length compression of 61% on math reasoning tasks while maintaining accuracy. Furthermore, SelfBudgeter allows users to see how long generation will take and decide whether to continue or stop. Additionally, users can directly control the reasoning length by setting token budgets upfront.
Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts
Di Zhang | Xun Wu | Shaohan Huang | Lingjie Jiang | Yaru Hao | Li Dong | Zewen Chi | Zhifang Sui | Furu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Di Zhang | Xun Wu | Shaohan Huang | Lingjie Jiang | Yaru Hao | Li Dong | Zewen Chi | Zhifang Sui | Furu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for improving reasoning capabilities. However, training RLVR with Mixture-of-Experts (MoE) policies remains fragile and is often prone to reward collapse.We identify a MoE-specific source of instability, referred to as router shift (RS), where changes in expert routing across policy updates exacerbate off-policy mismatch. This effect leads to increasingly volatile importance-ratio signals and bursty clipping behavior, which consistently precede training collapse.Motivated by this diagnosis, we propose Router-Shift Policy Optimization (RSPO). RSPO computes a per-token router-shift ratio conditioned on the previously activated experts, applies stop-gradient and a lower-bound floor, and softly rescales importance ratios prior to clipping and aggregation. This design explicitly accounts for routing-induced distributional drift during off-policy optimization.We evaluate the effect of RSPO under two settings: a synthetic countdown task and real-world reasoning tasks on MATH and Code. Across both settings, RSPO achieves better performance and exhibits greater stability compared to recent MoE-based RLVR methods.