Kun Li
Other people with similar names: Kun Li
Unverified author pages with similar names: Kun Li
2026
Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees
Kun Li | Zenan Xu | Junan Li | Zengrui Jin | Jinghao Deng | Zexuan Qiu | Bo Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kun Li | Zenan Xu | Junan Li | Zengrui Jin | Jinghao Deng | Zexuan Qiu | Bo Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model’s intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without additional human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors during training. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.
Reinforcement Learning on Pre-Training Data
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) is largely driven by scaling training compute through either pre-training with next-token prediction (NTP) or post-training with reinforcement learning (RL). The former contributes to learning broad knowledge and skills from general data, while struggling with data inefficiency and catastrophic forgetting in continual learning settings. The latter incentivizes reasoning capabilities with strong generalization, but is constrained by limited data availability due to its reliance on human annotation. To alleviate these issues, we propose Reinforcement Learning on Pre-Training data (RLPT), which combines the advantages of learning from general data and RL. In particular, RLPT derives reward signals directly from general text data through a next-segment reasoning objective, rewarding the policy for correctly predicting next text segments conditioned on the prefix text. Experiments across multiple benchmarks and models demonstrate the effectiveness of . For example, RLPT yields substantial improvements in continual pre-training (+4.6%) and provides a strong foundation for post-training (+3.4%) on Qwen3-8B-Base.
Search
Fix author
Co-authors
- Zenan Xu 2
- Bo Zhou 2
- Jinghao Deng 1
- Zheng Fang 1
- Fei Gao 1
- Xue Gong 1
- Guanhua Huang 1
- Cheng Jiang 1
- Yuhao Jiang 1
- Zengrui Jin 1
- Shuai LI 1
- Wai Lam 1
- Junan Li 1
- Kejiao Li 1
- Siheng Li 1
- Xiaoxue Li 1
- Zhuoyu Li 1
- Kai Liu 1
- Qibin Liu 1
- Zexuan Qiu 1
- Kun Shi 1
- Yangyu Tao 1
- Bochao Wang 1
- Di Wang 1
- Haoyuan Wu 1
- Wujiajia 1
- Ruibin Xiong 1
- Guanghui Xu 1
- Tingqiang Xu 1
- Jinbao Xue 1
- Jianfeng Yan 1
- Qi Yi 1
- Yuyuan Zeng 1
- Chenchen Zhang 1
- Zihao Zheng 1
- Jianchen Zhu 1
- Zhijiang xu 1
Venues
- ACL2