Hao Yi

2025

Enhancing the numerical and logical reasoning capabilities of Large Language Models (LLMs) has become a prominent research focus. Existing approaches exhibit notable limitations: inference-phase techniques, such as Chain of Thought, depend on prompt engineering and pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle to ensure step-wise mathematical correctness and often rely on model distillation or human annotations; Reinforcement Learning (RL) methods entail high GPU memory consumption and training instability. To overcome these challenges, we propose Self-training with Process Preference learning using Dynamic value margin (SPPD). SPPD formulates reasoning as a process-based Markov Decision Process (MDP), leveraging the Bellman optimality equation to derive a dynamic value margin for step-level preference optimization. It further incorporates tree-based self-sampling of model responses, eliminating the need for distillation. We theoretically establish that SPPD is equivalent to on-policy policy gradient methods under constrained reward functions. Experimental results on 7B-scale models show consistent superiority across both in-domain and out-of-domain mathematical benchmarks.

Co-authors

Venues

Findings1

Fix author