SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu


Abstract
Enhancing the numerical and logical reasoning capabilities of Large Language Models (LLMs) has become a prominent research focus. Existing approaches exhibit notable limitations: inference-phase techniques, such as Chain of Thought, depend on prompt engineering and pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle to ensure step-wise mathematical correctness and often rely on model distillation or human annotations; Reinforcement Learning (RL) methods entail high GPU memory consumption and training instability. To overcome these challenges, we propose Self-training with Process Preference learning using Dynamic value margin (SPPD). SPPD formulates reasoning as a process-based Markov Decision Process (MDP), leveraging the Bellman optimality equation to derive a dynamic value margin for step-level preference optimization. It further incorporates tree-based self-sampling of model responses, eliminating the need for distillation. We theoretically establish that SPPD is equivalent to on-policy policy gradient methods under constrained reward functions. Experimental results on 7B-scale models show consistent superiority across both in-domain and out-of-domain mathematical benchmarks.
Anthology ID:
2025.findings-emnlp.19
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
324–337
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.19/
DOI:
Bibkey:
Cite (ACL):
Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, and Yong Liu. 2025. SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 324–337, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin (Yi et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.19.pdf
Checklist:
 2025.findings-emnlp.19.checklist.pdf