SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Tianyi Wang; Yixia Li; Long Li; Yibiao Chen; Shaohan Huang; Yun Chen; Peng Li; Yang Liu; Guanhua Chen

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen

Abstract

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.

Anthology ID:: 2026.acl-long.356
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7831–7853
Language:
URL:: https://aclanthology.org/2026.acl-long.356/
DOI:
Bibkey:
Cite (ACL):: Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, and Guanhua Chen. 2026. SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7831–7853, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.356.pdf
Checklist:: 2026.acl-long.356.checklist.pdf

PDF Cite Search Checklist Fix data