STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

Yuhan Chen; Yuxuan Liu; Long Zhang; Pengzhi Gao; Jian Luan; Wei Liu

STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, Wei Liu

Abstract

Multi-turn interaction remains challenging for online reinforcement learning. Current GRPO-based methods—either at the trajectory level or the step level—still suffer from fundamental challenges in multi-turn settings: they allocate sampling uniformly across tasks regardless of difficulty, propagate misleading learning signals that penalize correct intermediate actions in failed trajectories, and incur high sample-collection costs under long-horizon environments. Step-level variants (e.g., GIGPO) mitigate some interaction-cost constraints by decomposing trajectories, yet they retain GRPO’s sampling imbalance and still struggle with heterogeneous multi-turn tasks. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy Optimization), a framework that dynamically allocates sampling based on per-task success rates and performs fine-grained step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples, followed by a step-level GRPO augmentation that strengthens updates on low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over both trajectory-level and existing step-level GRPO variants, converging faster and generalizing better under the same sampling budget.

Anthology ID:: 2026.findings-acl.1532
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30681–30692
Language:
URL:: https://aclanthology.org/2026.findings-acl.1532/
DOI:
Bibkey:
Cite (ACL):: Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. 2026. STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization. In Findings of the Association for Computational Linguistics: ACL 2026, pages 30681–30692, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization (Chen et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1532.pdf
Checklist:: 2026.findings-acl.1532.checklist.pdf

PDF Cite Search Checklist Fix data