Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

Miao Peng; Weizhou Shen; Nuo Chen; Chenliang Li; Ming Yan; Jia Li

Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

Miao Peng, Weizhou Shen, Nuo Chen, Chenliang Li, Ming Yan, Jia Li

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs’ short-context reasoning but falters in long-context scenarios requiring precise grounding and multi-hop reasoning. We identify the "almost-there" phenomenon—trajectories that are largely correct but fail at the final step—in long-context reasoning RL and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data, and (2) indiscriminate penalization of partially correct trajectories during long-context RL. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by measuring reasoning steps along Validity and Relevance dimensions, which captures critical signals from "almost-there" trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.

Anthology ID:: 2026.findings-acl.1306
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26201–26228
Language:
URL:: https://aclanthology.org/2026.findings-acl.1306/
DOI:
Bibkey:
Cite (ACL):: Miao Peng, Weizhou Shen, Nuo Chen, Chenliang Li, Ming Yan, and Jia Li. 2026. Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping. In Findings of the Association for Computational Linguistics: ACL 2026, pages 26201–26228, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping (Peng et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1306.pdf
Checklist:: 2026.findings-acl.1306.checklist.pdf

PDF Cite Search Checklist Fix data