Failure makes the agent stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Junhao Su; Yuanliang Wan; Junwei Yang; Hengyu Shi; Tianyang Han; Yurui Qiu; Junfeng Luo

doi:10.18653/v1/2026.findings-acl.618

Failure makes the agent stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Junhao Su, Yuanliang Wan, Junwei Yang, Hengyu Shi, Tianyang Han, Yurui Qiu, Junfeng Luo

Abstract

Tool-augmented large language models (LLMs) are typically trained via supervised imitation learning or coarse-grained reinforcement learning, approaches that primarily optimize one-shot tool calls. Existing practices of self-reflection largely rely on heuristic prompting or unidirectional reasoning traces: the model is encouraged to “think more,” rather than to treat error diagnosis and correction as a learnable capability. This makes them fragile in multi-turn interaction settings—once a call fails, the model tends to repeat the same mistake instead of recovering. To address this issue, we propose structured reflection, which transforms the “from error to repair” process into a first-class, controllable, and trainable action. The agent produces a concise yet precise reflection process: specifically, the model diagnoses the error based on evidence from the previous step and then proposes a correct and executable follow-up call. During training, we combine DAPO and GSPO’s objective functions and design a more principled reward mechanism tailored to tool calling, optimizing the stepwise strategy Reflect → Call → Final. To evaluate this capability, we introduce Tool-Reflection-Bench, a lightweight benchmark dataset that programmatically verifies structural validity, executability, parameter correctness, and result consistency. Tasks in the benchmark are constructed as miniature trajectories of Erroneous Call → Reflection → Corrected Call and are split into disjoint training and testing sets. Experiments on BFCL v3 and Tool-Reflection-Bench show that our method achieves significant improvements in multi-turn tool-call success rates and error recovery, while also reducing redundant calls. These results demonstrate that making reflection explicit and treating it as an optimization objective can substantially enhance the reliability of tool interaction, providing a reproducible pathway for agents to grow stronger by learning from failure. We will release all the code and datasets as open source once the paper is accepted by the community.

Anthology ID:: 2026.findings-acl.618
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12712–12734
Language:
URL:: https://aclanthology.org/2026.findings-acl.618/
DOI:: 10.18653/v1/2026.findings-acl.618
Bibkey:
Cite (ACL):: Junhao Su, Yuanliang Wan, Junwei Yang, Hengyu Shi, Tianyang Han, Yurui Qiu, and Junfeng Luo. 2026. Failure makes the agent stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions. In Findings of the Association for Computational Linguistics: ACL 2026, pages 12712–12734, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Failure makes the agent stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions (Su et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.618.pdf
Checklist:: 2026.findings-acl.618.checklist.pdf

PDF Cite Search Checklist Fix data