ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak; Minju Kim; Dongha Lim; Hyungjoo Chae; Dongjin Kang; Sunghwan Mac Kim; Dongil Yang; Jinyoung Yeo

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo

Abstract

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

Anthology ID:: 2025.findings-emnlp.1344
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24696–24727
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.1344/
DOI:
Bibkey:
Cite (ACL):: Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, and Jinyoung Yeo. 2025. ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24696–24727, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions (Kwak et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.1344.pdf
Checklist:: 2025.findings-emnlp.1344.checklist.pdf

PDF Cite Search Checklist Fix data