Benchmarking Failures in Tool-Augmented Language Models

Eduardo Treviño; Hugo Contant; James Ngai; Graham Neubig; Zora Zhiruo Wang

doi:10.18653/v1/2025.naacl-long.149

Benchmarking Failures in Tool-Augmented Language Models

Eduardo Treviño, Hugo Contant, James Ngai, Graham Neubig, Zora Zhiruo Wang

Abstract

The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume ‘perfect’ information access and tool availability, which may not hold in the real world. To systematically study TaLMs imperfections, we introduce the FAIL-TaLMs benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TaLMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help method, to provide missing information or replace non-functional tools. While Ask-and-Help can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.

Anthology ID:: 2025.naacl-long.149
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2916–2934
Language:
URL:: https://aclanthology.org/2025.naacl-long.149/
DOI:: 10.18653/v1/2025.naacl-long.149
Bibkey:
Cite (ACL):: Eduardo Treviño, Hugo Contant, James Ngai, Graham Neubig, and Zora Zhiruo Wang. 2025. Benchmarking Failures in Tool-Augmented Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2916–2934, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Benchmarking Failures in Tool-Augmented Language Models (Treviño et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-long.149.pdf

PDF Cite Search Fix data