AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories

Tianyue Ou; Wanyao Guo; Apurva Gandhi; Graham Neubig; Xiang Yue

doi:10.18653/v1/2025.emnlp-demos.15

AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories

Tianyue Ou, Wanyao Guo, Apurva Gandhi, Graham Neubig, Xiang Yue

Abstract

Large Language Model (LLM) agents produce rich, multi-step trajectories that interleave observations, internal reasoning, and tool actions. However, most evaluation pipelines focus solely on end-task success, leaving the agent’s decision-making process opaque and poorly understood. We introduce AgentDiagnose, an open-source, modular framework for diagnosing agent trajectories. The present release fully supports the web domain, and AgentDiagnose is architect as an extensible, open platform with compatibility for most agent trajectories. AgentDiagnose consists of (i) an evaluation module that quantifies five core agentic competencies—backtracking & exploration, task decomposition, observation reading, self-verification, and objective quality—and (ii) a visualization module that highlights trajectory semantics through t-SNE action embeddings, interactive word clouds, and state-transition timelines. On a set of 30 manually annotated trajectories, our automatic metrics achieve a mean Pearson correlation of 0.57 with human judgments, rising to 0.78 for task decomposition. Furthermore, filtering the 46k-example NNetNav-Live dataset with AgentDiagnose and fine-tuning a Llama-3.1-8B model on the top 6k trajectories improves WebArena success rates by 0.98, despite using only 13% of the original data. AgentDiagnose thus serves as both a diagnostic lens for agent analysis and a practical tool for curating higher-quality training data. The toolkit and demo are publicly available.

Anthology ID:: 2025.emnlp-demos.15
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 207–215
Language:
URL:: https://aclanthology.org/2025.emnlp-demos.15/
DOI:: 10.18653/v1/2025.emnlp-demos.15
Bibkey:
Cite (ACL):: Tianyue Ou, Wanyao Guo, Apurva Gandhi, Graham Neubig, and Xiang Yue. 2025. AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 207–215, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories (Ou et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-demos.15.pdf

PDF Cite Search Fix data