MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Zhiwei Liu; Jielin Qiu; Shiyu Wang; Jianguo Zhang; Zuxin Liu; Roshan Ram; Haolin Chen; Weiran Yao; Shelby Heinecke; Silvio Savarese; Huan Wang; Caiming Xiong

doi:10.18653/v1/2025.emnlp-demos.27

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

Abstract

The rapid adoption of Large Language Models (LLMs) as intelligent agents has underscored the necessity for robust evaluation frameworks capable of assessing agent performance in realistic, interactive environments. Existing evaluation methodologies often suffer from limitations such as static task benchmarks, limited scope, and inadequate integration with practical applications. In response, we introduce MCPEval, an open-source, Model Context Protocol (MCP)-based evaluation framework specifically tailored for comprehensive and systematic assessment of LLM-powered agents. MCPEval standardizes evaluations across diverse domains through automated task generation and verification, supports multiple performance metrics, and integrates seamlessly with native agent capabilities. We empirically validate the effectiveness of MCPEval across five distinct real-world domains, highlighting significant variations in performance across various LLM architectures and prompting strategies. Our results illustrate the framework’s capacity to uncover nuanced performance patterns and identify domain-specific strengths and weaknesses, providing valuable insights beyond traditional binary success metrics. We publicly release MCPEval to foster reproducible research and promote standardized evaluation practices within the LLM agent community.

Anthology ID:: 2025.emnlp-demos.27
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 373–402
Language:
URL:: https://aclanthology.org/2025.emnlp-demos.27/
DOI:: 10.18653/v1/2025.emnlp-demos.27
Bibkey:
Cite (ACL):: Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang, and Caiming Xiong. 2025. MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 373–402, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models (Liu et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-demos.27.pdf

PDF Cite Search Fix data