Shuyi Guo
2025
MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents
Kunlun Zhu | Hongyi Du | Zhaochen Hong | Xiaocheng Yang | Shuyi Guo | Zhe Wang | Zhenhailong Wang | Cheng Qian | Xiangru Tang | Heng Ji | Jiaxuan You
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kunlun Zhu | Hongyi Du | Zhaochen Hong | Xiaocheng Yang | Shuyi Guo | Zhe Wang | Zhenhailong Wang | Cheng Qian | Xiangru Tang | Heng Ji | Jiaxuan You
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents; yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario,and cognitive planning improves milestone achievement rates by 3%. Code and datasets are publicavailable at https://github.com/ulab-uiuc/MARBLE.