AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments

Zhiheng Xi; Yiwen Ding; Wenxiang Chen; Boyang Hong; Honglin Guo; Junzhe Wang; Xin Guo; Dingwen Yang; Chenyang Liao; Wei He; Songyang Gao; Lu Chen; Rui Zheng; Yicheng Zou; Tao Gui; Qi Zhang; Xipeng Qiu (邱锡鹏); Xuan-Jing Huang (黄萱菁); Zuxuan Wu; Yu-Gang Jiang

doi:10.18653/v1/2025.acl-long.1355

AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments

Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Xin Guo, Dingwen Yang, Chenyang Liao, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang

Abstract

Large language models (LLMs) have emerged as a promising foundation to build generally-capable agents (LLM-based agents) that can handle multi-turn decision-making tasks across various environments. However, the community lacks a unified interactive framework that covers diverse environments for comprehensive evaluation of agents, and enables exploration and learning for their self-improvement. To address this, we propose AgentGym, a framework featuring 7 real-world scenarios, 14 environments, and 89 tasks for unified, real-time, and concurrent agent interaction. We construct expanded instruction set, high-quality trajectories, and comprehensive benchmarking suite for developing LLM-based agents. Moreover, AgentGym supports interactive exploration and learning for agents through multi-turn interactions and real-time feedback. Based on AgentGym, we take the initial step to develop LLM-based agents that can handle diverse tasks via methods like self-improvement or reinforcement learning. Experimental results show that the trained agents can achieve results comparable to commercial models. We hope our work can help the community develop more advanced LLM-based agents. We release the code, dataset, benchmark, and checkpoints at https://agentgym.github.io/.

Anthology ID:: 2025.acl-long.1355
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27914–27961
Language:
URL:: https://aclanthology.org/2025.acl-long.1355/
DOI:: 10.18653/v1/2025.acl-long.1355
Bibkey:
Cite (ACL):: Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Xin Guo, Dingwen Yang, Chenyang Liao, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. 2025. AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27914–27961, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments (Xi et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1355.pdf

PDF Cite Search Fix data