Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Siyuan Wang (王思远); Zhuohan Long; Zhihao Fan; Xuan-Jing Huang (黄萱菁); Zhongyu Wei (魏忠钰)

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuanjing Huang, Zhongyu Wei

Abstract

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs). We utilize a multi-agent system to reframe new evolving instances with high confidence that extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, shortcut biases and probing their problem-solving sub-abilities. With this framework, we extend datasets across general and specific tasks, through various iterations. Experimental results show a performance decline in most LLMs against their original results under scalable and robust evaluations, offering a more accurate reflection of model capabilities alongside our fine-grained evaluation. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks. We hope this framework contributes the research community for continuously evolving benchmarks alongside LLM development.

Anthology ID:: 2025.coling-main.223
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3310–3328
Language:
URL:: https://aclanthology.org/2025.coling-main.223/
DOI:
Bibkey:
Cite (ACL):: Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuanjing Huang, and Zhongyu Wei. 2025. Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 3310–3328, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation (Wang et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.223.pdf

PDF Cite Search Fix data