ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

Hisham Abdullah Alyahya; Haidar Khan; Yazeed Alnumay; M Saiful Bari; Bulent Yener

doi:10.18653/v1/2025.acl-demo.33

ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

Hisham Abdullah Alyahya, Haidar Khan, Yazeed Alnumay, M Saiful Bari, Bulent Yener

Abstract

We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.

Anthology ID:: 2025.acl-demo.33
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Pushkar Mishra, Smaranda Muresan, Tao Yu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 340–350
Language:
URL:: https://aclanthology.org/2025.acl-demo.33/
DOI:: 10.18653/v1/2025.acl-demo.33
Bibkey:
Cite (ACL):: Hisham Abdullah Alyahya, Haidar Khan, Yazeed Alnumay, M Saiful Bari, and Bulent Yener. 2025. ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 340–350, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition (Alyahya et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-demo.33.pdf

PDF Cite Search Fix data