FlagEval-Arena: A Side-by-Side Comparative Evaluation Platform for Large Language Models and Text-Driven AIGC

Jing-Shu Zheng; Richeng Xuan; Bowen Qin; Zheqi He; Tongshuai.ren Tongshuai.ren; Xuejing Li; Jin-ge Yao; Xi Yang

doi:10.18653/v1/2025.acl-demo.56

FlagEval-Arena: A Side-by-Side Comparative Evaluation Platform for Large Language Models and Text-Driven AIGC

Jing-Shu Zheng, Richeng Xuan, Bowen Qin, Zheqi He, Tongshuai.ren Tongshuai.ren, Xuejing Li, Jin-Ge Yao, Xi Yang

Abstract

We introduce FlagEval-Arena, an evaluation platform for side-by-side comparisons of large language models and text-driven AIGC systems.Compared with the well-known LM Arena (LMSYS Chatbot Arena), we reimplement our own framework with the flexibility to introduce new mechanisms or features. Our platform enables side-by-side evaluation not only for language models or vision-language models, but also text-to-image or text-to-video synthesis. We specifically target at Chinese audience with a more focus on the Chinese language, more models developed by Chinese institutes, and more general usage beyond the technical community. As a result, we currently observe very interesting differences from usual results presented by LM Arena. Our platform is available via this URL: https://flageval.baai.org/#/arena.

Anthology ID:: 2025.acl-demo.56
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Pushkar Mishra, Smaranda Muresan, Tao Yu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 583–591
Language:
URL:: https://aclanthology.org/2025.acl-demo.56/
DOI:: 10.18653/v1/2025.acl-demo.56
Bibkey:
Cite (ACL):: Jing-Shu Zheng, Richeng Xuan, Bowen Qin, Zheqi He, Tongshuai.ren Tongshuai.ren, Xuejing Li, Jin-Ge Yao, and Xi Yang. 2025. FlagEval-Arena: A Side-by-Side Comparative Evaluation Platform for Large Language Models and Text-Driven AIGC. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 583–591, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: FlagEval-Arena: A Side-by-Side Comparative Evaluation Platform for Large Language Models and Text-Driven AIGC (Zheng et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-demo.56.pdf

PDF Cite Search Fix data