AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Xuanwen Ding; Chengjun Pan; Zejun Li (李泽君); Jiwen Zhang (张霁雯); Siyuan Wang (王思远); Zhongyu Wei (魏忠钰)

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

Abstract

Evaluating multimodal large language models (MLLMs) is becoming increasingly expensive as benchmarks grow in scale and cross-modality complexity. Inspired by structuralism in cognitive psychology, we tackle this difficulty with an adaptive evaluation framework for efficient benchmarking, namely **AutoJudger**. Instead of passively scoring on a fixed test set, AutoJudger treats evaluation as an interview-like process by keeping a hypothesized ability structure of the evaluated model and actively selecting the informative questions so as to refine these ability boundaries. Specifically, AutoJudger has three core components: **ability decomposition** to organize evaluation along meaningful capability dimensions, **ability estimation** to maintain an up-to-date quantitative profile of the model competence, and **adaptive question selection** to choose the most informative questions. To operationalize this paradigm, we introduce **A²-Judger**, a novel MLLM-based **A**gentic instantiation of **A**uto**Judger** equipped with semantic-aware retrieval and dynamic memory. Experiments on four representative multimodal benchmarks show that A²-Judger significantly improves sample efficiency while maintaining reliable evaluation results.

Anthology ID:: 2026.acl-long.685
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15009–15034
Language:
URL:: https://aclanthology.org/2026.acl-long.685/
DOI:
Bibkey:
Cite (ACL):: Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, and Zhongyu Wei. 2026. AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15009–15034, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs (Ding et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.685.pdf
Checklist:: 2026.acl-long.685.checklist.pdf

PDF Cite Search Checklist Fix data