Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data

Ofir Arviv; Kristjan Greenewald; Yotam Perlitz; Hadar Mulian; Michal Shmueli-Scheuer; Leshem Choshen

Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data

Ofir Arviv, Kristjan Greenewald, Yotam Perlitz, Hadar Mulian, Michal Shmueli-Scheuer, Leshem Choshen

Abstract

The inherent rigidity of fixed-size benchmarks makes them an inefficient tool for model evaluation. Diverse evaluation objectives, including model ranking, model selection and testing throughout development, demand varying levels of statistical power. The mismatch between fixed sample sizes and these diverse needs results in either excessive computational cost or compromised reliability – a critical concern for model evaluation. To overcome these limitations, we call for adoption of sequential testing in our field. We provide an adaptive evaluation framework, that provides a principled way to navigate the trade-off between efficiency and reliability in model evaluation. Our framework combines the established statistical paradigm of sequential testing with stopping criteria tailored to common evaluation needs such as diminishing returns detection, and minimum detectable effect size. We demonstrate its ability to adaptively manage the efficiency-reliability trade-off on the Open VLM Leaderboard, including, for example, a 80% reduction in computational cost compared to fixed-size evaluation (with a 2.5-point CI width allowance) while maintaining statistical significance.

Anthology ID:: 2026.findings-acl.43
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 871–881
Language:
URL:: https://aclanthology.org/2026.findings-acl.43/
DOI:
Bibkey:
Cite (ACL):: Ofir Arviv, Kristjan Greenewald, Yotam Perlitz, Hadar Mulian, Michal Shmueli-Scheuer, and Leshem Choshen. 2026. Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data. In Findings of the Association for Computational Linguistics: ACL 2026, pages 871–881, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data (Arviv et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.43.pdf
Checklist:: 2026.findings-acl.43.checklist.pdf

PDF Cite Search Checklist Fix data