GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Yoo Yeon Sung; Eve Fleisig; Yu Hou; Ishan Upadhyay; Jordan Lee Boyd-Graber

doi:10.18653/v1/2025.acl-long.962

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, Jordan Lee Boyd-Graber

Abstract

Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams’ timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.

Anthology ID:: 2025.acl-long.962
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19586–19587
Language:
URL:: https://aclanthology.org/2025.acl-long.962/
DOI:: 10.18653/v1/2025.acl-long.962
Bibkey:
Cite (ACL):: Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, and Jordan Lee Boyd-Graber. 2025. GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19586–19587, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration (Sung et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.962.pdf

PDF Cite Search Fix data