ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky


Abstract
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of *reliable evaluation* that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.
Anthology ID:
2025.findings-emnlp.594
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11146–11153
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.594/
DOI:
Bibkey:
Cite (ACL):
Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, and Gabriel Stanovsky. 2025. ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11146–11153, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments (Lior et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.594.pdf
Checklist:
 2025.findings-emnlp.594.checklist.pdf