Characterizing the Confidence of Large Language Model-Based Automatic Evaluation Metrics

Rickard Stureborg; Dimitris Alikaniotis; Yoshi Suhara

Characterizing the Confidence of Large Language Model-Based Automatic Evaluation Metrics

Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara

Abstract

There has recently been a growing interest in using Large Language Models (LLMs) to evaluate NLP tasks automatically. Considerable research effort has been put into improving such systems towards achieving high correlations with human judgement. However, it is still unclear what level of correlation is good enough for practical applications of LLM-based automatic evaluation systems. This paper characterizes these LLM evaluators’ confidence in ranking candidate NLP models and develops a configurable Monte Carlo simulation method. We show that even automatic metrics with low correlation with human judgement can reach high-confidence rankings of candidate models with reasonable evaluation set sizes (100s of examples). Further, we describe tradeoff curves between the LLM evaluator performance (i.e., correlation with humans) and evaluation set size; loss in correlation can be compensated with modest increases in the evaluation set size. We validate our results on RoSE, a text summarization dataset, and find our estimates of confidence align with empirical observations.Code available at https://github.com/rickardstureborg/llm-eval-confidence

Anthology ID:: 2024.eacl-short.9
Volume:: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: March
Year:: 2024
Address:: St. Julian’s, Malta
Editors:: Yvette Graham, Matthew Purver
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 76–89
Language:
URL:: https://aclanthology.org/2024.eacl-short.9
DOI:
Bibkey:
Cite (ACL):: Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. 2024. Characterizing the Confidence of Large Language Model-Based Automatic Evaluation Metrics. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 76–89, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):: Characterizing the Confidence of Large Language Model-Based Automatic Evaluation Metrics (Stureborg et al., EACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.eacl-short.9.pdf
Video:: https://aclanthology.org/2024.eacl-short.9.mp4

PDF Cite Search Video