Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Sher Badshah; Hassan Sajjad

doi:10.18653/v1/2025.winlp-main.37

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Abstract

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.

Anthology ID:: 2025.winlp-main.37
Volume:: Proceedings of the 9th Widening NLP Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M'hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendrabikram Thapa, Wiem Ben Rim
Venues:: WiNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 251–267
Language:
URL:: https://aclanthology.org/2025.winlp-main.37/
DOI:: 10.18653/v1/2025.winlp-main.37
Bibkey:
Cite (ACL):: Sher Badshah and Hassan Sajjad. 2025. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA. In Proceedings of the 9th Widening NLP Workshop, pages 251–267, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA (Badshah & Sajjad, WiNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.winlp-main.37.pdf

PDF Cite Search Fix data