HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Mingxuan Li; Hanchen Li; Chenhao Tan

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Abstract

Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM’s assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.

Anthology ID:: 2026.acl-long.1963
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42424–42443
Language:
URL:: https://aclanthology.org/2026.acl-long.1963/
DOI:
Bibkey:
Cite (ACL):: Mingxuan Li, Hanchen Li, and Chenhao Tan. 2026. HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 42424–42443, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation (Li et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1963.pdf
Checklist:: 2026.acl-long.1963.checklist.pdf

PDF Cite Search Checklist Fix data