LMUNIT: Fine-grained Evaluation with Natural Language Unit Tests

Jon Saad-Falcon; Rajan Pathe Vivek; William Berrios; Nandita Shankar Naik; Matija Franklin; Bertie Vidgen; Amanpreet Singh; Douwe Kiela; Shikib Mehri

doi:10.18653/v1/2025.findings-emnlp.176

LMUNIT: Fine-grained Evaluation with Natural Language Unit Tests

Jon Saad-Falcon, Rajan Pathe Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, Shikib Mehri

Abstract

As language models become integral to critical workflows, assessing their behavior remains a fundamental challenge – human evaluation is costly and noisy, while automated metrics provide only coarse, difficult-to-interpret signals. We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings, and natural language rationales. Through controlled human studies, we show this paradigm significantly improves inter-annotator agreement and enables more effective LLM development workflows. LMUnit achieves state-of-the-art performance on evaluation benchmarks including FLASK, BigGenBench, and RewardBench 2, while maintaining competitive results on the original RewardBench. These results validate both our proposed paradigm and scoring model, suggesting a promising path forward for language model evaluation and development. Our code has been released at github.com/ContextualAI/LMUnit with an MIT license.

Anthology ID:: 2025.findings-emnlp.176
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3303–3324
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.176/
DOI:: 10.18653/v1/2025.findings-emnlp.176
Bibkey:
Cite (ACL):: Jon Saad-Falcon, Rajan Pathe Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. 2025. LMUNIT: Fine-grained Evaluation with Natural Language Unit Tests. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3303–3324, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: LMUNIT: Fine-grained Evaluation with Natural Language Unit Tests (Saad-Falcon et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.176.pdf
Checklist:: 2025.findings-emnlp.176.checklist.pdf

PDF Cite Search Checklist Fix data