Position: Toward a Metric Typology for Language Model Evaluation

Jasper Kyle Catapang

Position: Toward a Metric Typology for Language Model Evaluation

Abstract

The critique of scalar benchmark rankings as proxies for model quality is now well-established (Raji et al., 2021; Wallach et al.,2025; Bean et al., 2025; Gehrmann et al., 2021). What the field still lacks is a shared structural vocabulary for comparing, combining, and contextualizing metric design choices. This paper provides that vocabulary: a four-primitive typology—representation (𝜙), comparison (D), aggregation (A), and context (C)—under which existing metrics (BLEU, BERTScore, nDCG, LLM-as-judge, calibration scores, agentic outcome measures) are explicit parameterizations of a common form. This typology is paired with a measurement–decision split: metrics are noisy estimators of latent constructs, and model selection is context-dependent Pareto optimization over construct estimates, not over raw scores. The typology makes implicit metric assumptions comparable and debatable rather than hidden inside a single number.

Anthology ID:: 2026.gem-main.78
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1015–1020
Language:
URL:: https://aclanthology.org/2026.gem-main.78/
DOI:
Bibkey:
Cite (ACL):: Jasper Kyle Catapang. 2026. Position: Toward a Metric Typology for Language Model Evaluation. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 1015–1020, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Position: Toward a Metric Typology for Language Model Evaluation (Catapang, GEM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.gem-main.78.pdf

PDF Cite Search Fix data