Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim; Gyuho Shim; Yongchan Chun; Minhyuk Kim; Chanjun Park; Heui-Seok Lim

doi:10.18653/v1/2025.emnlp-main.789

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim

Abstract

Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce **BENCHMARK PROFILING**, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model’s success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. **BENCHMARK PROFILING** therefore explains why performance gains do not always translate into user-perceived competence and offer a transparent tool for benchmark audit and model interpretability.

Anthology ID:: 2025.emnlp-main.789
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15635–15650
Language:
URL:: https://aclanthology.org/2025.emnlp-main.789/
DOI:: 10.18653/v1/2025.emnlp-main.789
Bibkey:
Cite (ACL):: Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park, and Heuiseok Lim. 2025. Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15635–15650, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks (Kim et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.789.pdf
Checklist:: 2025.emnlp-main.789.checklist.pdf

PDF Cite Search Checklist Fix data