Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods

Erfan Nourbakhsh; Mohammad Sadegh Sirjani; Amir Mousavi; Khoa Nguyen; John Quarles; Mimi Xie; Rocky Slavin

Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods

Erfan Nourbakhsh, Mohammad Sadegh Sirjani, Amir Mousavi, Khoa Nguyen, John Quarles, Mimi Xie, Rocky Slavin

Abstract

Large Language Models (LLMs) are trained on web-scale corpora, increasing the risk that benchmark test data appears in training sets and inflates reported performance. We present a systematic literature review of 55 studies on LLM benchmark contamination through late 2025. Our contributions are: (1) a four-tier contamination taxonomy (Exact, Syntactic, Semantic, Task-Level; T1–T4); (2) a comparative analysis of five detection families (string-matching, likelihood-based, membership inference, LLM-prompted detection, and benchmark auditing), including access assumptions and failure modes; (3) a synthesis of contamination evidence on MMLU, GSM8K, HumanEval, and HellaSwag by measurement construct; (4) a comparative evaluation of mitigation strategies across lifecycle points, access assumptions, and evidence maturity; and (5) a Contamination Transparency Card (CTC) framework for future releases. Across studies, no detection method is consistently reliable across contamination tiers, model-access settings, and training stages. We identify instruction tuning as a persistent blind spot, note that RL/post-training contamination auditing is only beginning to mature, and report inflation estimates spanning roughly 6%–40% under benchmark- and setting-dependent assumptions.

Anthology ID:: 2026.gem-main.50
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 518–539
Language:
URL:: https://aclanthology.org/2026.gem-main.50/
DOI:
Bibkey:
Cite (ACL):: Erfan Nourbakhsh, Mohammad Sadegh Sirjani, Amir Mousavi, Khoa Nguyen, John Quarles, Mimi Xie, and Rocky Slavin. 2026. Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 518–539, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods (Nourbakhsh et al., GEM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.gem-main.50.pdf

PDF Cite Search Fix data