Test of Time: Rethinking Temporal Signal of Benchmark Contamination

Terry Jingchen Zhang; Gopal Dev; Ning Wang; Max Obreiter; Wenyuan Jiang; Punya Syon Pandey; Keenan Samway; Yinya Huang; Bernhard Schölkopf; Mrinmaya Sachan; Zhijing Jin

Test of Time: Rethinking Temporal Signal of Benchmark Contamination

Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Wenyuan Jiang, Punya Syon Pandey, Keenan Samway, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin

Abstract

Post-cutoff performance decay has been widely interpreted as a temporal signal for benchmark contamination.We critically examine this belief and demonstrate that this temporal signal is highly sensitive to how benchmark questions are constructed.Specifically, we show that LLM-generated questions can produce remarkably different temporal patterns compared to fill-in-the-blank questions directly retrieved from the very same materials.We validated this finding on previous benchmarks that reported clear post-cutoff performance decay such as LiveCodeBench and further showed simple LLM transformation could effectively remove this temporal pattern when evaluated on the same models.We also provide a mechanistic understanding of our observation using influence function analysis.Overall, this work offers a new perspective on the sensitivity of temporal contamination signal and highlights the need for more robust contamination detection methods for reliable AI evaluation.

Anthology ID:: 2026.acl-long.1693
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36538–36555
Language:
URL:: https://aclanthology.org/2026.acl-long.1693/
DOI:
Bibkey:
Cite (ACL):: Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Wenyuan Jiang, Punya Syon Pandey, Keenan Samway, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, and Zhijing Jin. 2026. Test of Time: Rethinking Temporal Signal of Benchmark Contamination. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36538–36555, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Test of Time: Rethinking Temporal Signal of Benchmark Contamination (Zhang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1693.pdf
Checklist:: 2026.acl-long.1693.checklist.pdf

PDF Cite Search Checklist Fix data