An Open-Source Data Contamination Report for Large Language Models

Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin


Abstract
Data contamination in model evaluation has become increasingly prevalent with the growing popularity of large language models. It allows models to “cheat” via memorisation instead of displaying true capabilities. Therefore, contamination analysis has become an crucial part of reliable model evaluation to validate results. However, existing contamination analysis is usually conducted internally by large language model developers and often lacks transparency and completeness. This paper presents an extensive data contamination report for over 15 popular large language models across six popular multiple-choice QA benchmarks. We also introduce an open-source pipeline that enables the community to perform contamination analysis on customised data and models. Our experiments reveal varying contamination levels ranging from 1% to 45% across benchmarks, with the contamination degree increasing rapidly over time. Performance analysis of large language models indicates that data contamination does not necessarily lead to increased model metrics: while significant accuracy boosts of up to 14% and 7% are observed on contaminated C-Eval and Hellaswag benchmarks, only a minimal increase is noted on contaminated MMLU. We also find larger models seem able to gain more advantages than smaller models on contaminated test sets.
Anthology ID:
2024.findings-emnlp.30
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
528–541
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.30
DOI:
Bibkey:
Cite (ACL):
Yucheng Li, Yunhao Guo, Frank Guerin, and Chenghua Lin. 2024. An Open-Source Data Contamination Report for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 528–541, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
An Open-Source Data Contamination Report for Large Language Models (Li et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.30.pdf