Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Martin Riddell; Ansong Ni; Arman Cohan

doi:10.18653/v1/2024.acl-long.761

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Abstract

While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into how data contamination impacts the evaluation of code generation, which is critical for understanding the robustness and reliability of LLMs in programming contexts. In this work, we perform a comprehensive study of data contamination of popular code generation benchmarks, and precisely quantify their overlap with pretraining corpus through both surface-level and semantic-level matching. In our experiments, we show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training. We also conduct extensive analysis on the factors that affect model memorization and generalization, such as model size, problem difficulty, and question length. We release all resulting files from our matching pipeline for future research.

Anthology ID:: 2024.acl-long.761
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14116–14137
Language:
URL:: https://aclanthology.org/2024.acl-long.761
DOI:: 10.18653/v1/2024.acl-long.761
Bibkey:
Cite (ACL):: Martin Riddell, Ansong Ni, and Arman Cohan. 2024. Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14116–14137, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models (Riddell et al., ACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.acl-long.761.pdf

PDF Cite Search