KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Zhangchen Xu; Yang Liu; Yueqin Yin; Mingyuan Zhou; Radha Poovendran

doi:10.18653/v1/2025.findings-acl.365

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, Radha Poovendran

Abstract

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question–solution–test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. It is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

Anthology ID:: 2025.findings-acl.365
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6980–7008
Language:
URL:: https://aclanthology.org/2025.findings-acl.365/
DOI:: 10.18653/v1/2025.findings-acl.365
Bibkey:
Cite (ACL):: Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. 2025. KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6980–7008, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding (Xu et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.365.pdf

PDF Cite Search Fix data