Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Somshubra Majumdar; Vahid Noroozi; Mehrzad Samadi; Sean Narenthiran; Aleksander Ficek; Wasi Ahmad; Jocelyn Huang; Jagadeesh Balam; Boris Ginsburg

doi:10.18653/v1/2025.acl-industry.16

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Somshubra Majumdar, Vahid Noroozi, Mehrzad Samadi, Sean Narenthiran, Aleksander Ficek, Wasi Uddin Ahmad, Jocelyn Huang, Jagadeesh Balam, Boris Ginsburg

Abstract

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.

Anthology ID:: 2025.acl-industry.16
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Georg Rehm, Yunyao Li
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 208–221
Language:
URL:: https://aclanthology.org/2025.acl-industry.16/
DOI:: 10.18653/v1/2025.acl-industry.16
Bibkey:
Cite (ACL):: Somshubra Majumdar, Vahid Noroozi, Mehrzad Samadi, Sean Narenthiran, Aleksander Ficek, Wasi Uddin Ahmad, Jocelyn Huang, Jagadeesh Balam, and Boris Ginsburg. 2025. Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 208–221, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models (Majumdar et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-industry.16.pdf

PDF Cite Search Fix data