DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

Ying Zhou; Xinyao Wang; Yulei Niu; Yaojie Shen; Lexin Tang; Fan Chen; Ben He; Le Sun; Longyin Wen

doi:10.18653/v1/2025.findings-acl.1061

DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

Ying Zhou, Xinyao Wang, Yulei Niu, Yaojie Shen, Lexin Tang, Fan Chen, Ben He, Le Sun, Longyin Wen

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs’ limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM’s generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE’s latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code, and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2%–7% in certain cases. Data and code are available at https://github.com/bytedance/DiffLM.

Anthology ID:: 2025.findings-acl.1061
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20638–20658
Language:
URL:: https://aclanthology.org/2025.findings-acl.1061/
DOI:: 10.18653/v1/2025.findings-acl.1061
Bibkey:
Cite (ACL):: Ying Zhou, Xinyao Wang, Yulei Niu, Yaojie Shen, Lexin Tang, Fan Chen, Ben He, Le Sun, and Longyin Wen. 2025. DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 20638–20658, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models (Zhou et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1061.pdf

PDF Cite Search Fix data