Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

Maosong Cao; Taolin Zhang; Mo Li; Chuyu Zhang; Yunxin Liu; Haodong Duan; Songyang Zhang; Kai Chen

doi:10.18653/v1/2025.acl-long.1091

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, Kai Chen

Abstract

The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, the availability of high-quality human-annotated SFT data has become a significant bottleneck for LLMs, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a two-stage synthetic data generation framework that incorporates World Knowledge Trees and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to instruct model trained with RLHF. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling of synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.

Anthology ID:: 2025.acl-long.1091
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22392–22412
Language:
URL:: https://aclanthology.org/2025.acl-long.1091/
DOI:: 10.18653/v1/2025.acl-long.1091
Bibkey:
Cite (ACL):: Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, and Kai Chen. 2025. Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22392–22412, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement (Cao et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1091.pdf

PDF Cite Search Fix data