A Strategic Coordination Framework of Small LMs Matches Large LMs in Data Synthesis

Xin Gao; Qizhi Pei; Zinan Tang; Yu Li; Honglin Lin; Jiang Wu; Lijun Wu; Conghui He

doi:10.18653/v1/2025.acl-long.566

A Strategic Coordination Framework of Small LMs Matches Large LMs in Data Synthesis

Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Lijun Wu, Conghui He

Abstract

While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LMs involved framework, GRA, that aggregates specialized roles across small LMs to iterative refinement and quality control typically achieved by a single large LM. In this collaborative framework, multiple small LMs assume distinct roles—Generator, Reviewer, and Adjudicator—to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LMs can achieve data-level parity with distillation from large LMs. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents.

Anthology ID:: 2025.acl-long.566
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11552–11570
Language:
URL:: https://aclanthology.org/2025.acl-long.566/
DOI:: 10.18653/v1/2025.acl-long.566
Bibkey:
Cite (ACL):: Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Lijun Wu, and Conghui He. 2025. A Strategic Coordination Framework of Small LMs Matches Large LMs in Data Synthesis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11552–11570, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: A Strategic Coordination Framework of Small LMs Matches Large LMs in Data Synthesis (Gao et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.566.pdf

PDF Cite Search Fix data