MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Yicheng Chen; Yining Li; Kai Hu; Ma Zerun; HaochenYe HaochenYe; Kai Chen

doi:10.18653/v1/2025.findings-acl.515

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Yicheng Chen, Yining Li, Kai Hu, Ma Zerun, HaochenYe HaochenYe, Kai Chen

Abstract

Data quality and diversity are key to the construction of effective instruction-tuning datasets. With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to Maximize the Information Gain (MIG) in semantic space. Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. Notably, the model fine-tuned with 5% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73% on AlpacaEval and +6.89% on Wildbench.

Anthology ID:: 2025.findings-acl.515
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9902–9915
Language:
URL:: https://aclanthology.org/2025.findings-acl.515/
DOI:: 10.18653/v1/2025.findings-acl.515
Bibkey:
Cite (ACL):: Yicheng Chen, Yining Li, Kai Hu, Ma Zerun, HaochenYe HaochenYe, and Kai Chen. 2025. MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9902–9915, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space (Chen et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.515.pdf

PDF Cite Search Fix data