DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

Weijie Shi; Jipeng Zhang; Yaguang Wu; Jingzhi Fang; Shibo Zhang; Yao Zhao; Hao Chen; Ruiyuan Zhang; Yue Cui; Jia Zhu; Sirui Han; Jiajie Xu; Xiaofang Zhou

doi:10.18653/v1/2025.emnlp-main.215

DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Yao Zhao, Hao Chen, Ruiyuan Zhang, Yue Cui, Jia Zhu, Sirui Han, Jiajie Xu, Xiaofang Zhou

Abstract

Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model’s output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.

Anthology ID:: 2025.emnlp-main.215
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4330–4350
Language:
URL:: https://aclanthology.org/2025.emnlp-main.215/
DOI:: 10.18653/v1/2025.emnlp-main.215
Bibkey:
Cite (ACL):: Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Yao Zhao, Hao Chen, Ruiyuan Zhang, Yue Cui, Jia Zhu, Sirui Han, Jiajie Xu, and Xiaofang Zhou. 2025. DIDS: Domain Impact-aware Data Sampling for Large Language Model Training. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4330–4350, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: DIDS: Domain Impact-aware Data Sampling for Large Language Model Training (Shi et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.215.pdf
Checklist:: 2025.emnlp-main.215.checklist.pdf

PDF Cite Search Checklist Fix data