Demystifying Data Organization for Enhanced LLM Training

Yalun Dai; Yangyu Huang; Tongshen Yang; Yonghan Wang; Xin Zhang; Wenshan Wu; Qihao Zhao; Hao Li; Yuanyuan Gao; Kim-Hui Yap; Scarlett Li

Demystifying Data Organization for Enhanced LLM Training

Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, Scarlett Li

Abstract

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidances for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidances. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training.

Anthology ID:: 2026.acl-long.1262
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27358–27375
Language:
URL:: https://aclanthology.org/2026.acl-long.1262/
DOI:
Bibkey:
Cite (ACL):: Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, and Scarlett Li. 2026. Demystifying Data Organization for Enhanced LLM Training. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27358–27375, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Demystifying Data Organization for Enhanced LLM Training (Dai et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1262.pdf
Checklist:: 2026.acl-long.1262.checklist.pdf

PDF Cite Search Checklist Fix data