MLDataForge: Accelerating Large-Scale Dataset Preprocessing and Access for Multimodal Foundation Model Training

Andrea Blasi Núñez, Lukas Paul Achatius Galke, Peter Schneider-Kamp


Abstract
Preprocessing large and possibly multimodal datasets remains a key bottleneck in many machine learning workflows, particularly when random access to samples is needed for global shuffling and sorting. Existing approaches, including widely used formats like JSONL and frameworks such as Huggingface Datasets and MosaicML Streaming, typically incur substantial computational, memory, and storage overhead in such settings. Here, we introduce MLDataForge, a Python-based open-source framework designed for scalable dataset pre-processing and access. Our key contributions are: (1) optimized readers for Mosaic Data Shards (MDS) that substantially improve throughput, reduce peak storage usage, and support sample-level compression; (2) JINX (JSON Indexed ’N’ eXtended), a novel, index-augmented JSONL-compatible format supporting structured footers and binary sidecar files; and (3) a lazy-loading mechanism that defers data loading, decompression, and decoding JINX files until sample fields are accessed. We empirically evaluate MLDataForge and our contributions on a representative 200 GB supervised fine-tuning dataset for vision language models. Our best configuration – zstd-compressed JINX with binary sidecar and lazy loading – yields at least a decimal order-of-magnitude throughput increase compared to the best baselines for iteration, global shuffling, and sorting. These advances enable substantial gains in data preprocessing performance, facilitating more scalable and resource-efficient model training pipelines.
Anthology ID:
2025.ranlp-1.21
Volume:
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
175–183
Language:
URL:
https://aclanthology.org/2025.ranlp-1.21/
DOI:
Bibkey:
Cite (ACL):
Andrea Blasi Núñez, Lukas Paul Achatius Galke, and Peter Schneider-Kamp. 2025. MLDataForge: Accelerating Large-Scale Dataset Preprocessing and Access for Multimodal Foundation Model Training. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 175–183, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
MLDataForge: Accelerating Large-Scale Dataset Preprocessing and Access for Multimodal Foundation Model Training (Blasi Núñez et al., RANLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.ranlp-1.21.pdf