sDPO: Don’t Use Your Data All at Once

Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park


Abstract
As large language models (LLMs) continue to advance, aligning them with human preferences has become a critical objective. In this paper, we introduce stepwise DPO (sDPO), an innovative extension of the recently popularized Direct Preference Optimization (DPO) technique for alignment tuning. sDPO systematically partitions the available preference datasets and applies them incrementally, rather than utilizing the entire dataset simultaneously. This stepwise manner enables the integration of progressively more aligned reference models within the DPO training framework. Our empirical results demonstrate that sDPO not only enhances the alignment precision of reference models but also significantly improves the overall performance of the final model, surpassing other prominent LLMs with larger parameter counts.
Anthology ID:
2025.coling-industry.31
Volume:
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
366–373
Language:
URL:
https://aclanthology.org/2025.coling-industry.31/
DOI:
Bibkey:
Cite (ACL):
Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, and Chanjun Park. 2025. sDPO: Don’t Use Your Data All at Once. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 366–373, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
sDPO: Don’t Use Your Data All at Once (Kim et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-industry.31.pdf