2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Shilong Li; Yancheng He; Hui Huang; Xingyuan Bu; Jiaheng Liu; Hangyu Guo; Weixun Wang; Jihao Gu; Wenbo Su; Bo Zheng

doi:10.18653/v1/2025.findings-naacl.455

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Shilong Li, Yancheng He, Hui Huang, Xingyuan Bu, Jiaheng Liu, Hangyu Guo, Weixun Wang, Jihao Gu, Wenbo Su, Bo Zheng

Abstract

Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.

Anthology ID:: 2025.findings-naacl.455
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8164–8188
Language:
URL:: https://aclanthology.org/2025.findings-naacl.455/
DOI:: 10.18653/v1/2025.findings-naacl.455
Bibkey:
Cite (ACL):: Shilong Li, Yancheng He, Hui Huang, Xingyuan Bu, Jiaheng Liu, Hangyu Guo, Weixun Wang, Jihao Gu, Wenbo Su, and Bo Zheng. 2025. 2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 8164–8188, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: 2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision (Li et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-naacl.455.pdf

PDF Cite Search Fix data