Reasoning-Guided Exploration for Online DPO

Zetian Hu; Shunyu Liu; Ting-En Lin; Fei Huang; Yongbin Li; Dacheng Tao

Reasoning-Guided Exploration for Online DPO

Zetian Hu, Shunyu Liu, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao

Abstract

Recent work has aimed to enhance the reasoning capabilities of language models, but these methods are often limited to domains with objectively verifiable answers. To overcome this limitation, we introduce Reasoning-Guided Exploration for Online DPO (RGE-DPO), a novel self-play framework designed to improve reasoning on general-domain data. RGE-DPO employs a dual-reward mechanism to evaluate responses by assessing: (1) reasoning quality using a self-rewarding rubric that provides structured evaluation of logical coherence, reasoning depth, and verification behaviors; and (2) response quality using an established reward model trained for aspects like helpfulness and correctness. These two orthogonal evaluation signals enable a comprehensive assessment of different response dimensions without conflating reasoning processes with response content. We then integrate these two evaluation signals based on a weighted ranking mechanism to construct the preference pairs, which ensures that responses with superior reasoning processes are preferred when response quality is comparable. Experiments demonstrate that RGE-DPO achieves substantial improvements in instruction-following benchmark while maintaining competitive performance on verifiable academic benchmarks.

Anthology ID:: 2026.findings-acl.1370
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27526–27542
Language:
URL:: https://aclanthology.org/2026.findings-acl.1370/
DOI:
Bibkey:
Cite (ACL):: Zetian Hu, Shunyu Liu, Ting-En Lin, Fei Huang, Yongbin Li, and Dacheng Tao. 2026. Reasoning-Guided Exploration for Online DPO. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27526–27542, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Reasoning-Guided Exploration for Online DPO (Hu et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1370.pdf
Checklist:: 2026.findings-acl.1370.checklist.pdf

PDF Cite Search Checklist Fix data