IRPO: Implicit Policy Regularized Preference Optimization

Youngsoo Jang; Yu Jin Kim; Geon-Hyeong Kim; Honglak Lee; Moontae Lee

IRPO: Implicit Policy Regularized Preference Optimization

Youngsoo Jang, Yu Jin Kim, Geon-Hyeong Kim, Honglak Lee, Moontae Lee

Abstract

Training complexity often scales with the size of hyperparameter space for Large Language Models (LLMs). While Direct Preference Optimization (DPO) offers learning stability through reparameterizing the reward function, its regularization against the reference policy can lead to suboptimal outcomes when the reference policy is not optimal. Recent DPO variants address this concern, but at a cost: they introduce additional hyperparameters, reducing feasibility for LLM fine-tuning. To overcome this challenge, we introduce Implicit policy Regularized Preference Optimization (IRPO), which tackles suboptimality while maintaining training simplicity. By treating the winning policy that generated the chosen responses in a pairwise dataset as an implicit policy, IRPO maximizes KL-regularized reward without extra hyperparameters. Then we propose a novel PO algorithm that directly optimizes the IRPO objective by estimating the likelihood ratio between implicit policies. As the winning policy generally outperforms the reference policy, IRPO can effectively address suboptimality. Our experiments show that IRPO significantly outperforms baseline algorithms with the same hyperparameter complexity. Moreover, IRPO demonstrates comparable performance to recent algorithms that rely on a larger number of hyperparameters, offering a practical solution for scalable LLM fine-tuning.

Anthology ID:: 2026.findings-eacl.281
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5304–5325
Language:
URL:: https://aclanthology.org/2026.findings-eacl.281/
DOI:
Bibkey:
Cite (ACL):: Youngsoo Jang, Yu Jin Kim, Geon-Hyeong Kim, Honglak Lee, and Moontae Lee. 2026. IRPO: Implicit Policy Regularized Preference Optimization. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5304–5325, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: IRPO: Implicit Policy Regularized Preference Optimization (Jang et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-eacl.281.pdf
Checklist:: 2026.findings-eacl.281.checklist.pdf

PDF Cite Search Checklist Fix data