Direct Judgement Preference Optimization

Peifeng Wang; Austin Xu; Yilun Zhou; Caiming Xiong; Shafiq Joty

Direct Judgement Preference Optimization

PeiFeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty

Abstract

To meet the increasing need for timely and accurate evaluation of large language model (LLM) responses, training LLM-as-judges to evaluate and critique other model responses has emerged as a popular paradigm. However, existing judge models are largely trained with supervised finetuning (SFT) on small data scales to perform limited types of evaluation tasks, fundamentally limiting generalization.To meet the need for strong, generalized judge models, we explore training foundational judge models at large data scales (680K) with direct preference optimization (DPO). Using four training tasks, we form three types of DPO preference pairs targeting different aspects of evaluation: Generating meaningful critiques, making accurate judgements, and understanding what comprises good and bad responses. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and evaluate on a comprehensive suite of 13 benchmarks (7 pairwise, 4 single rating, and 2 classification). Our models achieve the best aggregate performance, with even our 8B model outperforming GPT-4o in pairwise benchmarks. Further analysis shows that our judge models produce factual and actionable critiques and serve as strong foundational judges for continued finetuning.

Anthology ID:: 2025.emnlp-main.103
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1979–2009
Language:
URL:: https://aclanthology.org/2025.emnlp-main.103/
DOI:
Bibkey:
Cite (ACL):: PeiFeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, and Shafiq Joty. 2025. Direct Judgement Preference Optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1979–2009, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Direct Judgement Preference Optimization (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.103.pdf
Checklist:: 2025.emnlp-main.103.checklist.pdf

PDF Cite Search Checklist Fix data