Generative Reward Modeling via Synthetic Criteria Preference Learning

Xiaobo Liang; Haoke Zhang; Juntao Li; Kehai Chen (陈科海); Qiaoming Zhu (朱巧明); Min Zhang

doi:10.18653/v1/2025.acl-long.1297

Generative Reward Modeling via Synthetic Criteria Preference Learning

Xiaobo Liang, Haoke Zhang, Juntao Li, Kehai Chen, Qiaoming Zhu, Min Zhang

Abstract

Generative Reward Models (GenRMs) leverage synthesized Chains of Thought (CoT) to reduce the need for massive labeled data, but this approach introduces risks of overoptimization due to the inability to guarantee the correctness of the CoTs. Identifying and optimizing unexpected behaviors within these synthesized CoT remains a challenge, as it heavily depends on precise annotations of intermediate behavior, similar to process supervision. In this work, we introduce a criteria-based preference tree for reward modeling, where each path in the tree represents a reasoning trajectory based on synthesized criteria. Crucially, each reasoning trajectory can be independently optimized through RL algorithm. These fine-grained process reward signals are derived from the inference-time computations and predefined rules, eliminating the need for human supervision. In experiments, SyncPL showed significant improvements over baselines on multiple human preference benchmarks. We further demonstrate that synthesized data can be learned using a long CoT format, analogous to an o1-like model, further enhancing performance while keeping stability and efficiency during training.

Anthology ID:: 2025.acl-long.1297
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26755–26769
Language:
URL:: https://aclanthology.org/2025.acl-long.1297/
DOI:: 10.18653/v1/2025.acl-long.1297
Bibkey:
Cite (ACL):: Xiaobo Liang, Haoke Zhang, Juntao Li, Kehai Chen, Qiaoming Zhu, and Min Zhang. 2025. Generative Reward Modeling via Synthetic Criteria Preference Learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26755–26769, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Generative Reward Modeling via Synthetic Criteria Preference Learning (Liang et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1297.pdf

PDF Cite Search Fix data