DUAL RM: Beyond Rule-based Preference Reward Modeling via Meta-Reward

Xiaobo Liang; Wanfu Wang; Qipeng Huang; Yuyang Ding; Zecheng Tang (汤泽成); Yixin Ji (纪一心); Qianben Chen; Zhe Zhao; Kehai Chen (陈科海); Juntao Li; Min Zhang

DUAL RM: Beyond Rule-based Preference Reward Modeling via Meta-Reward

Xiaobo Liang, Wanfu Wang, Qipeng Huang, Yuyang Ding, Zecheng Tang, Yixin Ji, Qianben Chen, Zhe Zhao, Kehai Chen, Juntao Li, Min Zhang

Abstract

The ability to model sparse and underspecified rewards, characteristic of human preferences, is fundamental to scaling Reinforcement Learning (RL). Current preference-based reward modeling largely relies on verifiable rewards, where human-annotated labels define rule-based signals. However, these methods face a fundamental bottleneck we term the Matryoshka Doll Problem: a recursive dependency where each reward verifier requires a meta-verifier, leading to continuous and costly dependence on human annotation. In this work, we propose Dual RM, which couples discriminative and generative reward models (DisRMs and GenRMs) under a non-parametric meta-reward. Rather than verifying the correctness of GenRM’s reasoning, the meta-reward evaluates its practical impact on response quality. Specifically, GenRM identifies multi-dimensional evaluation rubrics and iteratively refines the response, while DisRM quantifies the quality shifts induced by each rubric. Furthermore, we implement rubric-based test-time scaling to improve sample efficiency and preference alignment under both DPO and GRPO. Our experiments demonstrate that Dual RM achieves strong performance across major preference benchmarks. Notably, even when trained exclusively on language modality, it exhibits robust cross-modal transfer on Omni-RewardBench.

Anthology ID:: 2026.acl-long.1729
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37281–37296
Language:
URL:: https://aclanthology.org/2026.acl-long.1729/
DOI:
Bibkey:
Cite (ACL):: Xiaobo Liang, Wanfu Wang, Qipeng Huang, Yuyang Ding, Zecheng Tang, Yixin Ji, Qianben Chen, Zhe Zhao, Kehai Chen, Juntao Li, and Min Zhang. 2026. DUAL RM: Beyond Rule-based Preference Reward Modeling via Meta-Reward. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 37281–37296, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: DUAL RM: Beyond Rule-based Preference Reward Modeling via Meta-Reward (Liang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1729.pdf
Checklist:: 2026.acl-long.1729.checklist.pdf

PDF Cite Search Checklist Fix data