CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization

Jing Ye, Rui Wang, Yuchuan Wu, Victor Ma, Feiteng Fang, Fei Huang, Yongbin Li


Abstract
Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals. Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring. Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages: (1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality.
Anthology ID:
2025.findings-emnlp.18
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
297–323
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.18/
DOI:
Bibkey:
Cite (ACL):
Jing Ye, Rui Wang, Yuchuan Wu, Victor Ma, Feiteng Fang, Fei Huang, and Yongbin Li. 2025. CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 297–323, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization (Ye et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.18.pdf
Checklist:
 2025.findings-emnlp.18.checklist.pdf