Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity

Zihao Li, Feihao Fang, Xitong Zhang, Jiaru Zou, Zhining Liu, Wei Xiong, Ziwei Wu, Baoyu Jing, Jingrui He


Abstract
The advancement of Large Language Models (LLMs) has made ensuring their trustworthiness increasingly critical, especially in terms of fairness across diverse human groups. While modern LLMs are aligned with user preferences through Reinforcement Learning from Human Feedback (RLHF), the reward models used for alignment are trained on preference data that may both reflect societal biases and suffer from demographic skewness, as labeler populations are often uneven due to systemic accessibility or participation gaps. In this work, we reveal that reward models can exhibit significant discrepancies across different demographic groups, posing a fundamental challenge to fair and robust alignment. Using real-world datasets, we conduct the most comprehensive study to date, auditing various state-of-the-art reward models across nine sensitive attributes, including age, gender, ethnicity, etc. Our evaluation spans both (1) the agreement level between reward models and specific user groups, and (2) the reward model’s preference toward responses associated with different groups. Based on these findings, we propose the first method to mitigate group disparities in reward modeling. Code is available at https://github.com/Violet24K/FaRM.
Anthology ID:
2025.findings-emnlp.183
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3426–3455
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.183/
DOI:
Bibkey:
Cite (ACL):
Zihao Li, Feihao Fang, Xitong Zhang, Jiaru Zou, Zhining Liu, Wei Xiong, Ziwei Wu, Baoyu Jing, and Jingrui He. 2025. Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3426–3455, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity (Li et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.183.pdf
Checklist:
 2025.findings-emnlp.183.checklist.pdf