Feihao Fang


2025

pdf bib
Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity
Zihao Li | Feihao Fang | Xitong Zhang | Jiaru Zou | Zhining Liu | Wei Xiong | Ziwei Wu | Baoyu Jing | Jingrui He
Findings of the Association for Computational Linguistics: EMNLP 2025

The advancement of Large Language Models (LLMs) has made ensuring their trustworthiness increasingly critical, especially in terms of fairness across diverse human groups. While modern LLMs are aligned with user preferences through Reinforcement Learning from Human Feedback (RLHF), the reward models used for alignment are trained on preference data that may both reflect societal biases and suffer from demographic skewness, as labeler populations are often uneven due to systemic accessibility or participation gaps. In this work, we reveal that reward models can exhibit significant discrepancies across different demographic groups, posing a fundamental challenge to fair and robust alignment. Using real-world datasets, we conduct the most comprehensive study to date, auditing various state-of-the-art reward models across nine sensitive attributes, including age, gender, ethnicity, etc. Our evaluation spans both (1) the agreement level between reward models and specific user groups, and (2) the reward model’s preference toward responses associated with different groups. Based on these findings, we propose the first method to mitigate group disparities in reward modeling. Code is available at https://github.com/Violet24K/FaRM.