Identifying and Mitigating Social Bias Knowledge in Language Models

Ruizhe Chen, Yichen Li, Jianfei Yang, Yang Feng, Joey Tianyi Zhou, Jian Wu, Zuozhu Liu


Abstract
Generating fair and accurate predictions plays a pivotal role in deploying pre-trained language models (PLMs) in the real world. However, existing debiasing methods may inevitably generate incorrect or nonsensical predictions as they are designed and evaluated to achieve parity across different social groups but leave aside individual commonsense facts, resulting in modified knowledge that elicits unreasonable or undesired predictions. This paper introduces a novel debiasing framework that first identifies the encoding locations of biases within language models and then applies the Fairness-Stamp (FAST). FAST focuses on fine-grained, individual bias mitigation and integrates a lightweight network into PLMs, specifically targeting identified biases while preserving essential knowledge and maintaining factual integrity. We also present BiaScope, a new benchmark comprising datasets and metrics designed to evaluate the retention of commonsense knowledge and the generalization across paraphrased social biases. Our extensive experiments across multiple datasets demonstrate that FAST surpasses state-of-the-art baselines with superior debiasing performance while not compromising the overall model capability for knowledge retention and downstream predictions. This highlights the potential of fine-grained debiasing strategies to achieve fairness in PLMs. Code will be publicly available.
Anthology ID:
2025.findings-naacl.39
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
651–672
Language:
URL:
https://aclanthology.org/2025.findings-naacl.39/
DOI:
Bibkey:
Cite (ACL):
Ruizhe Chen, Yichen Li, Jianfei Yang, Yang Feng, Joey Tianyi Zhou, Jian Wu, and Zuozhu Liu. 2025. Identifying and Mitigating Social Bias Knowledge in Language Models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 651–672, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Identifying and Mitigating Social Bias Knowledge in Language Models (Chen et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-naacl.39.pdf