Jianfei Yang

2025

Generating fair and accurate predictions plays a pivotal role in deploying pre-trained language models (PLMs) in the real world. However, existing debiasing methods may inevitably generate incorrect or nonsensical predictions as they are designed and evaluated to achieve parity across different social groups but leave aside individual commonsense facts, resulting in modified knowledge that elicits unreasonable or undesired predictions. This paper introduces a novel debiasing framework that first identifies the encoding locations of biases within language models and then applies the Fairness-Stamp (FAST). FAST focuses on fine-grained, individual bias mitigation and integrates a lightweight network into PLMs, specifically targeting identified biases while preserving essential knowledge and maintaining factual integrity. We also present BiaScope, a new benchmark comprising datasets and metrics designed to evaluate the retention of commonsense knowledge and the generalization across paraphrased social biases. Our extensive experiments across multiple datasets demonstrate that FAST surpasses state-of-the-art baselines with superior debiasing performance while not compromising the overall model capability for knowledge retention and downstream predictions. This highlights the potential of fine-grained debiasing strategies to achieve fairness in PLMs. Code will be publicly available.

Jianfei Yang

2025

2021

Co-authors

Venues