Unlearning Bias in Language Models by Partitioning Gradients

Charles Yu, Sullam Jeoung, Anish Kasi, Pengfei Yu, Heng Ji


Abstract
Recent research has shown that large-scale pretrained language models, specifically transformers, tend to exhibit issues relating to racism, sexism, religion bias, and toxicity in general. Unfortunately, these pretrained language models are used almost universally in downstream tasks, and natural language processing is often applied to make real-world predictions. Thus, debiasing these language models as early in development as possible is increasingly crucial for preventing unintentional harms caused by natural language systems. To this end, we propose a new technique called partitioned contrastive gradient unlearning (PCGU), a gray-box method for debiasing pretrained masked language models. PCGU aims to optimize only the weights that contribute most to a specific domain of bias, doing so by computing a first-order approximation based on the gradients of contrastive sentence pairs. Our experiments show that PCGU is both low-cost and seems particularly effective at pinpointing the sources of implicit social bias in large pretrained transformers. Although we train using PCGU in the gender-profession domain only, we find that doing so can also partially mitigate bias across other domains. All code for our implementation and experiments can be found at https://github.com/CharlesYu2000/PCGU-UnlearningBias.
Anthology ID:
2023.findings-acl.375
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6032–6048
Language:
URL:
https://aclanthology.org/2023.findings-acl.375
DOI:
10.18653/v1/2023.findings-acl.375
Bibkey:
Cite (ACL):
Charles Yu, Sullam Jeoung, Anish Kasi, Pengfei Yu, and Heng Ji. 2023. Unlearning Bias in Language Models by Partitioning Gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6032–6048, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Unlearning Bias in Language Models by Partitioning Gradients (Yu et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.375.pdf
Video:
 https://aclanthology.org/2023.findings-acl.375.mp4