Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection

Shadi Iskander, Kira Radinsky, Yonatan Belinkov


Abstract
Natural language processing models tend to learn and encode social biases present in the data. One popular approach for addressing such biases is to eliminate encoded information from the model’s representations. However, current methods are restricted to removing only linearly encoded information. In this work, we propose Iterative Gradient-Based Projection (IGBP), a novel method for removing non-linear encoded concepts from neural representations. Our method consists of iteratively training neural classifiers to predict a particular attribute we seek to eliminate, followed by a projection of the representation on a hypersurface, such that the classifiers become oblivious to the target attribute. We evaluate the effectiveness of our method on the task of removing gender and race information as sensitive attributes. Our results demonstrate that IGBP is effective in mitigating bias through intrinsic and extrinsic evaluations, with minimal impact on downstream task accuracy.
Anthology ID:
2023.findings-acl.369
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5961–5977
Language:
URL:
https://aclanthology.org/2023.findings-acl.369
DOI:
10.18653/v1/2023.findings-acl.369
Bibkey:
Cite (ACL):
Shadi Iskander, Kira Radinsky, and Yonatan Belinkov. 2023. Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5961–5977, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection (Iskander et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.369.pdf
Video:
 https://aclanthology.org/2023.findings-acl.369.mp4