Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

Nakyeong Yang, Taegwan Kang, Stanley Jungkyu Choi, Honglak Lee, Kyomin Jung


Abstract
Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model’s task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).
Anthology ID:
2024.acl-long.490
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9061–9073
Language:
URL:
https://aclanthology.org/2024.acl-long.490
DOI:
Bibkey:
Cite (ACL):
Nakyeong Yang, Taegwan Kang, Stanley Jungkyu Choi, Honglak Lee, and Kyomin Jung. 2024. Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9061–9073, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination (Yang et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.490.pdf