Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

Giuseppe Attanasio, Debora Nozza, Dirk Hovy, Elena Baralis


Abstract
Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance.Most mitigation techniques use lists of identity terms or samples from the target domain during training. However, this approach requires a-priori knowledge and introduces further bias if important terms are neglected.Instead, we propose a knowledge-free Entropy-based Attention Regularization (EAR) to discourage overfitting to training-specific terms. An additional objective function penalizes tokens with low self-attention entropy.We fine-tune BERT via EAR: the resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in English and Italian.EAR also reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.
Anthology ID:
2022.findings-acl.88
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venues:
ACL | Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1105–1119
Language:
URL:
https://aclanthology.org/2022.findings-acl.88
DOI:
10.18653/v1/2022.findings-acl.88
Bibkey:
Cite (ACL):
Giuseppe Attanasio, Debora Nozza, Dirk Hovy, and Elena Baralis. 2022. Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1105–1119, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists (Attanasio et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-acl.88.pdf
Software:
 2022.findings-acl.88.software.zip
Code
 g8a9/ear
Data
MLMA Hate Speech