BiasX: “Thinking Slow” in Toxic Content Moderation with Explanations of Implied Social Biases

Yiming Zhang, Sravani Nanduri, Liwei Jiang, Tongshuang Wu, Maarten Sap


Abstract
Toxicity annotators and content moderators often default to mental shortcuts when making decisions. This can lead to subtle toxicity being missed, and seemingly toxic but harmless content being over-detected. We introduce BiasX, a framework that enhances content moderation setups with free-text explanations of statements’ implied social biases, and explore its effectiveness through a large-scale crowdsourced user study. We show that indeed, participants substantially benefit from explanations for correctly identifying subtly (non-)toxic content. The quality of explanations is critical: imperfect machine-generated explanations (+2.4% on hard toxic examples) help less compared to expert-written human explanations (+7.2%). Our results showcase the promise of using free-text explanations to encourage more thoughtful toxicity moderation.
Anthology ID:
2023.emnlp-main.300
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4920–4932
Language:
URL:
https://aclanthology.org/2023.emnlp-main.300
DOI:
10.18653/v1/2023.emnlp-main.300
Bibkey:
Cite (ACL):
Yiming Zhang, Sravani Nanduri, Liwei Jiang, Tongshuang Wu, and Maarten Sap. 2023. BiasX: “Thinking Slow” in Toxic Content Moderation with Explanations of Implied Social Biases. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4920–4932, Singapore. Association for Computational Linguistics.
Cite (Informal):
BiasX: “Thinking Slow” in Toxic Content Moderation with Explanations of Implied Social Biases (Zhang et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.300.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.300.mp4