On the Effectiveness of Adversarial Robustness for Abuse Mitigation with Counterspeech

Yi-Ling Chung; Jonathan Bright

On the Effectiveness of Adversarial Robustness for Abuse Mitigation with Counterspeech

Abstract

Recent work on automated approaches to counterspeech have mostly focused on synthetic data but seldom look into how the public deals with abuse. While these systems identifying and generating counterspeech have the potential for abuse mitigation, it remains unclear how robust a model is against adversarial attacks across multiple domains and how models trained on synthetic data can handle unseen user-generated abusive content in the real world. To tackle these issues, this paper first explores the dynamics of abuse and replies using our novel dataset of 6,955 labelled tweets targeted at footballers for studying public figure abuse. We then curate DynaCounter, a new English dataset of 1,911 pairs of abuse and replies addressing nine minority identity groups, collected in an adversarial human-in-the-loop process over four rounds. Our analysis shows that adversarial attacks do not necessarily result in better generalisation. We further present a study of multi-domain counterspeech generation, comparing Flan-T5 and T5 models. We observe that handling certain abuse targets is particularly challenging.

Anthology ID:: 2024.naacl-long.386
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6988–7002
Language:
URL:: https://aclanthology.org/2024.naacl-long.386
DOI:
Bibkey:
Cite (ACL):: Yi-Ling Chung and Jonathan Bright. 2024. On the Effectiveness of Adversarial Robustness for Abuse Mitigation with Counterspeech. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6988–7002, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: On the Effectiveness of Adversarial Robustness for Abuse Mitigation with Counterspeech (Chung & Bright, NAACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.naacl-long.386.pdf

PDF Cite Search