Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Chris Emmery; Ákos Kádár; Grzegorz Chrupała; Walter Daelemans

Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Chris Emmery, Ákos Kádár, Grzegorz Chrupała, Walter Daelemans

Abstract

A limited amount of studies investigates the role of model-agnostic adversarial behavior in toxic content classification. As toxicity classifiers predominantly rely on lexical cues, (deliberately) creative and evolving language-use can be detrimental to the utility of current corpora and state-of-the-art models when they are deployed for content moderation. The less training data is available, the more vulnerable models might become. This study is, to our knowledge, the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection. We demonstrate that model-agnostic lexical substitutions significantly hurt classifier performance. Moreover, when these perturbed samples are used for augmentation, we show models become robust against word-level perturbations at a slight trade-off in overall task performance. Augmentations proposed in prior work on toxicity prove to be less effective. Our results underline the need for such evaluations in online harm areas with small corpora.

Anthology ID:: 2022.lrec-1.319
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 2976–2988
Language:
URL:: https://aclanthology.org/2022.lrec-1.319
DOI:
Bibkey:
Cite (ACL):: Chris Emmery, Ákos Kádár, Grzegorz Chrupała, and Walter Daelemans. 2022. Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2976–2988, Marseille, France. European Language Resources Association.
Cite (Informal):: Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations (Emmery et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.319.pdf
Code: cmry/augtox

PDF Cite Search Code