Generation-Based Data Augmentation for Offensive Language Detection: Is It Worth It?

Camilla Casula, Sara Tonelli


Abstract
Generation-based data augmentation (DA) has been presented in several works as a way to improve offensive language detection. However, the effectiveness of generative DA has been shown only in limited scenarios, and the potential injection of biases when using generated data to classify offensive language has not been investigated. Our aim is that of analyzing the feasibility of generative data augmentation more in-depth with two main focuses. First, we investigate the robustness of models trained on generated data in a variety of data augmentation setups, both novel and already presented in previous work, and compare their performance on four widely-used English offensive language datasets that present inherent differences in terms of content and complexity. In addition to this, we analyze models using the HateCheck suite, a series of functional tests created to challenge hate speech detection systems. Second, we investigate potential lexical bias issues through a qualitative analysis on the generated data. We find that the potential positive impact of generative data augmentation on model performance is unreliable, and generative DA can also have unpredictable effects on lexical bias.
Anthology ID:
2023.eacl-main.244
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3359–3377
Language:
URL:
https://aclanthology.org/2023.eacl-main.244
DOI:
10.18653/v1/2023.eacl-main.244
Bibkey:
Cite (ACL):
Camilla Casula and Sara Tonelli. 2023. Generation-Based Data Augmentation for Offensive Language Detection: Is It Worth It?. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3359–3377, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Generation-Based Data Augmentation for Offensive Language Detection: Is It Worth It? (Casula & Tonelli, EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.244.pdf
Video:
 https://aclanthology.org/2023.eacl-main.244.mp4