Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection

Camilla Casula, Sebastiano Vecellio Salto, Alan Ramponi, Sara Tonelli


Abstract
The use of synthetic data for training models for a variety of NLP tasks is now widespread. However, previous work reports mixed results with regards to its effectiveness on highly subjective tasks such as hate speech detection. In this paper, we present an in-depth qualitative analysis of the potential and specific pitfalls of synthetic data for hate speech detection in English, with 3,500 manually annotated examples. We show that, across different models, synthetic data created through paraphrasing gold texts can improve out-of-distribution robustness from a computational standpoint. However, this comes at a cost: synthetic data fails to reliably reflect the characteristics of real-world data on a number of linguistic dimensions, it results in drastically different class distributions, and it heavily reduces the representation of both specific identity groups and intersectional hate.
Anthology ID:
2024.emnlp-main.1099
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19709–19726
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1099/
DOI:
10.18653/v1/2024.emnlp-main.1099
Bibkey:
Cite (ACL):
Camilla Casula, Sebastiano Vecellio Salto, Alan Ramponi, and Sara Tonelli. 2024. Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19709–19726, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection (Casula et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1099.pdf
Data:
 2024.emnlp-main.1099.data.zip