Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning

Hannah Kirk, Bertie Vidgen, Scott Hale


Abstract
Annotating abusive language is expensive, logistically complex and creates a risk of psychological harm. However, most machine learning research has prioritized maximizing effectiveness (i.e., F1 or accuracy score) rather than data efficiency (i.e., minimizing the amount of data that is annotated). In this paper, we use simulated experiments over two datasets at varying percentages of abuse to demonstrate that transformers-based active learning is a promising approach to substantially raise efficiency whilst still maintaining high effectiveness, especially when abusive content is a smaller percentage of the dataset. This approach requires a fraction of labeled data to reach performance equivalent to training over the full dataset.
Anthology ID:
2022.trac-1.7
Volume:
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Ritesh Kumar, Atul Kr. Ojha, Marcos Zampieri, Shervin Malmasi, Daniel Kadar
Venue:
TRAC
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
52–61
Language:
URL:
https://aclanthology.org/2022.trac-1.7
DOI:
Bibkey:
Cite (ACL):
Hannah Kirk, Bertie Vidgen, and Scott Hale. 2022. Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning. In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), pages 52–61, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning (Kirk et al., TRAC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.trac-1.7.pdf
Code
 hannahkirk/activetransformers-for-abusivelanguage