Just Collect, Don’t Filter: Noisy Labels Do Not Improve Counterspeech Collection for Languages Without Annotated Resources

Pauline Möhle, Matthias Orlikowski, Philipp Cimiano


Abstract
Counterspeech on social media is rare. Consequently, it is difficult to collect naturally occurring examples, in particular for languages without annotated datasets. In this work, we study methods to increase the relevance of social media samples for counterspeech annotation when we lack annotated resources. We use the example of sourcing German data for counterspeech annotations from Twitter. We monitor tweets from German politicians and activists to collect replies. To select relevant replies we a) find replies that match German abusive keywords or b) label replies for counterspeech using a multilingual classifier fine-tuned on English data. For both approaches and a baseline setting, we annotate a random sample and use bootstrap sampling to estimate the amount of counterspeech. We find that neither the multilingual model nor the keyword approach achieve significantly higher counts of true counterspeech than the baseline. Thus, keyword lists or multi-lingual classifiers are likely not worth the added complexity beyond purposive data collection: Already without additional filtering, we gather a meaningful sample with 7,4% true counterspeech.
Anthology ID:
2023.cs4oa-1.4
Volume:
Proceedings of the 1st Workshop on CounterSpeech for Online Abuse (CS4OA)
Month:
September
Year:
2023
Address:
Prague, Czechia
Editors:
Yi-Ling Chung, Helena Bonaldi, Gavin Abercrombie, Marco Guerini
Venues:
CS4OA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–61
Language:
URL:
https://aclanthology.org/2023.cs4oa-1.4
DOI:
Bibkey:
Cite (ACL):
Pauline Möhle, Matthias Orlikowski, and Philipp Cimiano. 2023. Just Collect, Don’t Filter: Noisy Labels Do Not Improve Counterspeech Collection for Languages Without Annotated Resources. In Proceedings of the 1st Workshop on CounterSpeech for Online Abuse (CS4OA), pages 44–61, Prague, Czechia. Association for Computational Linguistics.
Cite (Informal):
Just Collect, Don’t Filter: Noisy Labels Do Not Improve Counterspeech Collection for Languages Without Annotated Resources (Möhle et al., CS4OA-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.cs4oa-1.4.pdf