Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore

Janosch Haber, Bertie Vidgen, Matthew Chapman, Vibhor Agarwal, Roy Ka-Wei Lee, Yong Keong Yap, Paul Röttger


Abstract
Toxic content is a global problem, but most resources for detecting toxic content are in English. When datasets are created in other languages, they often focus exclusively on one language or dialect. In many cultural and geographical settings, however, it is common to code-mix languages, combining and interchanging them throughout conversations. To shine a light on this practice, and enable more research into code-mixed toxic content, we introduce SOA, a new multilingual dataset of online attacks. Using the multilingual city-state of Singapore as a starting point, we collect a large corpus of Reddit comments in Indonesian, Malay, Singlish, and other languages, and provide fine-grained hierarchical labels for online attacks. We publish the corpus with rich metadata, as well as additional unlabelled data for domain adaptation. We share comprehensive baseline results, show how the metadata can be used for granular error analysis, and demonstrate the benefits of domain adaptation for detecting multilingual online attacks.
Anthology ID:
2023.acl-long.711
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12705–12721
Language:
URL:
https://aclanthology.org/2023.acl-long.711
DOI:
10.18653/v1/2023.acl-long.711
Bibkey:
Cite (ACL):
Janosch Haber, Bertie Vidgen, Matthew Chapman, Vibhor Agarwal, Roy Ka-Wei Lee, Yong Keong Yap, and Paul Röttger. 2023. Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12705–12721, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore (Haber et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.711.pdf
Video:
 https://aclanthology.org/2023.acl-long.711.mp4