Offensive Content Detection via Synthetic Code-Switched Text

Cesa Salaam, Franck Dernoncourt, Trung Bui, Danda Rawat, Seunghyun Yoon


Abstract
The prevalent use of offensive content in social media has become an important reason for concern for online platforms (customer service chat-boxes, social media platforms, etc). Classifying offensive and hate-speech content in online settings is an essential task in many applications that needs to be addressed accordingly. However, online text from online platforms can contain code-switching, a combination of more than one language. The non-availability of labeled code-switched data for low-resourced code-switching combinations adds difficulty to this problem. To overcome this, we release a real-world dataset containing around 10k samples for testing for three language combinations en-fr, en-es, and en-de, and a synthetic code-switched textual dataset containing ~30k samples for training In this paper, we describe the process for gathering the human-generated data and our algorithm for creating synthetic code-switched offensive content data. We also introduce the results of a keyword classification baseline and a multi-lingual transformer-based classification model.
Anthology ID:
2022.coling-1.575
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
6617–6624
Language:
URL:
https://aclanthology.org/2022.coling-1.575
DOI:
Bibkey:
Cite (ACL):
Cesa Salaam, Franck Dernoncourt, Trung Bui, Danda Rawat, and Seunghyun Yoon. 2022. Offensive Content Detection via Synthetic Code-Switched Text. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6617–6624, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Offensive Content Detection via Synthetic Code-Switched Text (Salaam et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.575.pdf
Data
HateXplain