Cesa Salaam


2022

pdf bib
Offensive Content Detection via Synthetic Code-Switched Text
Cesa Salaam | Franck Dernoncourt | Trung Bui | Danda Rawat | Seunghyun Yoon
Proceedings of the 29th International Conference on Computational Linguistics

The prevalent use of offensive content in social media has become an important reason for concern for online platforms (customer service chat-boxes, social media platforms, etc). Classifying offensive and hate-speech content in online settings is an essential task in many applications that needs to be addressed accordingly. However, online text from online platforms can contain code-switching, a combination of more than one language. The non-availability of labeled code-switched data for low-resourced code-switching combinations adds difficulty to this problem. To overcome this, we release a real-world dataset containing around 10k samples for testing for three language combinations en-fr, en-es, and en-de, and a synthetic code-switched textual dataset containing ~30k samples for training In this paper, we describe the process for gathering the human-generated data and our algorithm for creating synthetic code-switched offensive content data. We also introduce the results of a keyword classification baseline and a multi-lingual transformer-based classification model.