Investigating Multilingual Abusive Language Detection: A Cautionary Tale

Kenneth Steimel, Daniel Dakota, Yue Chen, Sandra Kübler


Abstract
Abusive language detection has received much attention in the last years, and recent approaches perform the task in a number of different languages. We investigate which factors have an effect on multilingual settings, focusing on the compatibility of data and annotations. In the current paper, we focus on English and German. Our findings show large differences in performance between the two languages. We find that the best performance is achieved by different classification algorithms. Sampling to address class imbalance issues is detrimental for German and beneficial for English. The only similarity that we find is that neither data set shows clear topics when we compare the results of topic modeling to the gold standard. Based on our findings, we can conclude that a multilingual optimization of classifiers is not possible even in settings where comparable data sets are used.
Anthology ID:
R19-1132
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1151–1160
Language:
URL:
https://aclanthology.org/R19-1132
DOI:
10.26615/978-954-452-056-4_132
Bibkey:
Cite (ACL):
Kenneth Steimel, Daniel Dakota, Yue Chen, and Sandra Kübler. 2019. Investigating Multilingual Abusive Language Detection: A Cautionary Tale. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 1151–1160, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Investigating Multilingual Abusive Language Detection: A Cautionary Tale (Steimel et al., RANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/R19-1132.pdf