BanglaHateBERT: BERT for Abusive Language Detection in Bengali

Md Saroar Jahan, Mainul Haque, Nabil Arhab, Mourad Oussalah


Abstract
This paper introduces BanglaHateBERT, a retrained BERT model for abusive language detection in Bengali. The model was trained with a large-scale Bengali offensive, abusive, and hateful corpus that we have collected from different sources and made available to the public. Furthermore, we have collected and manually annotated 15K Bengali hate speech balanced dataset and made it publicly available for the research community. We used existing pre-trained BanglaBERT model and retrained it with 1.5 million offensive posts. We presented the results of a detailed comparison between generic pre-trained language model and retrained with the abuse-inclined version. In all datasets, BanglaHateBERT outperformed the corresponding available BERT model.
Anthology ID:
2022.restup-1.2
Volume:
Proceedings of the Second International Workshop on Resources and Techniques for User Information in Abusive Language Analysis
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Johanna Monti, Valerio Basile, Maria Pia Di Buono, Raffaele Manna, Antonio Pascucci, Sara Tonelli
Venue:
ResTUP
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
8–15
Language:
URL:
https://aclanthology.org/2022.restup-1.2
DOI:
Bibkey:
Cite (ACL):
Md Saroar Jahan, Mainul Haque, Nabil Arhab, and Mourad Oussalah. 2022. BanglaHateBERT: BERT for Abusive Language Detection in Bengali. In Proceedings of the Second International Workshop on Resources and Techniques for User Information in Abusive Language Analysis, pages 8–15, Marseille, France. European Language Resources Association.
Cite (Informal):
BanglaHateBERT: BERT for Abusive Language Detection in Bengali (Jahan et al., ResTUP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.restup-1.2.pdf