L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models

Hrushikesh Patil, Abhishek Velankar, Raviraj Joshi


Abstract
Social media platforms are used by a large number of people prominently to express their thoughts and opinions. However, these platforms have contributed to a sub stantial amount of hateful and abusive content as well. Therefore, it is impor tant to curb the spread of hate speech on these platforms. In India, Marathi is one of the most popular languages used by a wide audience. In this work, we present L3Cube-MahaHate, the first ma jor Hate Speech Dataset in Marathi. The dataset is curated from Twitter, anno tated manually. Our dataset consists of over 00 distinct tweets labeled into four major classes i.e hate, offensive, pro fane, and not. We present the approaches used for collecting and annotating the data and the challenges faced during the pro cess. Finally, we present baseline classi fication results using deep learning mod els based on CNN, LSTM, and Transform ers. We explore mono-lingual and multi lingual variants of BERT like MahaBERT, IndicBERT, mBERT, and xlm-RoBERTa and show that mono-lingual models per form better than their multi-lingual coun terparts. The MahaBERT model provides the best results on L3Cube-MahaHate Corpus.
Anthology ID:
2022.trac-1.1
Volume:
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Ritesh Kumar, Atul Kr. Ojha, Marcos Zampieri, Shervin Malmasi, Daniel Kadar
Venue:
TRAC
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–9
Language:
URL:
https://aclanthology.org/2022.trac-1.1
DOI:
Bibkey:
Cite (ACL):
Hrushikesh Patil, Abhishek Velankar, and Raviraj Joshi. 2022. L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models. In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), pages 1–9, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models (Patil et al., TRAC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.trac-1.1.pdf
Code
 l3cube-pune/MarathiNLP
Data
MOLD