Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Saurabh Sampatrao Gaikwad, Tharindu Ranasinghe, Marcos Zampieri, Christopher Homan


Abstract
The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.
Anthology ID:
2021.ranlp-1.50
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
437–443
Language:
URL:
https://aclanthology.org/2021.ranlp-1.50
DOI:
Bibkey:
Cite (ACL):
Saurabh Sampatrao Gaikwad, Tharindu Ranasinghe, Marcos Zampieri, and Christopher Homan. 2021. Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 437–443, Held Online. INCOMA Ltd..
Cite (Informal):
Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi (Gaikwad et al., RANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ranlp-1.50.pdf
Code
 tharindudr/mold
Data
MOLDOLID