L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

Raviraj Joshi


Abstract
We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We further present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. We show the effectiveness of these resources on downstream Marathi sentiment analysis, text classification, and named entity recognition (NER) tasks. We also release MahaGPT, a generative Marathi GPT model trained on Marathi corpus. Marathi is a popular language in India but still lacks these resources. This work is a step forward in building open resources for the Marathi language. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .
Anthology ID:
2022.wildre-1.17
Volume:
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Girish Nath Jha, Sobha L., Kalika Bali, Atul Kr. Ojha
Venue:
WILDRE
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
97–101
Language:
URL:
https://aclanthology.org/2022.wildre-1.17
DOI:
Bibkey:
Cite (ACL):
Raviraj Joshi. 2022. L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, pages 97–101, Marseille, France. European Language Resources Association.
Cite (Informal):
L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources (Joshi, WILDRE 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.wildre-1.17.pdf
Code
 l3cube-pune/MarathiNLP
Data
CC100L3CubeMahaSent