Generating Monolingual Dataset for Low Resource Language Bodo from old books using Google Keep

Sanjib Narzary, Maharaj Brahma, Mwnthai Narzary, Gwmsrang Muchahary, Pranav Kumar Singh, Apurbalal Senapati, Sukumar Nandi, Bidisha Som


Abstract
Bodo is a scheduled Indian language spoken largely by the Bodo community of Assam and other northeastern Indian states. Due to a lack of resources, it is difficult for young languages to communicate more effectively with the rest of the world. This leads to a lack of research in low-resource languages. The creation of a dataset is a tedious and costly process, particularly for languages with no participatory research. This is more visible for languages that are young and have recently adopted standard writing scripts. In this paper, we present a methodology using Google Keep for OCR to generate a monolingual Bodo corpus from different books. In this work, a Bodo text corpus of 192,327 tokens and 32,268 unique tokens is generated using free, accessible, and daily-usable applications. Moreover, some essential characteristics of the Bodo language are discussed that are neglected by Natural Language Progressing (NLP) researchers.
Anthology ID:
2022.lrec-1.705
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6563–6570
Language:
URL:
https://aclanthology.org/2022.lrec-1.705
DOI:
Bibkey:
Cite (ACL):
Sanjib Narzary, Maharaj Brahma, Mwnthai Narzary, Gwmsrang Muchahary, Pranav Kumar Singh, Apurbalal Senapati, Sukumar Nandi, and Bidisha Som. 2022. Generating Monolingual Dataset for Low Resource Language Bodo from old books using Google Keep. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6563–6570, Marseille, France. European Language Resources Association.
Cite (Informal):
Generating Monolingual Dataset for Low Resource Language Bodo from old books using Google Keep (Narzary et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.705.pdf