Sukumar Nandi


2023

pdf bib
Image Caption Synthesis for Low Resource Assamese Language using Bi-LSTM with Bilinear Attention
Pankaj Choudhury | Prithwijit Guha | Sukumar Nandi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment
Saurabh Kumar | Ranbir Sanasam | Sukumar Nandi
Findings of the Association for Computational Linguistics: EMNLP 2023

The increasing number of Indian language users on the internet necessitates the development of Indian language technologies. In response to this demand, our paper presents a generalized representation vector for diverse text characteristics, including native scripts, transliterated text, multilingual, code-mixed, and social media-related attributes. We gather text from both social media and well-formed sources and utilize the FastText model to create the “IndiSocialFT” embedding. Through intrinsic and extrinsic evaluation methods, we compare IndiSocialFT with three popular pretrained embeddings trained over Indian languages. Our findings show that the proposed embedding surpasses the baselines in most cases and languages, demonstrating its suitability for various NLP applications.

2022

pdf bib
Generating Monolingual Dataset for Low Resource Language Bodo from old books using Google Keep
Sanjib Narzary | Maharaj Brahma | Mwnthai Narzary | Gwmsrang Muchahary | Pranav Kumar Singh | Apurbalal Senapati | Sukumar Nandi | Bidisha Som
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Bodo is a scheduled Indian language spoken largely by the Bodo community of Assam and other northeastern Indian states. Due to a lack of resources, it is difficult for young languages to communicate more effectively with the rest of the world. This leads to a lack of research in low-resource languages. The creation of a dataset is a tedious and costly process, particularly for languages with no participatory research. This is more visible for languages that are young and have recently adopted standard writing scripts. In this paper, we present a methodology using Google Keep for OCR to generate a monolingual Bodo corpus from different books. In this work, a Bodo text corpus of 192,327 tokens and 32,268 unique tokens is generated using free, accessible, and daily-usable applications. Moreover, some essential characteristics of the Bodo language are discussed that are neglected by Natural Language Progressing (NLP) researchers.

pdf bib
AsNER - Annotated Dataset and Baseline for Assamese Named Entity recognition
Dhrubajyoti Pathak | Sukumar Nandi | Priyankoo Sarmah
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.