Sunil Regmi

2025

Paramananda@NLU of Devanagari Script Languages 2025: Detection of Language, Hate Speech and Targets using FastText and BERT
Darwin Acharya | Sundeep Dawadi | Shivram Saud | Sunil Regmi
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

This paper presents a comparative analysis of FastText and BERT-based approaches for Natural Language Understanding (NLU) tasks in Devanagari script languages. We evaluate these models on three critical tasks: language identification, hate speech detection, and target identification across five languages: Nepali, Marathi, Sanskrit, Bhojpuri, and Hindi. Our experiments, although with raw tweet dataset but extracting only devanagari script, demonstrate that while both models achieve exceptional performance in language identification (F1 scores > 0.99), they show varying effectiveness in hate speech detection and target identification tasks. FastText with augmented data outperforms BERT in hate speech detection (F1 score: 0.8552 vs 0.5763), while BERT shows superior performance in target identification (F1 score: 0.5785 vs 0.4898). These findings contribute to the growing body of research on NLU for low-resource languages and provide insights into model selection for specific tasks in Devanagari script processing.

2024

pdf bib abs

Exploring the Potential of Large Language Models (LLMs) for Low-resource Languages: A Study on Named-Entity Recognition (NER) and Part-Of-Speech (POS) Tagging for Nepali Language
Bipesh Subedi | Sunil Regmi | Bal Krishna Bal | Praveen Acharya
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) have made significant advancements in Natural Language Processing (NLP) by excelling in various NLP tasks. This study specifically focuses on evaluating the performance of LLMs for Named Entity Recognition (NER) and Part-of-Speech (POS) tagging for a low-resource language, Nepali. The aim is to study the effectiveness of these models for languages with limited resources by conducting experiments involving various parameters and fine-tuning and evaluating two datasets namely, ILPRL and EBIQUITY. In this work, we have experimented with eight LLMs for Nepali NER and POS tagging. While some prior works utilized larger datasets than ours, our contribution lies in presenting a comprehensive analysis of multiple LLMs in a unified setting. The findings indicate that NepBERTa, trained solely in the Nepali language, demonstrated the highest performance with F1-scores of 0.76 and 0.90 in ILPRL dataset. Similarly, it achieved 0.79 and 0.97 in EBIQUITY dataset for NER and POS respectively. This study not only highlights the potential of LLMs in performing classification tasks for low-resource languages but also compares their performance with that of alternative approaches deployed for the tasks.

2021

pdf bib abs

An End-to-End Speech Recognition for the Nepali Language
Sunil Regmi | Bal Krishna Bal
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

In this era of AI and Deep Learning, Speech Recognition has achieved fairly good levels of accuracy and is bound to change the way humans interact with computers, which happens mostly through texts today. Most of the speech recognition systems for the Nepali language to date use conventional approaches which involve separately trained acoustic, pronunciation and language model components. Creating a pronunciation lexicon from scratch and defining phoneme sets for the language requires expert knowledge, and at the same time is time-consuming. In this work, we present an End-to-End ASR approach, which uses a joint CTC- attention-based encoder-decoder and a Recurrent Neural Network based language modeling which eliminates the need of creating a pronunciation lexicon from scratch. ESPnet toolkit which uses Kaldi Style of data preparation is the framework used for this work. The speech and transcription data used for this research is freely available on the Open Speech and Language Resources (OpenSLR). We use about 159k transcribed speech data to train the speech recognition model which currently recognizes speech input with the CER of 10.3%.

Co-authors

Bipesh Subedi 1

Venues

Fix author