Praveen Acharya


2024

pdf bib
Bidirectional English-Nepali Machine Translation(MT) System for Legal Domain
Shabdapurush Poudel | Bal Krishna Bal | Praveen Acharya
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

Nepali, a low-resource language belonging to the Indo-Aryan language family and spoken in Nepal, India, Sikkim, and Burma has comparatively very little digital content and resources, more particularly in the legal domain. However, the need to translate legal documents is ever-increasing in the context of growing volumes of legal cases and a large population seeking to go abroad for higher education or employment. This underscores the need for developing an English-Nepali Machine Translation for the legal domain. We attempt to address this problem by utilizing a Neural Machine Translation (NMT) System with an encoder-decoder architecture, specifically designed for legal Nepali-English translation. Leveraging a custom-built legal corpus of 125,000 parallel sentences, our system achieves encouraging BLEU scores of 7.98 in (Nepali → English) and 6.63 (English → Nepali) direction

pdf bib
Exploring the Potential of Large Language Models (LLMs) for Low-resource Languages: A Study on Named-Entity Recognition (NER) and Part-Of-Speech (POS) Tagging for Nepali Language
Bipesh Subedi | Sunil Regmi | Bal Krishna Bal | Praveen Acharya
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) have made significant advancements in Natural Language Processing (NLP) by excelling in various NLP tasks. This study specifically focuses on evaluating the performance of LLMs for Named Entity Recognition (NER) and Part-of-Speech (POS) tagging for a low-resource language, Nepali. The aim is to study the effectiveness of these models for languages with limited resources by conducting experiments involving various parameters and fine-tuning and evaluating two datasets namely, ILPRL and EBIQUITY. In this work, we have experimented with eight LLMs for Nepali NER and POS tagging. While some prior works utilized larger datasets than ours, our contribution lies in presenting a comprehensive analysis of multiple LLMs in a unified setting. The findings indicate that NepBERTa, trained solely in the Nepali language, demonstrated the highest performance with F1-scores of 0.76 and 0.90 in ILPRL dataset. Similarly, it achieved 0.79 and 0.97 in EBIQUITY dataset for NER and POS respectively. This study not only highlights the potential of LLMs in performing classification tasks for low-resource languages but also compares their performance with that of alternative approaches deployed for the tasks.