Tapas Kumar Mishra


2025

pdf bib
Development of a Low-Cost Named Entity Recognition System for Odia Language using Deep Active Learning
Tusarkanta Dalai | Tapas Kumar Mishra | Pankaj Kumar Sa | Prithviraj Mohanty | Chittaranjan Swain | Ajit Kumar Nayak
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models

pdf bib
A thresholding method for Improving translation Quality for Indic MT task
Sudhansu Bala Das | Leo Raphael Rodrigues | Tapas Kumar Mishra | Bidyut Ku Patra
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

The conversion of content from one language to another using a computer system is known as Machine Translation (MT). Various techniques have been used to ensure effective translations that retain the contextual and lexical interpretation of the source and target languages. One of these methods is end-to-end Neural Machine Translation (NMT), which is frequently utilized in real-world machine translation systems. NMT requires large parallel datasets for effective translation. These datasets are essential for an MT system to acquire during the training phase to learn the linguistic patterns and structures of both languages. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since these datasets have been gathered from various sources, they contain many incorrect or dissimilar translations. Hence, the MT systems built using this dataset cannot perform to their usual potential. This paper proposes an algorithm to remove dissimilar translations from the training dataset and evaluate the model’s efficiency. Two Indic languages (ILs), Hindi (HIN) and Odia (ODI), were chosen for the experiment. A baseline NMT system is built for these languages, and the effect of different dataset sizes is investigated. The quality of the translations in the experiment is evaluated using standard metrics. The results have shown that removing the dissimilar translations from the training dataset improves the quality of the language. It is also noticed that, despite the fact that the ILs-English and English-ILs systems are trained using the same dataset, ILs-English works more effectively across all the evaluation metrics.

2022

pdf bib
NIT Rourkela Machine Translation(MT) System Submission to WAT 2022 for MultiIndicMT: An Indic Language Multilingual Shared Task
Sudhansu Bala Das | Atharv Biradar | Tapas Kumar Mishra | Bidyut Kumar Patra
Proceedings of the 9th Workshop on Asian Translation

Multilingual Neural Machine Translation (MNMT) exhibits incredible performance with the development of a single translation model for many languages. Previous studies on multilingual translation reveal that multilingual training is effective for languages with limited corpus. This paper presents our submission (Team Id: NITR) in the WAT 2022 for “MultiIndicMT shared task” where the objective of the task is the translation between 5 Indic languages from OPUS Corpus (which are newly added in WAT 2022 corpus) into English and vice versa using the corpus provided by the organizer of WAT. Our system is based on a transformer-based NMT using fairseq modelling toolkit with ensemble techniques. Heuristic pre-processing approaches are carried out before keeping the model under training. Our multilingual NMT systems are trained with shared encoder and decoder parameters followed by assigning language embeddings to each token in both encoder and decoder. Our final multilingual system was examined by using BLEU and RIBES metrics scores. In future, we look forward to extend our research that will help in fine-tuning of both encoder and decoder during the monolingual unsupervised training in order to improve the quality of the synthetic data generated during the process.