Biswajit Paul


2023

pdf bib
CAIR-NLP at SemEval-2023 Task 2: A Multi-Objective Joint Learning System for Named Entity Recognition
Sangeeth N | Biswajit Paul | Chandramani Chaudhary
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper describes the NER system designed by the CAIR-NLP team for submission to Multilingual Complex Named Entity Recognition (MultiCoNER II) shared task, which presents a novel challenge of recognizing complex, ambiguous, and fine-grained entities in low-context, multi-lingual, multi-domain dataset and evaluation on the noisy subset. We propose a Multi-Objective Joint Learning System (MOJLS) for NER, which aims to enhance the representation of entities and improve label predictions through the joint implementation of a set of learning objectives. Our official submission MOJLS implements four objectives. These include the representation of the named entities should be close to its entity type definition, low-context inputs should have representation close to their augmented context, and also minimization of two label prediction errors, one based on CRF and another biaffine-based predictions, where both are producing similar output label distributions. The official results ranked our system 2nd in five tracks (Multilingual, Spanish, Swedish, Ukrainian, and Farsi) and 3 rd in three (French, Italian, and Portuguese) out of 13 tracks. Also evaluation of the noisy subset, our model achieved relatively better ranks. Official results indicate the effectiveness of the proposed MOJLS in dealing with the contemporary challenges of NER.

2022

pdf bib
WebCrawl African : A Multilingual Parallel Corpora for African Languages
Pavanpankaj Vegi | Sivabhavani J | Biswajit Paul | Abhinav Mishra | Prashant Banjare | Prasanna K R | Chitra Viswanathan
Proceedings of the Seventh Conference on Machine Translation (WMT)

WebCrawl African is a mixed domain multilingual parallel corpora for a pool of African languages compiled by ANVITA machine translation team of Centre for Artificial Intelligence and Robotics Lab, primarily for accelerating research on low-resource and extremely low-resource machine translation and is part of the submission to WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the data track. The corpora is compiled through web data mining and comprises 695K parallel sentences spanning 74 different language pairs from English and 15 African languages, many of which fall under low and extremely low resource categories. As a measure of corpora usefulness, a MNMT model for 24 African languages to English is trained by combining WebCrawl African corpora with existing corpus and evaluation on FLORES200 shows that inclusion of WebCrawl African corpora could improve BLEU score by 0.01-1.66 for 12 out of 15 African to English translation directions and even by 0.18-0.68 for the 4 out of 9 African to English translation directions which are not part of WebCrawl African corpora. WebCrawl African corpora includes more parallel sentences for many language pairs in comparison to OPUS public repository. This data description paper captures creation of corpora and results obtained along with datasheets. The WebCrawl African corpora is hosted on github repository.

pdf bib
ANVITA-African: A Multilingual Neural Machine Translation System for African Languages
Pavanpankaj Vegi | Sivabhavani J | Biswajit Paul | Prasanna K R | Chitra Viswanathan
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes ANVITA African NMT system submitted by team ANVITA for WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the constrained translation track. The team participated in 24 African languages to English MT directions. For better handling of relatively low resource language pairs and effective transfer learning, models are trained in multilingual setting. Heuristic based corpus filtering is applied and it improved performance by 0.04-2.06 BLEU across 22 out of 24 African to English directions and also improved training time by 5x. Use of deep transformer with 24 layers of encoder and 6 layers of decoder significantly improved performance by 1.1-7.7 BLEU across all the 24 African to English directions compared to base transformer. For effective selection of source vocabulary in multilingual setting, joint and language wise vocabulary selection strategies are explored at the source side. Use of language wise vocabulary selection however did not consistently improve performance of low resource languages in comparison to joint vocabulary selection. Empirical results indicate that training using deep transformer with filtered corpora seems to be a better choice than using base transformer on the whole corpora both in terms of accuracy and training time.

2021

pdf bib
ANVITA Machine Translation System for WAT 2021 MultiIndicMT Shared Task
Pavanpankaj Vegi | Sivabhavani J | Biswajit Paul | Chitra Viswanathan | Prasanna Kumar K R
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This paper describes ANVITA-1.0 MT system, architected for submission to WAT2021 MultiIndicMT shared task by mcairt team, where the team participated in 20 translation directions: English→Indic and Indic→English; Indic set comprised of 10 Indian languages. ANVITA-1.0 MT system comprised of two multi-lingual NMT models one for the English→Indic directions and other for the Indic→English directions with shared encoder-decoder, catering 10 language pairs and twenty translation directions. The base models were built based on Transformer architecture and trained over MultiIndicMT WAT 2021 corpora and further employed back translation and transliteration for selective data augmentation, and model ensemble for better generalization. Additionally, MultiIndicMT WAT 2021 corpora was distilled using a series of filtering operations before putting up for training. ANVITA-1.0 achieved highest AM-FM score for English→Bengali, 2nd for English→Tamil and 3rd for English→Hindi, Bengali→English directions on official test set. In general, performance achieved by ANVITA for the Indic→English directions are relatively better than that of English→Indic directions for all the 10 language pairs when evaluated using BLEU and RIBES, although the same trend is not observed consistently when AM-FM based evaluation was carried out. As compared to BLEU, RIBES and AM-FM based scoring placed ANVITA relatively better among all the task participants.