Pavanpankaj Vegi
2022
WebCrawl African : A Multilingual Parallel Corpora for African Languages
Pavanpankaj Vegi
|
Sivabhavani J
|
Biswajit Paul
|
Abhinav Mishra
|
Prashant Banjare
|
Prasanna K R
|
Chitra Viswanathan
Proceedings of the Seventh Conference on Machine Translation (WMT)
WebCrawl African is a mixed domain multilingual parallel corpora for a pool of African languages compiled by ANVITA machine translation team of Centre for Artificial Intelligence and Robotics Lab, primarily for accelerating research on low-resource and extremely low-resource machine translation and is part of the submission to WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the data track. The corpora is compiled through web data mining and comprises 695K parallel sentences spanning 74 different language pairs from English and 15 African languages, many of which fall under low and extremely low resource categories. As a measure of corpora usefulness, a MNMT model for 24 African languages to English is trained by combining WebCrawl African corpora with existing corpus and evaluation on FLORES200 shows that inclusion of WebCrawl African corpora could improve BLEU score by 0.01-1.66 for 12 out of 15 African to English translation directions and even by 0.18-0.68 for the 4 out of 9 African to English translation directions which are not part of WebCrawl African corpora. WebCrawl African corpora includes more parallel sentences for many language pairs in comparison to OPUS public repository. This data description paper captures creation of corpora and results obtained along with datasheets. The WebCrawl African corpora is hosted on github repository.
ANVITA-African: A Multilingual Neural Machine Translation System for African Languages
Pavanpankaj Vegi
|
Sivabhavani J
|
Biswajit Paul
|
Prasanna K R
|
Chitra Viswanathan
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper describes ANVITA African NMT system submitted by team ANVITA for WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the constrained translation track. The team participated in 24 African languages to English MT directions. For better handling of relatively low resource language pairs and effective transfer learning, models are trained in multilingual setting. Heuristic based corpus filtering is applied and it improved performance by 0.04-2.06 BLEU across 22 out of 24 African to English directions and also improved training time by 5x. Use of deep transformer with 24 layers of encoder and 6 layers of decoder significantly improved performance by 1.1-7.7 BLEU across all the 24 African to English directions compared to base transformer. For effective selection of source vocabulary in multilingual setting, joint and language wise vocabulary selection strategies are explored at the source side. Use of language wise vocabulary selection however did not consistently improve performance of low resource languages in comparison to joint vocabulary selection. Empirical results indicate that training using deep transformer with filtered corpora seems to be a better choice than using base transformer on the whole corpora both in terms of accuracy and training time.
2021
ANVITA Machine Translation System for WAT 2021 MultiIndicMT Shared Task
Pavanpankaj Vegi
|
Sivabhavani J
|
Biswajit Paul
|
Chitra Viswanathan
|
Prasanna Kumar K R
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
This paper describes ANVITA-1.0 MT system, architected for submission to WAT2021 MultiIndicMT shared task by mcairt team, where the team participated in 20 translation directions: English→Indic and Indic→English; Indic set comprised of 10 Indian languages. ANVITA-1.0 MT system comprised of two multi-lingual NMT models one for the English→Indic directions and other for the Indic→English directions with shared encoder-decoder, catering 10 language pairs and twenty translation directions. The base models were built based on Transformer architecture and trained over MultiIndicMT WAT 2021 corpora and further employed back translation and transliteration for selective data augmentation, and model ensemble for better generalization. Additionally, MultiIndicMT WAT 2021 corpora was distilled using a series of filtering operations before putting up for training. ANVITA-1.0 achieved highest AM-FM score for English→Bengali, 2nd for English→Tamil and 3rd for English→Hindi, Bengali→English directions on official test set. In general, performance achieved by ANVITA for the Indic→English directions are relatively better than that of English→Indic directions for all the 10 language pairs when evaluated using BLEU and RIBES, although the same trend is not observed consistently when AM-FM based evaluation was carried out. As compared to BLEU, RIBES and AM-FM based scoring placed ANVITA relatively better among all the task participants.
Search
Co-authors
- Sivabhavani J 3
- Biswajit Paul 3
- Chitra Viswanathan 3
- Prasanna K R 2
- Prasanna Kumar K R 1
- show all...