Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task

Adam Dobrowolski; Marcin Szymański; Marcin Chochowski; Paweł Przybysz

doi:10.18653/v1/2021.wat-1.27

Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task

Adam Dobrowolski, Marcin Szymański, Marcin Chochowski, Paweł Przybysz

Abstract

This paper describes the submission to the WAT 2021 Indic Language Multilingual Task by Samsung R&D Institute Poland. The task covered translation between 10 Indic Languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu) and English. We combined a variety of techniques: transliteration, filtering, backtranslation, domain adaptation, knowledge-distillation and finally ensembling of NMT models. We applied an effective approach to low-resource training that consist of pretraining on backtranslations and tuning on parallel corpora. We experimented with two different domain-adaptation techniques which significantly improved translation quality when applied to monolingual corpora. We researched and applied a novel approach for finding the best hyperparameters for ensembling a number of translation models. All techniques combined gave significant improvement - up to +8 BLEU over baseline results. The quality of the models has been confirmed by the human evaluation where SRPOL models scored best for all 5 manually evaluated languages.

Anthology ID:: 2021.wat-1.27
Volume:: Proceedings of the 8th Workshop on Asian Translation (WAT2021)
Month:: August
Year:: 2021
Address:: Online
Editors:: Toshiaki Nakazawa, Hideki Nakayama, Isao Goto, Hideya Mino, Chenchen Ding, Raj Dabre, Anoop Kunchukuttan, Shohei Higashiyama, Hiroshi Manabe, Win Pa Pa, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Katsuhito Sudoh, Sadao Kurohashi, Pushpak Bhattacharyya
Venue:: WAT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 224–232
Language:
URL:: https://aclanthology.org/2021.wat-1.27/
DOI:: 10.18653/v1/2021.wat-1.27
Bibkey:
Cite (ACL):: Adam Dobrowolski, Marcin Szymański, Marcin Chochowski, and Paweł Przybysz. 2021. Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), pages 224–232, Online. Association for Computational Linguistics.
Cite (Informal):: Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task (Dobrowolski et al., WAT 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.wat-1.27.pdf

PDF Cite Search Fix data