Crosslingual Embeddings are Essential in UNMT for distant languages: An English to IndoAryan Case Study

Tamali Banerjee, Rudra V Murthy, Pushpak Bhattacharya


Abstract
Recent advances in Unsupervised Neural Machine Translation (UNMT) has minimized the gap between supervised and unsupervised machine translation performance for closely related language-pairs. However and the situation is very different for distant language pairs. Lack of overlap in lexicon and low syntactic similarity such as between English and IndoAryan languages leads to poor translation quality in existing UNMT systems. In this paper and we show that initialising the embedding layer of UNMT models with cross-lingual embeddings leads to significant BLEU score improvements over existing UNMT models where the embedding layer weights are randomly initialized. Further and freezing the embedding layer weights leads to better gains compared to updating the embedding layer weights during training. We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs. The proposed cross-lingual embedding initialization yields BLEU score improvement of as much as ten times over the baseline for English-Hindi and English-Bengali and English-Gujarati. Our analysis shows that initialising embedding layer with static cross-lingual embedding mapping is essential for training of UNMT models for distant language-pairs.
Anthology ID:
2021.mtsummit-research.3
Volume:
Proceedings of Machine Translation Summit XVIII: Research Track
Month:
August
Year:
2021
Address:
Virtual
Editors:
Kevin Duh, Francisco Guzmán
Venue:
MTSummit
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
23–34
Language:
URL:
https://aclanthology.org/2021.mtsummit-research.3
DOI:
Bibkey:
Cite (ACL):
Tamali Banerjee, Rudra V Murthy, and Pushpak Bhattacharya. 2021. Crosslingual Embeddings are Essential in UNMT for distant languages: An English to IndoAryan Case Study. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 23–34, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Crosslingual Embeddings are Essential in UNMT for distant languages: An English to IndoAryan Case Study (Banerjee et al., MTSummit 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mtsummit-research.3.pdf