Improving Embedding Transfer for Low-Resource Machine Translation

Van Hien Tran, Chenchen Ding, Hideki Tanaka, Masao Utiyama


Abstract
Low-resource machine translation (LRMT) poses a substantial challenge due to the scarcity of parallel training data. This paper introduces a new method to improve the transfer of the embedding layer from the Parent model to the Child model in LRMT, utilizing trained token embeddings in the Parent model’s high-resource vocabulary. Our approach involves projecting all tokens into a shared semantic space and measuring the semantic similarity between tokens in the low-resource and high-resource languages. These measures are then utilized to initialize token representations in the Child model’s low-resource vocabulary. We evaluated our approach on three benchmark datasets of low-resource language pairs: Myanmar-English, Indonesian-English, and Turkish-English. The experimental results demonstrate that our method outperforms previous methods regarding translation quality. Additionally, our approach is computationally efficient, leading to reduced training time compared to prior works.
Anthology ID:
2023.mtsummit-research.11
Volume:
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Month:
September
Year:
2023
Address:
Macau SAR, China
Editors:
Masao Utiyama, Rui Wang
Venue:
MTSummit
SIG:
Publisher:
Asia-Pacific Association for Machine Translation
Note:
Pages:
123–134
Language:
URL:
https://aclanthology.org/2023.mtsummit-research.11
DOI:
Bibkey:
Cite (ACL):
Van Hien Tran, Chenchen Ding, Hideki Tanaka, and Masao Utiyama. 2023. Improving Embedding Transfer for Low-Resource Machine Translation. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 123–134, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):
Improving Embedding Transfer for Low-Resource Machine Translation (Tran et al., MTSummit 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.mtsummit-research.11.pdf