Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation

Frederikus Hudi, Zhi Qu, Hidetaka Kamigaito, Taro Watanabe


Abstract
Multilingual neural machine translation aims to encapsulate multiple languages into a single model. However, it requires an enormous dataset, leaving the low-resource language (LRL) underdeveloped. As LRLs may benefit from shared knowledge of multilingual representation, we aspire to find effective ways to integrate unseen languages in a pre-trained model. Nevertheless, the intricacy of shared representation among languages hinders its full utilisation. To resolve this problem, we employed target language prediction and a central language-aware layer to improve representation in integrating LRLs. Focusing on improving LRLs in the linguistically diverse country of Indonesia, we evaluated five languages using a parallel corpus of 1,000 instances each, with experimental results measured by BLEU showing zero-shot improvement of 7.4 from the baseline score of 7.1 to a score of 15.5 at best. Further analysis showed that the gains in performance are attributed more to the disentanglement of multilingual representation in the encoder with the shift of the target language-specific representation in the decoder.
Anthology ID:
2024.lrec-main.446
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
4978–4989
Language:
URL:
https://aclanthology.org/2024.lrec-main.446
DOI:
Bibkey:
Cite (ACL):
Frederikus Hudi, Zhi Qu, Hidetaka Kamigaito, and Taro Watanabe. 2024. Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4978–4989, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation (Hudi et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.446.pdf