Neural Machine Translation between Low-Resource Languages with Synthetic Pivoting

Khalid Ahmed, Jan Buys


Abstract
Training neural models for translating between low-resource languages is challenging due to the scarcity of direct parallel data between such languages. Pivot-based neural machine translation (NMT) systems overcome data scarcity by including a high-resource pivot language in the process of translating between low-resource languages. We propose synthetic pivoting, a novel approach to pivot-based translation in which the pivot sentences are generated synthetically from both the source and target languages. Synthetic pivot sentences are generated through sequence-level knowledge distillation, with the aim of changing the structure of pivot sentences to be closer to that of the source or target languages, thereby reducing pivot translation complexity. We incorporate synthetic pivoting into two paradigms for pivoting: cascading and direct translation using synthetic source and target sentences. We find that the performance of pivot-based systems highly depends on the quality of the NMT model used for sentence regeneration. Furthermore, training back-translation models on these sentences can make the models more robust to input-side noise. The results show that synthetic data generation improves pivot-based systems translating between low-resource Southern African languages by up to 5.6 BLEU points after fine-tuning.
Anthology ID:
2024.lrec-main.1063
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
12144–12158
Language:
URL:
https://aclanthology.org/2024.lrec-main.1063
DOI:
Bibkey:
Cite (ACL):
Khalid Ahmed and Jan Buys. 2024. Neural Machine Translation between Low-Resource Languages with Synthetic Pivoting. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12144–12158, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Neural Machine Translation between Low-Resource Languages with Synthetic Pivoting (Ahmed & Buys, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1063.pdf