Improving Translation of Out Of Vocabulary Words using Bilingual Lexicon Induction in Low-Resource Machine Translation

Jonas Waldendorf, Alexandra Birch, Barry Hadow, Antonio Valerio Micele Barone


Abstract
Dictionary-based data augmentation techniques have been used in the field of domain adaptation to learn words that do not appear in the parallel training data of a machine translation model. These techniques strive to learn correct translations of these words by generating a synthetic corpus from in-domain monolingual data utilising a dictionary obtained from bilingual lexicon induction. This paper applies these techniques to low resource machine translation, where there is often a shift in distribution of content between the parallel data and any monolingual data. English-Pashto machine learning systems are trained using a novel approach that introduces monolingual data to existing joint learning techniques for bilingual word embeddings, combined with word-for-word back-translation to improve the translation of words that do not or rarely appear in the parallel training data. Improvements are made both in terms of BLEU, chrF and word translation accuracy for an En->Ps model, compared to a baseline and when combined with back-translation.
Anthology ID:
2022.amta-research.11
Volume:
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:
September
Year:
2022
Address:
Orlando, USA
Editors:
Kevin Duh, Francisco Guzmán
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
144–156
Language:
URL:
https://aclanthology.org/2022.amta-research.11
DOI:
Bibkey:
Cite (ACL):
Jonas Waldendorf, Alexandra Birch, Barry Hadow, and Antonio Valerio Micele Barone. 2022. Improving Translation of Out Of Vocabulary Words using Bilingual Lexicon Induction in Low-Resource Machine Translation. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 144–156, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Improving Translation of Out Of Vocabulary Words using Bilingual Lexicon Induction in Low-Resource Machine Translation (Waldendorf et al., AMTA 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.amta-research.11.pdf