Improving Machine Translation with Phrase Pair Injection and Corpus Filtering

Akshay Batheja, Pushpak Bhattacharyya


Abstract
In this paper, we show that the combination of Phrase Pair Injection and Corpus Filtering boosts the performance of Neural Machine Translation (NMT) systems. We extract parallel phrases and sentences from the pseudo-parallel corpus and augment it with the parallel corpus to train the NMT models. With the proposed approach, we observe an improvement in the Machine Translation (MT) system for 3 low-resource language pairs, Hindi-Marathi, English-Marathi, and English-Pashto, and 6 translation directions by up to 2.7 BLEU points, on the FLORES test data. These BLEU score improvements are over the models trained using the whole pseudo-parallel corpus augmented with the parallel corpus.
Anthology ID:
2022.emnlp-main.361
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5395–5400
Language:
URL:
https://aclanthology.org/2022.emnlp-main.361
DOI:
10.18653/v1/2022.emnlp-main.361
Bibkey:
Cite (ACL):
Akshay Batheja and Pushpak Bhattacharyya. 2022. Improving Machine Translation with Phrase Pair Injection and Corpus Filtering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5395–5400, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Improving Machine Translation with Phrase Pair Injection and Corpus Filtering (Batheja & Bhattacharyya, EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.361.pdf