Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation

Steinþór Steingrímsson; Pintu Lohar; Hrafn Loftsson; Andy Way

Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation

Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson, Andy Way

Abstract

When parallel corpora are preprocessed for machine translation (MT) training, a part of the parallel data is commonly discarded and deemed non-parallel due to odd-length ratio, overlapping text in source and target sentences or failing some other form of a semantic equivalency test. For language pairs with limited parallel resources, this can be costly as in such cases modest amounts of acceptable data may be useful to help build MT systems that generate higher quality translations. In this paper, we refine parallel corpora for two language pairs, English–Bengali and English–Icelandic, by extracting sub-sentence fragments from sentence pairs that would otherwise have been discarded, in order to increase recall when compiling training data. We find that by including the fragments, translation quality of NMT systems trained on the data improves significantly when translating from English to Bengali and from English to Icelandic.

Anthology ID:: 2023.mtsummit-coco4mt.1
Volume:: Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
Month:: September
Year:: 2023
Address:: Macau SAR, China
Venue:: MTSummit
SIG:
Publisher:: Asia-Pacific Association for Machine Translation
Note:
Pages:: 1–13
Language:
URL:: https://aclanthology.org/2023.mtsummit-coco4mt.1
DOI:
Bibkey:
Cite (ACL):: Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson, and Andy Way. 2023. Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation. In Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation, pages 1–13, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):: Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation (Steingrímsson et al., MTSummit 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.mtsummit-coco4mt.1.pdf

PDF Cite Search