Refining an Almost Clean Translation Memory Helps Machine Translation

Shivendra Bhardwa, David Alfonso-Hermelo, Philippe Langlais, Gabriel Bernier-Colborne, Cyril Goutte, Michel Simard


Abstract
While recent studies have been dedicated to cleaning very noisy parallel corpora to improve Machine Translation training, we focus in this work on filtering a large and mostly clean Translation Memory. This problem of practical interest has not received much consideration from the community, in contrast with, for example, filtering large web-mined parallel corpora. We experiment with an extensive, multi-domain proprietary Translation Memory and compare five approaches involving deep-, feature-, and heuristic-based solutions. We propose two ways of evaluating this task, manual annotation and resulting Machine Translation quality. We report significant gains over a state-of-the-art, off-the-shelf cleaning system, using two MT engines.
Anthology ID:
2022.amta-research.16
Volume:
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:
September
Year:
2022
Address:
Orlando, USA
Editors:
Kevin Duh, Francisco Guzmán
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
215–226
Language:
URL:
https://aclanthology.org/2022.amta-research.16
DOI:
Bibkey:
Cite (ACL):
Shivendra Bhardwa, David Alfonso-Hermelo, Philippe Langlais, Gabriel Bernier-Colborne, Cyril Goutte, and Michel Simard. 2022. Refining an Almost Clean Translation Memory Helps Machine Translation. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 215–226, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Refining an Almost Clean Translation Memory Helps Machine Translation (Bhardwa et al., AMTA 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.amta-research.16.pdf