DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages

Gabriele Sarti, Arianna Bisazza, Ana Guerberof-Arenas, Antonio Toral


Abstract
We introduce DivEMT, the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages. Using a strictly controlled setup, 18 professional translators were instructed to translate or post-edit the same set of English documents into Arabic, Dutch, Italian, Turkish, Ukrainian, and Vietnamese. During the process, their edits, keystrokes, editing times and pauses were recorded, enabling an in-depth, cross-lingual evaluation of NMT quality and post-editing effectiveness. Using this new dataset, we assess the impact of two state-of-the-art NMT systems, Google Translate and the multilingual mBART-50 model, on translation productivity. We find that post-editing is consistently faster than translation from scratch. However, the magnitude of productivity gains varies widely across systems and languages, highlighting major disparities in post-editing effectiveness for languages at different degrees of typological relatedness to English, even when controlling for system architecture and training data size. We publicly release the complete dataset including all collected behavioral data, to foster new research on the translation capabilities of NMT systems for typologically diverse languages.
Anthology ID:
2022.emnlp-main.532
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7795–7816
Language:
URL:
https://aclanthology.org/2022.emnlp-main.532
DOI:
10.18653/v1/2022.emnlp-main.532
Bibkey:
Cite (ACL):
Gabriele Sarti, Arianna Bisazza, Ana Guerberof-Arenas, and Antonio Toral. 2022. DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7795–7816, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages (Sarti et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.532.pdf