Exploring Paracrawl for Document-level Neural Machine Translation

Yusser Al Ghussin, Jingyi Zhang, Josef van Genabith


Abstract
Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets. However, document-level NMT is still not widely adopted in realworld translation systems mainly due to the lack of large-scale general-domain training data for document-level NMT. We examine the effectiveness of using Paracrawl for learning document-level translation. Paracrawl is a large-scale parallel corpus crawled from the Internet and contains data from various domains. The official Paracrawl corpus was released as parallel sentences (extracted from parallel webpages) and therefore previous works only used Paracrawl for learning sentence-level translation. In this work, we extract parallel paragraphs from Paracrawl parallel webpages using automatic sentence alignments and we use the extracted parallel paragraphs as parallel documents for training document-level translation models. We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents from TED, News and Europarl, outperforming sentence-level NMT models. We also perform a targeted pronoun evaluation and show that document-level models trained with Paracrawl data can help context-aware pronoun translation.
Anthology ID:
2023.eacl-main.94
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1304–1310
Language:
URL:
https://aclanthology.org/2023.eacl-main.94
DOI:
10.18653/v1/2023.eacl-main.94
Bibkey:
Cite (ACL):
Yusser Al Ghussin, Jingyi Zhang, and Josef van Genabith. 2023. Exploring Paracrawl for Document-level Neural Machine Translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1304–1310, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Exploring Paracrawl for Document-level Neural Machine Translation (Al Ghussin et al., EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.94.pdf
Video:
 https://aclanthology.org/2023.eacl-main.94.mp4