Corpora for Document-Level Neural Machine Translation

Siyou Liu; Xiaojun Zhang

Corpora for Document-Level Neural Machine Translation

Abstract

Instead of translating sentences in isolation, document-level machine translation aims to capture discourse dependencies across sentences by considering a document as a whole. In recent years, there have been more interests in modelling larger context for the state-of-the-art neural machine translation (NMT). Although various document-level NMT models have shown significant improvements, there nonetheless exist three main problems: 1) compared with sentence-level translation tasks, the data for training robust document-level models are relatively low-resourced; 2) experiments in previous work are conducted on their own datasets which vary in size, domain and language; 3) proposed approaches are implemented on distinct NMT architectures such as recurrent neural networks (RNNs) and self-attention networks (SANs). In this paper, we aims to alleviate the low-resource and under-universality problems for document-level NMT. First, we collect a large number of existing document-level corpora, which covers 7 language pairs and 6 domains. In order to address resource sparsity, we construct a novel document parallel corpus in Chinese-Portuguese, which is a non-English-centred and low-resourced language pair. Besides, we implement and evaluate the commonly-cited document-level method on top of the advanced Transformer model with universal settings. Finally, we not only demonstrate the effectiveness and universality of document-level NMT, but also release the preprocessed data, source code and trained models for comparison and reproducibility.

Anthology ID:: 2020.lrec-1.466
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3775–3781
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.466
DOI:
Bibkey:
Cite (ACL):: Siyou Liu and Xiaojun Zhang. 2020. Corpora for Document-Level Neural Machine Translation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3775–3781, Marseille, France. European Language Resources Association.
Cite (Informal):: Corpora for Document-Level Neural Machine Translation (Liu & Zhang, LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.466.pdf

PDF Cite Search