JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Makoto Morishita; Jun Suzuki; Masaaki Nagata

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Makoto Morishita, Jun Suzuki, Masaaki Nagata

Abstract

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes a broader range of domains and how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches achieved or surpassed performance comparable to model training from the initial state and reduced the training time. Additionally, we trained the model with an in-domain dataset and JParaCrawl to show how we achieved the best performance with them. JParaCrawl and the pre-trained models are freely available online for research purposes.

Anthology ID:: 2020.lrec-1.443
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3603–3609
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.443/
DOI:
Bibkey:
Cite (ACL):: Makoto Morishita, Jun Suzuki, and Masaaki Nagata. 2020. JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3603–3609, Marseille, France. European Language Resources Association.
Cite (Informal):: JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus (Morishita et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.443.pdf

PDF Cite Search Fix data