Constructing a Chinese—Japanese Parallel Corpus from Wikipedia

Chenhui Chu; Toshiaki Nakazawa; Sadao Kurohashi

Constructing a Chinese—Japanese Parallel Corpus from Wikipedia

Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi

Abstract

Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese―Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese―Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. We improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that our system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, we construct a Chinese―Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at http://orchid.kuee.kyoto-u.ac.jp/chu/resource/wiki_zh_ja.tgz.

Anthology ID:: L14-1209
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 642–647
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/21_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2014. Constructing a Chinese—Japanese Parallel Corpus from Wikipedia. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 642–647, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Constructing a Chinese—Japanese Parallel Corpus from Wikipedia (Chu et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/21_Paper.pdf

PDF Cite Search Fix data