An Iterative Approach for Mining Parallel Sentences in a Comparable Corpus

Lise Rebout, Phillippe Langlais


Abstract
We describe an approach for mining parallel sentences in a collection of documents in two languages. While several approaches have been proposed for doing so, our proposal differs in several respects. First, we use a document level classifier in order to focus on potentially fruitful document pairs, an understudied approach. We show that mining less, but more parallel documents can lead to better gains in machine translation. Second, we compare different strategies for post-processing the output of a classifier trained to recognize parallel sentences. Last, we report a simple bootstrapping experiment which shows that promising sentence pairs extracted in a first stage can help to mine new sentence pairs in a second stage. We applied our approach on the English-French Wikipedia. Gains of a statistical machine translation (SMT) engine are analyzed along different test sets.
Anthology ID:
L14-1368
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
648–655
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/43_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Lise Rebout and Phillippe Langlais. 2014. An Iterative Approach for Mining Parallel Sentences in a Comparable Corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 648–655, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
An Iterative Approach for Mining Parallel Sentences in a Comparable Corpus (Rebout & Langlais, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/43_Paper.pdf