A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora

Mahdi Khademian; Kaveh Taghipour; Saab Mansour; Shahram Khadivi

A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora

Mahdi Khademian, Kaveh Taghipour, Saab Mansour, Shahram Khadivi

Abstract

Achieving accurate translation, especially in multiple domain documents with statistical machine translation systems, requires more and more bilingual texts and this need becomes more critical when training such systems for language pairs with scarce training data. In the recent years, there have been some researches on new sources of parallel texts that are documents which are not necessarily parallel but are comparable. Since these methods search for possible translation equivalences in a greedy manner, they are unable to consider all possible parallel texts in comparable documents. This paper investigates a different approach for this need by considering relationships between all words of two comparable documents, which works fairly well even in the worst case of comparability. We represent each document pair in a matrix and then transform it to a new space to find parallel fragments. Evaluations show that the system is successful in extraction of useful fragment pairs.

Anthology ID:: L12-1531
Volume:: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:: May
Year:: 2012
Address:: Istanbul, Turkey
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4073–4079
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/892_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Mahdi Khademian, Kaveh Taghipour, Saab Mansour, and Shahram Khadivi. 2012. A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 4073–4079, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):: A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora (Khademian et al., LREC 2012)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/892_Paper.pdf

PDF Cite Search Fix data