Identifying Word Translations from Comparable Documents Without a Seed Lexicon

Reinhard Rapp, Serge Sharoff, Bogdan Babych


Abstract
The extraction of dictionaries from parallel text corpora is an established technique. However, as parallel corpora are a scarce resource, in recent years the extraction of dictionaries using comparable corpora has obtained increasing attention. In order to find a mapping between languages, almost all approaches suggested in the literature rely on a seed lexicon. The work described here achieves competitive results without requiring such a seed lexicon. Instead it presupposes mappings between comparable documents in different languages. For some common types of textual resources (e.g. encyclopedias or newspaper texts) such mappings are either readily available or can be established relatively easily. The current work is based on Wikipedias where the mappings between languages are determined by the authors of the articles. We describe a neural-network inspired algorithm which first characterizes each Wikipedia article by a number of keywords, and then considers the identification of word translations as a variant of word alignment in a noisy environment. We present results and evaluations for eight language pairs involving Germanic, Romanic, and Slavic languages as well as Chinese.
Anthology ID:
L12-1529
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
460–466
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/888_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Reinhard Rapp, Serge Sharoff, and Bogdan Babych. 2012. Identifying Word Translations from Comparable Documents Without a Seed Lexicon. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 460–466, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Identifying Word Translations from Comparable Documents Without a Seed Lexicon (Rapp et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/888_Paper.pdf