Towards producing bilingual lexica from monolingual corpora

Jingyi Han, Núria Bel


Abstract
Bilingual lexica are the basis for many cross-lingual natural language processing tasks. Recent works have shown success in learning bilingual dictionary by taking advantages of comparable corpora and a diverse set of signals derived from monolingual corpora. In the present work, we describe an approach to automatically learn bilingual lexica by training a supervised classifier using word embedding-based vectors of only a few hundred translation equivalent word pairs. The word embedding representations of translation pairs were obtained from source and target monolingual corpora, which are not necessarily related. Our classifier is able to predict whether a new word pair is under a translation relation or not. We tested it on two quite distinct language pairs Chinese-Spanish and English-Spanish. The classifiers achieved more than 0.90 precision and recall for both language pairs in different evaluation scenarios. These results show a high potential for this method to be used in bilingual lexica production for language pairs with reduced amount of parallel or comparable corpora, in particular for phrase table expansion in Statistical Machine Translation systems.
Anthology ID:
L16-1353
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2222–2227
Language:
URL:
https://aclanthology.org/L16-1353
DOI:
Bibkey:
Cite (ACL):
Jingyi Han and Núria Bel. 2016. Towards producing bilingual lexica from monolingual corpora. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2222–2227, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Towards producing bilingual lexica from monolingual corpora (Han & Bel, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1353.pdf