Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation

Pranava Swaroop Madhyastha, Cristina España-Bonet


Abstract
We propose a simple log-bilinear softmax-based model to deal with vocabulary expansion in machine translation. Our model uses word embeddings trained on significantly large unlabelled monolingual corpora and learns over a fairly small, word-to-word bilingual dictionary. Given an out-of-vocabulary source word, the model generates a probabilistic list of possible translations in the target language using the trained bilingual embeddings. We integrate these translation options into a standard phrase-based statistical machine translation system and obtain consistent improvements in translation quality on the English–Spanish language pair. When tested over an out-of-domain testset, we get a significant improvement of 3.9 BLEU points.
Anthology ID:
W17-2617
Volume:
Proceedings of the 2nd Workshop on Representation Learning for NLP
Month:
August
Year:
2017
Address:
Vancouver, Canada
Editors:
Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, Scott Yih
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
139–145
Language:
URL:
https://aclanthology.org/W17-2617
DOI:
10.18653/v1/W17-2617
Bibkey:
Cite (ACL):
Pranava Swaroop Madhyastha and Cristina España-Bonet. 2017. Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 139–145, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation (Madhyastha & España-Bonet, RepL4NLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-2617.pdf