word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

Yo Joong Choe; Kyubyong Park; Dongwoo Kim

word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

Yo Joong Choe, Kyubyong Park, Dongwoo Kim

Abstract

We present word2word, a publicly available dataset and an open-source Python package for cross-lingual word translations extracted from sentence-level parallel corpora. Our dataset provides top-k word translations in 3,564 (directed) language pairs across 62 languages in OpenSubtitles2018 (Lison et al., 2018). To obtain this dataset, we use a count-based bilingual lexicon extraction model based on the observation that not only source and target words but also source words themselves can be highly correlated. We illustrate that the resulting bilingual lexicons have high coverage and attain competitive translation quality for several language pairs. We wrap our dataset and model in an easy-to-use Python library, which supports downloading and retrieving top-k word translations in any of the supported language pairs as well as computing top-k word translations for custom parallel corpora.

Anthology ID:: 2020.lrec-1.371
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3036–3045
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.371/
DOI:
Bibkey:
Cite (ACL):: Yo Joong Choe, Kyubyong Park, and Dongwoo Kim. 2020. word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3036–3045, Marseille, France. European Language Resources Association.
Cite (Informal):: word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs (Choe et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.371.pdf

PDF Cite Search Fix data