An Alignment-Based Approach to Semi-Supervised Bilingual Lexicon Induction with Small Parallel Corpora

Kelly Marchisio, Philipp Koehn, Conghao Xiong


Abstract
Aimed at generating a seed lexicon for use in downstream natural language tasks and unsupervised methods for bilingual lexicon induction have received much attention in the academic literature recently. While interesting and fully unsupervised settings are unrealistic; small amounts of bilingual data are usually available due to the existence of massively multilingual parallel corpora and or linguists can create small amounts of parallel data. In this work and we demonstrate an effective bootstrapping approach for semi-supervised bilingual lexicon induction that capitalizes upon the complementary strengths of two disparate methods for inducing bilingual lexicons. Whereas statistical methods are highly effective at inducing correct translation pairs for words frequently occurring in a parallel corpus and monolingual embedding spaces have the advantage of having been trained on large amounts of data and and therefore may induce accurate translations for words absent from the small corpus. By combining these relative strengths and our method achieves state-of-the-art results on 3 of 4 language pairs in the challenging VecMap test set using minimal amounts of parallel data and without the need for a translation dictionary. We release our implementation at www.blind-review.code.
Anthology ID:
2021.mtsummit-research.24
Volume:
Proceedings of Machine Translation Summit XVIII: Research Track
Month:
August
Year:
2021
Address:
Virtual
Editors:
Kevin Duh, Francisco Guzmán
Venue:
MTSummit
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
293–304
Language:
URL:
https://aclanthology.org/2021.mtsummit-research.24
DOI:
Bibkey:
Cite (ACL):
Kelly Marchisio, Philipp Koehn, and Conghao Xiong. 2021. An Alignment-Based Approach to Semi-Supervised Bilingual Lexicon Induction with Small Parallel Corpora. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 293–304, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
An Alignment-Based Approach to Semi-Supervised Bilingual Lexicon Induction with Small Parallel Corpora (Marchisio et al., MTSummit 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mtsummit-research.24.pdf
Code
 kellymarchisio/align-semisup-bli