Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora

Martin Laville, Amir Hazem, Emmanuel Morin, Phillippe Langlais


Abstract
Narrow specialized comparable corpora are often small in size. This particularity makes it difficult to build efficient models to acquire translation equivalents, especially for less frequent and rare words. One way to overcome this issue is to enrich the specialized corpora with out-of-domain resources. Although some recent studies have shown improvements using data augmentation, the enrichment method was roughly conducted by adding out-of-domain data with no particular attention given to how to enrich words and how to do it optimally. In this paper, we contrast several data selection techniques to improve bilingual lexicon induction from specialized comparable corpora. We first apply two well-established data selection techniques often used in machine translation that is: Tf-Idf and cross entropy. Then, we propose to exploit BERT for data selection. Overall, all the proposed techniques improve the quality of the extracted bilingual lexicons by a large margin. The best performing model is the cross entropy, obtaining a gain of about 4 points in MAP while decreasing computation time by a factor of 10.
Anthology ID:
2020.coling-main.527
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
6002–6012
Language:
URL:
https://aclanthology.org/2020.coling-main.527
DOI:
10.18653/v1/2020.coling-main.527
Bibkey:
Cite (ACL):
Martin Laville, Amir Hazem, Emmanuel Morin, and Phillippe Langlais. 2020. Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6002–6012, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora (Laville et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.527.pdf