Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora

Amir Hazem, Emmanuel Morin


Abstract
Comparable corpora are the main alternative to the use of parallel corpora to extract bilingual lexicons. Although it is easier to build comparable corpora, specialized comparable corpora are often of modest size in comparison with corpora issued from the general domain. Consequently, the observations of word co-occurrences which are the basis of context-based methods are unreliable. We propose in this article to improve word co-occurrences of specialized comparable corpora and thus context representation by using general-domain data. This idea, which has been already used in machine translation task for more than a decade, is not straightforward for the task of bilingual lexicon extraction from specific-domain comparable corpora. We go against the mainstream of this task where many studies support the idea that adding out-of-domain documents decreases the quality of lexicons. Our empirical evaluation shows the advantages of this approach which induces a significant gain in the accuracy of extracted lexicons.
Anthology ID:
C16-1321
Volume:
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Yuji Matsumoto, Rashmi Prasad
Venue:
COLING
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
3401–3411
Language:
URL:
https://aclanthology.org/C16-1321
DOI:
Bibkey:
Cite (ACL):
Amir Hazem and Emmanuel Morin. 2016. Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3401–3411, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora (Hazem & Morin, COLING 2016)
Copy Citation:
PDF:
https://aclanthology.org/C16-1321.pdf