Multilingual Dictionary Based Construction of Core Vocabulary

Winston Wu, Garrett Nicolai, David Yarowsky


Abstract
We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries. Our newly developed core concept vocabulary list derived from these dictionary consensus methods achieves high overlap with existing widely utilized core vocabulary lists targeted at applications such as first and second language learning or field linguistics. Our in-depth analysis illustrates multiple desirable properties of our newly proposed core vocabulary set, including their non-compositionality. We employ a cognate prediction method to recover missing coverage of this core vocabulary in massively multilingual dictionary construction, and we argue that this core vocabulary should be prioritized for elicitation when creating new dictionaries for low-resource languages for multiple downstream tasks including machine translation and language learning.
Anthology ID:
2020.lrec-1.519
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4211–4217
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.519
DOI:
Bibkey:
Cite (ACL):
Winston Wu, Garrett Nicolai, and David Yarowsky. 2020. Multilingual Dictionary Based Construction of Core Vocabulary. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4211–4217, Marseille, France. European Language Resources Association.
Cite (Informal):
Multilingual Dictionary Based Construction of Core Vocabulary (Wu et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.519.pdf