Towards Lower Bounds on Number of Dimensions for Word Embeddings

Kevin Patel, Pushpak Bhattacharyya


Abstract
Word embeddings are a relatively new addition to the modern NLP researcher’s toolkit. However, unlike other tools, word embeddings are used in a black box manner. There are very few studies regarding various hyperparameters. One such hyperparameter is the dimension of word embeddings. They are rather decided based on a rule of thumb: in the range 50 to 300. In this paper, we show that the dimension should instead be chosen based on corpus statistics. More specifically, we show that the number of pairwise equidistant words of the corpus vocabulary (as defined by some distance/similarity metric) gives a lower bound on the the number of dimensions , and going below this bound results in degradation of quality of learned word embeddings. Through our evaluations on standard word embedding evaluation tasks, we show that for dimensions higher than or equal to the bound, we get better results as compared to the ones below it.
Anthology ID:
I17-2006
Volume:
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Month:
November
Year:
2017
Address:
Taipei, Taiwan
Editors:
Greg Kondrak, Taro Watanabe
Venue:
IJCNLP
SIG:
Publisher:
Asian Federation of Natural Language Processing
Note:
Pages:
31–36
Language:
URL:
https://aclanthology.org/I17-2006
DOI:
Bibkey:
Cite (ACL):
Kevin Patel and Pushpak Bhattacharyya. 2017. Towards Lower Bounds on Number of Dimensions for Word Embeddings. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 31–36, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Cite (Informal):
Towards Lower Bounds on Number of Dimensions for Word Embeddings (Patel & Bhattacharyya, IJCNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/I17-2006.pdf
Note:
 I17-2006.Notes.pdf