Statistical Uncertainty in Word Embeddings: GloVe-V

Andrea Vallebueno, Cassandra Handan-Nader, Christopher Manning, Daniel Ho


Abstract
Static word embeddings are ubiquitous in computational social science applications and contribute to practical decision-making in a variety of fields including law and healthcare. However, assessing the statistical uncertainty in downstream conclusions drawn from word embedding statistics has remained challenging. When using only point estimates for embeddings, researchers have no streamlined way of assessing the degree to which their model selection criteria or scientific conclusions are subject to noise due to sparsity in the underlying data used to generate the embeddings. We introduce a method to obtain approximate, easy-to-use, and scalable reconstruction error variance estimates for GloVe, one of the most widely used word embedding models, using an analytical approximation to a multivariate normal model. To demonstrate the value of embeddings with variance (GloVe-V), we illustrate how our approach enables principled hypothesis testing in core word embedding tasks, such as comparing the similarity between different word pairs in vector space, assessing the performance of different models, and analyzing the relative degree of ethnic or gender bias in a corpus using different word lists.
Anthology ID:
2024.emnlp-main.510
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9032–9047
Language:
URL:
https://aclanthology.org/2024.emnlp-main.510
DOI:
Bibkey:
Cite (ACL):
Andrea Vallebueno, Cassandra Handan-Nader, Christopher Manning, and Daniel Ho. 2024. Statistical Uncertainty in Word Embeddings: GloVe-V. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9032–9047, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Statistical Uncertainty in Word Embeddings: GloVe-V (Vallebueno et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.510.pdf