Evaluating the Stability of Embedding-based Word Similarities

Maria Antoniak, David Mimno


Abstract
Word embeddings are increasingly being used as a tool to study word associations in specific corpora. However, it is unclear whether such embeddings reflect enduring properties of language or if they are sensitive to inconsequential variations in the source documents. We find that nearest-neighbor distances are highly sensitive to small changes in the training corpus for a variety of algorithms. For all methods, including specific documents in the training set can result in substantial variations. We show that these effects are more prominent for smaller training corpora. We recommend that users never rely on single embedding models for distance calculations, but rather average over multiple bootstrap samples, especially for small corpora.
Anthology ID:
Q18-1008
Volume:
Transactions of the Association for Computational Linguistics, Volume 6
Month:
Year:
2018
Address:
Cambridge, MA
Editors:
Lillian Lee, Mark Johnson, Kristina Toutanova, Brian Roark
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
107–119
Language:
URL:
https://aclanthology.org/Q18-1008
DOI:
10.1162/tacl_a_00008
Bibkey:
Cite (ACL):
Maria Antoniak and David Mimno. 2018. Evaluating the Stability of Embedding-based Word Similarities. Transactions of the Association for Computational Linguistics, 6:107–119.
Cite (Informal):
Evaluating the Stability of Embedding-based Word Similarities (Antoniak & Mimno, TACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/Q18-1008.pdf
Video:
 https://aclanthology.org/Q18-1008.mp4
Data
New York Times Annotated Corpus