Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, Dan Jurafsky


Abstract
Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.
Anthology ID:
2022.acl-short.45
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
401–423
Language:
URL:
https://aclanthology.org/2022.acl-short.45
DOI:
10.18653/v1/2022.acl-short.45
Bibkey:
Cite (ACL):
Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, and Dan Jurafsky. 2022. Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 401–423, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words (Zhou et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-short.45.pdf
Code
 katezhou/cosine_and_frequency
Data
WiC