Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics

Francisco Valentini; Juan Sosa; Diego Slezak; Edgar Altszyler

doi:10.18653/v1/2023.findings-emnlp.9

Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics

Francisco Valentini, Juan Sosa, Diego Slezak, Edgar Altszyler

Abstract

Recent research has shown that static word embeddings can encode words’ frequencies. However, little has been studied about this behavior. In the present work, we study how frequency and semantic similarity relate to one another in static word embeddings, and we assess the impact of this relationship on embedding-based bias metrics. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled, and holds for different hyperparameter settings. This proves that the patterns we find are neither due to real semantic associations nor to specific parameters choices, and are an artifact produced by the word embeddings. To illustrate how frequencies can affect the measurement of biases related to gender, ethnicity, and affluence, we carry out a controlled experiment that shows that biases can even change sign or reverse their order when word frequencies change.

Anthology ID:: 2023.findings-emnlp.9
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 113–126
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.9
DOI:: 10.18653/v1/2023.findings-emnlp.9
Bibkey:
Cite (ACL):: Francisco Valentini, Juan Sosa, Diego Slezak, and Edgar Altszyler. 2023. Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 113–126, Singapore. Association for Computational Linguistics.
Cite (Informal):: Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics (Valentini et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.9.pdf

PDF Cite Search