Diego Slezak


2023

pdf bib
Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics
Francisco Valentini | Juan Sosa | Diego Slezak | Edgar Altszyler
Findings of the Association for Computational Linguistics: EMNLP 2023

Recent research has shown that static word embeddings can encode words’ frequencies. However, little has been studied about this behavior. In the present work, we study how frequency and semantic similarity relate to one another in static word embeddings, and we assess the impact of this relationship on embedding-based bias metrics. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled, and holds for different hyperparameter settings. This proves that the patterns we find are neither due to real semantic associations nor to specific parameters choices, and are an artifact produced by the word embeddings. To illustrate how frequencies can affect the measurement of biases related to gender, ethnicity, and affluence, we carry out a controlled experiment that shows that biases can even change sign or reverse their order when word frequencies change.