Assessing the Reliability of Word Embedding Gender Bias Measures

Yupei Du, Qixiang Fang, Dong Nguyen


Abstract
Various measures have been proposed to quantify human-like social biases in word embeddings. However, bias scores based on these measures can suffer from measurement error. One indication of measurement quality is reliability, concerning the extent to which a measure produces consistent results. In this paper, we assess three types of reliability of word embedding gender bias measures, namely test-retest reliability, inter-rater consistency and internal consistency. Specifically, we investigate the consistency of bias scores across different choices of random seeds, scoring rules and words. Furthermore, we analyse the effects of various factors on these measures’ reliability scores. Our findings inform better design of word embedding gender bias measures. Moreover, we urge researchers to be more critical about the application of such measures
Anthology ID:
2021.emnlp-main.785
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10012–10034
Language:
URL:
https://aclanthology.org/2021.emnlp-main.785
DOI:
10.18653/v1/2021.emnlp-main.785
Bibkey:
Cite (ACL):
Yupei Du, Qixiang Fang, and Dong Nguyen. 2021. Assessing the Reliability of Word Embedding Gender Bias Measures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10012–10034, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Assessing the Reliability of Word Embedding Gender Bias Measures (Du et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.785.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.785.mp4
Data
WikiText-103WikiText-2