Assessing the Reliability of Word Embedding Gender Bias Measures

Various measures have been proposed to quantify human-like social biases in word embeddings. However, bias scores based on these measures can suffer from measurement error. One indication of measurement quality is reliability, concerning the extent to which a measure produces consistent results. In this paper, we assess three types of reliability of word embedding gender bias measures, namely test-retest reliability, inter-rater consistency and internal consistency. Specifically, we investigate the consistency of bias scores across different choices of random seeds, scoring rules and words. Furthermore, we analyse the effects of various factors on these measures’ reliability scores. Our findings inform better design of word embedding gender bias measures. Moreover, we urge researchers to be more critical about the application of such measures


Introduction
Despite their success in various applications, word embeddings have been shown to exhibit a range of human-like social biases. For example, Caliskan et al. (2017) find that both GloVe (Pennington et al., 2014) and skip-gram (Mikolov et al., 2013) embeddings associate pleasant terms (e.g. love and peace) more with European-American names than with African-American names, and that they associate career words (e.g. profession and business) more with male names than with female names.
Various measures have been proposed to quantify such biases in word embeddings (Ethayarajh et al., 2019;Zhou et al., 2019;Manzini et al., 2019). These measures allow us to assess biases in word embeddings and the performance of bias-mitigation methods (Bolukbasi et al., 2016;. They also enable us to study social biases in a new way complementary to traditional qualitative 1 Our code is available at https://github.com/ nlpsoc/reliability_bias. A key challenge in developing bias measures is that social biases are abstract concepts that cannot be measured directly but have to be inferred from some observable data. This renders the resulting bias scores more prone to measurement errors. Therefore, it is important to carefully assess these measures' measurement quality. Jacobs and Wallach (2021) also highlight similar measurement issues in the context of fairness research.
In this paper, we focus on one aspect of measurement quality: reliability. It concerns the extent to which a measure produces consistent results. In particular, we investigate the reliability of word embedding gender bias measures. Figure 1 illustrates how gender biases in word embeddings are typically measured, with measuring gender biases in the concept of arts as an example. To calculate the gender bias score of a target word w (e.g. poetry, a word that relates to the concept of interest), we need to specify a gender base pair (m, f ) (a gendered word pair, e.g. father/mother) and a scoring rule. A scoring rule is a function that takes the embedding vectors of w, m and f as input and returns a bias score bias m,f w as output. In practice, often an ensemble of gender base pairs are used. In this case, we aggregate (e.g. average) all the bias scores w.r.t. different gender base pairs to obtain an overall bias score for a target word. Furthermore, multiple conceptually related target words (e.g. poetry, drama) may be specified to form a query q (e.g. arts). By aggregating individual bias scores of these target words, we can compute an overall bias score for a concept.
Clearly, the choice of target words, gender base pairs and scoring rules may influence the resulting bias scores. Reliable measurements of word embedding gender biases thus require target words, gender base pairs and scoring rules that produce consistent results. In this work, by drawing from measurement theory, we propose a comprehensive approach ( §4) to evaluate three types of reliability for these different components: • First, we assess the consistency of bias scores associated with different target words, gender base pairs and scoring rules, over different random seeds used in word embedding training (i.e. test-retest reliability; §5.2).
• Second, we assess the consistency of bias scores associated with different target words and gender base pairs, across different scoring rules (i.e. inter-rater consistency; §5.3).
• Third, we assess the consistency of bias scores over 1) different target words within a query and 2) different gender base pairs (i.e. internal consistency; §5.4).
Furthermore, we use multilevel regression to model the effects of various factors (e.g. word properties, embedding algorithms, training corpora) on the reliability scores of target words ( §5.5).
Our experiments show that word embedding gender bias scores are mostly consistent across different random seeds (i.e. high test-retest reliability) and across target words within the same query (i.e. high internal consistency). However, different scoring rules generally fail to agree with one another (i.e. low inter-rater consistency). Moreover, word embedding algorithms have a large influence on the reliability of bias scores.
Contributions First, we connect measurement theory to word embedding bias measures. Specifically, we propose a reliability evaluation framework for word embedding (gender) bias measures. Second, we provide a comprehensive assessment of the reliability of word embedding gender bias measures. Based on our findings, we urge researchers to be more critical about applying such measures.

Related Work
Measuring gender biases in word embeddings has been receiving a growing amount of research interest in NLP. Various gender bias measures have been proposed. They are based on different techniques, such as linear gender subspace identification (Bolukbasi et al., 2016;Vargas and Cotterell, 2020;Manzini et al., 2019), psychological tests (Ethayarajh et al., 2019Caliskan et al., 2017), inference from nearest neighbours (Gonen and Goldberg, 2019) and regression (Sweeney and Najafian, 2019;Badilla et al., 2020).
However, recent studies have raised concerns over the reliability of such measures. Zhang et al. (2020) show that gender bias scores easily vary in their direction and magnitude when different forms (e.g. capitalisation) of target words or different gender base pairs are used. Similarly, Antoniak and Mimno (2021) look into 178 different gender base pairs from previous works and find that the choice of gender base pairs can greatly impact bias measurements. They therefore urge future work to examine and document the choices of gender base pairs. Moreover, D'Amour et al. (2020) find that underspecification of models can lead to unstable contextualised word embedding bias scores. These findings call for a more systematic evaluation of the reliability of word embedding gender bias measures, which is the goal of our study.
Such measures' lack of reliability may partly stem from the fact that word embeddings themselves are often unstable, sensitive to choices of, for instance, word embedding algorithms Antoniak and Mimno, 2018;Hellrich et al., 2019), hyper-parameters (Levy et al., 2015Mimno and Thompson, 2017;Hellrich et al., 2019) and even random seeds Hellrich and Hahn, 2016;Bloem et al., 2019) during word embedding training.

Preliminaries
In this section, we first review three popular scoring rules used for measuring word embedding gender biases ( §3.1). Then, we introduce the conceptual framework of reliability and motivate its use in word embedding gender bias measurements ( §3.2).

Scoring Rules
Following Zhang et al. (2020), we focus on three popular scoring rules: DB/WA, RIPA and NBM. 2 DB/WA DB/WA (Direct Bias / Word Association) is one of the most commonly used scoring rules in previous work (Bolukbasi et al., 2016;Caliskan et al., 2017). Given a gender base pair (m, f ), the DB/WA score of a target word w is where * is the corresponding word vector of * , and cos(x, y) refers to the cosine similarity of x and y.
RIPA Another scoring rule based on vector similarity is Relational Inner Product Association (RIPA; Ethayarajh et al., 2019). The main difference between DB/WA and RIPA is that RIPA performs normalisation at the gender base pair level instead of at the word level. Formally, where * refers to the L2 norm of * .
NBM Unlike DB/WA and RIPA, which are based on vector similarities, NBM (Neighbourhood Bias Metric) is based on a word's k nearest neighbours (Gonen and Goldberg, 2019). Specifically, where |masculine(w)| and |f eminine(w)| are the number of words in w's k nearest neighbours biased towards the respective gender based on their DB/WA scores. Following Zhang et al. (2020) and Gonen and Goldberg (2019), we use k = 100.

Reliability
In measurement theory, reliability is the extent to which a measure produces consistent results over a variety of measurement conditions, in which basically the same results should be obtained (Drost, 2011). In this work, we focus on three important types of reliability: test-retest reliability, inter-rater consistency and internal consistency.
Test-retest reliability concerns the consistency of measurements across different measurement occasions (assuming no substantial change in the true value, Weir, 2005). For example, if gender bias scores vary substantially across different measurement occasions (e.g. different random seeds during embedding training; different random data samples), they should be considered to have low testretest reliability. In this case, derived conclusions from these scores are likely to be untrustworthy.
Inter-rater consistency is the degree to which different raters produce consistent measurements (Shrout and Fleiss, 1979). For example, consider scoring rules as the raters of word embedding bias scores. In this case, if different scoring rules measure gender biases in a similar way, they should produce bias scores that tend to agree with one another in both signs and normalised magnitude.
Internal consistency is defined as the agreement among multiple components that make up a measure of a single construct (Cronbach, 1951). In the example from Figure 1, we specify a query consisting of various arts-related target words to measure gender biases of the concept arts. We then compute individual bias scores for all target words before aggregating them to obtain an overall bias score. If the bias scores of target words are distinct from one another (i.e. low or negative correlation), the query has low internal consistency. In this case, one should question whether these target words measure the arts concept in comparable ways.

Estimating Reliability of Word Embedding Gender Bias Measures
In this section, we propose an evaluation framework to assess the reliability of word embedding gender bias measures. Respectively, we present our operational definitions for test-retest reliability ( §4.1), inter-rater consistency ( §4.2) and internal consistency ( §4.3). See Table 1 for an overview.
Notation Suppose we have s scoring rules, g gender base pairs and t target words. We train k word embedding models with the same hyper-parameters but k different random seeds. For each scoring rule, we calculate the bias score of each target word w.r.t each gender base pair, on each word embedding model. As a result, we get a four-dimensional bias scores matrix B of R s×g×t×k . For calculating inter-rater consistency and internal consistency, we average the gender bias scores derived from the k word embedding models. Averaging these embedding models can partial out the influence of random seeds, and therefore lead to more accurate estimation of other types of reliability. In this way, we get another bias matrix B of R s×g×t .

Test-retest Reliability
We measure test-retest reliability as the consistency of bias scores associated with each target word, gender base pair and scoring rule across different random seeds 3 . Specifically, we focus on two questions: First, are the bias scores of a single target word (averaged across gender base pairs) consistent across random seeds (i.e. test-retest reliability of target words)? Second, are the average bias scores of all target words w.r.t one gender base pair consistent across random seeds (i.e. test-retest reliability of gender base pairs)? We explore both questions for each scoring rule separately.
To calculate test-retest reliability for each target word, we use a bias score matrix of size R g×k by slicing B. We then use intra-class correlation (ICC) to estimate test-retest reliability.
ICC is a popular family of estimators for both test-retest reliability and inter-rater consistency. There are different forms of ICC estimators, each of which can involve distinct assumptions and can therefore lead to very different interpretations (Koo and Li, 2016). Shrout and Fleiss (1979) define 6 forms of ICC and present them as "ICC" with 2 additional numbers in parentheses (e.g., ICC(2,1) and ICC(3,1)). The first number refers to the model and can take on three possible values (1, 2 or 3). The second number refers to the intended use of raters/measurements in an application and can take on two values (1 or k). See Appendix C for a detailed description of these value options. We adopt ICC(2,1) as the estimator for test-retest reliability: where M S E , M S R , and M S C are the mean square of error, of rows and of columns, respectively 4 . Similarly, for the test-retest reliability of each gender base pair, we slice B to get a bias score matrix of size R t×k . We then calculate its ICC value using Equation 1 (by substituting t with g).

Inter-rater Consistency
We define inter-rater consistency as the consistency of bias scores across different scoring rules. We again investigate two questions: First, are the bias scores of a single target word (averaged across all gender base pairs) consistent across scoring rules (i.e. inter-rater consistency of a target word)? Second, are the average bias scores of all target words w.r.t. a single gender base pair consistent across scoring rules (i.e. inter-rater consistency of a gender base pair)?
To calculate the inter-rater consistency of each target word, we use a bias score matrix of size R g×s by slicing and transposing B . Following Koo and Li (2016), we adopt ICC(3, 1) as the estimator: Similarly, for the second question, we can get a bias score matrix size of R t×s for each gender base pair and calculate the inter-rater consistency of the gender base pair via Equation 2.

Internal Consistency
We investigate the internal consistency of both queries and the ensemble of gender base pairs. We thus focus on two questions: First, are the bias scores of different target words within a query consistent (i.e. internal consistency of a query)? Second, are the average bias scores of all target words consistent across different gender base pairs (i.e. internal consistency of the ensemble of gender base pairs)? We examine both questions for each scoring rule separately.
To calculate the internal consistency of a query consisting of t target words, we first slice and transpose B to get a bias score matrix size of R g×t . We then use Cronbach's alpha (Cronbach, 1951) as the estimator of internal consistency. Cronbach's alpha is the most common estimator for internal consistency, which assesses how closely related a set of test items are as a group (e.g. different target words of the same query). Specifically, where σ 2 i is the variance of the bias scores of target word i in the query w.r.t. different gender base pairs. σ 2 X is the sum of t i=1 σ 2 i and all covariances of bias scores between target words.
We calculate the internal consistency of the ensemble of gender base pairs in a similar way, by creating a bias score matrix size of R t×g and applying Equation 3 (substituting t with g).

Experimental Setup
Training Embeddings We select three corpora with different characteristics to train word embeddings. Two are from subReddits: r/AskScience (∼ 158 million tokens) and r/AskHistorians (∼ 137 million tokens, also used by Antoniak and Mimno 2018)  . However, these three lists are very specific (i.e. only concerning occupation words and adjectives) and thus unlikely applicable to other (future) research where different biases are of interest and different target words might be used (e.g. measuring gender biases of a whole corpus). Therefore, we also consider two additional, larger target word lists: 4) the top 10,000 most frequent words of Google's trillion word corpus (Google10K) 7 and 5) the full vocabulary of each corpus (Full).
Queries For the assessment of internal consistency, we examine six gender bias related queries from Caliskan et al. (2017): math, arts, arts_2, career, science, and family, each consisting of eight target words. Note that target word lists are different from queries. The former does not necessarily consist of conceptually related words. Figure 2 shows the distribution of test-retest reliability scores of target words and gender base pairs across target word lists and scoring rules. Here, word embeddings are trained with SGNS on WikiText-103. Similar results are found for other corpora and algorithms (see Appendix D.1).

Results: Test-retest Reliability
First, we observe that the majority of target words and gender base pairs have acceptable testretest reliability with ICC values greater than 0.6, regardless of the scoring rule used. 8 Nevertheless, quite some target words and gender base pairs fall below the lower whiskers of the box-plots (indicating low test-retest reliability).
Moreover, comparing with Google10K, which consists of frequent words, a higher proportion of words in the full vocabulary have very low testretest reliability. For example, 0.01% of the target words in Google10K have a test-retest reliability lower than 0.50 for word embeddings trained with SGNS on WikiText-103. In contrast, for the full vocabulary this is 0.1%, approximately 10 times that of Google10K. These results suggest that we should be careful when making word lists that consist of infrequent words (e.g. when studying less common concepts). If we do need to use infrequent words, we should check their test-retest reliability before deriving further conclusions. Figure 3 shows the distributions of inter-rater consistency scores of both target words and gender base pairs across different corpora (word embeddings trained by GloVe). More (similar) results on different algorithms are in Appendix D.2.

Results: Inter-rater Consistency
We observe that the inter-rater consistency of the majority of both target words and gender base pairs are rather low. This finding suggests that different scoring rules may measure very different aspects of word embedding gender biases, and hence their resulting bias scores differ substantially. More closely, we observe that for target words, the bias scores are the least similar between RIPA and NBM (Pearson's r: 0.836, p < 0.05), while they are much more similar between DB/WA and RIPA (Pearson's r: 0.923, p < 0.05), and between DB/WA and NBM (Pearson's r: 0.897, p < 0.05). A possible reason is that DB/WA and RIPA scores are both based on cosine similarities, and that NBM scores are based on DB/WA scores of the closest neighbours. In contrast, RIPA and NBM scores are computed in less comparable ways. Nevertheless, future studies are needed to further investigate the differences among scoring rules. 8 Generally, ICC values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and greater than 0.9 are considered to have poor, moderate, good, and excellent reliability, respectively (Koo and Li, 2016). This also holds for ICC values of inter-rater consistency (Section 5.3).

Results: Internal Consistency
Figure 4 presents the distribution of the internal consistency scores of every query and the ensemble of all gender base pairs across corpora. Each boxplot contains six scores from the combinations of embedding algorithms and scoring rules. We make four observations: First, the internal consistency of most queries and the ensemble of gender base pairs are acceptable (Cronbach's alpha values ≥ 0.7). 9 This indicates that most target words in the same query likely measure gender bias of the same concept. Bias scores of a target word are also generally consistent across gender base pairs.
Second, however, the patterns of internal consistency vary substantially across queries. For example, on the WikiText-103 corpus, the internal consistency scores of family are much higher and less varied than the scores of math. Third, the internal consistency of a query and the ensemble of gender base pairs seems dependent on specific corpora. For instance, the internal consistency scores of math are high and have a low variance on the corpus r/AskScience, but they are low and have a very high variance on r/AskHistorians.
Fourth, the high variance of scores for some queries (e.g. math on r/AskHistorians) suggests that a query's internal consistency may depend also on word embedding algorithms and scoring rules.

Factors Influencing the Reliability of Gender Base Pairs and Target Words
In this section, we investigate factors influencing the test-retest reliability and inter-rater consistency of both gender base pairs and target words. Because we only have a small number of gender base pairs, we qualitatively inspect them using visualisations. For (the large number of) target words, we Gender Base Pairs: Visualisation The distributions of test-retest reliability and inter-rater consistency of gender base pairs (on full vocabularies) are shown in Figure 5. We make two observations: First, gender base pairs in singular form are usually of higher test-retest reliability (e.g. boy∼girl versus boys∼girls), which is consistent with findings by Zhang et al. (2020). The median difference in test-retest reliability between singular and plural gender base pairs is statistically significant (t(7) = 2.45, p < .05). In contrast, such a statistical difference is not found for inter-rater consistency (t(7) = −0.13, p > .05).
Second, gender base pairs of higher test-retest reliability also tend to be of higher inter-rater consistency, evidenced by the moderate correlation between the median test-retest reliability scores and the median inter-rater consistency scores of gender base pairs (r = 0.644, p < .05).
Target Words: Regression Analyses We use multilevel regression to study potential influencing factors of the test-retest reliability and inter-rater consistency of target words. 10 Comparing with OLS regression and its variants, multilevel models allow for dependent observations. Therefore, they suit our data better where reliability scores are Inter-rater consistency (b) Inter-rater consistency Figure 5: Test-retest reliability and inter-rater consistency of different gender base pairs on full vocabularies. Each gender base pair has multiple reliability scores across combinations of embedding algorithms and corpora (as well as scoring rules for test-retest reliability). Gender base pairs in singular form tend to have higher test-retest reliability. Also, gender base pairs with higher test-retest reliability are more likely to score higher in inter-rater consistency.
nested within groups (e.g. different training algorithms and corpora of embeddings) and are thus correlated. Multilevel models have a further advantage that they estimate not only the effects of fixed factors (i.e. standard features) but also the amount of variance explained by each grouping factor.
We collect a range of word-level features as fixed factors, mostly inspired by previous studies Pierrejean and Tanguy, 2018;Hellrich and Hahn, 2016). These include 1) word-intrinsic features: log number of WordNet synsets (log #senses) and the most common Partof-Speech tag (PoS) in the Brown corpus (Francis and Kucera, 1979), as in ; 2) corpus-related features: log frequencies of words in the training corpus (log freq) and their squares (log 2 freq); 3) embedding-related features: cosine  similarity to nearest neighbour (NN Sim), L2 norm (L2 Norm) and embedding stability (ES). We calculate ES as follows: for each pair of word embedding models, we first fit an orthogonal transformation Q that minimizes the Frobenius norm of their difference. The stability of a word across multiple random seeds is then calculated as the average pairwise cosine similarity of its embedding vectors after transformation by Qs. We also consider scoring rules as a fixed factor because we are interested in comparing the influence of these three scoring rules on target words' test-retest reliability. The two grouping factors are embedding algorithms and training corpora.
We summarise the results in Table 2. For testretest reliability, the model has a satisfactory total explained variance (R 2 model : 51.77%). Fixed factors (including scoring rules) together explain a substantial part of the variation (R 2 fixed : 32.61%). Among these factors, embedding stability (ES) appears to be the most important one, indicated by the largest standardised effect estimate and ∆R 2 . The higher the embedding stability, the higher the test-retest reliability, which is expected. L2 norm and word frequency also account for a considerable amount of variance. When the L2 norm is lower or when the frequency is higher, test-retest reliability is higher. This observation is also consistent with prior research findings. For instance, Hellrich and Hahn (2016) show that word frequency positively correlates with embedding stability when word frequency is not too high. Also, Arora et al. (2016) find that the L2 norm correlates negatively with word frequency. This finding agrees with our observation in Figure 2 as well. In contrast, the choice of scoring rules has only a minor impact on test-retest reliability (R 2 : 0.4%). Despite a statistically significant difference between DB/WA and RIPA, the difference is very small (-0.0102) and therefore unlikely important.
Among the group level factors, embedding algorithms alone explain 18.02% of the total variance. This suggests that the test-retest reliability of a target word is determined by the word embedding training to a considerable degree. In contrast, the choice of corpora is much less influential.
For inter-rater consistency, the resulting model is also of good explanatory power (R 2 model : 51.60%). However, it is clear that word-level features fail to explain much of the variance (5.81%). Between the two grouping factors, algorithm dominates with an R 2 score of 0.4307. This indicates that the interrater reliability of a target word is largely determined by the word embedding algorithm used.
Note that we also explored potential interactions between the fixed factors and how they might impact the outcome test-retest and inter-rater reliability scores. However, it turned out that interaction effects generally had small effect sizes and did not considerably improve overall model fit. We therefore excluded them from the final models.

Conclusion & Discussion
In this paper, we propose to leverage measurement theory to examine the reliability of word embedding bias measures. We find that bias scores are mostly consistent across different random seeds (i.e. high test-retest reliability), as well as across gender base pairs and target words within a query (i.e. high internal consistency). In contrast, the three scoring rules fail to agree with one another (i.e. low inter-rater consistency). Furthermore, our regression results suggest that the consistency of bias scores across different random seeds are mostly influenced by various word-level features as well as the word embedding algorithm used. Meanwhile, the bias scores of target words across different scoring rules are dominated by the word embedding algorithm used. We thus urge future studies to be more critical about applying such measures.
Nevertheless, our work has limitations. First, we only consider gender bias measures. Future work should apply our reliability evaluation framework to other types of bias (e.g. racial bias). Second, we focus on static word embeddings. Future work should investigate the reliability of bias measures for contextualised embeddings. Third, we do not address validity, the other crucial aspect of measurement quality. We thus call for future studies on the validity of word embedding bias measures. Fourth, Goldfarb-Tarrant et al. (2021) argue that intrinsic (word embeddings) biases sometimes fail to agree with extrinsic biases (measured in downstream tasks, e.g. coreference resolution). One potential research direction is to assess the reliability of extrinsic bias measurements as well, to shed further light on the disconnect between intrinsic and extrinsic biases. Lastly, while ICC and Cronbach's Alpha are established reliability estimators in many scientific disciplines, correct interpretation of their values is often challenging and requires both statistical and field-specific expertise (Lee et al., 2012;Streiner, 2003). Future work should address the appropriate use of these estimators and their limitations in the context of NLP research.

Ethical Statement
Intended Usage As aforementioned in §1, word embedding bias measures are often used to analyse word embedding models, to assess the effect of bias mitigation methods, and to study societal biases. Our work thus intends to evaluate the quality of these measures and their derived conclusions. Moreover, our framework can also be used to assess the reliability of bias measures which consists of target words, gender base pairs, and scoring rules that were not included in this study. In this way, our framework can contribute to the development of models that are less biased and hence potentially less harmful.
Limitations In this study, we focused on common measures of gender biases in word embeddings. Measurements of gender biases in word embeddings typically rely on manually crafted sets of target words and pairs of gendered words (i.e. gender base pairs, such as he vs. she). In our experiments we use existing lists of words and word pairs that have been frequently used in related work. However, these word pairs were constructed by taking the very narrow view of binary gender. We hope to see more work on measures of bias in embeddings that considers non-binary gender identities as well as intersectional identities. Data Preprocessing For Reddit data (i.e. r/AskScience and r/AskHistorians), we lowercased, removed redundant spaces/urls, and used the Spacy 11 library to tokenize each sentence. For training GloVe embeddings, we substituted "<unk>" symbols in WikiText-103 with "<raw_unk>" symbols.

C Choosing ICC Estimators
Despite ICC being a commonly used tool for estimating test-retest and inter-rater reliability, there exist distinct forms of ICC estimators. Different forms of ICC can involve distinct assumptions and can therefore lead to very different interpretations (Koo and Li, 2016).
In this work, we follow the framework proposed by Shrout and Fleiss (1979). They define 6 forms of ICC and present them as "ICC" with 2 additional numbers in parentheses (e.g., ICC(2,1) and ICC(3,1)). The first number refers to the model and can take on three possible values (1, 2 or 3): 1 is a one-way random-effects model, where each subject receives a unique, random set of raters; 2 is a two-way random-effects model, where all subjects receive the same randomly chosen set of raters and the reliability results are assumed to be generalisable to unseen raters; 3 is a two-way mixed-effects model, where the selected raters are the only raters of interest and thus the reliability results are not generalisable to other raters. The second number refers to the intended use of raters/measurements in an application and can take on two values (1 or k). 1 refers to having only a single rater or measurement; k means using the mean of k raters or measurements.
Therefore, depending on the specific research data and goals, one of the 6 ICC forms may be used. For test-retest reliability, we use ICC(2,1) for the following three reasons. First, the raters (i.e. different random seeds) are a random sample of the population (of all possible random seeds). Second, each bias score receives the same raters (i.e. random seeds). Third, in actual research practices, researchers would normally use only one rater (one random seed) to measure word embedding biases.
For inter-rater consistency, we use ICC(3,1) based on two considerations. First, we are only interested in comparing three specific scoring rules (i.e. raters). Second, in practice, researchers would use only the result from one scoring rule (i.e. rater) to measure word embedding biases.

E Effect of Hyper-parameters
In our study, we did not fine-tune the hyperparameters while training the word embeddings, because the main goal of this paper is not to compare different setups of word embedding algorithms. In this section, we explore whether the results are sensitive to the choice of hyper-parameters used for training the word embeddings. In the main paper, for SGNS, we use 300 dimensions and 5 iterations. Here, we experiment with two different hyper-parameters on r/AskHistorians, 1) 3 iterations, and 2) 100 dimensions, respectively. The results are shown in Figures 18 to 24. Comparing with the default hyper-parameters, we observe that although the specific values of different types of reliability change, the overall trends remain the same.

F Multilevel Regression
This section is a detailed description of the multilevel regression experiments in §5.5.
Multilevel Models In a similar research setup,  used variants of OLS regression to study influencing factors of word embedding stability. However, OLS regression has a strong assumption that the observations are unconditionally independent of one another. If this assumption is violated, standard errors of regression coefficients will be underestimated, leading to an overstatement of statistical significance. In our study, our observations (reliability scores of target words) are nested within different corpora and word embedding algorithms. As a result, the residual variance is naturally partitioned into a within-corpus-algorithm component and a between-corpus-algorithm component. If the between-component is not explicitly accounted for, the observations will be correlated (thus not independent anymore) within a corpus or algorithm.
We therefore use multilevel regression instead of OLS regression (or its variants). Multilevel regression (or mixed-effect models, hierarchical models, among many other names) accounts for the grouping/hierarchical structure of data by explicitly modelling residuals at different levels of the data. In this way, the model not only relaxes the assumption of unconditional independent data, but also estimates group effects. The latter is especially beneficial because it allows us to quantify how much variance in the data is explained by the corpora and the algorithms, in addition to wordlevel features.
Formally, our multilevel regression model is: y i(a,c) = X i(a,c) β + ν 0a + µ 0c + i(a,c) where y i(a,c) refers to each reliability score of a target word i nested within an algorithm a and corpus c. X i(a,c) is a row vector containing observations on the word-level explanatory factors with a leading element of one. β is a column vector of parameter estimates of those factors. ν 0a , µ 0a , 0a are the residual error terms for algorithm a, corpus c and target word i. They are assumed to be normally distributed around zero with their own variances.
Note that we can not conduct such regression analyses for reliability scores on the level of queries, gender base pairs or scoring rules due to limited sample sizes (i.e. only 6 queries, 23 gender base pairs and 3 scoring rules).  Table 3: Results of multilevel regression on the test-retest and inter-rater reliability of target words. Estimates are standardized (bold if p < 0.05). ∆R 2 is reduction in explained variance when the corresponding factor is left out. R 2 fixed , R 2 corpus , R 2 algorithm and R 2 total refer to the explained variance of fixed factors (i.e. word level features and scoring rules), embedding training corpus used, embedding training algorithm used, total effects of all these three parts, respectively. W 1 , W 2 , we fit a transformation matrix Q , where Then the stability of word w is given by the cosine similarity of (w 1 , Qw 2 ), where w 1 and w 2 are the corresponding word vectors of w in word embeddings W 1 and W 2 .
Furthermore, for the regression model on testretest reliability, we also include scoring rules as a fixed factor. Scoring rules can also be considered as a grouping factor, as the bias scores are also clustered within scoring rules. However, this approach would not yield specific effect estimates for each of the three scoring rules, but instead only a variance estimate for all the scoring rules as a whole. Because we are interested in comparing the effects of the three scoring rules on target words' test-retest reliability scores, we explicitly model scoring rules as a fixed factor.
Lastly, we include word embedding training corpora and word embedding training algorithms as the two group-level factors.

Results
The full results are in Table 3. The interpretation of the table is the same as in §5.5. The main difference between this table and Table 2 is that the parameter estimates of the PoS feature are no longer omitted. We can see that this feature as a whole explains less than 1% of the variation in both the test-retest reliability and the inter-rater consistency of target words. This suggests that in the presence of other features, PoS is not a very important factor. Nevertheless, we do see some statistically significant and moderately sized parameter estimates of PoS categories. For instance, compared to adjectives, determiners tend to score on average 0.0212 lower on test-retest reliability. Numerals also score on average -0.0886 lower on inter-rater consistency than do adjectives.