No Word Embedding Model Is Perfect: Evaluating the Representation Accuracy for Social Bias in the Media

,


Introduction
Social bias describes prejudices and stereotypical thinking towards certain groups in society, such as genders or ethnicities (Fiske, 1998).Media bias, by contrast, refers to the tendency of media entities (e.g., a news outlet) to favor certain facts, views, or framings of events over others (Chen et al., 2021).In this work, we focus on media bias induced by political orientations (henceforth, political bias).While social and political bias differ in appearance, it can be expected that they relate to and mutually influence each other.A particular political bias, for example, may transport ideas of stereotypes, manifesting as social bias, that strengthen specific political ideas in society (Seiter, 1986;Domke et al., … If we want to raise our daughters to be different kind of womennonconformists in a world run amok, insurgents for the gospel -we must be sure to give them strategic and specialized training.We must teach them both the beauty and the basics of biblical womanhood through our faithful (though flawed) example and our gracious teaching.We must also pluck the weeds of feminism that our culture sows and which can take root in our daughters' hearts.… … Falling in love with a woman helped me move out of my marriage and into a new world of women.I discovered that intimate relationships with women were based on parity--there were no predetermined roles.Both partners were women, born and raised with similar gender expectations.… 1999).Vice versa, holding particular stereotypical views may make people more susceptible to a political view promoted by a news outlet (Schwarz and Jalbert, 2020).Figure 1 shows excerpts of two news articles, conveying potential gender bias.The outlined kinds of bias are also relevant to NLP methods that employ news articles to train models (Mikolov et al., 2013) or as a knowledge source (Slonim et al., 2021).For example, bias present in the articles may be learned and amplified by word embeddings if not explicitly accounted for.This impacts generalization performance negatively (Shah et al., 2020) and may have harmful consequences in practical applications (Bender et al., 2021;Joseph and Morgan, 2020).So far, one hurdle to mitigate these problems is the limited reliability of common measures of social bias present in a corpus (Spliethöver and Wachsmuth, 2021), stemming from embedding training algorithms not tailored to low-resource situations (Knoche et al., 2019;Spinde et al., 2021).
In this paper, we investigate how to assess social bias more reliably while empirically studying the interaction of social bias and political bias in US online news outlets.In particular, we identify low-resource settings and token frequency differences as two main issues with existing embeddingbased bias measures.We consider social bias towards genders, ethnicities and religions, and measure it with the widely used bias measure, WEAT (Caliskan et al., 2017).We restrict our political bias view to the unidimensional spectrum from left to right (Duckitt and Sibley, 2010), ignoring objectivity and fairness aspects (Chen et al., 2020).
In psychology literature, stereotypical views have been shown to coincide with political orientations (Section 2), suggesting that the political views of news outlets coincide with social biases.Under this premise, we aim to find out what word embedding algorithm best serves to reliably measure social bias.We investigate weaknesses of a standard algorithm that stem from the reliance on word lists, infrequent tokens in the data, and the quality of embeddings.We suggest (1) training frequency agnostic embeddings to compensate for lower quality of rare tokens, (2) a fine-tuned language model to account for smaller datasets, and (3) decontextualized embeddings to alleviate the "unnatural input" problem with contextualized models.
For our experiments, we introduce a large-scale media bias corpus in Section 4, covering more than 500,000 news articles from 47 English-language US online news outlets over 12 years (2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020)(2021).Given the corpus, we evaluate each potential improvement and compare their capability to encode and represent social bias a text corpus (Section 5).To this end, we systematically generate word embedding models from subsets of different political biases.In a second analysis, we explore the development of social bias in outlets over time in a respective manner.We can quantify the considered types of social bias for all models using WEAT.
Our results in Section 5 provide evidence that the general embeddings quality improves notably over standard static embeddings.Additionally, the proposed algorithms better model the expected social bias, though still not fully align with the literature.
This work provides three contributions to computational research on bias in language: 1. Findings on how to combine embedding models and bias measures to adequately quantify social bias in text corpora; 2. a large-scale news resource annotated for political bias; and 3. empirical insights into the interaction of social and media bias in US online news, and its development over time.1

Related work
We consider social bias that manifests as stereotypes, that is, generalized beliefs about social outgroups based on experiences with single members (Fiske, 1998).Such beliefs may lead to prejudices and discrimination that cause lasting harm.Stereotypes are usually transported through language, uttered either implicitly or explicitly (Wodak, 2008).
If entities with high public outreach, such as politicians and media outlets, spread stereotypes, this may therefore profoundly impact their audiences (Seiter, 1986;Domke et al., 1999).
Psychology and political science literature study the relation of stereotypes with political aspects.As part of this, multiple layers of partisan biases have been evaluated (Hayes, 2011;Bauer, 2015;Clifford, 2020).Focusing on social values, Valentino and Sears (2005) find a general shift of public social values related to a shift in voting outcomes.Other works compare social values of conservatives and liberals: While liberals seem more likely to reject "ingroup values", conservatives emphasize tradition and religion (Sylwester and Purver, 2015).Accordingly, Webster et al. (2014) observe a higher level of self-reported prejudices towards social groups that "challenge or violate traditional social values" among conservative probands.Chirumbolo et al. (2016), finally, report that liberals tend to value social equality, whereas conservatives justify social inequality with "the preservation of status quo".We use these connections between political bias and social values as a reference for our analyses.
Media bias can be evaluated from many angles, too.For instance, Chen et al. (2020) explore media bias in political news, automatically detecting incomplete reporting and evaluating its linguistic manifestations.One of their results is that words expressing negative emotions are most correlated with selective biases.Kenix and Jarvandi (2019), in turn, focus on conservative and liberal news articles from the US, Australia, and the UK to understand the construction of media frames.They find that the report framing of specific outlets aligns with their political bias.Rather than unfairness or issue perception, our work targets the interaction of media bias with social bias in news articles.
In particular, we quantify social bias in word embeddings with the widely used Word Embedding Association Test, WEAT (Caliskan et al., 2017).WEAT's main idea is to calculate the cumulative distance between groups of word vectors that describe a social group and attributes.Similar measures exist, such as ECT (Dev and Phillips, 2019), RNSB (Sweeney and Najafian, 2019), and MAC (Manzini et al., 2019), RIPA (Ethayarajh et al., 2019), WEATVEC (Knoche et al., 2019), the Smoothed First-Order Co-occurrence (Rekabsaz et al., 2021) and SAME (Schröder et al., 2021) but our goal is not to find the best measure.Rather, we seek to learn how measures like WEAT behave for different embedding algorithms.We are not aware of works that have done similar.
Similar to the analysis we carry out, Garg et al. (2018) exploit the properties of word embeddings to evaluate temporal relationships between changes of social bias and empirical demographic changes in the US.They evaluate embedding models trained on texts from different decades, for example finding that gender bias decreased with the women's movement in the 1960's.In a comparable analysis, Rios et al. (2020) find that gender bias reduced in biomedical research over time for some areas, but not in others.In this work, we utilize WEAT to evaluate social bias in news articles.Unlike previous work, however, we compare word embedding algorithms to model social bias in texts and their alignment with the literature reviewed above.
Closest to our work is the research of Knoche et al. (2019) and Spinde et al. (2021).The former use WEAT to compare social biases present word embeddings trained on different ideological online wikis.All wikis are found to have similar biases for gender, race, and religion, but to varying degrees.Spinde et al. (2021) collect US news articles from a liberal and a conservative media outlet.
By training one embedding model for each outlet and measuring the differences of all words in the embedding spaces, they determine the most biased words.The underlying hypothesis is that words, for which the context varies more strongly, will also be more biased.We apply a data collection method similar to Spinde et al. (2021), but cover 47 outlets.Additionally, instead of just focusing on two extreme communities, our corpus spans a wider spectrum of political opinions.Our main goal is to deepen the understanding of the social bias in word embeddings for different training algorithms.

Method
This paper studies how to best evaluate a text corpus for social bias, harnessing the ability of word embeddings to encode direct contexts.In particular, we quantify the social bias encoded in models trained on a corpus.The models are thus used as a proxy from which we derive the social bias in the original corpus.In the following, we present our evaluation method, discuss potential issues, and describe the employed embedding algorithms.

Evaluating Social Bias in Embeddings
We seek to analyze to what extent word embedding models encode the social bias of training data.For further insights, we investigate the models quality.
Word Similarity The quality of the semantic space of word embedding models benefits from larger datasets (Pennington et al., 2014).Since most social bias measures rely on this space, better embeddings should also yield more accurate bias evaluations.To gain a better understanding of the quality, we conduct word-similarity evaluations (Spinde et al., 2021) of all models we explore.These evaluations are based on a list of word pairs, human-annotated for similarity.For each pair, the cosine similarity between the vectors generated by a model is computed.The Spearman's ρ between the vector similarities and the annotations represents the score.While this intrinsic evaluation is not able to predict the performance on downstream tasks, it provides insights into the semantic quality of the embeddings (Faruqui et al., 2016).The results also enable us put the social bias evaluation into context.We apply two tests, MEN (Bruni et al., 2014) and WordSim353 (Finkelstein et al., 2001). 2   Social Bias To quantify social bias, we report results of WEAT (Caliskan et al., 2017).At its core, WEAT relies on four word lists describing a concept.Two lists describe social groups that are evaluated in the context of attributes which represent the other two lists.Common combinations are: • Gender.Male/female and career/family terms • Ethnicity.
African-/European-American names and pleasant/unpleasant terms • Religion.Christianity/Islam terms and pleasant/unpleasant terms Using a given embedding model, all words are transformed into word vectors, in order to measure the cumulative distance between the vectors.Let G and G be the word embeddings for the two social group lists, and A and B those for the attribute lists.Now, let ∆(w, G, G) be the mean difference between the cosine similarity of a word embedding w to all word embeddings in G and to the embeddings in G.Then, the WEAT score is defined as the effect size of the difference between A and B: This results in a value from -2 to 2, where 0 represents the least possible bias.Using WEAT makes our results comparable with related work.We calculate WEAT scores using the implementation of the WEFE framework (Badilla et al., 2020) and use word lists of Spliethöver and Wachsmuth (2021).

Accuracy of Bias Evaluation
Evaluating social bias in a word embedding model assumes that its semantic space is meaningful.As different word embedding algorithms achieve this with varying success, they likely also differ in their accuracy in encoding social bias.Assuming that the bias measure at hand (here, WEAT) works as intended, it is possible to evaluate differences between algorithms, given a corpus with known social bias.Below, we thus compare models of different embedding algorithms on training data for which the social bias is known from literature (see Section 2).While we cannot derive exact WEAT values for a corpus, we can infer relative differences for liberal and conservative texts.Together with the results of the word similarity evaluation, we can draw conclusions regarding the reliability of the results.

Potential Evaluation Issues
As previous research (Spliethöver and Wachsmuth, 2020;Spinde et al., 2021) points out, evaluating text corpora for social bias with static word embeddings (e.g., word2vec) entails three main problems: 1. Limited Corpus Size.The training data influences the semantic quality of the embeddings.
2. Representation Degeneration.Token frequency differences in the training data entail embeddings of differing quality.
3. Out-of-Vocabulary Tokens.Limited vocabularies cause unknown tokens during evaluation.
In the following, we describe these issues in more detail.To alleviate them, we train word embedding models with different algorithms below.
Limited Corpus Size To generate a meaningful semantic space based on context, word embedding models tend to require large datasets.For example, the pre-trained word2vec model (Mikolov et al., 2013) was trained on 100B tokens, the largest GloVe model (Pennington et al., 2014) on 840B.Thus, the quality of the embedding may suffer from small corpora.In turn, the results of the bias evaluation may not be as accurate as with larger corpora.
Representation Degeneration Representation degeneration describes the dependence of meaningful embeddings on the token occurrences, reflecting its available number of contexts (Karampatsis et al., 2020).It implies that infrequent tokens (rare tokens) tend to have lower-quality embeddings than more frequent ones (popular tokens).While fluctuations are expected due to Zipf's law, they result in less reliable semantic encodings (Gong et al., 2018;Karampatsis et al., 2020;Wolfe and Caliskan, 2021).In the context of social bias measures, this issue is especially relevant, since they implicitly assume a similar quality for all word vectors.The difference between tokens can be high for certain corpora (Spliethöver and Wachsmuth, 2020).Even more problematic, the occurrences also tend to vary within a single test (e.g., more male term occurrences than female ones), potentially influencing the social bias measure results negatively.
While a frequency difference can itself be a form of social bias, it makes the evaluation less straightforward, which is why we ideally seek to abstract from it.A naïve way would be to artificially augment the data by duplicating contexts of rare tokens.As we intend to keep the original signals, though, we explore more direct means of abstraction.
Out-Of-Vocabulary Tokens Static word embedding models have a fixed vocabulary, determined by tokens in their training corpus and are unable to generate embeddings for tokens not included (henceforth, OOV tokens).However, most embedding bias measures rely on pre-defined word lists and assume that an embedding is available for each word.OOV tokens hence need to be ignored in the evaluation, reducing the comparability of multiple models.This can be alleviated by sub-word tokenization, as used for BERT (Devlin et al., 2019).

Word Embedding Algorithms
We hypothesize that no existing word embedding algorithm is able to account for all issues discussed.Therefore, we train models with multiple algorithms, and we evaluate them against each other.For implementation details on the different algorithms, see Appendix A.
Static As baseline, we train static embedding models with word2vec (Mikolov et al., 2013).An advantage of this method is the fast training process.Also, static word embeddings are by now well researched and interpretable (Bommasani et al., 2020).In turn, the algorithms require large training data to generate high-quality embeddings.Furthermore, due to the representation degeneration problem, the measured bias may be less comparable if the token frequencies vary strongly between two corpora.The static models will be referred to as Static in the following.
Frequency-Agnostic Frequency-agnostic word embeddings (FRAGE) (Gong et al., 2018) aim to approach the representation degeneration problem by accounting for the frequency of tokens.FRAGE does so by training a long short-term memory model (LSTM) on a language modeling task and introducing an adversarial discriminator, classifying tokens as rare or popular.During training, the LSTM tries to minimize the ability of the adversarial to predict the class of each token.While reducing the impact of token frequency, the model is trained from scratch, increasing training time requiring much data to obtain high-quality embeddings.The models will be referred to as FrecAgn.
Fine-Tuned To account for the shortcomings of FRAGE, we additionally fine-tune BERT.On the one hand, it provides a good basis for embeddings, as it is pre-trained on large corpora.This should offer a certain level of base quality for semantic embeddings, potentially reducing the negative effect of size differences in the fine-tuning data.Moreover, it may minimize quality differences between embeddings of rare and popular tokens.Due to sub-word tokenization, OOV tokens are also not an issue.However, BERT contextualizes embeddings dynamically during generation, requiring the context of a token (e.g., the sentence it appears in) as input.Since bias measures usually work with single token embeddings, we need to generate embeddings by querying the model for unnatural inputs (e.g., inputs containing only the token in question without context) (Bommasani et al., 2020).The resulting models will be referred to as Fine-Tuned.
Decontextualized As an alternative to fine-tuned BERT, we employ the averaged pooling strategy presented by Bommasani et al. (2020) to generate decontextualized embeddings.The general idea is to embed all contexts of a specific token in a context dataset using a language model.To receive a single embedding per token, the contextualized embeddings are then averaged.Since the final embeddings are contextualized by the context dataset, they can also be expected to encode its social bias.We thus use the corpus we aim to evaluate for social bias as context.Since this method is also based on BERT, we expect the embeddings to have similar advantages over static embeddings, while accounting for the unnatural-input problem.Moreover, since the resulting embeddings are static rather than contextualized, they should retain benefits such as better interpretability.The time needed to generate decontextualized word embeddings is, however, more dependent on the size of the context dataset, since all contexts need to be embedded separately.This results in a potentially long generation time.The models will be referred to as Decontext.

Data
We now present the large-scale corpus that we acquired to study the existence of social bias in news articles across the political spectrum in the US.
Source Data Using media bias ratings from news aggregation platform allsides.com,we collected articles from liberal (left and lean-left labels) and conservative (right and lean-right labels), as well as neutral (center label) outlets.While this unidimensional view on the political spectrum is limited (Duckitt and Sibley, 2010), it provides us with a clear distinction and makes results easier to interpret.We refer to news articles with liberal, neutral, and conservative labels as data subsets in Section 5.
Similar to Spinde et al. (2021), we collected news articles from Common Crawl3 .Since the media bias rating history is not available, we mapped each outlet to its current rating.To extract the pure text from the collected files in WARC format, we used the library news-please (Hamborg et al., 2017).
For our experiments, we also extracted the articles' date of publication automatically as far as possible.Preprocessing To filter out non-English articles, we classified the language of each text automatically using the langdetect library 4 .In contrast, we intentionally did not filter news categories (e.g., keeping only news articles about politics), in order to avoid selection bias.Furthermore, the different embedding algorithms require varying preprocessing steps.For word2vec, sentence splitting is required.In order to train the FRAGE model, we tokenized the data and replaced ultra-rare tokens with "<unk>", since the model expects the preprocessing of the WikiText-2 corpus (Merity et al., 2016).
To do so, we used the huggingface tokenizer 5 and ended up with a vocabulary of around 39k tokens.
Statistics In total, we collected 520,798 news articles from 47 different outlets, 19 of which are liberal, 10 neutral, and 18 conservative.Table 1 reports detailed dataset statistics, showing that the number of articles is increasing over time, more or less monotonously.For about 20% of all articles (107,206), no publication date could be extracted.

Experiments
We now describe our experiments to evaluate embedding algorithms regarding their capabilities to accurately represent social bias in text corpora.To do so, we assess an algorithm's ability to generate a meaningful embedding space and to avoid the issues detailed in Section 3 arising from sparse data.
In particular, we systematically train models on all news articles with either political bias from our corpus (Section 4), once with each of the four word embedding algorithms from Section 3. To increase the data available for each bias, we aggregate news articles for lean-left and left as liberal as well as for lean-right and right outlets as conservative.

Word Similarity Tests
To better understand the models' quality, we first evaluate their performance on word-similarity tests.
Table 2 indicates that all proposed algorithms produce more meaningful embedding spaces compared to the Static models.The scores of the latter are close to 0.00, suggesting little to no correlation with the actual word similarities.A potential reason for the low scores is the limited training data, as discussed in Section 3.2, which may not be large enough to train high-quality models from scratch.The Decontext models that are pretrained on a larger dataset, on the other hand, achieve the highest scores for all data subsets on both tests (ranging from 0.62 to 0.77), also notably outperforming the underlying BERT model.The finetuning process of Fine-Tuned only marginally improves upon the base model.Considering that the liberal and conservative data subsets are notably larger than the neutral subset, it also seems that more data hurts the Fine-Tuned performance.This might be an issue of over-fitting to the fine-tuning data, decreasing the applicability of the resulting embeddings for the general similarity task.Further, Table 3: WEAT values of the models trained with each evaluated algorithm for the three types of social bias.∆ denotes the difference between the values of the models trained on conservative and liberal articles respectively; the highest ∆ for each bias type is marked bold.For reference, the WEAT values of pre-trained BERT are shown.
the "unnatural" input used to generate Fine-Tuned models, compared to the averaging strategy of the Decontext models, potentially impacts the embedding quality (Bommasani et al., 2020).These results suggest that, while the size of the training corpus does have an impact on the quality of the word embeddings, it is not the only contributing factor.For example, comparing the results in Table 2 across algorithms for the same dataset, the choice of the algorithm seems to be important as well.That said, some algorithms do seem to benefit from the additional data.While the models trained on the liberal data perform slightly better on MEN tests compared to the other two models trained on smaller data, the benefit seems to be mostly negligible considering the increase in data needed (the liberal dataset contains nearly twice as many articles compared to the conservative dataset) and the additional training time.Furthermore, it is unclear, if this performance difference might partially also due to the selection of tested words in the respective word similarity tests.
Considering consistency, the frequency-agnostic and the decontextualized model appear most stable across all tests and data subsets.As a result, the models are also more comparable in the WEAT evaluation across data subsets, as the quality of the embedding models seems to be less dependent on the corpus size and content.
Overall, the suggested algorithms seem to improve the quality of the embedding space and abstract reasonably from the corpus size.For Decontext and Fine-Tuned, OOV tokens are less of a problem, as they train on sub-word tokens.The impact of fewer OOV tokens seems small in Table 3 than previously assumed.The performance of the FrecAgn model does not vary notably from the two models trained on sub-word tokens.Less OOV tokens should, however, result in more accurate social bias evaluations as more word embeddings exit, from which associations can be measured.As noted before, this can be more important when testing smaller datasets, as done in Section 5.3.
To analyze the representation degeneration, we repeated the evaluation with token pairs for which at least one was among the 100 least used tokens of the respective data subset.In general, results were similar to those in Table 2, indicating that Decontext and FrecAgn also perform well with rare tokens.While the results seem convincing, they must be interpreted with care.The similarity evaluations test word embedding models for general words and meaning rather than for social biases.Furthermore, the relation between these tests and the social bias measures is not fully clear.

Bias Representation Accuracy
As detailed in Section 3, each model is evaluated for social bias using WEAT.Following Caliskan et al. (2017), positive values indicate potential biases towards women compared to men (Gender), African-American compared to European-American names (Ethnicity), and Islam compared to Christianity (Religion).Based on our literature review presented in Section 2, we expect the liberal models to be biased against men, European-American names and Christianity, which should be reflected in positive WEAT values.Accordingly, we expect the opposite for the conservative models, and the neutral models should receive WEAT values located between the others.
Table 3 shows the results.The ∆ columns indicate the difference in WEAT values between the models trained on conservative and on liberal news articles.It is a rough measure of an algorithm's accuracy in encoding social bias.The closer ∆ is to the maximum (2 − (−2) = 4), the better the models represent the expected social bias detailed in Section 3. Our discussion relies on this relative measure, as the exact WEAT value of the data subsets is unknown.We chose not to use absolute values, as a negative ∆ highlights cases that contradict our initial expectations, providing additional information on the quality of the word embedding models and applied measures.Since Fine-Tuned and Decontext are based on BERT, we report BERT's WEAT values for reference.
For all three evaluated bias types, at least one of the suggested algorithms receives a better accuracy than Static.While the fine-tuned models achieve the highest ∆ in the gender bias evaluation, FrecAgn performs closest to expectation in the ethnicity bias evaluation.For the religion bias evaluation, Decontext shows the highest ∆, even though the absolute differences are comparatively small.It is noteworthy that models trained with Static achieve the second-best accuracy for the gender and ethnicity bias evaluations.
In general, the WEAT values for liberal and conservative models are less divergent than expected.Also, ∆ is consistently close to 0 for Decontext and FrecAgn.When comparing the models for a single data subset for liberal outlets only), the WEAT value strongly depends on the applied algorithm.The variance for the same data subset across all evaluations is in all cases above 0.7, with an average of 1.005.This is an intriguing finding, indicating that the choice of a particular algorithm is an important parameter when interpreting WEAT results, making exact WEAT values less meaningful and relative comparisons to a reference necessary.
A further interesting result is the fact that the Fine-Tuned and Decontext models have lower WEAT values than the BERT model they are based on.We hypothesize that this is due to our data being less biased, which changed the word associations during the fine-tuning and decontextualization.With the analysis at hand, however, this phenomenon can not be explained conclusively.While there does not seem to be one "best" algorithm for evaluating social bias according to Table 3, the combination of data, algorithm, and bias type seem to matter for the final result.A potential explanation is that the social bias present in the data is not as we hypothesized in Section 2, and the political bias does not correlate with social bias to the expected degree.Neither psychology literature nor our manual inspection of samples of the corpus make this seem likely, though.
Stereotypes, ideas of society, and with that social bias rather may be expressed more implicitly (see the example in Figure 1), potentially drawing word list-based measures to quantify bias ineffective.Similarly, the word lists applied by the WEAT evaluations might not be fully applicable to the evaluated datasets, requiring adaptation to the given linguistic style (Chaloner and Maldonado, 2019).For example, while liberal media may use the term "immigrant" to describe people coming to the US from a different country, conservative media may rather use the term "alien" (Webson et al., 2020).If a word list only includes one of the terms, it cannot properly reflect the associations with the target group and thus the social bias in the data.We suspect that both issues might contribute to the negative ∆ values presented in Table 3.In this regard, future work may investigate measures that do not rely on predefined word lists, but adapt to the corpus being evaluated.

Temporal Evaluation
In our final experiment, we evaluated the change of social bias over time for the three political bias subsets.This also allows for insights into how the presented algorithms work with even fewer data, similar to the analyses of Garg et al. (2018) and Rios et al. (2020).In particular, we trained one word embedding model for articles of each year from each political bias considered (liberal, neutral, and conservative).We excluded all 107,206 articles for which we could not extract dates automatically.Here, we only used the decontextualization algorithm, given that it produced meaningful embeddings across all data subsets above.This is an important property for this evaluation, as the yearbased sub-corpora are comparatively small.We evaluate the models for social bias using WEAT.
Figure 2 plots the results for each type of bias.While gender bias doesn't change notably, ethnicity and religious bias increase over the 12 years.We don't attribute this is to the amount of data, as the fluctuations of the neutral model happen mostly during years where the number of articles is similar to the conservative bias (see Table 1).Similar to the evaluation of the full models, we find that the relative social bias levels do represent the expected results to a certain degree.The liberal model generally shows lower WEAT values in the gender and religious evaluation compared to the conservative model.For the ethnicity evaluation, the liberal and conservative models are less distinctive though and show a very similar trend.Similar to the analysis of the full models, the small differences in WEAT values, compared to the full WEAT scale, might indicate that the absolute WEAT numbers are less meaningful and only work in relative comparisons.

Conclusion
In this paper, we have compared word embedding algorithms for the task of evaluating text corpora for social bias.To this end, we have introduced a US online news corpus that covers three political bias directions at five levels.Our literature review has motivated that specific political bias coincides with social bias with respect to gender, ethnicity, and religion.We have taken advantage of this property to train three word embedding algorithms and evaluate them for social bias using WEAT.Lastly, we present an example application, analyzing the development of social bias in news articles over a 12 year period.
We find that the particularly frequency-agnostic and decontextualized embedding spaces are more meaningful and encode the social bias more accurately than word2vec.They fail, however, to do so consistently for all bias types.While the respective algorithms should be more reliable, especially when evaluating sparse datasets, the exact WEAT results should be considered with care.The values do not seem to quantify social bias in the same way for all embedding algorithms.Future research should investigate the relation between WEAT values of an algorithm and the encoded bias.
Our findings give insights into the role of word embedding algorithms within the social bias evaluation of texts, and they demonstrate what type of embedding models work even in sparse data scenarios.Thereby, we contribute to understanding social bias in texts and NLP applications in general.

Limitations
One limitation of our evaluation is the distantly supervised approach used to label articles for political bias based on the outlet it was published by.We recognize that not all articles of an outlet are necessarily politically biased in the same way and to the same degree.Similarly, the political bias of an outlet could have changed over the evaluated period.A more refined approach could label articles based on their content, rather than the publishing outlet.Similar can be said for the social bias labels.Ultimately, it is not guaranteed that the social bias present in the analyzed 500k news articles statistically matches knowledge from psychology literature.Under the premise that literature is right, however, we are convinced that our inference from political to social bias to be sound, even if may not apply to the same extent to all articles.
We also acknowledge that we did not account for the completeness of the word lists used in the WEAT evaluations, which might therefore suffer from selection bias, hence not comprehensively representing the target groups.As the WEAT values depend on the contents of the word lists, the presented values might therefore not be fully accurate.A potential improvement to account for representa-tion issues is to adapt the word lists to the language of each data subset, since outlets might use different vocabularies to describe the same groups.
Lastly, we were only able to evaluate a limited number of word embedding algorithms that account for token frequency issues.Potential alternatives include KAFE (Ashfaq et al., 2022), which relies on a knowledge graph to improve token representations, and AGG (Yu et al., 2022), for which the code was not available at the time of conducting the experiments.Similarly, we chose to fine-tune our BERT model for four epochs in all cases to obtain a comparable setting.Other choices might yield varying results.

Ethical Statement
We generate word embedding models for encoding social bias, as we train explicitly on texts that we expect to be biased.The models might therefore also contain more bias than other pre-trained models.They were, however, solely trained for the purpose of analyzing the training data.Due to the nature of the corpus and the comparatively sparse training data, we believe that the resulting models are not very applicable to other tasks.
We also note that, as already mentioned in the limitations, the word lists that we used in the WEAT evaluation are not complete.They might therefore not represent the social groups to a satisfying degree for real-world applications.
Fine-Tuned For the fine-tuned language models, we chose uncased BERT as starting point.We fine-tune the model for each political bias for four epochs with a standard masked language modeling objective using the Transformers library7 .We subsequently extract the embeddings using the flair library (Akbik et al., 2019).
Decontextualized To generate decontextualized embeddings, we again chose uncased BERT as starting point and the flair library for contextualization.For each token of interest, we collect the sentences it occurs in within the context datasets, generate contextualized embeddings for each of the sentences, and average them, as suggested by Bommasani et al. (2020).

Figure 1 :
Figure 1: Excerpts of two articles, from a right and a left news outlet according to allsides.com.Both show potential gender bias, but of different kind.The articles are included in the corpus presented in Section 4.

Figure 2 :
Figure2: Plots of the development of the WEAT scores of the word embedding models for each bias type over time.Each model was trained on data subsets for each pair of year and political orientation.Gender bias slightly reduces over time, while ethnicity bias and religious increase (dashed regression lines).

Table 1 :
Number of news articles per year for each orientation in our corpus (liberal, neutral, conservative) and their for sub-groups (e.g., left).The total number of articles (All) includes those for which no date could be extracted.

Table 2 :
Spearman's ρ of the word embedding similarity evaluation on the two tests, WordSim353 and MEN.The embedding models were trained using the evaluated algorithms on liberal, neutral or conservative articles.Bold values indicate the best score in each column.For comparison, the values of pre-trained BERT are shown.