Measuring Intersectional Biases in Historical Documents

,


Introduction
The availability of large-scale digitised archives and modern NLP tools has enabled a number of sociological studies of historical trends and cultures (Garg et al., 2018;Kozlowski et al., 2019;Michel et al., 2011).Analyses of historical biases and stereotypes, in particular, can shed light on past * Equal contribution.societal dynamics and circumstances (Levis Sullam et al., 2022) and link them to contemporary challenges and biases prevalent in modern societies (Payne et al., 2019).For instance, Payne et al. (2019) consider implicit bias as the cognitive residue of past and present structural inequalities and highlight the critical role of history in shaping modern forms of prejudice.
Thus far, previous research on bias in historical documents focused either on gender (Rios et al., 2020;Wevers, 2019) or ethnic biases (Levis Sullam et al., 2022).While Garg et al. (2018) separately analyse both, their work does not engage with their intersection.Yet, in the words of Crenshaw (1995), intersectional perspective is important because "the intersection of racism and sexism factors into black women's lives in ways that cannot be captured wholly by looking separately at the race or gender dimensions of those experiences." Analysing historical documents poses particular challenges for modern NLP tools (Borenstein et al., 2023;Ehrmann et al., 2020).Misspelt words due to wrongly recognised characters in the digitisation process, and archaic language unknown to modern NLP models, i.e. historical variant spellings and words that became obsolete in the current language, increase the task's complexity (Bollmann, 2019;Linhares Pontes et al., 2019;Piotrowski, 2012).However, while most previous work on historical NLP acknowledges the unique nature of the task, only a few address them within their experimental setup.
In this paper, we address the shortcomings of previous work and make the following contributions: (1) To the best of our knowledge, this paper presents the first study of historical language associated with entities at the intersections of two axes of oppression: race and gender.We study biases associated with identified entities on a word level, and to this end, employ distributional models and analyse semantics extracted from word embeddings trained on our historical corpora.(2) We conduct a temporal case study on historical newspapers from the Caribbean in the colonial period between 1770-1870.During this time, the region suffered both the consequences of European wars and political turmoil, as well as several uprisings of the local enslaved populations, which had a significant impact on the Caribbean social relationships and cultures (Migge and Muehleisen, 2010).(3) To address the challenges of analysing historical documents, we probe the applied methods for their stability and ability to comprehend the noisy, archaic corpora.
We find that there is a trade-off between the stability of word embeddings and their compatibility with the historical dataset.Further, our temporal analysis connects changes in biased word associations to historical shifts taking place in the period.For instance, we couple the high association between Caribbean countries and "manual labour" prevalent mostly in the earlier time periods to waves of white labour migrants coming to the Caribbean from 1750 onward.Finally, we provide evidence supporting the intersectionality theory by observing conventional manifestations of gender bias solely for white people.While unsurprising, this finding necessitates intersectional bias analysis for historical documents.

Related Work
Intersectional Biases.Most prior work has analysed bias along one axis, e.g.race or gender, but not both simultaneously (Field et al., 2021;Stańczak and Augenstein, 2021).There, research on racial biases is generally centred around the gender majority group, such as Black men, while research on gender bias emphasises the experience of individuals who hold racial privilege, such as white women.Therefore, discrimination towards people with multiple minority identities, such as Black women, remains understudied.Addressing this, the intersectionality framework (Crenshaw, 1989) investigates how different forms of inequality, e.g.gender and race, intersect with and reinforce each other.Drawing on this framework, Tan and Celis (2019a); May et al. (2019); Lepori (2020); Maronikolakis et al. (2022); Guo and Caliskan (2021) analyse the compounding effects of race and gender encoded in contextualised word representations and downstream tasks.Recently, Lalor et al. (2022);Jiang and Fellbaum (2020) show the harmful implications of intersectionality effects in pre-trained language models.Less interest has been dedicated to unveiling intersectional biases prevalent in natural language, with a notable exception of Kim et al. (2020) which provide evidence on intersectional bias in datasets of hate speech and abusive language on social media.As far as we know, this is the first paper on intersectional biases in historical documents.
Bias in Historical Documents.Historical corpora have been employed to study societal phenomena such as language change (Kutuzov et al., 2018;Hamilton et al., 2016) and societal biases.Gender bias has been analysed in biomedical research over a span of 60 years (Rios et al., 2020), in Englishlanguage books published between 1520and 2008(Hoyle et al., 2019)), and in Dutch newspapers from the second half of the 20th century (Wevers, 2019).Levis Sullam et al. ( 2022) investigate the evolution of the discourse on Jews in France during the 19th century.Garg et al. (2018) study the temporal change in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the US.However, they neglect the emergent intersectionality bias.
When analysing the transformations of biases in historical texts, researchers rely on conventional tools developed for modern language.However, historical texts can be viewed as a separate domain due to their unique challenges of small and idiosyncratic corpora and noisy, archaic text (Piotrowski, 2012).Prior work has attempted to overcome the challenges such documents pose for mod- ern tools, including recognition of spelling variations (Bollmann, 2019) and misspelt words (Boros et al., 2020), and ensuring the stability of the applied methods (Antoniak and Mimno, 2018).
We study the dynamics of intersectional biases and their manifestations in language while addressing the challenges of historical data.

Datasets
Newspapers are considered an excellent source for the study of societal phenomena since they function as transceivers -both producing and demonstrating public discourse (Wevers, 2019).As part of this study, we collect newspapers written in English from the "Caribbean Newspapers, 1718-1876" database, 1 the largest collection of Caribbean newspapers from the 18th-19th century available online.We extend this dataset with English-Danish newspapers published between 1770-1850 in the Danish colony of Santa Cruz (Saint Croix) downloaded from Danish Royal Library's website.As mentioned in §1, the Caribbean islands experienced significant changes and turmoils during the 18th-19th century.Although chronologies 1 https://www.readex.com/products/caribbean-newspapers-series-1-1718-1876-ame rican-antiquarian-society 2 https://www2.statsbiblioteket.dk/mediestream/ can change from island to island, key moments in Caribbean history can be divided into roughly four periods (Higman, 2021;Heuman, 2018): 1) colonial trade and plantation system (1718 to 1750); 2) international conflicts and slave rebellions (1751 to 1790); 3) revolutions and nation building (1791 to 1825); 4) end of slavery and decline of European dominance (1826 to 1876).In our experimental setup, we conduct a temporal study on data split into these periods (see Tab 2 for the number of articles in each period).As the resulting number of newspapers for the first period is very small (< 10), we focus on the three latter periods.As some of the newspapers downloaded from the Danish royal library contain Danish text, we use spaCy5 to tokenise the OCRed newspapers into sentences and the python package langdetect6 to filter out non-English sentences.

Bias and its Measures
Biases can manifest themselves in natural language in many ways (see the surveys by Stańczak and Augenstein (2021); Field et al. (2021); Lalor et al. (2022)).In the following, we state the definition of bias we follow and describe the measures we use to quantify it.

Definition
Language is known to reflect common perceptions of the world (Hitti et al., 2019) and differences in its usage have been shown to reflect societal biases (Hoyle et al., 2019;Marjanovic et al., 2022).In this paper, we define bias in a text as the use of words or syntactic constructs that connote or imply an inclination or prejudice against a certain sensitive group, following the bias definition as in Hitti et al. (2019).
To quantify bias under this definition, we analyse word embeddings trained on our historical corpora.These representations are assumed to carry lexical semantic meaning signals from the data and encode information about language usage in the proximity of entities.However, even words that are not used as direct descriptors of an entity influence its embedding, and thus its learnt meaning.Therefore, we further conduct an analysis focusing exclusively on words that describe identified entities.

Measures
WEAT The Word Embedding Association Test (Caliskan et al., 2017) is arguably the most popular benchmark to assess bias in word embeddings and has been adapted in numerous research (May et al., 2019;Rios et al., 2020).WEAT employs cosine similarity to measure the association between two sets of attribute words and two sets of target concepts.Here, the attribute words relate to a sensitive attribute (e.g.male and female), whereas the target concepts are composed of words in a category of a specific domain of bias (e.g.career-and familyrelated words).For instance, the WEAT statistic informs us whether the learned embeddings representing the concept of f amily are more associated with females compared to males.According to Caliskan et al. (2017), the differential association between two sets of target concept embeddings, denoted X and Y , with two sets of attribute embeddings, denoted as A and B, can be calculated as: where s(w, A, B) measures the embedding association between one target word w and each of the sensitive attributes: The resulting effect size is then a normalised measure of association: As a result, larger effect sizes imply a more biased word embedding.Furthermore, conceptrelated words should be equally associated with either sensitive attribute group assuming an unbiased word embedding.PMI We use point-wise mutual information (PMI; Church and Hanks 1990) as a measure of association between a descriptive word and a sensitive attribute (gender or race).In particular, PMI measures the difference between the probability of the co-occurrence of a word and an attribute, and their joint probability if they were independent as: A strong association with a specific gender or race leads to a high PMI.For example, a high value for PMI(f emale, wif e) is expected due to their co-occurrence probability being higher than the independent probabilities of female and wife.Accordingly, in an ideal unbiased world, words such as honourable would have a PMI of approximately zero for all gender and racial identities.

Experimental Setup
We perform two sets of experiments on our historical newspaper corpus.First, before we employ word embeddings to measure bias, we investigate the stability of the word embeddings trained on our dataset and evaluate their understanding of the noisy nature of the corpora.Second, we assess gender and racial biases using tools defined in §4.2.

Embedding Stability Evaluation
We use word embeddings as a tool to quantify historical trends and word associations in our data.However, prior work has called attention to the lack of stability of word embeddings trained on small and potentially idiosyncratic corpora (Antoniak and Mimno, 2018;Gonen et al., 2020).We compare these different embeddings setups by testing them with regard to their stability and capturing meaning while controlling for the tokenisation algorithm, embedding size and the minimum number of occurrences.
We construct the word embeddings employing the continuous skip-gram negative sampling model from Word2vec (Mikolov et al., 2013b) using gensim.7Following prior work (Antoniak and Mimno, 2018;Gonen et al., 2020), we test two common vector dimension sizes of 100 and 300, and two minimum numbers of occurrences of 20 and 100.The rest of the hyperparameters are set to their default value.We use two different methods for tokenising documents, the spaCy tokeniser and a subword-based tokeniser, Byte-Pair Encoding (BPE, Gage (1994)).We train the BPE tokeniser on our dataset using the Hugging Face tokeniser implementation. 8or each word in the vocabulary, we identify its 20 nearest neighbours and calculate the Jaccard similarity across five algorithm runs.Next, we test how well the word embeddings deal with the noisy nature of our documents.We create a list of 110 frequently misspelt words (See App A.2). We construct the list by first tokenising our dataset using spaCy and filtering out proper nouns and tokens that appear in the English dictionary.We then order the remaining tokens by frequency and manually scan the top 1 000 tokens for misspelt words.We calculate the percentage of words (averaged across 5 runs) for which the misspelt word is in immediate proximity to the correct word (top 5 nearest neighbours in terms of cosine similarity).
Based on the results of the stability and compatibility study, we select the most suitable model with which we conduct the following bias evaluation.

WEAT Evaluation
As discussed in §4.2, WEAT is used to evaluate how two attributes are associated with two target concepts in an embedding space, here of the model that was selected by the method described in §5.1.
In this work, we focus on the attribute pairs (female, male)9 and (white, non-white).Usually, comparing the sensitive attributes (white, non-white) is done by collecting the embedding of popular white names and popular non-white names (Tan and Celis, 2019b).However, this approach can introduce noise when applied to our dataset (Handler and Jacoby, 1996).First, non-whites are less likely to be mentioned by name in historical newspapers compared to whites.Second, popular non-white names of the 18th and 19th centuries differ substantially from popular non-white names of modern times, and, to the best of our knowledge, there is no list of common historical non-white names.For these reasons, instead of comparing the pair (white, non-white), we compare the pairs (African countries, European countries) and (Caribbean countries, European countries).
Following Rios et al. (2020), we analyse the association of the above-mentioned attributes to the target concepts (career, family), (strong, weak), (intelligence, appearance), and (physical illness, mental illness).Following a consultation with a historian, we add further target concepts relevant to this period (manual labour, non-manual labour) and (crime, lawfulness).Tab 6 (in App A.3) lists the target and attribute words we use for our analysis.
We also train a separate word embedding model on each of the dataset splits defined in §3 and run WEAT on the resulting three models.Comparing the obtained WEAT scores allows us to visualise temporal changes in the bias associated with the attributes and understand its dynamics.

PMI Evaluation
Different from WEAT, calculating PMI requires first identifying entities in the OCRed historical newspapers and then classifying them into predefined attribute groups.The next step is collecting descriptors, i.e. words that are used to describe the entities.Finally, we use PMI to measure the association strength of the collected descriptors with each attribute group.
Entity Extraction.We apply F-coref (Otmazgin et al., 2022), a model for English coreference resolution that simultaneously performs entity extraction and coreference resolution on the extracted entities.The model's output is a set of entities, each represented as a list of all the references to that entity in the text.We filter out non-human entities by using nltk's WordNet package,10 retaining only entities for which the synset "person.n1" is a hypernym of one of their references.
Entity Classification.We use a keyword-based approach (Lepori, 2020) to classify the entities into groups corresponding to the gender and race axes Table 3: The entities in our Caribbean newspapers dataset.Notice that #males and #females do not sum to #entities as some entities could not be classified.Similarly, #non-white males and #non-white females do not sum to #non-whites.
and their intersection.Specifically, we classify each entity as being a member of male vs female, and white vs non-white.Additionally, entities are classified into intersectional groups (e.g.we classify an entity into the group non-white females if it belongs to both female and non-white).
Formally, we classify an entity e with references {r 1 e , ..., r m e } to attribute group G with keyword-set 3 for listing the keyword sets of the different groups.In Tab 3, we present the number of entities classified into each group.We note here the unbalanced representation of the groups in the dataset.Further, it is important to state, that because it is highly unlikely that an entity in our dataset would be explicitly described as white, we classify an entity into the whites group if it was not classified as non-white.See the Limitations section for a discussion of the limitations of using a keyword-based classification approach.
To evaluate our classification scheme, an author of this paper manually labelled a random sample of 56 entities.The keyword-based approach assigned the correct gender and race label for ∼ 80% of the entities.See additional details in Tab 7 in App B. From a preliminary inspection, it appears that many of the entities that were wrongly classified as female were actually ships or other vessels (traditionally "ship" has been referred to using female gender).As F-coref was developed and trained using modern corpora, we evaluate its accuracy on the same set of 56 entities.Two authors of this paper validated its performance on the historical data to be satisfactory, with especially impressive results on shorter texts with fewer amount of OCR errors.
Descriptors Collection.Finally, we use spaCy to collect descriptors for each classified entity.Here, we define the descriptors as the lemmatised form of tokens that share a dependency arc labelled "amod" (i.e.adjectives that describe the tokens) to one of the entity's references.Every target group G j is then assigned with descriptors list To calculate PMI according to Eq (1), we estimate the joint distribution of a target group and a descriptor using a simple plug-in estimator:

Lexicon Evaluation
Another popular approach for quantifying different aspects of bias is the application of specialised lexica (Stańczak and Augenstein, 2021).These lexica assign words a continuous value that represents how well the word aligns with a specific dimension of bias.b) a) Figure 3: a) WEAT results of females vs males.The location of a marker measures the association strength of females with the concept (compared to males).For example, according to the modern model, females are associated with "weak" and non-manual labour while males are associated with "strong" and manual labour.b) WEAT results of Caribbean countries vs European countries.The location of a marker measures the association strength of Caribbean countries with the concept (compared to European countries).{(w 1 , a 1 ), ..., (w n , a n )}, where (w i , a i ) are wordvalue pairs, we calculate the association of B with a sensitive attribute G j using: where count(w i , D j ) is the number of times the word w i appears in the descriptors list D j .

Results
First, we investigate which training strategies of word embeddings optimise their stability and compatibility on historical corpora ( §6.1).Next, we analyse how bias is manifested along the gender and racial axes and whether there are any notice-able differences in bias across different periods of the Caribbean history ( §6.2).

Embedding Stability Evaluation
In Tab 4, we present the results of the study on the influence of training strategies of word embeddings.We find that there is a trade-off between the stability of word embeddings and their compatibility with the dataset.While BPE achieves a higher Jaccard similarity across the top 20 nearest neighbours for each word across all runs, it loses the meaning of misspelt words.Interestingly, this phenomenon arises, despite the misspelt words occurring frequently enough to be included in the BPE model's vocabulary.
For the remainder of the experiments, we aim to select a model which effectively manages this
trade-off achieving both high stability and captures meaning despite the noisy nature of the underlying data.Thus, we opt to use a spaCy-based embedding with a minimum number of occurrences of 20 and an embedding size of 100 which achieves competitive results in both of these aspects.Finally, we note that our results remain stable across different algorithm runs and do not suffer from substantial variations which corroborates the reliability of the findings we make henceforth.

WEAT Analysis
Fig 3 displays the results of performing a WEAT analysis for measuring the association of the six targets described in §5.2 with the attributes (females, males) and (Caribbean countries, European countries), respectively. 11We calculate the WEAT score using the embedding model from §6.1 and compare it with an embedding model trained on modern news corpora (word2vec-google-news-300, Mikolov et al. (2013a)).We notice interesting differences between the historical and modern embeddings.For example, while in our dataset females are associated with the target concept of manual labour, this notion is more aligned with males in the modern corpora.A likely cause is that during this period, womens' intellectual and administrative work was not commonly recognised (Wayne, 2020).It is also interesting to note that the attribute Caribbean countries has a much stronger association in the historical embedding with the target career (as opposed to family) compared to the modern embeddings.A possible explanation is that Caribbean newspapers referred to locals by profession or similar titles, while Europeans were referred to as relatives of the Caribbean population.
In This finding is potentially related to several historical shifts taking place in the period.For instance, while in the earlier years, it was normal for plantation owners to be absentees and continue to live in Europe, from 1750 onward, waves of white migrants with varied professional backgrounds came to the Caribbean.

PMI Analysis
We report the results of the intersectional PMI analysis in Fig 1 .As can be seen, an intersectional analysis can shed a unique light on the biased nature of some words in a way that single-dimensional analysis cannot.White males are "brave" and "ingenious", and non-white males are described as "active" and "tall".Interestingly, while words such as "pretty" and "beautiful" (and peculiarly, "murdered") are biased towards white as opposed to nonwhite females, the word "lovely" is not, whereas "elderly" is strongly aligned with non-white females.Another intriguing dichotomy is the word pair "sick" and "blind" which are both independent along the gender axis but manifest a polar racial bias.In Tab 8 in App B, we list some examples from our dataset featuring those words.
Similarly to §6.2.1, we perform a temporal PMI analysis by comparing results obtained from separately analysing the three dataset splits.In Fig 5, we follow the trajectory over time of the biased words "free", "celebrated", "deceased" and "poor".Each word displays different temporal dynamics.For example, while the word "free" moved towards the male attribute, "poor" transitioned to become more associated with the attributes female and non-white over time (potentially due to its meaning change from an association with poverty to a pity).
These results provide evidence for the claims of the intersectionality theory.We observe conventional manifestations of gender bias, i.e. "beautiful" and "pretty" for white females, and "ingenious" and "brave" for white males.While unsurprising due to the societal status of non-white people in that period, this finding necessitates intersectional bias analysis for historical documents in particular.

Lexicon Evaluation
Finally, we report the lexicon-based evaluation results in Fig 6 and Fig 7. Unsurprisingly, we observe lower dominance levels for the non-white and female attributes compared to white and male, a finding previously uncovered in modern texts (Field and Tsvetkov, 2019;Rabinovich et al., 2020).While Fig 7 indicates that the level of dominance associated with these attributes raised over time, a noticeable disparity to white males remains.Perhaps more surprising is the valence dimension.We see the highest and lowest levels of associations with the intersectional attributes non-white female and non-white male, respectively.We hypothesise that this connects to the nature of advertisements for lending the services of or selling non-white women where being agreeable is a valuable asset.

Conclusions
In this paper, we examine biases present in historical newspapers published in the Caribbean during the colonial era by conducting a temporal analysis of biases along the axes of gender, race, and their intersection.We evaluate the effectiveness of different embedding strategies and find a tradeoff between the stability and compatibility of word representations on historical data.We link changes in biased word usage to historical shifts, coupling the development of the association between manual labour and Caribbean countries to waves of white labour migrants coming to the Caribbean from 1750 onward.Finally, we provide evidence to corroborate the intersectionality theory by observing conventional manifestations of gender bias solely for white people.

Limitations
We see several limitations regarding our work.First, we focus on documents in the English language only, neglecting many Caribbean newspapers and islands with other official languages.While some of our methods can be easily extended to non-English material (e.g.WEAT analysis), methods that rely on the pre-trained English model F-coref (i.e.PMI, lexicon-based analysis) can not.
On the same note, F-coref and spaCy were developed and trained using modern corpora, and their capabilities when applied to the noisy historical newspapers dataset, are noticeably lower compared to modern texts.Contributing to this issue is the unique, sometimes archaic language in which the newspapers were written.While we validate F-coref performance on a random sample ( §5.2), this is a significant limitation of our work.Similarly, increased attention is required to adapt the keyword sets used by our methods to historical settings.
Moreover, our historical newspaper dataset is inherently imbalanced and skewed.As can be seen in Tab 2 and Fig 8, there is an over-representation of a handful of specific islands and time periods.While it is likely that in different regions and periods, less source material survived to modern times, part of the imbalance (e.g. the prevalence of the US Virgin Islands) can also be attributed to current research funding and policies.12Compounding this further, minority groups are traditionally under-represented in news sources.This introduces noise and imbalance into our results, which rely on a large amount of textual material referring to each attribute on the gender/race plane that we analyse.
Relating to that, our keyword-based method of classifying entities into groups corresponding to the gender and race axes is limited.While we devise a specialised keyword set targeting the attributes female, male and non-white, we classify an entity into the white group if it was not classified as non-white.This discrepancy is likely to introduce noise into our evaluation, as can also be observed in Tab 7.This tendency may be intensified by the NLP systems that we use, as many tend to perform worse on gender-and race-minority groups (Field et al., 2021).
Finally, in this work, we explore intersectional bias only along the race and gender axes.Thus, we neglect the effects of other confounding factors (e.g.societal position, occupation) that affect asymmetries in language.

Ethical Considerations
Studying historical texts from the era of colonisation and slavery poses ethical issues to historians and computer scientists alike since vulnerable groups still suffer the consequences of this history in the present.Indeed, racist and sexist language is not only a historical artefact of bygone days but has a real impact on people's lives (Alim et al., 2020).
We note that the newspapers we consider for this analysis were written foremost by the European oppressors.Moreover, only a limited number of affluent people (white males) could afford to place advertisements in those newspapers (which constitute a large portion of the raw material).This skews our study toward language used by privileged individuals and their perceptions.
This work aims to investigate racial and gender biases, as well as their intersection.Both race and gender are considered social constructs and can encompass a range of perspectives, including one's reflected, observed, or self-perceived identity.In this paper, we classify entities as observed by the author of an article and infer their gender and race based on the pronouns and descriptors used in relation to this entity.We follow this approach in an absence of explicit demographic information.However, we warn that this method poses a risk of misclassification.Although the people referred to in the newspapers are no longer among the living, we should be considerate when conducting studies addressing vulnerable groups.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: PMI analysis of our historical corpora.Words are placed on the intersectional gender/race plane.
2 See Tab 1 and Fig 8 (in App A.1) for details.

Figure 2 :
Figure 2: An example of a scanned newspaper (a) and the output of the OCR tool Tesseract (b).We fix simple OCR errors (highlighted) using a rule-based approach.
Preprocessing.Starting with scans of entire newspaper issues (Fig 2.a), we first OCR them using the popular software Tesseract 3 with default parameters and settings.We then clean the dataset by applying the DataMunging package, 4 which uses a simple rule-based approach to fix basic OCR errors (e.g.long s' being OCRed as f', (Fig 2.b)).

Figure 4 :
Figure 4: Temporal WEAT analysis conducted for the periods 1751-1790 (rebellions), 1791-1825 (revolutions) and 1826-1876 (abolishment).Similar to Fig 3, the height of each bar represents how strong the association of the attribute is with each concept.

11Figure 6 :
Figure 6: Association of attributes with the lexicon of dominance, valence, and arousal.
Fig 4 and Fig 10 (in App B), we present a dynamic WEAT analysis that unveils trends on a temporal axis.In particular, we see an increase in the magnitude of association between the target of family vs career and the attributes (females, males) and (Caribbean countries, European countries) over time.It is especially interesting to compare Fig 3 with Fig 4. One intriguing result is that the high association between Caribbean countries and manual labour can be attributed to the earlier periods.

Figure 8 :
Figure 8: The geographical distribution of the curated Caribbean newspapers dataset.

Figure 9 :Figure 10 :
Figure 9: WEAT results of African countries vs European countries.

Table 1 :
Statistics of the newspapers dataset.

Table 2 :
Total number of articles in each period and decade.

Table 4 :
Now, we can assign every word d i two continuous values representing its bias in the gender and race dimensions by calculating PMI(female, d i ) − PMI(males, d i ) and PMI(non-white, d i ) − PMI(white, d i ).These two continuous values can be seen as d i 's coordinates on the intersectional gender/race plane.Results of the stability analysis of different word embedding methods (measured with Jaccard similarity) and their compatibility with the historical corpora (ability to recognise misspelt words).

Table 6 :
Keywords used for performing WEAT evaluation.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 5 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? 6 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 3, 5 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.