Measuring Gender Bias in West Slavic Language Models

Pre-trained language models have been known to perpetuate biases from the underlying datasets to downstream tasks. However, these findings are predominantly based on monolingual language models for English, whereas there are few investigative studies of biases encoded in language models for languages beyond English. In this paper, we fill this gap by analysing gender bias in West Slavic language models. We introduce the first template-based dataset in Czech, Polish, and Slovak for measuring gender bias towards male, female and non-binary subjects. We complete the sentences using both mono- and multilingual language models and assess their suitability for the masked language modelling objective. Next, we measure gender bias encoded in West Slavic language models by quantifying the toxicity and genderness of the generated words. We find that these language models produce hurtful completions that depend on the subject’s gender. Perhaps surprisingly, Czech, Slovak, and Polish language models produce more hurtful completions with men as subjects, which, upon inspection, we find is due to completions being related to violence, death, and sickness.


Introduction
The societal impact of large pre-trained language models including the nature of biases they encode remains unclear (Bender et al., 2021).Prior research has shown that language models perpetuate biases, gender bias in particular, from the training corpora to downstream tasks (Webster et al., 2018;Nangia et al., 2020).However, Sun et al. (2019) and Stańczak and Augenstein (2021) identify two issues within the gender bias landscape as a whole.
Firstly, most of the research focuses on highresource languages such as English, Chinese and Spanish.Limited research exists in further languages.French, Portuguese, Italian, and Romanian (Nozza et al., 2021) have received some attention, as have Danish, Swedish, and Norwegian language models (Touileb and Nozza, 2022).Research into Slavic languages has been limited to covering gender bias in Slovenian and Croatian word embeddings (Supej et al., 2019;Ulčar et al., 2021).To the best of our knowledge, we present the first work on gender bias in West Slavic language models.Due to the nature of West Slavic languages as gendered languages, results from prior work on non-gendered languages might not apply, which deems it as a relevant research direction.
Secondly, most of the gender-related research focuses on gender as a binary variable (Stańczak and Augenstein, 2021).While we recognise that including the full gender spectrum might be challenging, moving away from binary to include neutral language and non-binary language is strongly desirable (Sun et al., 2021).
This work addresses both of these limitations.We focus on West Slavic languages, i.e., Czech, Slovak and Polish, with the intention of answering the following research questions: • RQ1: Are current multilingual models suitable for use in West Slavic languages?• RQ2: Do West Slavic language models exhibit gender bias in terms of toxicity and genderness scores?• RQ3: Are language models in Czech, Slovak and Polish generating more toxic content when exposed to non-binary subjects?
Our main contribution is a set of templates with masculine, feminine, neutral and non-binary subjects, which we use to assess gender bias in language models for Czech, Slovak, and Polish.First, we generate sentence completions using mono-and multilingual language models and test their suitability for the masked language modelling objective for West Slavic languages.Next, we quantify gender bias by measuring the toxicity (HONEST; Nozza et al. 2021) and valence, arousal, and dom-inance (VAD; Mohammad 2018) scores.We find that Czech and Slovak models are likely to produce completions containing violence, illness and death for male subjects.Finally, we do not find substantial differences in valence, arousal, or dominance of completions.

Gender Bias in Language Models
Gender bias refers to the tendency to make judgments or assumptions based on gender, rather than objective factors or individual merit (Sun et al., 2019).For high-resource languages, there is a respectable amount of research on automatic biases detection and mitigation including investigating stereotypical bias of contextualised word embedding (Kurita et al., 2019), amplification of datasetlevel bias by models (Zhao et al., 2017), gender bias in the translation of neutral pronouns (Cho et al., 2019), and gender bias mitigation (Bartl et al., 2020).Kurita et al. (2019) proposed querying the underlying language model as a method for measuring bias in contextualised word embeddings.Similarly, Stańczak et al. (2021) rely on a simple template structure to quantify bias in multilingual language models for 7 languages.Bartl et al. (2020) find that English BERT reflects the real-world gender bias of typical professions based on gender and are able to fine-tune the model to reduce this bias.Additionally, Bartl et al. (2020) show that methods effective for English language models are not necessarily effective for other languages, in particular German.Recently, Nangia et al. (2020) curate template sentences to evaluate biases, including racial and gender ones, while Névéol et al. (2022) transform this dataset into French while incorporating culture-specific issues into the templates.Subsequently, the specific task of exploring gender bias in lower resource languages was investigated for Scandinavian languages (Touileb and Nozza, 2022).
In this paper, we aim to quantify gender bias in West Slavic language models based on the sentence completion task.
[CS] Ta nebinární osoba je ____ .non-binary The non-binary person is a ____ .(XLM-R; Conneau et al. 2020).Since SlovakBERT is the only available model for the Slovak language, the other monolingual models are chosen to be BERT-like as well in order to provide fair comparison without the influence of model architecture.We list the selected models including their training data and the number of parameters in the Appendix in Table 3.
We measure the internal bias of the selected language models using the template-filling task as the monolingual language models for West Slavic languages were pre-trained using the cloze-style masked language model objective.In particular, we directly query the model to generate a word for the masked token in order to then, measure bias in the generated word.We use simple template sentences containing the target word for bias, i.e., a gendered subject such as man, women, or non-binary person.

Dataset
To the best of our knowledge, we introduce the first template-based dataset to measure gender bias in language models for West Slavic languages.In particular, we use two types of templates: 1. Translated templates -originally developed to evaluate gender bias in Scandinavian languages (Touileb and Nozza, 2022) The manual templates encompass attributes, preferences, and perceived roles in society, work and studies inspired by the categorisation in Baluchova (2010) and Kolek and Valdrová (2020).These categories together with their explanations and number of templates can be found in the Appendix in Table 4.We translate the first set of templates into Slovak, Czech and Polish using the Google Translate API, 2 which are then manually validated by a native speaker of these languages.The second set of templates extends the templates from the first set with neutral and non-binary subjects.Our dataset includes four gender categories of subjects: male (men, boys, etc.), female (women, girls, etc.), neutral (person, children, etc.), and non-binary (nonbinary person, non-binary people, etc.).
We demonstrate the usability of the dataset by evaluating gender bias in the monolingual language models for West Slavic languages.

Bias Measures
We use toxicity and genderness as proxies for gender bias.Specifically, we define toxicity as the use of language that is harmful to a gender group (Bassignana et al., 2018) and genderness of language as the use of unnecessarily gendered or stereotype-carrying words or language structures.Lexicon matching has been frequently adopted to measure both toxicity (Nozza et al., 2022) and genderness (Marjanovic et al., 2022;Field and Tsvetkov, 2019) on a word level.We measure gender bias in West Slavic Language models using two popular methods which are available in all analysed languages: the HONEST score (Nozza et al., 2021) and the Valence, Arousal, and Dominance lexicon (Mohammad, 2018).
HONEST We rely on the HurtLex lexicon (Bassignana et al., 2018), which has been published in more than 100 languages, to quantify the toxicity of a generated word.Recently, based on the toxicity scores in the HurtLex lexicon, Nozza et al. (2021) propose the HONEST score as a gender bias measure.More formally, the HONEST score is defined as: where T is the set of templates and C(LM, t, K) is a set of K completions for a given language model LM and template t.The indicator function marks whether the set of words is included in the 2 https://cloud.google.com/translateHurtLex lexicon.A high value for the HONEST score indicates a high level of toxicity within the completions, hence a high level of bias.We use HurtLex (Bassignana et al., 2018) to determine which completions are harmful as it is available in all three West Slavic languages.
VAD Lexicon Further, we measure the dimensions of valence, arousal, and dominance for the generated words employing the Valence, Arousal, Dominance lexicon (VAD; Mohammad 2018).
Studies into the differences in the way language is used by different gender, including Coates and Pichler (1998); Newman et al. (2008); Boudersa (2020), suggest that language used by women is less bold and/or dominant than the language used by men.Since dominance is stereotypically associated with men in West Slavic languages, we would expect gender bias to translate to the more dominant language used in association with the male gender.Similarly, for the valence and arousal dimensions, the stereotype is that men are more powerful, competent, and active and so a biased model is expected to generate more words with high valence and arousal values associated with men.
When it comes to the templates including neutral and non-binary subjects, these could very well follow the male default of West Slavic languages.Another possibility is that, in particular, the non-binary setting could be quite unknown to the models as such language is not commonly used in Slovak, Czech or Polish.

Experiments and Results
First, we analyse template completions using both mono-and multilingual language models to evaluate their suitability for use in West Slavic languages (RQ1).Next, we quantify gender bias in language models for West Slavic languages based on the toxicity, and valance, arousal, and dominance of the words they generate (RQ2).Finally, we compare the results for gender binary template completion with the results for templates including non-binary subjects (RQ3).

Comparison of mono-and multilingual LMs
In Table 2, we show examples of completions generated by the analysed multilingual language models, m-BERT and XLM-R.The completions highlighted in red are incorrect completions, i.e., the final sentence is nonsensical and/or is grammatically Table 2: Multilingual completions for the m-BERT and XLM-R language models.We provide translations in italics for completions that are actual words in the target language.The completions highlighted in red are incorrect.incorrect.We find that a substantial proportion of the completions is of low quality showing that multilingual language models are not well suited for the sentence completion task for West Slavic languages.In the following, we target monolingual language models due to the poor performance of the multilingual language models for these languages.
HONEST Following Touileb and Nozza (2022), we generate top k (for k ∈ {5, 10, 20}) completions of templates using the selected language models and calculate the HONEST score and percentages of completions with high VAD values.
In Figure 1, we show the HONEST scores for all language models and template types.We report higher percentages in red, and lower ones in green.The range of these scores lies between 0.005 and 0.132 hurtful completions.Most scores for manually created templates land between the 0.03-0.06mark, which is relatively high in and of itself.Comparing the manually created and translated templates, we see that all models score worse for the translated templates, for which scores are between 0.073 and 0.132.In other words, using these models produces a completion harmful to gender groups for up to 13.2% of completions.These results can then be compared directly with HON-EST scores for Danish, Swedish and Norwegian (Touileb and Nozza, 2022), where the worst overall score reported was 0.0495, showing that the monolingual West Slavic language models perform up to twice worse than Scandinavian models when it comes to hurtful completions.Future work should look into the reasons for these differences.
The manually created templates focus on the most common stereotypes, including personal attributes, likes, dislikes, work and studies.Hence, the lower scores would suggest that the hurtful completions were focused on other areas.Considering only the manually created templates, we see the lowest scores for both PolBERT and SlovakBERT when the subject was referring to a non-binary person.This is an interesting result, meaning that the language model focuses more on the word "person" rather than them being non-binary.Additionally, for the Slovak and Czech models, the female templates have less hurtful completions than the male ones.We hypothesise that this result is due to violence often being associated with men as seen in the example of the completed sentences in Table 5 in the Appendix.This trend continues when looking at the HONEST scores for translated templates.For Czert female completions are still less hurtful than male, while PolBERT has higher scores for female templates, meaning that hurtful completions occur more when speaking about women.
VAD We present the results of the valence, arousal, and dominance analysis in Figure 2. Overall, the scores are quite similar for all models and range between 0.03 and 0.043 for completions falling into the category of high valence, arousal or dominance values (defined as word level scores above 0.7).The differences between genders are not substantial with the largest differences around the magnitude of 0.01.We observe that, in general, the differences are largely between the different axis of valence, arousal, and dominance rather than between genders indicating no presence of bias in terms of these dimensions.

Conclusions
In this paper, we present the first study of gender bias in West Slavic language models, Czert, Slovak-BERT, and PolBERT.We introduce a dataset with 923 sentence templates in Czech, Slovak, and Polish including male, female, neutral, and non-binary gender categories.We measure gender bias based on hurtful completions and valence, arousal, and dominance scores.We find that Czert and Slovak-BERT models are more likely to produce hurtful completions with men as subjects, i.e., many times these completions are related to violence, death or sickness.On the contrary, the PolBERT model generates more hurtful completions for female subjects.An advantage of this approach to measuring gender bias is the relative ease of implementation into new languages by automatic translation.Future work will focus on measuring gender bias in a larger number of language models for West Slavic languages, as well as extending this research to other Slavic languages.Further, we aim to quantify biases across dimensions beyond toxicity and genderness.Additionally, future work will target measuring other biases such as racial, ethnic or age using this approach.

Limitations
Our analysis is strongly dependent on the quality of the employed lexica.The HurLex lexicon used to calculate the HONEST score is an automatically translated lexicon.We have uncovered issues with some words not being translated into the three target languages and others containing smaller translation errors.In particular, the Czech HurtLex contains 3015 words but only 2231 were identified as correct Czech words by a native speaker.That is, only 74% of the lexicon are correct words for the target language.
VAD lexicon is much larger, with over 19.000 words, which makes evaluation by native speakers impossible.In Appendix D, we present an evaluation of both VAD and HurtLex using Wordnet (Fellbaum, 1998) in available languages.We show that the VAD lexicon contains a higher percentage of correct words than HurtLex in all settings.Comparing this to native speaker evaluation for Czech, we see that WordNet marks a significantly smaller proportion of words as correct, even after lemmatisation.This is most probably because the native speakers were allowed to mark any correct Czech words, including slang, different conjugations and regional words, as grammatically correct.
Further, we rely on Google Translate API, an automatic tool, to translate the templates introduced in Touileb and Nozza (2022), while validating the translations manually by native speakers.

Ethics Statement
Continually engaging with systems that perpetuate stereotypes and use biased language, may lead to subconsciously confirming that these biases as correct (Beukeboom, 2014).This allows for further normalisation and acceptance of these biases within cultures and, therefore, hinders the progress towards a society that is equal and lacking in biases (Chestnut and Markman, 2018).
We limit the definitional scope of bias in this work to an analysis of toxicity and valence, arousal, and dominance scores.However, it is crucial to recognise that gender bias encompasses more than just these dimensions, and therefore requires a more nuanced understanding to effectively address its various forms and manifestations.The generated translation and the extension of the resource described herein are intended to be used for assessing bias in masked language models which represent a small subset of language models.

A List of Analysed Language Models
The analysed language models for West Slavic languages are listed below in Table 3.

B Manual Templates and Categories
Table 4 shows the categories of manually created templates, an example for each category and the number of templates per category.The gender of words denoted by "*_*" is changed to provide a comparison between genders.

C Example of Sentence Completion
In Table 5, we present examples of completed sentences.

D HurtLex and VAD Evaluation
In Table 6, we evaluate the two types of lexica using Wordnet (Fellbaum, 1998)

Figure 1 :
Figure 1: HONEST score per gender for each of the analysed languages and template types.

Figure 2 :
Figure 2: Percentage of completions with high valence, arousal, and dominance (VAD) values for each of the analysed languages and template types.

Table 1 :
Example of manually created templates inCzech with the corresponding gender. .

Table 3 :
List of the evaluated language models.

Table 4 :
Overview of the categories for the manual templates.

Table 5 :
Examples of templates with completions for Czech [CS], Polish [PL], and Slovak [SK] based on the selected models.

Table 6 :
Number of words validated by WordNet for each lexicon.