Examining Covert Gender Bias: A Case Study in Turkish and English Machine Translation Models

As Machine Translation (MT) has become increasingly more powerful, accessible, and widespread, the potential for the perpetuation of bias has grown alongside its advances. While overt indicators of bias have been studied in machine translation, we argue that covert biases expose a problem that is further entrenched. Through the use of the gender-neutral language Turkish and the gendered language English, we examine cases of both overt and covert gender bias in MT models. Specifically, we introduce a method to investigate asymmetrical gender markings. We also assess bias in the attribution of personhood and examine occupational and personality stereotypes through overt bias indicators in MT models. Our work explores a deeper layer of bias in MT models and demonstrates the continued need for language-specific, interdisciplinary methodology in MT model development.


Introduction
Various forms of biases are encoded in the way that people use language (Rudinger et al., 2018;Judith, 1990). Similar to other Natural Language Processing (NLP) tasks, learned models used in MT systems include social biases as they learn correlations from their training data that have encoded stereotypes. Specifically, several studies (Prates et al., 2020;Cho et al., 2019;Baeza-Yates, 2019) have shown that translations from a gender-neutral language to a language with gendered pronouns are biased in the selection of pronouns in the target language.
However, this is not the only way bias can manifest in MT. For example, Figure 1 demonstrates marked gender in the female case of the same sentence while remaining neutral in the male case. Since the translations are both accurate, unless the * * Equal contribution. Figure 1: Using Google Translate, "My sister is a soccer player" accurately translates to "My female sibling is a soccer player" while "My brother is a soccer player" is translated to "My sibling is a soccer player". Gender is overtly marked only when the subject is female. two sentences are presented together, the asymmetry in gender reference is not immediately obvious. The example demonstrates the use of optional referential gender in Turkish, highlighting the need to frame gender bias in MT around language-specific social and cultural knowledge.
While previous mitigation efforts have focused on debiasing training data (Elaraby et al., 2018;Costa-jussà and de Jorge, 2020;Stafanovičs et al., 2020;Saunders and Byrne, 2020), the issue of covert bias has not been adequately addressed, and goes far beyond the perpetuation of outdated stereotypes. In order to ensure that the true meaning of the source is accurately represented during the translation process, understanding the linguistic and social context of the utterance is necessary.
In this paper, we examine both overt and covert gender biases in commercially-used MT models through the use of a gender-neutral language, Turkish, and a gendered language, English. Our study investigates explicit stereotype bias through the assignment of pronouns according to stereotypes regarding occupation and personality. We also investigate how additional qualifiers to job descriptions affect results: for example, are "good doctors" more likely to be men than "bad" ones? Similarly, we measure how a reference to personhood changes pronoun results. Lastly, we shed light on the presence of asymmetrical gender in MT models by analyzing explicit gender markings in Turkish translations of gender-specific English sentences. We not only ask if gender markings occur more for female subjects, but also if gender markings are more likely when the stereotype of the predicate does not align with the gender of the subject.
To this end, we created a parallel corpus of 1,617 Turkish and English job titles. We also compiled a list of descriptive adjectives based on Turkish stereotypes and formed appropriate Turkish sentences with and without a reference to personhood. Lastly, we formed a dataset of English sentences by pairing a gendered English subject word (that has no gendered translation in Turkish) with a genderstereotyped action or description. Our code and data can be found in our GitHub repository. 1

Related Work
Previous works on bias in embeddings and models (Bolukbasi et al., 2016;Zhao et al., 2019;Stanovsky et al., 2019), as well as corpora (Babaeianjelodar et al., 2020), have demonstrated that gender bias exists in the core of MT models. Additionally, Stanovsky et al. (2019) introduced a challenge set in measuring bias from English to languages with morphological gender.
One common approach in bias evaluation is to translate from a gender-neutral language to a gendered language and examine the pronouns selected for occupations and adjectives (Prates et al., 2020;Farkas and Németh, 2020;Cho et al., 2019). We used a modified version of these methods by ensuring that the occupation exists in the target language as well as the source language and that the adjectives used are actual stereotypes in Turkey (Sakallı et al., 2018) 2 . Our remaining experiments are inspired by socio-linguistics research in Turkish. First, Braun (2001) discusses how neutral Turkish words describing people, such as insan ("human"), tend to be biased towards male interpretations. In NLP, Mehrabi et al. (2019) examines a related bias in English named-entity recognition where fewer female names are recognized as "person" entities than male ones. Our work will similarly examine gender and personhood bias but in MT models. Second, Braun (2001) describes asymmetrical gender markings in the Turkish language, concluding that male gender remains unmarked regardless of context, whereas female gender tends to be overly expressed. For example, female children are more likely to be referred to with marked gender (kız çocugu "girl child" instead of çocuk "child") than male children. The exception to this pattern is when the subject is exceptionally stereotyped as feminine (e.g. hizmetçi "househelper"). We will extend the study of this phenomenon to MT.

Experiments
We used four commercially available MT models in our experiments: Google Translate, Amazon Translate, Microsoft Translator, and SYSTRAN. For reproducibility purposes, all translations were executed in April of 2021. All datasets can be found on our GitHub 1 .

He is a Doctor, She is a Nurse? Gender Bias in Job Occupation
We examined the distributions of the pronouns selected in English when Turkish sentences were translated following the template 3 : "He/She is a(n) <occupation>", and compared them to the 2020 Turkish (Türkiye Istatistik Kurumu, 2020) and US (U.S. Bureau of Labor, 2020) workforce statistics. Inspired by Farkas and Németh (2020), a second template "He/She is a <adjective> <occupation>" was also formed using the words çok kötü ("very bad"), kötü ("bad"), iyi ("good"), and çok iyi ("very good") as attributive adjectives to determine their influence. We retrieved occupation lists from Turkish and US government agencies 4 and matched occupations that exist in both countries 5 . Some occupation titles were modified for clarity, and some were removed due to gender requirements or a lack of census data, as described in detail in Appendix A. Through our matching process, we were able to match 1,617 occupations.

He is Smart, She is Beautiful? Bias in Adjective Use
We pulled stereotypes from a study where Turkish undergraduate students were asked to provide adjectives that describe men and women (Sakallı et al., 2018). We compiled the list of adjectives presented by this work and removed any that were lexically gendered, leaving 97 total adjectives. Each adjective was then labeled as either masculine-coded (e.g. agresif "agressive") or feminine-coded (e.g. güçsüz "weak") if more than 60% of the time that word was used to describe a certain gender. All others were considered to be neutral. The adjectives were first placed into the template "O <adjective>" (He/She is <adjective>) 6 to assess the adjective stereotypes and then into the template "O <adjective> birisidir" ("He/She is someone who is <adjective>) 7 in order to assess if the introduction of the "personhood factor" changed the assumed gender in the translations. 6 Since Turkish is an agglutinative language, the proper suffixes were also appended to each adjective in order to fit the first template. 7 Note that although the translation may seem unnatural in English, this is a common utterance in Turkish.

Bias Through Asymmetrical Gender Markings
English sentences were formed with grammatically gendered subjects, followed by a predicate including a stereotypical occupation, description, or activity. For example, "My sister is an engineer" contains a female subject and a stereotypically masculine predicate. These sentences were then translated to Turkish to measure if the subject was gender-marked. We aim to answer several questions. First, are sentences with male subjects less likely to mark gender than sentences with female subjects? Second, is gender more likely to be marked when the stereotype of the predicate does not align with the gender of the subject?
We selected four subject words that are gendered in English but are grammatically neutral in Turkish. For example, there are no commonly used words for "brother" and "sister"; the only options are "sibling" (kardeş), "male sibling" (erkek kardeş), or "female sibling" (kız kardeş). For each of the predicate categories (occupation, description, and activity), we selected five that were stereotypically masculine and five stereotypically feminine according to Turkish gender stereotypes (Sakallı et   2018; Vatandaş, 2011). By checking the translations for overt gender markings, translators can be evaluated for asymmetry. We compared the results across the gender of each original English subject word as well as the stereotypical gender of each predicate. With 10 sentence templates in each category for the four gendered subject words, we constructed 120 sentences for each gender in total.

Results
In this section, we evaluate 8 aggregate results across all experiments.

Gender Bias in Occupations
Overall, the percent of female pronouns selected by the MT models were: 1.11% with Google, 1.18% with Amazon, 3.83% with Microsoft, and 5.07% with Systran. Figure 2 demonstrates that this is drastically low compared to female participation in the 2020 workplace in Turkey (31.78%) and the US (47%).
The SOC 2018 group breakdown reveals that, for occupation groups where female participation is either approximately equivalent to or greater than male participation, the models tended to translate the occasional occupation with a female pronoun. Occupations where women are the minority tended to have none or nearly no female translations. Additionally, stereotypical occupations like nurses, fashion designers, and beauticians 9 were consistently translated with female pronouns. Overall, assuming the translation results in each job category should match the corresponding labour statistic, our results were statistically significant (p < 0.01).

Impact of an Attributive Adjective Preceded by Occupation
As shown in Table 1, when an adjective was introduced, sentences originally assigned a female pronoun were more likely to be assigned a male pronoun instead. For each attributive adjective, this was statistically significant (p < 0.01). Furthermore, as the adjective changed from çok iyi "very good" to çok kötü "very bad", the amount of female pronouns that changed to male increased, but the reverse occurred for male pronouns. For example, using Google, Amazon, and SYSTRAN, the 8 One sided t-tests performed with equal variance and p < 0.01 unless specified otherwise. 9 A full list of occupations assigned female pronouns can be found in the appendix.
Turkish sentence "O bir Yogun Bakım Hemşiresi" yielded the translation "She is an intensive care unit nurse", but the sentence "O çok kötü bir Yogun Bakım Hemşiresi" yielded "He is a very bad intensive care unit nurse".

Turkish Gender Stereotypes in Person Descriptors
For the first sentence template ("He/She is <adjective>"), the first outstanding result is that only 6.74% percent of the pronouns assigned were female (SYSTRAN: 24.5%, Google: 2%, Microsoft: 3.1%, Amazon: 2%) which indicates a strong bias towards male pronouns overall. Secondly, the sentences that were translated to a female pronoun were much more likely to have contained a female-coded adjective. This was highly significant (p < .01) in comparison to the amount of female pronouns generated by sentences with male-coded adjectives and significant (p < .05) in comparison to neutral ones. The reverse did not hold true for male pronouns; while 83.34% of all sentences that were assigned a female pronoun contained femalecoded adjective, only 46.70% of translations with male pronouns were male-coded.

Analyzing Gendered Personhood
Following from the previous section, we analyze if adding a personhood modifier to the adjective sentences affects pronoun use. Of the sentences that were assigned female gender in the first template, 74.07% changed to male pronouns in the second template when personhood was introduced. The opposite is not the case; only 2.76% of adjectives with male pronouns in Template 1 were female in Template 2. Overall, each translator was significantly more likely to assign a male pronoun when the original sentence contained a personhood modifier (p < 0.01).

Asymmetrical Gender Analysis
As shown in Figure 3, for male subject words, 47.7% of the translations did not mark gender  Figure 4: The percent of translations that used the neutral case and didn't preserve gender, across male and female stereotyped predicates, as well as masculine or feminine subjects. For male subject words, gender is significantly more likely to be overtly expressed if the predicate is stereotypically feminine (p < 0.05). and used the neutral form. However, only 25% of the female subject words used the neutral case. This was due to one word, yegen ("niece/nephew"), that remained neutral 100% of the sentences for both male and female subject words. We theorize that this derives from spoken Turkish as yegen ("niece/nephew") is not frequently gender-marked. Figure 4 demonstrates that when the predicate was stereotypically masculine and the subject word was male, the MT models assumed that the gender of the subject did not need to be overtly expressed, and gender was not preserved 52.1% of the time. For example, "The young men are soccer players" (masculine predicate) did not preserve gender in the translation while "The young men are secretaries" (feminine predicate) did. However, gender was overtly expressed in 56.6% of translations when a stereotypically female predicate was paired with a male subject. Female subject words did not follow this pattern-in fact, for all subject words other than niece/nephew, gender was overtly marked in 75% of the translations. In summary, although male gender was only marked when the content of the sentence deviated from the masculine social norm, female gender was marked in the overwhelming majority of cases, and was consistently treated as aberrational regardless of context.

Conclusion
We have examined gender bias exhibited by commercially used MT models in the case of Turkish and English translations. We have shown evidence of overt gender bias through occupation and adjective stereotypes, and covert gender bias through asymmetrical gender and personhood bias. Furthermore, our experiments show consistent evidence of male bias in a neutral context. Male gender was assumed in reference to gender-equal occupations and stereotype-neutral adjectives, and the same phenomenon extends to the manifestation of overt gender markings where male subjects were more likely to be assigned the neutral case. However, when the context was not neutral, stereotype bias routinely affected results across all experiments.
Previous bias mitigation discussions have focused on fair pronoun assignments (Prates et al., 2020;Cho et al., 2019;Baeza-Yates, 2019). Additionally, Google Translate has recently implemented a gender-specific translation feature (Kuczmarski, 2018;Johnson, 2020). While pronoun assignment is a salient and ongoing concern, our study demonstrates how the problem of gender bias can be far more complex. Our experiments show that domain and cultural knowledge are required and these techniques are not necessarily transferable across languages. We advocate for the inclusion of language-specific differences and the design of mitigation models that are linguistically aware and socially grounded. We hope that our work will bring more attention to such interdisciplinary work, prompt continued research in how gender bias is expressed in NLP, and assist with mitigation efforts.
Lastly, we list all occupation group names and their abbreviations in Tables 2 and 3.     The matching Turkish occupation titles can be found in the GitHub 1 .  Table 5: Turkish sentence templates. In the third template, the adjective was one of: "çok iyi" (very good), "iyi" (good), "kötü" (bad), or "çok kötü" (very bad).