Gender Inflected or Bias Inflicted: On Using Grammatical Gender Cues for Bias Evaluation in Machine Translation

Neural Machine Translation (NMT) models are state-of-the-art for machine translation. However, these models are known to have various social biases, especially gender bias. Most of the work on evaluating gender bias in NMT has focused primarily on English as the source language. For source languages different from English, most of the studies use gender-neutral sentences to evaluate gender bias. However, practically, many sentences that we encounter do have gender information. Therefore, it makes more sense to evaluate for bias using such sentences. This allows us to determine if NMT models can identify the correct gender based on the grammatical gender cues in the source sentence rather than relying on biased correlations with, say, occupation terms. To demonstrate our point, in this work, we use Hindi as the source language and construct two sets of gender-specific sentences: OTSC-Hindi and WinoMT-Hindi that we use to evaluate different Hindi-English (HI-EN) NMT systems automatically for gender bias. Our work highlights the importance of considering the nature of language when designing such extrinsic bias evaluation datasets.


Introduction
Various models trained to learn from data are susceptible to picking up spurious correlations in their training data, which can lead to multiple social biases.In NLP, such biases have been observed in different forms: Bolukbasi et al. (2016) found that word embeddings exhibit gender stereotypes, Zhao et al. (2017) observed that models for visual semantic role labelling aggrandize existing gender bias present in data, similar biased behaviour had been observed in NLP tasks like coreference resolution (Lu et al., 2019) and Natural Language Inference (Rudinger et al., 2017).
Even state-of-the-art NMT models develop such biases (Prates et al., 2019).These models can ex-press gender bias in different ways.One is when due to their poor coreference resolution ability, they rely on biased associations with, say, occupation terms to disambiguate the gender of pronouns (Stanovsky et al., 2019;Saunders et al., 2020).Another is when these models translate gender-neutral sentences into gendered ones (Prates et al., 2019;Cho et al., 2019).In many cases, NMT models give a 'masculine default' translation.
This problem also exists for HI-EN Machine Translation (Ramesh et al., 2021).When put to use, such systems can cause various harms (Savoldi et al., 2021).Thus, evaluating and mitigating such biases from NMT models is critical to ensure fairness.
Prior research evaluating gender bias in machine translation has predominantly centered around English as the source language (Stanovsky et al., 2019).However, these evaluation methods or benchmarks don't seamlessly extend to other source languages, especially the ones with grammatical gender.For instance, in Hindi, elements like pronouns, adjectives, and verbs are often inflected with gender.Nonetheless, prior studies in other source languages often utilize gender-neutral sentences (Cho et al., 2019;Ramesh et al., 2021) for bias evaluation.Yet, in practice, many sentences inherently possess gender information.
Therefore, in this work, we propose to evaluate NMT models for bias using sentences with grammatical gender cues of the source language.This allows us to ascertain whether NMT models can discern the accurate gender from context or if they depend on biased correlations.In this work, we contribute the following : • Using Hindi as source language in NMT, we highlight the limitations of existing bias evaluation methods that use gender-neutral sentences.
• Additionally, we propose to use context-based arXiv:2311.03767v1[cs.CL] 7 Nov 2023 gender bias evaluation using grammatical gender markers of the source language.We construct two evaluation sets for bias evaluation of NMT models: Occupation Testset with Simple Context (OTSC-Hindi) and WinoMT-Hindi.
• Using these evaluation sets, we evaluate various blackbox and open-source HI-EN NMT models for gender bias.
• We highlight the importance of creating such benchmarks for source languages with expressive gender markers.
Code and data are publicly available1 .

Experimental Setup
NMT Models : We test HI-EN NMT models which are widely popular and represent state-ofthe-art in both commercial or academic research : (1) IndicTrans (Ramesh et al., 2022) (Agnihotri, 2007).However, the variety of gender markers can be different for different languages.Therefore it's essential to study gender-related rules of the specific language for creating benchmarks for such tasks.
For translation into English, TGBI uses the fraction of sentences in a sentence set S translated as "masculine", "feminine" or "neutral" in the target , i.e., p m , p f and p n , respectively to calculate P S as : P i is calculated for each sentence set S i (S 1 to S n ) to finally calculate TGBI = avg(P i ).Using lists from Ramesh et al. (2021), we evaluate four HI-EN NMT models using the TGBI score to create a comparison for our evaluation methods.
Often, using a metric like TGBI is not very practical.For example, when the original intent is not gender-neutral but constraints of the source language make it gender-neutral, then showing all versions5 or random guessing, with a 50% chance of choosing one gender in translation, are more practical.Also, gender-specific sentences are more common and making errors in such sentences makes for a more unfair system.Hence, we propose to expose gender bias by evaluating NMT models on such source language sentences.

Approach
We construct two sets of sentences, one with a simple gender-specified context and another with a more complex context.In creating these sets, we focus on the gender markers of the source language, i.e.Hindi.Also, we use template sentences which can help to automatically evaluate bias without using additional tools at the target side.

OTSC-Hindi
Escudé Font and Costa-jussà (2019) created a test set with custom template sentences to evaluate the  ".The possessive pronoun " " or " " and the verb " " or " " specify friend's gender.Here, the pronoun "u " references speaker's friend.
gender bias for English to Spanish Translation.Inspired by this template, we create a Hindi version with grammatical gender cues: " { , } । " (I have known [him/her] for a long time, my friend works as a [occupation].)Figure 1 explains the template and gender-related information.Note that, unlike the English version, this template specifies the gender of the speaker (first person) using a genderinflected verb, i.e. "я (jaanta)" for male while "я (jaanti)" for female.The possessive pronoun is also gender inflected based on the gender of the speaker's friend.In Hindi, the possessive pronoun is gender inflected based on the word following it, here " (mera)" is used for male friend while " (meri)" is used for female friend.Based on the use of " (mera)" or " (meri)", the verb " (karta)" and " (karti)" is used for a male friend and female friend, respectively.So in this template, there are four possibilities based on the gender of the speaker and the gender of the speaker's friend.Using 1071 occupations, we construct these four sets with 1071 sentences each and check the percentage of sentences where the speaker's friend is translated as male or female.This is because English translation only specifies the gender of the friend while the gender of the speaker is lost in translation.

WinoMT-Hindi
In the real world, NMT models deal with more complex sentences: long sentences with further context,

Coreference to Entity 2
Figure 2: Sentence Template for WinoMT-Hindi.When Entity 1 is referenced, we use gender-inflected verb to specify its gender.When Entity 2 is referenced, its gender is specified using gender-inflected relational postposition or an adjective.Phrase after the conjuction (containing the pronoun which refers to either entity) is gender neutral.
The lawyer shouts at the secretary as he got angry.
The lawyer yelled at the secretary there because she had done a bad job.
The lawyer shouts at the secretary as she got angry.
The lawyer yelled at the secretary there because he had done a bad job.However, since it is in English, using it for evaluating bias for other source languages is not possible.Therefore we contextualize this test set for the evaluation of bias in HI-EN Translation by manually creating "WinoMT-Hindi", which consists of 704 WinoBias-like sentences in Hindi, but modified to include gender cues of the language, mainly: gender-inflected adjectives, postpositions, and verbs.
Construction of "WinoMT-Hindi" is explained in Figure 2. Sample sentences are shown in Figure 3.In Winobias, only the English pronoun carries the gender of referenced entity, but here, to provide the gender of the referenced entity, we use genderinflected verbs for Entity 1 and postpositions or adjectives for Entity 2. The phrase after the conjunction is gender-neutral, challenging the model to look for a more extended context.We only specify the gender of the referenced entity to avoid confusing the model with too much information.
We don't need reference translations in English, as automatic evaluation is possible.Due to the nature of our source sentences, we can mark the gender of the target by simply checking for the presence of male pronouns (he, him or his) or female pronouns (she or her) in the translation.Interestingly, we also observe that few sentences are translated into gender-neutral form.For example, the sentence: " Ê ÚÚ e ky " (Secretary asks mover what he should do to help) is translated as "The secretary asks the mover what to do to help" by Google Translate.While there is an increased interest in promoting Gender-Neutral translation for inclusivity (Piergentili et al., 2023), others call for gender preservation in translation (Cabrera and Niehues, 2023).The presence of neutral output sentences can be modelled as false negatives or true positives depending upon the goals of the evaluation.For this study, we model presence as false negatives for male and female class, i.e. equivalent to misgendering sentences.Nonetheless, due to the limited fraction of such sentences, metrics largely reflect bias due to misgendering.
For gender bias evaluation, we use the metrics: Acc, ∆ G and ∆ S given by Stanovsky et al. (2019).For measuring the difference in F 1 score between male and female classes, i.e. ∆ G , we use classwise F 1 score.We have divided our sentences into pro-stereotypical and anti-stereotypical sets using translated and transliterated versions of the occupations list by Zhao et al. (2018).This was done manually to ensure gender-neutrality of these occupation terms (and avoid their gender-inflected versions) in Hindi.To measure the difference in overall performance between pro-stereotypical and anti-stereotypical groups, i.e., ∆ S , we use macro-F 1 score by averaging F 1 for male and female class only.We also report the percentage of sentences translated as gender-neutral, i.e.N for each NMT system.

TGBI Evaluation
The results are shown in Table 1.For most translation systems, sentences in "Negative (S5)", "Polite (S6)" and "Positive (S7)" sets have higher P values.With the highest TGBI score, "IndicTrans" performs better at translating gender-neutral Hindi sentences into English with minimum gender bias.
The problem with the TGBI metric is that it may not accurately capture the true fairness of an NMT system since evaluation is only done on genderneutral sentences.

Evaluation using OTSC-Hindi
The results are shown in Table 2. Based on these results, the IndicTrans system shows heavy bias against the feminine gender.Even though it has the highest TGBI score, IndicTrans fails to use the given context to disambiguate the gender of occupation terms and gives "male default" for most of the translations.Similarly, Microsoft and AWS Translate systems also show bias against women by translating most of the sentences into their "male default" versions.Out of all the NMT models, Google Translate performs best at disambiguating gender from the given context.This shows that using such a set of sentences and extrinsic metrics, which take into account the gendered nature of the source sentence, is better at exposing the gender bias of an NMT system otherwise hidden by a metric such as TGBI.

Evaluation using WinoMT-Hindi
The results are shown in We also observe that ∆ S values are very low for all NMT systems.There are two potential reasons.First, it is observed that these HI-EN NMT systems strongly prefer masculine outputs irrespective of occupation stereotypes.Hence they give the "masculine default" in most cases leading to a similar performance on pro-stereotypical and antistereotypical sentences.Another reason can be the poor contextualisation of occupation stereotype.We rely on stereotype labels provided by original English occupation lists by Zhao et al. (2018) to divide the occupations into pro-stereotypical and anti-stereotypical sets.However, these lists were based on data from US Department of Labor.This might not contextualise well for Hindi.Culturally relevant occupation related statistics is required for creating these stereotype labels for different occupations in Hindi which was difficult to obtain in our case.
However, WinoMT-Hindi provides a way to generalise and motivate the creation of such evaluation benchmarks for other languages.

Related Work
Many works have focused on evaluating gender translation accuracy by creating various benchmarks.WinoMT benchmark by Stanovsky et al. (2019) is widely used for gender bias evaluation.It contains sentences from WinoBias (Zhao et al., 2018) and Winogender (Rudinger et al., 2018) coreference test sets in English.Without reference translations, it devises an automatic translation evaluation method for eight diverse target languages.
Bias evaluation of NMT models on source lan-guages other than English has mainly focused on the translation of gender-neutral sentences.Cho et al. (2019) proposed TGBI measure to evaluate gender bias in the translation of gender-neutral Korean sentences to English.Ramesh et al. (2021) used TGBI measure for Hindi-English machine translation.Our work emphasises on creation of gender unambiguous evaluation benchmarks for source languages other than English by accounting for gender inflections in the language to test the model's ability to find these gender-related cues.

Conclusion and Future Work
To conclude our study, we highlighted the need for contextualising NMT bias evaluation for non-English source languages, especially for languages that capture gender-related information in different forms.We demonstrated this using Hindi as a source language by creating evaluation benchmarks for HI-EN Machine Translation and comparing various state-of-the-art translation systems.In future, we plan to extend our evaluation to more languages and use natural sentences for evaluation without following a particular template.We are also looking forward to developing evaluation methods that are more inclusive of all gender identities.

Table 2 :
66.01 33.99 * 99.29 2.71 * Evaluation of IndicTrans(IT), Google Translate(GT), Microsoft Translator(MS) and AWS Translate(AWS) using the OTSC-Hindi test set.Here p m and p w are the percentage of sentences translated as male and female, respectively for the speaker's friend.* corresponds to the percentage of sentences translated into the true label for each sentence set.Bold values indicate the maximum percentage of sentences translated into a single gender class. *

Table 3 :
Comparison of performance of various NMT Models on WinoMT-Hindi on Acc, ∆ G , ∆ S and N (all in %) measures.⋆ indicates significantly highest value, ⋄ indicates significantly lowest value, • indicates near about values for Acc, ∆ G and ∆ S , respectively.

Table 3 .
Since Acc i.e.Accuracy should be high while ∆ G and ∆ S values should be low, Google Translate outperforms other models as being the least gender-biased model.In-dicTrans and AWS Translate are heavily biased toward a particular gender.These models have lower Acc values (almost equal to the probability of a random guess, i.e. 50%) and higher ∆ G values indicating that the F 1 score for the male class is very large in comparison to the F 1 score for female.