Evaluating Gender Bias in Hindi-English Machine Translation

With language models being deployed increasingly in the real world, it is essential to address the issue of the fairness of their outputs. The word embedding representations of these language models often implicitly draw unwanted associations that form a social bias within the model. The nature of gendered languages like Hindi, poses an additional problem to the quantification and mitigation of bias, owing to the change in the form of the words in the sentence, based on the gender of the subject. Additionally, there is sparse work done in the realm of measuring and debiasing systems for Indic languages. In our work, we attempt to evaluate and quantify the gender bias within a Hindi-English machine translation system. We implement a modified version of the existing TGBI metric based on the grammatical considerations for Hindi. We also compare and contrast the resulting bias measurements across multiple metrics for pre-trained embeddings and the ones learned by our machine translation model.


Introduction
There has been a recent increase in the studies on gender bias in natural language processing considering bias in word embeddings, bias amplification, and methods to evaluate bias (Savoldi et al., 2021), with some evaluation methods introduced primarily to measure gender bias in MT systems. In MT systems, bias can be identified as the cause of the translation of gender-neutral sentences into gendered ones. There has been little work done for bias in language models for Hindi, and to the best of our knowledge, there has been no previous work that measures and analyses bias for MT of Hindi. Our approach uses two existing and broad frameworks for assessing bias in MT, including the Word Embedding Fairness Evaluation (Badilla et al., 2020) and the Translation Gender Bias Index (Cho et al., 2019) on Hindi-English MT systems. We modify some of the existing procedures within these metrics required for compatibility with Hindi grammar. This paper contains the following contributions: 1. Construction of an equity evaluation corpus (EEC) (Kiritchenko and Mohammad, 2018) for Hindi of size 26370 utterances using 1558 sentiment words and 1100 occupations following the guidelines laid out in Cho et al.
2. Evaluation of gender bias in MT systems for Indic languages.
3. An emphasis on a shift towards inclusive models and metrics. The paper is also demonstrative of language that should be used in NLP papers working on gender bias.
All our codes and files are publicly available. 1

Related Work
The prevalence of social bias within a language model is caused by it inadvertently drawing unwanted associations within the data. Previous works that have addressed tackling bias include Bolukbasi et al. (2016), which involved the use of multiple gender-definition pairs and principal component analysis to infer the direction of the bias. In order to mitigate the bias, each word vector had its projection on this subspace subtracted from it. However, this does not entirely debias the word vectors, as noted in .
There have been various attempts to measure the bias in existing language models. Huang et al. (2020) measure bias based on whether the sentiment of the generated text would alter if there were a change in entities such as the occupation, gender, etc. Kurita et al. (2019) performed experiments on evaluating the bias in BERT using the Word Embedding Association Test (WEAT) as a baseline for their own metric, which involved calculating the mean of the log probability bias score for each attribute.
Concerning the measurement of bias in existing MT systems, Stanovsky et al. (2019) came up with a method to evaluate gender bias for 8 target languages automatically. Their experiments aligned translated text with the source text and then mapped the English entity (source) to the corresponding target translation, from which the gender is extracted.
Most of the focus in mitigating bias has been in English, which is not a gendered language. Languages like Hindi and Spanish contain grammatical gender, where the gender of the verbs, articles, adjectives must remain consistent with that of the gender of the noun. In Zhou et al. (2019) a modified version of WEAT was used to measure the bias in Spanish and French, based on whether the noun was inanimate or animate, with the latter containing words like 'doctor,' which have two variants for 'male' and 'female' each.  worked on addressing the problem with such inanimate nouns as well and attempted to neutralize the grammatical gender signal of these words during training by lemmatizing the context words and changing the gender of these words.
While there has been much work on quantifying and mitigating bias in many languages in NLP, the same cannot be said for Hindi and other Indic languages, possibly because they are low-resource. Pujari et al. (2019) was the first work in this area; they use geometric debiasing, where a bias subspace is first defined and the word is decomposed into two components, of which the gendered component is reduced. Finally, SVMs were used to classify the words and quantify the bias.

Dataset and Data Preprocessing
The trained model that we borrowed from Gangar et al. (2021) was trained on the IIT-Bombay Hindi-English parallel data corpus (Kunchukuttan et al., 2018), which contains approximately 1.5 million examples across multiple topics. Gangar et al. (2021) used back-translation to increase the performance of the existing model by training the English-Hindi model on the IIT-Bombay corpus and then subsequently used it to translate 3 million records in the WMT-14 English monolingual dataset to augment the existing parallel corpus training data. The model was trained on this backtranslated data, which was split into 4 batches.
The dataset cleaning involved removing special characters, punctuation, and other noise, and the text was subsequently converted to lowercase. Any duplicate records within the corpus were also removed, word-level tokenization was implemented, and the most frequent 50,000 tokens were retained. In the subword level tokenization, where byte-pair encoding was implemented, 50,000 subword tokens were created and added to this vocabulary.

NMT Model Architecture
For our experiments in building the neural machine translation model, we made use of the OpenNMT-tf (Klein et al., 2020) library, with the model's configuration being borrowed from Gangar et al. (2021). The OpenNMT model made use of the Transformer architecture (Vaswani et al., 2017), consisting of 6 layers each in the encoder and decoder architecture, with 512 hidden units in every hidden layer. The dimension of the embedding layer was set to 512, with 8 attention heads, with the LazyAdam optimizer being used to optimize model parameters. The batch size was 64 samples, and the effective batch size for each step was 384.

WEFE
The Word Embedding Fairness Evaluation framework is used to rank word embeddings using a set of fairness criteria. WEFE takes in a query, which is a pair of two sets of target words and sets of attribute words each, which are generally assumed to be characteristics related to the former.
The WEFE ranking process takes in an input of a set of multiple queries which serve as tests across which bias is measured Q, a set of pre-trained word embeddings M , and a set of fairness metrics F .

The Score Matrix
Assume a fairness metric K is chosen from the set F , with a query template s = (t, a), where all  subqueries must satisfy this template. Then, In that case, the Q i (s) forms the set of all subqueries that satisfy the query template. Thus, the value of F = (m, Q) is computed for every pretrained embedding m that belongs to the set M , for each query present in the set. The matrix produced after doing this for each embedding is of the dimensions M × Q K .
The rankings are created by aggregating the scores for each row in the aforementioned matrix, which corresponds to each embedding. The aggregation function chosen must be consistent with the fairness metric, where the following property must be satisfied for ≤ F , where x, x , y, y are random values in |&IR, then agg(x, x ) ≤ agg(y, y ) must hold true to be able to use the aggregation function. The result after performing this operation for every row is a vector of dimensions 1 × M , and we use ≤ F to create a ranking for every embedding, with a smaller score being ranked higher than lower ones.
After performing this process for every fairness metric over each embedding m ∈ M , the resultant matrix with dimensions M × F consisting of the ranking indices of every embedding for every metric, and this allows us to compare and analyze the correlations of the different metrics for every word embedding.

WEAT
The WEAT (Word Embedding Association Test) (Caliskan et al., 2017) metric, inspired by the IAT (Implicit Association Test), takes in a set of queries as its input, with the queries consisting of sets of target words, and attribute words. In our case, we have defined two sets of target words catering to the masculine and feminine gendered words, respectively. In addition to this, we have defined multiple pairs of sets of attribute words, as mentioned in the Appendix. WEAT calculates the association of the target set T 1 with the attribute set A 1 over the attribute set A 2 , relative to T 2 . For example, as observed in Table 1, the masculine words tend to have a greater association with career than family than the feminine words. Thus, given a word w in the word embedding: The difference of the mean of the cosine similarities of a given word's embedding vector with the word embedding vectors of the attribute sets are utilized in the following equation to give an estimate of the association.

RND
The objective of the Relative Norm Distance (RND) (Garg et al., 2018) is to average the embedding vectors within the target set T , and for every attribute a ∈ A, the norm of the difference between the average target and the attribute word is calculated, and subsequently subtracted.
The higher the value of the relative distance from the norm, the more associated the attributes are with the second target group, and vice versa.

RNSB
The Relative Negative Sentiment Bias (RNSB) (Sweeney and Najafian, 2019) takes in multiple target sets and two attribute sets and creates a query. Initially, a binary classifier is constructed, using the first attribute set A 1 as training examples for the first class, and A 2 for the second class. The classifier subsequently assigns every word w a probability, which implies its association with an attribute set, i.e p(A 1 ) = C (A 1 ,A 2 ) (w) Here, C (A 1 ,A 2 ) (x) represents the binary classifier for any word x. The probability of the word's association with the attribute set A 2 would therefore be calculated as 1 − C (A 1 ,A 2 ) (w). A probability distribution P is formed for every word in each of the target sets by computing this degree of association. Ideally, a uniform probability distribution U should be formed, which would indicate that there is no bias in the word embeddings with respect to the two attributes selected. The less uniform the distribution is, the more the bias. We calculate the RNSB by defining the Kulback-Leibler divergence of P from U to assess the similarity of these distributions.

ECT
The Embedding Coherence Test (Dev and Phillips, 2019) compares the vectors of the two target sets T 1 and T 2 , averaged over all their terms, with vectors from an attribute set A. It does so by computing mean vectors for each of these target sets such that: After calculating the mean vectors for each target set, we compute its cosine similarity with every attribute vector a ∈ A, resulting in s 1 and s 2 , which are vector representations of the similarity score for the target sets. The ECT score is computed by calculating the Spearman's rank correlation between the rank orders of s 1 and s 2 , with a higher correlation implying lower bias.

TGBI
The  (2019), the authors create a test set of words or phrases that are gender neutral in the source language, Korean. These lists were then translated using three different models and evaluated for bias using their evaluation scheme. The evaluation methodology proposed in the paper quantifies associations of 'he,' 'she,' and related gendered words present translated text. We carry out this methodology for Hindi, a gendered low-resource language in natural language processing tasks.

Occupation and Sentiment Lists
Considering all of the requirements laid out by Cho et al. (2019), we created a list of unique occupa-tions and positive and negative sentiment in our source language, Hindi. The occupation list was generated by translating the list in the original paper. The translated lists were manually checked for errors and for the removal of any spelling, grammatical errors, and gender associations within these lists by native Hindi speakers. The sentiment lists were generated using the translation of existing English sentiment lists (Liu et al., 2005;Hu and Liu, 2004) and then manually checked for errors by the authors. This method of generation of sentiment lists in Hindi using translation was also seen in Bakliwal et al. (2012). The total lists of unique occupations and positive and negative sentiment words come out to be 1100, 820 and 738 in size respectively. These lists have also been made available online. 2

Pronouns and Suffixes
Hindi, unlike Korean, does not have gender-specific pronouns in the third person. Cho et al. (2019) considered 그 사람 (ku salam), 'the person' as a formal gender-neutral pronoun and the informal genderneutral pronoun, 걔 (kyay) for a part of their genderneutral corpus. However, for Hindi, we directly use the third person gender-neutral pronouns. This includes (vah), (ve), (vo) corresponding to formal impolite (familiar), formal polite (honorary) and informal (colloquial) respectively (Jain, 1969).
As demonstrated by Cho et al. (2019), the performance of the MT system would be best evaluated with different sentence sets used as input. We apply the three categories of Hindi pronouns to make three sentence sets for each lexicon set (sentiment and occupations): (i) formal polite, (ii) formal impolite, and (iii) informal (colloquial use).

Evaluation
We evaluate two systems, Google Translate and the Hi-En OpenNMT model, for seven lists that include: (a) informal, (b) formal, (c) impolite, (d) polite, (e) negative, (f) positive, and (g) occupation that are gender-neutral. We have attempted to find bias that exists in different types of contexts using these lists. The individual and cumulative scores help us assess contextual bias and overall bias in Hi-En translation respectively.
TGBI uses the number of translated sentences that contain she, he or they pronouns (and conventionally associated 3 words such as girl, boy or  person) to measure bias by associating that pronoun with p he , p she and p they 4 for the scores of P 1 to P 7 corresponding to seven sets S 1 to S 7 such that: and finally, TGBI = avg(P i ).

Results and Discussion
The BLEU score of the OpenNMT model we used was 24.53, and the RIBES score was 0.7357 across 2478 samples.

WEAT
We created multiple sets of categories for the attributes associated with 'masculine' and 'feminine,' including the subqueries as listed in the supplementary material. We used both the embeddings from the encoder and the decoder, that is to say, the source and the target embeddings, as the input to WEFE alongside the set of words defined in the target and attribute sets. Aside from this, we have also tested pre-trained word embeddings that were available with the gensim (Rehurek and Sojka, 2011) package on the same embeddings. The results of the measurement of bias using the WEFE framework are listed in Table 1. For the English embeddings, there is a significant disparity in the WEAT measurement for the Math vs Arts and the Science vs Arts categories. This could be owing to the fact that there is little data in the corpus that the MT system was trained over, which is relevant to the attributes in these sets. Hence the bias is minimal compared to the pretrained word2vec embeddings, which is learned over a dataset containing 100 billion words and is been explain in section 5.2 4 Changed convention to disassociate pronouns with gender and sex likely to learn more social bias compared to the embeddings learned in the training of the MT system. We notice a skew in some of the other results, which could be due to the MT model picking up on gender signals that have strong associations of the target set with the attribute set, implying a strong bias in the target set training data samples itself. However, all of these metrics and the pre-trained embeddings used are in positive agreement with each other regarding the inclination of the bias.
For the Hindi embeddings, while the values agree with each other for the first two metrics, there is a much more noticeable skew in the RND and ECT metrics. The pre-trained embeddings seem to exhibit much more bias, but the estimation of bias within the embedding learned by the MT may not be accurate due to the corresponding word vectors not containing as much information, consider the low frequency of terms in the initial corpus that the NMT was trained on. In addition to this, there were several words in the attribute sets in English that did not have an equivalent Hindi translation or produced multiple identical attribute words in Hindi. Consequently, we had to modify the Hindi attribute lists.
While these metrics can be used to quantify gender bias, despite not necessarily being robust, as is illustrated in Ethayarajh et al. (2019) which delves into the flaws of WEAT, they also treat gender in binary terms, which is also a consistent trend across research related to the field.
Our findings show a heavy tendency for Hi-En MT systems to produce gendered outputs when the gender-neutral equivalent is expected. We see that many stereotypical biases are present in the source and target embeddings used in our MT system. Further work to debias such models is necessary, and the development of a more advanced NMT would be beneficial to produce more accurate translations to be studied for bias.

TGBI
The final TGBI score which is the average of different P i values, is between 0 and 1. A score of 0 corresponds to high bias (or gendered associations in translated text) and 1 corresponds to low bias (Cho et al., 2019).
The bias values tabulated in Table 2, show that within both models, compared to the results on sentiment lexicons, occupations show a greater bias, with p she value being low. This points us directly to social biases projected on the lexicons (S bias 5 ). For politeness and impoliteness, we see that the former has the least bias and the latter most across all lists. While considering formal and informal lists, informal pronoun lists show higher bias. There are a couple of things to consider within these results: a) the polite pronoun (ve) is most often used in plural use in modern text (V bias ), thus leading to a lesser measured bias, b) consider that both polite and impolite are included in formal which could correspond to its comparatively lower index value compared to informal.
Bias in MT outputs whether attributed to S bias or V bias , is harmful in the long run. Therefore, in our understanding, the best recommendation is that TGBI = 1 with corresponding p they , p she , p he values 1, 0, 0 respectively.

Bias Statement
In this paper, we examine gender bias in Hi-En MT comprehensively with different categories of occupations, sentiment words and other aspects. We consider bias as the stereotypical associations of words from these categories with gender or more specifically, gendered words. Based on the suggestions by Blodgett et al. (2020), we have the two main categories of harms generated by bias: 1) representational, 2) allocational. The observed biased underrepresentation of certain groups in areas such as Career and Math, and that of another group in Family and Art, causes direct representational harm. Due to these representational harms in MT and other downstream applications, people who already belong to systematically marginalized groups are put further at risk of being negatively affected by stereotypes. Inevitably, gender bias causes errors in translation (Stanovsky et al., 2019) which can contribute to allocational harms due to disparity in how useful the system proves to be for different people, as described in an example in Savoldi et al. (2021). The applications that MT systems are used to augment or directly develop increase the risks associated with these harms.
There is still only a very small percent of the second most populated country in the world, India that speaks English, while English is the most used language on the internet. It is inevitable that a lot of content that might be consumed now or in the future might be translated. It becomes imperative to evaluate and mitigate the bias within MT systems concerning all Indic languages.

Ethical Considerations and Suggestions
There has been a powerful shift towards ethics within the NLP community in recent years and plenty of work in bias focusing on gender. However, we do not see in most of these works a critical understanding of what gender means. It has often been used interchangeably with the terms 'female' and 'male' that refer to sex or the external anatomy of a person. Most computational studies on gender see it strictly as a binary, and do not account for the difference between gender and sex. Scholars in gender theory define gender as a social construct or a learned association. Not accommodating for this definition in computational studies not only oversimplifies gender but also possibly furthers stereotypes (Brooke, 2019). It is also important to note here that pronouns in computational studies have been used to identify gender, and while he and she pronouns in English do have a gender association, pronouns are essentially a replacement for nouns. A person's pronouns, like their name, are a form of self-identity, especially for people whose gender identity falls outside of the gender binary (Zimman, 2019). We believe research specifically working towards making language models fair and ethically sound should be employing language neutralization whenever possible and necessary and efforts to make existing or future methodologies more inclusive. This reduces further stereotyping (Harris et al., 2017;Tavits and Pérez, 2019). Reinforcing gender binary or the association of pronouns with gender may be invalidating for people who identify themselves outside of the gender binary (Zimman, 2019).

Conclusion and Future Work
In this work, we have attempted to gauge the degree of gender bias in a Hi-En MT system. We quantify gender bias (so far only for the gender binary) by using metrics that take data in the form of queries and employ slight modifications to TGBI to extend it to Hindi. We believe it could pave the way to the comprehensive evaluation of bias across other Indic and/or gendered languages. Through this work, we are looking forward to developing a method to debias such systems and developing a metric to measure gender bias without treating it as an immutable binary concept.