Gender bias amplification during Speed-Quality optimization in Neural Machine Translation

Is bias amplified when neural machine translation (NMT) models are optimized for speed and evaluated on generic test sets using BLEU? We investigate architectures and techniques commonly used to speed up decoding in Transformer-based models, such as greedy search, quantization, average attention networks (AANs) and shallow decoder models and show their effect on gendered noun translation. We construct a new gender bias test set, SimpleGEN, based on gendered noun phrases in which there is a single, unambiguous, correct answer. While we find minimal overall BLEU degradation as we apply speed optimizations, we observe that gendered noun translation performance degrades at a much faster rate.


Introduction
Optimizing machine translation models for production, where it has the most impact on society at large, will invariably include speed-accuracy trade-offs, where accuracy is typically approximated by BLEU scores (Papineni et al., 2002) on generic test sets. However, BLEU is notably not sensitive to specific biases such as gender. Even when speed optimizations are evaluated in shared tasks, they typically use BLEU (Papineni et al., 2002;Heafield et al., 2020) to approximate quality, thereby missing gender bias. Furthermore, these biases probably evade detection in shared tasks that focus on quality without a speed incentive (Guillou et al., 2016) because participants would not typically optimize their systems for speed. Hence, it is not clear if Neural Machine Translation (NMT) speed-accuracy optimizations amplify biases. This work attempts to shed light on the algorithmic choices made during speed-accuracy optimizations source That physician is a funny lady! reference ¡Esa médica/doctora es una mujer graciosa! system A ¡Ese ::::: médico es una dama graciosa! system B ¡Ese ::::: médico es una dama divertida! system C ¡Ese ::::: médico es una mujer divertida! system D ¡Ese ::::: médico es una dama divertida! Table 1: Translation of a simple source sentence by 4 different commercial English to Spanish MT systems. All of these systems fail to consider the token "lady" when translating the occupation-noun, rendering it in with the masculine gender "doctor/médico". and their impact on gender biases in an NMT system, complementing existing work on data bias.
We explore optimizations choices such as (i) search (changing the beam size in beam search); (ii) architecture configurations (changing the number of encoder and decoder layers); (iii) model based speedups (using Averaged attention networks (Zhang et al., 2018)); and (iv) 8-bit quantization of a trained model..
Prominent prior work on gender bias evaluation forces the system to "guess" the gender (Stanovsky et al., 2019a) of certain occupation nouns in the source sentence. Consider, the English source sentence "That physician is funny.", containing no information regarding the physician's gender. When translating this sentence into Spanish (where the occupation nouns are explicitly specified for gender), an NMT model is forced to guess the gender of the physician and choose between masculine forms, doctor/médico or feminine forms doctora/médica. While investigating bias in these settings is valuable, in this paper, we hope to highlight that the problem is much worse -despite an explicit gender reference in the sentence, NMT systems still generate the wrong gender in translation (see Table 1), resulting in egregious errors where not only is the gender specification incorrect but the generated sentence also fails in morphological gender  agreement. To focus on these egregious errors, we construct a new data set, SimpleGEN. In Simple-GEN, all source sentences include an occupation noun (such as "mechanic", "nurse" etc.) and an unambiguous "signal" specifying the gender of the person being referred to by the occupation noun. For example, we modify the previous example to "That physician is a funny lady". We call our dataset "Simple" because it contains all the information needed by a model to produce correctly gendered occupation nouns. Furthermore, our sentences are short (up to 12 tokens) and do not contain complicated syntactic structures. Ideally, SimpleGEN should obviate the need for an NMT model to incorrectly guess the gender of occupation nouns, but using this dataset we show that gender translation accuracy, particularly in female context sentences (see Section 2), is negatively impacted by various speed optimizations at a greater rate than a drop in BLEU scores. A small drop in BLEU can hide a large increase in biased behavior in an NMT system. Further illustrating how insensitive BLEU is as a metric to such biases.

SimpleGEN: A gender bias test set
Similar to Stanovsky et al. (2019b), our goal is to provide English input to an NMT model and evaluate if it correctly genders occupation-nouns. We focus on English to Spanish (En-Es) and English to German (En-De) translation directions as occupation-nouns are explicitly specified for gender in these target languages while English is underspecified for such a morphological phenomenon which forces the model to attend to contextual clues. Furthermore, these language directions are considered "high-resource" and often cited as exemplars for advancement in NMT.
A key differentiating characterization of our test set is that there is no ambiguity about the gender of the occupation-noun. We achieve this by using carefully constructed templates such that there is enough contextual evidence to unambiguously specify the gender of the occupation-noun. Our templates specify a "scaffolding" for sentences with keywords acting as placeholders for values (see Table 2). For the occupation keywords such as f-occ-sg and m-occ-sg, we select the occupations for our test set using the U.S Department of Labor statistics of high-demand occupations. 1 A full list of templates, keywords and values is in table A6. Using our templates, we generate English source sentences which fall into two categories: (i) pro-stereotypical (pro) sentences contain either stereotypical male occupations situated in male contexts (MOMC) or female occupations in female contexts (FOFC), and (ii) anti-stereotypical (anti) sentences in which the context gender and occupation gender are mismatched, i.e. male occupations in female context (MOFC) and female occupations in male contexts (FOMC). Note that we use the terms "male context" or "female context" to categorize sentences in which there is an unambiguous signal that the occupation noun refers to a male or female person, respectively. We generated 1332 pro-stereotypical and anti-stereotypical sentences, 814 in the MOMC and MOFC subgroups and 518 in the FOMC and FOFC subgroups (we collect more male stereotypical occupations compared to female, which causes this disparity).
To evaluate the translations of NMT models on SimpleGEN, we also create an occupation-noun bilingual dictionary, that considers the number and gender as well as synonyms for the occupations. For example for the En-Es direction, the English occupation term 'physician", has corresponding entries for its feminine forms in Spanish as "doctora" and "médica" and for its masculine forms "doctor" and "médico" (See table A8 for our full dictionary). By design, non-occupation keywords such as f-rel and f-n-sg specify the expected gender of the occupation-noun on the target side, enabling dictionary based correctness verification.

Speeding up NMT
There are several "knobs" that can be tweaked to speed up inference for NMT models. Setting the beam-size (bs) to 1 during beam search is likely the   Alternatively, one can employ SD models without increasing the encoder layers resulting in smaller (and faster) models.
Average Attention Networks (AAN): Average Attention Networks reduce the quadratic complexity of the decoder attention mechanism to linear time by replacing the decoder-side self-attention with an average-attention operation using a fixed weight for all time-steps (Zhang et al., 2018). This results in a ≈ 3-4x decoding speedup over the standard transformer.

Experimental Setup
Our objective is not to compare the various optimization methods against each other, but rather surface the impact of these algorithmic choices on gender biases. We treat all the optimization choices described in section 3 as data points available to conduct our analysis. To this end, we train models with all combinations of optimizations described in section 3 using the Fairseq toolkit (Ott et al., 2019). Our baseline is a standard large transformer with a (6, 6) encoder-decoder layer configuration. For our SD models we use the following encoder-decoder layer configurations {(8, 4), (10, 2), (11, 1)}. We also train smaller shallow decoder (SSD) models without increasing the encoder depth {(6, 4), (6, 2), (6, 1)}. For each of these 7 configurations, we train AAN versions. Next, we save quantized and non-quantized versions for the 14 models, and decode with beam sizes of 1 and 5. We repeat our analysis for English to Spanish and English to German directions, using WMT13 En-Es and WMT14 En-De data sets, respectively. For the En-Es we limited the training data to 4M sentence pairs (picked at random without replacement) to ensure that the training for the two language directions have comparable data sizes. We apply Byte-Pair Encoding (BPE) with 32k merge operations to the data (Sennrich et al., 2016).
We measure decoding times and BLEU scores for the model's translations using the WMT test sets. Next, we evaluate each model's performance on SimpleGEN, specifically calculating the percent of correctly gendered nouns, incorrectly gendered nouns as well as inconclusive results. Table 3 shows an example of our evaluation protocol for an example source sentences and four possible translations. We deem the first two as correct even though the second translation incorrectly translates "funny" as "feliz" since we focus on the translation of "physician" only. The third translation is deemed incorrect because the masculine form "médico" is used and the last translation is deemed inconclusive since it is in the plural form. We average these metrics over 3 trials, each initialized with different random seeds. We obtained 56 data points for each language direction. Table 4a shows the performance of 6 selected models including a baseline transformer model with 6 encoder and decoder layers. The first two columns (time and BLEU) were computed using the WMT test sets. The remaining columns report metrics using SimpleGEN. The algorithmic choices resulting in the highest speed-up, result in a 1.5% and 4% relative drop in BLEU for En-Es and En-De, respectively (compared to the baseline model). The pro-stereotypical (pro) column shows the percentage correct gendered translation for sentences where the occupation gender matches the context gender. As expected the accuracies are relatively high (80.9 to 77.7) for all the models. The   Table 4a) and stacked in Table 4b). We selected 6 models in both sections to highlight their effect on decoding time, BLEU and the % correctness on gender-bias metrics. The last row for each section (and each direction), shows the relative % drops in all the metrics between the fastest optimization method and the baseline. For example, for En-Es the relative % drop of decoding time for Table 4a is calculated as 100 * (3662.8 − 1993.5)/3662.8.

Analysis
last row in each section shows the maximum relative drop in each metric. We find that for the prostereotypical column the maximum relative drop is 1.5 and 6.5 for Spanish and German, respectively, which is similar to the relative change in BLEU scores. However, we find that the models are able to perform better on MOMC compared to FOFC suggesting biases even within the pro-stereotypical setting. In the anti-stereotypical (anti) column, we observe below-chance accuracies of only 44.2% and 39.7% for the two language directions, even from our best model. Columns FOFC and MOFC, show the difference in performance for sentences in the female context (FC) category in the presence of a stereotypical female occupation versus a stereotypical male occupation. We see a large imbalance in performance in these two columns summarized in ∆FC. Similarly, ∆MC summarizes the drop in performance when the model is confronted with stereotypical female occupations in a male context when compared to a male occupation in a male context. This suggests that the transformer's handling of grammatical agreement especially in cases where an occupation and contextual gender mismatch could be improved. The speedups disproportionately affect female context (FC) sentences across all categories. In terms of model choices, we find that AANs deliver moderate speed-ups and minimal BLEU reduction compared to the baseline. However, AANs suffer the most degradation in terms of gender-bias. ∆, ∆FC and ∆MC are the highest for the ANN model in both language directions. On the other hand, greedy decoding with the baseline model has the smallest degradation in terms of gender-bias.
While Table 4a reveals the effect of select individual model choices, NMT practitioners, typically "stack" the optimization techniques together for large-scale deployment of NMT systems. Table 4b shows that stacking can provide ≈ 80 − 81% relative drop in decoding time. However, we again see a disturbing trend where large speedups and small BLEU drops are accompanied with large drops in gender test performance. Again, FC sentences disproportionately suffer large drops in accuracy, particularly in MOFC in the En-De direction, where we see a 53.2% relative drop between the baseline and the fastest optimization stack.
While tables 4a and 4b show select models, we illustrate and further confirm our findings using all the data points (56 models trained) using scatter plots shown in fig. 1. We see that relative % drop in BLEU aligns closely with the relative % drop in gendered translation in the pro-stereotypical setting. In the case of German, the two trendlines are virtually overlapping. However, we see a steep drop for the anti-stereotypical settings, suggesting that BLEU scores computed using a typical test set only captures the stereotypical cases and even small reduction in BLEU could result in more instances of biased translations, especially in female context sentences.

Related Work
Previous research investigating gender bias in NMT has focused on data bias, ranging from assessment to mitigation. For example, Stanovsky et al. (2019b) adapted an evaluation data set for co-reference resolution to measure gender biases in machine translation. The sentences in this test set were created with ambiguous syntax, thus forcing the NMT model to "guess" the gender of the occupations. In contrast, there is always an unambiguous signal specifying the occupation-noun's gender in SimpleGEN. Similar work in speech-translation also studies contextual hints, but their work uses real-world sentences with complicated syntactic structures and sometimes the contextual hints are across sentence boundaries resulting in genderambiguous sentences (Bentivogli et al., 2020).  2020) propose a dataannotation scheme in which the NMT model is trained to obey gender-specific tags provided with the source sentence. While Escudé Font and Costa-jussà (2019) employ pre-trained wordembeddings which have undergone a "debiasing" process (Bolukbasi et al., 2016;Zhao et al., 2018). Saunders and Byrne (2020) and Costa-jussà and de Jorge (2020) propose domain-adaptation on a carefully curated data set that "corrects" the model's misgendering problems. Costa-jussà et al. (2020) consider variations involving the amount of parameter-sharing between different language directions in multilingual NMT models.

Conclusion
With the current mainstreaming of machine translation, and its impact on people's everyday lives, bias mitigation in NMT should extend beyond data modifications and counter bias amplification due to algorithmic choices as well. We focus on algorithmic choices typically considered in speed-accuracy trade offs during productionization of NMT models. Our work illustrates that such trade offs, given current algorithmic choice practices, result in significant impact on gender translation, namely amplifying biases. In the process of this investigation, we construct a new gender translation evaluation set, SimpleGEN, and use it to show that modern NMT architectures struggle to overcome gender biases even when translating source sentences that are syntactically unambiguous and clearly marked for gender.

A.1 Impact Statement
This work identifies a weakness of NMT models where they appear to ignore contextual evidence regarding the gender of an occupation noun and apply an incorrect gender marker. It is difficult to measure the adverse effects of biases in NMT, but errors like the ones we highlight reduce trust in the NMT system.
Intended use: We hope that this type of error is further studied by NMT researchers leading to a solution. Furthermore, we expect the speedoptimization aspect of our work provides NMT engineers with an extra point of consideration, as we show gender-bias (errors in our dataset) increases rapidly compared to metrics like BLEU on standard datasets. In this work, we limit ourselves to viewing gender in the linguistic sense. SimpleGEN is not meant to be a replacement for traditional MT evaluation.
Risks: We recognize that socially, gendered language evolves (e.g. in English, "actress" is rarely used anymore). To the best of our knowledge, we selected occupations that are typically gendered (in Spanish and German) at present. Furthermore, we only regard the gender binary as a linguistic construct. It would be incorrect to use this work in the context of gender identity or gender expression etc.
Dataset: The dataset is "synthetic" in that it has been constructed using templates. We did not use crowd-sourcing or private data.

A.2 Full Template and Terms
Keywords Values f-n female, women m-n male, men f-n-pl women, ladies, females, gals m-n-pl men, guys, males, fellows f-n-sg gal, woman, lady m-n-sg man, guy, fellow f-obj-prn her m-obj-prn him f-pos-prn her m-pos-prn his f-obj-pos-prn her m-obj-pos-prn his f-sbj-prn she m-sbj-prn he f-rel wife, mother, sister, girlfriend m-rel husband, father, brother, boyfriend Table A5: Keywords and the values they can take.
Occupation Keywords Values f-occ-sg clerk, designer, hairdresser, housekeeper, nanny, nurse, secretary m-occ-sg director, engineer, truck driver, farmer, laborer, mechanic, physician, president, plumber, carpenter, groundskeeper f-occ-pl clerks, designers, hairdressers, housekeepers, nannies, nurses, secretaries m-occ-pl directors, engineers, truck drivers, farmers, laborers, mechanics, physicians, presidents, plumbers, carpenters, groundskeepers f-occ-sg-C clerk, designer, hairdresser, housekeeper, nanny, nurse, secretary m-occ-sg-C director, truck driver, farmer, laborer, mechanic, physician, president, plumber, carpenter, groundskeeper f-occ-pl-C clerks, designers, hairdressers, housekeepers, nannies, nurses, secretaries m-occ-pl-C directors, truck drivers, farmers, laborers, mechanics, physicians, presidents, plumbers, carpenters, groundskeepers f-occ-sg-V m-occ-sg-V engineer, f-occ-pl-V m-occ-pl-V engineers, Table A6: Occupation keywords and the values they can take. The prefix "m-" and "f-" indicate that according to the U.S Department of Labor these occupations have a higher percentage of male and female works, respectively. Table A7 shows the template we use to generate our source sentences in SimpleGEN. We can generate sentences in one of the four sub-categories (MOMC, MOFC, FOFC, FOMC) by setting occupation keywords with the prefix m-or f-from our terminology set Table A6). For example, to generate MOFC sentences, we set occupation-keywords with prefix m-and non-occupation keywords with prefix f-.

A.3 Breakdown of scatter plots
Figures A2a and A2b further divides prostereotypical into male-occupations in male contexts (MoMc) and female-occupations in female context (FoFc), and anti-stereotypical into maleoccupations in female contexts (MoFc) and femaleoccupations in male contexts (FoMc). Table A8 shows the dictionary we use for evaluation.