Contrastive Conditioning for Assessing Disambiguation in MT: A Case Study of Distilled Bias

Lexical disambiguation is a major challenge for machine translation systems, especially if some senses of a word are trained less often than others. Identifying patterns of overgeneralization requires evaluation methods that are both reliable and scalable. We propose contrastive conditioning as a reference-free black-box method for detecting disambiguation errors. Specifically, we score the quality of a translation by conditioning on variants of the source that provide contrastive disambiguation cues. After validating our method, we apply it in a case study to perform a targeted evaluation of sequence-level knowledge distillation. By probing word sense disambiguation and translation of gendered occupation names, we show that distillation-trained models tend to overgeneralize more than other models with a comparable BLEU score. Contrastive conditioning thus highlights a side effect of distillation that is not fully captured by standard evaluation metrics. Code and data to reproduce our findings are publicly available.


Introduction
Erroneous disambiguation of words makes translations inadequate and can even constitute a form of bias when it occurs systematically. However, detecting disambiguation errors in machine translation (MT) is a non-trivial task. Previous work has focused on automatic post-hoc analysis of translations (Raganato et al., 2019;Stanovsky et al., 2019), but rules of what makes a disambiguation correct or incorrect tend to be imprecise. While contrastive evaluation (Sennrich, 2017;Rios et al., 2017) eliminates the need for post-hoc analysis by scoring pre-defined pairs of hypotheses, such probability estimates cannot be obtained from black-box systems, e.g., from commercial APIs that only return Figure 1: Our case study is motivated by an analysis of the training data: In the English-German WMT19 news corpus, doctor is mostly translated into male forms such as Arzt and rarely into female forms (center: other variants). Student models trained on a machinetranslated version of the data amplify this imbalance. a 1-best translation to the user. In addition, contrastive hypotheses need to be carefully crafted for every target language of interest.
We propose contrastive conditioning as a scalable black-box alternative for evaluating disambiguation in machine translation. The evaluated translations are paired with contrastive source sequences and are then scored by a white-box translation model. The contrastive sources are variants of the original source, slightly modified to provide a stronger disambiguation cue. For example, consider a model that translates the English source 'doctor' into German as 'Ärztin' (female doctor). This translation will receive a better score when conditioned on the source 'female doctor' than on 'male doctor', indicating that it is a feminine form. Given sufficient disambiguation cues, the whitebox translation model can thus serve as an evaluator for the original translation.
Since the contrastive sources are written in the source language, contrastive conditioning does not rely on references in the target language. This makes it easier to scale the evaluation across multiple target languages. In addition, the method is reliable compared to alternative evaluation methods, according to our human validation.
In a case study, we utilize contrastive condition- The assistant asked the doctor if she needs any help.
female person male person We ought to have avocados as a starter.
appetizer motor Word Sense Disambiguation Gendered Occupation Names ing to answer a specific research question. We probe models trained using sequence-level knowledge distillation (SeqKD; Kim and Rush, 2016) and quantify their overgeneralization bias, i.e., their tendency to err on the side of frequent word senses. This question is of emerging interest because the distilled training data used for SeqKD are known to have reduced entropy (Zhou et al., 2020). Figure 1 illustrates to what degree a rare word sense can vanish in distillation, raising the question of how this affects disambiguation quality. Our case study is based on English-German and English-Russian systems and applies contrastive conditioning to two distinct types of disambiguation. The first type is word sense disambiguation in general, as represented by the MuCoW test suite (Raganato et al., 2019). The second type is the special case of gendered occupation names, for which the WinoMT challenge set has been released (Stanovsky et al., 2019). For both types of disambiguation, our results based on contrastive conditioning confirm that models trained via SeqKD tend to have a more pronounced overgeneralization bias than other models with a comparable BLEU score.

Evaluation of Disambiguation in MT
In the context of translation, word sense disambiguation (WSD; Navigli, 2009) can be formally defined as follows: Let us assume that every instance of a content word w conveys one out of a set {s 1 , s 2 , . . . } of senses. Then a WSD error occurs if a source instance w i is translated into a target word that does not convey the sense of w i but another sense of the word w (Popescu-Belis, 2019).
Automated approaches for evaluating MT systems on WSD can be grouped into patternmatching and scoring approaches. In a patternmatching evaluation, translations are searched for phrases that are known to be correct or incorrect. For example, Vickrey et al. (2005) create a test set from ambiguous source words in a parallel corpus, and Raganato et al. (2019Raganato et al. ( , 2020 use this approach to assemble a large-scale benchmark (MuCoW) with multiple translation variants for ambiguous words. However, it is usually not feasible to create an exhaustive list of all translation variants.
Scoring-based evaluation alleviates this problem by directly comparing probabilities for pre-defined contrastive translation variants as estimated by the model (Sennrich, 2017;Rios et al., 2017). An example of a contrastive translation pair for WSD is presented in the left part of Figure 2. The scoring of contrastive translations has some drawbacks in that it depends on a non-standard interface to the MT system, and, like pattern-matching evaluation, on language-specific references. Furthermore, there is no guarantee that the actual 1-best translation would be similar to the preferred variant.

Translation of Gendered Occupations
The translation of gendered occupation nouns can be seen as a special case of WSD. When translating occupations from a language that does not tend to mark their gender into a language that does, gender has to be either inferred from the context, e.g. from anaphoric pronouns, or expressed neutrally. Such a challenge arises when translating from English into German, Russian or other morphologically rich languages. WinoMT (Stanovsky et al., 2019) is a challenge set for this phenomenon, which combines several datasets for gender coreference in English (Rudinger et al., 2018;Zhao et al., 2018). See Figure 2 for an example in contrastive format.

Overgeneralization Bias
Carbonell et al. (1983) describe overgeneralization as a tendency to learn concepts that extend not only to positive but also to negative examples, which can arise if a system sees mostly positive examples. More recently, overgeneralization has been discussed as a category of social impact of NLP systems (Hovy and Spruit, 2016), and it has been hypothesized that overgeneralization of the training data leads to a loss of lexical diversity and to an exacerbation of gender bias in MT (Vanmassenhove et al., 2019;Roberts et al., 2020). In the case of WSD, Rios et al. (2017) have found that neural MT systems handle frequent word senses well but perform poorly on rare word senses. The influence of sense distribution on WSD has been further examined by Tang et al. (2018), Raganato et al. (2020 and Emelin et al. (2020). With regard to gendered occupations, Stanovsky et al. (2019) show that MT translates stereotypes more reliably, and WinoMT or similar datasets have subsequently been used to quantify bias in various translation settings (Kocmi et al. 2020;Costa-jussà et al. 2020a,b;Tomalin et al. 2021;Choubey et al. 2021; Renduchintala and Williams 2021; among others).

Effects of Knowledge Distillation
Overgeneralization bears some resemblance to compression, which is significant in the context of knowledge distillation (Hinton et al., 2015). The process of sequence-level knowledge distillation (SeqKD) can be described as follows (Kim and Rush, 2016): 1. A generative model is trained, to be used as an intermediate teacher; 2. The teacher re-generates the target side of the training data; 3. A student model, which is usually smaller, is trained on the new data.
In its simplest form, SeqKD replaces the original target side of the training data with the teacher's best translation as generated with beam search. Kim and Rush (2016) report that small student models can approximate the translation quality of more complex teachers and that student models excel under greedy decoding, making them an attractive choice for large-scale deployment of MT (Kim et al., 2019). The effectiveness of SeqKD raises the question of how distilled data differ from the original training data and how such a difference might affect model behavior.
Previous analyses of SeqKD have focused on general linguistic metrics rather than probing tasks such as lexical disambiguation. Distilled data have been characterized as less noisy and more deterministic than the original target (Gu et al., 2018), as having a reduced fertility and distortion (Zhang et al., 2018), reduced lexical diversity (Xu et al., 2021), and as being less complex while preserving faithfulness (Zhou et al., 2020). Concurrent work (Silva et al., 2021) studies distillation in the context of masked language models, showing that distilled models have a more pronounced bias according to standard metrics. Ding et al. (2021) examine SeqKD in non-autoregressive MT, where it is shown to decrease translation accuracy with respect to rare words in word-aligned parallel data. Finally, Renduchintala et al. (2021) show that MT models optimized for speed have an increased gender bias, analyzing techniques that are complementary to distillation, namely reduction of beam size, shallow decoders, efficient attention and quantization. In this paper, we perform a case study on distillation based on established probing tasks in MT, using a novel evaluation protocol in order to reliably identify patterns of overgeneralization.

Linguistic Motivation
Recall of Pattern-matching Approaches Evaluation methods for disambiguation can have limited recall, which adds noise to comparisons of related systems. For example, Scherrer et al. (2020) find that on average, 20-28% of translations remain undecidable given the MuCoW gold variants.
WinoMT follows a pattern-matching approach as well, but without using reference translations. The predicted gender is inferred based on word alignment and language-specific morphosyntactic analysis. Stanovsky et al. (2019) have found that such an analysis yields an average agreement with human annotations of 87%, and manual whitelisting was proposed to further improve accuracy (Kocmi et al., 2020). However, a certain amount of noise cannot be avoided. In a small percentage of cases, the analyzers cannot determine the grammatical gender; Kocmi et al. (2020) counted around 13% such unresolved translations for Russian. In addition, errors in alignment or morphosyntactic analysis can cause a small number of false positives or negatives.  pronoun is compared to the morphosyntactic properties of a translated noun. However, an English pronoun conveys the notion of a speaker about a real-world referent (notional gender; McConnell-Ginet, 2014), while in languages such as German, Romance languages or Russian, the gender of nouns is sometimes arbitrary (grammatical gender; Bender, 2013). For example, the German noun Wache 'guard' is grammatically feminine but can refer to any person. Thus, morphosyntactic analysis, despite being a good heuristic in general, can lead to classification errors regarding gender.

Classification of Gender
Furthermore, a disagreement between notional and grammatical gender can at times be interpreted as a generic, rather than false, translation. However, this seems unlikely for WinoMT because most sentences describe concrete individuals of known gender. A notable exception may be Russian, where masculine nouns are used for many occupationse.g., vraq 'doctor' -and choice of grammatical gender can be influenced by factors such as prestige or historical connotation in addition to the referent's gender (Wade, 2011). Finally, an exclusive focus on grammatical gender may penalize efforts to create gender-neutral translations. For example, neutral terms in German tend to coincide with a feminine grammatical gender, such as Pflegekraft 'care worker' or Fachperson 'specialist person'.

Proposed Evaluation Protocol
Given those considerations, we propose an alternative evaluation protocol that does not impose hard constraints on a translation ( Figure 3): 1. Translate the original source with the model that is being evaluated.
2. Construct variants of the source that provide a stronger disambiguation cue. The variants are contrastive: Some disambiguate the source correctly, others do so incorrectly.
3. Use a translation model (evaluator) to score the translation from (1) conditioned on the contrastive sources. Compute score correct as the best evaluator score over the correctly modified sources, and score incorrect as the best evaluator score over the incorrectly modified sources. 2 The overall score for the translation is defined as: score = score correct /(score correct +score incorrect ).
Note that the evaluated model could be used as its own evaluator. To make comparisons between many models, however, we prefer to evaluate all models with the same state-of-the-art ensemble.

Disambiguation Cues
The disambiguation cues used for contrastive conditioning can have textual form, or have the more generic form of an additional input feature (Sennrich and Haddow, 2016). In our experiments, we insert a textual cue into the source sentence because this enables us to use an off-the-shelf MT model as evaluator.
MuCoW We prepend the ambiguous word with another word that hints at a specific sense: Original source: We ought to have avocados as a starter.
Modified source (correct): We ought to have avocados as a dinner starter.
Modified source (incorrect): We ought to have avocados as a motor starter.
In order to automatically find disambiguation cues to insert, we use a masked language model (Liu et al., 2019) to generate a set of candidates for each sense. In a second step, we select the insertions that prove to be most discriminative for contrastive conditioning; we use the reference translations provided by MuCoW as a validation set. We select 3 correct and 3 incorrect insertions per sense. Implementation details are provided in Appendix D. based on contrastive sources. The example illustrates that contrastive conditioning is informative: The first two translations (#1-2), which disambiguate doctor correctly while differing in word choice, receive an overall score > 0.5. On the other hand, the two incorrect translations (#3-4) receive a score < 0.5, and the overall score is close to 0.5 for translations that are ambiguous either due to gender-neutral language (#5) or word omission (#6).

Gendered Occupations
We add the adjectives 'female' and 'male' in brackets: We treat the disambiguation cue that agrees with the WinoMT gold label as correct, and vice versa.

Weighting of Samples
Unweighted Accuracy In the simplest form, we can define the accuracy of the evaluated model to be the proportion of samples where score > 0.5.
Category-wise Weighted Accuracy However, the evaluator score can be interpreted as a form of confidence, since the likelihood that contrastive conditioning misjudges a translation is highest where score ≈ 0.5. We propose to downweight samples where the evaluator has low confidence, using a weighting scheme that maintains the balance of categories. For each category (e.g. 'male'), we rank the samples in decreasing order by |score− 0.5|, and assign to each sample i a weight proportional to n − rank(i), where n is the size of the category. We use the weights to compute a weighted accuracy. Table 1 illustrates the scoring process on the example of hand-crafted translations, and in Ta Table 2: In a human validation study, we compare different automated evaluation methods to human evaluation of machine translations. Each figure is the proportion of agreement between human evaluation and automated evaluation. The agreement in the last row is weighted according to category-wise weighting.
with low evaluator confidence tend to be difficult to judge, either because they carry over the ambiguity, or because the ambiguous part of the source has been omitted in the translation.

Validation of Contrastive Conditioning
Human Validation We perform a human validation study to verify that contrastive conditioning is a viable alternative to the pattern-matching baselines. English-German and English-Russian machine translations, together with the sources, are blindly annotated by 2 language professionals (up to 200 samples per language, task and annotator). For MuCoW samples, we ask: Is the translation closer to the correct sense cluster than to the wrong one? For WinoMT, the question asked is: Does the occupation name convey the gender implied by its context?
The inter-annotator agreement is in the range expected for the tasks, ranging from κ = 0.95  Other 51.5% Other 52.1% Other 57.0% Figure 4: A Sankey diagram visualizing how sequence-level knowledge distillation amplifies gender imbalances for the most frequent occupations in the WMT19 English-German corpus. Redistribution of grammatical gender is observed mostly when the teacher regenerates the target side of the training data, but the student further increases the imbalance when prompted to translate the same data. Unidentified translation variants are classified as 'Other'.
for English-German WinoMT to κ = 0.20 for English-Russian WinoMT. The low agreement for the latter task confirms that the degree to which Russian occupation nouns are generic is highly subjective, as discussed in Section 3.1.
Our results (Table 2) show that contrastive conditioning has a high proportion of agreement to the human annotations. Details of the human validation study are described in Appendix E.

Agreement of Different Evaluator Models
As an additional validation of our method, we analyze if the choice of evaluator model can influence the results. Namely, we rank all the 27 models evaluated in the case study (Section 4) with 3-4 state-of-theart-models that have been independently trained with various random seeds (Ng et al., 2019). The evaluators largely agree in their rankings. When averaging the Spearman's rank correlation coefficients over all pairs of evaluators for each task and language pair, the maximum average is 0.99 for English-German WinoMT, and the minimum is 0.88 for English-Russian WinoMT. In the latter case, a lower rank correlation is expected given that some of the evaluated models perform very closely.

Effect on Gender-Neutral Language
The WinoMT dataset includes a small number of source sentences that contain the pronoun they. Such examples show that translation of gender cannot solely be understood as a disambiguation problem, and has more complex aspects. In this study, we follow Kocmi et al. (2020) and exclude neutral inputs from the dataset since we use the dataset as a proxy to quantify disambiguation error. However, contrastive conditioning does not depend on binary labels, and we hope that the proposed method could also aid the assessment of gender-neutral or nonbinary translation (Cao and Daumé III, 2020; Saunders et al., 2020). Furthermore, it was mentioned in Section 3.1 that gender-neutral language can be a valid way to translate ambiguous source sentences, but one that, in some languages, is difficult to evaluate based on grammatical gender. While our approach does not directly recognize neutral language, such edge cases can likely be identified by a low evaluator confidence. We downweight cases with low confidence using the above weighting scheme, given that within the framework of WSD, translations that preserve the ambiguity of the source are usually not considered to be disambiguation errors (Popescu-Belis, 2019).

Case Study: Assessing Disambiguation
Bias in Distilled Models

Hypothesis
We hypothesize that distillation could also impair disambiguation quality. Our hypothesis is motivated by a simple data analysis, which is visualized in Figure 4.
Motivating Analysis When searching for English occupation nouns in the English-German parallel training data from the WMT19 news translation task (Barrault et al., 2019), we find that the gender ratio is considerably skewed: Of the corresponding German references, 39.4% contain common translation variants that convey the notion of a male person (e.g., Fahrer 'male driver') and only 3.6% contain common female variants (e.g., Fahrerin 'female driver'); the remaining 57.0% contain neutral, impersonal, or lexically rare translations. In addition to real-world labor inequalities, this skew can be explained by linguistic phenomena such as generic masculines (Lessinger, 2020). As shown in Figure 4, we also analyze the translations across two additional iterations of the data: the distilled data generated by the teacher, and translations of the same training data by the student model. We average our counts across three teachers and three students trained with different random seeds, which allows us to report standard deviations. As expected, the distilled data have a higher imbalance, with male forms increasing by 5.1% (±0.4%) and female forms decreasing by 0.2% (±0.0%). Moreover, the student further increases this imbalance (despite having the same size and capacity as the teacher), with male forms growing by 0.7% (±0.2%) and female forms slightly decreasing by 0.1% (±0.1%). It seems plausible that smaller students would develop an even stronger bias.
Limitations of Data Analysis However, such a word count analysis of distillation has clear limitations (we describe implementation details in Appendix A and address further limitations in the impact statement). One limitation is that there is a large number of translations that cannot be automatically classified. Focusing on precision not only leaves a very large group of translations classified as 'Other' but could itself be a source of bias.
Secondly, the word count analysis does not take into account whether a source sentence provides sufficient context for disambiguation, or whether a source would be inherently ambiguous even to human translators. While it seems likely that the lexical overgeneralization observed in the above analysis will also cause translation errors, such errors can only be quantified using source sentences that are known to have a salient context. Our preliminary data analysis as well as its potential limitations thus strongly motivate a targeted evaluation of SeqKD models with respect to disambiguation.

Experimental Setup
To have a controlled setup, we trained teachers and students from scratch using Fairseq (Ott et al., 2019). In addition, we also distilled state-of-the-art MT systems (Ng et al., 2019). 3 The main differ-ence between the 'Scratch' and the 'SOTA' teachers is that Ng et al. (2019) used advanced filtering, backtranslation and domain-adaptive fine-tuning. 4 Architecture For the teachers, we used the big Transformer architecture (Vaswani et al., 2017) with a doubled feed-forward size of 8192. For the students and for further baselines we trained two additional sizes, small and mini. Table A1 compares the three sizes and their parameter numbers.
Data To train our teachers we used the English-German and English-Russian parallel training data from the WMT19 news translation task (Barrault et al., 2019). 5 We reused the BPE vocabularies computed by Ng et al. (2019), which are joint for English-German and disjoint for English-Russian. We also filtered sentences longer than 250 tokens as well as pairs with a length ratio larger than 1.5.
Hyperparameters For each language pair we trained 3 'Scratch' teachers with different random seeds. Each teacher was then used to train an individual student per size. We repeated this procedure with the 'SOTA' teachers. We used Adam with an initial learning rate of 5e-4, FP16 training, label smoothing with = 0.1, and a dropout of 0.3. We trained with a token batch size of 16k, and we selected the best checkpoint with respect to BLEU based on the newstest sets from the preceding years.
For decoding, beam search with size 5 was used. Figure 5: Performance of English-German (left) and English-Russian (right) translation models in terms of BLEU on newstest19, and MuCoW (top) as well as WinoMT (bottom) minimum accuracy. In order to highlight compression effects, students of varying capacity that share the same teacher are connected with a line. Additional lines connect baseline models of varying capacity that were trained from scratch with the same random seed. sets such a MuCoW and WinoMT. We use minimum accuracy over the categories of those testsets.
For MuCoW, we use the minimum accuracy over the categories 'frequent' and 'rare'. Word senses are categorized as 'frequent' if they occur more often than alternative senses in the training data. The other, alternative senses are categorized as 'rare' (our word counting is described in Appendix A).
For WinoMT, we use the minimum accuracy over the categories 'male' and 'female'. This is a deviation from what has been used in previous work on WinoMT, but we believe that it serves as an adequate measure of overgeneralization. While in theory, minimum accuracy is motivated by the difference principle of distributive justice (Rawls, 1971), in practice we find that minimum accuracy is consistently found in the categories 'rare' (MuCoW) and 'female' (WinoMT). This confirms that minimum accuracy captures overgeneralization bias. Absolute differences (∆ G or ∆ S ; Stanovsky et al., 2019) or a ratio of categories (M:F; Saunders and Byrne, 2020) do not take overall performance into account and thus assign good scores to models with low accuracy. In addition, minimum accuracy is easy to interpret, ranging from 0 to 100, with higher meaning better.

Results
Figure 5 shows our results, which are listed in tabular form in Appendix H. While BLEU is positively correlated with minimum accuracy according to our overgeneralization probing tasks, student models tend to perform worse on the probing tasks than other models with a similar BLEU score. In order to statistically confirm this observation, we perform a multiple linear regression analysis for each task-language pair (Table 3). We find that BLEU has a significant positive correlation to accuracy on the overgeneralization probing tasks and that SeqKD has a significant negative correlation. The overall regression is statistically significant (p < 0.05).
We also note that for all English-Russian mod-

Task Variable Coefficient
MuCoW EN-DE BLEU score 1.78* SeqKD is used -6.47* MuCoW EN-RU BLEU score 1.39* SeqKD is used -2.14* WinoMT EN-DE BLEU score 5.00* SeqKD is used -7.56* WinoMT EN-RU BLEU score 2.52* SeqKD is used -5.21* Table 3: A multiple regression analysis confirms that students trained with SeqKD tend to perform worse on probing tasks for overgeneralization. The dependent variable is minimum accuracy on the probing task; as independent variables the BLEU score is used, as well as a binary variable describing whether the model was trained as a SeqKD student (*: significant at p < 0.05).
els, the minimum accuracy on gendered occupations is worse than random, which is in line with previous findings by Kocmi et al. (2020). Remarkably, English-Russian teachers trained from scratch are far outperformed by their best students in terms of BLEU, but the students are more biased towards overgeneralization.

Discussion
Contrastive conditioning is a new protocol for evaluating and comparing MT systems with regard to word sense disambiguation. In our analysis we build on established test sets, but have replaced the standard pattern-matching analysis with contrastive conditioning. A human validation study, with English as a source language, showed that the approach is reliable, especially if the samples are weighted according to evaluator confidence. An advantage of our approach over patternmatching is that it can process any potential translation, not just translations containing pre-defined lemmas or translations that have certain morphosyntactic properties. Furthermore, an advantage over conventional contrastive evaluation methods is that the decoding mode of the evaluated model does not need to be constrained. Thus, black-box systems, for example APIs of commercial systems, can be evaluated too. Finally, test sets can remain reference-free, and we even believe that neither strong assumptions nor deeper expertise of a target language are strictly required to perform contrastive conditioning (even though in this paper, we put forward some linguistically informed arguments to motivate our approach).
A limitation of contrastive conditioning is that a disambiguation cue needs to be provided in the source language. For error types other than disambiguation, such a cue might be difficult to create. In this paper, we have built on purely textual cues, which enabled us to use an off-the-shelf translation model for scoring. Linguistic input features (Sennrich and Haddow, 2016;Stafanovičs et al., 2020) could provide an alternative disambiguation cue in future work.
Our case study of knowledge distillation, which is based on contrastive conditioning, shows that SeqKD can lead to lexical overgeneralization, and to a loss of adequacy in disambiguation that is generally not captured by BLEU.

Conclusion
In order to evaluate MT models on disambiguation, we have devised a novel evaluation method, contrastive conditioning. It allows for a reference-free, black-box evaluation of MT models with respect to disambiguation, requiring only that a strong disambiguation cue can be provided in contrastive sources. Based on this evaluation method, we have presented a case study of translation models trained with sequence-level knowledge distillation. Focusing on the issue of lexical overgeneralization in word sense disambiguation, we have tested the models on word sense disambiguation and the translation of gendered occupations. Our results indicate that sequence-level knowledge distillation can amplify existing imbalances in the training data, and typically leads to an increased overgeneralization bias. We encourage future work to develop methods that reap the benefits of knowledge distillation with minimal increases in bias.

Broader Impact
We use the term 'bias' to describe a behavioral tendency of NLP systems that goes undetected by common evaluation metrics. While we focus on how it affects the accuracy of translations, overgeneralization bias does not just have a technical dimension but also a social one (Hovy and Spruit, 2016;Sheng et al., 2021), especially with regard to sensitive categories such as gender. Therefore, our findings could also inform a socio-political discussion of model compression, provided that such a discussion is normatively well-founded (Blodgett et al., 2020).
Our preliminary data analysis (Section 4.1) is based on gender as a variable, which warrants some ethical reflection (Larson, 2017). Our analysis is based on a very large collection of English occupation nouns and their translations into German. We categorize the notional gender of the German translations as 'male' or 'female' in cases where grammatical gender is a valid indication. While the automatic inference of gender is discouraged in many research contexts, we believe that our approach is adequate, since in this case, rather than the personal gender of human subjects, the notional gender of nouns is inferred (McConnell-Ginet, 2014). However, gender-neutral or alternative ways of expressing gender are not separately counted. Thus, the preliminary data analysis should be understood as a motivating example of lexical overgeneralization, and does not constitute a comprehensive corpus analysis of gender.

A Word Count Methodology
There are two instances where we count word occurrences in the training data: • We count the senses of ambiguous English words in order to divide the samples into the categories 'frequent' and 'rare'. For this we use an approximative method.
• We count the genders of German occupation names to inform Section 4.1 and Appendix C. We make sure to use a method with high precision for this.
Word Senses For an approximate count of English word senses we use a similar method as Raganato et al. (2020). The MuCoW dataset represents a sense of an English source word as a cluster of target-language lemmas. We thus count a sentence pair as an occurrence of a sense if the source word appears in the source and at least one of the target lemmas appears in the lemmatized reference. We lemmatize the data using Stanza (Qi et al., 2020). The count is approximate since (a) the provided variants in the target language do not cover all possible translations and (b) the lemmatization is noisy. Still, we expect the counts to be proportional to the true sense distribution.
Occupation Names To count the genders of German occupations, we list common German translations of each English occupation name. We only list variants that have an identifiable gender across the full morphological paradigm, and whose grammatical gender usually matches the notional gender.
(Most German occupational terms meet this criterion, but there are exceptions such as Angestellte, whose male and female paradigms intersect, and the gender-neutral Wache mentioned above.) On average, we list 3-4 male lemmas per occupation, and the same amount of female variants. For each lemma, we enumerate the complete paradigm and search the data for each inflectional form. Note that masculine occupation nouns usually have more inflectional forms than feminine nouns, but we do not expect this to influence our results since the totals over the full paradigm should be comparable. We count each sentence pair as an occurrence of 'male' if the English occupation is found in the source and one of the male forms is found in the target sequence. If one of the female forms is found, we count the occurrence as 'female', and if no known forms, or both male and female forms, are found we classify the translation as 'other'.  Table A1: Transformer sizes used for students and nondistilled baselines.

C Occupational Stereotypes
Since WinoMT also uses a notion of stereotypes (∆ S ), we considered using this metric for our analysis. However, the top 25% of occupations in the English-German training data are all predominantly associated with male word forms. In the top 50% occupations, which together have a relative frequency of 95%, there are just two occupations that are mostly associated with female forms in the data (nurse and cleaner), We did find some correlation between the female ratios in the training data and the percentages derived by Zhao et al. (2018) from U.S. labor statistics, with a Pearson coefficient of r = 0.69. Still, since the predominant stereotype in the German training data is 'male' for all but 2 occupations that we searched, we did not extend our analysis to occupational stereotypes.

D Disambiguation Cues for WSD
After some experimentation, we decided to rely on a fully automated approach for finding suitable insertions, which involves a masked language model to generate inserted words, and a validation process to select the most discriminative insertions. Insertions are generated based on the MuCoW source sentences. For every sentence, we place a <mask> token before the ambiguous word and predict a probability distribution using RoBERTa (Liu et al., 2019). For each sense cluster, we select the 10 words with the highest predicted probability of occurring in the example sentences but not in the counterexamples, and 10 words vice versa. We then use the reference translations provided by MuCoW as a validation set to reduce these potential disambiguation cues to the 3 correct and 3 incorrect cues that are most discriminative for contrastive conditioning. Correct disambiguation cues are discriminative if the evaluator assigns a good score to the reference, and vice versa.

E Human Validation
For each language and each task we annotate a subsample of machine translations and compute the proportion of agreement between human evaluation and automated evaluation methods. The annotations are created as follows: • For MuCoW, we translate the in-domain source sentences with state-of-the-art ensembles (Ng et al., 2019). We first evaluate the translations using the pattern-matching evaluation method proposed by Scherrer et al. (2020), using Stanza (Qi et al., 2020 for lemmatization. We then randomly select a subset of translations for validation: Up to 200 translations that are undecidable given pattern-matching evaluation, and a larger subset of decidable translations proportionate to the overall ratio of decidable translations. For the former we collect human annotations, for the latter we assume that the pattern-matching evaluation is correct. • With regard to WinoMT, we annotate translations originally collected by Stanovsky et al. In both datasets, annotators have found some edge cases, which we handle as follows when converting the raw labels to binary labels: In MuCoW, 7 https://github.com/gabrielStanovsky/ mt_gender we skip some samples that have been marked by our annotators as badly defined, e.g. because the sense definitions overlap too much, or because the gold label is wrong due to a misaligned or mistranslated reference. This only affects the subset of the samples that are undecidable for the original MuCoW algorithm, since we do not manually review the other samples. In WinoMT, we skip samples with a neutral gold label because they are out of the scope of this study (Section 3.6). Furthermore, some translations have been marked as neutral because they preserve the ambiguity of the source (e.g. gender-neutral translations); we treat those cases as correct translations. Finally, evaluators have marked a small number of translations as undecidable, e.g. because the ambiguous part of the input was ignored by the MT system; we treat those cases as disambiguation errors.
Inter-annotator agreement is reported in Table A3. In Table A4 we compare the annotations originally collected by Stanovsky et al. (2019) to the corresponding subset of our own WinoMT annotations. The moderate agreement on WinoMT underlines that especially in Russian, the interpretation of occupation nouns can be subjective.
Based on the human annotations, we compute the proportion of agreement of different evaluation approaches. To ensure a fair comparison with the pattern-matching approach to MuCoW, we do not treat all indecisions as disagreements. Instead we follow the notion of recall (B) proposed by Scherrer et al. (2020) and judge undecidable translations as wrong, which may be in agreement or disagreement with the human annotator. For pattern-matching evaluation of WinoMT, we use the reference imple-  mentation and make sure that the word alignment is computed on the full dataset, rather than the validation subset.

Annotator Guidelines for MuCoW
The goal of this annotation is to evaluate machine translation of ambiguous nouns: Is the translation closer to the correct sense cluster than to the wrong one?
Explanation of the data provided to you: Correct Sense The translation is closer to the correct sense cluster.
Wrong Sense The translation is closer to the other sense clusters.
Both / Neutral / Ambiguous The translation preserves the ambiguity found in the source sentence.
Bad sample / Ill-defined senses The sample is not adequate for evaluating word sense disambiguation, e.g. due to overlapping sense clusters or because the gold senses are not consistent with the source sentence.
Translation too bad to tell / Third sense It is impossible to assign a label due to bad translation quality.

Annotator Guidelines for WinoMT
The goal of this annotation is to evaluate machine translation of occupation names: Does the occupation name convey the gender implied by its context?
Explanation of the data provided to you: Entity The evaluated occupation in English.
Please note that only one occupation name per sentence is evaluated, even though the sentence might contain multiple occupations.
Translation The machine-translated sentence that is evaluated.

Source Sentence The original English sentence
Explanation of the labels: Male The occupation name conveys a male gender.
Female The occupation name conveys a female gender.
Both / Neutral / Ambiguous The translation preserves the ambiguity found in the source sentence.
Translation too bad to tell It is impossible to assign a label due to bad translation quality.
Other remarks: • Please annotate semantical gender, not grammatical gender.
• Please only take into consideration the occupation noun itself and associated articles. Specifically, try to ignore any pronouns referring to the occupation noun. Pronouns will often disagree with the occupation noun. It is of utmost importance that the pronouns do not influence your annotation. To give an example in English, the label for the following sentence should be 'female', not 'male': The actress looked at himself in the mirror. Ambiguous source: It never used to feel like a strain.

Machine translation:
Nikogda ne bylo owuweni napr eni . correct -Source with correct disambiguation cue: It never used to feel like a stress strain. 0.62 -Source with incorrect disambiguation cue: It never used to feel like a genetic strain. 0.33 Ambiguous source: The conductor must be complete for our plan to succeed.

Machine translation:
Diri er dol en byt polnym, qtoby nax plan uvenqals uspehom. incorrect -Source with correct disambiguation cue: The light conductor must be complete for our plan to succeed. 0.63 -Source with incorrect disambiguation cue: The choir conductor must be complete for our plan to succeed. 0.69 Table A5: Examples of contrastive conditioning based on the MuCoW testset (Raganato et al., 2019) for WSD. For each example, we show the correctly modified source and the incorrectly modified source with the highest evaluator scores. The modifications were generated with a masked language model and selected as described in Appendix D.
A higher score means a better score, thus, contrastive conditioning generally agrees with the human gold rating. A special case is the third German example, where the ambiguous word (tag) is deleted in the translation. Contrastive conditioning assigns very similar scores to the contrastive sources of that example. The receptionist helped the [female] construction worker and got a present from her. 0.50 -Source with incorrect disambiguation cue: The receptionist helped the [male] construction worker and got a present from her. 0.56 Ambiguous source: The guard asked the librarian to leave because it was late and she will be leaving soon.

Machine translation:
Die Wache bat den Bibliothekar zu gehen, weil es spät war und sie wird bald gehen. neutral -Source with correct disambiguation cue: The [female] guard asked the librarian to leave because it was late and she will be leaving soon.
0.57 -Source with incorrect disambiguation cue: The [male] guard asked the librarian to leave because it was late and she will be leaving soon. 0.56