Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation

Neural Machine Translation (NMT) currently exhibits biases such as producing translations that are too short and overgenerating frequent words, and shows poor robustness to copy noise in training data or domain shift. Recent work has tied these shortcomings to beam search – the de facto standard inference algorithm in NMT – and Eikema & Aziz (2020) propose to use Minimum Bayes Risk (MBR) decoding on unbiased samples instead. In this paper, we empirically investigate the properties of MBR decoding on a number of previously reported biases and failure cases of beam search. We find that MBR still exhibits a length and token frequency bias, owing to the MT metrics used as utility functions, but that MBR also increases robustness against copy noise in the training data and domain shift.


Introduction
Neural Machine Translation (NMT) currently suffers from a number of issues such as underestimating the true length of translations (Koehn and Knowles, 2017;Stahlberg and Byrne, 2019;Kumar and Sarawagi, 2019), underestimating the probability of rare words and over-generating very frequent words (Ott et al., 2018), or being susceptible to copy noise in the training data (Khayrallah and Koehn, 2018). In out-of-domain translation, hallucinations (translations that are fluent but unrelated to the source) are common (Koehn and Knowles, 2017;Lee et al., 2018;Müller et al., 2020).
Recently, Eikema and Aziz (2020) have highlighted the role of the decision rule, namely searching for the highest-scoring translation, and have argued that it is at least partially to blame for some of these biases and shortcomings. They found that sampling from an NMT model is faithful to the training data statistics, while beam search is not. They recommend the field look into alternative inference algorithms based on unbiased samples, such as Minimum Bayes Risk (MBR) decoding.
We believe MBR has potential to overcome several known biases of NMT. More precisely, if a bias can be understood as being caused by the modeseeking nature of beam search then we hypothesize that MBR could exhibit less bias. We view short translations, copies of the source text and hallucinations as hypotheses that are probable, but quite different to other probable hypotheses. If such pathological hypotheses are in a pool of samples, it is unlikely that MBR would select them as the final translation.
While Eikema and Aziz (2020) compare the statistical properties of samples and beam search outputs, and show that MBR can perform favourably compared to beam search according to automatic metrics, our paper aims to perform a targeted study of MBR and its properties, specifically its effects on the biases and shortcomings discussed previously. In our experiments we find that • If used with a utility function that favours short translations, MBR inherits this bias; • MBR still exhibits a token probability bias in that it underestimates the probability of rare tokens and overestimates very common tokens; • Compared to beam search, MBR decoding is more robust to copy noise in the training data; • MBR exhibits higher domain robustness than beam search. We demonstrate that MBR reduces the amount of hallucinated content in translations.
Beam search belongs to a broader class of inference procedures called maximum-a-posteriori (MAP) algorithms. What MAP algorithms have in common is that they attempt to find the most probable translation under a given model. Essentially, they try to recover the mode of the output distribution over sequences.
An exact solution to this search problem is usually intractable. Beam search is an approximation that is tractable, but it also frequently fails to find the true mode of the distribution (Stahlberg and Byrne, 2019).

Known deficiencies of NMT systems
NMT systems are known to be deficient in a number of ways. We describe here only the ones relevant to our discussion and experiments.
Length bias: Systems underestimate the true length of translations. On average, their translations are shorter than references (Koehn and Knowles, 2017;Stahlberg and Byrne, 2019;Kumar and Sarawagi, 2019).
Skewed word frequencies: In translations, tokens that occur frequently in the training data are overrepresented. On the other hand, rare tokens occur fewer times than their probability in the training data would suggest (Ott et al., 2018).
Beam search curse: Increasing the beam size leads to finding translations that are more probable under the model. In theory, this should improve translation quality. Paradoxically, empirical results show that large beam sizes decrease quality (Koehn and Knowles, 2017;Ott et al., 2018).
Susceptibility to copy noise: Copied content in the training data disproportionately affects translation quality. More specifically, the most detrimental kind are copies of the source sentence on the target side of the training data (Khayrallah and Koehn, 2018). If such copies are present in the training data, copy hypotheses will be overrepresented in beam search (Ott et al., 2018).
Low domain robustness: Systems are not robust under distribution shifts such as domain shift. Having a system translate in an unknown test domain often does not gradually degrade translation quality, but leads to complete failure cases called hallucinations (Lee et al., 2018;Koehn and Knowles, 2017;Müller et al., 2020).
Much past research has attributed those deficiencies to model architectures or training algorithms, while treating beam search as a fixed constant in experiments. In contrast, Eikema and Aziz (2020) argue that the fit of the model is reasonable, which means that neither the model itself nor its training can be at fault. Rather, they argue that the underlying problem is beam search.
Inadequacy of the mode: Stahlberg and Byrne (2019) and Eikema and Aziz (2020) suggest that the mode of the distribution over output sequences is in fact not the best translation. On the contrary, it seems that in many cases the mode is the empty sequence (Stahlberg and Byrne, 2019). In addition, it appears that the probability of the mode is not much different from very many other sequences, as the output distribution is quite flat in an extensive region of output space (Eikema and Aziz, 2020).
Intuitively, it makes sense that such a situation could arise in NMT training: maximum likelihood estimation training does not constrain a model to be characterized well by its mode only. If the mode is inadequate, then obviously that is problematic for a mode-seeking procedure such as beam search, and MAP inference in general. In fact, MAP decoding should be used only if the mode of the output distribution can be trusted (Smith, 2011).
An alternative is a decision rule that considers how different a translation is from other likely translations.

Minimum Bayes Risk Decoding
MBR decoding was used in speech recognition (Goel and Byrne, 2000) and statistical machine translation (Kumar and Byrne, 2004;Tromble et al., 2008). More recently, MBR was also used to improve beam search decoding in NMT (Stahlberg et al., 2017;Shu and Nakayama, 2017;Blain et al., 2017). Eikema and Aziz (2020) are the first to test a variant of MBR that operates on samples instead of an nbest list generated by beam search.
We give here a simplified, accessible definition of MBR in the context of NMT. Essentially, the goal of MBR is to find not the most probable trans-lation, but the one that minimizes the expected risk for a given loss function and the true posterior distribution. In practice, the set of all possible candidate translations can be approximated by drawing from the model a pool of samples S of size n: (1) The same set of samples can also be used to approximate the true posterior distribution. Then for each sample s i in S, its expected utility (the inverse risk) is computed by comparing it to all other samples in the pool. The sample with the highest expected utility is selected as the final translation: The size of the pool n and the utility function u are hyperparameters of the algorithm. A particular utility function typically computes the similarity between a hypothesis and a reference translation. Therefore, MBR "can be thought of as selecting a consensus translation [...] that is closest on average to all likely translations" (Kumar and Byrne, 2004).

Motivation for experiments
We hypothesize that MBR decoding is useful for a certain class of failure cases encountered with beam search. Namely, if an incorrect translation from beam search can be characterized as a hypothesis that is likely but fairly different from other hypotheses with similar probability, then MBR is expected to improve over beam search.
Several known deficiencies of NMT systems outlined in Section 2.2 belong to this class of beam search failures. For instance, length bias occurs when a beam search translation is shorter than other hypotheses with comparable probability. Likewise, translations that are copies of the input sentence or hallucinations (translations that are fluent, but unrelated to the input) can be avoided with MBR if they are not common in a pool of samples.
Finally, we study the skewedness of token frequencies in translations. Eikema and Aziz (2020) study lexical biases in NMT models, showing that model samples have higher agreement with the training distribution than MAP output. We investigate whether this is also true for MBR decoding, focusing on the well-known bias towards frequent tokens.

Data
We use data for a number of language pairs from the Tatoeba Challenge (Tiedemann, 2020). Individual language pairs are fairly different in terms of language families, scripts and training set sizes. See Appendix A for details about our data sets.
For one additional experiment on out-of-domain robustness we use data from Müller et al. (2020). This data set is German-English and defines 5 different domains of text (medical, it, koran, law and subtitles). Following Müller et al. (2020) we train our model on the medical domain, and use data in other domains to test domain robustness.
We hold out a random sample of the training data for testing purposes. The size of this sample varies between 1k and 5k sentences, depending on the overall size of the training data.

Models
Our preprocessing and model settings are inspired by OPUS-MT (Tiedemann and Thottingal, 2020). We use Sentencepiece (Kudo, 2018) with subword regularization as the only preprocessing step, which takes care of both tokenization and subword segmentation. The desired number of pieces in the vocabulary varies with the size of the data set.
We train NMT models with Sockeye 2 (Domhan et al., 2020). The models are standard Transformer models (Vaswani et al., 2017), except that some settings (such as word batch size and dropout rate) vary with the size of the training set. Following Eikema and Aziz (2020) we disable label smoothing so as to get unbiased samples.

Decoding and evaluation
In all experiments, we compare beam search to MBR decoding and in most cases also to single samples. For beam search, we always use a beam size of 5. Single samples are drawn at least 100 times to show the resulting variance.
If not stated otherwise, all results presented are on a test set held out from the training data, i.e. are certainly in-domain, which avoids any unintended out-of-domain effects.
We evaluate automatic translation quality with BLEU (Papineni et al., 2002), CHRF (Popović, 2016) and METEOR (Denkowski and Lavie, 2014). We compute BLEU and CHRF with SacreBLEU (Post, 2018). See Appendix B for details. Figure 1: CHRF1 scores of MBR decoding on two test corpora: the standard Tatoeba test set (out-of-domain) and a test set of held-out training data (in-domain). Plots show the difference between MBR and beam search, as a function of the number of samples used for MBR.  MBR also depends on samples, so we repeat each MBR experiment twice to show the resulting variance. We also vary the number of samples used with MBR, from 5 to 100 in increments of 5. Finally, we produce MBR translations with different utility functions. All of the utility functions are sentence-level variants of our evaluation metrics: BLEU, CHRF or METEOR. See Table 1 for an overview of utility functions. If not stated otherwise, MBR results are based on 100 samples and use chrf-1 as the utility function.

Length bias
We evaluate MBR decoding with different utility functions. There is no single utility function which performs best on all evaluation metrics. Instead, any of our evaluation metrics can be optimized by choosing a closely related utility function (see Figure 2 and Appendix D). For instance, chrf-2 as the utility function leads to the best CHRF2 evaluation scores.
Number of samples: We find that the translation quality of MBR increases steadily as the number of samples grows (see Figure 2). This means that MBR does not suffer from the beam search curse where single pathological hypotheses in a large beam can jeopardize translation quality.
We analyze the lengths of translations produced by different decoding methods in Table 2 (see Appendix E for additional statistics). We find that in terms of mean length of translations, beam search underestimates the true length of translations, even when hypotheses are normalized. Hypotheses generated by sampling better match the reference length. This is in line with the findings of Eikema and Aziz (2020).
For MBR decoding, it is clear that the choice of utility function has an impact on the mean length of the resulting translations. For instance, employing sentence-level BLEU as the utility function leads to translations that are too short. BLEU is a precisionbased metric known to prefer shorter translations on the sentence level (Nakov et al., 2012).
chrf-2 and meteor emphasize recall more, and the resulting MBR translations overestimate the true length of translations. 2 On the other hand, chrf-0.5, a CHRF variant with a bias for precision, leads to the shortest translations overall.
We test whether we can reduce length biases by symmetrizing our utility functions u as follows: where H is the harmonic mean. This should avoid favouring either recall or precision, but in practice even symmetric utility functions lead to translations that are shorter than references on average.
Based on these observations we conclude that MBR inherits length biases associated with its utility function.

Token frequency bias
Beam search overgenerates tokens that are very common in the training data and undergenerates rare tokens (see Section 2.2). Sampling on the other hand assigns correct probabilities to common and rare tokens. Given that MBR is based on samples, does it share this property with sampling? In Figure 3 we show that this is not the case. Although the skewedness of probabilities is less severe for MBR than for beam search, MBR still assigns too high a probability to frequent events. A reason for this is that our utility functions are based on surface similarity between samples, so rare tokens, which will be sampled rarely, will thus also have low utility.
Unfortunately, there is a trade-off between correct probability statistics for very common and very rare words and translation quality. The most faithful statistics can be obtained from sampling, but sampling leads to the worst overall translation quality.

Domain robustness
In general, as the number of samples grows, MBR approaches but does not outperform beam search on our in-domain data (see Figure 1). On our outof-domain data, the gap between MBR and beam search is smaller. We hypothesize that MBR may be useful for out-of-domain translation.
We evaluate MBR on a domain robustness benchmark by Müller et al. (2020). Figure 4 shows that on this benchmark MBR outperforms beam search on 2 out of 4 unknown test domains. A possible reason why MBR is able to outperform beam search in unknown domains is that it reduces hallucinated translations. To test this hypothesis, we define a hallucination as a translation that has a CHRF2 score of less than 0.01 when compared to the reference, inspired by Lee et al. (2018).
Given this definition of hallucination, Figure 5 shows that on average, MBR assigns a lower utility score to hypotheses that are hallucinations. Similarly, MBR reduces the percentage of hallucinations found in the final translations, compared to beam search or sampling. To summarize, we find that MBR decoding has a higher domain robustness than beam search.

Impact of copy noise in the training data
If copies of source sentences are present on the target side of training data, copies are overrepresented in beam search (Section 2.2). Here we test whether MBR suffers from this copy bias as well.
We create several versions of our training sets where source copy noise is introduced with a proba-    bility between 0.1% and 50%. As shown in Figure  6, MBR and beam search are comparable if there are few copies in the training data. However, if between 5 and 25% of all training examples are copies, then MBR outperforms beam search by a large margin (> 10 BLEU for Arabic-German).
As further evidence for the ability of MBR to tolerate copy noise we present an analysis of copies in Figure 7. We define a copy as a translation with a word overlap with the reference of more than 0.9. We show that MBR assigns a much lower utility to copy hypotheses than to all hypotheses taken together. In the final translations, MBR manages to reduce copies substantially. For instance, if around 10% of the training examples are copies, beam search produces around 50% copies, while MBR reduces this number to below 10%.
We conclude from this experiment that MBR is more robust to copy noise in the training data. We acknowledge that this setting is artificial because copy noise can easily be removed from data sets. Nonetheless, it is a striking example of a known shortcoming of NMT systems usually attributed to the model or training procedure, when in fact beam search is at least partially to blame. 9 Conclusion and future work MBR decoding has recently regained attention in MT as a decision rule with the potential to overcome some of the biases of MAP decoding in NMT. We empirically study the properties of MBR decoding with common MT metrics as utility functions, and find it still exhibits a length bias and token frequency bias similar to beam search. The length bias is closely tied to the utility function. However, we also observe that MBR decoding successfully mitigates a number of well-known failure modes of NMT, such as spurious copying, or hallucinations under domain shift. The mechanism by which MBR achieves such robustness is that copies or hallucinated hypotheses in a pool of samples are assigned low utility and never selected as the final translation.
In our experiments, MBR did not generally outperform beam search according to automatic metrics, but we still deem it a promising alternative to MAP decoding due to its robustness. For future work, we are interested in exploring more sophisticated similarity metrics to be used as utility functions, including trainable metrics such as COMET (Rei et al., 2020), and investigating how these utility functions affect the overall quality and biases of translations.

Note on reproducibility
We will not only release the source code used to train our models (as is common in NLP papers at the moment), but a complete pipeline of code that can be run on any instance in a fully automated fashion. This will allow to reproduce our results, including the graphs and tables shown in this paper, in a consistent way with minimal changes. We encourage the community to attempt to reproduce our results and publish the results.

B Evaluation details
For evaluation metrics that require tokenization (BLEU and METEOR), we use the standard mteval13a tokenization implemented in SacreBLEU. We do not use any language-specific tokenization rules even if they are available for the target language. The SacreBLEU signatures for our CHRF and BLEU evaluation metrics are listed in Table 4.

C Comments on the development sets distributed with the Tatoeba challenge
The Tatoeba Challenge (Tiedemann, 2020) distributes training, development and test data for a large number of language pairs. What is peculiar about the challenge is that the training data is assembled from various sources through OPUS (Tiedemann, 2012), while the development and test data are contributed by users of Tatoeba 3 . This means that the development and test set can be considered out-of-domain material. We investigated this issue and conclude that it does not constitute a problem. When both the development and test data are sampled from the training data, the results are similar to the ones we present in this paper, except for a small overall shift.

E Additional length tables
We provide additional length statistics for utility functions used with MBR in Table 5.