Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation

In NMT we search for the mode of the model distribution to form predictions. The mode and other high-probability translations found by beam search have been shown to often be inadequate in a number of ways. This prevents improving translation quality through better search, as these idiosyncratic translations end up selected by the decoding algorithm, a problem known as the beam search curse. Recently, an approximation to minimum Bayes risk (MBR) decoding has been proposed as an alternative decision rule that would likely not suffer from the same problems. We analyse this approximation and establish that it has no equivalent to the beam search curse. We then design approximations that decouple the cost of exploration from the cost of robust estimation of expected utility. This allows for much larger hypothesis spaces, which we show to be beneficial. We also show that mode-seeking strategies can aid in constructing compact sets of promising hypotheses and that MBR is effective in identifying good translations in them. We conduct experiments on three language pairs varying in amounts of resources available: English into and from German, Romanian, and Nepali.


Introduction
NMT systems (Sutskever et al., 2014;Bahdanau et al., 2015) are trained to predict a conditional probability distribution over translation candidates of any given source sentence.After training, choosing a translation for a given input requires a decision rule: a criterion to elect a 'preferred' translation.MAP decoding, the most common decision rule in NMT, seeks the most probable translation under the model (i.e., the mode of the distribution). 1 Code is available at github.com/roxot/mbr-nmt.MAP decoding and its approximations such as beam search (Graves, 2012) have been under scrutiny.Stahlberg and Byrne (2019) show that the true mode is oftentimes inadequately short or empty.Better approximate search is known to hurt quality (Koehn and Knowles, 2017;Murray and Chiang, 2018;Kumar and Sarawagi, 2019), a problem known as the beam search curse.The success of beam search depends on search biases introduced by hyperparameters such as beam size and length normalisation, which are tuned not to correlate with the objective of MAP decoding, but rather to strike a compromise between modeseeking search and properties of reasonable translations.Despite its success, a number of problems have been observed: length bias (Cho et al., 2014;Sountsov and Sarawagi, 2016), word frequency bias (Ott et al., 2018), susceptibility to copy noise (Khayrallah and Koehn, 2018;Ott et al., 2018), and hallucination under domain shift (Lee et al., 2019;Müller et al., 2020;Wang and Sennrich, 2020).
Eikema and Aziz (2020) argue that the inadequacy of the mode in NMT is a reasonable consequence of the translation space being combinatorial and unbounded.They show that, while distri-butions predicted by NMT do reproduce various statistics of observed data, they tend to spread probability mass almost uniformly over a large space of translation candidates.This makes their precise ranking in terms of probability mass a fragile criterion for prediction.While some of these candidates are possibly inadequate (e.g., the empty sequence), most of them are similar to one another and exhibit appreciable structural similarity to reference translations.To make better use of the statistics predicted by NMT models, Eikema and Aziz (2020) recommend MBR decoding (Kumar and Byrne, 2004), a decision rule that seeks the translation candidate which maximises an external notion of utility (e.g., an MT evaluation metric) in expectation under the model distribution.While MBR decoding promises robustness to idiosyncratic translations, it remains intractable, much like MAP decoding.Eikema and Aziz (2020) propose an approximation based on Monte Carlo (MC) sampling, which although tractable in principle, requires a prohibitive number of assessments of the utility function.
In this work, we first analyse the procedure by Eikema and Aziz (2020) and establish that it does not suffer from a counterpart to the beam search curse.That is, better search does not hurt translation quality.Their approximation is, however, computationally expensive, requiring a number of assessments of the utility function that is quadratic in sample size.We propose algorithms that scale linearly, allowing us to explore large hypothesis spaces, and considerably improve upon existing approximations to MBR with less computation.Finally, we find that mode-seeking strategies such nucleus sampling and beam search can still aid MBR decoding by constructing compact sets of high expected utility hypotheses, relying on MBR to filter idiosyncratic translations that may be present.

NMT and Decision Rules
NMT employs neural networks (NNs) to predict a conditional probability distribution Y |θ, x over translation candidates of any given source sentence x.The sample space Y is the set of all sequences of known target-language symbols (e.g., sub-word units).NMT factorises the distribution as a chain of random draws from Categorical distributions parameterised in context.The prefix translation y <j starts empty and grows one symbol at a time until a special end-of-sequence symbol is drawn.At each step j, f maps from varying inputs (x, y <j ) to a probability distribution over the vocabulary.
Common choices for f include recurrent networks (Sutskever et al., 2014;Bahdanau et al., 2015) and Transformers (Vaswani et al., 2017).Given a dataset of observed translation pairs, the NN parameters θ are estimated to attain a local optimum of the regularised log-likelihood function.
After training, and for a given input, choosing a translation requires a decision rule to map from a distribution over translation candidates to a single 'preferred' translation.The most common decision rule in NMT is MAP decoding, which outputs the mode of the conditional distribution.Despite the widespread intuition that MAP decoding is an obvious choice, maximum likelihood estimation (MLE) is oblivious to our desire to form predictions.

MAP Decoding
Maximum-a-posteriori (MAP) decoding outputs the most probable translation under the model:2 As this is intractable, beam search (Graves, 2012;Sutskever et al., 2014) is used.Beam search is a pruned version of breadth-first search which maintains an active set of k partial translations.For large beam size k, translation quality degrades (Koehn and Knowles, 2017) and the exact y MAP is often the empty sequence (Stahlberg and Byrne, 2019).Therefore, in practice, the beam size is kept small and the objective in Equation ( 2) is regularised to up-rank longer hypotheses (Wu et al., 2016;Murray and Chiang, 2018).

MBR Decoding
Minimum Bayes risk (MBR) decoding stems from the principle of maximisation of expected utility (Berger, 1985).A utility function u(y, h) measures the benefit in choosing h ∈ Y when y ∈ Y is the ideal decision.When forming predictions, we lack knowledge about ideal translations and must decide under uncertainty.MBR lets the model fill in 'ideal decisions' probabilistically as we search through the space of candidates for the one which is assigned highest utility in expectation: MBR has a long history in parsing (Goodman, 1996;Sima'an, 2003), speech recognition (Stolcke et al., 1997;Goel and Byrne, 2000), and MT (Kumar andByrne, 2002, 2004).
In MT, u can be a sentence-level evaluation metric (e.g., METEOR (Denkowski and Lavie, 2011) or Sentence BLEU (Chen and Cherry, 2014)).Intuitively, whereas the MAP prediction is the translation to which the model assigns highest probability, no matter how idiosyncratic, the MBR prediction is the translation that is closest (under the chosen u) to all other probable translations.See Figure 1 for an illustration of this concept.Seeking support for a prediction not only in terms of probability but also in terms of utility makes MBR decoding robust to situations where inadequate translations are assigned high probability, as it often happens with the empty string (Stahlberg and Byrne, 2019), when the training data are noisy (Ott et al., 2018), too small (Eikema and Aziz, 2020) or distant from the test domain (Müller and Sennrich, 2021).
It is a well-known result that for the 'exact match' utility, u(y, h) := 1 {y} (h), the expected utility of h is p Y |X (h|x, θ), hence MBR and MAP decoding have the same optimum under this choice (Kumar and Byrne, 2002).This view justifies MAP decoding as an instance of MBR, where decisions are optimised with respect to a strict notion of translational equivalence.In machine translation evaluation, exact match is a questionable choice of utility function.It, for example, is unable to capture paraphrases or any other form of semantic equivalence.
Like in MAP decoding, exhaustive enumeration of all hypotheses is impossible, we must resort to a finite subset H(x) of candidates.Unlike MAP decoding, the objective function µ u (h; x, θ) cannot be evaluated exactly.Most approximations to MBR decoding, from Kumar and Byrne (2004) to recent instances (Stahlberg et al., 2017;Shu and Nakayama, 2017;Blain et al., 2017), use k-best lists from beam search for H(x) and to form a biased estimate of expected utility.Eikema and Aziz (2020) use unbiased samples from the model for both approximations: i) they follow the generative story in Equation (1) to obtain N independent samples y (n) , a procedure known as ancestral sampling (Robert and Casella, 2010); then, ii) for a hypothe-sis h, they compute an MC estimate of µ u (h; x, θ): which is unbiased for any sample size N .Eikema and Aziz (2020) use the same N samples as candidates and approximate Equation (3) by y N-by-N := arg max h∈{y (1) ,...,y (N ) } μu (h; x, N ) .(5) We note that the candidates do not need to be obtained using ancestral sampling, and investigate alternative strategies in Section 5.4.It is important, however, to use ancestral samples to obtain an unbiased estimate of expected utility as we show in Section 5.1.We call this class of MBR algorithms using unbiased MC estimation instances of sampling-based MBR decoding.

Coarse-to-Fine MBR Decoding
A big disadvantage of MBR N-by-N is that it requires N 2 assessments of the utility function.If U is an upperbound on the time necessary to assess the utility function once, then MBR N-by-N runs in time O(N 2 ×U ).For a complex utility function, this can grow expensive even for a modest hypothesis space.As NMT distributions have been shown to be high entropy (Ott et al., 2018;Eikema and Aziz, 2020), the quadratic cost prevents us from sufficiently exploring the space of translations.Therefore, we investigate and propose more flexible algorithms.
An important property of sampling-based MBR decoding is that MC estimation of expected utility, Equation (4), and approximation of the hypothesis space in Equation (5) really are two independent approximations.Tying the two is no more than a design choice that must be reconsidered.We start by obtaining N translation candidates from the model, which will form the hypothesis space H(x).Then, we use any number S < N of ancestral samples for approximating expected utility in Equation (4). 3e call this version MBR N-by-S , which takes time O(N × S × U ). Compared to MBR N-by-N , this variant is able to scale to much larger hypothesis spaces H(x).In practice, however, robust MC estimation for the utility of interest may still require S that is too large for the N we are interested in.
An idea that we explore in this work is to make use of a proxy utility that correlates with the target src Convercent erhielt $10 Millionen bei der Finanzierung im Februar von Firmen wie Sapphire Ventures und Tola Capital, womit das gesamte Kapital auf $47 Millionen angehoben wurde.ref Convercent raised $10 million in funding in February from firms such as Sapphire Ventures and Tola Capital, bringing its total capital raised to $47 million.
Figure 2: Motivation for coarse-to-fine MBR.We sort 300 candidates sampled from the model along the x-axis from best to worst according to a robust MC estimate (using 1,000 samples) of expected BEER under the model.Left: feasible MC estimates (5 samples) of each candidate's expected BEER.Right: robust and inexpensive MC estimates (100 samples) of expected utility w.r.t. a simpler metric (skip-bigram F1).As estimates are stochastic, we perform 100 repetitions and plot mean ± two deviations.We can see that the robust estimates (right) correlate fairly well with the expensive ranking we intend to approximate (x-axis), despite of the simpler utility.As we can afford more evaluations of the proxy utility, we obtain estimates of reduced variance, which leads to safer pruning.utility but is cheaper to compute.Even when those do not correlate perfectly, we can make use of the proxy utility to filter the hypothesis space to a manageable size T on which we can perform robust MC estimation of expected utility.We coin this approach coarse-to-fine MBR decoding (or MBR C2F ), which filters the hypothesis space to a manageable size in the coarse step, and performs robust MC estimation of expected utility in the fine step: μuproxy (h; x, S) .

(6b)
Upper-bounding the complexity of the proxy utility by U proxy , the target utility by U target , using S samples for MC estimation in the coarse step (6b) and L in the fine step (6a), the complexity of this algorithm is MBR C2F decouples robust MC estimation (large L) from exploration (large N ) and the cost of exploration from the cost of the target utility.As illustrated in Figure 2, we can find proxy utilities that correlate reasonably well with our target utility and are able to give us a rough-but usefulordering of the hypothesis space.Rather than using a proxy utility, we could use the target utility itself in the coarse-step provided we pick a small S. This, however, most likely leads to too high variability in the ranking, as shown in Figure 2 (left).

Data, Systems and Utilities
We perform experiments on three language pairs with varying amount of resources for training: En-glish into and from German, Romanian and Nepali.For German-English (de-en) we use all available WMT'18 (Bojar et al., 2018) news data except for Paracrawl, resulting in 5.9 million sentence pairs.We train a Transformer base model (Vaswani et al., 2017) until convergence and average the last 10 epoch checkpoints to obtain our final model.We test our models on newstest2018.For Romanian-English (ro-en) we use all available WMT'16 (Bojar et al., 2016a) news data amounting to 565k sentence pairs.We train a Transformer base model until convergence and pick the best epoch checkpoint according to the validation loss.We test our models on newstest2016.Finally, for Nepali-English (ne-en) we use the data setup by Guzmán et al. (2019).We apply the pre-processing step of removing duplicates as in Eikema and Aziz (2020).This results in 235k sentence pairs.We test our models on the FLORES test set, which is of a widely different domain than the training data.We mimick the training setup and models used in Guzmán et al. (2019).In all models we disable label smoothing, as this has been found to negatively impact model fit, which would compromise the performance of MBR (Eikema and Aziz, 2020).
For computational efficiency, we opt for nonneural evaluation metrics for use as utility function in MBR.BEER (Stanojević and Sima'an, 2014) is a non-neural trained metric that has shown good correlation with human judgements in previous WMT metrics shared tasks (Macháček and Bojar, 2014;Stanojević et al., 2015;Bojar et al., 2016b).In experiments shown in Table 4 in Appendix B we found that using BEER as utility function per-Figure 3: Estimates of expected utility for various hypotheses.We plot practical estimates of expected utility (x-axis) using either ancestral, nucleus or 'beam' samples against an accurate MC estimate using 1,000 ancestral samples.The gray line depicts a perfect estimator.
formed well at pushing translation performance higher across a range of automatic evaluation metrics.We therefore use BEER as the utility of choice in our experiments and as a consequence will consistently report corpus-level BEER scores of MBR translations as well.We also report Sacre-BLEU (Papineni et al., 2002;Post, 2018a) scores where relevant to be able to detect overfitting to the utility and for comparison with other works.

Estimation of Expected Utility
We start by motivating the importance of unbiased estimates of expected utility using ancestral samples (i.e.sampling-based MBR).In Figure 3 we verify the biasedness of alternatives to ancestral sampling for this computation: nucleus sampling (Holtzman et al., 2020) and 'beam sampling' (i.e., using k-best outputs from beam search for estimating expected utility; Blain et al. (2017)).We can see, rather clearly, that estimates using nucleus samples or beam search bias away from expected utility under the model, while ancestral sampling is unbiased by design and hence should be preferred when approximating the objective function in search.Therefore, in all experiments that follow, we shall use ancestral samples for making unbiased estimates of expected utility, even when different methods are used to construct the hypothesis space.

N-by-N MBR
Now, we look into scaling MBR N-by-N .Eikema and Aziz (2020) only explored 30 by 30 approximations to the MBR objective.Our aim is to investigate whether MBR decoding is indeed able to scale to better translation performance with more computa-  tion.In Figure 4, we explore N from 30 to 405. 4s MBR optimises a specific utility (we use BEER), we report translation quality along both BEER and BLEU to detect overfitting to the metric.
We find that MBR steadily improves across language pairs as N grows larger.BLEU scores improve at a similar rate to that of BEER, showing no signs of overfitting to the utility.This is strong empirical evidence that sampling-based MBR has no equivalent to the beam search curse.We see this as an important property of a decoding objective.

N-by-S MBR
MBR N-by-N couples two approximations, namely, tractable exploration and unbiased estimation of expected utility are based on the same N ancestral samples.Our aim is to learn more about the impact of these two approximations, for which we look into MBR N-by-S .Moreover, with less than N 2 assessments of utilities per decoding, we can also investigate larger H(x).We explore N ranging from 210 to 1005, while keeping the number of samples used for approximating expected utility of each hypothesis smaller, with S ranging from 10 to 200.We argue that S does not need to grow at the same pace as N , as MC estimates should stabilize after a certain point. 5See our results in Figure 5.
We find that growing N beyong 405 improves translation quality further, even when the estimates of expected utility are less accurate.Increasing S also steadily improves translation quality, with diminishing returns in the magnitude of improvement.On the other hand, smaller values of S lead to notable deterioration of translation quality and we note higher variance in results.For all language pairs it is possible to improve upon the best MBR N-by-N results by considering a larger hypothesis spaces and smaller S.This experiment shows that the two approximations can be controlled independently and better results are within reach if we explore more.On top of that, the best setting of MBR N-by-N takes 164,025 utility assessments per decoding, MBR N-by-S with S = 100 brings this number down to 100,500 for the largest N considered, while improving BEER scores on all language pairs.We note that again increasing either N or S generally improves translation quality in our experiments.This further strengthens our previous finding that sampling-based MBR does not seem to have an equivalent of the beam search curse.

Choice of Hypothesis Space
While our focus thus far has been on reducing the number of target utility calls, allowing the exploration of larger H(x), one should also take sampling time in consideration.For example, we found that in MBR N-by-N with N = 100, sampling time made up about 60% of the total translation time for our setup.Therefore, it is computationally attractive to construct compact H(x) with promising translation candidates.Ideally, for better search in MBR, we enumerate a set of high expected utility hypotheses.Up until now we have constructed H(x) using ancestral samples, following Eikema and Aziz (2020).Strategies like nucleus sampling and beam search are known empirically to produce higher quality translations than ancestral sampling and might therefore also enumerate outcomes that have high expected utility.We explore ancestral sampling, nucleus sampling and beam search.In a hyperparameter search we found p = 0.7 for nucleus sampling to work best.For beam search we use a length penalty of 1.2 (ne) or 0.6 (de, ro).We compare each strategy by the expected BEER values of the translations generated, using accurate estimates of expected BEER (using 1,000 samples for MC estimation).We show results in Figure 6.
We find ancestral sampling to produce hypotheses across the entire range of expected BEER scores.Nucleus sampling and beam search generally produce translations at the higher end of expected BEER.Therefore, these seem more suitable for generating effective H(x) at smaller N .Nucleus sampling seems to lead to the largest proportion of high expected utility translations across language pairs.Beam search has a noticeably high proportion of poor translations for English-Nepali, a low-resource language pair where mode-seeking search has been observed to be less reliable.Results in the opposite direction were similar.We explore both nucleus sampling and beam search for constructing H(x) in the next experiment, as well as combining all three strategies together.

Coarse-to-Fine MBR
We now turn to the coarse-to-fine procedure (MBR C2F ) described in Section 3.

Choice of Proxy Utility
We compare various proxy utilities by their effectiveness as filtering strategies in obtaining high expected utility sets, where we again use accurate estimates of expected utility using 1,000 samples for MC estimation.We filter the top-20 hypotheses from an initial 100 hypotheses obtained using ancestral sampling.This ensures a high variety of expected utilities in the initial set.We also compare each proxy utility on their runtime performance.We compare both cheap estimates of expected BEER using either 1 or 5 samples for MC estimation (BEER-1 and BEER-5 respectively) as well as cheap-to-compute proxy metrics: unigram F1 using 50 samples for MC estimation (UF-50) and skip-bigram F1 using 50 samples for MC estimation (SBF-50). 6We use expected BEER using Figure 6: Proportion plots of expected utility for 3 strategies for constructing H(x), using 100 translation candidates per strategy.We estimate expected utility using 1,000 samples.Results are aggregated over 100 source sentences.
Figure 7: Comparison of proxy utilities on English to German: BEER using 1, 5 or 100 samples for MC estimation, and unigram F1 (UF) and skip-bigram F1 (SBF) each using 50 samples for MC estimation.We use each proxy utility to filter a top-20 from 100 ancestral samples.We show the resulting expected target utilties (BEER, an accurate estimate) (left), as well as a runtime comparison (right).Results are aggregated over 100 source sequences.100 samples for MC estimation (BEER-100) as a reference point.See our results on the English-German system in Figure 2.
We surprisingly find nearly all strategies to lead to equally good filtered sets as BEER-100 in terms of expected BEER of the filtered set.The only strategy that performs slightly worse than the others is BEER-1, which is likely too noisy to be a reliable filtering strategy.We observed very similar results for the other five language pairs.In terms of runtime performance we find BEER-1 to be fastest followed by UF-50 at a 22.2x performance increase over BEER-100. 7In follow-up experiments, we will use UF-50 as a proxy utility, providing high quality filtered sets at good runtime performance.

Coarse-to-Fine MBR Results
In Table 1 we compare MBR C2F with MBR N-by-S using N = 405 nucleus samples (p = 0.7) to construct the hypothesis space.We filter the top-T = 50 hypotheses using UF-50 as proxy utility and use L = 100 samples for MC estimation of the top-set, following our findings in Sections 5.5.1 and 5.3 respectively.For MBR N-by-S we set S = 13 to roughly match the amount of computation available to MBR C2F , based on a 22.2x speed-up of UF-50 relative to BEER-100 observed in Figure 7.We find that across language pairs MBR C2F consistently outperforms MBR N-by-S showing improve-7 Our Python implementations of unigram and skip-bigram F1 are not optimized and we deem it likely that a greater speed-up is possible with a more efficient implementation.ments between +0.4 and +1.1 BEER and +0.2 to +1.9 BLEU.MBR C2F thus is effective at obtaining higher translation quality than MBR N-by-S at the same amount of computation available for MBR.
We also explore the effects on translation quality of changing and combining strategies for constructing H(x).We find that using a beam of N = 405 (using the same length penalty as in Section 5.4) to construct H(x) produces better results than nucleus sampling for most language pairs.Notably, reordering a large beam considerably improves over standard beam search decoding (using the usual beam size of 5 (ro, ne) or 4 (de)) for all language pairs in terms of BEER and for most language pairs in terms of BLEU scores.Combining all strategies for creating hypothesis spaces: ancestral sampling, nucleus sampling and beam search leads to the best results overall.For all language pairs both BEER and BLEU scores either improve or remain similar.This is more empricial evidence that expected utility is a robust and reliable criterion for picking translations: enlarging the hypothesis space or improving MC estimation under reasonable choices of hyperparameters seemingly never unreasonably hurts translation quality, but generally improves it.
A Multi-Reference Test Set We also test three systems from We use BEER as utility, UF-50 as proxy utility, set top-T = 50 and use L = 100 samples for MC estimation.We use various strategies for constructing H(x): 405 nucleus samples (N), the 405-best list from beam search (B) and combining both of these along with 1,005 ancestral samples (all).We use S = 13 in MBR N-by-S to mimic the computational cost of MBR C2F at N = 405.
The last row shows standard beam search performance using a typical beam size of 4 or 5 depending on the language.MBR results are averaged over 3 runs.Standard deviations for BEER/BLEU scores are below 0.1/0.2(NxS), 0.1/0.1 (C2F) and 0 (BS).
use translators A, C and D).We show results in Table 2.We find a similar pattern to that of Table 1.MBR C2F greatly outperforms MBR N-by-S given the same amount of available compute (see Section 5.5) for details).MBR C2F outperforms beam search results in terms of BEER, but is much closer to beam search this time in terms of BLEU.

Runtime
We measure runtime performance on hypothesis generation, sampling for MC estimation of expected utilities and decoding time seperately for various algorithms explored in this work on the English to German language pair.We run all experiments on an Intel Xeon Bronze 3104 Processor and a single NVIDIA GeForce 1080Ti GPU.For generating samples and beam search outputs we set the batch size to as large as possible, constrained by the available GPU memory.MBR using BEER as utility runs on CPU, while sampling and beam search run on GPU.We mimic the MBR N-by-N and MBR C2F setups from Table 1 using a hypothesis space of 405 nucleus samples.We also addition-  ally include runtime results for MBR N-by-N with N = 405 and a more expensive MBR N-by-S variant with S = 100 (NxS large ).For beam search we report results for a beam size of 4, as has been used throughout the paper for this language pair.Results are shown in Table 3.As can be seen, collecting hypotheses and unbiased sampling makes up for a large part of the total decoding time in MBR algorithms.We do note that sampling operations are easily parallelisable and can be split across multiple GPUs when available.In terms of the decoding time itself, we can see that we greatly reduced the amount of computation needed to perform MBR going from 23,156 seconds of decoding time for MBR N-by-N to only 726 seconds of decoding time for MBR C2F .This can be attributed to the great reduction in number of utility calls in our proposed approximations.

Related Work
In recent NMT literature MBR has started being explored either in combination with MAP decoding or replacing it altogether.Stahlberg et al. (2017) adapt lattice minimum Bayes risk decoding (Tromble et al., 2008) on SMT translation lattices to be incorporated in left-to-right beam search decoding in NMT, thereby proposing a hybrid decoding scheme.They adapt lattice MBR to work on par-tial hypotheses and perform beam search to find translations that are both high probability under the NMT model and have high expected utility under the SMT model.Shu and Nakayama (2017) also combine beam search with MBR decoding to find low risk hypotheses, after which they re-rank all hypotheses with MBR again.They report having to restrict the number of hypotheses as not to degrade the effectiveness of MBR re-ranking, a finding that is likely due to biased estimation of expected utility, as in our work we find that increasing the number of hypotheses always improves translation quality.Blain et al. (2017) explore the quality of k-best lists obtained from beam search in NMT models and find that while MAP is not a good criterion for ranking the resulting hypotheses, re-ranking using MBR with BEER as a utility leads to improvements on top of standard beam search decoding (with a small beam size), in terms of both BLEU scores as well as human evaluation scores.Borgeaud and Emerson (2020) approach decoding from a voting theory perspective and derive a decoding strategy similar to MBR.They explore a range of utility functions, achieving similar BLEU scores to beam search, but showing improvements in terms of length, diversity and human judgement.
All of the above works make use of beam search to provide both the hypothesis space as well as to make a biased estimate of expected utility.Eikema and Aziz (2020) are the first work in NMT that propose to use sampling from the model to both make unbiased estimates of expected utility, the importance of which we confirm in experiments, and to form the hypothesis space.The authors only explore MBR N-by-N , however, and never explore hypothesis spaces larger than N = 30 samples.We show that it is beneficial to scale MBR to much larger hypothesis spaces and that it can be beneficial to construct them using mode-seeking strategies.Müller and Sennrich (2021) study the properties of the sampling-based algorithm proposed in Eikema and Aziz (2020) and explore hypothesis spaces up to a size of N = 100 as well as multiple utility functions.They find that MBR decoding outputs exhibit a similar but smaller bias towards short translations and frequent tokens compared to beam search, but do observe that this is dependent on the choice of utility function.They further find that MBR decoding mitigates spurious copying and hallucinations under domain shift.Similar to our work, they find that MBR decoding scales well with larger hypothesis spaces and better estimation of expected utility.Freitag et al. (2021) explore the use of large hypothesis spaces and a range of utilities, including neural utilities, on the MBR N-by-N approximation.They find that using BLEURT as utility leads to significantly better translations in a human evaluation, while producing considerably lower probability translations.
We provide a more extensive overview of historical approximations to the MBR objective as well as an overview of alternatives for tackling the inadequacy of the mode in Appendix A.

Conclusion
We have shown MBR to be a robust decision rule for NMT that can find high quality translations.
In particular, we have found that MBR, under reasonable hyperparameter choices, generally leads to improved translation quality with more computation (i.e., searching a larger search space and/or using more samples for more accurate MC estimation).Big challenges in decoding with MBR are constructing the hypothesis space and keeping computational cost of estimating expected utility tractable.We have proposed effective strategies for both, by exploring more efficient ways of forming the hypothesis space and proposing an approximation to MBR that is linear in the size of this hypothesis space.Our coarse-to-fine MBR procedure is able to considerably reduce the number of calls to the utility function without compromising translation quality.We have shown that sampling-based MBR in general can outperform beam search on all the language pairs we explored and can continue to improve with better and more accurate search.We believe sampling-based MBR to be a promising, albeit still more expensive, alternative to beam search decoding.Unlike beam search, where it is not obvious how to further improve translation quality, sampling-based MBR is likely to benefit from improvements of different aspects of the algorithm.We believe fruitful avenues of research to be among i) clever algorithms for constructing hypothesis spaces, ii) more robust estimates of expected utility using fewer samples, iii) use of modern neural utilities and iv) improving the modelling capacity of NMT systems.We hope that this work motivates researchers and practitioners to make more conscious considerations of the choice of decision rule and that it paves the way for use of tractable sampling-based MBR decoding in NMT.

Limitations
This work has proposed a number of algorithms for more efficient decoding under the minimum Bayes risk decision rule.However, in terms of runtime performance MBR decoding is still outperformed by beam search.MBR will likely always be more expensive than current applications of beam search, in which very small beam sizes are employed, since on top of generating translation candidates, MBR decoding will potentially need a separate set of samples for estimating expected utility, and perform additional computations in the form of utility assessments.While this currently makes MBR less attractive in real-time translation scenarios, we believe that the demonstrated scalability and robustness of the decoding objective makes MBR interesting in scenarios in which translation speed is not the highest priority.Furthermore, continued research into algorithmic improvements to MBR approximations and optimized implementations of existing algorithms may make MBR attractive in real-time translation in the future.
MBR also relies on a utility function, a hyperparameter to the decision rule (decoding algorithm).On the one hand, this allows us to inject some domain expertise into the decoding algorithm.On the other hand, in machine translation, we do not have a gold-standard metric that we trust to judge translation quality perfectly.This means we will have to choose a utility that we know is suboptimal, and may have peculiarities such as bad hypotheses that exploit certain aspects of the utility to be ranked unreasonably high.Nonetheless, it is unlikely that the NMT model puts a lot of mass on such translations, reducing the likelihood of encountering such situations.We believe there are also positives to incorporating a utility function into the decoding algorithm: MBR can benefit from advances in the field of machine translation evaluation, as some recent works have already exploited (Freitag et al., 2021;Fernandes et al., 2022).
Finally, current MBR algorithms do not permit incremental generation of translations.A translation hypothesis can only be assessed once it's fully generated by the NMT model.This is a bottleneck to its speed and doesn't make optimal use of the factorisation of modern-day NMT systems.We do think this is a promising direction for future work. in this subset.This has the undesirable effect of exaggerating differences in probability due to underestimation of the normalisation constant, and, like MAP decoding, it over-represents pathologies around the mode.Similarly, most prior work uses mode-seeking search to explore a tractable subset of the hypothesis space.Mode-seeking approximations bias the decoder towards the mode making MBR decoding less robust to idiosyncratic outcomes in the hypothesis space (Eikema and Aziz, 2020).This is in stark contrast with our work, where we sample from the model to construct unbiased estimates of expected utility, as well as to enumerate a tractable hypothesis space.
There are cases in statistical machine translation (SMT) where the computation of expected utility can be factorised along a tractable directed acyclic graph (DAG) via dynamic programming (Tromble et al., 2008;Zhang and Gildea, 2008;DeNero et al., 2009;Kumar et al., 2009).In such cases, the DAG contains a much larger subset of the sample space than any practical k-best list, still some pruning is necessary to construct a compact DAG containing only the most probable outcomes.These strategies are only available for models and utility functions that make strong Markov assumptions.For example, Tromble et al. (2008) and DeNero et al. (2009) develop linearisation strategies for BLEU, and Zhang and Gildea (2008) maximise expected trigram counts as a proxy to BLEU proper.The idea of utilising a proxy utility is something we also explore in this paper, though only as an intermediate step to decoding with the target utility.
In some (rarer) cases, unbiased (or asymptotically unbiased) samples have been used to approximate the MBR objective and/or to reduce the search space.For example, Stanojević and Sima'an (2015) use ancestral sampling in MBR decoding for permutation-trees-based reordering models, and Arun et al. (2009) use Gibbs sampling for MBR decoding in phrase-based MT.Unbiased samples for estimation of expected utility or exploration of a tractable hypothesis space are simply not common in machine translation.In SMT, the reason is a technical one, most SMT models are not based on a leftto-right factorisation of the joint distribution, thus unbiased sampling requires MCMC (DeNero et al., 2008;Blunsom et al., 2009) or expensive adaptive rejection sampling (Aziz et al., 2013).This limitation does not extend to NMT models, but NMT most likely simply inherited from SMT the prac-tice of using beam-search-based approximations, at least until the work of Eikema and Aziz (2020).

A.2 Tackling the Inadequacy of the Mode
Eikema and Aziz (2020) link the inadequacy of the mode in NMT to the entropy of the conditional distribution, or, more precisely, to the fact that NMT models tend to spread probability mass over large subsets of the sample space (Ott et al., 2018).It is plausible that strategies to concentrate probability mass (e.g., reducing entropy or pruning the support of the model) will do so by making inadequate translations less probable.For example, Forster et al. (2021) find that the inadequacy of the mode problem does not seem to affect sequence-tosequence models of morphological inflection, an essentially deterministic task, whose combinatorial space is built upon a smaller vocabulary (i.e., characters instead of sub-word units), and whose observations are typically very short (i.e., words rather than sentences).Peters and Martins (2021) train sparse sequence-to-sequence models (Peters et al., 2019) which assign zero probability to many outcomes dramatically reducing the support of the conditional distribution over complete sequences.They show that sparsity leads to inadequate candidates such as the empty string being pruned out of the support.They also find that label smoothing increases the rate at which the empty string is more probable than the beam-search output.Meister et al. (2020) interprets the algorithmic approximations of beam search as an inductive bias towards outputs with uniform information density (Jaeger and Levy, 2007).They develop variants of beam search where this preference is a tunable hyperparameter and show that deviating from the mode with this type of bias can lead to improved translation quality.Another way to deviate from the mode is to augment the decoding objective with an auxiliary model.Li and Jurafsky (2016) re-rank a k-best list using a combination of two model probabilities, namely, p Y |X (h|x, θ fwd ) and p X|Y (x|h, θ bwd ).They think of this as maximising the mutual information (MI) between source and translation.The motivation is that the targetto-source component will push against inadequate candidates, as those are unlikely to be mapped back to the source with high probability.Bhattacharyya et al. (2021) find that 100 samples from an NMT model contain better candidates (measured in terms of BLEU) than the output of beam search (an observation Eikema and Aziz (2020) also make based on 30 samples and METEOR, instead).They propose to rerank these samples using an energy-based model trained to order candidates as sentence-BLEU would.Like these works, sampling-based MBR decoding, can be seen as a form of explore and rank approach, however, the ranking function in MBR is derived from the NMT model itself, whereas both MI-and EBM-based re-ranking involve an auxiliary trained model.For the EBM, in particular, in the limit of a too large hypothesis space, the beliefs of the NMT model are completely overwritten by the EBM.MBR, instead, does not overwrite the model's beliefs, it re-expresses those beliefs in terms of utility.Leblond et al. (2021) recast NMT as a reinforcement learning problem and learn both a policy (i.e., a mechanism to explore the space of translations one word at a time from left-to-right) and a value function (i.e., an estimate at the expected reward of finishing a given prefix translation).For reward they investigate what they call privileged metrics, which require access to references (e.g., sentencelevel BLEU), and unprivileged metrics, which do not use references but access the source (e.g., a quality estimation score).Compared to samplingbased MBR, their work tightly integrates search and value estimation, thus going beyond ranking a fixed set of candidates.The objective function of MBR can be thought of as an 'unprivileged metric' in their terminology, one that is based on the NMT model itself (and a choice of utility).But, the policy in sampling-based MBR (i.e., the NMT model) is not trained to be aware of the evaluation metric.

B Comparing Target Utilities
We compare a number of utility functions for use in MBR decoding.In principle any function that measures some notion of similarity across sequences and can be reliably assessed on the sentence-level is suitable as a utility function for MBR.As BLEU is the predominant automatic evaluation metric on which translation quality is assessed, we experiment with a smoothed version of BLEU (Papineni et al., 2002) that can work on the sentence-level: sentence-BLEU (Chen and Cherry, 2014) using the default parameters in Post (2018b).We further try METEOR (Denkowski and Lavie, 2011) as this was used in Eikema and Aziz (2020)  good results.8BEER (Stanojević and Sima'an, 2014) is a character-based metric that has shown to correlate well with human judgements in many WMT metrics tasks (Macháček and Bojar, 2014;Stanojević et al., 2015;Bojar et al., 2016b).Finally, we also explore ChrF++ (Popović, 2017), another character based metric that is an improved version of ChrF (Popović, 2015).
We perform MBR N-by-S with N = 405 and S = 100 in order to perform the comparisons.We measure the performance of each utility on BEER, BLEU, METEOR and ChrF++.Our results are shown in Table 4.As expected, using a certain utility achieves the best performance under the lens of that metric as well.Sometimes we find a small deviation from this when BEER or METEOR outperforms sentence-BLEU in terms of BLEU score.This is likely due to sentence-BLEU only being an approximation to BLEU itself.We find that overall BEER seems to do best across metrics followed by ChrF++.One attempt to quantify this more clearly is by normalize the scores per language pair and evaluation metric compared to the maximum score obtained by the best scoring system for that metric and language pair.This leads to the following average performances per evaluation metric: BEER 0.978, METEOR 0.968, ChrF++ 0.964, and sentence-BLEU 0.955.This indeed shows a slight edge of BEER over the other utilities tested in pushing scores across our evaluation metrics.Herefore, in the main paper, we have used BEER as the utility of choice.The finding that BEER works well as a utility function in MBR was also made before in the work of Blain et al. (2017).

Figure 1 :
Figure 1: NMT spreads probability roughly uniformly over a large set of promising hypotheses (left).MBR (right) assigns hypotheses an expected utility, revealing clear preferences against those that are too idiosyncratic.

Figure 4 :
Figure4: MBR N-by-N for various sizes of N using BEER as target utility.We report both BEER and BLEU scores.

Figure 5 :
Figure 5: MBR N-by-S : we estimate the expected utility of N hypotheses using S samples.We show average performance over 3 runs with 1 standard deviation.The dashed line shows MBR N-by-N performance at N = 405.

Table 1 (
NxS, C2F and beam search) on a multi-reference test set.We use the English to German systems trained on WMT18 news data and translate newstest2021, which has three separate translations for each source sentence (we

Table 1 :
Comparing MBR N-by-S , MBR C2F and beam search (BS) in terms of BEER and BLEU performance.

Table 2 :
English to German MBR N-by-S and MBR C2F results on the newstest2021 multi-reference test set.We use N = 405 nucleus samples as hypothesis space and use the same hyperparameters as in Table1.

Table 3 :
A runtime comparison of MBR variants and beam search.We separate the time taken for i) hypothesis generation ii) sampling (for estimation of expected utility) and iii) running the decoder itself.We use N = 405 nucleus samples, S = 13 and S large = 100 ancestral samples for NxS variants, and the hyperparameter settings for C2F as used in Table1.

Table 4 :
and showed Comparing BEER, sentence-BLEU, METEOR and ChrF++ as utility functions in MBR N-by-S using N = 405 and S = 100.