It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk

Minimum Bayes Risk (MBR) decoding is a method for choosing the outputs of a machine learning system based not on the output with the highest probability, but the output with the lowest risk (expected error) among multiple candidates. It is a simple but powerful method: for an additional cost at inference time, MBR provides reliable several-point improvements across metrics for a wide variety of tasks without any additional data or training. Despite this, MBR is not frequently applied in NLP works, and knowledge of the method itself is limited. We first provide an introduction to the method and the recent literature. We show that several recent methods that do not reference MBR can be written as special cases of MBR; this reformulation provides additional theoretical justification for the performance of these methods, explaining some results that were previously only empirical. We provide theoretical and empirical results about the effectiveness of various MBR variants and make concrete recommendations for the application of MBR in NLP models, including future directions in this area.


Introduction
"Sometimes innovation is only old ideas reappearing in new guises . . .[b]ut the new costumes are better made, of better materials, as well as more becoming: so research is not so much going round in circles as ascending a spiral."(Jones, 1994) Minimum Bayes Risk (MBR) decoding (Bickel and Doksum (1977); §2) is a decoding method following a simple intuition: when choosing a best output from a set of candidates, the desirable output should be both 1) high probability and 2) relatively consistent with the rest of the outputs (i.e., outputs that are not consistent with the other outputs are high riskthey may be dramatically better or worse than the consensus).MBR thus provides an alternative to the more standard maximum-likelihood decoding; when a sample of sufficient size is taken, MBR almost uniformly outperforms beam search and single-output sampling across tasks, metrics, and datasets (see §6).It is also notable in its flexibility; in §3 we organize and discuss several different design decisions that go into the use of MBR and how they affect the efficacy of the method.
While MBR is rarely applied by name in modern NLP, a number of methods with similar intuitions have gained popularity.In §4, we demonstrate that a number of generation techniques widely used with modern language models can be viewed as special instances of MBR: self-consistency (Wang et al., 2023) and its extensions, range voting (Borgeaud and Emerson, 2020), output ensembling (DeNero et al., 2010;Martínez Lorenzo et al., 2023), and some types of density estimation (Kobayashi, 2018).This view exposes connections between seemingly disparate methods and presents theoretical justifications for existing empirical results using these methods.We also discuss how insights from the MBR literature can inform the use of these other MBR-like methods.
With the framing of MBR, the theoretical justification for the empirical performance of several methods becomes clear; the extension of selfconsistency to open-ended generations becomes trivial; and several promising modifications to selfconsistency and output ensembling are exposed.In particular, modern MBR-like methods often do not apply the insights from research on MBR, suggesting that these methods could be further improved.In §5, we show that some design choices, though seemingly intuitive to a practitioner accustomed to search-based decoding methods, should be avoided when applying MBR.

Standard decoding
Decoding from an autoregressive model (such as a transformer decoder) is performed tokenwise.The distribution at each decoding step is conditioned on the prior tokens and the input text: p(y i |y <i , x) (1) The model is locally normalized; the probabilities of next tokens sum to 1.The probability of a sequence under this global model distribution is Given this distribution, there are several ways of extracting an output: by sampling at each decoding step from the distribution over next tokens (often with some modification to the distribution, e.g.temperature, nucleus, or epsilon sampling; Holtzman et al. (2019)); by always choosing the most probable next token (i.e.greedy decoding); or by performing a search over some subset of the output space, guided by the distribution (e.g.beam search, best-first search).These methods generally return a single output; if multiple output candidates are present, the one with the maximum likelihood under the model distribution is returned.

Minimum Bayes Risk decoding
The traditional formulation of MBR is as a minimization objective.Given a output space Y and a probability distribution over this space p(y|x), we compute the risk R(y ′ ) of a candidate decoding y ′ as the expected error (also called loss) under this distribution (Bickel and Doksum, 1977;Kumar and Byrne, 2004;Tromble et al., 2008).The MBR decoding is then the y ′ within Y that minimizes risk: = argmin We can trivially rewrite the risk as a maximization of gain (also called utility) rather than a minimization of error, where G(y, y ′ ) = −L(y, y ′ ).
Approximating risk Computing this sum over the space of all possible outputs Y is intractable for most models. 1In these cases, we approximate the risk R(y ′ ) by using a subset of the full space Y ⊂ Y ; that is, instead of exact computation of the expectation, we approximate it with a sum over independent samples from p(y|x).Generally, this is performed by sampling repeatedly from a model (or several models) and estimating the probability of each individual output as proportional to the relative frequency that the output occurs.2For an unbiased sampling method3 (e.g.ancestral sampling), as the number of outputs drawn goes to infinity, this recovers the model's true distribution of probability over sequences.Thus, we approximate risk using this sample: Thus, given a sample (which may include duplicates) Y and a function (e.g. a metric) that compares two sequences G : Y × Y → R, we approximate the true MBR decoding rule as: Separation of evidence and hypothesis sets In many cases, the same subset of the output space is used for both the risk estimate and the candidate outputs.However, when the sample is substantially smaller than the full output space, it is often beneficial to use separate sets (Eikema and Aziz, 2022;Yan et al., 2023).Following prior work ( §2.2), we refer to these as the evidence set (Y e ) and hypothesis set (Y h ).This separation is beneficial because there are distinct and potentially contradictory desiderata for the two sets.We wish for our evidence set to cover a large, representative portion of the search space to obtain a more accurate estimate of risk.However, we want our hypothesis set to only cover the narrower, high-quality region of the space, as we do not want to consider candidate hypotheses that are low-quality.Applying the separation of evidence and hypothesis sets yields the equation for MBR over two subsets of the output space: 3 Taxonomy of MBR In this section, we examine how these four factors affect the efficacy of MBR and give recommendations for each; in Section 4, we discuss how these apply to other MBR-like methods.

Sampling a hypothesis set
Several recent works show benefits from improving the quality of the hypothesis space.Fernandes et al. (2022) apply a two-stage approach where they first apply an N -best (referenceless) reranker and then do MBR over only the most highly ranked hypotheses, which they also use as the evidence set.Eikema and Aziz (2022) introduce a method, Coarse-to-Fine MBR, that first uses MBR with a cheap-to-compute metric to filter a large hypothesis space to a smaller set, then uses MBR with a better but more expensive to compute metric over the smaller set; they separate evidence and hypothesis sets.(Freitag et al., 2023) further investigates sampling strategies for MBR, finding that epsilon sampling (Hewitt et al., 2022) outperforms other strategies in automated and human evaluations.
Another earlier line of work has considered growing post hoc the hypothesis set in order to obtain hypotheses with higher expected gain (González-Rubio et al., 2011;González-Rubio and Casacuberta, 2013;Hoang et al., 2021).

Sampling an evidence set
Comparatively less work has studied strategies for sampling the evidence set.Most recent work has adopted the unbiased sampling strategy of (Eikema and Aziz, 2020), i.e. drawing i.i.d.samples from the model distribution p(y|x) (equation 2).This strategy is motivated by their observation that unbiased sampling is reasonably reflective of the data distribution, much more so than beam search.However, their approach is incompatible with models trained via label smoothing (Szegedy et al., 2016).(Yan et al., 2023) attempt to remedy this by sampling the evidence set with temperature τ < 1, sharpening the model distribution.

What metric do we want to maximize?
The gain G (alternatively, error L) may be an arbitrary function Y e × Y h → R. Early work focused on simple, token-level metrics like word error rate and BLEU (Kumar and Byrne, 2004;Ehling et al., 2007), but more recent work has explored the use of neural metrics (Amrhein and Sennrich, 2022;Freitag et al., 2022), as well as executing outputs in code generation (Shi et al., 2022;Li et al., 2022).
Generally, for both neural and non-neural metrics, MBR with metric G as a gain function will yield the largest downstream improvements on G (Müller and Sennrich, 2021;Freitag et al., 2022;Fernandes et al., 2022).In other words, if one aims to optimize system performance on metric M , one should perform MBR with M as gain.
However, MBR also inherits the weaknesses and biases of the gain metric used.MBR has been shown to suffer from length and token frequency biases brought on by the metric, i.e.MBR with BLEU prefers shorter sentences (Nakov et al., 2012;Müller and Sennrich, 2021).Similarly, (Amrhein and Sennrich, 2022) find that MBR over COMET causes higher rates of errors for named entities and numbers due to a lack of sensitivity in the metric.Moreover, MBR is susceptible to overfitting to the metric; (Freitag et al., 2023) show that the MBR setting that maximizes the metric is not the one that humans prefer.
Note that in the most trivial case, where the met- MBR recovers modeseeking methods like beam search-i.e.MBR under this metric, in expectation, yields the maximum likelihood decoding.

What probability distribution should we use to estimate risk?
Most MBR decoding methods use the model's score distribution over outputs, s, as the (unnormalized) evidence distribution.Alternately, this distribution may be normalized by a temperature (during minimum risk training (Smith and Eisner, 2006) or decoding (Yan et al., 2023)).Some work (Suzgun et al., 2023) interprets this as a weak proxy for the human or true distribution, arguing that the true objective is to minimize error under the human

Method
Evidence Gen. Hypothesis Gen. Metric p(y|x) Lattice MBR (Tromble et al., 2008) N-best list N-best list BLEU translation lattice Coarse-to-fine MBR (Eikema and Aziz, 2022) ancestral sampling filter(sample) BEER single model Wiher et al. (2022) ancestral sampling evidence + more decodings BEER single model MBR-DC (Yan et al., 2023) temperature sampling 1 temperature sampling 1 BLEURT single model Ours ( § 3.3) ancestral sampling temperature sampling BERTScore single model Ours ( § 3.4) ancestral sampling temperature sampling BERTScore length-corrected scores (Freitag et al., 2023) epsilon sampling BLEURT single model Crowd sampling 2 (Suzgun et al., 2023) temperature sampling neural score metric single model MBR-Exec (Shi et al., 2022) temperature sampling execution match single model Self-consistency (SC) (Wang et al., 2023) temperature sampling exact answer match single model Complex SC (Fu et al., 2022) filter(temperature sample) exact answer match single model SC for open-ended gen (Jain et al., 2023) temperature sampling n-gram overlap single model Range voting (Borgeaud and Emerson, 2020) beam search n-gram overlap single model Post-Ensemble (Kobayashi, 2018) beam  2023) coin the new term crowd sampling, they also explicitly refer to their method as MBR. distribution: Note that this is not the only reasonable choice of p(y|x); other possible distributions include a distribution over outputs from multiple models ( §4.2) or the length-penalized distribution over a single model's outputs p l (y|x) ( §5.3).

MBR as a frame for other methods
Self-consistency, range voting, output ensembling, and density estimation can all be viewed through the framing of MBR.This exposes unstated connections between the methods and provides some theoretical backing to the empirical success of these methods.We discuss each in turn.

Self-consistency as MBR
Self-consistency (Wang et al., 2023) is a method for choosing outputs from language models.In selfconsistency, the model is prompted to generate an explanation and then an answer.Multiple outputs O = {y 1 , . . ., y m } are sampled from the model, the answers A = {a 1 , . . ., a m } are extracted a i = ans(y i ), and the most frequent answer is returned: Self-consistency only computes exact match over the answer, not the reasoning chain.It is possible to recover MBR from this method by either taking the hypothesis/evidence sets to be the set of resulting answers Y h = Y e = A discarding the reasoning chain, or by defining a gain function G(y, y ′ ) = 1(ans(y) = ans(y ′ )) over full outputs O; though notationally different, they are mathematically equivalent.
Thus, self-consistency is a type of MBR decoding in which we approximate the risk with a Monte Carlo estimate (cf.Eq. 6), the answers are sampled from the model (conditioned on the prompt), and the metric is exact match of the "final answer." This framing additionally explains some results from the self-consistency paper.Wang et al. (2023) compare the performance of selfconsistency across sampling strategies, finding that the best of the strategies they tried are those that are closest to ancestral sampling (nucleus sampling with p = 0.95 and τ = 0.7 without top-k sampling).They also find that self-consistency works better with a sampled output rather than outputs from beam search (their Table 6).Through the lens of MBR, this empirical result has a clear theoretical justification: ancestral sampling of evidence sets generally yields the best performance for MBR because this provides an unbiased estimator of the probabilities of the sampled sequences.This also presents an opportunity for improvement: while Wang et al. (2023) do not evaluate on ancestral sampling, it is possible that this would outperform their best results.
Self-consistency is a special case of MBR.Proposed extensions to self-consistency have recovered aspects of generalized MBR decoding, including filtering to smaller hypothesis/evidence sets (Fu et al., 2022) and the use of alternative gain metrics (Jain et al., 2023).As a result, the term self-consistency has widened in definition from a specific type of MBR to a catch-all for MBR-based decoding methods on large language models.

Output Ensembling as MBR
Model ensembling techniques that operate on completed outputs of models may also be cast in MBR terms.Note that this does not include methods that operate on model weights or partial outputs.Common ensembling methods such as averaging model weights (Izmailov et al., 2018) or averaging token-level probabilities (Sennrich et al., 2016;Manakul et al., 2023) cannot be explicitly formulated as MBR.
The connection to MBR is most straightforward in methods that perform MBR decoding over the outputs of multiple models (DeNero et al., 2010;Duh et al., 2011;Barzdins and Gosko, 2016;Lee et al., 2022, inter alia).Representative of this family of methods is Post-Ensemble (Kobayashi, 2018), which ensembles multiple text generation models θ 1 , θ 2 , . . ., θ n by separately decoding from each model, computing pairwise sentence embedding similarity between all pairs of outputs, and yielding the output with greatest average similarity.Observe that this may be framed as MBR minimizing the expected risk over the mixture distribution where n i=1 π i = 1.While π i is usually taken to be uniform over the ensemble, this need not always be the case (Duan et al., 2010).
Other methods may be viewed as relaxations of MBR decoding.Assemble!(Martínez Lorenzo et al., 2023) ensembles Abstract Meaning Representation (AMR) graph parsers by computing the pairwise perplexities of each output under each parser.While this is not precisely MBR, it may be viewed as a variation where the evidence set is a set of models, not a set of model outputs.

MBR as Density Estimation
Interestingly, Post-Ensemble (Kobayashi, 2018) ( §4.2) was not formulated as MBR (and in fact never referred to by name as MBR), but rather as kernel density estimation.Kernel density estimation is a non-parametric method for estimating the probability density function p of an unknown distribution, given samples (x 1 , x 2 , • • • , x n ) from that distribution (Rosenblatt, 1956;Parzen, 1962).
Indeed, Equation 11 very closely resembles the Monte Carlo estimator of expected loss in Equation 6.This connection allowed (Kobayashi, 2018) to propose approximation error bounds on MBR, drawing from the density estimation literature. 4ote that the kernel function K(x, x i ) is more commonly written as K(x − x i ), or K(x T x i ) for directional statistics.While this may seem limiting, we can rewrite commonly used MBR metrics in this form; we show this for ROUGE-n as an example.For a sequence y, define T n (y) to be a vector of size |V | n , where |V | is the size of the vocabulary, containing the number of times every possible ngram appears in y.Then we can rewrite ROUGE-n as the following: where The similarity between density estimation and MBR yields an alternative interpretation of MBR as a mode-seeking search.However, we are not seeking the mode of the model's distribution over outputs, p(y|x), but rather that of a distribution over some features ϕ(y) of our output, p ′ (ϕ(y)|x).For instance, in the case of ROUGE-n MBR, We posit that this alternative distribution p ′ (T n (y ′ )|x) may be better correlated with performance on specific downstream metrics than the original model distribution, potentially adding an additional justification for MBR's effectiveness.
We hope this may inspire future work investigating the theoretical underpinnings of MBR.

Range Voting as MBR
Methods that take inspiration from outside of NLP may also be MBR-like; in particular, some MBRlike algorithms in the literature are formulated from a voting theory perspective where candidate hypotheses are assigned votes based on similarity to some set of voters (Wang et al., 2023;Jain et al., 2023;Suzgun et al., 2023;Hoang et al., 2021).
We show here that range voting (Borgeaud and Emerson, 2020), which broadly encapsulates these proposed voting methods, reduces to MBR.Range voting describes a family of voting systems in which each voter assigns each candidate a score and the candidate with the greatest total or average score is elected.Observe that the set of candidates C corresponds to the hypothesis set Y h and the set of voters V corresponds to the evidence set Y e .Then, if voter v's score for candidate c is taken to be a gain G(v, c) and each voter is assigned uniform weight, range voting is equivalent to the MBR decision rule in Equation 8: Other range-voting methods can similarly be cast as MBR variants.

Design Decisions Impact MBR Performance
Although all the methods in Section 4 are MBRlike, they make very different decisions about the four design choices in our MBR taxonomy.To demonstrate the importance of the method design, we consider empirically two cases where changing design impacts the performance of the method.

Experimental Details
We run MBR experiments for abstractive summarization on CNN/DM (Nallapati et al., 2016) with a fine-tuned BART-Large5 released by the BART authors (Lewis et al., 2020) as our base model.In §5.3, we additionally report results for translation on WMT'16 Romanian-English (Ro-En) (Bojar et al., 2016) using mBART-50 (Liu et al., 2020). 6We draw n e ancestral samples for our evidence set and n t temperature samples (τ = 0.5 for CNN/DM, τ = 0.3 for WMT'16 Ro-En) for our hypothesis set.We set n e = n t = 30 in §5.2 and n e = n t = 50 in §5.3.Unless otherwise specified, we take ROUGE-1 (Lin, 2004) as our gain metric for summarization and BLEU-4 (Papineni et al., 2002) 7 as our gain metric for translation.

The MBR metric matters -but perhaps not as much as the hypothesis set
We find that using MBR with the summarization n-gram metric ROUGE-1 (Lin, 2004) improves abstractive summarization performance over beam search on CNN/DM, even when evaluating performance with neural metrics; using the general-purpose neural metric BERTScore (Zhang et al., 2020) as the MBR metric yields highest BERTScore but smaller gains on non-neural metrics, a finding consistent with past work; and even BEER (Stanojević and Sima'an, 2014), a translation metric, works as an MBR metric for this task.
However, prior work using the same dataset and model (Wiher et al., 2022) found that BEER (Stanojević and Sima'an, 2014) underperforms beam search.This divergence in results is likely due to our different choices in hypothesis set - Wiher et al. (2022) use the evidence set plus additional outputs from other decoding methods as hypotheses, while we use temperature samples at τ = 0.5.While reusing the evidence set is more efficient than sampling a separate set of hypotheses, it leads to performance degregation in this case; this further emphasizes the importance of choosing the hypothesis set in MBR. 7We use the implementation from sacrebleu (Post, 2018) with signature nrefs:1|case:mixed|eff:yes|tok:13a| smooth:exp|version:2.3.1

Varying the risk distribution: lessons from beam search don't translate to MBR
By nature, autoregressive text generation models suffer from length bias: sequence probability monotonically decreases with increasing length, causing shorter, potentially less informative sequences to be favored by the model distribution (Koehn and Knowles, 2017;Stahlberg and Byrne, 2019).
For non-sampling methods such as beam search, the sequence probabilities are generally modified with a length-dependent term when comparing sequences (Murray and Chiang, 2018;Cho et al., 2014).Hence, it stands to reason that a lengthcorrected distribution with these biases alleviated may provide a better estimate of the risk R(y ′ ).
Vanilla Monte Carlo MBR (as depicted in Equation 6) yields an estimate of the expected risk under the distribution that our evidence samples are drawn from.To modify the distribution used in our estimate, we turn to importance sampling, a method for estimating the expected value of a quantity under target distribution p, given samples from proposal distribution q (Kloek and van Dijk, 1978).For a brief tutorial on importance sampling and description of our estimator, see Appendix A.
We take the score of a sequence to be the log probability: We then experiment with two of the strategies described in (Murray and Chiang, 2018) for constructing the length corrected score s l (y|x): (a) Length normalization: The model distribution is smoothed with temperature T β , where T is the sequence length and β is the length penalty, a hyperparameter.A larger β more heavily prioritizes longer sequences.(He et al., 2016): A fixed reward γ is added to the score per token generated.
The length-corrected distribution is then p l (y|x) ∝ exp s l (y|x).We apply normalized importance sampling (Rubinstein and Kroese, 2016) to estimate the risk under the length corrected distribution, i.e.R(y ′ ) = E y∼p l [L(y, y ′ )], given samples drawn from the model distribution p(y|x).We compare our MBR results against beam search both with and without length normalization.We use the models' default values for length penalty (β = 2 for BART, β = 1 for mBART).Table 4: MBR results for various length correction schemes on WMT'16 Romanian-English.We report BLEU, chrF, BLEURT, BERTSCORE, and length ratio, respectively.We use the chrF (Popović, 2015) implementation from sacrebleu.
Our results are Tables 3 and 4. In line with past work, we find that beam search generally benefits from incorporating a length penalty.However, we find that length-corrected MBR underperforms vanilla MBR.This may be due to a gap between the sampling and length-correction distibutions, leading to a high-variance estimator of risk.
However, our results are also emblematic of a wider trend among minimum-risk techniques.Past work has found that models trained with Minimum Error Rate Training (Och, 2003;Shen et al., 2016), an error-aware training method, do not require length correction in beam search (Neubig, 2016).Similarly, we find that MBR without length correction generates outputs relatively close in length to the references, more so than length-normalized beam search.This suggests that MBR may be to some extent immune from length biases, when they are not introduced by the MBR metric (Müller and Sennrich, 2021).

MBR applications in NLP
The use of minimum Bayes risk decoding in NLP predates these MBR-like methods; MBR has been applied by name in NLP since the 1990s.
Historical context Minimum Bayes Risk decoding has roots in Bayesian decision theory, a field of study that dates as far back as the Age of En-lightenment (Bernoulli, 1738;Parmigiani, 2001).Central to Bayesian decision theory is the principle of risk minimization: in the face of uncertainty, an optimal decision maker should choose the option that minimizes the amount of error they can expect to suffer -or, in other terms, maximizes the amount of utility they can expect to enjoy (DeGroot, 1970;Bickel and Doksum, 1977).This is precisely the intuition encoded in MBR (i.e.Equation 3).
Adoption in NLP MBR was adopted by the speech and NLP communities in the 1990s and early 2000s, finding applications in syntactical parsing (Goodman, 1996;Sima'an, 2003), automatic speech recognition (Stolcke et al., 1997;Goel and Byrne, 2000), and statistical machine translation (Kumar and Byrne, 2004;Tromble et al., 2008;Kumar et al., 2009).Many NLP tasks during this time relied upon graph structures as inductive biases (i.e.parse trees or translation lattices/hypergraphs).As such, early MBR works often used these graphical models as hypothesis and evidence spaces.Work on lattice MBR (Tromble et al., 2008), for instance, treated the set of all hypotheses encoded in a word lattice, of which there are exponentially many, as both evidence and hypothesis sets.This is in contrast to most later MBR work, which operates on a relatively small list of text outputs obtained from a neural model.As a result, early work relied on rather involved dynamic programming algorithms for exact MBR decoding and were restricted to token-factorizable metrics such as BLEU and edit distance.Later work additionally demonstrated the efficacy of MBR for question answering (Duan, 2013) and for joining statistical and neural approaches to translation (Stahlberg et al., 2017).
Recent usage In an effort to move past beam search, which has well-known pathologies (Stahlberg and Byrne, 2019), MBR has in recent years resurfaced as a decision rule for textgeneration models (Eikema and Aziz, 2020).As discussed earlier in §3, several lines of work have sprung up investigating the properties of MBR in modern neural text generation setups.Notably, however, most of these works have focused on applications of the method to neural machine translation, with only a few very recent works studying its applications in other text generation tasks (Shi et al., 2022;Wiher et al., 2022;Suzgun et al., 2023).
Outside of these areas, the method has largely been applied in shared task papers (e.g.Manakul  2022); Barzdins and Gosko (2016)), as it provides a reliable boost in performance.The fraction of papers in the ACL Anthology that reference MBR (at least by this name) has declined from its peak around 2009 (Figure 1).

Conclusion
Minimum Bayes Risk decoding has declined in popularity, but the underlying concept of sampling a set from a distribution and choosing an output to minimize risk according to that set has remained.This concept now takes many surface forms-from self-consistency to range voting to output ensembles-and current research in these areas rarely draws connections to MBR.While rediscovery is a key part of science, so is recontextualizing new methods within a broader research narrative.This can often reveal new insights or cast findings in a different light.For instance, the empirical benefits of self-consistency can be justified through an MBR framing; work on extensions to self-consistency has rediscovered other properties of MBR; and work on ensembling has raised questions about how to weight mixtures of models that can be reasoned about within the framework of noisy estimates of global probability distributions.
The adoption of newer terms for MBR-like methods may be a type of terminology drift.Related phenomena have been studied in the philosophy of science literature, including pressures to coin new terms (Dyke, 1992;Merton, 1957), potential negative consequences of divergent terminology (Calvert, 1956;Samigullina et al., 2020), and decreased citation of older methods in NLP (Singh et al., 2023).For a more involved discussion of the literature on term coining and possible connections, see Appendix B.
Language is not static, so some degree of terminology drift in scientific literature is unavoidable.However, recognizing the connections between modern techniques and older work is crucial to understanding why such methods are effective.We must not forget the lessons of the past as we search for the methods of the future.

A More details on importance sampling for MBR
We present in this section the normalized importance sampling estimator of risk used in our experiments in §5.3.The core insight of importance sampling is that we can rewrite the expected value of a random variable f (x) under target distribution p as another expectation under some proposal distribution q: Importance sampling can be particularly useful when sampling from the proposal distribution is easy, but sampling from the target distribution is costly or intractable; this is indeed the case for MBR, as sampling from the length-corrected distribution p l (y|x) requires computation of its partition function, which has exponential complexity.
Hence, for MBR, if we draw evidence samples Y e according to model distribution p(y|x) but wish to compute the risk under some length-corrected distribution p l (y|x), we may compute where we let w(y) = p l (y|x)/p(y|x), commonly referred to as the importance weight.Note, however, that importance sampling requires us to be able to exactly compute the probabilities p(y|x) and p l (y|x); while the former can be computed efficiently (Equation 2), the latter is intractable, again because it requires the partition function.What we can efficiently compute is the unnormalized probability pl (y|x) = exp s l (y|x), where s l is the length-corrected score given by either Equation 16or 17.
Fortunately, we can use normalized importance sampling to obtain a consistent estimator of the risk by adjusting importance weights (Rubinstein and Kroese, 2016): where w(y) = pl (y|x)/p(y|x).As it is the ratio of two estimates, the normalized importance sampling estimator is biased for finite sample sizes.

B Contextualizing this work within philosophy of science
In this section, we contextualize our work in the broader framings of meta-analysis of scientific research.
Patterns of citation in NLP Several factors have been shown to correlate with citation rate in NLP, including author geographic location (Rungta et al., 2022), author gender (Mohammad, 2020), and publication date (Bollmann and Elliott, 2020;Singh et al., 2023).Bollmann and Elliott (2020) conduct a bibliometric anaylsis of the ACL Anthology, finding that the mean age of papers cited decreased significantly from 2010 to 2019.Singh et al. (2023) expand this analysis to the full anthology, finding that, while citations of older papers rose briefly in the mid-2010s, it has since declined, with 2021 marking a historic low for the percentage of citations that went to older papers8 .They term this citational amnesia and discuss several possible reasons for the result, including the shift to neural methods and the rise of new areas of NLP.
Our work raises another potential explanation: some citational amnesia is due to terminology drift over time, as old methods begin to be referred to by newer names.
Term coining in science Work in science and technology studies has examined the broader phenomenon of term coining in science.Dyke (1992) argues that neologisms emerge more frequently in fields that prize novelty and see science as fundamentally about leaps of discovery, and fields that are perceived as synthesizing findings from multiple fields are most likely to recycle terms from other disciplines.She cites computer science as an example of a field where most new terms of art emerge from recycling common words, often those that draw a metaphor to some basic physical or human concept; this is reflected in the adoption of the humanizing "self-consistency" and the political-science-inspired "range voting" in decoding.Raad (1989) suggests that evocative, metaphorladen names are more likely to emerge as a scientific field grows more public-facing and in times where many new terms are being coined; both of these descriptors apply to modern NLP.While several works in linguistics and STS have considered the coining of new terms for new phenomena, relatively little work has focused on the divergence of terminology for previously observed phenomena.
The consequences of divergent or distinct terminology have also been studied, with differences in terminology across fields blamed for slow adaptation of research to practical applications (e.g. in studying visual distortions during plane takeoff (Calvert, 1956)).Borrowing terminology from another language (often Latin or Greek) or from another field has been described as a method to build common ground between researchers (Samigullina et al., 2020) and as a possibly concerning pressure against developing language-specific scientific terminology in lower-resourced languages (Hultgren, 2013).However, most work on lexical divides in science has focused on divides across language or field rather than divides across time in the same field.

Figure 1 :
Figure 1: The use of MBR (by name) peaked in the mid-2010s.This graph shows the percentage of ACL Anthology papers that mention several MBR-related phrases by year, from 2000 to 2022.

Table 1 :
Recent work under our taxonomy.The line separates methods that are explicitly MBR (above) from those that we identify as MBR-like (below).
1Different temperatures used for evidence and hypothesis. 2 WhileSuzgun et al. (