Machine Translation Decoding beyond Beam Search

Beam search is the go-to method for decoding auto-regressive machine translation models. While it yields consistent improvements in terms of BLEU, it is only concerned with finding outputs with high model likelihood, and is thus agnostic to whatever end metric or score practitioners care about. Our aim is to establish whether beam search can be replaced by a more powerful metric-driven search technique. To this end, we explore numerous decoding algorithms, including some which rely on a value function parameterised by a neural network, and report results on a variety of metrics. Notably, we introduce a Monte-Carlo Tree Search (MCTS) based method and showcase its competitiveness. We provide a blueprint for how to use MCTS fruitfully in language applications, which opens promising future directions. We find that which algorithm is best heavily depends on the characteristics of the goal metric; we believe that our extensive experiments and analysis will inform further research in this area.


Introduction
Sequence to sequence model decoding remains something of a paradox. The most widely adopted training method for these models is maximum likelihood estimation (MLE), which aims at maximising the probability of the ground truth outputs provided in the training datasets. Consequently, decoding from MLE-trained models is done by trying to find the output to which the model assigns maximum likelihood. Unfortunately, as models usually predict tokens one by one, exact search is not feasible in the general case and practitioners resort to heuristic mechanisms instead.
The most popular of these heuristics is beam search (Reddy, 1977), which maintains several hypotheses in parallel and is guaranteed to find a more likely output than the more basic greedy decoding. This approach has some obvious flaws: 1 DeepMind. Preprint. Copyright 2021 by the author(s). for one, it is completely agnostic to the actual metrics (or scores) practitioners actually want to optimise.
Even more crucially, in most cases beam search fails at the one thing it is supposed to do: finding the optimal output sequence (w.r.t the model), as shown by Stahlberg & Byrne (2019). Also alarming are the findings of Welleck et al. (2020), proving that traditional search mechanisms can yield infinite-length outputs, to which the model assigns zero probability. Interestingly, the use of likelihood as a training objective has a spectacular side-effect: it causes trained models to have an inordinate fondness for empty outputs. By using exact search on the output likelihood in machine translation, Stahlberg & Byrne (2019) show that in more than half of cases the highest scoring output according to the model is the empty sentence! All told, we rely on models placing a surprising emphasis on empty outputs, and on a decoding mechanism which usually fails to find optimal outputs; and both ignore the relevant metrics. One can then justifiably wonder why we observe impressive MT results. Stahlberg & Byrne (2019) provide an apparently paradoxical explanation: it is precisely because the decoding mechanisms are imperfect that models produce outputs of high quality. Meister et al. (2020a) elaborate on this assumption; they show that beam search optimises for a slightly modified likelihood objective, promoting uniform distribution probability inside sentences. This state of affairs seems highly unsatisfactory. While a whole body of work has been devoted to alleviating these issues, most approaches have been concerned with training (Bengio et al., 2015;Ranzato et al., 2016;Shen et al., 2016;Norouzi et al., 2016;Bahdanau et al., 2017;Edunov et al., 2018;Leblond et al., 2018), or making the search mechanism differentiable (Collobert et al., 2019). These have resulted in performance increase, but they still rely on likelihood as an objective for decoding. Further, Choshen et al. (2019) shows that performance improvements using RL are limited and poorly understood.
In this paper, we focus instead on contrasting the performance of beam search to alternative decoding algorithms aimed at optimising various metrics of interest directly, via a value function (or the metric itself when available). Notably, we experiment with variants of the powerful Monte Carlo arXiv:2104.05336v1 [cs.CL] 12 Apr 2021 Tree Search (MCTS) (Coulom, 2006;Kocsis & Szepesvári, 2006) mechanism, which has a proven track record in other sequential applications (Browne et al., 2012;Silver et al., 2017). We investigate whether, by optimising the metric of interest at test time, one can obtain improved performance compared to likelihood-based approaches, and whether performance scales with the amount of computation -as opposed to that of beam search which has been shown to degrade with large beam sizes (Cohen & Beck, 2019).
We concentrate on machine translation (MT), an emblematic and well-studied sequence to sequence task, which comes with readily available data and well-defined benchmarks.
Contributions. (i) We distinguish two different types of metrics: privileged scores, which rely on ground truth translations, in contrast to unprivileged ones. We design a new score, Multilingual BERTScore, as an imperfect but illustrative example of the latter. (ii) We introduce several new decoding algorithms, detailing their implementation and how best to use them for MT. In particular, we provide a blueprint for how to use MCTS profitably in NLP (as well as pseudocode for a batched Numpy-based (Harris et al., 2020) implementation), which opens the door for many exciting applications. (iii) We run extensive experiments to study the performance of decoding mechanisms for different metrics. We show that beam search is the best option only for privileged metrics. For those, value-based alternatives falter as the value problem is too hard -since it ultimately relies on reconstructing hidden information. For unprivileged scores, beam search is outperformed by its competitors, including MCTS.
Outline. We go over the related work thoroughly in Section 2. In Section 3, we contrast several types of metrics, and introduce illustrative examples. We review beam search and introduce alternative algorithms in Section 4. We explain how we train the required value function for value-based methods in Section 5. In Section 6 we introduce necessary architecture adaptations for inference-intensive applications and then go over experimental details and results. Finally, we discuss our results and their limitations as well as possible next steps in Section 7.

Related Work
Incremental models for sequence generation typically output more coherent sequences, as each token prediction takes into account its predecessors (Gu et al., 2018). However, this gain comes at a cost in terms of tractability: finding the sequence with maximum probability according to the model argmax y∈Y π(y|x) -becomes a search problem over the combinatorial space Y. Given the size of the (token) action space A, exact search appears out of the realm of possibility. So we have to resort to incremental prediction; but then how do we pick individual tokens, without knowing how these choices will impact the likelihood of the final sequence? We start by describing the three most widely used methods, which all pick tokens one by one from left to right.
The sampling method predicts tokens by directly sampling from the model policy π(y t+1 |x, y 1 ...y t ), computed via a softmax operator applied to the model logits (Ackley et al., 1985) -possibly after applying a temperature parameter. The greedy search method incrementally picks the tokens with highest probability according to the model. This inexpensive approach can be seen as a special case of the sampling method, with very low temperature. Finally, beam search maintains a beam of k possible translations, updating them incrementally by ranking their extensions via the model likelihood. While k times more expensive than the previous approaches, beam search has stood the test of time, resulting in steady performance improvements on MT tasks.
Building on these methods, a number of improvements have been proposed. Welleck et al. (2019) explore out-of-order decoding, where the model additionally learns the order in which to decode tokens. This provides benefits in a variety of tasks, but unfortunately not MT. Wang et al. (2020) use look-ahead in the beam search to take into account future likelihood, which yields improvements on low-data tasks, but again does not outperform beam search on MT. Meister et al. (2020b) speeds up beam search for monotonous scores.
Several works focus on studying the interplay between the incremental models and beam search. Cohen & Beck (2019) shows that performance is not monotonically increasing with beam size, but degrades after a fairly small value of k. Stahlberg & Byrne (2019) devise a clever exact search mechanism, relying on the fact that likelihoods are monotonically decreasing with size. While still prohibitively expensive, this approach underlines several key facts. First, beam search does not recover argmax y∈Y π(y|x) in a most cases, even with increased computational budget. Second, argmax y∈Y π(y|x) is the empty sequence more than half the time in MT. Eikema & Aziz (2020) propose an interesting explanation for this observation: while models are good at spreading probability mass over a large quantity of acceptable outputs, they are unable to effectively pick the best one. Indeed, the mode of the distribution might even be disjoint from the area where the models assigns the majority of probability mass. They propose minimum Bayes risk decoding, which leverages the whole distribution rather than only its mode, and can outperform vanilla beam search in low-resource scenarios.
A large body of work has been dedicated to improving sampling diversity, which plays a key role in many NLP applications -though not usually in machine translation. Fan et al. (2018) propose only sampling from the top k tokens according to the policy to avoid sampling from the tail of the dis-tribution. Holtzman et al. (2019) adopt a similar approach, but instead of fixing k they fix p, the size of the 'nucleus' of the distribution from which sampling is allowed to select tokens. This performs better on open-ended tasks. Kool et al. (2019) propose a search mechanism in-between sampling and beam search, which produces provably unique samples by leveraging the Gumbel-Max trick (Gumbel, 1954). Yu et al. (2020) use a different, much more expensive flavor of MCTS to add diverse samples to a larger NMT system: instead of relying on direct value estimation, they rely on (expensive) rollouts to estimate node values.
Finally, the most pertinent approach to optimise various metrics -and most closely related to our proposed MCTS decoding -is value-guided beam search, as developed by He et al. (2017);Ren et al. (2017) for MT and image captioning. Contrary to all other methods presented in this section, this approach does not solely rely on model likelihood. In both papers a value network -estimating the eventual score from an unfinished sample -is trained in addition to the policy network. Then instead of following the likelihood to select the hypotheses on the beam, one uses a linear combination of the policy logits and the value. This approach has shown improved performance compared to vanilla beam search; notably, it is less sensitive to the chosen beam size. While this method uses the value exclusively for one-step lookaheads, MCTS can be leveraged to explore further in the future. Additionally, it requires evaluating the value score of all tokens at each step, which can be prohibitively expensive if the action space is big (in MT, one routinely uses vocabularies of size larger than 30000).

Machine Translation metrics
There are two main evaluation strategies for MT outputs. The first one crucially relies on having access to a heldout test set of high quality (input, output) sentence pairs (x, y x ) x∈X . One can then compute a monolingual similarity score between the system's outputs (ŷ x ) x∈X and the ground truth outputs (y x ) x∈X . Common metrics include BLEU (Papineni et al., 2002), METEOR (Denkowski & Lavie, 2014) which takes into account synonyms; or BERTScore (Zhang et al., 2019). We refer to this type of metrics as privileged, as they require access to ground truth translations.
The second is concerned with assessing translation quality for source sentences for which one does not have reference translations. To determine whether machine-generated outputs are accurate enough or require human modification, one relies on multilingual quality estimation metrics (Specia et al., 2018). These do not rely on ground truth sequences; instead comparing produced samples to sources sentences directly. Expert human evaluation is perhaps the most relevant such score, but many automated alternatives exist (Martins et al., 2017); see e.g. Bhattacharyya et al. (2021). We refer to these metrics as unprivileged.
Privileged metrics provide high quality evaluation signal, and are well-suited to comparing average model performance (trusting that results on the unseen test set generalise to other domains of interest). However, they rely on the quality of the test set translations (which are usually unique, hence somewhat arbitrary), and cannot be used to evaluate the quality of models' prediction for specific unseen inputs. In contrast, unprivileged metrics are harder to access or approximate but can be used without ground truth translations.
We use two privileged metrics in our experiments: BLEU and BERTScore. We introduce another, unprivileged metric: Multilingual BERTScore. Note that while a translation model likelihood can be considered an unprivileged metric, it comes with the unusual property that it is decomposable. We thus treat it as a special case.
BLEU (Papineni et al., 2002). The BLEU score computes modified precisions for n-grams (typically with n ranging from 1 to 4) between a corpus of candidate sentences and a reference corpus. These precisions are then averaged geometrically, and multiplied by a brevity penalty. This metric is meant to be used at the corpus level; it is unstable at the sentence level. It is the de facto golden standard for comparing MT algorithms, though as it crucially relies on access to a dataset of reference translations, it is not available to assess translation quality at decoding time.
BERTScore (Zhang et al., 2019). By contrast, BERTScore is a sentence-level metric to compare a candidate sentence to a reference translation. It relies on several consecutive steps: first, computing contextual embeddings for each token in both sentences with a shared BERT (Devlin et al., 2019) model; second, computing all pairwise cosine similarities between embeddings of the two sentences; third, greedily aligning tokens based on these similarities; finally, averaging the similarities of the aligned tokens. Compared to BLEU, BERTScore is found to correlate slightly better with human judgement. Importantly for decoding purposes, it is a sentence-level metric (which is averaged to produce a corpus-level statistic).
Multilingual BERTScore. While BERTScore is designed as a monolingual metric, we repurpose it as a multilingual one by using it to compare a candidate to its source sentence (instead of reference translation). Both sentences are in different languages, but this is fine as long as the underlying BERT model is itself multilingual. 1 We call this new metric Multilingual BERTScore. Its performance relies on the underlying BERT model's ability to map related tokens in different languages to similar embeddings. Because of the one-to-one nature of the alignment phase, we expect it to score more highly translation pairs that have a one-to-one token correspondence, rather than syntactically different pairs. We thus expect it to make sense for pairs of syntactically similar languages. We stress that we do not advocate widespread adoption of this imperfect score for MT; we consider it however a convenient illustrative example of an unprivileged metric. In practice, we observe that it behaves reasonably for our two evaluation language pairs (WMT '14 English/German and English/French) as shown in Table 1. Interestingly, it scores trained model outputs higher than ground truth outputs. We hypothesise that the former follow the source sentence more closely than the latter.

Decoding algorithms
In this section, we go over the details of each algorithm, and its adaptations to better suit privileged or unprivileged metrics. We separate them in three categories: (i) algorithms based on likelihood maximisation, (ii) value-based mechanisms which rely on approximating metrics via a value function, and (iii) ranking-based methods which access the metrics directly and pick the highest-scoring example out of a pool of finished candidates. Of course, ranking-based methods are only usable for unprivileged metrics, as privileged metrics are not computable at test time. We provide a high-level comparison of all algorithms in Table 2.

Likelihood-based decoding
Greedy decoding (GD) is our first baseline; it consists in picking the token with maximum likelihood at each step.
Beam search (BS) maintains a beam of k possible translations prefixes at each time step t, (p i t ) k i=1 . Prefixes are updated incrementally as follows: for each prefix p i t one adds each of the corresponding k most probable tokens (given p i t ), resulting in at most k × k new prefixes of size increased by 1. Then among these the k prefixes with the highest likelihood are selected, thus obtaining This method aims at optimising likelihood, and is agnostic to any metric of interest. It is therefore at a disadvantage if we change the objective of the search. Consequently, we also study the performance of value-or score-based variants.

Value-based decoding
To motivate the introduction of value functions in our decoding mechanisms, it is helpful to understand how machine translation can be construed of as a Reinforcement Learning task, with an underlying Markov Decision Process (MDP).
In MT, we work with a vocabulary V of tokens, and a dataset contains pairs of sentences (x, y x ) where x, y x ∈ V + . We can define a somewhat trivial MDP, where: • the states consist in a pair containing a source sentence x ∈ V + and a sample in constructionŷ 1 ...ŷ t ∈ V + , • the action space A is the output vocabulary V (taking an action means adding a specific token to the sample), • the transitions are deterministic: picking tokenŷ t+1 ∈ A in state s t = (x,ŷ 1 ...ŷ t ) leads to the unique possible successor state s t+1 = (x,ŷ 1 ...ŷ tŷt+1 ), • the reward is 0 for any non-terminal state; for terminal states, it is m(y x ,ŷ) for privileged metrics and m(x,ŷ) for unprivileged metrics of interest. Entering a terminal state is done by picking a special <EOS> token.
A value function v for a policy π approximates the final score one might expect to obtain, starting from a nonterminal state s, and following π thereafter. It thus provides forward-looking guidance during decoding, as opposed to likelihood (accessible during decoding but myopic) or a score (only computable on finished sentences).
Value-guided beam search (VGBS), as developed by He et al. (2017);Ren et al. (2017), augments the decision mechanism in beam search (when picking the top k prefixes amongst k × k candidates) with a value network v. The internal score is a linear combination between the (lengthnormalised) log-likelihood of a prefix and its value approximated by the value network, with a contribution factor α: bs(s t , a t ) = α t log π(s t , a t ) + (1 − α)v(s t , a t ). 2 Note that this method does not use the score; thus it is applicable to both unprivileged and privileged metrics. This search method combines both a policy π and a value network v. For every decoding step, a fixed budget of simulations is allocated to build a tree of possible future trajectories. Each simulation consists in 3 steps: • selection: recursively picking children nodes according to the pUCT formula, starting at the root and until reaching an unopened (i.e. not expanded yet) node s o : where Q(s, a) is a statistic representing the value of taking action a in state s, updated online during the search, c puct is a tunable constant, τ is a temperature parameter applied to the policy π τ (a|s) = π(a|s) 1/τ / b π(b|s) 1/τ and N (s, a) is the number of times action a has been chosen from state s while building the tree (also called visit count); • expansion: opening the selected node s o by computing the policy π(a|s o ) a∈A at the associated new state, as well as the value v(s o ); • backup: updating the Q statistics encountered during the tree traversal leading to s o via an aggregation mechanism (such as averaging the previous statistic with v(s o ), or taking their maximum: Once the tree is finished, the decision for the current decoding step is made according to the statistics of the root's children nodes. A popular option consists in picking the root child with the most visit counts, but one may also select the one with maximum aggregated value instead.
While it is customary to allow MCTS to use the score directly when encountering a terminal state, we opt for a pure value implementation instead (i.e. using the value instead of the score on terminal nodes). This makes V-MCTS applicable to privileged metrics, which it wouldn't be otherwise.
One of the keys to successful MCTS performance is properly balancing the breadth and depth of the exploratory trees. We found two adaptations to be helpful. First, we used an adaptive value scale as described by Schrittwieser et al. (2020, Appendix B): in the selection phase, we rescale Q(s, a) in the [0, 1] interval by replacing it with Q(s,a)−min Q max Q−min Q , where min Q and max Q correspond to the minimum and maximum value observed in the tree, updated online. Second, we tune the logits temperature τ jointly with the c puct hyperparameter.

Reranking-based decoding
Value-driven decoding methods are well-suited to optimise metrics which we cannot evaluate at test time, such as privileged metrics. One might also prefer them for especially expensive unprivileged metrics, e.g. expert human evaluation. For tractable unprivileged metrics though, we can directly compute the scores of finished candidate sentences, Uses a value Uses the score directly without having to resort to approximation. We study two specific decoding mechanisms that take advantage of this option.
Sampling and reranking (S+R, S+RV) simply consists in sampling a fixed number of finished candidate sentenceŝ y 1 , ...,ŷ n from the policy (with a carefully tuned temperature applied to its logits), scoring all of them and picking the highest-performing one: argmax n i=1 m(x,ŷ i ). To measure the loss of performance associated with using a value, we also introduce a variant, S+RV that ranks candidates according to the value (rather than the score).

MCTS with rollouts (MCTS+Roll) is a variant of V-
MCTS where we replace the value approximation for a given node s by a more expensive one based off the actual score. From s, we perform a greedy rollout (w.r.t. the policy π) until we arrive at a terminal node s T . We then compute the score with the finished sample and the source as inputs, use this scalar as the value of node s, and continue as in V-MCTS. Of course greedy rollouts are expensive in MT, so this method is not directly comparable to V-MCTS. It is however useful as a proof of concept which enables us to measure how much performance we lose by relying on a value function rather than directly on the score.

Training a value network
Several of the algorithms we detail in Section 4 make use of a value network. We train such models in several steps.
• First, we train a plain supervised policy model π sup on our bilingual datasets.
• Second, we update each data item (x, y x ), which contains a source x and a reference sentence y x , by replacing y x with a sampleŷ x obtained via greedy decoding 3 from our trained policy π sup , and adding a score m comparing eitherŷ x to y x (for privileged metrics) or y x to x (for unprivileged metrics).
• Finally, we train a dual-headed network on the augmented dataset, with a shared transformer encoderdecoder torso (Vaswani et al., 2017) taking source x  and sampleŷ x as inputs, and two heads, one predicting the policy π d and the other the value v. This approach provides a powerful regulariser for the value, greatly reducing its tendency to overfitting (Silver et al., 2017).
The second step is mandatory to obtain a score distribution to train the value model on, in the case of privileged metrics. Indeed, the scores of the optimal supervised policy are all perfect (comparing y x to y x ), thus uninformative, making it impossible to train a value network on. Relying on a sample rather than on the ground truth sentence to compute the score has another advantage: the samples follow the policy π sup so the value will be the one associated with a trained policy, as the one we use during decoding, rather than with the optimal supervised policy.
Losses. We train the policy by minimising its Kullback-Leibler divergence with the initial supervised policy π sup : L π = D KL (π||π sup ). We reframe the value regression problem as classification by discretising the score interval into buckets. We emulate training our value function on unfinished samples by adding a value loss term at every step, and reusing the transformer decoder causality mask.
The trouble with privileged metrics. In practice, we find that learning a value function for privileged metrics (such as BLEU or BERTScore) is difficult. To understand why, we run an ablation to distinguish between the three subtasks a value function must perform: (i) approximate the score, (ii) predict the end of a trajectory from an unfinished prefix, and (iii) assess the translation quality of a pair of finished sentences in different languages. To separate concerns, we run the following experiments: for (i), we train our network to predict BLEU given ground truth targets (rather than source sentences) and finished samples (instead of prefixes). For (ii), we give the network ground truth targets and unfinished samples. Finally, for (iii), we give the network source sentences and finished samples (thus removing the need to predict the future of trajectories). We observe that: the error is very low for (i); higher, but significantly improved over the full setup for (ii); and surprisingly, roughly identical to the full setup for (iii). Thus the real difficulty lies in (iii).
One possible explanation for this result is that the value network is missing a key input. Indeed, in the case of privileged metrics, the score is computed between a sampleŷ x and a ground truth reference y x ; but the value network only has access to the source sentence x and a prefix ofŷ x . Thus before it can compute a precise score approximation, it first has to infer y x from x. But of course, inferring y x from x is exactly the original machine translation problem, which makes the value problem empirically harder than its policy counterpart on our dataset.
Hybrid architecture for privileged metrics. We see that a "cheating" value network (as in (ii)) performs strongly. Unfortunately, we cannot allow our model to cheat at test time. However, we can still leverage privileged information at training time through representation shaping, by distilling a cheating value network into a non-cheating one. We propose a new training paradigm, where we call the transformer model twice per step. The first call is the regular pathway, with source sentence fed to the encoder and sample to the decoder. The second call is the "cheating" pathway, with ground truth reference fed to the encoder and sample to the decoder. We apply the policy and value losses described earlier in this section to the regular pathway outputs. To the cheating pathway outputs, we apply only a value loss. Finally, we add a simple 2 loss between the final layers of both pathways. The idea is to use the powerful cheating representation to help the weaker regular representation. An illustration can be found in Appendix A.
At inference time, we only compute the regular pathway, which does not cheat. In practice, with careful tuning of the loss hyper-parameters we are able to significantly reduce the gap in performance between this new hybrid model and the cheating one (as in (ii)), so we use this training regimen for our experiments on privileged metrics.

Experiments
We start by detailing our general setup, then report results for all 3 metrics we consider, move on to describe how we tuned all algorithms for best performance, and finally study how they scale with increasing search budget.

Experimental setup
We consider two established machine translation datasets: WMT'14 English to German (ENDE) and WMT'14 English to French (ENFR). The first dataset contains roughly 4.5 million training sentence pairs, while the second is much bigger with just under 41 million training sentence pairs, which enables us to account for scale in our experiments. All dev and test sets contains approximately 3000 sentences.
Our joint policy/value model is based on the Transformer encoder-decoder architecture (Vaswani et al., 2017), which is typically used in machine translation studies. Encoder and decoder have 6 attention blocks, hidden dimensionality 512, 16 heads and our dictionary size is about 32k. As  Table 4. Comparison of decoding mechanisms on ENDE and ENFR. Top row contains general metric statistics and the original transformer baseline; the second row performance of supervised models with likelihood-based decodings; the third results for value-based algorithms with joint policy/value models (a specific one for each metric); the last one numbers for score-based methods. Best overall performance is in bold; best value-based performance in blue. Beam search performs strongly for privileged metrics, while value-based methods prevail for unprivileged scores. Score-based methods outperform their value-based counterparts, but V-MCTS remains competitive.
we test inference-intensive methods, we use a few adaptations (while still matching or outperforming the original transformer models, as shown in Table 4).
First, we use multi-query attention, as defined by Shazeer (2019). Counter-intuitively, the performance bottleneck for small transformer architectures on our hardware of choice, TPUv3, is memory access by a very large margin. This is driven by the need to store and read keys and values from memory to enable faster, incremental inference. We reduce this memory footprint by only computing a single set of keys and values per attention block, that we share across all attention heads. This simple change yields an impressive, almost linear speedup with respect to the number of attention heads. While it comes at a small cost in terms of accuracy, this can easily be offset by reallocating the attention weights we removed to the feed-forward layer of the attention blocks.
Second, we use key and value dimensionality 128, rather than the more customary 64, which enables faster inference by removing an expensive padding operation on TPUv3.
Finally, we allow a budget of 50 inferences per token in the sampled solutions for all methods; compared to 1 for greedy decoding, and 4 for beam search. We use incremental sampling for speed. Both the running time and the memory footprint are directly proportional to the amount of inferences for all methods.

Main results analysis
Privileged metrics: BLEU and BERTScore. We report our results on privileged metrics in Table 4. Plain beam search is a strong contender in this setup, often matching or outperforming other methods, while using a fraction of the inference budget (unfortunately performance degrades rapidly with larger beam sizes so we cannot leverage more computational budget). In this setup, value-based methods struggle to justify their higher complexity and cost.
Value-based algorithms for unprivileged metrics. The results for this alternative use case, also presented in Table 4, paint a completely different picture. We see that while regular beam size obtains a small but consistent improvement, value-guided methods perform significantly better. Between the latter, MCTS is particularly promising, as its performance scales nicely with the size of the dataset.
We observe that the policies of our joint policy/value models perform slightly worse than their supervised counterparts (see Table 12 in Appendix). If we use the initial supervised model policy in conjunction with the multilingual value (see Table 13 in Appendix), we obtain promising results: notably 40.31 BLEU when optimising MLBERTScore with MCTS on the ENFR dataset -more than a full BLEU point above the performance of beam search. From a qualitative point of view, we see a confirmation of our conjecture: multilingual BERTScore is not perfectly aligned with BLEU. It seems to encourage word-for-word translations, which has a positive effect initially (more consistency between the source and the sample sentences), but ultimately leads to less natural translations if used with enough budget.
Score-based approaches for unprivileged metrics. The bottom of Table 4 gives results when we allow direct access to our two unprivileged metrics, without having to go through a value approximation. They reinforce our finding that the choice of algorithm heavily depends on the use case.
Two additional properties stand out. First, all the methods that access the score directly perform significantly better than their value-guided counterparts.
Second, the purely value-based V-MCTS is competitive with and can even outperform the score-based approach, S+R. This is promising, as MCTS is more widely applicable (as some scores are expensive to get). However S+R performs surprisingly well, which may warrant more explorations of sampling methods optimising for diversity (e.g. Fan et al.

Algorithms tuning and ablations
We detail in this section which hyper-parameters we tuned (and how) to generate the results reported in Table 4. Precise ranges and ablations are provided in Appendix B.
Beam search. We experiment with the beam size, the normalisation constant, and try to apply temperature to the logits. We found that the best performance was achieved with the default hyper-parameters for most metrics.
Value-guided beam search. In this variant, the ranking rule is a linear combination of the log-likelihoods of tokens and their values provided by a neural network. We tested various value scaling schemes (e.g. taking the logarithm of the value as detailed by He et al. (2017)), as well as log-likelihood normalisations; we found that using the plain value, with a length normalisation for the log-likelihoods worked best. We tuned the weight α of the value in the score. For privileged metrics, α = 0.5 performed best. For unprivileged metrics larger weights were preferable (α = 0.9 for ENDE and α = 0.95 for ENFR), which confirms our observation that value functions are of higher quality in this setup.
Value-guided MCTS. We find that selecting the best pair of logits temperature and scaling factor c puct is key for optimising performance.
Specifically for unprivileged metrics, we made two other key decisions. First, instead of using the weighted average rule in the backup phase (where the value of the newly opened node is used to update the value of all its predecessors), we found that we could use instead the max operator to better effect. Second, we observed better results when selecting as action (once the full tree is completed) the root child with maximum aggregate value, instead of the more customary visit count. Both the generic options are meant to make the search more robust to value outliers, as they average statistics. We essentially found that for unprivileged metrics, our value approximation was good enough that following the value more aggressively led to improved performance. The reverse was true on privileged metrics, where our value networks are lower-quality.
Sampling and reranking. Given the simplicity of this approach, there really is a single hyper-parameter to tune: the temperature applied to the model likelihoods before sampling. We do find however, that properly picking this temperature is critical, as the method is sensitive to it: high temperatures (τ ≥ 1) leads to nonsensical outputs, while conversely low temperatures (τ ≤ 0.5) lead to almost no diversity in outputs. On the dev set, our best performing runs use τ = 0.75 for MLBERTScore, but τ = 0.25 for privileged metrics (using S+RV which relies on value approximation).

MCTS + rollouts.
We find that the adaptations we made for Value-guided MCTS also perform best for this proof-ofconcept algorithm. As the quality of the value estimation is good (though expensive), following the value aggressively also yields the best results.

Scaling search with computational budget
We study how our decoding algorithms scale with their search computational budget. We report results on ML-BERTScore in Table 5 and more detailed numbers in Appendix C. Our findings are fairly unsurprising: the more quality score data algorithms can leverage, the better they scale. We see that on privileged metrics -where value networks are hard to train and thus quite imperfect -performance quickly stops increasing with more computation and start degrading instead. When using higher quality value networks (in the unprivileged metrics setup), performance increases more steadily with computation (almost everywhere), plateauing rather than degrading. Finally, when accessing the score directly (for the ranking approaches), the more computation, the better and performance keeps increasing with more inferences.

Discussion
We now summarise and contextualise our findings, and discuss potential next steps for optimising relevant MT metrics.
The main takeaway from our experiments is that which algorithm is best depends heavily on the metric to optimise. This reinforces the notion that one should carefully consider when picking a decoding mechanism for a machine translation pipeline, rather than default to beam search.
Second, we find that optimising privileged metrics (e.g. BLEU) via a value function is surprisingly hard. While distinguishing large gaps in quality is easier than modelling the policy, discriminating between good candidates is in practice as hard as the policy problem, since a first step required step is to estimating the ground truth sentence. Indeed, in our experiments we observe relatively low quality value networks, and comparatively little improvements with value-based decoding methods (especially on the small ENDE dataset). Using a value function to optimise unprivileged metrics is more promising.
Third, we show that MCTS is not only a valid way of decoding for machine translation tasks, but also the best option in some use cases (given some necessary adjustments). We study its strengths and weaknesses, and demonstrate that its performance is crucially linked to the ease of learning a good value function. We include pseudo-code for an easily reproducible Numpy implementation in Appendix F. All told, we provide a blueprint for how to use MCTS efficiently in NLP with state-of-the-art transformer models.
Finally and somewhat surprisingly, we find that whenever access to the score is possible, the deceivingly simple S+R method performs well. More experimentation is required to understand why; but at any rate, it should be a strong contender in this specific setup.
Future directions. We have shown that optimising for unprivileged metrics is easier than for privileged ones. The ultimate unprivileged metric for machine translation is human translation assessment. Thus it seems natural to consider training a score directly from human evaluation of translation pairs, and to later focus on optimising it via MCTS.
Another natural extension is a full-blown RL algorithm; iteratively improving policies via value-guided search and training value functions on search-improved policies, getting closer to the optimal policy and value at each step.

Outline
Appendix A provides details about the network architectures and the optimisation hyper-parameters used in our work. We also describe the hybrid architecture that is used to improve the training of value networks in the hard case of privileged metrics. In Appendix B, we lay out the details of the tuning of the different decoding mechanisms used throughout the paper. In Appendix C, we give results about how performance of our decoding algorithms scale with more computational budget. In Appendix D, we discuss the trade offs between learning a joint policy and value network on sampled (or distilled) trajectories, versus training a separate value network to predict the value of a policy network trained on supervised trajectories. In Appendix E, we give a few examples of MCTS exploratory trees. Finally, in Appendix F we provide a simple implementation of a batched version of MCTS in plain Numpy.

A. Network architectures and training
In this section, we detail our basic dual-headed architecture, our training regimen and our optimisation hyper-parameters. We also describe our hybrid architecture (which we use for privileged metrics) in more depth.
Dual-head transformer architecture. We start from the original transformer encoder-decoder architecture (Vaswani et al., 2017), with a few modifications. Both encoder and decoder have num layers = 6 attention layers. The hidden size is 512. The embedding vocabulary size is just short of 32000 tokens for both language pairs. The unroll length of our models is 128. We use "normal GPT-2"-style initialisers, i.e. initial values are sampled from a Gaussian distribution with mean 0 and standard deviation 0.02 √ num layers . The one exception to this rule is for embeddings, where we use truncated normal initialisers with standard deviation 1.0.
On top of the decoder, we add two "heads". The first one is the policy head. It consists in a linear projection from the hidden dimensionality to the vocabulary size, followed by a softmax operator to output a distribution over the whole vocabulary. The second head is the value head: a linear projection from the hidden dimensionality to the amount of value buckets we define (|B| = 500 in our experiments), followed by a softmax operator. We compute the value loss as the cross-entropy between the softmax distribution (v i ) i∈B and a one-hot encoding of the target value of the same dimension. To output the value, we compute the sum of the softmax distribution multiplied by the average value in each bucket i∈B v ibi .
Compared to the original architecture, we apply several changes related to inference speed. First, instead of 8 attention heads we use 16. Second, the dimensionality of the keys and values is 128, compared to 64 in the original architecture, thus avoiding a costly padding operation on our hardware accelerators, TPUv3. Finally, we use multi-query attention (Shazeer, 2019), only computing a single set of keys and values per attention block and sharing them across all attention heads. This reduces the memory footprint of the keys and values by a factor of the number of attention heads (16 here), considerably decreasing the time spent reading and writing from memory, which ultimately results in a near-linear inference speedup with respect to the number of attention heads. As this alternative attention mechanism requires less trainable weights than the more conventional one, we reallocate some of those in the feedforward layer of the attention blocks by using a bigger internal hidden dimensionality of 3072 instead of 2048.
Optimisation. We use the Adam (Kingma & Ba, 2015) optimiser with learning rate 0.001, and the following hyperparameters: b 1 = 0.9; b 2 = 0.98; = 1e −9 . Our batch size is 4096, and we train for 100000 steps for the ENDE dataset and 300000 steps for the larger ENFR dataset. As regularisation, we use dropout with a weight 0.1, but no weight decay. We also use label smoothing with hyperparameter 0.1 (although we see little impact when removing it).
The last difference with the original transformer encoderdecoder is where we place the layer norm operator. We put it at the beginning of the attention and the feed-forward layers, rather than at the end, which allows for fully-residual layers.
Hybrid architecture for privileged metrics. In Section 5, we show that learning good value functions on privileged metrics such as BLEU and BERTScore is very difficult. This is mainly due to the fact that our value networks are lacking access to the ground truth targets which are required for precise score computation. Our ablation study show that if we remove this difficulty by allowing the value model to "cheat" by using the ground truth targets rather than the source sentences as encoder inputs, we obtain much more precise values. Downstream results using MCTS with such a "cheating" value show very large BLEU improvements. Unfortunately, in practice one cannot rely on such a trick when decoding.
Our idea is to try to leverage "cheating" information indirectly at training time to shape the representation of a regular (i.e. non-cheating) value network. Another way to look at it is that we try to distill the knowledge of the cheating value model into the regular one.
To achieve this, we propose a new training regimen, as detailed in Figure 1. The basic idea is to compute the final layer of a cheating value model, and to use it as an auxiliary  Table 6. Greedy vs MCTS (50 simulations) performance on the ENDE dataset for BLEU (top rows) and BERTScore (bottom rows).
Using the normal architecture, improvements are very small. Using the hybrid architecture yields more significant improvements. The middle column contains (greyed-out) results when using a cheating model which takes ground truth targets as inputs. Improvements are enormous in this prohibited setting, which is unsurprising at the value function receives optimal output as its own inputs.
target for the final layer of a regular value model, in addition to its normal value loss. We thus have two pathways. On the left, the regular value model encoder receives the source sentence as input (which is available at test time). On the right, the cheating value model encoder receives the ground truth target sentence as input (which is not available at test time). For both pathways, the decoder's input is a sample sentence. Crucially, both pathways rely on the same transformer encoder-decoder: they share all weights, the only difference is in their inputs.
To train such a model, we use four losses. First, we apply the regular policy L π and value loss L v on the regular pathway. Second, we apply a value loss L vc on the cheating pathway. Finally, we add an 2 loss D between the final layer of both pathways, with a stop gradient for the cheating one -so that its representation is not directly affected by D.
We do not add a policy loss on the cheating pathway. This seems natural, as such a loss would only encourage the model to reproduce its inputs exactly, effectively pushing it towards the identity function.
Using such a hybrid architecture yields performance improvements when using the value model with MCTS, as shown in Table 6.
We find that to obtain best performance, three things need to be combined: (i) sharing weights across both pathways, (ii) the distillation 2 loss and (iii) the cheating value loss. Each loss is added with a linear weight. Proper tuning of these weights is important. We find that using weights 1.0 for L π and L vc , as well as 0.1 for L v and D leads to best performance.

B. Experimental details and ablations
We detail how we tuned each algorithm for best performance in this section. We used the dev datasets to determine which  Figure 1. Hybrid architecture. Since learning a good value network is hard when dealing with privileged metrics, we propose a training mechanism to distill the information from a cheating network that has access to the full privileged information in order to predict the value score. In details, we have two encoder/decoder networks that share their weights. The first network (left) is a regular network that takes as input the input sentence in the source language and is trained to output (i) a policy predicting the words of the output sentence in the target language and (ii) a value score for that output via a policy and a value loss Lπ and Lv, respectively. The second network (right), dubbed cheating, is given as input the ground truth sentence in the target language and is trained to output the value score of the output sentence in the target language against that ground truth via the loss Lv c . This simplifies its task considerably, as it has direct access to the privileged information to compute the value. A distillation loss D is added to transfer knowledge from the cheating model to the regular network. Results in Table 6 shows that such an approach yields significant improvements by easing the training of the value network. options were best. Wherever we report numbers, we compute those on the test set for comparison purposes, but the experiments were run after dev set selection. In practice, we find a very good correlation between observations on the dev and on the test sets (although absolute values were lower on the dev set, rankings remained mostly unchanged). Unless otherwise indicated, our findings hold for both ENDE and ENFR datasets, and for all 3 metrics we consider (BLEU, BERTScore, Multilingual BERTScore).
Beam Search. As this method is known for degrading with large beam sizes, we add a length normalisation term, as advocated by Wu et al. (2016). The resulting score for a candidate y 1 ..y t is thus: bs(y 1 ..y t ) = ( 6 t+5 ) θ log π(y 1 ..y t ) We tuned three hyper-parameters for beam search: • the beam size: we tried 2, 4, 6, 8, 10 and 20. We find that the best performance is attained at 6, plateaus until 10 and starts slowly decreasing by 20 (see Table 7).
• the logits temperature: we tried 0.6, 0.8, 1.0, 1.2, 1.4. 1.0 performs best by a wide margin. Low values degrade to greedy search performance while high values yield non-sensical sentences.
• the normalisation temperature parameter θ: we tried 0.4, 0.6, 0.8 and 1.0. We find θ = 0.6 performs best, as in the original paper (θ = 0.4 is on par but slightly worse, and performance degrades as soon as θ ≥ 0.8).
The hyper-parameters to tune are slightly different than those of plain beam search. As we do not see performance decrease with larger beam size, we no longer need to find the optimal one. Which one to use is largely dependent on how much computation one can afford to use. Performance as a function of this quantity are reported in Section C.
Further, because of the additional value term, we have a new linear combination weight α to tune. Here are the hyper-parameter ranges we consider: • the logits temperature: as for plain beam search, we tried 0.6, 0.8, 1.0, 1.2, 1.4. We find similar results: 1.0 is the best-performing temperature by a significant margin; the reliance on the value mitigates the effect for unprivileged metrics (where the value is a good approximation).
• the value normalisation: we tried using the logarithm of the value instead of the value itself, with appropriate scaling: log( v(y1..yt)−d D−d ). We find that by carefully tuning the minimum and maximum bounds d and D we can match the performance we obtain with the plain value, but not outperform it. As a result we opt for the simpler formulation.
• the linear combination weights α: we swept between 0 and 1 by 0.1 increments (with an additional measure at 0.95). For privileged metrics, we find α = 0.5 to perform best (on both datasets). For Multilingual BERTScore on the other hand, much larger values are required to achieve best performance: α = 0.9 for ENDE and α = 0.95 for ENFR. Our hypothesis is that the quality of our value function is much higher for this last unprivileged metric, which allows us to lean more heavily on its guidance. See Table 8 Table 9. BLEU and Multilingual BERTScore performance when using different value aggregation mechanisms and action selection rules. We observe that on the unprivileged metric, the best option rely more heavily on the value function; contrary to what we see for the privileged metric.
MCTS variants. MCTS is a complex algorithm with a large number of hyper-parameters. We found that three main aspects are important for performance: making sure the policy and the value terms are well-balanced in the UCT formula, picking the best value aggregation mechanism during the backup phase, and selecting the best acting criteria (once the tree is finished).
To ensure balance in the UCT formula, we tuned two things: • we optimised for the logits temperature τ and the multiplicative constant c puct jointly. We tried temperatures 0.9, 1.1 and 1.3, in conjunction with c puct in 1.0, 2.0, 3.0, 4.0, 6.0, 8.0. For privileged metrics, the pair (τ = 0.9; c puct = 3.0) performed best across both datasets and both metrics; while for Multilingual BERTScore the best performer was (τ = 1.1; c puct = 8.0) across both datasets. Note that both larger temperature and larger c puct encourage exploration in the UCT formula, thus reducing the relative weight of the policy in favour of the value; that we can use larger scalars for unprivileged metrics is yet another indication that the associated value functions are more trustworthy.
• as we detailed in the main text, we rescale the values dynamically during the tree construction so that all the values encountered until the current step are more evenly distributed in the [0, 1] interval by mapping the minimum value to 0 and the maximum value to 1.
We tested two value aggregation operators: running average and maximum. We also tried two action selection mechanisms: picking among the root's children nodes the one with maximum visit count, or the one with maximum aggregated value.
Our observation once again underline the contrast between privileged and unprivileged metrics, as is illustrated in Table 9. For the former, the best choice is to use the running average as value aggregation operator during the backup phase, and to pick the root child with maximum visit counts. Conversely, for Multilingual BERTScore we found that we obtained best performance with the maximum aggregation operator and by picking the root child with maximum aggregated value. We thus see that for our unprivileged metric we can rely on the value function aggressively; while for privileged metrics we need to limit our exposure to it.
Sampling + Ranking variants. For these algorithms, we really only have a single hyper-parameter to tune: the policy temperature τ . Small temperatures lead to little diversity across different samples, but ensure that samples are highly ranked by the model and hence are syntactically correct. On the other hand, large temperature encourage diversity, at the price of correctness. We sweep over the [0.15, 0.95] interval by 0.1 increments, and report results in Table 10. We find that for the score-based S+R, balancing diversity with correctness means we have to use τ = 0.75.
The story is more nuanced for S+RV. As this variant relies on the value function rather than the score to rerank samples, we can use it even for privileged metrics. We observe that for these, the optimal temperature is much smaller (τ = 0.25), which in effect means that the algorithm relies more heavily on the policy, compared to the value. The reason why is once again that the value for this type of metrics is of lower quality.
In contrast, the optimal temperature for our unprivileged metric is τ = 0.75, similar to what we find for the scorebased S+R.

C. Scaling search with computational budget
We present more detailed scaling results in this section. These confirm our main observation: the higher the quality of the metric signal we use, the better the method scales with additional computation.
We see for instance that on privileged metrics, where value functions are hard to train, the performance of value-based methods reaches its peak quickly and start degrading. paratively, on unprivileged metrics, value-based methods keep improving with more inferences, eventually plateauing. Finally, score-based methods do not even plateau. As a result, S+R ends up outperforming MCTS after 200 simulations per token, although MCTS remains the best performer under 100 simulations. This motivates investigating a variant of MCTS which is allowed to use the score on completed sentences (our current algorithm is purely value-based).
Beam search, which does not use the metric at all, behaves thus more similarly across all metrics, quickly reaching its peak performance and then plateauing. The length penalty is crucial to prevent performance degradation.
Another interesting observation is that S+RV, the valuebased alternative of S+R, performs worse than VGBS or MCTS. It appears that the crucial ingredient to S+R good performance is direct access to the score, rather than its simple search mechanism.
Finally, we note that for VGBS, each token costs k + k 2 inferences (k to compute the policy for every beam, k 2 to compute the value for the k 2 possible follow-up tokens). As a result, we use the smallest k such that k + k 2 ≥ n when allowing n simulations for other decoding algorithms.

D. Supervised policy vs Distilled policy
As we study decoding mechanisms on privileged metrics based on the ground truth, we cannot train our value networks on the initial supervised dataset (where the value target would be 1 for all items, as the ground truth targets are considered optimal). As a result we go through an intermediate step, first training a supervised policy model, and then replacing ground truth targets by a greedy sample from the said policy to create a new distillation dataset.
We observe is the performance of the policy models trained on the distillation datasets is slightly lower than that of their plainly supervised counterparts, as illustrated in  Table 13. On the larger dataset, we see that MCTS decoding outperforms any other type of decoding. Interestingly, we obtain an improvement of more than 1 BLEU point over the supervised baseline when using a Multilingual BERTScore value function; and we obtain this result with a relatively low amount of simulations (25). Unfortunately adding more computational budget does not help, as the decoding is targeting a different metric than BLEU. But with a low enough amount of simulations, we see that trying to optimise our unprivileged metric yields benefits.
While this result is close to the state of the art for such a small policy model, the comparison is not fair as the approach requires double the amount of parameters (since the value net is another network). This could be alleviated by training a supervised policy, fixing its weights, and adding a lightweight value head on top in a second training step. We leave this for future work.

F. Batched Numpy-friendly MCTS
Accelerator hardware such as GPUs or TPUs allow us to execute neural networks faster; but to fully leverage their computing power, we have to run on batches of several inputs. This is not very easily mixed with an algorithm such as MCTS, as it requires a queuing mechanism between the search itself and the neural network computations, potentially leading to inefficiencies. To circumvent this issue, we introduce a Numpy-compatible version of MCTS, which can then be run completely on the accelerator device.
The basic idea is that we use storage tensors which are indexed by the number of the current node or simulation in the MCTS tree. The root node has index 0 for all elements in a batch, and we then build all subsequent elements recursively.
We start by creating a NumpyMCTS object, whose fields store all the necessary tree information to compute a single batched instance of search (i.e. MCTS for one token, not MCTS applied to the full sequence). In details, for each node, for each item in the batch we store: • visit counts: the amount of times said nodes have been visited during the search, • raw values: the initial value of the node as returned by our value network, • values: the aggregated value of the node at this point in the search, • parents: which node is its parent in the tree, • action from parents: which action was taken to transition from the parent to the node itself, • depth: the tree depth of each node in the tree, • is terminal: whether or not they are a terminal node.
All these variables are tensors are of size (B, S) where B is the batch size and S is the amount of simulation plus one.
For ease of tree manipulation, we also store for each node the indices of its children, its prior over its possible children, the values of each child, and the visit count of each child. The associated tensors should be of shape (B, S, V ), where V is the total number of possible actions. However this makes for large tensors, on which Numpy operations can become costly. To alleviate this issue we store a sparse version of these tensors instead, only keeping the top A children according to the policy for each node. The shapes are thus (B, S, A) instead. We maintain a mapping from 0 to A − 1 in the topk mapping tensor, of shape (B, S, A) itself too.
Finally, the object also stores for each node its associated transformer state, so that we can use incremental inference during the search. These states can be kept on the accelerator device itself. We now define a method to perform the search itself. As stated in the main text, MCTS consists in applying the same three steps for each simulation, so we iterate over S. First, we use the simulate() method to select which new nodes to explore. Second, we expand these new nodes (calling our neural network to compute both the policy and the value at these nodes). Finally, we back the newly computed values up the tree.
The dense visit counts method allows us to map back our sparse action representation into the original action space. Once we have selected nodes to expand, we can proceed. The expand method is where we call our neural networks to compute policies and values. We then create the nodes in the object fields through the create node method. Finally, we update the tree topology to connect the new nodes to the tree.