Amortized Noisy Channel Neural Machine Translation

Noisy channel models have been especially effective in neural machine translation (NMT). However, recent approaches like"beam search and rerank"(BSR) incur significant computation overhead during inference, making real-world application infeasible. We aim to study if it is possible to build an amortized noisy channel NMT model such that when we do greedy decoding during inference, the translation accuracy matches that of BSR in terms of reward (based on the source-to-target log probability and the target-to-source log probability) and quality (based on BLEU and BLEURT). We attempt three approaches to train the new model: knowledge distillation, one-step-deviation imitation learning, and Q learning. The first approach obtains the noisy channel signal from a pseudo-corpus, and the latter two approaches aim to optimize toward a noisy-channel MT reward directly. For all three approaches, the generated translations fail to achieve rewards comparable to BSR, but the translation quality approximated by BLEU and BLEURT is similar to the quality of BSR-produced translations. Additionally, all three approaches speed up inference by 1-2 orders of magnitude.


Introduction
Noisy channel models have been traditionally used in many tasks, including speech recognition (Jelinek, 1997), spelling correction (Brill and Moore, 2000), question answering (Echihabi and Marcu, 2003), and statistical machine translation (Koehn et al., 2003).In machine translation (MT), the probability of the source sentence conditioned on the target-language generation is taken into account when generating a translation.In modern neural machine translation (NMT), the noisy channel approach is successful and often indispensable in many recent top-performing machine translation systems (Yee et al., 2019;Ng et al., 2019;Chen et al., 2020;Yu et al., 2020;Tran et al., 2021).
One widely used approach of noisy channel NMT is "beam search and rerank" (BSR).Assume a trained forward translator and a trained reverse translator,1 BSR decoding consists of two steps: first, decode using beam search with a large beam size from the forward translation model and store the entire beam; second, rerank the beam using a reward which is the sum of the forward translation log probability and the reverse log probability.This approach incurs significant computational overhead, given the need to decode a large beam (usually with beam size 50-100) from the forward translator and the need to feed the large beam through the reverse translator.The computational cost is especially problematic if the practitioner has a large volume of translation requests, or if the system is mobile-based and requires offline translation.
We thus aim to learn a separate neural network with an identical architecture as the forward translator such that at inference time, when we do greedy decoding using this new network, we investigate how much translation accuracy would be sacrificed.Specifically, we investigate how forward/reverse rewards of the translations as well as the translation quality (approximated by BLEU and BLEURT) would compare to those of BSR-generated translations. 2he paper explores three approaches, with increasingly more exploration when optimizing the reward.(1) Knowledge distillation (KD) from a pseudo-training-corpus generated by BSR: we can treat the BSR-generated corpus as the oracle, and KD can be interpreted as behavioral cloning.
(2) a one-step-deviation imitation learning strategy (IL) where given a fixed sequence of target-language tokens, we adjust each time-step's probability distribution over the vocabulary such that the resulting distribution minimizes an energy function used in BSR reranking, and (3) Q learning which explicitly learns the scoring function used in BSR reranking.
We experiment on three datasets (IWSLT'14 De-En, .Experimental results show that all three approaches speed up inference by 50-100 times.The approaches fail to achieve comparable rewards to BSR, but compared to the non-BSR baselines, the approaches achieve much higher reverse rewards (i.e., log p r (x | y) where p r is the reverse translator) at the expense of forward rewards (i.e., log p f (y | x) where p f is the forward translator).Meanwhile, the approaches achieve a translation quality (approximated by BLEU and BLEURT) that is comparable to that of BSR.In particular, IL's BLEURT scores is significantly higher than those of beam search, across all three datasets; IL's BLEURT scores are not significantly different from BSR's scores, across three datasets.

Neural Machine Translation
NMT systems usually model the distribution p(y | x) where x = (x 1 , x 2 , . . ., x Ts ) is a sourcelanguage sequence and y = (y 1 , y 2 , . . ., y T ) is a target-language sequence.Most NMT systems use an autoregressive factorization: where y <t = (y 1 , y 2 , . . ., y t−1 ), and p θ is parameterized with a neural network.At test-time, to decode a translation given a source sentence, greedy decoding and beam search are most commonly used.Both are approximate search methods to find the highest-scoring translations.

Beam Search and Rerank (BSR)
BSR has appeared in a number of top-performing models, including many winning submissions of the WMT competitions (Ng et al., 2019;Chen et al., 2020;Yu et al., 2020;Tran et al., 2021).The intuition of BSR is to take advantage of the reverse translator during decoding.Specifically, we do beam search with a large beam size b (usually 50-100) to obtain b candidate translations.Then, we rerank the candidates using the scoring function: where γ and γ are tuned in [0,2].Without access to a language model trained on a huge targetlanguage monolingual external corpus, if we use log p f (y | x) + γ log p r (x | y) as the ranking criteria, BSR also provides a significant performance gain.With a large beam size, this approach performs better than the "two-step beam search" approach (Yu et al., 2017;Yee et al., 2019).
3 Amortized Noisy-Channel NMT One common problem with the above approaches is the inference-time computation overhead.If a translation system needs to translate a high volume of texts, then the test-time computational efficiency is crucial.Thus, our goal is to use a network to approximate such a noisy channel NMT system, while having the same inference-time computational cost as greedily decoding from p f .Specifically, we want our translations to maximize the following objective: where γ > 0 is some fixed coefficient.Using the autoregressive factorization, the forward reward log p f (y | x) equals |y| t=1 log p f (y t | y <t , x), and the reverse reward log p r (x | y) equals |x| t=1 log p r (x t | x <t , y).Goal: Investigating if greedily decoding from our new models leads to successful amortization.Three approaches are shown in this section.We do greedy decoding from the obtained models, and we investigate whether amortization is successful as follows.
• First, we examine if decoding is faster than BSR.This aspect is guaranteed given that we would do greedy decoding from our new network which has the same architecture as p f .
• Next, we examine if both the forward and reverse rewards of the translations are close to the forward and reverse rewards of the translations generated by BSR, respectively.
• Finally, we examine the translation quality by checking if BLEU and BLEURT scores of our model's translations are close to those of BSR-produced translations.

Approach 1: Knowledge Distillation (KD)
KD has been used to amortize beam search (Chen et al., 2018).It is also effective in NMT in general (Kim and Rush, 2016;Freitag et al., 2017;Tan et al., 2019;Tu et al., 2020).Here we adapt a simple version of KD for amortized noisy-channel decoding.
First, train a forward translator p f and a reverse translator p r using maximum likelihood estimation.Then, do BSR on the entire training set to obtain the pseudo-corpus.In particular, we ignore the p lm term in this paper given that it usually requires a big language model, and the inclusion of the term is orthogonal to our goal of reducing inference time. 3  Next, we train a separate "knowledge distilled" model p KD on this new pseudo-corpus (i.e., with the original source-language sentences and the BSR-generated target-language sentences).This objective is equivalent to minimizing the KLdivergence between the distribution induced by the pseudo-corpus obtained through BSR and our model distribution.
At inference time, we greedily decode from p KD .

Approach 2: 1-Step-Deviation Imitation
Learning (IL) Define a network A φ such that it takes in the source sentence and a target-language prefix, and A φ (• | x, y <t ) outputs a |V|-dimensional probability distribution corresponding to the t-th time-step.Moreover, A φ and p f have the same architecture.
In autoregressive text generation, to learn A φ such that it is close to an existing network p θ , imitation learning seeks to optimize φ as follows: where one example of L is the cross entropy.
Forward energy.Inspired by ENGINE (Tu et al., 2020), in the context of noisy channel NMT, define the forward sub-energy E f t , which is a function of 3 Generating the pseudo-corpora can be paralleled.If the system is deployed in the real world, we argue that the amount of computation used to generate the pseudo-corpus is negligible, compared to the aggregate amount of computation for inference.
φ, as follows:4 Suppose we have a source sentence x and a sequence of prefix distributions ŷ<1 , . . ., ŷ<T .We call A φ (• | x, ŷ<t ) the t-th step distribution according to A φ .Intuitively, given a source and a fixed sequence of prefixes, we learn A φ such that the resulting t-th-step distribution matches the forward conditional probability (measured by p f )the latter depends on the source x and the prefix distributions.
Reverse energy.Next, we define the reverse subenergy as follows: Intuitively, we also learn A φ such that the one-hot distributions corresponding to the source words should match the reverse conditional probability (measured by p r ).
Trajectories.In the above equations, ŷ = (ŷ 1 , . . ., ŷT ).ŷt comes from two sources, with probability p and 1 − p for each minibatch during training (Section 4.2): (i) ŷt = arg max v∈V A φ (v | x, ŷ<t ) and ŷ<1 = ∅; in other words, given that A φ (• | x, ŷ<t ) is a probability distribution, we use the most likely token as ŷt .(ii) For the second source, let vt be the t-th token of the BSR-obtained sequence, so that we can expose our model to BSRprefixes, which are the optimal prefixes.
Final objective.We train A φ using the following objective: During inference, we greedily decode from A.

Approach 3: Q Learning
A well-motivated approach is to use Q learning (Watkins and Dayan, 1992;Sutton and Barto, 1998) to explicitly learn a reward function Q, with the goal that when we greedily decode from Q, the generations maximize the reward shown in Eq. ( 1).
Let us view machine translation as a sequential decision-making process.At time-step t, given a state s t = (y <t , x), a policy takes an action a t ∈ V, transits to the next state s t+1 = (y <(t+1) , x),5 and receives a reward r t .

Background on Q Learning
produces the expected return after seeing state s t , taking action a t , and following policy π; i.e., Q π (s t , a t ) = E[ ∞ t =t r t |s t , a t , π] assuming discount factor 1. We further define Q * : S ×A → R to be the optimal action-value function: which is the maximum return achievable by following any strategy after seeing a state s t and taking an action a t .In particular, Q * solves the Bellman Equation (Sutton and Barto, 1998): assuming discount factor 1 and given deterministic transition dynamics (in our machine translation scenario) after taking action a t given state s t .
Traditionally, the Q function is implemented as a matrix of size |S| × |A|, which is intractable in the case of MT due to the combinatorial nature of the state space.We thus use function approximation to tackle this issue of intractability: we follow Mnih et al. (2015) and use a deep neural network trained with experience replay and target networks to approximate the Q learning.
Deep Q learning draws samples from a set of trajectories B, and the neural network Q aims to predict Q * by learning based on minimizing the following squared loss.
where φ is the parameter to Q, and Q is a slightly old copy of Q.6 Algorithm 1: Q learning for amortized noisy channel NMT Given p f , p r , and a parallel translation dataset D. while not converged do Collect training trajectories ( §3.3), and sample a mini-batch B.
Compute target R t : Update φ (using gradient descent) by the objective To model the noisy-channel NMT, given a targetlanguage sequence y and its length T , we have reward r = (r 1 , . . ., r T ), where (2) We construct Q to have the same architecture as p f without the final softmax layer.Q is trained using Algorithm 1 which is adapted from deep Q learning originally applied to Atari games (Mnih et al., 2015), given that we aim to best leverage the existing off-policy trajectories from different sources.The full algorithm is shown in Algorithm 1.
In short, our algorithm says that given a trajectory (x, y, r), at time-step t < T , we want the scalar Q(s t , a t ) to be close to the sum of the t-th step reward and the most optimistic future return, had we taken action a t at time-step t.At time-step T , we want Q(s T , a T ) = Q((y <T , x), eos ) to be close to r T , as defined in Eq. (2).
To generate the t-th token at inference-time, we do greedy decoding using Q as follows: ŷt = arg max at∈V Q(s t , a t ).
Trajectories.The off-policy algorithm shown in Algorithm 1 requires trajectories, i.e., (x, y, r) tuples.The trajectories come from two sources.
(1) Q-based trajectories.In this category, we have two ways of obtaining y: (1a) Boltzmann exploration (Sutton, 1990)7 and (1b) greedy decoding based on Q.At the start of the optimization, however, most of the Q-generated sequences are very far from target sequences.The lack of high-reward sequences prevents Q learning from efficient optimization.Therefore, we also inject reasonably good trajectories from the beginning of training by utilizing both ground-truth sequences as well as p fbased sequences.We thus need the next category of sources.
(2) p f -based trajectories.The target-language sequences are obtained by decoding using p f ; please find more details in Appendix A.  Bojar et al., 2014) which has a moderately large training set (train/dev/test size: 4,500,966/3,000/3,003).Each of the transformer models (the p KD in KD, the A in IL, the Q function in Q learning) has the same number of parameters as the original MLE-trained forward translator p f .The model for IWSLT'14 De-En is the smallest, and the model for WMT'14 De-En is the largest.The detailed settings can be found in Appendix B. BLEU scores in this paper are computed with sacre-BLEU (Post, 2018).BLEURT scores are computed using BLEURT-20-D12 (Sellam et al., 2020), a recent RemBERT-based checkpoint that achieves high human agreement.The models we experiment on are shown in Table 1.

Hyperparameters
The architecture and optimization details of p f and p r are shown in Appendix B. When training p f and p r , we validate the model performance after each epoch, and select the model that corresponds to the best dev set BLEU.
γ is the coefficient multiplied to the reverse reward, when computing the total reward in Eq. ( 1); γ and BSR beam size b are tuned on dev set BLEU using BSR.We choose γ = 0.9 and b = 100 for IWSLT'14 De-En; γ = 0.5 and b = 70 for WMT'16 Ro-En; γ = 0.5 and b = 50 WMT'14 De-En.See Appendix B for details.
For training the IL-based network, the learning rate is selected from {10 −6 , 5 × 10 −6 , 10 −5 , 3 × 10 −5 , 5 × 10 −5 }.We use weight decay of 10 −4 .Dropout rate is selected from {0, 0.05, 0.1, 0.3}; we find that a dropout rate of 0 or 0.05 always works the best.We use a fixed max batch length (i.e., the max number of input tokens in a batch) of 4,096 tokens.The probability p, described in Section 3, is selected from {0, 0.1, 0.5, 0.9, 1}; we find that p = 0.1 or p = 0.5 usually works the best.We accumulate gradients and do gradient descent once every k steps for computational reasons.k is selected from {4, 8, 16}.We find that the IL approach relies on a good initialization, so we use p KD/nc to initialize the new network.
For Q learning, the synchronization frequency K in Algorithm 1 is selected from {10, 20, 30, 50, 150}.The learning rate is tuned in {10 −5 , 3 × 10 −5 , 5 × 10 −5 , 10 −4 }.We use weight decay of 10 −4 .Dropout rate is tuned in {0, 0.01, 0.05, 0.1}; we find that a dropout rate of 0 always works the best.We use a fixed max batch length 4096.We tune the number of steps per gradient update in {4, 8, 16}; a large number effectively increases the batch size.The ratio for different trajectories is described in Appendix A.1.Furthermore, we find that training Q with a small γ at the beginning stabilizes the training, so we first use γ = 0.1 and train till convergence, and then increase γ by 0.2 increment, and we reiterate the process until reaching the desired γ.

Preliminary Analysis
Inference speed.Using any of the three proposed approaches achieves a significant speedup, given that the three approaches all use greedy decoding.We quantify this speedup experimentally.
During inference, we maximize the memory usage of a single NVIDIA RTX 8000 GPU by finding the largest batch length in the form of 2 k where k is a positive integer.9In the IWSLT'14 De-En task, the inference speed (sequences per second) for BSR is 11.The speed for "greedy by p f " is around 1050, and the decoding speed for any of three proposed approaches is also similar.
Rewards.First, comparing the three approaches to greedy decoding or beam search from p f , we see that the three approaches achieve smaller forward rewards, but much larger reverse rewards.This observation is expected given that the three approaches consider both the forward and reverse rewards, while greedy decoding or beam search from p f only consider forward rewards.Second, comparing the three approaches against BSR, the three approaches achieve both smaller forward rewards and smaller reverse rewards.However, we find this a reasonable trade-off to be made between decoding latency and rewards, as all these approaches are 1-2 orders of magnitude faster in decoding.Among the three approaches, KD and IL achieve a better balance between forward and reverse rewards.This observation can be explained by the difference in how the reverse reward is presented among the three approaches.In KD and IL, the learning signal by reverse rewards is implicitly spread throughout all the steps in a sequence.In other words, changing the conditional distribution in each time-step would adjust the loss in KD and the reverse energies in IL.In Q learning, the reverse reward is sparse: it only appears at the end of the sequence, unlike the forward reward which is spread throughout all the steps.This makes it easier for Q learning to maximize the forward reward compared to the reverse reward which requires many more updates to be propagated toward the earlier time steps.
Translation quality.The three approaches achieve BLEU and BLEURT scores that are comparable to those by BSR.Moreover, the three approaches achieve BLEU scores that are much better than "greedy decoding from p f " which has the same computational budget; they are often better than "beam search from p f " as well.In particular, Table 3 shows that IL's BLERUT scores are significantly higher than the scores of beam search across all three datasets.In addition, IL's BLEURT scores are not significantly different from BSR across all three datasets.Therefore, our approaches are able to generate translations with similar quality as those by BSR, while being 5-7 times as fast as beam search and 50-100 times as fast as BSR.

Analysis of Translations
In Q learning, the reverse reward is only presented as a learning signal at the end of each sequence.As observed earlier by Welleck et al. (2020), the length of the generations may inform us of the possible degeneracies, such as excessive repetitions.
Therefore, we analyze WMT'16 Ro-En translations generated by different systems, and we first examine the lengths of translations in different source length buckets.Figure 1 shows that the lengths by different systems are similar in the first four buckets, but in the longest source length bucket (81, ∞), Q learning produces longer translations.
Closer examination of the translations reveal that Q learning produces degenerate translations with extensive repetitions when the source sentences are among the longest in the entire dev set; other  6.
To confirm this finding, we analyze repetitions by source-length buckets.We define "token rep" to be the percentage of tokens that have appeared in the immediately preceding 5-grams: , where the superscript indicates the i-th example, and N indicates the number of translations.
We see from Figure 2 that for the longest sourcesentence length bucket (81, ∞), Q produces translations with a significantly larger 5-gram repetition rate.Moreover, beam search from the forward only model p f exhibits a behavior most similar to reference translations.We leave it for the future to study the cause behind an elevated level of repetition in noisy-channel decoding.
Next, to compare translation similarity among different approaches, we examine the corpus-level BLEU score between each pair of approaches, averaged between two directions.By Table 5, translations by BSR is similar to those produced by p f and Q learning, compared to KD and IL.Now we compare the translations produced by the three approaches.Translations by KD are more similar to IL, compared to BSR and Q learning.This is in line with our intuition that KD and IL differ from Q learning, given that how the reverse reward is presented is different between KD/IL and Q learning.

Further Analysis
KD.One may wonder whether the improvements in KD arise from the KD procedure or because we use BSR when constructing the pseudocorpus.We therefore experiment with another  model p KD/beam : we generate the pseudo-corpus Y beam from the training set, by beam search from p f , and then use MLE to train p KD/beam using the parallel corpora (X, Y beam ).Table 4 suggests that the forward rewards of the two approaches are similar, but the reverse rewards for p KD/nc is much larger.Meanwhile, p KD/nc produces translations with higher BLEU.It is therefore necessary to use BSR to generate the pseudo-corpus, in order to amortize noisy-channel NMT using KD.
Q learning.Why does Q learning, the best understood approach among the three, fail to achieve rewards that are comparable to BSR?The two challenges of a general deep Q learning algorithm are exploration and optimization.
Exploration refers to whether we can find highquality trajectories.We hypothesize that it is not an issue given the diversity of trajectories we use, as shown in Appendix A.1.We even attempt adding high-reward trajectories from BSR as well as trajectories from a deep ensemble of multiple p f 's but neither BLEU nor reward improves.
We thus suspect optimization as a challenge.The reverse reward log p r (x|y) is sparse in that it is non-zero only at the terminal state (y 1:T , x) where y T = eos .The difficulty in maximizing the sparse reverse reward comes from using one-step bootstrapping in Q learning.Such bootstrapping allows Q learning to cope with very long episodes or even an infinite horizon, but this slows down the propagation of future reward to the past.Because we always work with relatively short episodes only in machine translation, we should investigate other learning paradigms from reinforcement learning, such as R learning (Mahadevan, 1996).We leave this further investigation to the future.

Related Work
One of our approaches adapts knowledge distillation (KD) for the noisy channel NMT setting.KD (Hinton et al., 2015;Kim and Rush, 2016) has been shown to work well for sequence generation.Chen et al. (2018) propose trainable greedy decoding, in which they use knowledge distillation to train a greedy decoder so as to amortize the cost of beam search.More subsequent studies have demonstrated the effectiveness of KD in neural machine translation (Freitag et al., 2017;Tan et al., 2019); Gu et al. (2017) show that it is difficult for onpolicy reinforcement learning (RL) to work better than KD.Recently, KD has greatly boosted performance of non-autoregressive MT models (Gu et al., 2018;Lee et al., 2018;Tu et al., 2020).KD is also used to speed up speech synthesis and the approach has been widely deployed in real products (van den Oord et al., 2018).
RL for sequence generation has been greatly inspired by Sutton andBarto (1998). Ranzato et al. (2016) and Bahdanau et al. (2016) apply on-policy RL (REINFORCE and actor-critic algorithms) to MT, but the major optimization challenge lingers given that the reward is usually sparse.Choshen et al. (2020) recently find that the improvements in MT performance may rely on a good initialization.To address the sparsity issue, Norouzi et al. (2016) attempt a hybrid maximum likehood (ML) and RL approach.More recently, Pang and He (2021) attempt to use an offline RL setting with per-token reward based on the a translator trained source: acum , insa , tsipras cere grecilor sa ii incredinteze din nou mandatul de premier , in cadrul unor alegeri despre care sustine ca ii vor intari pozitia politica .KD: now , however , tsipras is urging greeks to entrust the prime minister &apos;s mandate again , in an election he claims will strengthen his political position .IL: now , however , tsipras is asking greeks to reentrust them with the prime minister &apos;s term , in an election that they claim will strengthen his political position .Q learning: now , however , tsipras is urging greeks to reentrust his term as prime minister in an election that he claims will strengthen his political position .beam search by p f : now , however , tsipras is urging greeks to re-entrust the prime minister &apos;s term in an election that he claims will strengthen his political position .BSR: now , however , tsipras is urging greeks to reentrust the prime minister &apos;s term , in an election that he claims will strengthen his political stance .reference: now , however , tsipras asks the greeks again to entrust him with the prime minister position , during an election which he says will strengthen his political position .
source: adomnitei a fost trimis in judecata de directia nationala antico <unk> ruptie ( dna ) , fiind acuzat de favorizarea faptuitorului si fals intelectual dupa ce , spun pro <unk> curorii , ar fi incercat sa mascheze un control de audit in urma caruia se descoperise o serie de nereguli cu privire la receptia dintr-un contranct public semnat intre cj si firma laser co .KD: adomnitei was sued by the national anti-co nistelrooij ruptie ( dna ) as accused of favouring the perpetrator and false intellectual after , pro nistelrooij curorii says , he would have tried to disguise an audit control as a result of which a number of irregularities concerning reception in a public contranct signed between the cj and laser co were discovered .IL: adomnitei was sued by the national directorate antico iel ruptie ( dna ) and accused of favouring the perpetrator and forgery an intellectual after , pro iel curorii says , he had tried to disguise an audit control that found a number of irregularities regarding the reception in a public conctrant signed between cj and laser .Q learning: the runner the runner , the runner-the runner-in-ranging runner-up is given to the latter , as he is accused of promoting the perpetrator and faltering intellectual after , says pro or: curors , tried to disguise an audit control , as a result of which a number of irregularities concerning a reception signed between cj and laser had been discovered in a public cross-border convoy .beam search by p f : adomnitei was sued by the national anti-co nistelrooij rupture ( dna , accused of favouring the perpetrator and forgery intellectual after allegedly attempting to disguise an audit control line between cj and lasco .BSR: adomnitei was sued by the national anti-co xiated department ( dna , accused of favouring the perpetrator and forgery intellectual after allegedly attempting to disguise an audit control line signed between cj and the lasco firm .reference: adomni«unk» ei was indicted by the national anticorruption directorate ( dna ) , being accused of favouring the offender and forgery after , according to the prosecutors , he tried to mask an audit which discovered a number of irregularities regarding the acceptance of a public contract entered into by the county council and the company laser co .In recent years, off-policy RL methods have been used to better leverage existing trajectories in text generation.For instance, in the chatbot setting (Serban et al., 2017;Zhou et al., 2017), the periodicallycollected human feedback is treated as the trajectory.In our case, we leverage the expensive BSRobtained trajectories as well as trajectories from many different models and sources, although the sparse reward issue still lingers.
Finally, we point out a recent endeavor to speed up noisy channel NMT inference (Bhosale et al., 2020).They reduce the size of the channel model, the size of the output vocabulary, and the number of candidates during beam search.Our solution is orthogonal: we aim to use a separate network to amortize decoding cost, while not changing the network's architecture.

Conclusion
We describe three approaches (KD, IL, Q learning) to train an amortized noisy-channel NMT model.We investigate whether greedily decoding from these models will lead to accurate translations in terms of reward and quality.Although all three approaches fail to achieve comparable rewards to BSR, the reverse rewards are much higher than those from non-BSR baselines, often at the ex-pense of forward rewards.However, we found the translation quality (measured by BLEU and BLEURT) to be comparable to that of BSR, while inference is much faster.For future work, the research community could further investigate better ways to optimize toward a sparse reward in the language generation context.Another way to approach the Q learning optimization challenge is to find better reward functions including denser rewards.
A More Information on Q learning for Amortized Noisy Channel NMT

A.1 Details on trajectories
We have obtained trajectories from different sources in the off-policy algorithm (Algorithm 1).Each trajectory contains a source-language sequence x, a target-language sequence y, and the corresponding sequence of rewards r = (r 1 , . . ., r T ).
One natural category of trajectories to consider is the ones obtained by Q during training.Source (1a) and source (1b) correspond to Q-based trajectories.
Source (2) corresponds to p f -obtained trajectories.Specifically, we split this category into a few sub-sources.(2a) The y is obtained through sampling from p f with temperature sampled from Uniform([0, 1]).(2b) The y is obtained through greedily decoding from p f .(2c) The y is obtained through beam search from p f with a beam size randomly chosen from 2 to 10. (2d) The y is obtained through beam search from p f : we first obtain 50 candidate sequences corresponding to largest p f probabilities using beam search with beam size 50; next, we pick a random sequence out of these 50 sentences.
We have also experimented with gold-standard trajectories from the parallel translation dataset D, but the inclusion of such trajectories do not lead to better rewards (of translations generated from Q).
B More Discussion on Experiments BSR hyperparameters.γ is tuned in {0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5}, and b is tuned in {5, 10, 20, . . ., 100} for the first two datasets and {5, 10, 20, . . ., 50} for WMT'14 De-En due to memory constraints.The best γ is 0.9, 0.5, 0.5, for IWSLT'14 De-En, WMT'16 Ro-En, WMT'14 De-En, respectively; the best b is 100, 70, 50 for the three datasets, respectively.Details on p f and p r .Recall that p f is the forward translator (from the source language to the target language) and p r is the reverse translator (from the target language to the source language).We use transformer-based architectures for all experiments.Refer to Table 7 for the architecture.
Number of parameters in the models.The IWSLT'14 De-En transformer has 39,469,056 parameters, the WMT'16 Ro-En transformer has 62,046,208 parameters, and the WMT'14 De-En transformer has 209,911,808 parameters.
Discussion on Q learning.In Section 5.3, to investigate whether better trajectories can improve Q learning results, we attempt adding high-reward trajectories from BSR as well as trajectories from a deep ensemble of two p f 's.Deep ensembling two models (using different seeds) can produce highquality translations.In this case, we simply want to use deep ensembling to diversify the sources of high-reward and high-BLEU trajectories.However, the result is that neither BLEU nor reward improves.

C Ethical Considerations
IWSLT and WMT datasets are standard machine translation benchmarks.The datasets come from a variety of sources: phone conversations, parliament proceedings, news, and so on.There may be naturally occurring social biases in the datasets which have not undergone thorough cleansing.Training on these potential biases may lead to biased generations.There has been recent work studying such biases (Kocmi et al., 2020).
This work deals with speeding up inference, but not pretraining or training (Liu et al., 2021;Hou et al., 2022).The standard practice of creating the pseudo-corpus requires a significant amount of computation.This step is optional, but it gives a boost in performance.We argue that if the MT system is put into production, then the benefit from the efficient inference will outweigh the cost of generating the pseudo-corpus.

Figure 1 :
Figure 1: Average length bucketed by length of the source sentence.The five buckets contain 453, 877, 376, 92, 26 sentences, respectively.The six systems are KD, IL, Q learning, beam search by p f , BSR, and reference translations, respectively.In the longest length bucket, Q learning produces translations that are longer than translations by other systems.

Table 1 :
Mean and standard deviation (across sequences) of test set forward and reverse rewards for translations.b refers to beam size during inference.

Table 2 :
Test set sacreBLEU (mean & standard deviation of three runs using different random seeds).IL performs the best among the three proposed methods.

Table 3 :
Test set BLEURT-20-D12 (mean & standard deviation of three runs).IL performs the best among the three proposed methods.Significance test is conducted in Table8, which shows that IL's scores are significantly better than the scores by beam search; in addition, IL's scores are not significantly different from BSR's scores.

Table 4 :
The rewards and BLEU scores using two KD approaches: p KD/beam uses the pseudo-corpus generated by doing beam search from p f .p KD/nc uses the pseudo-corpus generated by BSR.

Table 5 :
Corpus-level BLEU between translations by pairs of systems.Each reported BLEU is averaged between two directions.

Table 6 :
WMT'16Ro-En examples produced by different systems.The top example is randomly selected.The bottom example is an example with a long source, and Q learning produces repetitions.
using standard MLE.

Table 7 :
Settings for the forward model p f and the reverse (channel) model p r .

Table 8 :
Test set BLEURT-20-D12 (mean & standard deviation of three runs).IL performs the best among the three proposed methods.* : The score is significant (p-value smaller than 0.05) compared to the beam search results.†: The score is significantly higher (pvalue smaller than 0.05) than BSR results, or the score is not significantly different (p-value larger than 0.05) from the BSR results.