Exploring Supervised and Unsupervised Rewards in Machine Translation

Reinforcement Learning (RL) is a powerful framework to address the discrepancy between loss functions used during training and the final evaluation metrics to be used at test time. When applied to neural Machine Translation (MT), it minimises the mismatch between the cross-entropy loss and non-differentiable evaluation metrics like BLEU. However, the suitability of these metrics as reward function at training time is questionable: they tend to be sparse and biased towards the specific words used in the reference texts. We propose to address this problem by making models less reliant on such metrics in two ways: (a) with an entropy-regularised RL method that does not only maximise a reward function but also explore the action space to avoid peaky distributions; (b) with a novel RL method that explores a dynamic unsupervised reward function to balance between exploration and exploitation. We base our proposals on the Soft Actor-Critic (SAC) framework, adapting the off-policy maximum entropy model for language generation applications such as MT. We demonstrate that SAC with BLEU reward tends to overfit less to the training data and performs better on out-of-domain data. We also show that our dynamic unsupervised reward can lead to better translation of ambiguous words.


Introduction
Autoregressive sequence-to-sequence (seq2seq) neural architectures have become the de facto approach in Machine Translation (MT). Such models include Recurrent Neural Networks (RNN) (Sutskever et al., 2014; and Transformer networks (Vaswani et al., 2017), among others. However, these models have as a serious limitation the discrepancy between their training and inference time regimes. They are traditionally trained using the Maximum Likelihood Estimation (MLE), which aims to maximise log-likelihood of a categorical ground truth distribution (samples in the training corpus) using loss functions such as cross-entropy, which are very different from the evaluation metric used at inference time, which generally compares string similarity between the system output and reference outputs. Moreover, during training, the generator receives the ground truth as input and is trained to minimise the loss of a single token at a time without taking the sequential nature of language into account. At inference time, however, the generator will take the previous sampled output as the input at next time step, rather than the ground truth word. MLE training thus causes: (a) the problem of "exposure bias" as a result of recursive conditioning on its own errors at test time, since the model has never been exclusively "exposed" to its own predictions during training; (b) a mismatch between the training objective and the test objective, where the latter relies on evaluation using discrete and non-differentiable measures such as BLEU (Papineni et al., 2002).
The current solution for both problems is mainly based on Reinforcement Learning (RL), where a seq2seq model (Sutskever et al., 2014; is used as the policy which generates actions (tokens) and at each step receives rewards based on a discrete metric taking into account importance of immediate and future rewards. However, RL methods for seq2seq MT models also have their challenges: high-dimensional discrete action space, efficient sampling and exploration, choice of baseline reward, among others (Choshen et al., 2020). The typical metrics used as rewards (e.g., BLEU) are often biased and sparse. They are measured against one or a few human references and do not take into account alternative translation options that are not present in the references.
One way to address this problem is to use entropy-regularised RL frameworks. They incorporate the entropy measure of the policy into the reward to encourage exploration. The expectation is that this leads to learning a policy that acts as stochastically as possible while able to succeed at the task. Specifically, we focus on the Soft Actor-Critic (SAC) (Haarnoja et al., 2018a,b) RL framework, which to the best of our knowledge has not yet been explored for MT, as well as other natural language processing (NLP) tasks. The main advantage of this architecture, as compared to other entropy regularised architectures (Haarnoja et al., 2017;Ziebart et al., 2008), is that it is formulated in the off-policy setting that enables reusing previously collected samples for more stability and better exploration. We demonstrate that SAC prevents the model from overfitting, and as a consequence leads to better performance on out-of-domain data. Another way to address the problem of sparse or biased reward is to design an unsupervised reward. Recently, in Robotics, SAC has been successfully used in unsupervised reward architectures, such as the "Diversity is All You Need" (DIAYN) framework (Eysenbach et al., 2018). DIAYN allows the learning of latent-conditioned sub-policies ("skills") in unsupervised manner, which allows to better explore and model target distributions. Inspired by this work, we propose a formulation of an unsupervised reward for MT. We thoroughly investigate effects of this reward and conclude that it is useful in lexical choice, particularly the rare sense translation for ambiguous words.
Our main contributions are thus twofold: (a) the re-framing of the SAC framework such that it can be applied to MT and other natural language generation tasks (Section 3). We demonstrate that SAC results in improved generalisation compared to the MLE training, leading to better translation of out-of-domain data; (b) the proposal of a dynamic unsupervised reward within the SAC framework (Section 3.4). We demonstrate its efficacy in translating ambiguous words, particularly the rare senses of such words. Our datasets and settings are described in Section 4, and our experiments in Section 5.

Related Work
Reinforcement Learning for MT RL has been successfully applied to MT to bridge the gap between training and testing by optimising the sequence-level objective directly (Yu et al., 2017;Ranzato et al., 2015;Bahdanau et al., 2016). However, thus far mainly the REINFORCE (Williams, 1992) algorithm and its variants have been used (Ranzato et al., 2015;Kreutzer et al., 2018). These are simpler algorithms that handle the large natural language action space, but they employ a sequencelevel reward which tends to be sparse.
To reduce model variance, Actor-Critic (AC) models consider the reward at each decoding step and use the Critic model to guide future actions (Konda and Tsitsiklis, 2000). This approach has also been explored for MT (Bahdanau et al., 2016;He et al., 2017). However, more advanced AC models with Q-Learning are rarely applied to language generation problems. This is due to the difficulty of approximating the Q-function for the large action space. The large action space is one of the bottleneck for RL for text generation in general. Pre-training of the agent parameters to be close to the true distribution is thus necessary to make RL work (Choshen et al., 2020). Further RL training of the agent makes the overfitting problem even more pronounced resulting in peaky distributions. Such problems are traditionally addressed by entropy regularised RL.
Entropy Regularised RL The main goal of this type of RL is to learn an efficient policy while keeping the entropy of the agent actions as high as possible. The paradigm promotes exploration of actions, suppresses peaky distributions and improves robustness. In this work, we explore the effectiveness of the maximum entropy SAC framework (Haarnoja et al., 2018a).
The work closest to ours is of Dai et al. (2018) where the Entropy-Regularised AC (ERAC) model leads to better MT performance. The major difference between ERAC and SAC is that the former is an on-policy model and the latter is an off-policy model. On-policy approaches use consecutive samples collected in real-time that are correlated to each other. In the off-policy setting, our SAC algorithm uses samples from the memory that are taken uniformly with reduced correlation. This key characteristic of SAC ensures better model generalisation and stability (Mnih et al., 2015). There are also differences in the architectures of SAC and ERAC, i.a., using 4 Q-value networks instead of two. These differences will be covered in detail in Section 3.
Unsupervised reward RL Significant work has been done in Robotics to improve the learning capability of robots. These approaches do not rely on a single objective but rather promote intrinsic motivation and exploration. Such an approach to learn diverse skills (latent-conditioned sub-policies, in practice, skills like walking or jumping) in unsupervised manner was recently proposed by Eysenbach et al. (2018). The approach relies on the SAC model and inspired our approach to designing our unsupervised reward for MT. We are not aware of other attempts to design dynamic unsupervised RL rewards (learnt together with the network) in seq2seq in general, or MT in particular. Recent work on unsupervised rewards in NLP (Gao et al., 2020) explores mainly static rewards computed against synthetic references.

Methodology
In this section we start by describing the underlying MT architecture and its variant using RL, to then introduce our SAC formulation and the reward functions used.

Neural Machine Translation (NMT)
A typical Neural Machine Translation (NMT) system is a seq2seq architecture (Sutskever et al., 2014;, where each source sentence x = (x 1 , x 2 , · · · , x n ) is encoded by the encoder into a series of hidden states. At each decoding step t, a target word y t is generated according to p(y t |y <t , x) conditioned on the input sequence x and decoded sequence y <t = (y 1 , · · · , y t−1 ) up to the t-th time step. Given the corpus of pairs of source and target sentences {x i , y i } N i=1 , the training objective function -maximum likelihood estimation (MLE) is defined as:

Reinforcement Learning for NMT
Within the RL framework, the task of NMT can be formulated as a sequential decision making process, where the state is defined by the previously generated words (y <t ) and the action is the next word to be generated. Given the state s t , the agent picks an action a t (for seq2seq it is the same as y t ), according to a (typically stochastic) policy π θ and observes a reward r t for that action. The reward can be calculated based on any evaluation metric, e.g. BLEU.
The objective of the RL training is to maximise the expected reward: L RL = E a 1 ,··· ,a T ∼π θ (a 1 ,··· ,a T ) [r(a 1 , · · · , a T )] (2) Under the policy π, we can also define the values of the state-action pair Q(s t , y t ) and the state V (s t ) as follows: Intuitively, the value function V measures how good the model could be when it is in a specific state s t . The Q function measures the value of choosing a specific action when we are in such state.
Given the above definitions, we can define a function called advantage -denoted by A π -relating the value function V and Q function as follows: Therefore, the focus is on maximising one of the following objectives: Different RL algorithms have different ways to search for the optimal policy. Algorithms such as REINFORCE, as well as its variant MIXER (Ranzato et al., 2015), popular in language tasks, search for the optimal policy via Eq. 2 using the Policy Gradient. Actor-Critic (AC) models typically improve the performance of Policy Gradient models by solving Eq. 5 (left part) (Bahdanau et al., 2016). Q-learning models that aim at maximising the Q function (Eq 5, right part) to improve over both the Policy Gradient and AC models (Dai et al., 2018).

Soft Actor-Critic (SAC)
The SAC algorithm (Haarnoja et al., 2018a) adds to the Eq. 2 an entropy term: [r(s t , a t ) + αH(π(·|s t ))] (6) where α controls the stochasticity of the optimal policy, a trade-off between the relative importance of the entropy term H and the reward r(s t , a t ) that the agent receives by taking action a t when the state of the environment is s t . Its aim is to maximise the entropy of actions at the same time as maximising the rewards.
As mentioned earlier, SAC is an off-policy Qlearning AC algorithm. As other AC algorithms it consists of two parts: the actor (the policy function) and the critic -action-value function (Q), parameterised by φ and θ, respectively.
During off-policy learning, the history of states, actions and respective rewards are stored in a memory (D), a.k.a. the replay buffer.

• Critic Training
The Q-function estimates the value of an action at a given state based on its future rewards. The soft-Q value is computed recursively by applying a modified Bellman backup operator: is the expected future reward of a state and log(π(a t |s t )) is the entropy of the policy.
The parameters of the Q-function are updated towards minimising the mean squared error between the estimated Q-values and the assumed ground-truth Q-value. The assumed ground-truth Q-values are estimated based on the current reward (r(s t , a t )) and the discounted future reward of the next state (γVθ(s t+1 )). This mean squared error objective function of the Q network is as follows: Note that the parameters of the networks are denoted as θ andθ respectively. This is the best practice where the critic is modeled with two neural networks with the exact same architecture but independent parameters (Mnih et al., 2015).
The parameters of the target critic network (Qθ) are iteratively updated with the exponential moving average of the parameters of the main critic network (Q θ ). This constrains the parameters of the target network to update at a slower pace toward the parameters of the main critic, which has been shown to stabilise the training process (Lillicrap et al., 2016).
Another advantage of SAC is the double Q-learning (Hasselt, 2010). In this approach, two Q networks for both of the main and the target critic functions are maintained. When estimating the current Q values or the discounted future rewards, the minimum of the outputs of the two Q networks is used. Thus the estimated Q values do not grow too large, which improves the policy training (Haarnoja et al., 2018a).
• Actor Training SAC updates the policy to minimise the KLdivergence to make the distribution of π φ (s t ) policy function look more like the distribution of the Q function: (10) where softmax is used in the final layer of the policy to output a probability distribution over the actions.
We note that some versions of the SAC algorithm allow to automatically tune the α parameter so that while maximising the expected return, the policy should satisfy the minimum entropy criteria. In our experiments we however used a fixed α. Updating α during training resulted in too short sentences in the output.
Finally, we note that Eq. 10 does not simply add an entropy term to the standard Policy Gradient. The critic Q θ trained by Eq. 9 additionally captures the entropy from future steps.
For more details on SAC for the discrete setting (like MT) we refer to Christodoulou (2019). For more formal details on the architecture, see Haarnoja et al. (2018a,b).

Reward functions
Below we define the reward functions we use in our SAC architecture.
Supervised BLEU reward: -SAC BLEU In the supervised setup, we employ the sequence-level BLEU score (Papineni et al., 2002) with add-1 smoothing (Chen and Cherry, 2014). As an additional length constraint at each time step, we deduct from the respective score the length penalty: lp = |l y − lŷ|, where y is the reference translation. This penalty prevents longer translations that are not penalised by the brevity penalty of BLEU.
BLEU has been chosen in our study to ensure better comparability with the related work in RL MT traditionally using the BLEU reward (Bahdanau et al., 2016;Dai et al., 2018).
Unsupervised reward -SAC unsuper As discussed above, using automatic metrics as reward function can lead to a number of issues, e.g. reward sparsity, overfitting towards single reference. Moreover, designing a good reward can be challenging.
Inspired by recent work on the SAC algorithm in unsupervised RL (Eysenbach et al., 2018), we have designed an unsupervised reward that balances the quality and diversity in the model search space.
The pseudo-reward function we use is as follows: where p(z) is a categorical uniform distribution for a latent variable z. q δ (z|x, a) is provided by a discriminator parametrised by a neural network. z is randomly assigned to a word sampled at each step from the actor distribution. The discriminator is a Bag-of-Words model that takes as input the encoded source sequence and the word itself to predict its z.
More intuitively, every time a word appears in the translation hypothesis for a source sentence (within the Bag-of-Words formulation) it is randomly assigned a certain value of z. The more times this word appears in the sampled hypotheses (for a given source) the closer will be log q δ (z|x, a) to the uniform prior p(z), hence reward r z (x, a) will be close to 0. Thus, frequent translations will be suppressed and search for less frequent translations will be encouraged in order to receive a reward larger than 0.
Such a reward is less sparse than the traditional ones and is also dynamic which prevents memorising and overfitting.

Data
We perform experiments on the Multi30K dataset (Elliott et al., 2016) 1 of image description translations and focus on the English-German (EN-DE) and English-French (EN-FR) (Elliott et al., 2017) language directions. Following best practises, we use sub-word segmentation (BPE (Sennrich et al., 2016)) only on the target side of the corpus. The dataset contains 29,000 instances for training, 1,014 for development, and 1,000 for testing. We use flickr2016 (2016), flickr2017 (2017) and coco2017 (COCO) test sets for model evaluation.
2016 is the most in-domain test set since it was taken from the same superset of descriptions as the training set, whereas 2017 and COCO are from different image description corpora and are thus considered out-of-domain.
For more fine-grained assessment of our models with unsupervised reward, we use the MLT test set (Lala and Specia, 2018;Lala et al., 2019), an annotated subset of the Multi30K corpus where each instance is a 3-tuple consisting of an ambiguous source word, its textual context (a source sentence), and its correct translation. The test set contains 1,298 sentences for English-French and 1,708 for English-German. It was designed to benchmark models in their ability to select the right lexical choice for words with multiple translations, especially when some of these translations are rarer.
Additionally, to allow for comparison with previous work, we evaluate on the IWSLT 2014 German-to-English dataset (Cettolo et al., 2012) from TED talks, which has been used as testbed in most work on RL for MT. The training set contains 153K sentence pairs. We followed the pre-processing procedure described in (Dai et al., 2018).
When compared to the IWSLT 2014 dataset, all the three Multi30K test sets are more out-ofdomain. This was found by the analysis of perplexities of language models trained with respective training data for each dataset (see Appendix A.4).

Training
We modify the original SAC architecture to adapt it to MT following best practices (Bahdanau et al., 2016) in the area. The functions π φ and Q θ are parameterised with neural networks: π φ is an RNN seq2seq model with a 2-layer GRU  encoder and a 2-layer Conditional GRU decoder (Sennrich et al., 2017) with attention . For SAC BLEU, Q θ duplicates the structure of the former, but encodes the reference instead of the source sentence to mimic inputs to the actual BLEU function.
We first pretrain the actor and then pretrain the critic, before the actor-critic training. The pretraining of actors is done until convergence according to the early stopping criteria of 10 epochs wrt. to the MLE loss. We have also found that our critics require much less pretraining (3-5 epochs as compared to 10-20 epochs in general for AC architectures with the MSE loss). Also, to prevent divergence during the actor-critic training, we continue performing MLE training using a smaller weight λ mle . We set α to 0.01. Following Haarnoja et al. (2018a), we rescale the reward to the value inverse to α. Note that we did not find it useful to add to SAC the smoothing objective minimising variance of Q-values (Bahdanau et al., 2016;Dai et al., 2018). We presume that the double Q-learning significantly contributes to the stability of the network and additional smoothing is not required.
For SAC unsuper, we parameterise q δ by a 2-layer feed-forward neural network, which takes the source as encoded by the actor and a t and outputs q δ (z|x, a). We set z to take one of 4 values. 2 For this unsupervised setting, we do not train a Q-function. We instead operate in the oracle mode and following (Keneshloo et al., 2018) define true Q-value estimates and use it to update our actor. Details on training are given in Appendix A. We use pysimt (Caglayan et al., 2020) with Py-Torch (Paszke et al., 2019) v1.4 for our experiments. 3

Evaluation
We use the standard set of MT evaluation metrics: BLEU (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014) and TER (Snover et al., 2006). We perform signifi-2 This hyperparameter is tuned on the validation set. It typically varies from 2 to several hundreds in the related work (Haarnoja et al., 2018b). 3 https://github.com/ImperialNLP/pysimt cance testing via bootstrap resampling using the Multeval tool (Clark et al., 2011). For the lexical translation task, we measure the Lexical Translation Accuracy (LTA) score (Lala et al., 2019). The score provides an average estimation of how accurately the words have been translated. For each ambiguous word, a score of +1 is awarded if the correct translation of the word is found in the output translation; a score of 0 is assigned if a known incorrect translation is found, or none of the candidate words are found in the translation. We also propose a metric that not only rewards correctly translated ambiguous words, but also penalises words translated with the wrong sense: the Ambiguous Lexical Index (ALI). ALI assigns -1 for wrong translations in the given context, whereas LTA simply does not reward them.

Comparison to state-of-the-art
We first compare our SAC models against the MLE model (baseline) and ERAC 4 (state-of-the-art -SOTA) both trained and tested on the Multi30K data (Table 1). Compared to SAC, ERAC differs in that it uses the on-policy setting (i.e., using samples collected in real time). Our SAC algorithm is an off-policy algorithm and uses samples from the memory to promote generalisation.
We clearly observe the tendency of ERAC models to perform better on the more in-domain 2016 data (+1.9 BLEU, +1.6 METEOR, -0.8 TER  To further confirm our hypothesis that SAC reduces overfitting and performs better on the out-of-domain data, we train our models on the IWSLT 2014 train set and test on the out-ofdomain Multi30K test sets (in the reverse direction, German into English, Table 2).
We observe similar performance for complete set of outputs (including sentences with UNK tokens) for MLE and SAC BLEU. If the lines with UNK words are not taken into account, 5 we observe an improvement for the 2016 and 2017 test sets (+0.5 BLEU, +0.1 METEOR, -0.5 TER on average), and a much bigger improvement for the more out-ofdomain COCO set (+2.5 BLEU, +0.3 METEOR, -2 TER on average). This confirms our hypothesis that SAC helps to reduce overfitting.
Finally, we compare SAC to the SOTA AC-base RL architectures, namely ERAC and AC, on the IWSLT 2014 set that is commonly used for this task. Compared to SAC, AC differs in that it does not use entropy regularisation. We also provide the performance for the popular MIXER algorithm. Results are shown in Table 3.
In terms of the general performance, our SAC 5 The original corpus pre-processing pipeline that we followed to increase comparability does not include subword segmentation. We take the intersection of hypotheses sentences across Multi30K test setups that contain no generated UNK token wrt. the IWSLT 2014 vocabulary. Reference files may still contain the UNK token, we focus on the generated text here. performs on pair with the MLE model. SAC BLEU even slightly lowers this score (-0.2 BLEU, -0.2 METEOR). We note that SAC BLEU results contain an increased count of UNK words as compared to MLE (+2.8%) This increased generation of UNK words due to the entropy regularisation is partially responsible for this similar performance. Another cause is that SAC does not overfit to the BLEU distribution of the target data.  Ranzato et al., 2015) 20.73 --AC (Bahdanau et al., 2016) 28.53 --ERAC (w/feed) (Dai et al., 2018) 29.36 --ERAC (w/o feed) (Dai et al., 2018) (Dai et al., 2018).

Translation of ambiguous words
To further investigate the effect of the unsupervised reward, we have evaluated SAC unsuper on the MLT dataset. Results are shown in Table 4. We calculate the scores on two conditions: All Cases takes into account all possible lexical translations; while for Rare Cases, only the instances where the gold-standard translation is not the most frequent translation for that particular ambiguous word. We observe that both SAC BLEU and SAC unsuper   Table 9 in Appendix). Moreover, SAC unsuper is particularly successful when evaluated on 2016 and 2017 and outperforms both MLE and SAC BLEU across setups. This demonstrates the potential of the unsupervised reward function for the cases when we have to choose between possible translations for an ambiguous word (i.e., better exploration of the search space). BLEU reward, on the other hand, is more reliable when we have to adjust distributions to produce one single possible translation. Manual inspection of these SAC unsuper improvements confirmed their increased accuracy (see Table 5). For example, the ambiguous French source word 'hill' ('colline') is translated as 'pente'('slope') by both MLE and SAC BLEU, while only SAC unsuper produces the correct sentence: 'adolescent saute la colline 'hill' avec son vélo'.

Qualitative analysis
To get further insights into the general results, we also performed human evaluation of the outputs for MLE, SAC BLEU, and SAC unsuper using professional in-house expertise. This was done for COCO EN-FR and 2016 EN-DE as two sets with contrastive results in the lexical translation experiment.
For this human analysis, we randomly selected test samples (50 samples per language pair per group) with source words of different frequency in the training data: rare words (frequency 1) and other words (frequency ≥ 10). These other words are randomly chosen from the sentences that differ in their translation across setups. The resulting average frequency of those words is around 40 for both language pairs. A rank of quality (both fluency and adequacy together) is assigned by the human evaluator from 1 to 3, allowing ties. Following the common practice in MT, each system was then assigned a score which reflects how often it was judged to be better or equal to other systems (Bojar et al., 2017).
Results are in Table 6. We observe a tendency of SAC BLEU to do well on the translation of rare source words, but not so well on the translation of words in the middle frequency range (this observation is confirmed by the analysis of the frequency of output words, see Appendix A.5, see Table 10). Our unsupervised reward tends to increase the performance on more frequent words ('Other' in Table 6) by promoting their less common translations in the distribution, hence better translations for ambiguous words from our previous experiment. These ambiguous words are quite frequent, they potentially have multiple possible translations but only one correct translation in a given context.

Conclusions
We propose and reformulate SAC reinforcement learning approaches to help machine translation through better exploration and less reliance on the reward function. To provide a good trade-off between exploration and quality, we devise two reward methods in the supervised and dynamic unsupervised manner. The maximum entropy off-policy SAC algorithm mitigates the overfitting problem when evaluated in the out-of-domain space; both rewards introduced in our SAC architecture can achieve better quality for lexical translation of ambiguous words, particularly the rare senses of   words. The formulation of the unsupervised reward and its potential to influence translation quality open perspectives for future studies on the subject. We leave the exploration of how those supervised and unsupervised rewards could be combined to improve MT for future work.

A.1 Hyperparameters
For the NMT RNN agent, the dimensions of embeddings and GRU hidden states are set to 200 and 320, respectively. The decoder's input and output embeddings are shared (Press and Wolf, 2017). We use Adam (Kingma and Ba, 2014) as the optimiser and set the learning rate and mini-batch size to 0.0004 and 64, respectively. A weight decay of 1e−5 is applied for regularisation. We clip the gradients if the norm of the full parameter vector exceeds 1 (Pascanu et al., 2013). The four Qnetworks are identical to the agent (see Table 7). For the unsupervised reward setting, we use 2 two-layer feed-forward neural network (both dimensionalities are equal to 100). We use again Adam as the optimiser and set the learning rate and mini-batch size to 0.0001 and 64, respectively.

Hyper-parameters
Pre-train Critic     the IWSLT 2014 training set. 7 Table 8  and COCO are more distant from the train partition than 2016 testset. 7 We train Transformer language models using the fairseq toolkit (Ott et al., 2019).

A.5 Analysis of distributions
We argue that the improvement over MLE can be partially attributed to a better handling of less frequent words. It has been shown that rare words tend to be under-represented in NMT (Koehn and Knowles, 2017;Shen et al., 2016). RL training with regularized entropy might mitigate this issue due to a better exploration of the action space. To illustrate this point, we compute the training frequency of the words generated by the NMT systems for the sentences where an improvement over MLE is observed. Figure 1 shows the training frequency percentiles for MLE and SAC BLEU English-French translations of the COCO testset. Reference frequencies are also provided for comparison. We observe that although both MLE and SAC contain more frequent words than the reference, this tendency is less pronounced for SAC. We relate this observation to the fact that our SAC outperforms MLE for the ambiguous word translation (Table 4) where the most frequent translation is not always the correct one.