Prevent the Language Model from being Overconfident in Neural Machine Translation

The Neural Machine Translation (NMT) model is essentially a joint language model conditioned on both the source sentence and partial translation. Therefore, the NMT model naturally involves the mechanism of the Language Model (LM) that predicts the next token only based on partial translation. Despite its success, NMT still suffers from the hallucination problem, generating fluent but inadequate translations. The main reason is that NMT pays excessive attention to the partial translation while neglecting the source sentence to some extent, namely overconfidence of the LM. Accordingly, we define the Margin between the NMT and the LM, calculated by subtracting the predicted probability of the LM from that of the NMT model for each token. The Margin is negatively correlated to the overconfidence degree of the LM. Based on the property, we propose a Margin-based Token-level Objective (MTO) and a Margin-based Sentence-level Objective (MSO) to maximize the Margin for preventing the LM from being overconfident. Experiments on WMT14 English-to-German, WMT19 Chinese-to-English, and WMT14 English-to-French translation tasks demonstrate the effectiveness of our approach, with 1.36, 1.50, and 0.63 BLEU improvements, respectively, compared to the Transformer baseline. The human evaluation further verifies that our approaches improve translation adequacy as well as fluency.


Introduction
Neural Machine Translation (NMT) has achieved great success in recent years (Sutskever et al., 2014; * Equal contribution. This work was done when Mengqi Miao was interning at Pattern Recognition Center, WeChat AI, Tencent Inc, China. † Corresponding author. 1 Code is available at https://github.com/Mlair 77/nmt adequacy Luong et al., 2015;Vaswani et al., 2017;Meng and Zhang, 2019;Zhang et al., 2019a;Yan et al., 2020b), which generates accurate and fluent translation through modeling the next word conditioned on both the source sentence and partial translation. However, NMT faces the hallucination problem, i.e., translations are fluent but inadequate to the source sentences. One important reason is that the NMT model pays excessive attention to the partial translation to ensure fluency while failing to translate some segments of the source sentence (Weng et al., 2020b), which is actually the overconfidence of the Language Model (LM). In the rest of this paper, the LM mentioned refers to the LM mechanism involved in NMT.
Many recent studies attempt to deal with the inadequacy problem of NMT from two main aspects. One is to improve the architecture of NMT, such as adding a coverage vector to track the attention history (Tu et al., 2016), enhancing the crossattention module (Meng et al., 2016Weng et al., 2020b), and dividing the source sentence into past and future parts (Zheng et al., 2019). The other aims to propose a heuristic adequacy metric or objective based on the output of NMT. Tu et al. (2017) and Kong et al. (2019) enhance the model's reconstruction ability and increase the coverage ratio of the source sentences by translations, respectively. Although some researches (Tu et al., 2017;Kong et al., 2019;Weng et al., 2020b) point out that the lack of adequacy is due to the overconfidence of the LM, unfortunately, they do not propose effective solutions to the overconfidence problem.
From the perspective of preventing the overconfidence of the LM, we first define an indicator of the overconfidence degree of the LM, called the Margin between the NMT and the LM, by subtracting the predicted probability of the LM from that of the NMT model for each token. A small Mar-gin implies that the NMT might concentrate on the partial translation and degrade into the LM, i.e., the LM is overconfident. Accordingly, we propose a Margin-based Token-level Objective (MTO) to maximize the Margin. Furthermore, we observe a phenomenon that if target sentences in the training data contain many words with negative Margin, they always do not correspond to the source sentences. These data are harmful to model performance. Therefore, based on the MTO, we further propose a Margin-based Sentence-level Objective (MSO) by adding a dynamic weight function to alleviate the negative effect of these "dirty data".
We validate the effectiveness and superiority of our approaches on the Transformer (Vaswani et al., 2017), and conduct experiments on large-scale WMT14 English-to-German, WMT19 Chinese-to-English, and WMT14 English-to-French translation tasks. Our contributions are: • We explore the connection between inadequacy translation and the overconfidence of the LM in NMT, and thus propose an indicator of the overconfidence degree, i.e., the Margin between the NMT and the LM. • Furthermore, to prevent the LM from being overconfident, we propose two effective optimization objectives to maximize the Margin, i.e., the Margin-based Token-level Objective (MTO) and the Margin-based Sentence-level Objective (MSO). • Experiments on WMT14 English-to-German, WMT19 Chinese-to-English, and WMT14 English-to-French show that our approaches bring in significant improvements by +1.36, +1.50, +0.63 BLEU points, respectively. Additionally, the human evaluation verifies that our approaches can improve both translation adequacy and fluency.

Background
Given a source sentence x = {x 1 , x 2 , ..., x N }, the NMT model predicts the probability of a target sentence y = {y 1 , y 2 , ..., y T } word by word: where y <t = {y 1 , y 2 , ..., y t−1 } is the partial translation before y t . From Eq. 1, the source sentence x and partial translation y <t are considered in the meantime, suggesting that the NMT model is es-sentially a joint language model and the LM is instinctively involved in NMT. Based on the encoder-decoder architecture, the encoder of NMT maps the input sentence x to hidden states. At time step t, the decoder of NMT employs the output of the encoder and y <t to predict y t . The training objective of NMT is to minimize the negative log-likelihood, which is also known as the cross entropy loss function: log p(y t |y <t , x). ( The LM measures the probability of a target sentence similar to NMT but without knowledge of the source sentence x: The LM can be regarded as the part of NMT decoder that is responsible for fluency, only takes y <t as input. The training objective of the LM is almost the same as NMT except for the source sentence x: The NMT model predicts the next word y t according to the source sentence x and meanwhile ensures that y t is fluent with the partial translation y <t . However, when NMT pays excessive attention to translation fluency, some source segments may be neglected, leading to inadequacy problem. This is exactly what we aim to address in this paper.

The Approach
In this section, we firstly define the Margin between the NMT and the LM (Section 3.1), which reflects the overconfidence degree of the LM. Then we put forward the token-level (Section 3.2) and sentencelevel (Section 3.3) optimization objectives to maximize the Margin. Finally, we elaborate our twostage training strategy (Section 3.4).

Margin between the NMT and the LM
When the NMT model excessively focuses on partial translation, i.e., the LM is overconfident, the NMT model degrades into the LM, resulting in hallucinated translations. To prevent the overconfidence problem, we expect that the NMT model outperforms the LM as much as possible in predicting golden tokens. Consequently, we define the Margin between the NMT and the LM at the t-th time step by the difference of the predicted probabilities of them: where p N M T denotes the predicted probability of the NMT model, i.e., p(y t |y <t , x), and p LM denotes that of the LM, i.e., p(y t |y <t ).
The Margin ∆(t) is negatively correlated to the overconfidence degree of the LM, and different values of the Margin indicate different cases: • If ∆(t) is big, the NMT model is apparently better than the LM, and y t is strongly related to the source sentence x. Hence the LM is not overconfident. • If ∆(t) is medium, the LM may be slightly overconfident and the NMT model has the potential to be enhanced. • If ∆(t) is small, the NMT model might degrade to the LM and not correctly translate the source sentence, i.e., the LM is overconfident. 2 Note that sometimes, the model needs to focus more on the partial translation such as the word to be predicted is a determiner in the target language. In this case, although small ∆(t) does not indicate the LM is overconfident, enlarging the ∆(t) can still enhance the NMT model.

Margin-based Token-level Objective
Based on the Margin, we firstly define the Margin loss L M and then fuse it into the cross entropy loss function to obtain the Margin-based Tokenevel Optimization Objective (MTO). Formally, we define the Margin loss L M to maximize the Margin as follow: where we abbreviate p N M T (y t |y <t , x) as p N M T (t). M(∆(t)) is a function of ∆(t), namely Margin function, which is monotonically decreasing (e.g., 1 − ∆(t)). Moreover, when some words have the same ∆(t) but different small, the NMT model is urgently to be optimized on the token thus the weight of M(∆(t)) should be enlarged. Therefore, as the weight of M(∆(t)), 1 − p N M T (t) enables the model treat tokens wisely.
Variations of M(∆). We abbreviate Margin function M(∆(t)) as M(∆) hereafter. A simple and intuitive definition is the Linear function: M(∆) = 1 − ∆, which has the same gradient for different ∆. However, as illustrated in Section 3.1, different ∆ has completely various meaning and needs to be treated differently. Therefore, we propose three non-linear Margin functions M(∆) as follows: • Cube: (1 − ∆ 3 )/2.
As shown in Figure 1, the four variations 3 have quite different slopes. Specifically, the three nonlinear functions are more stable around ∆ = 0 (e.g., ∆ ∈ [−0.5, 0.5]) than Linear, especially Quintic. We will report the performance of the four M(∆) concretely and analyze why the three non-linear M(∆) perform better than Linear in Section 5.4.
Finally, based on L M , we propose the Marginbased Token-level Objective (MTO): where L N M T ce is the cross-entropy loss of the NMT model defined in Eq. 2 and λ M is the hyperparameter for the Margin loss L M .

Target
How did your mother succeed in keeping the peace between these two very different men?

Expert Translation
Although they are twins, they are quite different in character.

Figure 2:
The parallel sentences, i.e., the source and target sentences, are sampled from the WMT19 Chineseto-English training dataset. We also list an expert translation of the source sentence. The words in bold red have negative Margin. This target sentence has more than 50% tokens with negative Margin, and these tokens are almost irrelevant to the source sentence. Apparently, the target sentence is a hallucination and will harm the model performance.

Margin-based Sentence-level Objective
Furthermore, through analyzing the Margin distribution of target sentences, we observe that the target sentences in the training data which have many tokens with negative Margin are almost "hallucinations" of the source sentences (i.e., dirty data), thus will harm the model performance. Therefore, based on MTO, we further propose the Marginbased Sentence-level Objective (MSO) to address this issue. Compared with the LM, the NMT model predicts the next word with more prior knowledge (i.e., the source sentence). Therefore, it is intuitive that when predicting y t , the NMT model should predict more accurately than the LM, as follow: Actually, the above equation is equivalent to ∆(t) > 0. The larger ∆(t) is, the more the NMT model exceeds the LM. However, there are many tokens with negative Margin through analyzing the Margin distribution. We conjecture the reason is that the target sentence is not corresponding to the source sentence in the training corpus, i.e., the target sentence is a hallucination. Actually, we also observe that if a large proportion of tokens in a target sentence have negative Margin (e.g., 50%), the sentence is probably not corresponding to the source sentence, such as the case in Figure 2. These "dirty" data will harm the performance of the NMT model. To measure the "dirty" degree of data, we define the Sentence-level Negative Margin Ratio of parallel sentences (x, y) as follow: where #{y t ∈ y : ∆(t) < 0} denotes the number of tokens with negative ∆(t) in y, and #{y t : y t ∈ y} is the length of the target sentence y. When R(x, y) is larger than a threshold k (e.g., k=50%), the target sentence may be desperately inadequate, or even completely unrelated to the source sentence, as shown in Figure 2. In order to eliminate the impact of these seriously inadequate sentences, we ignore their loss during training by the Margin-based Sentence-level Objective (MSO): where I R(x,y)<k is a dynamic weight function in sentence level. The indicative function I R(x,y)<k equals to 1 if R(x, y) < k, else 0, where k is a hyperparameter. L T is MTO defined in Eq. 7. I R(x,y)<k is dynamic at the training stage. During training, as the model gets better, its ability to distinguish hallucinations improves thus I R(x,y)<k becomes more accurate. We will analyze the changes of I R(x,y)<k in Section 5.4.

Two-stage Training
We elaborate our two-stage training in this section, 1) jointly pretraining an NMT model and an auxiliary LM, and 2) finetuning the NMT model.
Jointly Pretraining. The language model mechanism in NMT cannot be directly evaluated, thus we train an auxiliary LM to represent it. We pretrain them together using a fusion loss function: where L N M T ce and L LM ce are the cross entropy loss functions of the NMT model and the LM defined in Eq. 2 and Eq. 4, respectively. λ LM is a hyperparameter. Specifically, we jointly train them through sharing their decoders' embedding layers and their pre-softmax linear transformation layers (Vaswani et al., 2017). There are two reasons for joint training: (1) making the auxiliary LM as consistent as possible with the language model mechanism in NMT; (2) avoiding abundant extra parameters.
Finetuning. We finetune the NMT model by minimizing the MTO (L T in Eq. 7) and MSO (L S in Eq. 10). 4 Note that the LM is not involved at the inference stage.
Datasets. For En→De, we use 4.5M training data. Following the same setting in (Vaswani et al., 2017), we use newstest2013 as validation set and newstest2014 as test set, which contain 3000 and 3003 sentences, respectively. For En→Fr, the training dataset contains about 36M sentence pairs, and we use newstest2013 with 3000 sentences as validation set and newstest2014 with 3003 sentences as test set. For Zh→En, we use 20.5M training data and use newstest2018 as validation set and new-stest2019 as test set, which contain 3981 and 2000 sentences, respectively. For Zh→En, the number of merge operations in byte pair encoding (BPE) (Sennrich et al., 2016a) is set to 32K for both source and target languages. For En→De and En→Fr, we use a shared vocabulary generated by 32K BPEs.
Evaluation. We measure the case-sensitive BLEU scores using multi-bleu.perl 5 for En→De and En→Fr. For Zh→En, case-sensitive BLEU scores are calculated by Moses mteval-v13a.pl script 6 . Moreover, we use the paired bootstrap resampling (Koehn, 2004) for significance test. We select the model which performs the best on the validation sets and report its performance on the test sets for evaluation.
Model and Hyperparameters. We conduct experiments based on the Transformer (Vaswani et al., 2017) and implement our approaches with the opensource tooklit Opennmt-py (Klein et al., 2017). Following the Transformer-Base setting in (Vaswani et al., 2017), we set the hidden size to 512 and the encoder/decoder layers to 6. All three tasks are trained with 8 NVIDIA V100 GPUs, and the batch size for each GPU is 4096 tokens. The beam size is 5 and the length penalty is 0.6. Adam optimizer (Kingma and Ba, 2014) is used in all the models. The LM architecture is the decoder of the Transformer excluding the cross-attention layers, sharing the embedding layer and the pre-softmax linear transformation with the NMT model. For En→De, Zh→En, and En→Fr, the number of training steps is 150K for jointly pretraining stage and 150K for finetuning 7 . During pretraining, we set λ LM to 0.01 for all three tasks 8 . Experimental results shown in Appendix A indicate that the LM has converged after pretraining for all the three tasks. During finetuning, the Margin function M(∆) in Section 3.2 is set to Quintic, and we will analyze the four M(∆) in Section 5.4. λ M in Eq. 7 is set to 5, 8, and 8 on En→De, En→Fr and Zh→En, respectively. For MSO, the threshold k in Eq. 10 is set to 30% for En→De and Zh→En, 40% for En→Fr. The two hyperparameters (i.e., λ M and k) are searched on validation sets, and the selection details are shown in Appendix B. The baseline model (i.e., vanilla Transformer) is trained for 300k steps for En→De, En→Fr and Zh→En. Moreover, we use a joint training model as our secondary baseline, namely NMT+LM, by jointly training the NMT model and the LM throughout the training stage with 300K steps. The training steps of all the models are consistent, thus the experiment results are strictly comparable.

Results and Analysis
We first evaluate the main performance of our approaches (Section 5.1 and 5.2). Then, the human evaluation further confirms the improvements of translation adequacy and fluency (Section 5.3). Finally, we analyze the positive impact of our models on the distribution of Margin and explore how each fragment of our method works (Section 5.4).

Results on En→De
The results on WMT14 English-to-German (En→De) are summarized in Table 1. We list the results from (Vaswani et al., 2017) and several related competitive NMT systems by various methods, such as Minimum Risk Training (MRT) objective (Shen et al., 2016), Simple Fusion of NMT and LM (Stahlberg et al., 2018), optimizing adequacy metrics (Kong et al., 2019;Feng et al., 2019) and improving the Transformer architecture Zheng et al., 2019;Yang et al., 2019;Weng et al., 2020b;Yan et al., 2020a). We re-

System
En→De ↑ Existing NMT systems Transformer (Vaswani et al., 2017) 27.3 -MRT* (Shen et al., 2016) 27.71 -Simple Fusion** (Stahlberg et al., 2018) 27.88 -Localness  28.11 -Context-Aware  28.26 -AOL (Kong et al., 2019) 28.01 -Eval. Module (Feng et al., 2019) 27.55 -Past&Future (Zheng et al., 2019) 28.10 -Dual (Yan et al., 2020a) 27.86 -Multi-Task (Weng et al., 2020b) 28 Compared with the baseline, NMT+LM yields +0.75 BLEU improvement. Based on NMT+LM, our MTO achieves further improvement with +0.50 BLEU scores, indicating that preventing the LM from being overconfident could significantly enhance model performance. Moreover, MSO performs better than MTO by +0.11 BLEU scores, which implies that the "dirty data" in the training dataset indeed harm the model performance, and the dynamic weight function I R(x,y)<k in Eq. 10 could reduce the negative impact. In conclusion, our approaches improve up to +1.36 BLEU scores on En→De compared with the Transformer baseline and substantially outperforms the existing NMT systems. The results demonstrate the effectiveness and superiority of our approaches.   Table 2: Case-sensitive BLEU scores (%) on the test set of WMT14 En→Fr and WMT19 Zh→En. ↑ denotes the improvement compared with the NMT baseline (i.e., Transformer). " †": significantly better than NMT (p<0.01). " ‡": significantly better than the joint model NMT+LM (p<0.01). * denotes the results come from the cited paper.

Results on En→Fr and Zh→En
The results on WMT14 English-to-French (En→Fr) and WMT19 Chinese-to-English (Zh→En) are shown in Table 2. We also list the results of (Vaswani et al., 2017) and our reimplemented Transformer as the baselines. On En→Fr, our reimplemented result is higher than the result of (Vaswani et al., 2017), since we update 300K steps while Vaswani et al. (2017) only update 100K steps. Many studies obtain similar results to ours (e.g., 41.1 BLEU scores from (Ott et al., 2019)). Compared with the baseline, NMT+LM yields +0.07 and +0.15 BLEU improvements on En→Fr and Zh→En, respectively. The improvement of NMT+LM on En→De in Table 1 (i.e., +0.75) is greater than these two datasets. We conjecture the reason is that the amount of training data of En→De is much smaller than that of En→Fr and Zh→En, thus NMT+LM is more likely to improve the model performance on En→De.
Compared with NMT+LM, our MTO achieves further improvements with +0.42 and +1.04 BLEU scores on En→Fr and Zh→En, respectively, which demonstrates the performance improvement is mainly due to our Margin-based objective rather than joint training. Moreover, based on MTO, our MSO further yields +0.14 and +0.31 BLEU improvements. In summary, our approaches improve up to +0.63 and +1.50 BLEU scores on En→Fr and Zh→En compared with the baselines, respectively, which demonstrates the effectiveness and generalizability of our approaches.

Human Evaluation
We conduct the human evaluation for translations in terms of adequacy and fluency. Firstly, we ran-   domly sample 100 sentences from the test set of WMT19 Zh→En. Then we invite three annotators to evaluate the translation adequacy and fluency. Five scales have been set up, i.e., 1, 2, 3, 4, 5. For adequacy, "1" means totally irrelevant to the source sentence, and "5" means equal to the source sentence semantically. For fluency, "1" represents not fluent and incomprehensible; "5" represents very "native". Finally, we take the average of the scores from the three annotators as the final score.
The results of the baseline and our approaches are shown in Table 3. Compared with the NMT baseline, NMT+LM, MTO and MSO improve adequacy with 0.08, 0.22, and 0.37 scores, respectively. Most improvements come from our Margin-based methods MTO and MSO, and MSO performs the best. For fluency, NMT+LM achieves 0.2 improvement compared with NMT. Based on NMT+LM, MTO and MSO yield further improvements with 0.01 and 0.05 scores, respectively. Human evaluation indicates that our MTO and MSO approaches remarkably improve translation adequacy and slightly enhance translation fluency.   Figure 3. Compared with NMT+LM, both MTO and MSO effectively reduce the percent of ∆ < 0 and improve the average ∆.

Analysis
Margin between the NMT and the LM. Firstly, we analyze the distribution of the Margin between the NMT and the LM (i.e., ∆ in Eq. 5). As shown in Figure 3, for the joint training model NMT+LM, although most of the Margins are positive, there are still many tokens with negative Margin and a large amount of Margins around 0. This indicates that the LM is probably overconfident for many tokens, and addressing the overconfidence problem is meaningful for NMT. By comparison, the Margin distribution of MSO is dramatically different with NMT+LM: the tokens with Margin around 0 are significantly reduced, and the tokens with Margin in [0.75, 1.0] are increased apparently. More precisely, we list the percentage of tokens with negative Margin and the average Margin for each model in Table 4. Compared with NMT+LM, MTO and MSO reduce the percentage of negative Margin by 2.28 and 1.56 points, respectively. We notice MSO performs slightly worse than MTO, because MSO neglects the hallucinations during training. As there are many tokens with negative Margin in hallucinations, the ability of MSO to reduce the proportion of ∆ < 0 is weakened. We further analyze effects of MTO and MSO on the average of Margin. Both MTO and MSO improve the average of the Margin by 33% (from 0.33 to 0.44). In conclusion, MTO and MSO both indeed increase the Margin between the NMT and the LM.
Variations of M(∆). We compare the performance of the four Margin functions M(∆) defined in Section 3.2. We list the BLEU scores of the Transformer baseline, NMT+LM and our MTO approach with the four M(∆) in Table 5. All the four variations bring improvements over NMT and NMT+LM. The results of Log with different α are similar to Linear, while far lower than Cube and Quintic. And Quintic performs the best among all the four variations. We speculate the reason is that   Table 6: Case-sensitive BLEU scores (%) on Zh→En validation set and test set of MTO with (w/) and without (w/o) the weight 1 − p N M T (t).
∆ ∈ [−0.5, 0.5] is the main range for improvement, and Quintic updates more careful on this range (i.e., with smaller slopes) as shown in Figure 1.
Effects of the Weight of M(∆). In MTO, we propose the weight 1−p N M T (t) of the Margin function M(∆) in Eq. 6. To validate the importance of it, we remove the weight and the Margin loss degrades to L M = T t=1 M(∆(t)). The results are listed in Table 6. Compared with NMT+LM, MTO without weight performs worse with 0.25 and 0.05 BLEU decreases on the validation set and test set, respectively. Compared with MTO with weight, it decreases 0.73 and 1.09 BLEU scores on the validation set and test set, respectively. This demonstrates that the weight 1 − p N M T (t) is indispensable for our approach.
Changes of I R(x,y)<k During Training. In MSO, we propose a dynamic weight function I R(x,y)<k in Eq. 10. Figure 4 shows the changes of I R(x,y)<k in MSO and the BLEU scores of MSO and MTO during finetuning. As the training continues, our model gets more competent, and the proportion of sentences judged to be "dirty data" by our model increases rapidly at first and then Case Study. To better illustrate the translation quality of our approach, we show several translation examples in Appendix C. Our approach grasps more segments of the source sentences, which are mistranslated or neglected by the Transformer.

Related Work
Translation Adequacy of NMT. NMT suffers from the hallucination and inadequacy problem for a long time (Tu et al., 2016;Müller et al., 2020;Wang and Sennrich, 2020;Lee et al., 2019). Many studies improve the architecture of NMT to alleviate the inadequacy issue, including tracking translation adequacy by coverage vectors (Tu et al., 2016;Mi et al., 2016), modeling a global representation of source side (Weng et al., 2020a), dividing the source sentence into past and future parts (Zheng et al., 2019), and multi-task learning to improve encoder and cross-attention modules in decoder (Meng et al., 2016Weng et al., 2020b). They inductively increase the translation adequacy, while our approaches directly maximize the Margin between the NMT and the LM to prevent the LM from being overconfident. Other studies enhance the translation adequacy by adequacy metrics or additional optimization objectives. Tu et al. (2017) minimize the difference between the original source sentence and the reconstruction source sentence of NMT. Kong et al. (2019) pro-pose a coverage ratio of the source sentence by the model translation. Feng et al. (2019) evaluate the fluency and adequacy of translations with an evaluation module. However, the metrics or objectives in the above approaches may not wholly represent adequacy. On the contrary, our approaches are derived from the criteria of the NMT model and the LM, thus credible.
Language Model Augmented NMT. Language Models are always used to provide more information to improve NMT. For low-resource tasks, the LM trained on extra monolingual data can rerank the translations by fusion (Gülçehre et al., 2015;Sriram et al., 2017;Stahlberg et al., 2018), enhance NMT's representations (Clinchant et al., 2019;Zhu et al., 2020), and provide prior knowledge for NMT (Baziotis et al., 2020). For data augmentation, LMs are used to replace words in sentences (Kobayashi, 2018;Gao et al., 2019). Differently, we mainly focus on the Margin between the NMT and the LM, and no additional data is required. Stahlberg et al. (2018) propose the Simple Fusion approach to model the difference between NMT and LM. Differently, it is trained to optimize the residual probability, positively correlated to p N M T /p LM which is hard to optimize and the LM is still required in inference, slowing down the inference speed largely.
Data Selection in NMT. Data selection and data filter methods have been widely used in NMT. To balance data domains or enhance the data quality generated by back-translation (Sennrich et al., 2016b), many approaches have been proposed, such as utilizing language models (Moore and Lewis, 2010;van der Wees et al., 2017;Zhang et al., 2020), translation models (Junczys-Dowmunt, 2018;Wang et al., 2019a), and curriculum learning (Zhang et al., 2019b;Wang et al., 2019b). Different from the above methods, our MSO dynamically combines language models with translation models for data selection during training, making full use of the models.

Conclusion
We alleviate the problem of inadequacy translation from the perspective of preventing the LM from being overconfident. Specifically, we firstly propose an indicator of the overconfidence degree of the LM in NMT, i.e., Margin between the NMT and the LM. Then we propose Margin-based Token-level and Sentence-level objectives to maximize the Margin. Experimental results on three large-scale translation tasks demonstrate the effectiveness and superiority of our approaches. The human evaluation further verifies that our methods can improve translation adequacy and fluency.

A Loss of the Language Model
To validate whether the LM is converged or not after pretraining, we plot the loss of the LM as shown in Figure 5. The loss of the LM remains stable after training about 80K steps for En→De, Zh→En and En→Fr, indicating that the LM is converged during pretraining stage.

B Hyperparameters Selection
The results of our approaches with different λ M (defined in Eq. 7) and k (defined in Eq. 10) on the validation sets of WMT14 En→De, WMT14 En→Fr and WMT19 Zh→En are shown in Figure 6. We firstly search the best λ M based on MTO. All the three datasets achieve better performance for λ M ∈ [5, 10]. The model reaches the peak when λ M =5, 8, and 8 for the three tasks, respectively. Then, fixing the best λ M for each dataset, we search the best threshold k. As shown in the right of Figure 6, the best k is 30% for En→De and Zh→En, 40% for En→Fr. This is consistent with our observations. When the proportion of tokens  with negative Margin in a target sentence is greater than 30% or 40%, the sentence is most likely to be a hallucination.

C Case Study
As shown in Figure 7, our approach outperforms the base model (i.e., the Transformer) in translation adequacy. In case 1, the base model generates "on Tuesday", which is unrelated to the source sentence, i.e., hallucination, and under-translates "November 5" and "the website of the Chinese embassy in Mongolia" information in the source sentence. However, our approach translates the above two segments well. In Case 2, the base model reverses the chronological order of the source sentence, thus generates a mis-translation, while our model translates perfectly. In Case 3, the base model neglects two main segments of the source sentence (the text in bold blue font) and leads to the inadequacy problem. However, our model takes them into account. According to the three examples, we conclude that our approach alleviates the inadequacy problem which is extremely harmful to NMT.

REF 2
For this, CCTV issued a quick comment: this was the first Memorial Day after the implementation of the law for the protection of heroes and martyrs in China.

BASE 2 CCTV released a quick comment on this: this is our heroic martyrs protection law after the implementation of the first martyr anniversary.
OURS 2 CCTV issued a quick comment on this: this is the first martyr memorial day after the implementation of our country's heroic martyr protection law.

REF 3
According to foreign media reports, it was hard for people to find anything unusual in two little lions playing in a conservation center located in the suburb in Pretoria, the capital of South Africa, but they were absolutely unique.

BASE 3
It's hard to see anything unusual in a nursing home in a suburb of Pretoria, South Africa's capital, where two lions play together.

OURS 3
According to foreign media reports, in a care center on the outskirts of Pretoria, South Africa, two lions play together, it is difficult to see any abnormalities, but they are unique.