Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation

Recently, token-level adaptive training has achieved promising improvement in machine translation, where the cross-entropy loss function is adjusted by assigning different training weights to different tokens, in order to alleviate the token imbalance problem. However, previous approaches only use static word frequency information in the target language without considering the source language, which is insufficient for bilingual tasks like machine translation. In this paper, we propose a novel bilingual mutual information (BMI) based adaptive objective, which measures the learning difficulty for each target token from the perspective of bilingualism, and assigns an adaptive weight accordingly to improve token-level adaptive training. This method assigns larger training weights to tokens with higher BMI, so that easy tokens are updated with coarse granularity while difficult tokens are updated with fine granularity. Experimental results on WMT14 English-to-German and WMT19 Chinese-to-English demonstrate the superiority of our approach compared with the Transformer baseline and previous token-level adaptive training approaches. Further analyses confirm that our method can improve the lexical diversity.


Introduction
Neural machine translation (NMT) (Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017;Chen et al., 2018;Yan et al., 2020;Liu et al., 2021) has achieved remarkable success. As a data-driven model, the performance of NMT depends on training corpus. Balanced training data is a crucial factor in building a superior model. However, natural languages conform to the Zipf's law (Zipf, 1949), the frequencies of words exhibit the long tail characteristics, which brings an imbalance in the distribution of words in training corpora. Some studies (Jiang et al., 2019;Gu et al., 2020) assign different training weights to target tokens according to their frequencies. These approaches alleviate the token imbalance problem and indicate that tokens should be treated differently during training.
However, there are two issues in existing approaches. First, these approaches believe that lowfrequency words are not sufficiently trained and thus amplify the weight of them. Nevertheless, low-frequency tokens are not always difficult as the model competence increases (Wan et al., 2020). Second, previous studies only use monolingual word frequency information in the target language without considering the source language, which is insufficient for bilingual tasks, e.g., machine translation. The mapping between bilingualism is a more appropriate indicator. As shown in Table  1, word frequency of pleasing and bearings are both 847. Corresponding to Chinese, pleasing has multiple mappings, while bearings is relatively single. The more multivariate the mapping is, the less confidence in predicting the target word given the source context. He et al. (2019) also confirm this view that words with multiple mappings contribute more to the BLEU score.
To tackle the above issues, we propose bilingual mutual information (BMI), which has two characteristics: 1) BMI measures the learning difficulty for each target token by considering the strength of association between it and the source sentence; 2) for each target token, BMI can dynamically adjust according to the context. BMI-based adaptive training can dynamically adjust the learning granularity on tokens. Easy tokens are updated with coarse granularity while difficult tokens are updated with pleasing (847) gāoxìng (81); yúkuài (74); xǐyuè (63); qǔyuè (49) ... bearings (847) zhóuchéng (671) ... fine granularity. We evaluate our approach on both WMT14 English-to-German and WMT19 Chinese-to-English translation tasks. Experimental results on two benchmarks demonstrate the superiority of our approach compared with the Transformer baseline and previous token-level adaptive training approaches. Further analyses confirm that our method can improve the lexical diversity. The main contributions 1 of this paper can be summarized as follows: • We propose a training objective based on bilingual mutual information (BMI), which can reflects the learning difficulty for each target token from the perspective of bilingualism, and assigns an adaptive weight accordingly to guide the adaptive training of machine translation.
• Experimental results show that our method can improve not only the machine translation quality, but also the lexical diversity.

Neural Machine Translation
A NMT system is a neural network that translates a source sentence x with n words to a target sentence y with m words. During the training process, NMT models are optimized by minimizing cross entropy: where y j is the ground-truth token at the j-th position and y <j is the translation history known before predicting token y j .

Token-level Adaptive Training Objective
Following (Gu et al., 2020), the token-level adaptive training objective is where w j is the weight assigned to the target token y j . Gu et al. (2020) used monolingual word frequency information in the target language to calculate the w j . The weight does not contain the information of the source language, and cannot be dynamically adjusted with the context.

BMI-based Adaptive Training
In this section, we start with the definition of the bilingual mutual information (BMI). Then we analyze the relationship between BMI and translation difficulty. Based on this, we introduce our BMIbased token-level adaptive training objective.

Definition of BMI
Mutual information measures the strength of association between two random variables by comparing the number of their individual and joint occurrences. We develop BMI, which is calculated by summarizing the mutual information of the target token and each token in the source sentence, to measure the learning difficulty of the model. Token pairs with high BMI are considered easy, since they have high co-occurrence relative to the frequency. Given the source sentence x and target token y j , we define the bilingual mutual information as 2 : where f (x i ) and f (y j ) are total number of sentences in the corpus containing at least one occurrence of x i and y j , respectively, f (x i , y j ) represents total number of sentences in the corpus having at least one occurrence of the word pair (x i , y j ), and K denotes total number of sentences in the corpus.

What BMI Measures?
We use an example to illustrate our idea. Figure 1 shows two sentence pairs. Words with Red and Bold fonts have the same word frequency. As shown in Table 1, pleasing has multiple mappings, while the mapping of bearings is relatively single. As a result, the appearance of corresponding English word brings different confidence of the appearance of the Chinese word, which can be reflected by BMI. Further statistical results are shown in Figure 2, high BMI means relatively stable mapping, which is easy to be learned by the model and has low lexical diversity.

BMI-based Objective
We calculate the token-level weight by scaling BMI and adjusting the lower limit as follows: The two hyperparameters S (scale) and B (base) influence the magnitude of change and the lower limit, respectively.
In training process, the loss of simple tokens will be amplified, the model updates simple tokens with coarse granularity, because our strategy thinks the model can easily predict these target tokens given the source sentence, and it needs to increase the penalty if the prediction is wrong. For difficult tokens, the model has a higher tolerance because their translation errors may not be absolute. As a result, the loss is small due to the small weight and the difficult tokens are always updated in a fine-grained way.

Experiments
We evaluate our method on the Transformer (Vaswani et al., 2017) and conduct experiments on two widely-studied NMT tasks, WMT14 English-to-German (En-De) and WMT19 Chineseto-English (Zh-En).

Data Preparation
EN-DE. The training data consists of 4.5M sentence pairs from WMT14. Each word in the corpus has been segmented into subword units using byte pair encoding (BPE) (Sennrich et al., 2016) with 32k merge operations. The vocabulary is shared among source and target languages. We select newstest2013 for validation and report the BLEU scores on newstest2014.
ZH-EN. The training data is from WMT19 which consists of 20.5M sentence pairs. The number of merge operations in byte pair encoding (BPE) is set to 32K for both source and target languages. We use newstest2018 as our validation set and newstest2019 as our test set, which contain 4k and 2k sentences, respectively.

Systems
Transformer. We implement our approach with the open source toolkit THUMT (Zhang et al., 2017) and strictly follow the setting of Transformer-Base in (Vaswani et al., 2017).
Exponential (Gu et al., 2020). This method adds an additional training weights to lowfrequency target tokens: Chi-Square (Gu et al., 2020). The weighting function of this method is similar to the form of chi-square distribution w j = A · Count 2 (y j )e −T ·Count(y j ) + 1.   (Equation 4) for 100k steps. The same procedure was used for the competing methods. In order to eliminate the influence of noise, we assign the weight of tokens with BMI lower than 0.4 to zero during the training process.

Hyperparameters
We introduce two hyperparameters, B and S, to adjust the weight distribution based on BMI, as shown in Equation 4. In our experiments, we fixed B to narrow search space [0.7, 1]. We tuned another hyperparameter S on the validation sets. The results are shown in Table 2. Finally, we use the best hyperparameters found on the validation set for the final evaluation of the test set. For En-De, B = 0.8 and S = 0.15, for Zh-En, B = 1.0 and S = 0.1.

Main Results
As shown in Table 3, compared with (Vaswani et al., 2017), our Transformer outperforms it by 0.67 BLEU points. We use a strong baseline system in this work in order to make the evaluation convincing. Improvement of existing methods (Gu et al., 2020) Table 3: BLEU scores (%) on the WMT14 En-De test set and the WMT19 Zh-En test set. Results of our method marked with '*' are statistically significant (Koehn, 2004) by contrast to all other models (p<0.01). Zh-En task, respectively. The significant and consistent improvement on the two large-scale dataset demonstrates the effectiveness of our method.

Results on Different BMI Intervals
We score each target sentence of newstest2014 by calculating the average BMI of each token in the sentence, and then divide newstest2014 into two subsets with equal size according to the score, denoted as HIGH and LOW, respectively. As shown in Figure 3, compared to Transformer, frequencybased methods outperform on the HIGH subset but have no obvious improvement on the LOW subset. By contrast, our method can not only bring a stable improvement on the HIGH subset, the improvement is even more obvious on the LOW subset. Low BMI means relatively rich mapping. We believe that the model should have a higher tolerance for these tokens because their translation errors may not be absolute. For example, the model outputs another token with similar meaning. Therefore, our method improves more on LOW subset.   Table 4 show that, on improving the lexical diversity of translation, our method is superior to existing methods (Chi-Square and Exponential) based on word frequency.

Contrast with Label Smoothing
There are similarities between token-level adaptive training and label smoothing, because they both adjust the loss function of the model by token weighting. In particular, for some smoothing methods guided by prior or posterior knowledge of training data (Gao et al., 2020;Pereyra et al., 2017), different tokens are treated differently. But these similarities are not the key points of the two methods, and they are essentially different. The first and very important point is that the motivations of the two methods are different. Label smoothing is a regularization method to avoid overfitting, while our method treats samples of different difficulty differently for adaptive training. Second, the two methods work in different ways. Label smoothing is used when calculating the cross-entropy loss. It emphasizes how to assign the weight of tokens other than the golden one, and indirectly affects the training of the golden token. While our method is used after calculating the cross-entropy loss. It is calculated according to the golden token at each position in the reference, which is more direct. In all experiments, we employed uniform label smoothing of value ls = 0.1, the results show that the two methods does not conflict when used together.

Conclusion
We propose a novel bilingual mutual information based adaptive training objective, which can measure the learning difficulty for each target token from the perspective of bilingualism, and adjust the learning granularity dynamically to improve token-level adaptive training. Experimental results on two translation tasks show that our method can bring a significant improvement in translation quality, especially on sentences that are difficult to learn by the model. Further analyses confirm that our method can also improve the lexical diversity.