Focus on the Target’s Vocabulary: Masked Label Smoothing for Machine Translation

Label smoothing and vocabulary sharing are two widely used techniques in neural machine translation models. However, we argue that simply applying both techniques can be conflicting and even leads to sub-optimal performance. When allocating smoothed probability, original label smoothing treats the source-side words that would never appear in the target language equally to the real target-side words, which could bias the translation model. To address this issue, we propose Masked Label Smoothing (MLS), a new mechanism that masks the soft label probability of source-side words to zero. Simple yet effective, MLS manages to better integrate label smoothing with vocabulary sharing. Our extensive experiments show that MLS consistently yields improvement over original label smoothing on different datasets, including bilingual and multilingual translation from both translation quality and model’s calibration. Our code is released at https://github.com/PKUnlp-icler/MLS


Introduction
Recent advances in Transformer-based (Vaswani et al., 2017) models have achieved remarkable success in Neural Machine Translation (NMT). For most NMT studies (Vaswani et al., 2017;Song et al., 2019;Lin et al., 2020;Pan et al., 2021), there are two widely used techniques to improve the quality of the translation: Label Smoothing (LS) and Vocabulary Sharing (VS). Label smoothing (Pereyra et al., 2017) turns the hard one-hot labels into a soft weighted mixture of the golden label and the uniform distribution over the whole vocabulary, which serves as an effective regularization technique to prevent over-fitting and overconfidence (Müller et al., 2019) of the model. In addition, vocabulary sharing (Xia et al., 2019) is another commonly used technique, which unifies * Corresponding author   (Vaswani et al., 2017). Jointly adopting label smoothing and vocabulary sharing techniques cannot achieve further improvements, but leads to sub-optimal performance. the vocabulary of both source and target language into a whole vocabulary, and therefore the vocabulary is shared. It enhances the semantic correlation between the two languages and reduces the number of total parameters of the embedding matrices. However, in this paper, we argue that jointly adopting both label smoothing and vocabulary sharing techniques can be conflicting, and leads to suboptimal performance. Specifically, with vocabulary sharing, the shared vocabulary can be divided into three parts as shown in Figure 1. But with label smoothing, the soft label still considers the words at the source side that are impossible to appear at the target side. This would mislead the translation model and exerts a negative effect on the translation performance. As shown in Table 1, although introducing label smoothing or vocabulary sharing alone can improve the vanilla Transformer, jointly adopting both of them cannot obtain further improvements but achieves sub-optimal results.
To address the conflict of label smoothing and vocabulary sharing, we first propose a new mechanism named Weighted Label Smoothing (WLS) to control the smoothed probability distribution and its parameter-free version Masked Label Smoothing (MLS). Simple yet effective, MLS constrains the soft label not to assign soft probability to the words only belonging to the source side. In this way, we not only keeps the benefits of both label smoothing and vocabulary sharing, but also address the conflict of these two techniques to improve the quality of the translation.
According to our experiments, MLS leads to a better translation not only in scores like BLEU but also reports improvement in model's calibration. Compared with original label smoothing with vocabulary sharing, MLS outperforms in WMT'14 EN-DE(+0.47 BLEU), WMT'16 EN-RO (+0.33 BLEU) and other 7 language pairs including DE,RO-EN multilingual translation task.

Background
Label Smoothing The original label smoothing can be formalized as: K denotes the number of classes, α is the label smoothing parameter, α/K is the soft label,ŷ is a vector where the correct label equals to 1 and others equal to zero andŷ LS is the modified targets.
Label smoothing is first introduced to image classification (Szegedy et al., 2016) Gao et al. (2020) explores the best recipe when applying label smoothing to machine translation. To generate more reliable soft labels, Lukasik et al. (2020) takes semantically similar n-grams overlap into consideration level label smoothing.  proposes Graduate Label Smoothing that generate soft label according to the different confidence scores of model. To the best of our knowledge, we are the first to investigate label smoothing's influence on machine translation from the perspective of languages. Vocabulary Sharing Vocabulary sharing is widely applied in most neural machine translation studies (Vaswani et al., 2017;Song et al., 2019;Lin et al., 2020). Researchers have conducted in-depth studies in Vocabulary Sharing.  propose shared-private bilingual word embeddings, which give a closer relationship between the source and target embeddings. While Kim et al. (2019) point out that there is an vocabulary mismatch between parent and child languages in shared multilingual word embedding.

Conflict Between Label Smoothing and Vocabulary Sharing
Words or subwords in a language pair's joint dictionary can be categorized into three classes: source, common and target using Venn Diagram according to their belonging to certain language as depicted in Figure 1. This can be achieved by checking whether one token in the joint vocabulary also belongs to the source/target vocabulary. We formalized the categorization algorithm in Appendix A. Then we compute the tokens' distribution in different translation directions as shown in Table 2. Tokens in source class account for a large proportion up to 50%. When label smoothing and vocabulary sharing are together applied, the smoothed probability will be allocated to words that belong to the source class. Those words have zero overlap with the possible target words, therefore they have no chance to appear in the target sentence. Allocating smoothed probability to them might introduce extra bias for the translation system during training process, unavoidably leading to a higher translation perplexity as also revealed by Müller et al. (2019). Table 3 reveals the existence of conflict, that the joint use of label smoothing and vocabulary sharing doesn't compare with solely use one technique in all language pairs with a maximum loss of 0.32 BLEU score.

Weighted Label Smoothing
To deal with the conflict when executing label smoothing, we propose a plug-and-play Weighted Label Smoothing mechanism to control the smoothed probability's distribution.
Weighted Label Smoothing(WLS) has three parameters β t , β c , β s apart from the label smoothing parameter α, where the ratio of the three parameters represents the portion of smoothed probability allocated to the target, common and source class and the sum of the three parameters is 1. The distribution within token class follows a uniform distribution. WLS can be formalized as: whereŷ is a vector where the element corresponding to the correct token equals to 1 and others equal to zero. β is a vector that controls the distribution of probability allocated to incorrect tokens. We use t i , c i , s i to represent probability allocated to the i-th token in the target,common,source category, all of which form the distribution controlling vector β with K i β i = α. The restriction can be formalized as:

Masked Label Smoothing
Based on the Weight Label Smoothing mechanism, we can now implement Masked Label Smoothing by set β s to 0 and regard the target and common category as one category. In this way, Masked Label Smoothing is parameter-free and implicitly injects external knowledge to the model. And we have found out that this simple setting can reach satisfactory results according our experiments. We illustrate different label smoothing methods in Figure 2. It is worth noticing that MLS is different from setting WLS's parameters to 1-1-0 since there might be different number of tokens in the common and target vocab. The height of each bar in the graph denoted the probability allocated to each token. y is the current token during current decoding phase. We assume that there are only 10 tokens in the joint vocabulary and t1-t3 belongs to target class, c1-c3 belongs to common class and s1-s3 belongs to source class.
For multilingual translation, we combine the WMT'16 RO-EN and IWSLT'14 DE-EN datasets to formulate a RO,DE-EN translation task. We also make a balanced multilingual dataset that has equal numbers of DE-EN and RO-EN training examples to reduce the impact of imbalance languages and to explore how MLS performs under different data distribution condition in multilingual translation.
We apply the Transformer base (Vaswani et al., 2017) model as our baseline model. We fix the label smoothing parameter α to 0.1 in the main experiments and individually experiment and examine the performance of MLS under different α.
We use compound_split_bleu.sh from fairseq to compute the final bleu scores. The inference ECE score 1 and chrF score 2 are computed through open source scripts. We list the concrete training and evaluation settings in Appendix B.

Results
Bilingual   Table 4 for both BLEU and chrF scores. Similar to Gao et al. (2020)'s conclusion, we find that a higher α can generally improve the bilingual translation quality. And applying MLS can further improve the results. It shows that not only the probability increase in target vocabulary, but also the allocation of smoothed probabilities in different languages matters in the improvement of translation performance.
Multilingual As shown in Table 3, MLS achieves consistent improvement over the original label smoothing in both the original and the balanced multilingual translation dataset under all translation directions. In the original combined dataset, direction RO-EN (400K) has much more samples than DE-EN (160K). We do not apply a resampling strategy during training in order to investigate how the imbalance condition affects different models' performance. The balanced version cuts down samples in RO-EN direction to the same number as in DE-EN direction.
Compared with the imbalance version, the balanced version gave better BLEU scores in DE-EN direction while much worse performance in RO-EN translation for both the original label smoothing and MLS. It indicates that the cut down on RO-EN   According to the result, though the best BLEU score's WLS setting vary from different tasks and there seems to exist a more complex relation between the probability allocation and the BLEU score, we still have two observations. First, applying WLS can generally boost the quality of translation compared to the original label smoothing. Second, only WLS with β t , β c , β s each equals to 1/2-1/2-0 can outperform the original label smoothing on all tasks, which suggests the setting is the most robust one. Thus we recommend using this setting as the initial setting when applying WLS.
Furthermore, the most robust setting agrees with the form of MLS since they both allocate zero probability to the source category's tokens, which further proves the robustness of MLS.

Improvement in Model's Calibration and Translation Perplexity
Müller et al. (2019) have pointed out label smoothing prevents the model from becoming overconfident therefore improve the calibration of model. Since there is a training-inference discrepancy in NMT models, inference ECE score  better reflects models' real calibration.
To compute the ECE scores, we need to split the model's predictions into M bins according to the output confidence and calculate the weighted average of bin's confidence/accuracy difference as the ECE scores considering the number of samples  in each bin.
where N is the number of total prediction samples and B i is the number of samples in the i-th bin. acc (B i ) is the average accuracy in the i-th bin.
The score denotes the difference between accuracy and confidence of models' output during inference. Less ECE implies better calibration.
The inference ECE scores of our models are shown in Table 6. It turns out that models with MLS have lower Inference ECE scores on different datasets. The results indicate that MLS will lead to better model calibration.
We also find out that MLS leads to a significantly lower perplexity than LS during the early stage of training in all of our experiments. It's not surprising since zeroing the source side words' smoothed probability can decrease the perplexity. It can be another reason for model's better translation performance since it gives a better training initialization.

Conclusion
We reveal the conflict between label smoothing and vocabulary sharing techniques in NMT that jointly adopting the two techniques can lead to suboptimal performance. To address this issue, we introduce Masked Label Smoothing to eliminate the conflict by reallocating the smoothed probabilities according to the languages' differences. Simple yet effective, MLS shows improvement over original label smoothing from both translation quality and model's calibration on a wide range of tasks.