Con-NAT: Contrastive Non-autoregressive Neural Machine Translation

,


Introduction
Neural machine translation has developed rapidly with the development of deep learning.The traditional neural machine translation models (Sutskever et al., 2014;Bahdanau et al., 2015;Wu et al., 2016;Vaswani et al., 2017) are autoregressive (AT), which means that they predict target tokens one by one based on source tokens and previously predicted tokens.This dependence leads to the limitation of translation speed, and the time required for translation is directly proportional to the sentence length.
Recently, non-autoregressive machine translation (NAT) becomes a research hotspot.The nonautoregressive generation mode eliminates token dependency in the target sentence and generates all tokens in parallel, considerably improving transla- tion speed.However, the speed increase is accompanied by a decrease in translation quality.Many iterative models have been developed to make a trade-off between translation speed and quality.The iterative model improves translation quality by continually and iteratively optimizing the generated target sentence.The iterative model is usually to predict the masked token in the target sentence, such as BERT (Devlin et al., 2019).
The masked tokens are usually chosen at random.A sentence can be masked in a variety of ways.In different masked sequences of the same sentence, the predicted tokens at the same position should be the same.Embodied in token representations is the similarity of representations.The representation of the same masked token should be similar because they are from the same token and have the same semantics in a similar context (the same source sentence and the different masked results of the same target sentence).We think about how to make these different representations of the same token more similar.Inspired by the successful use of contrastive learning in NLP pre-trained models (e.g., Gao et al., 2021), we explore combining contrastive learning and the conditional masked language model, treating different representations of the same masked token as positive pairs and representations of different tokens as negative pairs.We pull in positive pairs and push out negative pairs using contrastive learning.
As illustrated in Figure 1, we propose two strategies for constructing positive pairs in this paper.Contrastive Common Mask is a method that utilizes representations of the same token in different masked sequences of the same sentence.As shown in Figure 1(a), "fell" is masked both in "he [mask] asleep almost [mask]" and "he [mask] asleep [mask] instantly", which are different randomly masked results of "he fell asleep almost instantly".The other is inspired by Gao et al. (2021), where we feed the same input to the decoder twice and get two different representations due to the dropout setting, which we call Contrastive Dropout.The two representations of the same token should be similar, as shown in Figure 1(b).
We use the constructed positive and negative pairs to calculate the contrastive loss and jointly optimize it with the cross-entropy loss.We verify the effectiveness of our model in six translation directions of three standard datasets with varying data sizes.Experiments show that our model beats CMLM (Ghazvininejad et al., 2019) with 0.80-1.46BLEU margins and GLAT (Qian et al., 2021) with 0.18-0.65BLEU margins at the same translation speed.It also outperforms other CMLM-based models and beats the state-of-the-art NAT model on .
The main contributions of this work can be concluded as follows: • To the best of our knowledge, our work is the first effort to combine token-level contrastive learning and the conditional masked language model.
• We propose two methods to construct positive pairs for the contrastive conditional masked language model: Contrastive Common Mask and Contrastive Dropout.
• Our model Con-NAT achieves a consistent and significant improvement in six translation directions on fully and iterative NAT and is state-of-the-art on WMT'16 Ro-En (34.18 BLEU).

Preliminaries Non-Autoregressive Machine Translation
The machine translation task is defined as generating a target sentence Y = y 1 , . . ., y Ty under the condition of a given source sentence X = {x 1 , . . ., x Tx }.Most models factorize the conditional probability P θ (Y | X) by: where Y <t denotes the target tokens generated before time step t, T y denotes the target sentence length and θ denotes the model parameters.This autoregressive mode makes the decoding process time-consuming, because the target tokens are generated step by step.Non-autoregressive models break the conditional dependency between target tokens and generate all target tokens in parallel.The conditional probability P θ (Y | X) is factorized as: Although the assumption of conditional independence improves the translation speed, it also impairs the model performance.

The Conditional Masked Language Model
The mainstream iterative NAT (CMLM) and fully NAT(GLAT) take the masked language model as training objective (Devlin et al., 2019).The objective function allows the model to learn to predict any arbitrary subset of the target sentence in parallel: where Y ms is a set of target tokens randomly replaced by the special token [mask], and Y obs is the set of observed target tokens.
Contrastive Learning Contrastive learning algorithms compare positive and negative pairs to learn representations, and they have achieved remarkable success in computer vision, natural language processing, recommendation systems, and other fields.It pulls positive pairs together and pushes negative pairs apart in the feature space.For positive and negative pairs, different algorithms and applications use different selection strategies.We assume that there is a mini-batch of 2N examples.For example i, there is a positive pair (i, j(i)), and the other 2(N − 1) examples are treated as negative examples of i.The training objective for example i is: , where z denotes the example feature, τ is a temperature hyper-parameter and sim is the similarity function (e.g. the cosine similarity: sim(z i , z j(i) ) =

Methodology
In this section, we present how we incorporate contrastive learning into NAT.We begin by introducing the structure of our model Con-NAT, followed by two positive pair construction methods for contrastive learning, and lastly, the training objective combined with the contrastive loss.Figure 2 shows the overall framework.

Model
We use the standard CMLM or GLAT as our base model f base .The encoder is a standard transformer encoder, and the decoder is a transformer decoder without the causal mask.As the token representation, we utilize the output of the last layer of the decoder, which is denoted as h.A projection head f proj maps the representation h into a vector representation z that is more suitable for the contrastive loss.Such a projection head has been shown to be important in improving the representation quality of the layer before it (Chen et al., 2020).This projection head is implemented as a multi-layer perceptron with a single hidden layer.We formulate the process of obtaining z as follows:

Contrastive Learning
Positive pairs are different representations of the same token in the same sentence, while negative pairs are representations of other tokens in the same mini-batch.For the acquisition of different representations of the same token, we adopt two methods.One is to randomly mask the same sentence twice in a row, and the tokens that are masked twice constitute a positive pair, which we call Contrastive Common Mask.The other is inspired by Gao et al. (2021) and simply feeds the same input to the decoder twice.We can obtain two different representations of the same token as positive pairs by applying the standard dropout twice, which we call Contrastive Dropout.

Contrastive Common Mask
During training, the model randomly masks some of the tokens from the target sentence.We perform this process on the same target sentence twice and get two sets of results, {Y obs 1 , Y ms 1 } and {Y obs 2 , Y ms 2 }.And we get z (m 1 ) and z (m 2 ) as follows using different decoder inputs: Contrastive Dropout There are dropout modules in the fully-connected layers and multi-head attention layers.Due to their randomness, we will get different features if we feed the same input sentence into the model multiple times.Similarly, with the same decoder input and different dropout parameters, we get z (d 1 ) and z (d 2 ) as follows : where θ drop 2 and θ drop 2 denote different dropout masks.
If we combine these two construction methods, we get four sets of features,

Contrastive Loss
Now that we have different representations of the same token in the same sentence, we use it to calculate the loss of contrastive learning.Let Y 1 and Y 2 represent two types of randomly masked tokens for the same sentence, which may or may not be the same, z 1 and z 2 denote the corresponding features.Let N = |Y 1 ∩ Y 2 | denote the number of common masked tokens.We select the representations of common masked tokens from z 1 and z 2 to form Z, where |Z| = 2N .Let i, k ∈ I ≡ {1 . . .2N } be the index of one representation of an arbitrary token, j(i) ∈ I be index of the other representation for the same token.Then the contrastive loss is given by: As shown above, for both Y ms 1 and Y ms 2 , we get two representations for contrastive learning, z (m 1 ,d 1 ) , z (m 1 ,d 2 ) and z (m 2 ,d 1 ) , z (m 2 ,d 2 ) , respectively.Different representation combinations are used to calculate the different losses of contrastive learning.For the Contrastive Common Mask, we get two losses: (1) For the Contrastive Dropout, we can also get two losses: (2) We can also use L con (z (m 1 ,d 1 ) , z (m 2 ,d 2 ) ) and L con (z (m 1 ,d 2 ) , z (m 2 ,d 1 ) ) to calculate the losses.However, too many contrastive learning loss itmes will occupy a large GPU memory, resulting in a small batch size, which is not conducive to training.So we just use (1) and (2).

Training Losses
Masked Language Model CMLM-based models are optimized by cross-entropy loss over every masked token in target sentence.We calculate losses for both {Y obs 1 , Y ms 1 } and {Y obs 2 , Y ms 2 } by: Length Predict The length of the target sentence must be known in advance for CMLM-based models to predict the entire sentence in parallel.Also, we follow Ghazvininejad et al. (2019) and add a special token [LENGTH] to the encoder.The model uses the decoder output of [LENGTH] to predict the length of the target sentence.The length loss is: where L max represents the maximum length of the target sentence.

Models
Baselines We adopt Transformer (AT) and existing NAT models for comparison.NAT models can be divided into fully NAT models and iterative NAT models.See Table 1 for more details.Iterative NAT models with enough number of iterations generally outperform fully NAT models.Noisy parallel decoding (NPD) is an important technique for fully NAT to improve the performance of the model, which requires an additional AT model for re-ranking.The models trained with CTC loss are usually better than the models trained with crossentropy loss because of its inherent de-duplication mechanism.The current state-of-the-art model is the Imputer, which combines the CTC and the masked language model.

Overall Results
Table 1 shows the main results on WMT'14 En-De and WMT'16 En-Ro test sets.For iterative NAT, our model significantly and consistently improves the quality of translation across four translation di- rections compared to existing NAT models, except for Imputer.Furthermore, our model outperforms the Imputer on the Ro-En and is state-of-the-art (34.18 BLEU).Our model Con-CMLM outperforms standard CMLM with margins from 0.80 to 1.04 BLEU points, demonstrating the usefulness of our methods.It is also significantly superior to other CMLM-based models, such as SMART, CMLM+LFR, CMLM+PMG, and MvCR.For fully NAT, Con-GLAT also outperforms GLAT.
Table 2 shows the results on the large-scale dataset WMT'17 En-Zh.Our approach still achieves a consistent and substantial improvement over CMLM.
We compare the performance of Con-CMLM to other iterative NAT models that train on raw data without sequence-level knowledge distillation.Table 3 shows that Con-CMLM still significantly outperforms other iterative NAT models.Con-CMLM performs better than Imputer, which is not achieved in distillation data.The better performance on the raw data means that our method is more general and robust.
It is worth noting that the contrastive module is only used in the training process and is discarded during inference.Therefore the translation latency is not increased.Con-CMLM and Con-GLAT have the same speedup as CMLM and GLAT, respectively.

Analysis
Similarity of Token Representations We further verify the idea of optimizing the similarity of different representations of the same token in the same sentence.We mask the gold target twice with the same mask rate, predict masked tokens and calculate the cosine similarity of the two representations.Table 4 shows the average similarity of all common masked tokens with different mask rates in {0.2, 0,4, 0.6, 0.8}.Our approach makes representations of the same masked token more similar.As the mask ratio increases, the similarity gap between CMLM and Con-CMLM increases.

Comparison of Different Iterations Iterative
NAT can effectively improve model performance by increasing the number of Naturally, the larger the number of iterations is, the slower the translation speed is.Therefore we need to strike a balance between translation speed and model performance.One, four, and ten iterations are widely employed for CMLM-based models.We compare the model performance of CMLM and Con-CMLM in the six translation directions in the Table 2 and Table 5.As we can see, Con-CMLM constantly beats CMLM in every iteration step and task, and the fewer the iterations, the more significant the improvement.Furthermore, the Con-CMLM performance with four iterations outperforms the CMLM performance with ten iterations, which the other previous CMLM-based models do not achieve.

Repeated Translation
In NAT, a major issue is repeated translation, which means that illogical consecutive repeated tokens frequently exist in translated sentences.This is especially noticeable in long sentences.We calculate the average number of consecutive repeated tokens per sentence on the WMT'16 En-Ro test set.Contrastive Layer For contrastive learning, we can obtain various representations from different layers of the Decoder.The impact of different layer representations is discussed here.First, we choose the output of the Decoder's fourth, fifth, and sixth layers independently.Second, we combine the contrastive losses of the fifth and the sixth layers together.The projection heads for these two layers can be the same or different.Finally, we also compare the word embedding output of the Decoder.followed by word embedding.The shallower the representation used, the worse the performance is.
Combining the contrastive losses for different layers is not helpful, whether using the same head or different heads.
Dropout Probability Since we use dropout explicitly and implicitly in Contrastive Dropout and Contrastive Common Mask, respectively, we conduct ablation experiments on WMT'16 En-Ro with different dropout rates in {0.1, 0,2, 0.3, 0.4, 0.5}.As Table 10 shows, dropout rates that are too high or too low hurt the performance of the model.The best choice of dropout rate is 0.3.

Conclusion
In this work, we propose Con-NAT, which is the first effort to combine token-level contrastive learning and the conditional masked language model.Con-NAT optimizes the similarity of different representations of the same token in the same sentence by contrastive learning.We propose Contrastive Common Mask and Contrastive Dropout to construct positive pairs, using different random masks and dropout masks, respectively.Our model achieves consistent and significant improvement in the six translation tasks and is state-of-the-art on WMT'16 Ro-En.The lightweight contrastive module is removed during inference, so it does not affect the translation speed.In the future, we will focus on combining the idea with the CTC and the pre-trained masked language model.

Figure 2 :
Figure 2: The overall framework of our Con-NAT model.[M] is the special token [mask].Left figure: the model structure.Right figure: the combination of Contrastive Common Mask and Contrastive Dropout.For different masked results of the same sentence, it is Contrastive Common Mask when combined vertically, and Contrastive Dropout when combined horizontally.

Table 3 :
The performance (BLEU) of Con-CMLM on raw data, compared to other non-autoregressive models.
Table1: Performance (BLEU) comparison between our proposed models Con-NAT (Con-GLAT and Con-CMLM) and existing models.Iter.denotes the number of iterations, Adv.means adaptive and m is the number of re-ranking candidates.

Table 4 :
The similarity of token representations.

Table 6 :
Model Iter.En-De De-En En-Ro Ro-En The average number of consecutive repeated tokens per sentence with different iterations on the WMT'16 En-Ro test set.

Table 7 :
Table6shows the results.According to whether the sentence length is fewer than 25, all samples are divided into Short and Performance (BLEU) of Con-CMLM with different random seed.The first row is the result in Table1.Long groups.It can be seen that after the addition of the contrastive module, the number of consecutive repeated tokens is significantly reduced.Model StabilityWe switch random seeds for more experiments to test the stability of the model.As we can see from Table7, the results of our model are not well-trained by chance.Even with some other random seeds, the results are better.

Table 8 :
Ablation experiments on two methods of constructing positive pairs.

Table 10 :
Table 9 shows the result.Using representations of the sixth layer alone has the best performance, Performances on WMT16'En-Ro with different dropout rates.