Manifold Adversarial Augmentation for Neural Machine Translation

Improving the robustness of neural machine translation models on variations of input sentences is an active area of research. In this paper, we propose a simple data augmentation approach by sampling virtual sentences from the vicinity distributions in higher-level representations, constructed either from individual training samples via adversarial learning or pairs of training samples through mixup. By simplifying and extending previous work that operates at the token level, our method can construct virtual training samples in a broader space and achieve improved translation accuracy compared to the previous state-of-the-art. In addition, we present a simple variation of the mixup strategy to better utilize the pseudo training samples created from back-translation, obtaining further improvement in performance.


Introduction
In recent years, neural machine translation (NMT) models (Sutskever et al., 2014;Bahdanau et al., 2014;Vaswani et al., 2017) have dramatically improved the quality of machine translation, especially with the introduction of the seminal Transformer architecture (Vaswani et al., 2017) that has become the de facto modeling choice. NMT training aims to learn a parameterized function that models the prediction of the translation in a target language given a source language sentence from labeled training data, which is often limited in volume especially for low-resource domains or languages. Similar to other fields in deep learning, model robustness is an area of concern for NMT as a minor change in the input sentence may result in a different or incorrect translation. In practice this can happen with spelling or grammar errors (Provilkov et al., 2019), speech recognition errors (Ruiz et al., 2019;Di Gangi et al., 2019), or even a sentence of the same meaning but with a slightly different use of words or expressions. Some studies (Belinkov and Bisk, 2017) have shown that the performance of NMT models can drop significantly when small perturbations are added to input sentences.
This problem can be attributed to overfitting as it is difficult to reliably model the translation distribution for the part of input space that has little or no training samples. There have been several attempts to address this problem by filling the space via data augmentation. One direction is to create new training samples by adding perturbations at the token level (Wang et al., 2018;Belinkov and Bisk, 2017;Sperber et al., 2017;Ebrahimi et al., 2018;Li et al., 2019;Cheng et al., 2018Cheng et al., , 2019Cheng et al., , 2020Levy et al., 2019), through either token insertion, deletion, and substitution operations or introducing noises to token embedding vectors. Among these approaches, Cheng et al. (2019) demonstrated the effectiveness of incorporating adversarial training samples that are natural sentences, with their semantic relevance to the original sentence safeguarded by language modeling. Cheng et al. (2020) achieved further improvement by creating more diverse but virtual sentences by mixing up actual training samples or synthesized adversarial samples via interpolation of word embeddings, but again at the token level.
Inspired by the success of manifold mixup in computer vision (Verma et al., 2019) and the recent evidence of separable manifolds in deep language representations (Mamou et al., 2020), we propose to simplify and extend previous work on adversarial learning and mixup augmentation to operate in high-level hidden representations, and as such we name the method manifold adversarial augmentation. Specifically, we create adversarial representations on a randomly selected hidden layer to attack the NMT model by adding perturbations based on gradients at a random scale to some randomly selected positions. Because the adversarial representations diverge slightly from the original representation but in many different ways, they can be viewed as many diverse sentences that are different in expression but have similar meanings. We also create virtual samples by mixing up the hidden representations of two randomly selected samples at a randomly selected hidden layer. Similarly, the mixup presentations can be viewed as many diverse sentences that fill the semantic space between the two original samples, which can help obtain smoother decision boundaries in the data space that is less populated. We further extend the mixup strategy to back-translation, another effective data augmentation method for machine translation, creating virtual samples to bridge the gap between pseudo samples and gold samples.
Experiments on the LDC Chinese-English and IWSLT English-French benchmark tasks demonstrate that our method can significantly improve the vanilla Transformer model by more than 4 and 3 BLEU respectively, averaged over multiple data sets for each task. Compared to the recent state-ofthe-art AdvAug method in (Cheng et al., 2020), our method achieves an average improvement of 0.39 and 1.10 BLEU respectively. Further improvement can be achieved with the use of back-translation data.

Method
As our manifold adversarial augmentation method is closely related to the AdvAug method (Cheng et al., 2020), we start by highlighting, and also depicting in Figure 1, their similarities and differences.
AdvAug uses both adversarial learning and mixup augmentation at the token level. The adversarial samples are obtained by randomly replacing a small subset of input words (on either source or target side) with words adjacent in the direction of the gradient that can also fit in context based on language modeling. The generated adversaries tend to be natural sentences, however, the variation is limited as it cannot deal with word insertion, deletion, reordering, and more general variations in language expression. Their mixup operation creates virtual samples by interpolating the word embedding vectors of two randomly selected training samples or adversarial samples. While it can generate more training samples, it is hard to interpret the virtual samples as representions of natural sentences, lim- iting its potential in dealing with natural texts.
In contrast, our method operates on higher-level hidden representions for both adversarial learning and mixup augmentation, relying on multiple neural layers to extract semantic meanings, which makes it easier to perform arithmetic operations on semantics. Although we do not explicitly construct adversarial samples that are natural texts, we conjecture that our method has the potential of covering more variations that can occur naturally. We next describe the details of our approach.

Adversarial Learning
Let x be the input sequence to our model, which could be either a source language sentence or a target language text representing the translation history. We use h (j) and z (k) to denote the hidden representations at the j-th encoder layer and the history portion of the k-th decoder layer, respectively. Enc >j denotes the function composed of the encoder layers higher than j, and Dec >k the function composed of the decoder layers higher than k plus the output layer, which computes the generation distribution of output words 1 . We generate perturbation δ h (j) to the encoder representations h (j) as follows: where g (j) is the gradient with respect to the NMT training loss L nmt back-propagated at h (j) , γ is a hyper-parameter controlling the maximum amount of perturbation, and η = [η i ∼ Beta(α adv , β adv ); 0 < i ≤ |x|] is a random variable providing more fine-gained control of the perturbation. By setting α adv < 1 and β adv < 1, η i can concentrate close to 0 or 1 and act like a gate independently controlling whether to add perturbation at a specific position, mimicking the random selection of positions for word replacement in Ad-vAug. Similarly, we generate perturbation δ z (k) to z (k) on the decoder side. The manifold adversarial learning loss L m adv is computed by: Here ω represents the prediction distribution of NMT model on the original training sample, and we base the adversarial loss on KL-divergence instead of MLE, following the VAT work in (Miyato et al., 2018).

Mixup Augmentation
Verma et al. (2019) investigated manifold mixup augmentation as a way to leverage semantic interpolations at hidden representations as additional training signals for the image classification task. They demonstrated that it results in neural models with smoother decision boundaries at multiple layers, avoiding being overly confident in the space with little or no training samples, and can improve model performance and robustness. Inspired by this work, we attempt to extend the mixup augmentation method in AdvAug (Cheng et al., 2020) from word embeddings to hidden representations at higher layers for NMT training. Specifically, given two training samples, we first compute their hidden representations h (j) and h (j) at the j-th encoder layer, hidden representations z (k) and z (k) at the history portion of the k-th decoder layer, and their output distributions ω and ω . We then construct the hidden representations and the output distribution of the virtual mixup sample as follows: where m λ (x, y) = λx + (1 − λ)y denotes the interpolation of two vectors, with an interpolation weight λ ∼ Beta(α mixup , β mixup ) randomly sampled from a Beta distribution for each pair of training samples. The manifold mixup augmentation loss L m mixup is computed by: Finally, our manifold adversarial augmentation method optimizes on the combination of original NMT training loss, adversarial learning loss, and the mixup augmentation loss:

Extention to Back-Translation
Back translation is an effective data augmentation method for machine translation. However, it is well known that pseudo training samples created from back translation have different characteristics from the gold training samples, due to factors such as domain mismatch and translation errors. To bridge this gap, we extend the manifold mixup augmentation strategy to create virtual training samples that are interpolated between a pseudo training sample and a gold training sample, again at hidden representations. We can adjust the parameters of the distribution Beta(α bt , β bt ) for generating the interpolation weight, biasing it toward the gold training sample to alleviate the aforementioned problems with back translation. Leth bt , andω bt be interpolation results, we define an additional training loss: 3 Experiments

Setup
We conduct experiments on two language pairs: Chinese-English and English-French. For the Chinese-English translation task, we use the LDC corpus with 1.2M sentence pairs for training, NIST06 for validation, and NIST02, NIST03, NIST04, NIST05, NIST08 as the test sets. For the English-French translation task, we use the IWSLT 2016 corpus with 230k sentence pairs for training, test2012 for validation, and test2013 and test2014 as the test sets. All models are based on the Transformer architecture. Details of the data processing, model configuration, and training settings can be found in the appendix. We compare with the following methods: • The vanilla Transformer model (Vaswani et al., 2017).
• The virtual adversarial regularization method in (Sano et al., 2019), which adds a proportion of normalized gradient to the source and target word embeddings for adversarial training.  • The doubly adversarial inputs method in (Cheng et al., 2019), which performs adversarial learning with word substitutions in the source and target text based on language modeling and gradients at word embeddings.
• The AdvAug method in (Cheng et al., 2020), a state of the art adversarial learning method for NMT, also described in Section 2.   Table 2 presents the ablation study results of different loss functions. In addition to two manifold adversarial augmentation loss functions described in Section 2, we also include their counterparts computed at the word embeddings for comparison. First, we always achieve better MT results with loss functions computed at the hidden representions than at the word embeddings, further validating our motivation that operating at higher hidden layer is superior. Second, we observe that adversarial learning and mixup augmentation are complementary to each other, with the combination of the two achieving the best performance.

Results with Back Translation
We conduct back-translation experiments on the English-French task as it has a smaller training set and can potentially benefit more from back translation. 25M French sentences from newscrawl07-11 2 are used as additional monolingual data, and are translated to English using a Transformer model trained with only the parallel training data. As shown in Table 3, both the Transformer baseline and our method can benefit from back-translation, although our method obtains a smaller improvement compared to the Transformer baseline as it has a significantly higher BLEU score to start with (actually higher than the Transformer baseline with back-translation). With the addition of our specially designed mixup loss L m,bt mixup that biases toward the gold training samples in mixup augmentation, our method is able to achieve an extra gain of 0.65 and 0.21 BLEU improvement on the two test sets.

Conclusion
In this paper, we present a simple yet effective manifold adversarial augmentation method for NMT. By training on virtual samples constructed through adversarial learning and mixup augmentation at higher-level hidden representations, our method can train more robust NMT models with improved translation performance. We follow the network settings in the original Transformer work. The total numbers of the parameters of the model are 64757760 and 83247104 for French-English and Chinese-English translation tasks. The dropout ratio is 0.3. The model is optimized with Adam. We use inverse square root as the learning rate schedule, with the peak learning rate of 5e-4, warm-up steps of 4000. During decoding, the beam size is 4 and the length penalty is 0.6. We search hyper-parameters for producing adversarial examples according to BLEU 3 on the validation set. Finally, the maximum number of layers for manifold data augmentation at source side K src and target side K tgt are both set to 3. We let α adv = 0.5 and β adv = 0.5 on the source side, and α adv = 0.3 and β adv = 0.7 on the target side. When mixing training example pairs, the hyperparameter α mixup and β mixup are both set to 8 for the English-French translation task, and set to 0.2 for the Chinese-English translation task. When we mix parallel sentence with back-translated parallel sentence, we let α bt = 8 and β bt = 4.

References
We use 1 V100 GPU for the IWSLT English-French translation task, and 4 V100 GPU for the NIST Chinese-English translation task. It takes about 24 hours and 72 hours for these two tasks respectively.   A.2 Effect of K src and K tgt for manifold adversarial augmentation Instead of performing manifold adversarial augmentation on a predetermined hidden layer, we choose to randomly select j ∈ [0, K src ] and k ∈ [0, K tgt ] among a range of encoder and decoder layers, allowing more variations. We study their effect on the validation set of the NIST Chinese-English translation task. We fix K src = 3 (or K tgt = 3), when change the value of K tgt (or K src ). As shown in Table 4, large or small K src and K tgt will make the model performs worse down to about 1 BLEU.

A.3 Impact of different weights for losses
To further study the impact of different losses, we set the training loss of our model as L = L nmt + µ 1 L (m) adv + µ 2 L (m) mixup , and compare the performance when we set different µ 1 and µ 2 . We conduct experiments on the IWSLT English-French translation task. As shown in Table 5, too small µ 1 and µ 2 will make the model perform worse.