mixSeq: A Simple Data Augmentation Methodfor Neural Machine Translation

Data augmentation, which refers to manipulating the inputs (e.g., adding random noise,masking specific parts) to enlarge the dataset,has been widely adopted in machine learning. Most data augmentation techniques operate on a single input, which limits the diversity of the training corpus. In this paper, we propose a simple yet effective data augmentation technique for neural machine translation, mixSeq, which operates on multiple inputs and their corresponding targets. Specifically, we randomly select two input sequences,concatenate them together as a longer input aswell as their corresponding target sequencesas an enlarged target, and train models on theaugmented dataset. Experiments on nine machine translation tasks demonstrate that such asimple method boosts the baselines by a non-trivial margin. Our method can be further combined with single input based data augmentation methods to obtain further improvements.


Introduction
Data augmentation, which enlarges the training corpus by manipulating the inputs through given rules, has been widely used in machine learning tasks. For image classification, there are various data augmentation methods, including cropping, flipping, rotating,cut-out (DeVries and Taylor, 2017), etc. For natural language processing (briefly, NLP), similar data augmentation methods also exist, like randomly swapping words (Lample et al., 2018a), dropping words (Iyyer et al., 2015), and masking specific words (Xie et al., 2017). With data augmentation, the main content of the input is not affected but the noise is introduced so as to increase the diversity of the training set. The effectiveness of above data augmentation methods has been verified by their strong performance improvements in both image processing and NLP tasks. For example, with the combination of data augmentation and meta-learning, state-of-the-art result of image classification is achieved (Cubuk et al., 2019).
Most existing data augmentation methods take one sample from the training set as input, which might limit the scope and diversity of the training corpus. Mixup (Zhang et al., 2018) is a recently proposed data augmentation method, where two samples from the training corpus are leveraged to build a synthetic sample. Specifically, let x 1 , x 2 denote two images from the training set, and y 1 , y 2 denote their corresponding labels. The synthetic data (λx 1 + (1 − λ)x 2 , λy 1 + (1 − λ)y 2 ) is introduced to the augmented dataset, where λ is randomly generated. Such a strategy is further enhanced in follow-up works (Zhang et al., 2019;Berthelot et al., 2019). Pair sampling (Inoue, 2018) is another data augmentation method where the synthetic sample is built as (0.5x 1 + 0.5x 2 , y 1 ). In comparison, according to our knowledge, such ideas are not leveraged in NLP tasks (e.g., machine translation). Therefore, in this work, we explore along this direction to see whether augmenting data through mixing multiple sentences is helpful.
In sequence learning tasks, two inputs x 1 and x 2 might contain different numbers of units (e.g., words or subwords). Besides, for sequence generation tasks, their labels y 1 and y 2 are of different lengths. Therefore, it is not practical to sum them up directly. Instead, we choose to concatenate two inputs and the two labels to get the synthetic data. We find that it is important to use a special token to separate the two sentences in a synthetic data. We name our proposed method as mixSeq.
mixSeq is a simple yet very efficient and effective data augmentation method. We conduct experiments on 9 machine translation tasks and find that mixSeq can boost the baseline by 0.66 BLEU on average. Specifically, on FLORES Sinhala↔English, our method can improve the baseline by 1.03 points. mixSeq can be further combined with data augmen-tation methods working on a single input, e.g., randomly dropping, swapping or masking words, to further improve the performance (see Table 3).
Normally, mixSeq randomly samples the two concatenated sequences. However, if the two concatenated sequences are contextually related, we can enhance our mixSeq to a context-aware version: ctxMixSeq, which will result in better performance (see Table 4).

Our Method
Notations: Let X and Y denote two language spaces, which are collections of sentences in the corresponding languages. The target of neural machine translation (briefly, NMT) is to learn a mapping from X to Y.
denote the bilingual NMT training corpus, where x i ∈ X , y i ∈ Y, and N is the number of training samples. Let concat(· · ·) denote the concatenation operation, where the input sequences are merged into a longer one, and each input is segmented by a space. Training Algorithm: We propose mixSeq, a simple yet effective data augmentation method, which generates new samples by operating on two existing samples. The algorithm is shown in Algorithm 1.
Algorithm 1: mixSeq Upsample or downsample D to sizeN and get a new datasetD; train an NMT model onD ∪D, which is of size 2N .
In mixSeq, the most important step is to build an augmented datasetD. As shown from line 3 to line 7 in Algorithm 1, we first sample two aligned sequence pairs (x i , y i ) and (x j , y j ) (the design of sampling rule SamplingFunc is left to the next part). Then we concatenate their source sentences and the target sentences respectively with a special label <sep> separating two samples, and get two longer sequences,x k andỹ k (line 5 in Algorithm 1). We eventually obtain the augmented datasetD with sizeN . After that, we upsample or downsample D to the same size asN and obtainD. Finally, we train our translation models onD ∪D. Design of SamplingFunc: We have two forms of SamplingFunc, which corresponds to two variants of our algorithm: (1) In general cases, SamplingFunc randomly samples i and j from {1, 2, · · · , N }. For ease of reference, we still use mixSeq to denote this variant.
(2) When contextual information is available, i.e., the parallel data is extracted from a pair of aligned document, SamplingFunc only samples consecutive sequences in a given document. Assume x i /y i represent the i-th sentence in the document, then SamplingFunc only samples (i, i + 1) index pairs.We use ctxMixSeq to denote this variant. ctxMixSeq is related to context-aware machine translation (Tiedemann and Scherrer, 2017). The difference is that, during inference, ctxMixSeq uses a single sequence as the input, while Tiedemann and Scherrer (2017) uses multiple sequences including the contextual information. Discussions: mixSeq operates on two sequences, while previous data augmentation methods like randomly dropping, swapping or masking words usually operate on a single sequence. These methods can be combined with mixSeq to bring further improvements (see Table 3).

Experiments
We conduct experiments on the following machine translation tasks to evaluate our method: IWSLT'14 German↔English and Spanish↔English; FLORES English↔Nepali and English↔Sinhala; and WMT'14 English→German.
We abbreviate English, German, Spanish, Nepali and Sinhala as En, De, Es, Ne and Si.

Setup
Datasets: For IWSLT'14 De↔En, following Edunov et al. (2018), we lowercase all words, tokenize them, and apply BPE with 10k merge operations (Sennrich et al., 2016) to obtain of the subword representations 1 . The validation set is split from the training set and the test set is the concatenation of tst2010, tst2011, tst2012, dev2010 and dev2012. For IWSLT'14 Es↔En, the preprocessing is the same as that for De↔En without lowercasing the words. We use tst2013 and tst2014 as the validation and test sets respectively. For FLO-RES En↔Ne and En↔Si datasets, we used the BPE version of dataset provided by Guzmán et al. (2019). For WMT'14 En→De, we concatenate newstest2012 and newstest2013 as the validation set and use newstest2014 as the test set. The statistics of the datasets are shown in Table 1. On all tasks, the vocabulary is shared between the source language and the target language.  Models and Training Strategy: For mixSeq, we setN as 5N ; for ctxMixSeq, we setN as N . We choose Transformer (Vaswani et al., 2017) as our translation model. For IWSLT tasks, the dimensions of the embedding, feed-forward network and number of layers of the Transformer models are 256, 1024 and 6 respectively. The dropout rate is 0.3. The batch size is 6000 tokens, and we train the models for 300k steps. For FLORES tasks, we use exact the same architecture and training strategy as those in (Guzmán et al., 2019) for fair comparison. The model is a 5-layer Transformer with embedding dimension and feed-forward network dimension 512 and 2048. The batch size is 16k. The baseline model is trained for 100 epochs, while mixSeq is trained for 10 epochs considering our enlarged dataset is 10 times larger than the original dataset. For WMT task, the dimensions of the embedding, feed-forward network and number of layers of the Transformer models are 1024, 4096 and 6 respectively. The batch size is 4096 tokens per GPU. We train on eight V100 GPUs and accumulate the gradients for 16 times before updating.
For all models, we use Adam with learning rate 5 × 10 −4 and the inverse sqrt learning rate scheduler to optimize the models. All models are trained until convergence. Evaluation: We use beam search with beam width of 5 and length penalty of 1.0 to generate sequences. The generation quality is evaluated by BLEU score.

Results
The results of standard Transformer and mixSeq on small-scale datasets are shown in the first section of Table 2. We adopt another baseline, pair sampling (Inoue, 2018) into NMT for comparison, which can produce a synthetic datasetD ps made up of pairs (concat(x 1 , <sep>, x 2 ), y 1 ), (x 1 , y 1 ) ∈ D, (x 2 , y 2 ) ∈ D. The results of pair sampling (briefly, PS) are in the third column of Table 2. mixSeq generally brings good improvements and significantly outperforms the baseline on all tasks except for two (En→De and En→Si). The pair sampling baseline performs poorly on all tasks. This is because pair sampling requires the translation model to translate the first part of the input (i.e., x 1 ) while ignoring the second part (i.e., x 2 ), which is against the goal of NMT. It is also worth noting that the time and number of steps required to converge on the augmented dataset and the original dataset are similar.  We also evaluate mixSeq on a large-scale dataset, WMT'14 En→De, and the results are shown in the second section of Table 2. Due to resource limitation, we do not try pair sampling. Our method improves the BLEU score by 0.46, which shows that mixSeq is a generally effective method for NMT.
We further compare and combine our method with data augmentation methods on one sequence, including randomly dropping, masking and swapping words. We conduct experiments on IWSLT'14 De↔En. As shown in Table 3, our method brings further improvement when combined with existing data augmentation method on a single sequence. The baseline is improved by up to 0.82 BLEU.

Method
De→En En→ De  To verify the effectiveness of ctxMixSeq, we conduct experiments on IWSLT'14 En↔De, where contextual information is available. As discussed in Section 2, Tiedemann and Scherrer (2017) is similar with ctxMixSeq, except that it takes two sequences concat(x t−1 , <sep>, x t ) as the input during inference. We denote this inference method as 2in (two inputs). Another baseline proposed in Tiedemann and Scherrer (2017) is that the NMT model is trained on dataset D ∪D a , wherẽ D a = {concat(x t−1 , <sep>, x t ), y t } N t=2 . This can be seen as a context-aware version of pair sampling and we briefly denote it as ctxPS. The results are in Table 4. ctxMixSeq outperforms all baselines proposed by Tiedemann and Scherrer (2017). Compared to mixSeq, ctxMixSeq brings consistent improvements, especially when combined with mixSeq.

Method
En→De De→En  With mixSeq, we find that the alignment is enhanced. We visualize the source-target attention maps obtained by our method. Given (x i , <sep>, x j ) and the corresponding translation (y i , <sep>, y j ), we find that most attention weight  of y i is assigned to x i , with little assigned to x j . Similar phenomena is observed for y j . In this way, the attention mechanism is enhanced, which might explain the performance improvements.

Analysis
In this section, we conduct ablation study on the usage of <sep> and the effect of concatenating more than two sequences.

Ablation Study of the Usage of <sep>
To evaluate the effect of <sep> token, we remove the <sep> from sequences as another baseline. We conduct the experiments on IWSLT En↔Es and FLORES En↔Ne datasets, and report the results in Table 5. We find that our method performs poorly without <sep>, sometimes even worse than the Transformer. Our conjecture is that <sep> helps the model learn to align each part of the input to the corresponding part of the output, which can improve the representation learning.

Concatenating More Sequences
We wonder whether the BLEU scores can be further boosted by concatenating more sequences. We move a step forward by randomly concatenating three sequences, and build a synthetic datasetD 3 withN 3 examples. Experiments are conducted on FLORES En↔{Ne, Si} datasets, and results are shown in Table 6. In the third and fourth rows, N =N 3 = 5N . In the last row, we setN = N 3 = 2.5N to ensure the number of synthetic data remains the same.  The results show that, although both D ∪D 3 and D ∪D ∪D 3 settings can bring some improvements, the improvements are not consistent among different datasets. Further work is needed on how to use more samples for data augmentation.

Related Work
Most existing data augmentation methods in NMT operate on one single input. Fadaee et al. (2017) replaced common words with rare words under the guidance of language models to improve the translation of rare words. In unsupervised learning, Lample et al. (2018b) proposed to randomly drop, swap, or mask words. Gao et al. (2019) verifies the effectiveness of such methods in supervised NMT. RAML (Norouzi et al., 2016) randomly inserted, deleted or substituted words in the target sequence with probability exponentially decreasing with the edit distance. SwitchOut (Wang et al., 2018) extended RAML by both manipulating on the source side and the target side. Gao et al. (2019) proposed to "softly replace" words by replacing the one-hot representation of words with a distribution on the vocabulary. A concurrent work similar to ours is (Kondo et al., 2021), where <sep> is not leveraged. In other fields, data augmentation methods operating on multiple samples have been proposed. Mixup (Zhang et al., 2018) generated a synthetic sample by averaging two inputs and the two labels. It is further applied to semi-supervised learning to enlarge the dataset (Berthelot et al., 2019). Pair sampling (Inoue, 2018) only averaged the two inputs but not the labels.

Conclusion and Future Work
In this work, we proposed a simple yet effective data augmentation method for NMT, which randomly concatenates two training samples to enlarge the datasets. Experiments on nine machine translation tasks demonstrate the effectiveness of our method. For future work, there are a few directions to explore. First, we will apply our method to more NLP tasks. Second, we will theoretically analyze when and why it works. Third, we will study and design more effective data augmentation methods.