Data Augmentation for Text Generation Without Any Augmented Data

Data augmentation is an effective way to improve the performance of many neural text generation models. However, current data augmentation methods need to define or choose proper data mapping functions that map the original samples into the augmented samples. In this work, we derive an objective to formulate the problem of data augmentation on text generation tasks without any use of augmented data constructed by specific mapping functions. Our proposed objective can be efficiently optimized and applied to popular loss functions on text generation tasks with a convergence rate guarantee. Experiments on five datasets of two text generation tasks show that our approach can approximate or even surpass popular data augmentation methods.


Introduction
End-to-end neural models are generally trained in a data-driven paradigm.Many researchers have proposed powerful network structures to fit training data well.It has also become ubiquitous to increase the training data amount to improve model performance.Data augmentation is an effective technique to create additional samples in both vision and text classification tasks (Perez and Wang, 2017;Shorten and Khoshgoftaar, 2019;Wei and Zou, 2019), which perturb samples without changing their labels.For text generation tasks, there can be more types of data perturbation to construct augmented samples, including corrupting the input text (Xie et al., 2017), the output text (Norouzi et al., 2016;Kurata et al., 2016), or both (Zhang et al., 2020).As such, classification tasks can be regarded as special cases of generation tasks in terms of incorporating data augmentation techniques, and this work mainly discusses text generation tasks.
The focus of previous work on text data augmentation has been to design proper augmentation techniques to create augmented samples.Some augmentation methods have been proposed for general text tasks.For example, different general replacement operations have been explored to edit words in a text sample, ranging from simple look-up tables (Zhang et al., 2015) to pretrained masked language models (Kobayashi, 2018;Wu et al., 2019).Sennrich et al. (2016) propose to augment text sequences by back-translation.For some generation tasks such as dialogue generation, general augmentation methods may not yield stable improvements and it requires to carefully incorporate the task property to design useful augmented samples (Zhang et al., 2020).All these methods need to explicitly construct augmented samples, and the data mapping functions from the original samples to the augmented samples are mostly defined apriori.This motivates us to raise a question, whether we can skip the step to define or choose proper augmented data mapping functions to accomplish effective data augmentation.
To answer this question, we aim to formulate the problem of data augmentation for general text generation models without any use of augmented data mapping functions.We start from a conventional data augmentation objective, which is a weighted combination of loss functions associated with the original and augmented samples.We show that the loss parts of the augmented samples can be re-parameterized by variables not dependent on the augmented data mapping functions, if a simple Euclidean loss function between the sentence representations is applied.Based on this observation, we propose to directly define a distribution on the re-parameterized variables.Then we optimize the expectation of the augmented loss parts over this distribution to approximate the original augmented loss parts computed with various augmented data mapping functions.We make different assumptions on the variable distributions and find that our proposed objective can be computed and optimized efficiently by simple gradient weighting.If stochastic gradient descent (SGD) is used, our objective is guaranteed with the convergence rate O(1/ √ T ).Our objective can be coupled with popular loss functions on text generation tasks, including the word mover's distance (Kusner et al., 2015) and the cross-entropy loss.
Our approach, which utilizes the proposed objective and optimizes it by SGD, has two advantages.First, it provides a unified formulation of various data perturbation types in general text generation models, which sheds a light on understanding the working mechanism of data augmentation.Second, the optimization of our approach is simple and efficient.Without introducing any new sample during training, we can avoid additional calculation efforts on augmented samples, often with the total size much larger than the original data size.Hence, our approach maintains high training efficiency.
Extensive experiments are conducted to validate the effectiveness of our approach.We mainly use the LSTM-based network structure (Bahdanau et al., 2015;Luong et al., 2015b) and perform experiments on two text generation tasks -neural machine translation and single-turn conversational response generation.Results on five datasets demonstrate that the proposed approach can approximate or even surpass popular data augmentation methods such as masked language model (Devlin et al., 2019) and back-translation (Sennrich et al., 2016).

Related Work
Data augmentation has shown promising improvements on neural models for different text generation tasks such as language modeling (Xie et al., 2017), machine translation (Sennrich et al., 2016) and dialogue generation (Niu and Bansal, 2019;Cai et al., 2020).Existing text data augmentation methods can be mainly categorized into word-level augmentation and sentence-level augmentation.
Word-level augmentation methods perturb words within the original sentence.Common operations include word insertion and deletion (Wei and Zou, 2019), synonym replacement (Zhang et al., 2015), and embedding mix-up (Guo et al., 2019).Masked language models can be used by masking some percentages of tokens at random, and predicting the masked words based on its context (Wu et al., 2019;Cai et al., 2020).
Sentence-level data augmentation is not limited to edit only a few words in the original sentence, but to generate a complete sentence.For example, back-translation is originally proposed to translate monolingual target language data into source language to augment training pairs in machine translation (Sennrich et al., 2016).It is later extended to paraphrase sentences in any text dataset, in which two translation models are applied: one translation model from the source language to target language and another from the target to the source.GANbased and VAE-based models have also achieved impressive results to create entire sentences to augment the training data (Hu et al., 2017;Cheng et al., 2019).For dialogue generation, retrieved sentences can be good supplement of the original corpus (Zhang et al., 2020).
Both word-level and sentence-level augmentation methods need to define their augmented data mapping functions (i.e. operations to edit words or models to generate sentences) apriori.Some works train policies to sample a set of word-level operations (Niu and Bansal, 2019), but the operation candidates are still pre-defined.A few works learn to construct augmented samples and optimize the network jointly (Hu et al., 2019;Cai et al., 2020).Different from previous work, our goal is not to propose or learn novel augmented data mapping functions.Instead, we investigate whether the effectiveness of data augmentation can be achieved while we do not bother to use any specific augmented data mapping function.
Besides data augmentation, data weighting is another useful way to improve model learning.It assigns a weight to each sample to adapt its importance during training.The sample weights are often carefully defined (Freund and Schapire, 1997;Bengio et al., 2009) or learnt by another network (Jiang et al., 2018;Shu et al., 2019).Data augmentation is often combined with data weighting together to weight the original and augmented samples.

Background
We are given original samples D = {(x, y)} with x, y both as text sequences.Without loss of generality, a deep generation model is to learn a mapping function f x,y by a deep neural network that outputs y given x.As mentioned in the introduction, text generation tasks mainly have three types of augmented data: • one (or several) perturbed input text x by one (or several) augmented data mapping function φ x; • one (or several) perturbed output text ŷ by one (or several) augmented data mapping functions φ ŷ; • one (or several) perturbed paired text ( x, ŷ) by corresponding augmented data mapping functions.Proper augmented data mapping functions are often supposed to generate perturbed sequences or sequence pairs that are close to the original one.They are assumed to be given apriori in optimizing the generation model for now.
Let (f x,y (x), y) denote the loss function to be minimized for each sample.We first use augmented data in the input domain as an example to present the problem formulation and introduce our approach, then later discuss other types of augmented data.Data augmentation methods generally apply an augmented loss per sample with its augmented samples: where w x is the importance weight associated with each augmented sample, φ x is the augmented data mapping function that constructs x, and F is the function space containing all feasible augmented data mapping functions.

Our Approach
In this section, we aim to formulate the problem of data augmentation for general text generation models without any use of augmented data mapping functions.We introduce our approach by assuming that the loss function is the most simple Euclidean distance, i.e.
where u and v are the sentence representations of two sentences, i.e. the target sequence and the predicted sequence.Other conventional loss functions in text generation will be discussed in Section 5. We first rewrite each loss part of an augmented data point in (1) from a polar coordinate system in Sec 4.1.In this way, we can regard the total augmented loss part with multiple augmented data mapping functions as sampling different points in the polar coordinate system.This inspires us that we can skip to define any augmented data mapping function, but only design a joint distribution of the perturbation radius and perturbation angle in the polar coordinate system.In Sec 4.2, we show two probability distribution substantiations, and find that our approach can be optimized efficiently by simply re-weighting the gradients.In Sec 4.3, we discuss the extension of our approach for other augmented data mapping function types.

Proposed Objective
By treating f x,y (x), f x,y ( x) and y as three vertices in the Euclidean space, we can form a triangle (illustrated in Fig. 1a) with the three vertices and the loss between them as edges.For a given augmented data mapping function φ x and a sample (x, y), we can rewrite (f x,y ( x), y) using the polar coordinate system with f x,y (x) as the pole and (f x,y (x), y) as the polar axis: where θ is the radian of f x,y ( x).We can observe that, the rewritten augmented sample loss part depends on the original sample loss (f x,y (x), y) as well as the radius r and radian θ of f x,y ( x).Here r is the data perturbation distance (f x,y (x), f x,y (x)).Therefore, we can map each augmented data mapping function φ x ∈ F into (r, θ) ∈ P , where P is a joint distribution of (r, θ)1 .A weighted summation of the augmented loss parts from different augmented data mapping functions can be seen as an empirical estimation of the expectation of the rewritten loss by sampling different (r, θ)'s from their joint distribution P , though the corresponding ground truth P is not observed.
This inspires us how to avoid to specifically design or choose several augmented data mapping functions and their weights used in (1).We can directly design the distribution P of (r, θ) and optimize the expectation of the rewritten loss (i.e. the right hand side in (3)) under this distribution.Hence, we propose to optimize the following objective to mimic the effect of data augmentation: C g e l e I x 0 g g r H R m N R 2 C v X j y X 9 I 9 a 9 o X T e v u v N 6 6 L u O o g k N w B B r A B p e g B W 5 B G 3 Q A B j l 4 A i / g 1 X g 0 n o 0 3 4 3 1 e W j H K n n 3 w C 8 b H N y Z 1 l s M = < / l a t e x i t > f x,y (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " n l 4 J D q 6 e N U s e o l Y C c J x c w < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 b j s h p 6 F c q W 1 t t 2 k W n F o k D U 8 2 O U = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i G 5 c V 7 A P a s W Q y m T Y 0 k w x J R i l D / 8 O N C 0 X c + i / u / B s z 7 S y 0 9 U D I 4 Z x 7 y c k J E s 6 0 c d 1 v p 7 S y u r a + U d 6 s b G 3 C g e l e I x 0 g g r H R m N R 2 C v X j y X 9 I 9 a 9 o X T e v u v N 6 6 L u O o g k N w B B r A B p e g B W 5 B G 3 Q A B j l 4 A i / g 1 X g 0 n o 0 3 4 3 1 e W j H K n n 3 w C 8 b H N y Z 1 l s M = < / l a t e x i t > f x,y (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " q U j c K u r b p e 2 9 j c 2 t 4 x d / e 6 M k o E J h 0 c s U j 0 g 1 X g 0 n o 0 3 4 3 1 e W j H K n n 3 w C 8 b H N y Z 1 l s < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 b j s h p 6 F c q W 1 t < l a t e x i t s h a 1 _ b a s e 6 4 = " q U j c K u                 Φ(e; r, θ) = e 2 + r 2 − 2er cos θ. (5)

Optimization
We design specific distributions of (r, θ) used in the proposed objective (4) and their optimization.We assume the two variables are independent: In the following corollary, we first show the result by assuming that both r and θ follow uniform distributions.Recall that proper data mapping functions augment samples close to the original one.An ideal case is thus to perturb samples with their output representations uniformly surrounding that of the original sample.The uniform distribution with a small perturbation radius upper bound R can simulate this ideal case.
Corollary 1.We are given the perturbation distance upper bound R and assume that r ∼ U(0, R), θ ∼ U(0, π). ( 7) where C 1 is a constant and C 2 (R) is another constant dependent on R.
Proof is in the Appendix.With the above result, we can optimize the objective in (4) by minimizing the derived upper bound.We calculate its gradient: where Θ contains all neural model parameters.It can be observed that the major difference of the above gradient compared with the original one of the objective in (1) lies in the second part of ( 8), which weights the original gradient by the loss value.This means that the performance improvement brought by data augmentation under our formulation can be equivalently accomplished by specialized data weighting.Indeed, many data weighting methods (Lin et al., 2017) favors hard examples by reducing the gradient contribution from easy examples and increasing the importance of hard examples (example with large loss value in our approach), which significantly boost the performance.This in turn shows that simple uniform distributions assumed here should be reasonable and effective.Instead of uniform distribution, we can assume a uniform distribution on θ but an exponential distribution on r such that a small perturbation distance is preferred with a higher probability.Corollary 2. We are given the expected value of the perturbation distance as R and assume that Proof is in the Appendix.The above corollary shows that even if different distributions are assumed, we can still use gradient weighting to optimize the proposed objective, where C 1 (R) can be set as a hyper-parameter.
If the loss is Lipschitz smooth, of which Euclidean distance is the case, we can prove the convergence of our approach with the convergence rate The proof is provided in the Appendix, which is extended from results in Reddi et al. (2016).
Theorem 1. Suppose our is in the class of finitesum Lipschitz smooth functions, has δ-bounded gradients, and the weight of the loss gradient is clipped to be bounded by where L is the Lipschitz constant and Θ * is an optimal solution.Then the iterates of SGD of our approach with our satisfy:

Other Types of Augmented Data
We now discuss how our approach can be applied to other types of augmented data.For augmented data on the output domain, the objective in (1) becomes: (11) The augmented loss part can be rewritten using the polar coordinate system with y as the pole and (y, f x,y (x)) as the polar axis, illustrated in Fig. 1b: Similarly, the augmented data mapping function φ ŷ can be re-parameterized into a function of the radius r = (y, ŷ) (still the perturbation distance) and the radian of ŷ.The objective turns out to be the same as (4).
For data perturbation on both the input and output space, we have: (13) Illustrated in Fig. 1c, we first make use of the triangle inequality that: Using ( 3) and ( 12), the objective is rewritten as: Note that E (r,θ)∈P [r] is a scalar which is not dependent on any learning parameter.Thus optimizing the above objective is equivalent to optimizing (4).
From the above analysis, we can see that our proposed objective in (4) can be applied to handle all three kinds of augmented data mapping functions in text generation models.

Loss Function
In theory, our approach can be applied to any Lipschitz smooth loss function that holds the equation (3).In this section, we show another valid loss function in our approach -the word mover's distance (WMD) (Kusner et al., 2015;Zhao et al., 2019), which is previously used in various text generation tasks.Next, we discuss the cross entropy loss, in which the proposed objective is not an upper-bound of the data augmentation objective.However, our approach can still converge with the same convergence rate and experimental results in the next section validate the effectiveness of our approach with the cross-entropy loss.

Word Mover's Distance
WMD, also named the optimal transport distance (Chen et al., 2018a), leverages optimal transport to find an optimal matching of similar words between two sequences, providing a way to measure their semantic similarity: where p u,i /p v,j is the probability distribution of the sentence, i.e. i p u,i = 1 and j p v,j = 1.d i,j is the cost for mis-predicting u i to v j , where the squared Euclidean distance d i,j = u i − v j2 is used and u i /v j is the word embedding vector.Note that the Euclidean distance in (2) is a special case of WMD by replacing the 1-gram used in WMD to n-gram with n larger than the sentence's length.WMD is the squared L 2 Wasserstein distance.We take its squared root, i.e.W D = √ W M D , which holds an upper bound as the right hand side in (3).Also, W D is Lipschitz smooth.
Theorem 2. For the L 2 Wasserstein distance W 2 (•, •) on the Wasserstein space W 2 (R n ) and any x, y, z ∈ W 2 (R n ), we have Here θ is the angel between the γ xy and γ zx , γ xy is the geodesic (shortest path) connecting x, y in W 2 (R n ), and γ zx is the geodesic connecting z, x in W 2 (R n ).
Theorem 3. u and v are given as fixed.Assuming that u Θ is Lipschitz continuous with respect to the parameters Θ.Then W D (u Θ , v) is Lipschitz continuous with respect to the parameters Θ.
Roughly speaking, according to Sturm et al. (2006)[Proposition 2.10], the sectional curvature of Wasserstein space W 2 (R n ) is non-negative.Hence, every geodesic triangle in W 2 (R n ) is fatter than the one with same sides length in R 2 .As a consequence, an inequality like cosine law is satisfied on W 2 (R n ), i.e., Theorem 2 holds.A formal proof of the above two theorems is provided in the Appendix.Thus, all our derivations in Section. 4 hold.
The exact computation of W D is expensive during training.In our experiments, we resort to the inexact proximal point method for optimal transport algorithm to compute it (Chen et al., 2018a).

Cross-entropy Loss
Although WMD is effective for various sequence generation tasks, the most conventional loss function adopted in existing generation models is the cross-entropy loss.It measures the word difference at each word y i of the output sequence y: where y i is the target one-hot vector with the correct dimension as 1 and 0 elsewhere, and p i is the predicted probability output by a softmax layer.We adopt the maximum likelihood estimation as the training paradigm by assuming truth for preceding words in predicting p i .
The cross-entropy loss is also Lipschitz smooth, and thus we can guarantee its convergence from Theorem 1.Unfortunately, it does not satisfy the equation in (3), and thus minimizing our objective in (4) does not necessarily approximate the data augmentation objective in (1).In our experiments, we also try the cross-entropy loss, and results show that our objective is effective to improve the model performance compared with the base model.This is not surprising since our approach is optimized by gradient weighting and thus at least it is a useful data weighting method.

Experiments
The proposed approach provides a new paradigm and understanding of data augmentation for text generation.To evaluate that our approach can mimic the effect of data augmentation, we conduct experiments on two text generation tasksneural machine translation and conversational response generation.We compare our approach with two most popular data augmentation methods (one token-level and one sentence-level augmentation method) that can be applied on various text generation tasks: • Masked Language model (MLM): We use a pretrained BERT (Devlin et al., 2019;Wolf et al., 2020) and randomly choose 15% of the words for each sentence.BERT takes in these masked words to predict these masked positions with new words.We augment one sample from each original training sample.Thus the data size increases to twice of the original one.Note that we only augment the English side of translation datasets.
• Back-translation (BT): For neural machine translation, we employ a fixed target-to-source translation model trained on the original dataset.For conversational response generation, we perturb both the input and output text of the original sample pair using two pretrained translation model: an Englishto-German model and its backward counterpart, which are obtained using the WMT14 corpus with 4.5M sentence pairs 2 .We again augment one sample from each original training sample.
We set the same weight w of all augmented loss parts used in aug as a hyper-parameter, and tune it on the development set of each dataset.Since Euclidean distance is a special case of WMD as dis-

Neural Machine Translation
We use translation benchmarks IWSLT14 En-De, En-Fr, En-It, and IWSLT15 En-Vi in our experiments.The datasets of IWSLT14 are pre-processed with the script in Fairseq 3 .For IWSLT14 datasets, we use tst2011 as validation set and tst2012 as test set.The IWSLT15 dataset is the same as that used in Luong et al. (2015a), and the validation and test sets are tst2012 and tst2013, respectively.Table 1 shows the BLEU scores on their test sets.For both cross-entropy loss and L 2 Wasserstein distance, all data augmentation methods (MLM, BT and OURS) perform better than the corresponding base models in most cases.The improvement margins are different across the various datasets.The reason may be that the datasets are in different scales and the alignment difficulty between different languages can also vary.The performance of MLM is not stable from our results, which is largely due to that masked tokens are possible to be filled in with different semantic ones and thus the semantics of the sentence changes.Therefore, the augmented data are not aligned indeed, and the translation model learning can be distracted.Note that we also evaluate our method using the Transformer model and get some similar findings.
Experimental results of the Transformer model are presented in the appendix.
Compared to BT and MLM, our approach that mimics the effect of data augmentation without actually constructing augmented samples, shows encouraging results.Note that our proposed objective may not have a theoretical guarantee on the cross-entropy loss.Yet, it still manages to improve the base model except for Fr⇒En, and surpasses MLM on all datasets.With the use of L 2 Wasserstein distance, our approach even outperforms BT and achieves the best performance on half test sets.This validates the benefits of not using any specific data augmentation mapping function in data augmentation as in our proposed objective.We provide further analysis on the performance of our approach versus BT.In Fig. 2, we compare testing BLEU scores obtained by models updated with the same number of samples.Since we construct one augmented sample from each original training sample, the total number of samples used in BT is twice as much as that of our approach.We can see that our approach achieves compatible performance with BT, while only requires half of the training data.This shows that our approach, without involving additional calculations on extra samples, can effectively save the computational expense.Fig. 3 shows the sensitivity of performance under different hyper-parameters.For our approach, we vary across different C 1 (R)'s; for BT, we vary the sample weight w of the augmented samples.We re-scale C 1 (R) by 10 −4 and w by 10 −1 , in order to visualize them within the same range of x-axis.Both BT and our approach demonstrate their robustness under different settings of their hyper-parameters.

Conversational Response Generation
We use the English single-round Reddit conversation dataset (Zhou et al., 2018).Following previous work on data augmentation for dialogue system (Cai et al., 2020;Zhang et al., 2020), we simulate a low data regime so that data augmentation is expected to be more effective.Thus, we select data pairs with the length of both the query and response less than 20, and randomly split them into 200K for training, 2K for validation and 5K for testing.Automatic evaluation for each method is performed on all test data.We report Perplexity, BLEU and BLEU-k (k=1,2) to measure the response coherence; Distinct-k (k=1,2) (Li et al., 2016) to measure the response diversity.We also hire five annotators from a commercial annotation company for manual evaluation on 200 pairs randomly sampled from the test set.Results of all methods are shuffled for annotation fairness.Each annotator rates each response on a 5-point scale (1: not acceptable; 3: acceptable; 5: excellent; 2 and 4: used in unsure case) from two perspectives: Fluency and Relevance.
Results are summarized in Table 2. On automatic metrics, BT only shows marginal improvements on a few metrics, which can not exhibit its strength as in translation tasks.MLM effectively increases the response diversity (Dist1&2).This is due to nature of the conversation data that conversation pair often remains coherent even if the semantics of the query or response has been slightly changed.Thus, MLM can increase data diversity, which is appreciated in training response generation models.In terms of human evaluation, BT and MLM can barely improve the base model.As for our approach, it achieves the best or second best results on most metrics for both loss functions, demonstrating more robust performance than BT and MLM.This is consistent with our statement in the introduction that we often need to design proper augmented data mapping functions carefully for a target generation task, which requires non-trivial work.As such, it is meaningful to avoid the use of specific data augmentation techniques and find a unified formulation of data augmentation for general generation tasks.From our results, the proposed objective demonstrates its power to achieve the effect of data augmentation across different generation tasks.

Conclusions and Future Work
We have proposed an objective of formulating data augmentation without any use of any augmented data mapping function.We show its optimization and provide the corresponding convergence rate.Both the L 2 Wasserstein distance and the crossentropy loss are discussed with their use in our objective and their corresponding theoretical guarantees.Different from previous data augmentation works that need to add manipulated data into the training process, our gradient based approach provides a potential way to obtain performance improvements, which may come from augmented data, without incurring the computational expense.Experiments on both neural machine translation and conversational response generation validate the effectiveness of our objective compared to existing popular data augmentation methods: masked language models and back-translation.
We believe this work provides a new understanding of data augmentation.Our approach can also be useful to a wide range of tasks including text classification tasks, which can be seen as special cases of text generation tasks, and cross-modality generation tasks such as image captioning, in which we can skip the step to use various image augmentation techniques.
We would like to point out that some parts of our approach can be improved in the future, which may lead to a better performance and generalization.Firstly, current distributions we choose in the re-parameterized loss are relatively simple.Some points under current continuous distributions may not correspond to valid text sequences in the original text space, due to the discreteness of natural languages.A possible way is that we change to leverage more informative distributions, such as including prior distributions computed from several augmented samples.Secondly, our method is derived under the framework of SGD and it is possible to extend it to the Adam framework (Kingma and Ba, 2014;Chen et al., 2018b;Reddi et al., 2019).We also leave the more general version of our work in the future.

C Proof of Theorem 1
We study the nonconvex finite-sum problems of the form where both L and our may be nonconvex.For ease of notation, we use to denote our in the following of the proof.We denote the class of such finite-sum Lipschitz smooth functions by F n .We optimize functions in F n with the gradient in Eq. 8 by SGD.For L ∈ F n , SGD takes an index i ∈ [n] and a sample in the training set, and returns the pair ( i (Θ), ∇ i (Θ)).
Definition 1.We say Let α t denote the learning rate at iteration t, and w it be the gradient weight assigned to sample i by our approach.By SGD, we have Definition 4. We say the positive gradient weight w in our approach is bounded if there exist constants w 1 and w 2 such that w 1 ≤ w i ≤ w 2 for all i ∈ [n].
Proof of Theorem1.According to the Lipschitz continuity of ∇ , the iterates of our approach satisfy the following bound: After substituting ( 25) into ( 26), we have: The first inequality follows from the unbiasedness of the stochastic gradient E it [∇ it (Θ t )] = ∇ (Θ t ).
The second inequality uses the assumption on gradient boundedness in Definition 3.  we obtain Summing (28) from t = 0 to T − 1 and using that α t is a fixed α, we obtain The first step holds because the minimum is less than the average.The second step is obtained from (28).The third step follows from the assumption on gradient weight boundedness in Definition 4. The fourth step is obtained from the fact that (Θ * ) ≤ (Θ T ).The final inequality follows upon using α = c/ √ T .

D Proof of W D
We begin with some concepts in mathematics.Let (X, | • , • |) be a complete metric space.Definition 5. A rectifiable curve γ(t) : I ⊂ R + → X connecting two points p, q is called a geodesic if its length is equal to |p, q| and it has unit speed.Here, we say that γ(t) : I → X has unit speed, if for any s, t ∈ I, s < t, we have, the length of the restriction γ : [s, t] → X is t − s.A metric space X is called a geodesic space if, for every pair of points p, q ∈ X, there exists some geodesic connecting them.Definition 6.We say that, a geodesic space (X, |• , •|) has non-negative curvature in the sense of Alexandrov, if it satisfies the following property: • for any p ∈ X, and for any unit speed geodesics γ(s) : I → X and σ(t) : J → X with γ(0) = σ(0) := p, the comparison angle ∠γ(s)pσ(t) := arccos t 2 + s 2 − |γ(s), σ(t)| 2 2 • s • t is non-increasing with respect to each of the variables t and s.
The angle between γ and σ at p is defined by In other words, every geodesic triangle in X is fatter than the one with sides length in R 2 (Figure 4).Hence, we complete the proof.
Proof of Theorem 3. We derive from the definition of W D and the triangle inequality for the L 2 Wasserstein distance that for any Θ, Θ , where T i,j satisfies j T i,j = p u Θ ,i ∀i, i T i,j = p u Θ ,j ∀j.
Take T i,j = δ ij • p u Θ ,i .According to the assumption that u Θ is Lipschitz continuous with respect to the parameters Θ, we have for some constant L > 0. Hence, we get that Finally, we got Hence, we complete the proof.

E Experimental Results of Transformer
We also evaluate our method using the Transformer architecture on two translation tasks.To prevent the model from over-fitting, we use a Transformer model with a 2-layer encoder and a 2-layer decoder.
Other hyper-parameters are almost the same as in Vaswani et al. (2017), except for the optimizer.In our experiment, we use SGD to train the model, instead of Adam (Vaswani et al., 2017), since our approach is derived under SGD.Results are shown in Table 3, which are consistent with the observations from the LSTM model.We hope that our approach and theoretical analysis can be extended to the Adam framework (Kingma and Ba, 2014;Chen et al., 2018b;Reddi et al., 2019) D K 7 w 5 j 8 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A 3 n O M + g = = < / l a t e x i t > r < l a t e x i t s h a 1 _ b a s e 6 4 = " Z s 9 Z T L L 9 9 b g L X 5 5 m b T P 6 t 5 F 3 b 0 7 r z W u i z r K c A T H c A o e X E I D b q E J L S C g 4 B l e 4 c 1 5 c l 6 c d + d j P l p y i p 1 D + A P n 8 w d W I 5 M O < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 b j s h p 6 b g L X 5 5 m b T P 6 t 5 F 3 b 0 7 r z W u i z r K c A T H c A o e X E I D b q E J L S C g 4 B l e 4 c 1 5 c l 6 c d + d j P l p y i p 1 D + A P n 8 w d W I 5 M O < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " Z s 9 Z T L L 9 9 5 m b T P 6 t 5 F 3 b 0 7 r z W u i z r K c A T H c A o e X E I D b q E J L S C g 4 B l e 4 c 1 5 c l 6 c d + d j P l p y i p 1 D + A P n 8 w d W I 5 M O < / l a t e x i t > y (a) with a perturbed input x < l a t e x i t s h a 1 _ b a s e 6 4 = " k k D K 7 w 5 j 8 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A 3 n O M + g = = < / l a t e x i t > r < l a t e x i t s h a 1 _ b a s e 6 4 = " Z s 9 Z T L L 9 9 C g e l e I x 0 g g r H R m N R 2 C v X j y X 9 I 9 a 9 o X T e v u v N 6 6 L u O o g k N w B B r A B p e g B W 5 B G 3 Q A B j l 4 A i / g 1 X g 0 n o 0 3 4 3 1 e W j H K n n 3 w C 8 b H N y Z 1 l s M = < / l a t e x i t > f x,y (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " n l 4 J D q 6 e N U s e o l Y C c J x c w I b g L X 5 5 m b T P 6 t 5 F 3 b 0 7 r z W u i z r K c A T H c A o e X E I D b q E J L S C g 4 B l e 4 c 1 5 c l 6 c d + d j P l p y i p 1 D + A P n 8 w d W I 5 M O < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 b j s h p 6 b g L X 5 5 m b T P 6 t 5 F 3 b 0 7 r z W u i z r K c A T H c A o e X E I D b q E J L S C g 4 B l e 4 c 1 5 c l 6 c d + d j P l p y i p 1 D + A P n 8 w d W I 5 M O < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " Z s 9 Z T L L 9 9 1 a 1 s w / + w P r 8 A b m i l g w = < / l a t e x i t > ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = " Z s 9 Z TL L 9 9 Y l W 3 G l u R C / G N A O B / g E = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L U E F K I o o u i 2 5 c V r A P a E O Y T C b t 0 M k k z E y k I W T j r 7 h x o Y h b P 8 O d f + O k z U K r B 4 Y 5 n H M v 9 9 7 j x Y x K Z V l f R m V p e W V 1r b p e 2 9 j c 2 t 4 x d / e 6 M k o E J h 0 c s U j 0 P S Q J o 5 x 0 F F W M 9 G N B U O g x 0 v M m N 4 X f e y B C 0 o j f q z C g e l e I x 0 g g r H R m N R 2 C v X j y X 9 I 9 a 9 o X T e v u v N 6 6 L u O o g k N w B B r A B p e g B W 5 B G 3 Q A B j l 4 A i / g 1 X g 0 n o 0 3 4 3 1 e W j H K n n 3 w C 8 b H N y Z 1 l s M = < / l a t e x i t > f x,y (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " n l 4 J D q 6 e N U s e o l Y C c J x c w I Q p r N z 4 K 2 5 c K O L W b 3 D n 3 z h p s 9 D W A 8 M c z r m X e + 9 x Y 0 a F N I x v r b K 0 v L K 6 V l 2 v b W x u b e / o u 3 t d E S U c k w 6 O W M T 7 L h K E 0 Z B 0 J J W M 9 G N O U O A y 0 n P H N 4 X f e y B c 0 C i 8 l 2 l M 7 A A N Q + p T j K S S H P 3 Q d 7 L J a Z o 3 r B G S m e V G z B N p o L 5 s k u c n j l 4 3 m sY U c J G Y J a m D E m 1 H / 7 K 8 C C c B C S V m S I i B a c T S z h C X F D O S 1 6 x E k B j h M R q S g a I h C o i w s + k Z O T x W i g f 9 i K s X S j h V f 3 d k K B D F c q o y Q H I k 5 r 1 C / M 8 b J N K / s j M a x o k k I Z 4 N 8 h M G Z Q S L T K B H O c G S p Y o g z K n a F e I R 4 g h L l V x N h W D O n 7 x I u m d N 8 6 J p 3 J 3 X W 9 d l H F V w A I 5 A A 5 j g E r T A L W i D D s D g E T y D V / C m P W k v 2 r v 2 M S u t a G X P P v g D 7 f M H J 2 a Z k A = = < / l a t e x i t > f x,y (x)< l a t e x i t s h a 1 _ b a s e 6 4 = " q U j c K u t D W I v Q m k O O d p z P N 4 D 5 0 5 A = " > A A A B / X i c b V D N S 8 M w H E 3 n 1 5 x f 9 e P m J T g E T 6 M V R Y 9 D L x 4 n u D l Y y 0 j T d A t L k 5 K k Q i 3 F f 8 W L B 0 W 8 + n 9 4 8 7 8 x 3 X r Q z Q c h j / d + P / L y g o R R p R 3 n 2 6 o t L a + s r t X X G x u b W 9 s 7 9 u 5 e T 4 l U Y t L F g g n Z D 5 A M 6 o A s w e A T P 4 B W 8 W U / W i / V u f c x G a 1 a 1 s w / + w P r 8 A b m i l g w = < / l a t e x i t > ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 b j s h p 6 F c q W 1 t t 2 k W n F o k D U 8 2 O U = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i G 5 c V 7 A P a s W Q y m T Y 0 k w x J R i l D / 8 O N C 0 X c + i / u / B s z 7 S y 0 9 U D I 4 Z x 7 y c k J E s 6 0 c d 1 v p 7 S y u r a + U d 6 s b b g L X 5 5 m b T P 6 t 5 F 3 b 0 7 r z W u i z r K c A T H c A o e X E I D b q E J L S C g 4 B l e 4 c 1 5 c l 6 c d + d j P l p y i p 1 D + A P n 8 w d W I 5 M O < / l a t e x i t > y (b) with a perturbed output ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = " k k Y f H E k J a F T d m f j a Y t 6 5 H 6 s s s a w = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 W P R i 8 c W 7 A e 0 o W y 2 k 3 b t Z h N 2 N 0 I J / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 b u a 3 n 1 B p H s s H M 0 n Q j + h Q 8 p A z a q z U U P 1 y x a 2 6 c 5 B V 4 u W k A j n q / f J X b x C z N E J p m K B a d z 0 3 M X 5 G l e F M 4 L T U S z U m l I 3 p E L u W S h q h 9 r P 5 o V N y Z p U B C W N l S x o y V 3 9 P Z D T S e h I F t j u 1 2 z y O I p z A K Z y D B 9 d Q g 3 u o Q x M Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A 3 n O M + g = = < / l a t e x i t > r < l a t e x i t s h a 1 _ b a s e 6 4 = " k k Y f H E k J a F T d m f j a Y t 6 5 H 6 s s s a w = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 W P R i 8 c W 7 A e 0 o W y 2 k 3 b t Z h N 2 N 0 I J / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b u 1 2 z y O I p z A K Z y D B 9 d Q g 3 u o Q x M Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A 3 n O M + g = = < / l a t e x i t > r < l a t e x i t s h a 1 _ b a s e 6 4 = " Z s 9 Z T L L 9 9Y l W 3 G l u R C / G N A O B / g E = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L U E F K I o o u i 2 5 c V r A P a E O Y T C b t 0 M k k z E y k I W T j r 7 h x o Y h b P 8 O d f + O k z U K r B 4 Y 5 n H M v 9 9 7 j x Y x K Z V l f R m V p e W V 1r b p e 2 9 j c 2 t 4 x d / e 6 M k o E J h 0 c s U j 0 P S Q J o 5 x 0 F F W M 9 G N B U O g x 0 v M m N 4 X f e y B C 0 o j f q z C g e l e I x 0 g g r H R m N R 2 C v X j y X 9 I 9 a 9 o X T e v u v N 6 6 L u O o g k N w B B r A B p e g B W 5 B G 3 Q A B j l 4 A i / g 1 X g 0 n o 0 3 4 3 1 e W j H K n n 3 w C 8 b H N y Z 1 l s M = < / l a t e x i t > f x,y (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " n l 4 J D q 6 e N U s e o l Y C c J x c w I Q p r N z 4 K 2 5 c K O L W b 3 D n 3 z h p s 9 D W A 8 M c z r m X e + 9 x Y 0 a F N I x v r b K 0v L K 6 V l 2 v b W x u b e / o u 3 t d E S U c k w 6 O W M T 7 L h K E 0 Z B 0 J J W M 9 G N O U O A y 0 n P H N 4 X f e y B c 0 C i 8 l 2 l M 7 A A N Q + p T j K S S H P 3 Q d 7 L J a Z o 3 r B G S m e V G z B N p o L 5 s k u c n j l 4 3 m s Y U c J G Y J a m D E m 1 H / 7 K 8 C C c B C S V m S I i B a c T S z h C X F D O S 1 6 x E k B j h M R q S g a I h C o i w s + k Z O T x W i g f 9 i K s X S j h V f 3 d k K B D F c q o y Q H I k 5 r 1 C / M 8 b J N K / s j M a x o k k I Z 4 N 8 h M G Z Q S L T K B H O c G S p Y o g z K n a F e I R 4 g h L l V x N h W D O n 7 x I u m d N 8 6 J p 3 J 3 X W 9 d l H F V w A I 5 A A 5 j g E r T A L W i D D s D g E T y D V / C m P W k v 2 r v 2 M S u t a G X P P v g D 7 f M H J 2 a Z k A = = < / l a t e x i t > f x,y (x)< l a t e x i t s h a 1 _ b a s e 6 4 = " 2 b j s h p 6 F c q W 1 t t 2 k W n F o k D U 8 2 O U = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i G 5 c V 7 A P a s W Q y m T Y 0 k w x J R i l D / 8 O N C 0 X c + i / u / B s z 7 S y 0 9 U D I 4 Z x 7 y c k J E s 6 0 c d 1 v p 7 S y u r a + U d 6 s b b g L X 5 5 m b T P 6 t 5 F 3 b 0 7 r z W u i z r K c A T H c A o e X E I D b q E J L S C g 4 B l e 4 c 1 5 c l 6 c d + d j P l p y i p 1 D + A P n 8 w d W I 5 M O < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 b j s h p 6 F c q W 1 t t 2 k W n F o k D U 8 2 O U = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i G 5 c V 7 A P a s W Q y m T Y 0 k w x J R i l D / 8 O N C 0 X c + i / u / B s z 7 S y 0 9 U D I 4 Z x 7 y c k J E s 6 0 c d 1 v p 7 S y u r a + U d 6 s b b g L X 5 5 m b T P 6 t 5 F 3 b 0 7 r z W u i z r K c A T H c A o e X E I D b q E J L S C g 4 B l e 4 c 1 5 c l 6 c d + d j P l p y i p 1 D + A P n 8 w d W I 5 M O < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " Z s 9 Z T L L 9 9Y l W 3 G l u R C / G N A O B / g E = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L U E F K I o o u i 2 5 c V r A P a E O Y T C b t 0 M k k z E y k I W T j r 7 h x o Y h b P 8 O d f + O k z U K r B 4 Y 5 n H M v 9 9 7 j x Y x K Z V l f R m V p e W V 1r b p e 2 9 j c 2 t 4 x d / e 6 M k o E J h 0 c s U j 0 P S Q J o 5 x 0 F F W M 9 G N B U O g x 0 v M m N 4 X f e y B C 0 o j f q z C g e l e I x 0 g g r H R m N R 2 C v X j y X 9 I 9 a 9 o X T e v u v N 6 6 L u O o g k N w B B r A B p e g B W 5 B G 3 Q A B j l 4 A i / g 1 X g 0 n o 0 3 4 3 1 e W j H K n n 3 w C 8 b H N y Z 1 l s M = < / l a t e x i t > f x,y (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " q U j c K u t D W I v Q m k O O d p z P N 4 D 5 0 5 A = " > A A A B / X i c b V D N S 8 M w H E 3 n 1 5 x f 9 e P m J T g E T 6 M V R Y 9 D L x 4 n u D l Y y 0 j T d A t L k 5 K k Q i 3 F f 8 W L B 0 W 8 + n 9 4 8 7 8 x 3 X r Q z Q c h j / d + P / L y g o R R p R 3 n 2 6 o t L a + s r t X X G x u b W 9 s 7 9 u 5 e T 4 l U Y t L F g g n Z D 5 A i j H L S 1 V Q z 0 k 8 k Q X H A y H 0 w u S 7 9 + w c i F R X 8 T m c J 8 W M 0 4 j S i G G k j D e 0 D b 4 x 0 7 g A M 6 o A s w e A T P 4 B W 8 W U / W i / V u f c x G a 1 a 1 s w / + w P r 8 A b m i l g w = < / l a t e x i t > ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = " Z s 9Z T L L 9 9 Y l W 3 G l u R C / G N A O B / g E = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L U E F K I o o u i 2 5 c V r A P a E O Y T C b t 0 M k k z E y k I W T j r 7 h x o Y h b P 8 O d f + O k z U K r B 4 Y 5 n H M v 9 9 7 j x Y x K Z V l f R m V p e W V 1r b p e 2 9 j c 2 t 4 x d / e 6 M k o E J h 0 c s U j 0 P S Q J o 5 x 0 F F W M 9 G N B U O g x 0 v M m N 4 X f e y B C 0 o j f q z C g e l e I x 0 g g r H R m N R 2 C v X j y X 9 I 9 a 9 o X T e v u v N 6 6 L u O o g k N w B B r A B p e g B W 5 B G 3 Q A B j l 4 A i / g 1 X g 0 n o 0 3 4 3 1 e W j H K n n 3 w C 8 b H N y Z 1 l s M = < / l a t e x i t > f x,y (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " n l 4 J D q 6 e N U s e o l Y C c J x c w I Q p r N z 4 K 2 5 c K O L W b 3 D n 3 z h p s 9 D W A 8 M c z r m X e + 9 x Y 0 a F N I x v r b K 0 v L K 6 V l 2 v b W x u b e / o u 3 t d E S U c k w 6 O W M T 7 L h K E 0 Z B 0 J J W M 9 G N O U O A y 0 n P H N 4 X f e y B c 0 C i 8 l 2 l M 7 A A N Q + p T j K S S H P 3 Q d 7 L J a Z o 3 r B G S m e V G z B N p o L 5 s k u c n j l 4 3 m sY U c J G Y J a m D E m 1 H / 7 K 8 C C c B C S V m S I i B a c T S z h C X F D O S 1 6 x E k B j h M R q S g a I h C o i w s + k Z O T x W i g f 9 i K s X S j h V f 3 d k K B D F c q o y Q H I k 5 r 1 C / M 8 b J N K / s j M a x o k k I Z 4 N 8 h M G Z Q S L T K B H O c G S p Y o g z K n a F e I R 4 g h L l V x N h W D O n 7 x I u m d N 8 6 J p 3 J 3 X W 9 d l H F V w A I 5 A A 5 j g E r T A L W i D D s D g E T y D V / C m P W k v 2 r v 2 M S u t a G X P P v g D 7 f M H J 2 a Z k A = = < / l a t e x i t > f x,y (x)< l a t e x i t s h a 1 _ b a s e 6 4 = " q U j c K u t D W I v Q m k O O d p z P N 4 D 5 0 5 A = " > A A A B / X i c b V D N S 8 M w H E 3 n 1 5 x f 9 e P m J T g E T 6 M V R Y 9 D L x 4 n u D l Y y 0 j T d A t L k 5 K k Q i 3 F f 8 W L B 0 W 8 + n 9 4 8 7 8 x 3 X r Q z Q c h j / d + P / L y g o R R p R 3 n 2 6 o t L a + s r t X X G x u b W 9 s 7 9 u 5 e T 4 l U Y t L F g g n Z D 5 A M 6 o A s w e A T P 4 B W 8 W U / W i / V u f c x G a 1 a 1 s w / + w P r 8 A b m i l g w = < / l a t e x i t > ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 b j s h p 6 F c q W 1 t t 2 k W n F o k D U 8 2 O U = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i G 5 c V 7 A P a s W Q y m T Y 0 k w x J R i l D / 8 O N C 0 X c + i / u / B s z 7 S y 0 9 U D I 4 Z x 7 y c k J E s 6 0 c d 1 v p 7 S y u r a + U d 6 s b

Figure 1 :
Figure 1: Illustration of the polar coordinate systems for three kinds of data perturbation.Rays in the figures are the polar axes.Our approach expresses edges in dots by their corresponding polar coordinates.

3Figure 2 :
Figure 2: BLEU scores by models updated with the same number of samples.

Figure 3 :
Figure 3: BLEU scores by models trained with different hyper-parameters.Values in the x-axis are re-scaled in order to visualize them in the same range.

Figure 4 :
Figure 4: geodesic space with non-negative curvature According to Sturm et al. (2006)[Proposition 2.10], the Wasserstein space W 2 (R n ) has non-negative curvature in the sense of Alexandrov.Precisely, Lemma 1. Sturm et al. (2006)[Proposition 2.10] Let n ≥ 1.The Wasserstein space W 2 (R n ) equipped with the L 2 Wasserstein distance W 2 (•, •) has non-negative curvature in the sense of Alexandrov.Proof of Theorem 2. Let X = W 2 (R n ) and |• , •| be the L 2 Wasserstein distance.For any x, y, z ∈ X, we denote by γ xy (γ zx ) the geodesic connecting x and y (resp.z and x).By the above Lemma, X has non-negative curvature in the sense of Alexandrov, hence according to Definition 6, one can define the angle between γ xy and γ zx at x, denoted by θ, and we have θ ≥ ∠yxz := arccos |x, y| 2 + |z, x| 2 − |y, z| 2 2 • |x, y| • |z, x| ,

Table 1 :
BLEU scores on various translation datasets.CE: Cross-Entropy loss; WD: L 2 Wasserstein distance.The best results are in bold, and the second-best results are in underline.

Table 2 :
Automatic and human evaluation results on Reddit.Human: the gold reference of the query.The best results are in bold, and the second-best results are in underline.

Table 3 :
in the future.BLEU scores on two translation datasets using the Transformer model.CE: Cross-Entropy loss; WD: L 2 Wasserstein distance.The best results are in bold, and the second-best results are in underline.