Data Augmentation with Adversarial Training for Cross-Lingual NLI

Due to recent pretrained multilingual representation models, it has become feasible to exploit labeled data from one language to train a cross-lingual model that can then be applied to multiple new languages. In practice, however, we still face the problem of scarce labeled data, leading to subpar results. In this paper, we propose a novel data augmentation strategy for better cross-lingual natural language inference by enriching the data to reflect more diversity in a semantically faithful way. To this end, we propose two methods of training a generative model to induce synthesized examples, and then leverage the resulting data using an adversarial training regimen for more robustness. In a series of detailed experiments, we show that this fruitful combination leads to substantial gains in cross-lingual inference.


Introduction
There is a growing need for NLP systems that support low-resource languages, for which taskspecific training data may be lacking, while domain-specific parallel corpora may be too scarce to train a reliable machine translation engine. To overcome this, zero-shot cross-lingual systems can be trained on a source language L S and subsequently also be applied to other languages L T despite a complete lack of labelled training data for those target languages. In the past, such systems typically drew on translation dictionaries, lexical knowledge graphs, or parallel corpora, to build a cross-lingual model that exploits simple connections between words and phrases across different languages (de Melo and Siersdorfer, 2007;. Recently, pretrained language model architectures such as BERT (Devlin et al., 2019) have been shown capable of learning joint multilingual representations with self-supervised objectives under a shared vocabulary, simply by combining the input from multiple languages (Devlin et al., 2019;Artetxe and Schwenk, 2019;Conneau and Lample, 2019;. Such representations greatly facilitate cross-lingual applications. Still, the success of such cross-lingual transfer hinges on how close the involved languages are, with substantial drops observed for some more distant language pairs (Lauscher et al., 2020).
For our study, we focus on natural language inference (NLI), i.e., classifying whether a premise sentence entails, contradicts, or is neutral with regard to a hypothesis sentence (Williams et al., 2017). This is a useful building block for applications involving semantic understanding (Zhu et al., 2018;Reimers and Gurevych, 2019). However, the task is also very challenging, as it not only requires accounting for very subtle differences in meaning but also inferring presuppositions and implications that are not explicitly stated. Due to these intricate subtleties, zero-shot cross-lingual models are often fairly brittle, while obtaining in-language training data is fairly costly.
Data Augmentation. To boost the performance of cross-lingual models, an intuitive thought is to draw on unlabeled data from the target language so as to enable the model to better account for the specifics of that language, rather than just being fine-tuned on the source language. A natural way of exploiting unlabeled data is to consider standard semi-supervised learning methods that leverage a model's own predictions on unlabeled target language inputs (Dong and de Melo, 2019). However, this strategy fails when the predictions are too noisy to serve as reliable training signals. In this paper, we hence explore data augmentation to circumvent this problem. The idea, widespread in computer vision and speech recognition, is to generate new training data from existing labeled data. For images, a common approach is to apply transformations such as rotation and flipping, as these typically preserve the original label assigned to an image (Krizhevsky et al., 2012). For text, in contrast, data augmentation is more challenging, and straightforward techniques include simple operations on words within the original training sequences, such as synonym replacement, random insertion, random swapping, or random deletion (Wei and Zou, 2019). In practice, however, there are two notable problems. One is that the synthesized data from data augmentation techniques may as well be noisy and unreliable. Second, new examples may diverge from the distribution of the original data.
On NLI, these problems are particularly pronounced, as the very nature of this task is to account for subtle differences between sentences. Modified versions of the original sentences may no longer have the same meaning and entailments. Hence, existing data augmentation techniques often fail to boost the result quality.
Overview and Contributions. In this paper, we propose a novel data augmentation scheme to synthesize controllable and much less noisy data for cross-lingual NLI. This augmentation consists of two parts. One serves to encourage language adaptation by means of reordering source language words based on word alignments to better cope with typological divergency between languages, denoted as Reorder Augmentation (RA). Another seeks to enrich the set of semantic relationships between a premise and pertinent hypotheses, denoted as Semantic Augmentation (SA). Both are achieved by learning corresponding sequence-to-sequence (Seq2Seq) models.
The resulting samples along with their new labels serve as an enriched training set for the final cross-lingual training. During this phase, we invoke a special adversarial training regimen that enables the model to better learn from such automatically induced training samples and transfer more information to the target languages while better bridging the gap between typologically distinct languages. Our empirical study demonstrates the necessity of incorporating adversarial training into training with synthetic samples and the superiority of our new augmentation method on cross-lingual Natural Language Inference (Conneau et al., 2018). Remarkably, our cross-lingual approach even outperforms in-language supervised learning.

Method
Our proposed method consists of two steps. The first involves inducing training examples with two data augmentation models. Next, a task-specific classifier is trained on both the original and the newly generated training instances, with adversarial perturbation for improved robustness and generalization.

Reorder Augmentation
Reorder augmentation is based on the intuition of making a model more robust with respect to differences in word order typology. If our training examples consist entirely of instances from a language L S with a fairly strict subject-verb-object (SVO) word order such as English, the model will be less well equipped to pay attention to subtle semantic differences between sentences from a target language L T obeying subject-object-verb (SOV) order. To alleviate this problem, we can rely on auxiliary data to diversify the training data. For this, we obtain word alignments for unannotated bilingual parallel sentence pairs covering L S and an auxiliary language L A that need not be the same as L T . We then reorder all source sentences to match the word order of L A based on the alignments, and train a model to apply such reordering on the NLI training instances. Formally, suppose we have obtained l unlabelled parallel sentences in the source language L S and in the auxiliary language L A , C = {( s i , a i | i = 1, ..., l}, where s, a is a source-auxiliary language sentence pair. Based on a word alignment model, in our case FastAlign (Dyer et al., 2013), which uses Expectation Maximization to compute the lexical translation probabilities, we obtain a word pair table for each sentence pair s, t , denoted as A(s, a) = {(i 1 , j 1 ), ..., (i m , j m )}.
Following the word order of L A , we then reorder the source sequence s by consulting the table A(s, t), yielding the new sentence pair s,s . Next, we consider a pretrained Seq2Seq model, denoted as r(·; θ). The model is assumed to have been pretrained with an encoder and a decoder in the source language, and we fine-tune this generative model by training on the new parallel corpusC = {( s i ,s i | i = 1, ..., l}. This generative Seq2Seq model can then reorder the sequences in the labeled training dataset D = {(x i , y i ) | i = 1, ..., n}, where n is the number of labeled instances, each x i consists of a sequence pair s 1 , s 2 , and each y i ∈ Y is the corresponding ground truth label describing their relationship.

Semantic Augmentation
Our second augmentation strategy involves training a controllable model that, given a sentence and a label describing the desired relationship, seeks to emit a second sentence that stands in said relationship to the input sentence. Thus, given an existing training sentence pair, we can consider different variations of one sentence in the pair and invoke the model to generate a suitable second sentence. However, such automatically induced samples from SA are inordinately noisy, precluding their immediate use as training data, so we exploit a large pretrained Teacher model trained on available source language samples to rectify the labels of these synthetic samples with appropriate strategies.
Generation. As we wish to be able to control the label of a generated example, the requested label is prepended to the input as a (textual) prefix before it is fed into a Seq2Seq model. We adopt the groundtruth label of each example as the respective prefix, resulting in a new input sequence (y i : s 1 ) coupled with s 2 as the desired output forming a training pair for the generation model.
Given the resulting labeled training dataset D SA , we can fine-tune a pretrained Seq2Seq model, denoted as g(·; θ). This generative Seq2Seq model can then be invoked for semantic data augmentation to generate new training instances. For each (ȳ : s 1 ) as a labeled input sequence, wherē y ∈ Y \ {y i }, we generate ans 2 via the fine-tuned Seq2Seq model, yielding a new training instance ( s 1 ,s 2 ,ȳ).
Label Rectification. The semantic augmentation inducess 2 automatically based on s 1 and the requested labelȳ. However, the obtaineds 2 may not always genuinely have the desired relationshipȳ to s 1 . Thus, we treat this data as inherently noisy and propose a rectifying scheme based on a Teacher model. We wish for this Teacher to be as accurate as possible, so we start off with a large pretrained language model specifically for the source language L S , which we assume obtains a better performance on L S than a pretrained multilingual model. We train the Teacher network h(·; θ) in K epochs using the set of original labeled data D. This teacher model is then invoked to verify and potentially rectify labels from the automatically induced augmentation data Dã = {(x i , y i ) | i = 1, ..., m} obtained in the previous step (where m is the number of instances). We assume (ỹ i , c) = h(x i ; θ) denotes the predicted label along with the confidence score c ∈ [0, 1] emitted by the classifier, and assume a confidence threshold T has been predetermined. There are several strategies to determine the final labels.
• Teacher Strategy: when the confidence score is above T , we believe the Teacher model is sufficiently confident to ensure a reliable label, while other instances are discarded.
• TR Strategy: An alternative scheme is to in- Here, labels remain unchanged when Teacher predictions match the originally requested labels. In case of an inconsistency, we adopt the Teacher model's label if it is sufficiently confident, and otherwise retain the requested label.

Adversarial Training
Upon completing the two kinds of data augmentation, we possess synthesized data that is substantially less noisy, denoted as D r , which can be incorporated into the original training data D to yield the final augmented training set D a = D ∪ D r . With this, we proceed to train a new model f (·; θ) for the final cross-lingual sentence pair classification. As a special training regimen, we adopt adversarial training, which seeks to minimize the maximal loss incurred by label-preserving adversarial perturbations (Szegedy et al., 2014;Goodfellow et al., 2015), thereby promising to make the model more robust. Nonetheless, the gains observed from it in practice have been somewhat limited in both monolingual and cross-lingual settings. We conjecture that this is because it has previously merely been invoked as an additional form of monolingual regularization (Miyato et al., 2017).
In contrast, we hypothesize that adversarial training is particularly productive in a cross-lingual framework when used to exploit augmented data, as it encourages the model to be more robust towards the divergence among similar words and word orders in different languages and to better adapt to the new modestly noisy data. This hypothesis is later confirmed in our experimental results.
Adversarial training is based on the notion of finding optimal parameters θ to make the model robust against any perturbation r within a norm ball on a continuous multilingual (sub-)word embedding space. Hence, the loss function becomes: Generally, a closed form for the optimal perturbation r adv (x i , y i ) cannot be obtained for deep neural networks. Goodfellow et al. (2015) proposed approximating this worst case perturbation by linearizing f (x i ;θ) around x i . With a linear approximation and an L 2 norm constraint in Equation 2, the adversarial perturbation is However, neural networks are typically not linear even over a relatively small region, so this approximation cannot guarantee to achieve the best optimal point within the bound. Madry et al. (2017) demonstrated that projected gradient descent (PGD) allows us to find a better perturbation r adv (x i , y i ).
In particular, for the norm ball constraint ||r|| ≤ , given a point r 0 , Π ||r||≤ aims to find a perturbation r that is closest to r 0 as follows: To find more optimal points, K-step PGD is needed during training, which requires K forwardbackward passes through the network. With a linear approximation and an L 2 norm constraint, PGD takes the following step in each iteration: Here, α is the step size and t is the step index.

Experimental Setup
Tasks and Datasets. For evaluation, we used XNLI (Conneau et al., 2018), the most prominent cross-lingual Natural Language Inference corpus, which extends the MultiNLI dataset (Williams et al., 2017) to 15 languages. In our experiments, we considered 20k training data, i.e., ∼5% of the original training size to study lower-resource settings requiring augmentation. Following previous work, we consider English as the source language in our experiments.
Model Details. To show that our reorder augmentation strategy does not require auxiliary data from a low-resource target language, we only give it access to parallel data for another closely related high-resource language. Specifically, we use the English-German bilingual parallel corpus from JW300 (Agić and Vulić, 2019). Like English, German commonly adopts an SVO word order, but in some instances also mandates SOV and is generally less rigid than English. This allows us to demonstrate the utility of reorder augmentation even in the absence of data from a language similar to the target language. We relied on FastAlign 1 to induce 200k training pairs for Seq2Seq fine-tuning on reordering.
As the pre-trained Seq2Seq model, we used Google's T5-base (Raffel et al., 2020), a unified text-to-text Transformer, to generate new training examples. During generation, we set the beam size as 1 and use sampling instead of greedy decoding. For the Teacher model in semantic augmentation, we relied on RoBERTa-Large (Liu et al., 2019), a robustly optimized BERT model, to fine-tune NLI on English. As the multilingual model, we employ XLM-RoBERTa-base (XLM-R) , trained on over 100 different languages. For PGD, the step size α, norm constraint size , and number of steps K are 1.0, 3.0, 3, respectively. All hyperparameter tuning is conducted based on the accuracy on the English validation set. The Teacher strategy for XNLI then is used for the rectification of semantically augmented texts, as inference requires particularly clean data. The threshold T for this is 0.8. An overview of the basic network parameter values is given in Table 1. We rely on early stopping as a termination criterion. For all NLI classification results, we randomly repeat each experiment 5 times and report the averaged accuracy.

Main Results
Cross-lingual Inference Classification. Table 2 compares our approach against several strong baselines on XNLI. The first part considers in-language supervised learning, where we relied on genuine training data from the target language rather than a cross-lingual setting. These results are merely provided for comparison. The second part considers zero-shot cross-lingual transfer, i.e., the setting we are targeting in this paper: We first used English training data to train the XLM-R model and then applied it to non-English languages without any training data in the target language. We also trained the model with PGD adversarial training to assess how well PGD works without any data augmentation. Next, we evaluate XLM-R when trained on original and augmented examples from several augmentation methods, with and without adversarial training, respectively. The first of these is Easy Augmentation (EA) by Wei and Zou (2019), a stateof-the-art method for data augmentation in NLP. It mixes 4 strategies, namely synonym replacement, random insertion, random swapping, and random deletion, applying each of these to 20% of words in a sentence. Additionally, we consider our proposed RA and SA strategies, as well as combinations of EA or RA with SA. Compared with vanilla XLM-R without adversarial training, XLM-R with PGD works better across a range of non-English languages, which shows the effectiveness of adversarial training for more robustness in cross-lingual settings. We observe that XLM-R, when trained with EA or RA, outperform the setting without augmentation for English and some non-English languages, though it does not achieve sufficiently stronger results in terms of the average accuracy across different languages. This suggests that XLM-R struggles to benefit from the augmented instances from RA for better generalizability. In contrast, when trained with SA, XLM-R performs better than without SA examples for most languages, confirming that our semantic augmentation is beneficial. Remarkably, XLM-R with SA examples even succeeds at outperforming in-language training with an average absolute improvement of about 1.1% in accuracy, suggesting that cross-lingual models trained with automatically generated English examples can be more informative with regard to inference than target language examples. 2 Next, we also observe that the accuracy of XLM-R with additional examples from EA, RA, SA is boosted with PGD. This suggests that adversarial training is particularly useful to boost generalizability and robustness when operating on artificial augmented examples.
Beyond this, our full zero-shot approach further outperforms all baselines across 14 languages, including in-language training. This demonstrates the value of improving generalizability and robustness by adding diverse forms of augmentation in an adversarial training framework that can cope with noisy examples.

Ablation Studies and Analysis
Comparisons on Different Rectifying Strategies. One key part of our method is the label rectification mechanism. We compare different rectification strategies in Table 3. The results show that the Teacher and TR methods introduced in Section 2.1.2 yield fairly similar results. This confirms the robustness of our approach with regard to the choice of strategy. The same also holds for an additional option, Agreement, which retains only those examples on which the prediction from the Teacher agrees with the originally requested label. Finally, for comparison, we evaluated yet another strategy, Requested, which always adopts the originally requested labels as chosen for generation. We find that this strategy introduces overly many unreliable labels, so the model is unable to work well. This confirms that rectifying labels with a Teacher model is a crucial ingredient.

Comparisons on Adversarial Perturbations.
For assessing the value of PGD for adversarial per-    turbation, Table 4 compares PGD with the standard Fast Gradient Method (FGM) for adversarial perturbation (Goodfellow et al., 2015) as introduced in Section 2.2. We ran experiments on XNLI with 10k and 20k training data, each augmented with 80k induced semantic examples. We observe that FGM obtains a lower average accuracy than PGD with the same amount of training data, confirming the superiority of PGD in providing better adversarial perturbations than FGM to improve both generalization and robustness.

Effectiveness on Different Training Sizes.
Data augmentation is an important approach to deal with scarce labels. The results in Table  4 further show that when fine-tuning T5 using 10k XNLI training instances with 80k semantic and 10k reorder augmented examples, we obtain substantially better results than when using 20k training instances without augmentation. We can also observe the improvement of XLM-R with RA, SA, and adversarial training over vanilla XLM-R on each language as plotted in Figure 2. The relative gains with 10k training data are larger than with 20k training data across a range of languages, which shows that our method is consistently most beneficial when training data is scarce.
Influence of Amount of Augmentation. To assess the role of the amount of data augmentation, we conducted experiments on XNLI with 20k training examples, and evaluated the effect of adding either 20k or 80k augmented examples from EA, RA, SA. The results are given in Table 5. When trained without PGD, one can often benefit from using up to 80k augmented examples. Due to the inherent reordering differences between English and German, there are limits regarding the amount of such data one ought to incorporate. We find that 20k instances from RA can suffice. We observe  6.4 6.6 6.4 4.3 6.1 4.6 5.2 8.1 6.0 6.7 7.3 3.9 5.5 5.5 5.8 Table 5: Accuracy (in %) on XNLI experiments trained using 20k vs. 80k augmentation data from EA, RA, SA, with and without PGD. Case Studies. To better illustrate the principles of our data augmentation technique, we provide several examples. Table 6 shows two examples of the three data augmentation processes on XNLI. For the first example, the original label is contradiction, so entailment and neutral serve as requested labels to generate new training text. Next, our Teacher model attempts to rectify these labels. Although our generative model treats Vrenna and I fought him in a fight, but he had just gotten us as neutral to S 1 (Vrenna and I both fought him and he nearly took us), the Teacher model changes the label to entailment. For the second example, both the generative and Teacher model are unable to conclude that The rice ripens in the summer is contradictory with the premise. From the two EA outputs, we can observe him is randomly deleted in Example (1) and the and rice is swapped in Example (2), which loses some information, whereas RA Seq2Seq generated examples maintain all crucial information despite the reordering.

Related Work
Data Augmentation. Data augmentation is a promising technique, especially when dealing with scarce data, imbalanced data, or semi-supervised learning problems. Back-translation (Sennrich et al., 2015) has been considered as a technique to obtain alternative examples preserving the original semantics, by translating an existing example in language L A into another language L B and then translating it back into L A to obtain an augmented example. Yu et al. (2018) and Xie et al. (2020) applied it to question answering and semi-supervised monolingual training scenarios. However, this requires high-quality translation engines that often do not exist in the settings in which one wishes to apply cross-lingual systems.
Wei and Zou (2019) instead combined synonym replacement, random insertion, random swapping, and random deletion in a method named EDA. Since insertion and deletion may affect the semantics of the utterance, some studies opt to control ( O -contradiction S 1 : In summer the rice forms a green velvety blanket, then turns golden in autumn when it ripens and is harvested. S 2 : The rice is golden and harvestable in the summer, but turns green in autumn.

EA -contradiction
S 1 : Harvested summer the rice forms a green velvety blanket then turns golden in autumn when is ripens and it in. S 2 : The the is golden and harvestable in rice summer, but turns green in autumn.

RA -contradiction
S 1 : In summer forms the rice a green velvety blanket, turns then in autumn golden when it ripens and harvested is. S 2 : The rice is golden and harvestable in the summer, but turns in autumn green. SA entailment entailment S 2 : The rice turns golden in autumn when it ripens. SA neutral entailment S 2 : The rice ripens in the summer and then turns golden in the autumn.
the selection of words to be replaced with indicators such as TF-IDF scores (Xie et al., 2020 (Radford et al., 2018) to generate a single new sequence in each instance. Our work, in contrast, presents a novel augmentation scheme designed to cope with the special challenges of sentence pair classification, where a Seq2Seq Transformer enables augmentation based on a paired input sentence. Our method also introduces a Teacher model to rectify labels. Apart from this, we expand the idea of language model based augmentation to cross-lingual settings and leverage noisy instances with adversarial training.
Adversarial Training. Many approaches for improving the robustness of a machine learning system against adversarial perturbations (Szegedy et al., 2014) have been advanced. Goodfellow et al. (2015) proposed a fast gradient method based on linear perturbation of non-linear models. Later, Madry et al. (2017) presented PGD-based adversarial training through multiple projected gradient ascent steps to adversarially maximize the loss. In NLP, Belinkov and Bisk (2017) exploited structureinvariant word manipulation and robust training on noisy texts for improved robustness. Iyyer et al. (2018) proposed syntactically controlled paraphrase networks with back-translated data and used them to generate adversarial examples. Adversarial training also plays a role in improving a neural model's generalization. For instance, Cheng et al. (2019) used adversarial source examples to improve a translation model.  exploit FGM-based adversarial training in selflearning for improved cross-lingual text classification. In our setting, we count on adversarial training in the word embedding space and show that PGD-based adversarial training remains effective when the adversarial perturbation is applied to noisy augmented examples.

Conclusion
While multilingual pretrained model have enabled better cross-lingual learning, we still often encounter data scarcity issues due to the high cost of collecting data, which weakens the generalization ability of the multilingual model.
To address this, this paper proposes a novel data augmentation strategy with label rectification to build synthetic examples, outperforming even models trained with larger amounts of ground-truth data. We show that we can best learn from such noisy instances with adversarial training, which enables the classifier to transfer more information from the source language to other languages and to become more robust. Remarkably, with this, our models trained without any target language training data at all are able to outperform models trained fully on in-language training data. Moreover, the amount of augmented data from our Seq2Seq-based reorder augmentation used in training is much less than that required by the state-of-the-art EDA method in order to achieve comparable performance. Finally, in our series of follow-up experiments comparing different training regimens and variants, one notable finding is that our overall augmented approach can even outperform non-augmented supervision with twice as many ground truth labels. Overall, this suggests our combination of data augmentation with adversarial training as a valuable way of learning substantially more accurate and more robust models without any target-language training data.

Broader Impact
Research on cross-lingual NLP is often motivated by a desire to provide state-of-the-art advances to linguistic communities that have been underserved. Such advances may enable better access to information as well as to products and services. However, there is a risk that such technological advances may not always be desired by the relevant communities and may indeed also cause harm to them (Bird, 2020). Moreover, cross-lingual systems in particular may exhibit biases with regard to the source language used for training and the general cultural assumptions reflected in such data. In light of this, special care needs to be taken to analyze potential outcomes and risks before deploying cross-lingual systems in real-world applications.