Learning to Perturb Word Embeddings for Out-of-distribution QA

QA models based on pretrained language models have achieved remarkable performance on various benchmark datasets. However, QA models do not generalize well to unseen data that falls outside the training distribution, due to distributional shifts. Data augmentation (DA) techniques which drop/replace words have shown to be effective in regularizing the model from overfitting to the training data. Yet, they may adversely affect the QA tasks since they incur semantic changes that may lead to wrong answers for the QA task. To tackle this problem, we propose a simple yet effective DA method based on a stochastic noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics. We validate the performance of the QA models trained with our word embedding perturbation on a single source dataset, on five different target domains. The results show that our method significantly outperforms the baseline DA methods. Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.


Introduction
Deep learning models have achieved impressive performances on a variety of real-world natural language understanding tasks such as text classification, machine translation, question answering, and text generation to name a few (Vaswani et al., 2017;Seo et al., 2017). Recently, language models that are pretrained with a large amount of unlabeled data have achieved breakthrough in the performance on these downstream tasks (Devlin et al., 2019), even surpassing human performance on some of them.
The success of such data-driven language model pretraining heavily depends on the amount and diversity of training data available, since when * * Equal contribution trained with a small amount of highly-biased data, the pretrained models can overfit and may not generalize well to out-of-distribution data. Data augmentation (DA) techniques (Krizhevsky et al., 2012;Verma et al., 2019a;Yun et al., 2019;Sennrich et al., 2016) can prevent this to a certain extent, but most of them are developed for image domains and are not directly applicable to augmenting words and texts. Perhaps the most important desiderata for an augmentation method in supervised learning, is that it should not change the label of an example. For image domains, there exist several well-defined data augmentation techniques that can produce diverse augmented images without changing the semantics. In contrast, for Natural Language Processing (NLP), it is not straightforward to augment the input texts without changing their semantics. A simple augmentation technique that preserves semantics is replacing words with synonyms or using back translation (Sennrich et al., 2016). However, they do not effectively improve the generalization performance because the diversity of viable transformations with such techniques is highly limited (Pham et al., 2021).
Some recent works (Wei and Zou, 2019;Ng et al., 2020) propose data augmentation methods tailored for NLP tasks based on dropping or replacing words and show that such augmentation techniques improve the performance on the out-ofdomain as well as the in-domain tasks. As shown in Fig. 1, however, we have observed that most existing data augmentation methods for NLP change the semantics of original inputs. While such change in the semantics may not be a serious problem for certain tasks, it could be critical for Question Answering (QA) task since its sensitivity to the semantic of inputs. For instance, replacing a single word with a synonym (Hesburgh → Vanroth in Fig. 1) might cause the drastic semantic drift of the answer (Jia and Liang, 2017  augmentations are ineffective for QA tasks, and most existing works on data augmentation for QA tasks resort to question or QA-pair generation. Yet, this approach requires a large amount of training time, since we have to train a separate generator, generate QA pairs from them, and then use the generated pairs to train the QA model. Also, QA-pair generation methods are not sample-efficient since they usually require a large amount of generated pairs to achieve meaningful performance gains. To address such limitations of the existing data augmentation techniques for QA, we propose a novel DA method based on learnable word-level perturbation, which effectively regularizes the model to improve its generalization to unseen questions and contexts with distributional shifts. Specifically, we train a stochastic perturbation function to learn how to perturb each word embedding of the input without changing its semantic, and augment the training data with the perturbed samples. We refer to this data augmentation method as Stochastic Word Embedding Perturbation (SWEP).
The objective of the noise generator is to maximize the log-likelihood of the answer of the input with perturbation, while minimizing the Kullback-Leibler (KL) divergence between prior noise distribution and conditional noise distribution of given input. Since the perturbation function maximizes the likelihood of the answer of the perturbed input, it learns how to add noise without changing the semantics of the original input. Furthermore, minimizing the KL divergence prevents generating identical noise as the variance of the prior distribution is non-zero, i.e. we can sample diverse noise for the same input.
We empirically validate our data augmentation method on both extractive and generative QA tasks.
We train the QA model on the SQuAD dataset (Rajpurkar et al., 2016) with our learned perturbations, and evaluate the trained model on the five different domains -BioASQ (Tsatsaronis et al., 2012), New York Times, Reddit post, Amazon review, and Wikipedia (Miller et al., 2020) as well as SQuAD to measure the generalization performance on out-of-domain and in-domain data. The experimental results show that our method improves the in-domain performance as well as out-of-domain robustness of the model with this simple yet effective approach, while existing baseline methods often degrade the performance of the QA model, due to semantics changes in the words. Notably, our model trained only with the SQuAD dataset shows even better performance than the model trained with 240,422 synthetic QA pairs generated from a question generation model. Our contribution in this work is threefold.
• We propose a simple yet effective data augmentation method to improve the generalization performance of pretrained language models for QA tasks.
• We show that our learned input-dependent perturbation function transforms the original input without changing its semantics, which is crucial to the success of DA for question answering.
• We extensively validate our method for domain generalization tasks on diverse datasets, on which it largely outperforms strong baselines, including a QA-pair generation method.

Related Work
Data Augmentation As in image domains (Krizhevsky et al., 2012;Volpi et al., 2018;Yun et al., 2019), data augmentation methods are known to be an effective regularizer in text domain (Sennrich et al., 2016). However, unlike the image transformations that do not change their semantics, transforming raw texts without changing their semantics is difficult since they are composed of discrete tokens. The most common approach for data augmentation in NLP is applying simple perturbations to raw words, by either deleting a word or replacing it with synonyms (Wei and Zou, 2019). In addition, back-translation with neural machine translation has also been shown to be effective, as it paraphrases the original sentence with a different set and ordering of words while preserving the semantics to some extent (Xie et al., 2020). Beyond such simple heuristics, Ng et al. (2020) propose to mask the tokens and reconstruct them with pretrained language model to augment training data for text classification and machine translation. For QA tasks, question or QA-pair generation (Zhang and Bansal, 2019; are also popular augmentation techniques, which generate questions or question-answer pairs from an unlabeled paragraph, thus they can be utilized as additional data to train the model.

Domain Generalization
Unlike domain adaptation in which the target domains are fixed and we can access unlabeled data from them, domain generalization aims to generalize to unseen target domains without access to data from the target distribution. Several prior works (Li et al., 2018;Balaji et al., 2018;Tseng et al., 2020)

Brief Summary of Backgrounds
The goal of extractive Question Answering (QA) is to point out the start and end position of the answer span y = (y start , y end ) from a paragraph (context) c = (c 1 , . . . , c L ) with length L for a question x = (x 1 , . . . , x M ). For generative QA, it aims to generate answer y = (y 1 , . . . , y K ) instead of predicting the position of answer spans from the context. A typical approach to the QA is to train a neural networks to model the conditional distribution p θ (y|x, c), where θ are composed of θ f and θ g denoted for the parameters of the encoder f (·; θ f ) and classifier or decoder g(·; θ g ) on top of the encoder. We estimate the parameter θ to maximize the log likelihood with N observations , which are drawn from some unknown distribution p train , as follows: For convenience, we set the length T := L+M +3 and abuse notations to define the concatenated sequence of the question x and context c as x := (x 0 , . . . , x L , c 0 , . . . , c M +1 ) where x 0 , c 0 , c M +1 denote start, separation, and end symbol, respectively. However, the model trained to maximize the likelihood in Eq. (1) is prone to overfitting and brittle to distributional shifts where target distribution p test is different from p train . In order to tackle this problem, we train the model with additional data drawn from different generative process to increase the support of training distribution, to achieve better generalization on novel data with distributional shifts. We will describe it in the next section.

Learning to Perturb Word Embeddings
Several methods for data augmentation have been proposed in text domain, however, unlike in image domains (Verma et al., 2019a,b;Yun et al., 2019), there does not exist a set of well-defined data augmentation methods which transform the input without changing its semantics. We propose a new data augmentation scheme where we sample a noise z = (z 1 , . . . , z T ) from a distribution q φ (z|x) and perturb the input x with the sampled noise without altering its semantics. To this end, the likelihood p θ (y|x, z) should be kept high even after the perturbation, while the perturbed instance should not collapse to the original input. We estimate such parameters φ and θ by maximizing the following objective: where β ≥ 0 is a hyper-parameter which controls the effect of KL-term. We assume that z t and z t are conditionally independent given x if t = t , i.e., q φ (z|x) = T t=1 q φ (z t |x). The parameter of prior ψ is a hyper-parameter to be specified. When β = 1, the objective corresponds to the Evidence Lower BOund (ELBO) of the marginal likelihood.
Maximizing the expected log-likelihood term in Eq.
(2) increases the likelihoods evaluated with the perturbed embeddings, and therefore the semantics of the inputs after perturbations are likely to be preserved. The KL divergence term in Eq. (2) penalizes the perturbation distribution q φ (z|x) deviating too much from the prior distribution p ψ (z). We assume that the prior distribution is fully fac- α denotes a vector with ones, identity matrix, and positive real number, respectively. Hence, we expect the inputs perturbed with the multiplicative noises remain close to the original inputs on average. Note that the choice of the prior is closely related to Gaussian dropout (Srivastava et al., 2014); we will elaborate on this connection later.
The parameterization of the perturbation function q φ heavily affects the success of the learning with the objective (2). The function needs to control the intensity of perturbation for each token of x without changing the semantics. Since the meaning of each word varies across linguistic contexts, the function should be expressive enough to encode the sentence x into a meaningful latent space embedding to contextualize the subtle meaning of each word in the sentence.
To this end, we share the encoder function f (·; θ f ) to contextualize the input x into hidden representation (h 1 , . . . , h T ) and feed it into the perturbation function as input as shown in the left side of Fig. 2. However, we stop the gradient of φ with respect to L(φ, θ) propagating to the encoder f (·; θ f ). Intuitively, it prevents noisy gradient from flowing to p θ for early stage of training. On top of the encoder, we stack two layer feed forward neural network with ReLU activation, which outputs mean µ t ∈ R d and variance σ 2 t ∈ R d for each token, following Kingma and Welling (2014). We leverage the reparameterization trick (Kingma and Welling, 2014) to sample z t ∈ R d . Since x is a sequence of discrete tokens, we map each token Overview of how the input is perturbed with SWEP. It encodes the input to hidden representation with transformers and outputs a desirable noise for each word embedding. The noise is multiplied with the word embedding.
x t to corresponding word embedding e t and multiply it with the noise z t in element-wise manner as follows: where denotes element-wise multiplication. We feed (ẽ 1 , . . . ,ẽ T ) to the g • f to compute the likelihood p θ (y|x, z) as shown in Fig. 2.

Learning Objective
As described in the section 3.2, we can jointly optimize the parameters θ, φ with gradient ascent. However, we want to train the QA model with additional data drawn from the different generative process as well as the given training data to increase the support of training distribution, which leads to better regularization and robustness to the distributional shift. Therefore, our final learning objective function is a convex combination of L M LE (θ) and L noise (φ, θ) as follows: where 0 < λ < 1 is a hyper-parameter which controls the importance of each objective. For all the experiments, we set λ as 0.5. In other words, we train the QA model to maximize the conditional log-likelihood of the original input and perturbed one with stochastic gradient ascent.

Connection to Dropout
Since each random variable of the perturbation vector z t = (z t,1 , . . . , z t,d ) is independent, we only consider the i−th coordinate. With the reparameterization trick, we can write z t,i = µ t,i + σ t,i i , where each i iid ∼ N (0, 1) and µ t,i , σ t,i are i−th component of µ t , σ t which are outputs of neural network as described in Eq. (3). Simply, each noise element z t,i is sampled from N (µ t,i , σ 2 t,i ). Assume thatz is the noise sampled from the prior distribution N (1, α), i.e.z = 1 + α · where ∼ N (0, 1). Then, z t,i can be expressed in terms ofz as follows: If we set α = (1 − p)/p where p is the retention probability, we can considerz as a Gaussian dropout mask sampled from N (1, 1−p p ), which shows comparable performance to dropout mask sampled from Bernoulli distribution with probability p (Srivastava et al., 2014). Then, we can interpret our perturbation function as the input dependent dropout which scales and translates the Gaussian dropout mask, and thus it flexibly controls the intensity of perturbation adaptively to each word embedding of the input x.

Task
Our goal is to regularize the QA model to generalize to unseen domains, such that it is able to answer the questions from the new domain. We consider a more challenging setting where the model is trained with a single source dataset and evaluate it on the datasets from the unseen domains as well as on unseen examples from the source domain. Specifically, we train the QA model with SQuAD dataset (Rajpurkar et al., 2016) as source domain, test the model with several different target domain QA datasets -BioASQ (Tsatsaronis et al., 2012), New Wikipedia (Wiki), New York Times (NYT), Reddit posts, and Amazon Reviews (Miller et al., 2020). We evaluate the QA model with F1 and Exact Match (EM) score, following the convention for extractive QA tasks. For the BioASQ dataset, we use the dataset provided in the MRQA shared task (Fisch et al., 2019). We downloaded the other datasets from the official website of Miller et al. (2020).

Experimental Setup
Implementation Detail As for the encoder f , we use the pretrained language model -BERTbase (Devlin et al., 2019), ELECTRA-small (Clark et al., 2020) for extractive QA and randomly initialize an affine transformation layer for g. For the generative QA task, we use a T5-small (Raffel et al., 2020) for f • g as an encoder-decoder model. For the perturbation function q φ , we stack two feed-forward layers with ReLU on the encoder as described in section 3.2. For the extractive QA task, we train the model for 2 epochs with the batch size 8 and use AdamW optimizer (Loshchilov and Hutter, 2019) with the learning rate 3 · 10 −5 . For the T5 model, we train it for 4 epochs with batch size 64 and use Adafactor optimizer (Shazeer and Stern, 2018) with learning rate 10 −4 . We use beam search with width 4 to generate answers for generative question answering.

Baselines
We experiment with our model SWEP and its variant against several baselines.
1. MLE: This is the base QA model fine-tuned to maximize L M LE (θ). 2. Adv-Aug: Following Volpi et al. (2018), we perturb the word embeddings of the input x with an adversarial objective and use them as additional training data to maximize L M LE (θ). We assume that the answer for each question and context remains the same after the adversarial perturbation. 3. Gaussian-Dropout This is the model whose word embedding is perturbed with dropout mask sampled from a Gaussian distribution N (1, 1−p p ), where p is dropout probability and set to be 0.1 (Srivastava et al., 2014). 4. Bernoulli-Dropout This is the model of which word embedding is perturbed with dropout mask sampled from Bernoulli distribution Ber(1 − p), where p is dropout probability and set to be 0.1 (Srivastava et al., 2014). 5. Word-Dropout: This is the model trained to maximize L M LE (θ) with word dropout (Sennrich et al., 2016) where the tokens of x are randomly set to a zero embedding. 6. SSMBA: This is the QA model trained to maximize L M LE (θ), with additional examples generated by the technique proposed in (Ng et al., 2020), which are generated by corrupting the target sequences and reconstructing them using a masked language model, BERT. 7. Prior-Aug This is variant of SWEP trained with additional perturbed data, where the noise is drawn from the prior distribution p ψ (z) rather than q φ (z|x).  8. SWEP: This is our full model which maximizes the objective function in Eq. (4).

Experimental Result
We compare SWEP and its variant Prior-Aug with the baselines as described in section 4.1. As shown in Table 1, our model outperforms all the baselines, whose backbone networks are BERT or ELECTRA, on most of the datasets. The data augmentation with SSMBA improves the performance of ELEC-TRA on in-domain dataset SQuAD and Wiki. However, it significantly underperforms ours on out-ofdomain datasets even if the data augmentation with SSMBA use 4.8 times more data than ours. Similarly, Table 2 shows that the T5 model trained with our method consistently improves the performance of the model trained with MLE on most of the datasets. Contrary to ours, SSMBA significantly degrades the performance of the BERT and T5 model both on in-domain and out-of-domain datasets. Since masking and reconstructing some of the tokens from a sentence with a masked language model may cause a semantic drift, those transformations make some questions unanswerable. As a result, the data augmentation with SSMBA often hurts the performance of the QA model. Similarly, Word-Dropout randomly zeros out word embedding of tokens, but some of zeroed out words are critical for answering questions. Adv-aug marginally improves the performance, but it requires an additional backward pass to compute the gradient for adversarial perturbation, which slows down the training procedure.

Low Resource QA
We empirically show that our data augmentation SWEP is an effective regularizer in the setting where there are only a few annotated training examples. To simulate such a scenario, we reduce the number of labeled SQuAD data to 80%, 50%, 30%, and 10% and train the model with the same experimental setup as described in section 4.2. Fig. 3 shows the accuracy as a function of the percentage of QA pairs. Ours consistently improves the performance of the QA model at any ratios of labeled data. Even with 10% of labeled data, it increases EM and F1 score by 1%.

Data augmentation with QG
We show that our data augmentation is sampleefficient and further improves the performance of  the QA model trained with additional synthetic data generated from the question-answer generation model (QG). We use Info-HCVAE  to generate QA pairs from unlabeled paragraphs and train the BERT model with humanannotated and synthetic QA pairs, while varying the number of the generated pairs. As shown in Fig.  4, SWEP trained only with SQuAD already outperforms the model trained with 240,422 synthetic QA pairs generated with Info-HCVAE. Moreover, when combining the two methods, we achieve even larger performance gains compared to when using either SWEP or Info-HCVAE alone, as the two approaches are orthogonal.

Ablation Study
We further perform an ablation study to verify the effectiveness of each component of SWEP. In Table 3, we present the experimental results while removing various parts of our model. First of all, we replace the elementwise multiplicative noise with elementwise additive noise and set the prior distribution as N (0, αI d ). We observe that the noise generator does not learn meaningful perturbation, which leads to performance degradation. Moreover, instead of learning µ t or σ t from the data, we fix either of them and perform experiments, which we  denote w/ fixed µ and w/ fixed σ. For all the time step t, we set µ t as (1, . . . , 1) ∈ R d for w/ fixed µ. For w/ fixed σ, we set σ 2 t as (1, . . . , 1) ∈ R d , i.e. we use the identity matrix I d as the covariance of q φ (z|x). As shown in Table 3, fixing µ t or σ 2 t with predefined values achieves slightly better performance than the Prior-Aug, but it degrades the performance of the full model. Based on this experimental results, we verify that learning µ t or σ 2 t for each word embedding e t is crucial to the success of the perturbation function, as it can delicately perturb each words with more flexibility.
Furthermore, we convert the stochastic perturbation to deterministic one, which we denote as w/o ∼ N (0, I d ). To be specific, the MLP(h t ) in Eq.
(2) only outputs µ t alone and we multiply it with e t without any sampling, i.e.ẽ t = e t µ t . As shown in Table 3, the deterministic perturbation largely underperforms the full model. In terms of the objective function, we observe that removing L M LE (θ) results in larger performance drops, suggesting that using both augmented and original instance as a single batch is crucial for performance improvement. In addition, the experiment without D KL shows the importance of imposing a constraint on the distribution of perturbation with the KL-term.

Quantitative Analysis
We quantitatively analyze the intensity of perturbations given to the input during the training. To quantitatively measure the semantic drift, we measure the extent to how many words are replaced with another word during training for each data augmentation method and plot it in Fig. 6. Unlike SSMBA, which replaces the predefined percentage of words with others, the adversarial augmentation (Adv-Aug) or SWEP perturbs the word embeddings in the latent space. We project the perturbed embedding back to the input space to count how many words are changed. Specifically, each word w t ∈ R |V| is represented as the one-hot vector and mapped to word vector as e t = W e w t , where V denotes the vocabulary for training data and W e ∈ R d×|V| is the word embedding matrix. Then, the perturbed word embeddingẽ t is projected back to one-hot vectorw t as follows: where one-hot(j, |V|) makes a one hot vector of which j-th component is one with the length |V|.
In Fig. 6, we plot the ratio of how many words are replaced with others in raw data before and after each perturbation for each batch as training goes on. In Fig. 1, for example, SSMBA changes about 11 raw words while SWEP does not change any words. We observe that around 20% of perturbed words are not projected back to each original word if we apply the adversarial augmentation. Also, we see that the adversarial augmentation largely changes the semantics of the words although the perturbation at the final layer is within the epsilon neighborhood of its latent embedding. In contrast, the perturbation by SWEP rarely changes the original words except in the very early stage of training. This observation implies that SWEP learns the range of perturbation that preserves the semantics of the original input, which is important when augmenting data for QA tasks and verifies our concept described in Fig. 1.

Qualitative Analysis
In Fig. 5, we visualize the value of the l 2 distance between the original word and one with the perturbation after the training. We observe that the perturbation function q φ learns to generate adaptive perturbations for each word (i.e. the lowest intensity of perturbation on answer-like words "professor jerome green"). However, it is still unknown why the intensity of certain word is higher than the others and how much difference affects the dynamics of training. We have included more observation such as embedding space visualization in Figure 7.

Conclusion
We proposed a simple yet effective data augmentation method based on a stochastic word embedding perturbation for out-of-distribution QA tasks. Specifically, our stochastic noise generator learns to generate the adaptive noise depending on the contextualized embedding of each word. It maximizes the likelihood of input with perturbation, such that it learns to modulate the intensity of perturbation for each word embedding without changing the semantic of the given question and paragraph. We augmented the training data with the perturbed samples using our method, and trained the model with only a single source dataset and evaluate it on datasets from five different domains as well as the in-domain dataset. Based on the experimental results, we verified that our method improves both the performance of in-domain generalization and robustness to distributional shifts, outperforming the baseline data augmentation methods. Further quantitative and qualitative analysis suggest that our method learns to generate adaptive perturbation without a semantic drift.

Broader Impact
Our data augmentation method SWEP efficiently improves the robustness of the QA model to unseen out-of-domain data with a few additional computational cost. This robustness is crucial to the success of the real-world QA models, since they frequently encounter questions for unseen domains, from the end-users. While previous works such as  require a set of several heterogeneous datasets to learn domain-invariant representations, such is not a sample-efficient method, while our method is simple yet effective and can improve the robustness of the QA model only when trained on a single source dataset.

A Experimental Setup
A.1 Dataset Statistics Table 4 describes detailed dataset statistics.

A.2 Baselines
1. Word-Dropout We set the same dropout probability as 0.1, which is the same dropout probability of the backbone networks -BERT, ELECTRA, and T5 model.

Adv-Aug
We follow the adversarial perturbation from (Volpi et al., 2018). We set the number of iteration for perturbation as 5, which is much fewer steps than the original paper due to the computational cost.
3. SSMBA We use the official code of the original paper 1 to augment the training data from SQuAD. We set the probability of masking 0.25 and sample 8 different examples for each training data instance. In total, we synthesize 426,266 additional training instances.

Prior-Aug
We set the α as 0.1 which is the dropout probability of the backbone networks.

A.3 Data Augmentation with QG
Following the experimental setup from , we split the original SQuAD validation dataset by half into new validation and test set. We download the synthetic QA pairs generated by their generative model Info-HCVAE from the github 2 and augment SQuAD training data with them. They leverage the generative model to sample QA pairs from unlabeled paragraph of Harvest-ingQA dataset 3 (Du and Cardie, 2018), varying the different portion of unlabeled paragraph (denoted as H×5%-H×50%). We first finetune BERT-base QA model with the synthetic QA pairs generated for 2 epochs and further train it with the original SQuAD training data for another 2 epochs. We use AdamW optimizer (Loshchilov and Hutter, 2019) and set learning rate 2 · 10 −5 and 3 · 10 −5 for pretraining and finetuning, respectively with batch size 32. We choose the best checkpoint based on the F1 score from the new validation dataset and evaluate F1 and Exact Match (EM) score on the new test dataset.

4:
Forward data without perturbation to compute log p θ (y (i) |x (i) ) 5: Forward data with perturbation and compute L noise (φ, θ) 7: Update θ, φ with L(φ, θ) 8: end for 9: end while C Further Analysis Motivated by observations from , we further analyze the adaptive perturbation for  Table 5: The l 2 distance of k-NN nearest neighbor, l 2 distance between embeddings before and after perturbation, and the average µ value of the word embedding from BERT, segmented by the word frequency rank (Lower rank indicates high-frequency word). each word.  observe that lowfrequency words disperse sparsely while highfrequency words concentrate densely on the word embedding space of BERT. Following the setting of , we first measure the l 2 distance between k-nearest neighbors of each word embedding. Specifically, we rank each word (wordpiece tokens) by frequency counted based on the SQuAD train set and sample 100 examples from the SQuAD train set for analysis. In Table 5, we also observe that low-frequency words have more distance to their neighbor than high-frequency words. Then, we measure the average l 2 distance of word embedding before and after perturbation and the average perturbation size for each word as 1 d d i=1 µ t,i after the training. We observe that lowfrequency words tend to be perturbed more than high-frequency words. This observation suggests that the noise generator can recognize acceptable extents to perturb words depend on the word embedding distribution then tends to generate more perturbation on sparsely dispersed low-frequency words and less perturbation on densely concentrated high-frequency words. Note that we use beta annealing to magnify the difference for analysis so that the β becomes zero in the second epoch.

D Embedding Space Visualization
In Figure 7, we visualize the embedding space using t-SNE (Maaten and Hinton, 2008) for both word embedding ((a), (b)) and contextualized embedding ((c), (d)) before and after perturbation from ELECTRA-small model. We sample the example from the SQuAD training set, which is the same example as Figure 1 in the main paper. SWEP encodes each input tokens x t to hidden representation h t with transformers and outputs a desirable noise for each word embedding. The noise z t is multiplied with the word embedding e t of each token x t . We observe that the perturbed word embedding is mapped to a different space against orig-inal word embedding, however, the contextualized embedding is not much changed by the perturbation. Note that absolute positions are different in each plot because of the randomness inherent in the t-SNE algorithms. [SEP] (b) Word Embedding after Perturbation [CLS] in what year was the theodore m . [SEP] (c) Contextualized Embedding before Perturbation [CLS] in what year was the theodore m .