Adversarial Training for Machine Reading Comprehension with Virtual Embeddings

Adversarial training (AT) as a regularization method has proved its effectiveness on various tasks. Though there are successful applications of AT on some NLP tasks, the distinguishing characteristics of NLP tasks have not been exploited. In this paper, we aim to apply AT on machine reading comprehension (MRC) tasks. Furthermore, we adapt AT for MRC tasks by proposing a novel adversarial training method called PQAT that perturbs the embedding matrix instead of word vectors. To differentiate the roles of passages and questions, PQAT uses additional virtual P/Q-embedding matrices to gather the global perturbations of words from passages and questions separately. We test the method on a wide range of MRC tasks, including span-based extractive RC and multiple-choice RC. The results show that adversarial training is effective universally, and PQAT further improves the performance.


Introduction
Neural networks have achieved superior performance on many tasks, but they are vulnerable to adversarial examples (Szegedy et al., 2014) -examples that have been mixed with certain perturbations. Adversarial training (AT) (Goodfellow et al., 2015) uses both clean and adversarial examples to improve the robustness of the model for image classification.
In the field of NLP, Miyato et al. (2017) have applied adversarial training on text classification tasks and improved the model performance. From then on, many AT methods has been proposed (Wu et al., 2017;Yasunaga et al., 2018;Bekoulis et al., 2018;Zhu et al., 2020;Jiang et al., 2019;Pereira et al., 2020;. They mostly adopt a general AT strategy, but focus less on the adaptation of AT to NLP tasks. To explore this adaptation, in this work, we aim to apply adversarial training Passage: ... The rock cycle is an important concept in geology which illustrates the relationships between these three types of rock, and magma. When a rock crystallizes from melt (magma and/or lava), it is an igneous rock. ...

Question:
An igneous rock is a rock that crystallizes from what? Table 1: An example from the SQuAD dataset. We highlight two words rock and igneous for better demonstration. The words with the same color are injected with the same perturbation by PQAT. The different occurrences of the same word (for example, rock in passage and question) are perturbed differently depending on their roles. on machine reading comprehension (MRC) tasks, which exhibit complex NLP characteristics.
The objective of MRC is to let a machine read the given passages and ask it to answer the related questions. There are several types of MRC tasks. In this work we focus on span-based extractive RC (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Yang et al., 2018) and multiple-choice RC (Lai et al., 2017). To apply adversarial training on MRC tasks, we notice that there are several salient characteristics of MRC compared to other tasks such as image classification: (1) The inputs are discrete. Unlike pixels, which can take continuous values, words are discrete tokens.
(2) The tokens in the input sequences are not independent. A word may occur in an input sequence several times. After the embedding layer, these occurrences are represented by the word vectors with the same value and hold the same semantic meaning (although the word may be polysemous). (3) The roles of passages and questions are different. Given a question as the query, the model needs to look up the correct answer in the passage.
People have utilized the first characteristic to apply adversarial training by perturbing input word vectors instead of tokens. However, the second and third characteristics have been largely ignored. For example, in Table 1, which is a passage-question pair from the SQuAD dataset, the word rock has appeared multiple times. In the standard adversarial training, the perturbations added to each occurrence of rock are different, ignoring the fact that they share the same meaning. On the other hand, the multiple occurrences of the same word in the passage and question play different roles, such as the rock in the passage and question. It is appropriate to treat them differently.
To take the second and the third characteristics into consideration, we propose a novel adversarial training method called PQAT. The core of PQAT is the virtual P/Q-embeddings, which are two independent embedding spaces for passages and questions. Each time we calculate perturbations, P/Qembeddings gather the perturbations from passages and questions for each word, then generate a global and role-aware perturbation for each word from passages and questions separately. For example, in Table 1, the perturbations on all the occurrences of rock in the passage and question will be gathered into two matrices separately, forming global and role-aware perturbations of rock. PQAT is as efficient as the standard AT with nearly no extra time cost. Also, The virtual P/Q-embeddings are only used during training. They are discarded once the training is finished. Thus PQAT does not increase the model size and inference time for predictions.
We have applied adversarial training on several MRC tasks, including span-based extractive RC and multiple-choice RC. Results show that adversarial training improves the MRC model performance universally and consistently, even over the strong pre-trained model baseline. Furthermore, the PQAT method outperforms the standard AT on both normal datasets and adversarial datasets. Lastly, our results verify the usefulness of incorporating information of task form into the design of the adversarial training method.

Standard Adversarial Training
Adversarial training first constructs adversarial examples by generating worst-case perturbations that maximize the current loss, then minimize the loss on those adversarial examples. In NLP tasks, a popular approach to generate perturbations is to perturb word vectors from the embedding layer (Miyato et al., 2017). We denote the input token sequence as X and the operation of looking up in an embedding layer E as emb(E, ·). The objective  of AT is where f θ (·) is the model parametrized by θ excluding word embedding layer; X vec = emb(E, X) is the word vectors of input sequence. L is the loss function. We perturb the word vectors with the adversarial perturbations δ. δ can be estimated by linearizing L(f θ (X vec + δ), y) around X and perform the multiple-step projected gradient descent (PGD) (Madry et al., 2018): where t is the gradient descent step, Π δ ≤ denotes projection δ back onto the -ball. g t is the gradient of the loss with respect to perturbation δ. The more gradient descent steps, the better approximation of δ, but also more expensive in computation.

Adversarial Training for MRC
In the above algorithm, when generating the perturbations on X vec through backward propagation, each word vector X i vec is perturbed independently, like the pixels in an image. It ignores the semantic relationship among the word vectors of a word's different occurrences. To make the perturbation on each occurrence aware of other occurrences of the same word, we adapt AT by gathering not only the perturbations on each word vector, but also the perturbations on the embedding matrix. The latter can be seen as the global perturbations, which provide context-insensitive semantic information.
The global perturbations are rather coarsegrained, since all the occurrences of the same word receive the same global perturbation. Note that in MRC tasks, words in passages and questions play different roles. Thus, to keep this information, we distinguish the words in passages and questions by creating two virtual embedding matrices P and Q: P-embedding matrix P collects the perturbations of all the words from the passages; Q-embedding matrix Q for the questions. We give an illustration in Figure 1. P/Q-embedding matrices are virtual since they only provide perturbations, no the real word vectors. During training, perturbations from virtual embeddings and word vectors are summed up to form the adversarial input Z vec . The final objective is Z vec = [X P vec + P vec ; X Q vec + Q vec ] + δ (5) P vec = emb(P , X P ), Q vec = emb(Q, X Q ) are the perturbations from the virtual embeddings. X P and X Q stand for the passage and question sections in X. [·; ·] denotes concatenation. In this way, we have generated fine-grained local perturbations δ by standard AT, and global role-aware perturbations P vec and Q vec by the virtual P/Q-embeddings. We call the later process as PQAT, which is the main adaptation of adversarial training for MRC.
We list the overall algorithm of adversarial training for MRC in Algorithm 1. We initialize P and Q with the gaussian distribution. For each batch, we perform K-step gradient descent (line 9-22): we look up the original word vectors and P/Q-embedding vectors from the embedding layer E and the P/Q-embedding matrices. The adversarial inputs are constructed by summing them with local perturbations δ. Then we compute the gradients of model parameters g t , local perturbations g δ and P/Q-embedding matrices g P and g Q . These gradients can be calculated in a single backward pass. Lastly, we update the virtual embeddings and local perturbations (line 18-21).
Note that P/Q-embedding matrices serve as the containers for perturbations. When the training is finished, P/Q-embedding matrices are no longer needed and can be discarded. δ , P and Q control the strengths of standard AT and PQAT. If δ = 0, we have a pure P/Qembeddings based adversarial training, i.e., PQAT; while if P = Q = 0, we recover the standard AT.
To construct the inputs, for span-based extractive RC, we concatenate each P and Q with modeldependent special tokens; for multiple-choice RC with m options for each example, we append each option to the concatenation of P and Q, and construct m input sequences from each example.
When applying AT or PQAT, we only perturb   the word embeddings and leave the position embeddings unchanged. For PQAT on RACE, we let the Q-embedding matrix collect perturbations from both questions and options. Training Settings and Hyperparameters. All the models are implemented with Transformers (Wolf et al., 2019) and trained on a single Nvidia V100 GPU. To improve the stability and reduce the uncertainty of the results, we run each experiment four times with different seeds and report the mean value of performance. We use AdamW as our optimizer with batch size 24 and learning rate 3e-5 for RoBERTa BASE and 2e-5 or 1e-5 for RoBERTa LARGE . The maximum number of epochs is set to 3 for SQuAD and 5 for RACE and Hot-potQA. A linear learning rate decay schedule with warmup ratio 0.1 was used. For PQAT, δ is set to 0, P and Q is set to 4e-2 for RACE and 2e-2 for other tasks. The variance σ is 1e-2. We set the number of gradient descent steps K = 2 to balance speed and performance.

Overall Results
The overall results are summarized in Table 2  is able to boost model performance across all MRC tasks and outperforms the RoBERTa baseline significantly. On HotpotQA, which is a complicated MRC task that features multi-hop questions and asks for multiple kinds of predictions, PQAT still outperforms the baseline by 1.3/1.1 on Joint EM/Joint F1. On RACE, PQAT improves the performance significantly by 1.5% in accuracy. The universal improvements on various kinds of MRC tasks prove the wide applicability of PQAT.

Comparison
We compare different adversarial training methods and their combinations by tuning the strengths of perturbations { δ , P , Q }. The results are in Table  3. The underlined scores are the ones reported in Table 2. Firstly, to test the effectiveness of standard AT, we disable PQAT with P = Q = 0 and enable standard AT with δ =2e-3 for RACE and 1e-2 for other tasks 1 . Other settings are unchanged, and we still follow Algorithm 1. PQAT consistently outperforms standard AT on the three tasks. Then we enable both PQAT and standard AT by setting all the strengths { δ , P , Q } to non-zero values.
The performance gets slightly better on SQuAD 1.1 and RACE, but gets worse on SQuAD 2.0.
Compared with the standard AT, PQAT achieves higher performance by itself. Therefore PQAT could be a better alternative to applying adversarial training on MRC tasks.

Robustness on Adversarial Datasets
We assess the robustness of MRC models with AddSent and AddOneSent. AddSent and AddOne-Sent are two adversarial datasets built on SQuAD 1.1. In both datasets, passages are appended with distracting sentences. MRC models that heavily rely on text matching may be easily fooled to predict wrong answers from the distracting sentences.
The results are shown in Table 4. With the standard adversarial training (AT), the MRC model improves its robustness by about 5% over RoBERTa BASE in F1. PQAT further improves the performance over AT by about 1% on both AddSent and AddOneSent.

Conclusion
We have applied adversarial training on a wide range of MRC tasks, including span-based extractive RC and multiple-choice RC. Especially, we have proposed a novel adversarial training method PQAT, which uses virtual P/Q-embedding matrices to generate global and role-aware perturbations that consider the characteristics of MRC tasks. Our experiments demonstrate that adversarial training improves the MRC model performance universally and consistently, even over the strong pre-trained model baseline. The PQAT method further improves the model performance over the standard AT on both normal datasets and adversarial datasets.