Learning Invariant Representation Improves Robustness for MRC Models

,


Introduction
Machine reading comprehension (MRC) (Zeng et al., 2020) aims to answer the question based on a passage as context. It has experienced rapid development due to the evolution of deep neural networks (Minjoon Seo, 2017;Devlin et al., 2019) and release of large-scale and high-quality datasets (Rajpurkar et al., 2016;Dzendzik et al., 2021). By introducing different kinds of attack sets, however, a considerable amount of literature (Jia and Liang, 2017;Gan and Ng, 2019;Liu et al., 2020b;Si et al., 2021) has shown that the result on in-domain test set tends to overestimate the models' performance. For example, Jia and Liang (2017) exposed models' over-stability issue by adding one distracting sentence, which suggests the models' inability to distinguish a sentence that actually answers the question from one that merely share sufficient words with it but semantically changed. Gan and Ng (2019) showed that a question paraphrased in a slightly different but semantically similar way can mislead the model to output a wrong answer.
Up to now, researchers have proposed several solutions to alleviate these robustness issue, including utilizing external knowledge to create adversarial examples to enrich training data (Wang and Bansal, 2018;Zhou et al., 2020) and adversarial training based methods (Yang et al., 2019;Liu et al., 2020a;Yang et al., 2021). However, recent literature (Liu et al., 2020c,a;Si et al., 2021) suggests that models trained with specific augmented data are still easily attacked by other unseen perturbations. Adversarial training can improve models' robustness under general attacks without requiring any explicit adversarial examples, but at the cost of iterative training schedule.
To be capable of handling more general attacks rather than just a certain type attack without laborious iterative schedule, we propose SCQA to addresses the above robustness issue by learning invariant representations of similar examples inspired by Le-Khac et al. (2020). In detail, the data augmentation module first constructs an example similar to the input example to form a positive pair. Then SCQA utilizes stability loss to scale down the change of probability distribution caused by small label-preserving perturbations. In addition, SCQA introduces contrastive loss to pull semantically close pairs together to further improve the alignment property in the representation space.
In the experimental part, we have organized different MRC tasks and several attack tests as a benchmark for MRC robustness, including spanbased extractive, multiple choice and Yes/No MRC. The results show that SCQA with dropout noise as implicit data augmentation can reduce the distance between embeddings of paired examples, and therefore improve the robustness of the MRC models over different types of adversarial perturbations consistently and significantly. Moreover, it is worth to note that our approach can further boost the models' performance with specific explicit data augmentation strategies.

SCQA Architecture
As shown in Figure 1, SCQA has five modules: • A data augmentation module that construct positive example pairs. • An encoder which learns contextual representations for input sequence. • A contrastive loss layer on top of the encoder, it aims to pull positive pairs together and keep one representation distant from other negative representations in the same batch. • A predictor that maps the contextual representation into probability distribution to predict the answer. • A stability calculator that quantifies the change of probability distribution caused by small label-preserving perturbations.
Given each input sequence x i ∈ R L * d consists of question q i and context c i , L is the sequence length and d is the hidden dimension, we first apply data augmentation module to generate semantically sim- i and representations of other examples in the same batch will be passed into contrastive layer to compute contrastive loss l contrastive . The predictor module P will compute probability distribution P i and P ′ i of answers. Then, MRC task-specific loss l mrc and stability loss l stability are calculated separately. Finally, all losses are combined to be the optimization objective of model parameters. The model is expected to learn invariant representations of similar input sequences and be robust against label-preserving attacks.

Data Augmentation Strategies
The purpose of data augmentation module is to generate semantically similar example pairs which are used for calculating the contrast and stability loss. We apply two different data augmentation strategies into SCQA framework to transform training batch size from n to 2 × n.

Adversarial Examples as explicit Augmentation
Augmenting original training data with adversarial examples created by the same rules as attacks is utilized to improve the models' robustness (Wang and Bansal, 2018), although which can only defend the specific attack (Liu et al., 2020a). For this setting, we follow the strategy in Wang and Bansal (2018) and mix as many adversarial examples as 25% of the original training data. For those instances x i without corresponding adversarial examples, we take the dropout approach to supplement data.

Training Loss
Our training loss is a weighted sum of the MRC task-specific loss l mrc , contrastive loss l contrastive and stability loss l stability . This section describe stability and contrastive loss in detail. For completeness, we provide description of MRC task-specific loss in Appendix A.1. l total = l mrc + w 1 · l contrastive + w 2 · l stability (1) Stability Loss The aim of stability loss is to scale down change of probability distribution caused by small label-preserving perturbations in data augmentation module. Given the probability distribution P i and P ′ i output by predictor P, the stability 3307 where τ is the temperature hyper-parameter and sim(h 1 , h 2 ) is the cosine similarity   Table 2 reports the robustness of models against AddSent and AddOneSent attack built on SQuAD 1.1, under the explicit and implicit data augmentation strategies respectively. With dropout as implicit augmentation, SCQA improves the performance by about 0.6% consistently. Training with adversarial samples can effectively improve the robustness against specific attack, and SCQA can further boost the performance by about 0.5%.
Overall results on other datasets and adversarial attack sets are presented in Appendix A.4

Ablation Study
We investigate ablation experiments to observe the impact of stability and contrastive loss. Additionally we explore how to combine stability and contrastive loss to better optimize a robust model. Table 3 shows the results of BERT on RACE dataset, which suggests that just training with stability, contrastive loss or combination serially can either improve model performance, and the combination of the two objective at the same time can achieve the best overall effect.

Analysis of Embedding Space
To prove the hypothesis that SCQA can train enhanced contextual representations with better invariance, we compute the alignment property l align between representations of example x in test set and corresponding label-preserving adversarial example x ′ . A lower alignment value represents a better representation. The results on Figure 2 shows that SCQA can amends model alignment and therefore improve the robustness of QA models against test-time label-preserving perturbations. We can also observe from the value of alignment that the AddSent attack is more challenging than CharSwap.

Conclusion
A number of studies

Limitations
Current data augmentation strategy just contains creating positive training pairs to defense labelpreserving perturbations. SCQA pays little attention to the construction of high-quality negative sample pairs, which need to be further explored to fully utilize the effect of contrastive learning.

A.1 MRC Task-specific Loss
Different forms of machine reading comprehension tasks require task-specific output heads and loss functions, which are as follows: • Span-based Extractive MRC task requires model to predict the start/end position probability distributions of answer. And the training loss l span is the negative log-likelihoood of correct start and end boundaries: where y s i and y e i are the ground truth start and end positions of input example x i . • Span-based Extractive RC with Unanswerable Questions requires the model not only to predict the correct span answer when the question is answerable, but also to identify when no answer can be inferred from the context. Therefore, we use the negative log-likelihoood of correct start and end position as training objective, and if the question has no answer, we will simply predict both the start and end position as 0. During the inference phase, if the best candidate span answer has a score that is less than the score of the no-answer (sum of start and end probability in position 0 ) minus a threshold, the no-answer is selected for this example. • Multiple Choice RC requires the model to find the only correct option in the given candidate options based on the context. Given a input example x i consists of question q i , context i , 1 <= j <= k. We feed the k sequences into encoder E to get contextual where d is the hidden size. The final predictor module P then calculate the probability of which option the answer is. The loss function is cross-entropy loss generally used in multiple classification problems.
• Yes/No MRC expect the model to answer yes or no when given Yes/No questions and related context. It can be modeled as a binary classification problem and the training objective is cross-entropy loss.

A.2 Details of Datasets
Datasets Table 4 represents the dataset statistics in detail.
• SQuAD 1.1 Rajpurkar et al. (2016) proposed the dataset in which the 100k+ questions were created by crowdworkers on 536 Wikipedia articles. • SQuAD 2.0 was released by Rajpurkar et al.
(2018) which combines exsiting SQuAD 1.1 data with over 50k unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. • RACE was presented by Lai et al. (2017) which contains near 100k questions of Chinese middle and high school students' English exams. More reasoning ability of the model is required to answer the question because these questions were carefully designed for evaluating the students' ability in understanding and reasoning. • ReClor was constructed by Yu et al. (2019) from standardized graduate admission examinations. What makes ReClor challenging is that every sentence in the context is important. Therefore the model should not only extract relevant information from the context, but also have the logical reasoning ability. • BoolQ was created by Clark et al. (2019) in unprompted and unconstrained ways and focus on Yes/No questions. The questions are written by people who did not know the answer to the question they were asking. And each question is paired with a paragraph from Wikipedia that an independent annotator has marked if the context contains the answer. • NP-BoolQ proposed by Khashabi et al. (2020) also requires the machine to understand what facts can be inferred to be true or false from the context. It was constructed by applying human-driven natural perturbations to BoolQ (Clark et al., 2019).
Adversarial Attacks We test the robustness of models on following different types of labelpreserving perturbations which attack MRC models from their unique perspectives: •  perturb the test instances of BoolQ and nine remaining NLP datasets in small but meaningful ways that typically change the gold label, creating contrast sets looks not explicitly adversarial but significantly reduce the performance.
Augmented Adversarial Examples One of the straightforward ways to defend attack is augmenting the training dataset with adversarial examples generated by the same rules as the attacks. To further illustrate the effectiveness of our approach, we train our model on the datasets mixed with the adversarial examples.
• AddSentDiverse was proposed by Wang and Bansal (2018) based on the observation that retraining models with data generated by AddSent (Jia and Liang, 2017) has limited effect on the robustness. They further enriched SQuAD 1.1 training data by dynamically generating the fake answers and varying the locations where the distractor sentences are placed. The mixed adversarial examples accounts for 20% of the total SQuAD 1.1 dataset. • NP-BoolQ was proposed by Khashabi et al.
(2020) to focus on the value of natural perturbations for robust model design. They asked the workers to change the question by adding or removing up to four terms, resulting a modified question challenging for RoBERTa trained on BoolQ dataset. The mixed adversarial examples accounts for 5% of the total dataset. We utilize training part of NP-BoolQ to augment BoolQ training data.

A.3 Parameters Settings
We use the base version of BERT, RoBERTa and ALBERT. During training, the batch size and epoch varies according to the task, and all models have the same batch size and epoch in the same dataset. For SQuAD 1.1 and SQuAD 2.0, the batch size and epoch are 12 and 3 respectively. For BoolQ and NP-BoolQ, the batch size and epoch are 12 and 10 separately. The batch size for RACE and ReClor is 6 since each question has 4 options, and the epoch is 3. We set the learning rate lr to 3e − 5 for all models on all datasets, excepting that lr of all models is 2e − 5 for ReClor, lr of RoBERTa is 3e-6 on RACE and NP-BoolQ, and ALBERT has lr 1e-5 for RACE, BoolQ and NP-BoolQ. We keep the other hyper-parameters of models default. The weights in combined loss are simply set 1e − 4 for w 1 and 3e − 5 for w 2 . The temperature τ is 0.05.

A.4 Detailed Results
Table 5-8 represents the experimental results of models on BoolQ, NP-BoolQ, SQuAD 2.0, ReClor and corresponding adversarial sets respectively.