Evaluating Neural Model Robustness for Machine Comprehension

We evaluate neural model robustness to adversarial attacks using different types of linguistic unit perturbations – character and word, and propose a new method for strategic sentence-level perturbations. We experiment with different amounts of perturbations to examine model confidence and misclassification rate, and contrast model performance with different embeddings BERT and ELMo on two benchmark datasets SQuAD and TriviaQA. We demonstrate how to improve model performance during an adversarial attack by using ensembles. Finally, we analyze factors that effect model behavior under adversarial attack, and develop a new model to predict errors during attacks. Our novel findings reveal that (a) unlike BERT, models that use ELMo embeddings are more susceptible to adversarial attacks, (b) unlike word and paraphrase, character perturbations affect the model the most but are most easily compensated for by adversarial training, (c) word perturbations lead to more high-confidence misclassifications compared to sentence- and character-level perturbations, (d) the type of question and model answer length (the longer the answer the more likely it is to be incorrect) is the most predictive of model errors in adversarial setting, and (e) conclusions about model behavior are dataset-specific.


Introduction
Deep neural models have recently gained popularity, leading to significant improvements in many Natural Language Understanding (NLU) tasks (Goldberg, 2017). However, the research community still lacks in-depth understanding of how these models work and what kind of linguistic information is actually captured by neural networks (Feng et al., 2018). Evaluating model robustness to manipulated inputs and analyzing model behavior during adversarial attacks can provide deeper Context: One of the most famous people born in Warsaw was Maria Skłodowska-Curie, who achieved international recognition for her research on radioactivity and was the first female recipient of the Nobel Prize. Question: What was Maria Curie the first female recipient of? Answer: Nobel Prize insights into how much language understanding models actually have (Hsieh et al., 2019;Si et al., 2020). Moreover, as has been widely discussed, models should be optimized not only for accuracy but also for other important criteria such as reliability, accountability and interpretability (Lipton, 2018;Doshi-Velez and Kim, 2017;Ribeiro et al., 2016;Goodman and Flaxman, 2017).
In this work, we evaluate neural model robustness on machine comprehension (MC), a task designed to measure a system's understanding of text. In this task, given a context paragraph and a question, the machine is tasked to provide an answer. We focus on span-based MC, where the model selects a single contiguous span of tokens in the context as the answer (Tab. 1). We (1) quantitatively measure when and how the model is robust to manipulated inputs, when it generalizes well, and when it is less susceptible to adversarial attacks, (2) demonstrate that relying on ensemble models increases robustness, and (3) develop a new model to predict model errors during attacks. Our novel contributions shed light on the following questions: • Which embeddings are more susceptible to noise and adversarial attacks? • What types of text perturbation lead to the most high-confidence misclassifications? • How does the amount of text perturbation effect model behavior?
• What factors explain model behavior under perturbation? • Are the above dataset-specific?
Broader Implications We would like to stress the importance of this type of work to ensure diversity and progress for the computational linguistics community. We as a community know how to build new models for language understanding, but we do not fully understand how these models work. When we deploy these models in production, they fail to perform well in real-world conditions, and we fail to explain why they fail; the reason being we have not performed through evaluation of model performance under different experimental conditions. Neural model evaluation and thorough error analysis, especially for tasks like machine comprehension, are critical to make progress in the field. We have to ensure our research community goes beyond F1 scores and incremental improvements and gains deeper understanding of models decision making processes to drive revolutionary research rather than evolutionary.

Background
There is much recent work on adversarial NLP, surveyed in Belinkov and Glass (2019) and Zhang et al. (2019). To situate our work, we review relevant research on the black-box adversarial setting, in which one does not have access or information about the model's internals, only the model's output and its confidence about the answer. 1 In an adversarial setting, the adversary seeks to mislead the model into producing an incorrect output by slightly tweaking the input. Recent work has explored input perturbations at different linguistic levels: character, word, and sentence-level. For character-level perturbations, NLP systems generally do not take into account the visual characteristics of characters. Researchers have explored the effects of adding noise by randomizing or swapping characters and examining its effect on machine translation (MT) (Heigold et al., 2018;Belinkov and Bisk, 2018), sentiment analysis and spam detection Gao et al. (2018), and toxic content detection Li et al. (2018). Eger et al. (2019) replaced with similar looking symbols, and developed a system to replace characters with nearest neighbors in visual embedding space. For word-level perturbations, Alzantot et al. (2018) used a genetic algorithm to replace words with contextually similar words, evaluating on sentiment analysis and textual entailment. For sentence-level perturbations,  generated adversarial paraphrases by controlling the syntax of sentences and evaluating on sentiment analysis and textual entailment tasks. Hu et al. (2019) found that augmenting the training data with paraphrases can improve performance on natural language inference, question answering, and MT. Niu and Bansal (2018) use adversarial paraphrases for dialog models.
Other related work includes Zhao et al. (2018);Hsieh et al. (2019), who generated natural looking adversarial examples for image classification, textual entailment, and MT. Specifically for MC, Jia and Liang (2017) added a distractor sentence to the end of the context, Ribeiro et al. (2018) extracted sentence perturbation rules from paraphrases created by translating to and then from a foreign language and then manually judged for semantic equivalence, and (Si et al., 2020) focused on evaluating model robustness for MC.
Unlike earlier work, we empirically show how neural model performance degrades under multiple types of adversarial attacks by varying the amount of perturbation, the type of perturbation, model architecture and embedding type, and the dataset used for evaluation. Moreover, our deep analysis examines factors that can explain neural model behavior under these different types of attacks.

Methods
We perform comprehensive model evaluation for machine comprehension over several dimensions: the amount of perturbation, perturbation type, model and embedding variation, and datasets.

Perturbation Type
We examine how changes to the context paragraph (excluding the answer span) affect the model's performance using the following perturbations: • Character-level. In computer security, this is known as a homograph attack. These attacks have been investigated to identify phishing Original The connection between macroscopic nonconservative forces and microscopic conservative forces is described by detailed treatment with statistical mechanics.
Character The connection between macroscopic nonconservative forces and microscopic conservative forces is described by detailed treatment with statistical mechanics.

Word
The connection between macroscopic nonconservative forces and insects conservative troops is referred by detailed treatment with statistical mechanics.

Sentence
The link between macroscopic non-conservative forces and microscopic conservative forces is described in detail by statistical mechanics. and spam (Fu et al., 2006b,a;Liu and Stamm, 2007) but to our knowledge have not been applied in the NLP domain. We replace 25% of characters in the context paragraph with deceptive Unicode characters 2 that to a human are indistinguishable from the original. • Word-level. We randomly replace 25% of the words in the context paragraph with their nearest neighbor in the GLoVe (Pennington et al., 2014) embedding space. 3 • Sentence-level. We use Improved ParaBank Rewriter (Hu et al., 2019), a machine translation approach for sentence paraphrasing, to paraphrase sentences in the context paragraph. We perform sentence tokenization, paraphrase each sentence with the paraphraser, then recombine the sentences. For character and word perturbations, we use 25% as this is where the performance curve in Heigold et al. (2018) flattens out. 4 Regardless of the type of perturbation, we do not perturb the context that contains the answer span, so that the answer can always be found in the context unperturbed. Because paraphrasing is per sentence, we only modify sentences that do not contain the answer span. An example of each perturbation type is shown in Tab. 2.

Amount of Perturbation
For each perturbation type, we experiment with perturbing the training data at differing amounts. All models are tested on fully perturbed test data.
• None: clean training data.
• Half: perturb half the training examples.
• Full: perturb the entire train set.
• Both: append the entire perturbed data to the entire clean data. 5 • Ens: ensemble model that relies on none, half and full perturbed data; we rely on ensemble voting and only include the word in the predicted answer if any two models agree.

Model Architecture and Embeddings
BiDAF model with ELMo (Seo et al., 2017;. ELMo is a deep, contextualized, character-based word embedding method using a bidirectional language model. The Bi-Directional Attention Flow model is a hierarchical model with embeddings at multiple levels of granularity: character, word, and paragraph. We use pre-trained ELMo embeddings in the BiDAF model implemented in AllenNLP . BERT (Devlin et al., 2019). BERT is another contextualized embedding method that uses Transformers (Vaswani et al., 2017). It is trained to recover masked words in a sentence as well as on a next-sentence prediction task. The output layer of BERT is fed into a fully-connected layer for the span classification task. Pre-trained embeddings can be fine-tuned to a specific task, and we use the Huggingface PyTorch-Transformers package, specifically bert-large-cased-whole-wordmasking-finetuned-squad model. We fine-tune for two epochs in each experimental settings.

Benchmark Datasets
We experiment on two benchmark MC datasets: SQuAD (Rajpurkar et al., 2016). The Stanford Question Answering Dataset is a collection of over 100K crowdsourced question and answer pairs. The context containing the answer is taken from Wikipedia articles. TriviaQA (Joshi et al., 2017). A collection of over 650K crowdsourced question and answer pairs, where the context is from web data or Wikipedia. The construction of the dataset differs from SQuAD in that question answer pairs were first constructed, then evidence was found to support the answer. We utilize the Wikipedia portion of TriviaQA, whose size is comparable to SQuAD. To match the span-based setting of SQuAD, we convert TriviaQA to the SQuAD format using the scripts in the official repo and remove answers without evidence.

Evaluation Results
Fig. 1 summarizes our findings on how model behavior changes under noisy perturbations and adversarial attacks. Here, we briefly discuss how perturbation type, perturbation amount, model, and embeddings affect model misclassification rate. In addition, we contrast model performance across datasets and report how to mitigate model error rate using ensembling. Detailed analyses are presented in Sec. 5. Key findings are italicized.
The effect of perturbation type To assess whether perturbations changed the meaning, we ran a human study on a random sample of 100 perturbed contexts from SQuAD. We found (as expected) that the two annotators we employed could not distinguish char-perturbed text from the original. For word perturbations, the meaning of the context remained in 65% of cases, but annotators noted that sentences were often ungrammatical. For sentence-level perturbations, the meaning remained in 83% of cases.
For a model trained on clean data, character perturbations affect the model the most, followed by word perturbations, then paraphrases. To a machine, a single character perturbation results in a completely different word; handling this type of noise is important for a machine seeking to beat human performance. Word perturbations are context independent and can make the sentence ungrammatical. 6 Nevertheless, the context's meaning generally remains coherent. Paraphrase perturbations are most ideal because they retain meaning while allowing more drastic phrase and sentence structure modifications. In Sec. 4.2, we present a more successful adversarially targeted paraphrasing approach.
The effect of perturbation amount Perturbed training data improves the model's performance for character perturbations (1 st column of Fig. 1a), likely due to the models' ability to handle unseen words: BiDAF with ELMo utilizes character embeddings, while BERT uses word pieces. Our results corroborate Heigold et al. (2018)'s findings (though on a different task) that without adversarial training, models perform poorly on perturbed test data, but when models are trained on perturbed data, the amount of perturbed training data does not make much difference. We do not see statistically significant results for word and paraphrase perturbations (2 nd and 3 rd columns in each heatmap in Fig. 1). We conclude that perturbing 25% of the words and the non-strategic paraphrasing approach were not aggressive enough.
The effect of model and embedding As shown in Fig. 1a and b, the BERT model had less errors than the ELMo-based model regardless of the perturbation type and amount on SQuAD data. While   Table 3: Example result from response ensembling under the SQuAD ELMo setting. The question is "What was used by the West to justify control over eastern territories?" The answer is "Orientalism", and in all three settings, the ensemble was correct.
the two models are not directly comparable, our results indicate that the BERT model is more robust to adversarial attacks compared to ELMo.
The effect of the data Holding the model constant ( Fig. 1b and c), experiments on TriviaQA resulted in more errors than SQuAD regardless of perturbation amount and type, indicating that Triv-iaQA may be a harder dataset for MC and may contain data bias, discussed below.

Adversarial Ensembles
Ensemble adversarial training has recently been explored (Tramèr et al., 2018) as a way to ensure robustness of ML models. For each perturbation type, we present results ensembled from the none, half, and full perturbed settings. We tokenize answers from these three models and keep all tokens that appear at least twice as the resulting answer (Tab. 3). Even when all three model answers differ (e.g. in the word perturbation case), ensembling can often reconstruct the correct answer. Nevertheless, we find that this ensembling only helps for TriviaQA, which has an overall higher error rate (bottom row of each figure in Fig. 1).

Strategic Paraphrasing
We did not observe a large increase in errors with paraphrase perturbations (Fig. 1), perhaps because paraphrasing, unlike the char and word perturbations, is not a deliberate attack on the sentence.
Here we experiment with a novel strategic paraphrasing technique that targets specific words in the context and then generates paraphrases that exclude those words. We find the most important words in the context by individually modifying each word and obtaining the model's prediction and confidence, a process similar to Li et al. (2018). Our modification consists of removing the word and examining its effect on the model prediction.
The most important words are those which, when removed, lower the model confidence of a correct answer or increase confidence of an incorrect answer. The Improved ParaBank Rewriter supports constrained decoding, i.e. specifying positive and negative constraints to force the system output to include or exclude certain phrases. We specify the top five important words in the context as negative constraints to generate strategic paraphrases. 7 We experimented on 1000 instances in the SQuAD dev set as shown in Tab. 4. Our results indicate that strategic paraphrasing with negative constraints is a successful adversarial attack, lowering the F1-score from 89.96 to 84.55. Analysis shows that many words in the question are important and thus excluded from the paraphrases. We also notice that paraphrasing can occasionally turn an incorrect prediction into a correct one. Perhaps  paraphrasing makes the context easier to understand by removing distractor terms; we leave this for future investigation.

Model Confidence
In a black-box setting, model confidence is one of the only indications of the model's inner workings. The models we employed do not provide a single confidence value; AllenNLP gives a probability that each word in the context is the start and end span, while the BERT models only give the probability for the start and end words. We compute the model's confidence using the normalized entropy of the distribution across the context words, where n is the number of context words, and take the mean for both the start and end word: 1 − Hn(s)+Hn(e)

2
, where s and e are probability distributions for the start and end words, respectively. Low entropy indicates certainty about the start/end location. Since the BERT models only provide probabilities for the start and end words, we approximate the entropy by assuming a flat distribution, dividing the remaining probability equally across all other words in the context.
Comparing confidence across models (Fig. 2a), the BERT model has lower confidence for misclassifications, which is ideal. A model should not be confident about errors. Fig. 2b compares confidence across perturbation type. In the none training setting, character perturbations introduce the most uncertainty compared to word or paraphrase perturbations. This is expected, since character perturbations result in unknown words. In the adversarial training, word perturbations lead to the highest number of high-confidence errors. Thus, to convincingly mislead the model to be highly confident about errors, one should use word perturbations.

Robustness Analysis
Here, we do a deeper dive into why models make errors with noisy input. We investigate data charac-teristics and their association with model errors by utilizing CrossCheck (Arendt et al., 2020), a novel interactive tool designed for neural model evaluation. Unlike several recently developed tools for analyzing NLP model errors (Agarwal et al., 2014;Wu et al., 2019) and understanding ML model outputs Poursabzi-Sangdeh et al., 2018;Hohman et al., 2019), CrossCheck is designed to allow rapid prototyping and cross-model comparison to support experimentation. 8

The Effect of Question Type, Question and Context Lengths
We examine if models make more errors on specific types of questions in adversarial training, i.e., some questions could just be easier that others. We first examine question type: 9 who, what, which, when, where, why, how, and other. The majority of SQuAD questions are what questions, while most TriviaQA questions are other questions, perhaps indicating more complex questions (Fig. 4a). We see that models usually choose answers appropriate for the question type; even if they are incorrect, answers to when questions will be dates or time word spans, and answers to how many questions will be numbers. Fig. 4a presents key findings on differences in model misclassifications between two datasets given specific question types. On the SQuAD dataset, the model finds certain question types, e.g. when and how, easiest to answer regardless of the perturbation type. Responses to these questions, which generally expect numeric answers, are not greatly affected by perturbations. For Triv-iaQA, in general we observe more errors across question types compared to SQuAD, i.e. more errors in what, which and who questions. Regarding question length, SQuAD and Triv-iaQA have similar distributions (Fig. 4b). Both datasets have a mode answer length around 10 words; TriviaQA has a slightly longer tail in the distribution. We did not find question length to impact the error. Regarding context length, SQuAD and TriviaQA have vastly differing context length distributions (Fig. 4c), partly due to how the two datasets were constructed (see Sec. 3.4 for details). For both datasets, the error distribution mirrors the context length distribution, and we did not find any relation between model errors and context length.

The Effect of Answer Length
Our analysis shows that the length of the model's answer is a strong predictor of model error in the adversarial setting: the longer the answer length, the more likely it is to be incorrect. Fig. 3 plots the proportion of correct to incorrect answers. We notice a downward trend which is mostly consistent across experimental settings. For both SQuAD and TriviaQA, the models favored shorter answers, which mirrors the data distribution.

The Effect of Complexity: Annotator Agreement and Reading Level
Here, we examine the effect of task complexity on model performance under adversarial training, using inter-annotator agreement as a proxy for question complexity and paragraph readability as a proxy for context complexity. Inter-annotator agreement represents a question's complexity: low agreement indicates that annotators did not come to a consensus on the correct answer; thus the question may be difficult to answer. We examine SQuAD, whose questions have one to six annotated answers. In Fig. 5, we present inter-annotator agreement (human confidence) plotted against model confidence over the four training perturbation amounts, looking only at the incorrect predictions. The setting is SQuAD BERT with character perturbation. We observe  that the models are generally confident even when the humans are not, which is noticeable across all perturbation amounts. However, we see interesting differences in model confidence in adversarial training: models trained in the none and half settings have confidence ranging between 0 and 1 compared to the models trained in full and both setting with confidence above 0.8, indicating training with more perturbed data leads to more confident models.
To evaluate the effect of context complexity, we use the Flesch-Kincaid reading level (Kincaid et al., 1975) to measure readability. For questions the model answered incorrectly, the median readability score was slightly higher than the median score for correct responses (Tab. 5), indicating that context with higher reading level is harder for the model to understand. TriviaQA contexts have higher reading levels than SQuAD.

Predicting Model Errors
Our in-depth analysis reveals many insights on how and why models make mistakes during adversarial training. Using the characteristics we analyzed above, we developed a binary classification model to predict whether the answer would be an error,    given the model's answer and attributes of the context paragraph. We one-hot-encode categorical features (training amount, perturbation type, question type) and use other features (question length, context length, answer length, readability) as is. For each setting of embedding and perturbation type on SQuAD, we train an XGBoost model with default settings with 10-fold cross validation (shuffled). We present the model's average F1 scores (Tab. 6) and feature importance as computed by the XGBoost model (Fig. 6). We see that performance (micro F1) is better to slightly better than a majority baseline (picking the most common class), indicating that certain features are predictive of errors. Specifically, we find that: for character perturbations, the fact that the training data is clean is a strong predictor of errors; a model trained on clean data is most disrupted by character perturbations; for word and paraphrase perturbations, question types are important predictors of errors.

Conclusion and Future Work
Our in-depth analysis of neural model robustness sheds light on how and why MC models make errors in adversarial training, and through our error prediction model, we discovered features of the data e.g., question types that are strongly predictive of when a model makes errors during adversarial attacks with noisy inputs. Our results on evaluating the effect of the data e.g., questions and context length will not only explain model performance in context of the data, but will also allow to build future neural models more resilient to adversarial attacks and advance understanding of neural model behavior across a variety of NLU tasks and datasets and its strengths and weaknesses.
For future work, we see many avenues for extension. We plan to experiment with more aggressive and more natural perturbations, and deeper counterfactual evaluation (Pearl, 2019). While recent research has made great strides in increasing model performance on various NLP tasks, it is still not clear what linguistic patterns these neural models are learning, or whether they are learning language at all (Mudrakarta et al., 2018).