Evaluating the Robustness of Neural Language Models to Input Perturbations

High-performance neural language models have obtained state-of-the-art results on a wide range of Natural Language Processing (NLP) tasks. However, results for common benchmark datasets often do not reflect model reliability and robustness when applied to noisy, real-world data. In this study, we design and implement various types of character-level and word-level perturbation methods to simulate realistic scenarios in which input texts may be slightly noisy or different from the data distribution on which NLP systems were trained. Conducting comprehensive experiments on different NLP tasks, we investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations. The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced. We highlight that models need to be further improved and that current benchmarks are not reflecting model robustness well. We argue that evaluations on perturbed inputs should routinely complement widely-used benchmarks in order to yield a more realistic understanding of NLP systems’ robustness.


Introduction
High-performance deep neural language models such as BERT (Devlin et al., 2018), XLNet (Z. Yang et al., 2019), and GPT-2 (Radford et al., 2019) have brought breakthroughs to a wide range of Natural Language Processing (NLP) tasks including text classification, sentiment analysis, textual entailment, natural language inference, machine translation, and question answering. Their immense ability in capturing various linguistic properties has led these state-of-the-art language models to master different NLP tasks, even surpassing human accuracy on some benchmarks such as SQuAD 1 .
However, recent studies have revealed that there is a gap between performing well on benchmarks and actually working under real-world situations (Belinkov and Bisk, 2018;Ribeiro et al., 2020). Even a well-trained, high-performance deep language model can be sensitive to negligible changes in the input that cause the model to make erroneous decisions (M. Sun et al., 2018). This raises serious concerns regarding the robustness/reliability of neural language models utilized in real-world applications. The terms 'robustness' and 'reliability' refer to the ability of a system to perform consistently well in situations where changes to input should not cause a change in the system's output, or the system is expected to properly reflect the change and produce a correct outcome.
Applying automatic or human-controlled perturbations to textual inputs has been shown to be effective for evaluating the robustness of NLP systems, investigating their vulnerabilities, and finding their bugs. Recently, CheckList (Ribeiro et al., 2020) provided a framework for behavioral testing of NLP systems inspired by black-box testing in software engineering. CheckList enabled generating new (perturbed) test samples through abstracting different test types aimed at testing linguistic capabilities. Other studies focused on evaluating robustness to perturbed inputs for machine translation (Belinkov and Bisk, 2018;Niu et al., 2020), perturbation sensitivity analysis for detecting unintended model biases (Prabhakaran et al., 2019), or robustness to adversarial perturbations (Alshemali and Kalita, 2020; Ebrahimi et al., 2018;Liang et al., 2018). However, a comprehensive methodology for evaluating the performance of NLP models under real-world conditions is still missing.
In a realistic scenario, the input text may contain typos and misspellings that should not cause any changes in the NLP system's outcome. Minor grammatical errors may appear in the text, but the semantics is still preserved, therefore, the NLP system is expected to treat the input as it was errorfree. Some deliberate or unintentional changes may modify the semantics, and the NLP model is expected to reflect the changes in the outcome. These are only few examples of natural noise in text data that NLP systems should have the ability to properly deal with.
In this paper, we design and implement a wide range of character-level and word-level systematic perturbations to textual inputs in order to simulate different types of noise that a NLP system may face in real-world use cases. Conducting extensive experiments on various NLP tasks, we investigated the ability of four neural language models, i.e. BERT, RoBERTa, XLNet, and ELMo, in handling slightly perturbed inputs. The results reveal that the neural models are unstable to small changes that can be easily handled by humans, e.g. misspellings, missing words, repeated words, synonyms, etc. The systematic input perturbations can expose the vulnerabilities of NLP systems and bring more insights into how high-performance models behave when they encounter noisy yet understandable inputs. This study suggests that the performance of NLP models should not be overestimated by only relying on accuracy scores obtained on benchmark datasets. Similar to CheckList, our perturbation framework treats NLP systems as black-boxes. This facilitates comparison of different models, without needing to know the model structure and internals. CheckList focuses on testing linguistic capabilities of NLP systems, e.g. handling coreferences, identifying named entities, semantic role labeling, and vocabulary. On the other hand, our perturbation methods aim at evaluating the robustness of NLP systems to noisy inputs. Our input perturbation framework can act as a complement to the CheckList testing methodology.
In CheckList, many test types rely on creating synthetic samples from scratch by the user, which is a time-consuming task, and needs much creativity and effort. Moreover, synthetic samples may suffer from low coverage (Ribeiro et al., 2020). However, most of the perturbation methods introduced in this paper do not need human intervention; they can automatically generate perturbed samples that still preserve the semantics and are sufficiently meaningful to users.
Some types of perturbation utilized in this work were already tested in previous work on adversarial attacks on NLP systems (Zeng et al., 2020;Zhang et al., 2020). However, adversarial perturbations are considered worst-case scenarios that do not occur frequently in real-world situations, representing a very specific type of noise (Fawzi et al., 2016). In order to generate effective adversarial examples, most attack methods need to have access to the NLP model structure, internal weights, and hyperparameters, which may not be possible in every testing scenario (Zhang et al., 2020). Furthermore, adversarial perturbations should not be perceived by humans (Liang et al., 2018). This is a serious challenge, since even small changes to a text may be easily recognized by the user.
To the best of our knowledge, this paper is the first study that presents empirical results achieved with a comprehensive set of non-adversarial perturbation methods for testing robustness of NLP systems on non-synthetic text. An important contribution of this work is to evaluate the robustness of several high-performance language models on various NLP tasks using different types of character-level and word-level input perturbations. Moreover, to ascertain the usefulness of the perturbations (i.e. how effectively they can be used to automatically generate meaningful and understandable perturbed samples), we conducted an extensive user study.

NLP tasks
In our experiments, we used five datasets covering five different NLP tasks. Table 1 summarizes some statistics of the datasets. A short description of the datasets is given in the following.
TREC (Li and Roth, 2002) is a Text Classification (TC) dataset containing more than 6,000 questions and 50 different class labels that specify the type of questions.
Stanford Sentiment Treebank (SST) (Socher et al., 2013) is a Sentiment Analysis (SA) dataset containing more than 11,000 movie reviews from 'Rotten Tomatoes'. Every review is classified into one of the five classes: very positive, positive, neutral, negative, and very negative.
CoNLL-2003 2 is a Named Entity Recognition (NER) dataset containing news stories from the Reuters corpus with more than 200K tokens annotated as Person, Organization, Location, Miscellaneous, or Other.
STS benchmark (Cer et al., 2017) is a Semantic Similarity (SS) dataset comprising of more than 8K text pairs extracted from image captions, news headlines, and user forums. Each pair of sentences is assigned a similarity score between 0 and 5.
WikiQA (WQA) (Y. Yang et al., 2015) is a Question Answering (QA) dataset composed of more than 3,000 questions and 29,000 sentences as answers extracted from Wikipedia.

Language models
In our experiments, we utilized four neural language models shown to be effective in learning bidirectional contexts and obtained state-of-the-art results during recent years: BERT (Devlin et al., 2018) is composed of deep encoder transformer layers and uses two pretraining objectives, i.e. masked language modelling and next sentence prediction. We used the BERTLARGE architecture (along with the cased model) containing 24 transformer layers, 1024 hidden units per layer, 16 attention heads per hidden unit, and 340 million parameters.
RoBERTa (Liu et al., 2019) uses a model architecture similar to BERT, but adopts an optimized pretraining approach. It was pretrained on more data, with bigger batch sizes and longer sequences than BERT. Furthermore, the next sentence prediction objective was removed and a dynamic masking strategy replaced the basic masking method. We used RoBERTaLARGE that further optimizes the same model as BERTLARGE.
ELMo (Peters et al., 2018) is a contextualized word representation method that utilizes character convolutions along with shallow concatenation of backward and forward LSTMs to implement bidirectional language modeling. We used the original ELMo model composed of two highway layers with an LSTM hidden size of 4096, output size of 512, and a total parameters of 93.6 million. The contextualized embeddings computed by ELMo were fed into a dense layer containing 128 hidden units followed by an output layer with a softmax activation in the TC and QA tasks, a linear activation in the SA and SS tasks, and CRF layer with a linear activation in the NER task.
We retrieved the pretrained models, fine-tuned them separately on each downstream task using the training and development sets, and tested them on the test sets. We utilized the Huggingface transformers (Wolf et al., 2020) and FARM 3 libraries to implement the transformer-based models. A complete list of hyperparameter values is presented in Appendix A.

Perturbation methods
We designed and implemented various characterlevel and word-level perturbation methods that simulate different types of noise an NLP system may encounter in real-world situations. The perturbations can be produced for every dataset regardless of the underlying language model or NLP system being tested. Table 2 presents an example for every perturbation method. The perturbation methods were implemented in Python using the NLTK library. The source code is available at https://github.com/mmoradiiut/NLP-perturbation.  Almost all the character-level perturbations presented here were already tested in adversarial attack scenarios (Heigold et al., 2018;Zeng et al., 2020;Zhang et al., 2020), but were not yet implemented in a non-adversarial testing framework, except the misspelling perturbation implemented by CheckList. Among the word-level perturbations, Deletion, Repetition, Singular/plural verbs, Word order, and Verb tense were not already used to test the robustness. However, Negation was included in CheckList, and Replacement with Synonyms was used for adversarial attack (Dong et al., 2020;Ren et al., 2019).

Character-level perturbation
These perturbation methods randomly select a word, denoted as Wordi, and apply perturbations to its characters. They are described in the following.
Insertion. A character is randomly selected and inserted in a random position (except the first and last position) if Wordi contains at least three characters.
Deletion. A character is randomly selected and deleted if Wordi contains at least three characters. The last and first characters of Wordi are never deleted.
Replacement. A character is randomly selected and is replaced by an adjacent character on the keyboard. Repetition. A character in a random position (except the first and last position) is selected and a copy of it is inserted right after the selected character.
Common misspelled words. If a word in the input text appears in the Wikipedia corpus of common misspelled words 4 , it is replaced by its misspelling.
Letter case changing toggles the letter case, i.e. converts a lower case character to its upper case form and vice versa. The letter case changing is done for either the first or all the characters of Wordi. The type of letter case changing is specified in a random manner.

Word-level perturbation
Deletion randomly selects a word from the input sample and removes it.
Repetition selects a random word, makes a copy of it, and inserts it right after the selected word.
Replacement with synonyms replaces words contained in the sample by their synonyms extracted from the WordNet lexical database (Miller, 1995).
Negation. It identifies verbs in the sample, then injects negations by converting positive verbs to negative, or removes negation by converting negative verbs to positive. The goal is to  investigate the ability of the NLP system in adapting its outcome to reflect the injected or removed negation. This perturbation method operates based on a set of rules that assess verbs, subjects, and verb tenses based on POS tags, then applies an appropriate rule to construct the test sample. For example, if the POS tag of a verb is VBZ, the verb appears in the third person simple present form. Therefore, the verb is replace by [does not + VBP] where VBP is the basic form of the verb, in order to inject negation into the sample.
Singular/plural verbs. It simulates a common error in real use cases, i.e. using plural form of a verb instead of the singular form, and vice versa, usually with a third-person subject. This perturbation does not usually change the text's meaning in most NLP tasks if the task does not rely on the subject-verb agreement. Therefore the NLP system should treat the perturbed sample as an unperturbed text.
Word order. It randomly selects M consecutive words from the sample and changes the order in which they appear in the text. The goal is to investigate whether the NLP system is sensitive to word ordering or it only decides based on the presence of words in the input.
Verb tense. It converts present simple or continuous verbs to their corresponding past simple or continuous forms, or vice versa. The goal is to assess the sensitivity of the NLP system to changing the verb tense in tasks where the verb tense is not important to the output. In this case, the system's output should not change after modifying the verb tense. This method first extracts POS tags to identify verbs and their subjects. It then converts the verb tense using the mlconjug3 package and reconstruct the sentence with the new verb tense.

Experimental results
All the experiments were performed on a computer with an Intel Core i5-9600K CPU at 3.70GHz, 32 GB of RAM, and a GeForce RTX 2080 Ti graphic card (GPU) with 11 GB dedicated memory. Perturbation methods ran on CPU; fine-tuning on training sets, and evaluating on test sets and perturbed samples ran on GPU.

Performance on perturbed inputs
Since it has been proven that sentences that contain few typos, misspellings, or minor character-level errors can be still fully understandable to humans (Belinkov and Bisk, 2018;Xu and Du, 2020), character-level perturbations are not expected to change the text's meaning in most cases. Therefore, they can be automatically produced and used for testing the robustness of NLP systems.
On the other hand, some word-level perturbations may change the text's meaning.  Consequently, the perturbed samples should be monitored to make sure they are still meaningful with respect to the NLP task at hand, and are consistent with the original label in the dataset. Otherwise, they should not be used for testing the robustness, or the label should be changed to reflect the change and preserve the consistency. We separately applied every character-level perturbation method to all test samples in a dataset, and all the resulting perturbed samples were used to evaluate the robustness of the language models. A hyperparameter named Perturbation Per Sample (PPS) specified the maximum number of perturbations in a sample.
We monitored and filtered perturbed samples resulted from three word-level perturbations that may change the text's meaning. These perturbations are Deletion, Negation, and Replacement with synonym. For every sample whose meaning was changed by these three methods, and a change in the test set label was necessary to preserve consistency, we altered the label if it was applicable. If a proper label could not be assigned to the perturbed sample or the resulting text was no longer meaningful, we excluded the sample from the evaluations. Since monitoring and filtering every single perturbed sample was extremely time-consuming (such that approximately one minute was needed on average to check the meaningfulness of a perturbed sample and its consistency with the test set label), we corrected labels and filtered perturbed samples for the above three methods until 200 samples were collected for every dataset; then we used these samples to evaluate the models on perturbed inputs. We performed this manual curation of perturbed samples for all values of PPS that we experimented, i.e. values in the range [1,4]. Appendix B presents the number of perturbed samples checked in the manual curation procedure until reaching 200 test samples for every dataset and different values of PPS. The manual curation was performed by three annotators who had sufficient English language knowledge to properly judge about the meaningfulness and consistency of perturbed samples.
Since the rest of word-level perturbations are not expected to change the text's meaning with respect to the NLP tasks in our experiments, we did not monitor and filter them; they were produced and used automatically. Again, the PPS hyperparameter controlled the maximum number of perturbations in every sample. Table 3 and Table 4 present the performance of the language models on character-level and wordlevel perturbed samples, respectively. These results are reported for PPS=1. The performance of the language models on original, unperturbed test sets  is also reported in both tables for every NLP task. We performed five separate fine-tuning runs to test if the performance of the NLP models on the original test set and perturbed samples vary between individual runs. Since there was no statistically significant difference between multiple runs (with respect to a t-test with a significance level of p=0.05), we only report the results of the first fine-tuning and testing run. The language models were neither pretrained nor fine-tuned on perturbed samples. The perturbation methods were only applied to the test sets. As the results show, the language models are sensitive to the perturbations and their performance decreases when the input is slightly noisy. However, RoBERTa still performs better than the other models, and ELMo obtains the lowest scores in general.
The results suggest that some language models can handle specific types of perturbation more effectively than other models. ELMo obtains higher scores than BERT and even performs on par with XLNet and RoBERTa on some characterlevel perturbations. This can be due to its pure character-based representation that enables the model to use morphological clues, leading to a more robust model against character-level noises.
XLNet is shown to handle perturbations to word ordering more efficiently than the others. This can be an effect of the permutation language modelling that may allow the model to still capture the context and perform more accurately when some context words appear in a different order. The results also suggest those models that were pretrained on larger corpora such as RoBERTa and XLNet are more robust when words are replaced by their synonyms. Furthermore, when the negation perturbation has more impact on the task at hand, e.g. sentiment analysis, the models are less stable and handle the noise less efficiently than on other tasks. Observing the results, we can also point out the LSTM-based model, i.e. ELMo, is more sensitive to the order of words in a sample than the transformer-based models. Table 5 presents the absolute decrease in the performance of the language models for different PPS values in the range [1,4]. For every language model, the average of absolute decrease in performance is separately reported on characterlevel and worl-level perturbations for every NLP task. As can be shown, the models are generally more sensitive to character-level perturbations than word-level ones. Perturbed inputs causes the models to make erronous outcomes on the sentiment analysis task more often than on the other tasks. On the other hand, the question answering task suffers less than the other tasks from noisy inputs. Figure 1 represents six examples for which perturbations to the input led the RoBERTa model  to make wrong decisions, but the model made correct decisions on the respective original inputs. As can be seen, examples 1-3 contain minor character-level noise that causes the model make wrong decisions, however, the perturbed text still seems understandable. In example 4, 'diameter' was replaced by 'diam' and 'golf' was replaced by 'golf_game', but the model failed to handle these changes. In example 5, two repetitive words led the model to estimate a lower similarity score, however, the semantic remained unchanged. Finally, example 6 shows how removing a single word led the model to choose a wrong answer.

User study
We conducted a user study with 20 participants to investigate how understandable the perturbed texts are to humans. We created a set of perturbations by randomly selecting perturbed samples from the datasets used in the experiments. The samples covered all types of character-level and word-level perturbations.
In the first part of the study, each participant was given 30 perturbed samples from those perturbation methods that are not expected to change the text's meaning with respect to the NLP tasks at hand. These are all the character-level perturbations and three word-level perturbations, i.e. Repetition, Singular/plural verbs, and Verb tense. The participants were also given the original text along with every perturbed sample, and were asked to judge if the perturbed text is understandable and still conveys the same meaning. Every sample contained one, two, or three perturbations.
According to the user evaluations, on average, 94% of the perturbed samples from this set were understandable and still conveyed the same meaning as the original text. These results are well in agreement with our discussion in Section 6.2, i.e. the majority of our proposed perturbations can be automatically produced and used without needing human supervision to ensure understandability and consistency.
In the second part of the study, each participant was given 20 perturbed samples from those perturbation methods that may change the text's meaning or result in meaningless text. They are the rest of word-level perturbations, i.e. Deletion, Replacement with synonyms, Negation, and Word order. The participants were also given the original text along with every perturbed sample, and were asked to judge (with respect to the task at hand) if the perturbed text is still meaningful and consistent with the test set label.
According to the user evaluations, on average, 39% of the perturbed samples from this set were still meaningful and consistent with the label, 12% of the perturbed samples were meaningful but the label should be changed, and 49% of the perturbed samples were no longer meaningful. These results imply that some perturbations need to be monitored, corrected, or filtered to make sure they are understandable, meaningful, and consistent with the test set label. This helps to fairly estimate the robustness of NLP systems to input perturbations.

Related work
Typical performance measures such as accuracy, precision, recall, etc. may not properly reflect how NLP systems behave in real-world use cases. This has motivated many studies to devise novel methods for investigating different capabilities and vulnerabilities of text processing systems. Behavioral testing introduces targeted changes to textual inputs to test linguistic capabilities of systems (Ribeiro et al., 2020). Explanations provide simplified representations of what a complex NLP model has learned (Moradi and Samwald, 2021a, b;Ribeiro et al., 2016). This can help to identify biases and errors in NLP models. Adversarial perturbations have been widely studied to assess the robustness of NLP systems against adversarial samples crafted to fool a model (Alshemali and Kalita, 2020;Ren et al., 2019;Zhang et al., 2020). However, adversarial samples resemble a very specific type of noise. Moreover, most of previous work on adversarial perturbation to NLP models focused on misspelling attacks (Jones et al., 2020;Pruthi et al., 2019;L. Sun et al., 2020). The perturbation methods implemented in this paper represented a wide range of noises that an NLP system may face in real-world situations.
Introducing noise and changing textual inputs were already adopted to assess the ability of models in capturing specific linguistic features such as learning syntax-sensitive dependencies (Linzen et al., 2016), for specific NLP tasks such as machine translation (Belinkov and Bisk, 2018), for detecting biases in language models (Prabhakaran et al., 2019), or to identify susceptible entities in text documents (M. Sun et al., 2018). In this paper, we investigated the robustness on a wide range of tasks, and for various types of character-level and word-level noises in text.

Conclusion
In this paper, we introduced and implemented a set of non-adversarial perturbation methods that can be used to evaluate the robustness of NLP systems. We extensively investigated the robustness of high-performance neural language models to noisy input texts. The evaluations on various NLP tasks imply that these models are sensitive to different character-level and word-level perturbations to the input, and the models' performance can decrease when the input contains slight noise. The results suggest that it may be too simplistic to only rely on accuracy scores obtained on benchmark datasets when evaluating the robustness of NLP systems.
The proposed perturbations can be used, along with other methodologies such as CheckList, to test how robust and reliable NLP systems can operate in real-world settings. The experimental results demonstrated that the perturbation methods are effective tools for evaluating NLP systems against noisy data. The user study revealed that only few perturbation methods need to be monitored to make sure they produce meaningful and consistent samples. Most of the perturbation methods can be used automatically to produce noisy test samples. They can be also used as a baseline for evaluating adversarial attacks against non-adversarial perturbations.
Future work may include helping users assess meaning preservation and grammatical correctness in a semi-automatic manner. Sentence encoders such as InferSent (Conneau et al., 2017), Universal Sentence Encoder (Cer et al., 2018), and BERT trained for semantic similarity (Reimers and Gurevych, 2019) can be used to give users clues how semantically similar the original and perturbed sentences are. Moreover, users can be provided with information about grammatical errors in the perturbed text using LanguageTool (Naber, 2003)