How Does Data Corruption Affect Natural Language Understanding Models? A Study on GLUE datasets

A central question in natural language understanding (NLU) research is whether high performance demonstrates the models’ strong reasoning capabilities. We present an extensive series of controlled experiments where pre-trained language models are exposed to data that have undergone specific corruption transformations. These involve removing instances of specific word classes and often lead to non-sensical sentences. Our results show that performance remains high on most GLUE tasks when the models are fine-tuned or tested on corrupted data, suggesting that they leverage other cues for prediction even in non-sensical contexts. Our proposed data transformations can be used to assess the extent to which a specific dataset constitutes a proper testbed for evaluating models’ language understanding capabilities.


Introduction
The super-human performance of recent Transformer-based pre-trained language models (Devlin et al., 2019;Liu et al., 2019) on natural language understanding (NLU) tasks has raised scepticism regarding the quality of the benchmarks used for evaluation (Wang et al., 2018(Wang et al., , 2019)).There is increasing evidence that these datasets contain annotation artefacts and other statistical irregularities that can be leveraged by machine learning models to perform the tasks (Gururangan et al., 2018;Poliak et al., 2018b;Tsuchiya, 2018;Glockner et al., 2018;Talman and Chatzikyriakidis, 2019;Pham et al., 2020;Talman et al., 2021).These studies have so far largely focused on the natural language inference (NLI) and textual entailment tasks.The scope of our work is wider,

non-paraphrase
Arison said Mann may have been one of the pioneers of the world music movement and he had a deep love of Brazilian music.
Arison said Mann was a pioneer of the world music movement -well before the term was coined -and he had a deep love of Brazilian music. in the sense that we address all but one NLU tasks comprised in the GLUE benchmark, specifically: linguistic acceptability (COLA), paraphrasing (MRPC and QQP), sentiment prediction (SST-2), and semantic textual similarity (STS-B).
We present a series of experiments where the datasets used for model training and evaluation undergo a number of corruption transformations, which involve removing specific word classes from the data.We remove words pertaining to a specific class (e.g., nouns, verbs), instead of random words, to see the relative importance of word classes for the NLU tasks.For instance, verbs arguably play a significant role in sentence level semantics and removing them is expected to have a bigger impact on the GLUE scores, compared to say determiners.
The transformations seriously affect the quality of the sentences found in the datasets, making them in many cases unintelligible (cf.examples

COLA
The Corpus of Linguistic Acceptability (Warstadt et al., 2018) 64.05 Matthew's correlation in Table 1); a decrease in performance for models fine-tuned on these corrupted datasets would, thus, be expected.High performance would, instead, indicate that the models rely on lexical cues that remain after corruption, and possibly on other dataset artefacts, to perform a task without necessarily understanding the meaning of the processed utterances.
Our results show that performance after the corruptions remains high for most GLUE tasks, suggesting that the models leverage other cues for prediction even in non-sensical contexts.

Related Work
Annotation artefacts and statistical biases in NLI datasets are easily leveraged by the models and can guide prediction (Lai and Hockenmaier, 2014;Marelli et al., 2014;Poliak et al., 2018a;Gururangan et al., 2018).Examples include explicit negation being indicative of contradiction, and generic nouns suggesting entailment.Artefacts are also present in other types of datasets, for example in the ROC Story dataset where models can provide story endings without looking at the actual stories (Schwartz et al., 2017;Cai et al., 2017).Several works have proposed more challenging and cleaner NLI datasets where artefacts have been removed (McCoy et al., 2019).An efficient way to do this is using adversarial filtering (Nie et al., 2020;Zellers et al., 2018).The superior quality of the resulting NLI datasets is confirmed by Talman et al. (2021) in a series of experiments where it is shown that data corruption affects these higher quality datasets to a greater extent than previous datasets.
This work follows the same experimental direction where text perturbations serve to explore the sensitivity of language models to specific phenomena (Futrell et al., 2019;Ettinger, 2020;Taktasheva et al., 2021;Dankers et al., 2021).It has been shown, for example, that shuffling word order causes significant performance drops on a wide range of QA tasks (Si et al., 2019;Sugawara et al., 2019), but that state-of-the-art NLU models are not sensitive to word order (Pham et al., 2020;Sinha et al., 2021).Syntax-based perturbations have also been studied in relation to robustness and faithfulness of machine translation models (Parthasarathi et al., 2021).
We add to this line of research by applying data corruption transformations that involve removing entire word classes (Talman et al., 2021) to all but one GLUE tasks. 1 We interpret high performance of models fine-tuned and/or tested on corrupted datasets as an indication of the presence of lexical cues, and possibly artefacts, guiding prediction, since the meaning of the corrupted utterances is often hard to recover.

Datasets and Corruptions
In our experiments, we address eight tasks included in the General Language Understanding Evaluation (GLUE) benchmark for the English language (Wang et al., 2018): CoLa, MNLI, MRPC, QNLI, QQP, RTE, SST-2, STS-B.Following Talman et al. (2021), we corrupt the training and development sets available for these tasks by removing words of specific word classes. 2 We use the development sets for evaluation, since annotated test data have not been made publicly available. 3We create three configurations for each task:  Table 3: Example results for the RoBERTa-base model fine-tuned on CORRUPT-TRAIN and tested on the original evaluation set (columns 2 and 3); fine-tuned on the original data and tested on CORRUPT-TEST; fine-tuned on CORRUPT-TRAIN and tested on CORRUPT-TEST (columns 6 and 7).∆ is the difference to the baseline scores obtained by RoBERTa-base on the original dataset, given in Table 2.
tion on corrupted data.The corruption procedure involves removing all instances of a specific word class from the corresponding dataset (ADJ, ADV, CONJ, DET, NOUN, NUM, PRON, VERB).We label the corrupted datasets by indicating the class of the words that have been removed (e.g., COLA-NOUN, QNLI-VERB).Given the possible combinations of tasks, datasets and corruptions, we end up with 192 setups for our experiments.Note that the resulting sentence fragments do not constitute propositions.Although not ideal, this is not necessarily problematic for tasks such as sentiment analysis.For inference, the assumption that the task can only be performed at the propositional level is a strong claim, especially given that examples which are not propositions are abundant in existing benchmarks such as MNLI (e.g., examples extracted from dialogue).

Models
We fine-tune the pre-trained RoBERTa-base model (Liu et al., 2019) from the Huggingface Transformers library (Wolf et al., 2020a) in each of our 192 configurations.We use the same finetuning and evaluation set up for all the experiments.We retrieve the GLUE datasets using the Huggingface Datasets library (Wolf et al., 2020b).We finetune the models for 3 epochs, using a batch size of 32 and a learning rate of 0.00002.

Results
The baseline results using the original (noncorrupted) datasets are shown in Table 2. Given the large number of configurations, we only report the exact evaluation results for the -NOUN and -VERB settings in Table 3, as these content word classes arguably contribute a lot to the meaning of utterances.For the remaining configurations, we visualise the effect of the corruptions using heatmaps that show the difference in performance compared to the baseline results (Figures 1 to 3).Our results for the -NOUN and -VERB corruptions in CORRUPT-TRAIN (Table 3), and for all configurations in Figure 1, show a notable decrease in performance on COLA and RTE, especially when nouns are removed.The impact on MNLI-M and QNLI datasets is small, confirming previous findings regarding the presence of annotation artefacts and lexical cues that can guide model prediction.Our results suggest that this is the case also in other GLUE datasets, such as MRPC and SST-2, where the models still manage to perform fairly well compared to the baseline when fine-tuned on corrupted data.
Our CORRUPT-TEST results in Table 3 and in Figure 2 show that removing nouns from the data used for evaluation has a much larger impact across tasks, compared to CORRUPT-TRAIN.The biggest drop in performance is observed on COLA, MNLI-M and STS-B.However, accuracy on MRPC and SST-2 is still very high, suggesting that good performance does not require sentence-level understanding but can be achieved by relying on lexical cues present in the data.In the CORRUPT-TRAIN AND TEST setting (Table 3 and Figure 3), we observe the biggest drop in performance on COLA, MNLI-M and STS-B, and a lower impact on QNLI, QQP and SST-2.6 Discussion and Analysis

Lexical Cues
Our results show that model performance in many tasks is marginally affected by the imposed corruptions which, however, in many cases alter the meaning of utterances.We conduct additional analyses aimed at identifying the lexical cues that remain after corruption and can guide model prediction.We focus on MRPC (Microsoft Research Paraphrase Corpus) and SST-2 (Stanford Sentiment Treebank), where the impact of CORRUPT-TEST transformations was the smallest.MRPC addresses the paraphrase relationship between sentence pairs.We explore the semantic similarity of the information that remains after corruption.Our assumption is that if a sentence pair (from which nouns or verbs have been removed) still contains synonyms or longer paraphrases, this can guide the model towards detecting a similarity or entailment relationship.For this analysis, we use the unigram paraphrases in the L (large) package of PPDB 2.0 (Pavlick et al., 2015).We find that in the CORRUPT-TEST-NOUN MRPC dataset, 76% of the sentence pairs for which the model made correct predictions still include a lexical paraphrase.
SST-2 involves detecting the sentiment expressed in individual sentences.We use the NRCLex tool 4 to measure the sentiment expressed by lexical cues in the CORRUPT-TEST sentences for which model predictions are correct.Given that sentiment can be expressed in a text by words pertaining to different grammatical categories, we explore whether lexical cues indicating the polarity of the text still remain after removing instances of a specific word class.In Table 4, we show the labels predicted by NRCLex for corrupted test sentences, where the nouns and adjectives have been dropped.We observe that even if sentences become non-sensical after corruption, it is still possible to detect the (positive or negative) polarity of the sentences from the remaining words.Relying on these lexical cues, RoBERTa often manages to predict the correct sentiment.Specifically, according to the NRCLex predictions, the correct sentiment is still present in 383 out of 761 corrupted sentences where RoBERTa made correct predictions in the CORRUPT-TEST-NOUN setting.If both nouns and adjectives are removed (CORRUPT-TEST-NOUN-ADJ), NRCLex detects that the correct sentiment is still present in 125 out of the 672 examples that were correctly predicted by RoBERTa.

Can RoBERTa Guess the Missing
Tokens?
As RoBERTa has been pre-trained using a Masked Word Prediction task, it is reasonable to ask if high model performance with our corrupted datasets could be due to the model's ability to "fill in the gaps" and predict the missing words.To test this, in each sentence of the MRCP development set, we replace the first token that is aimed by a specific corruption procedure (-NOUN/VERB) with the [MASK] token.We do this in the original sentence (by removing only the first noun/verb instance) and in the corrupted sentence (where all other nouns/verbs are missing).For example, from the first sentence in Table 4, we generate two clozetask queries in the -NOUN setting: We use these queries to test RoBERTa's token prediction capability.As shown in Table 5, it is easier to predict the masked token in the original sentences, but the model is still able to make correct predictions in the corrupted sentences.This could partly explain the high performance observed for MRPC in the corrupted setting (cf.Section 5).

Conclusion
We apply a set of controllable corruption transformations to the datasets of NLU tasks in the GLUE benchmark, and study their impact on model performance.The proposed transformations are generic enough to be applicable to other NLU tasks, and can enrich the available artillery for dataset quality assessment in terms of how efficiently they trigger and test the language understanding capabilities of the models.Our results indicate that understanding the meaning of utterances is not required for high performance in most GLUE tasks.This finding suggests caution in interpreting leaderboard results and in the conclusions that can be drawn regarding the language understanding capabilities of the models.We make our code available5 in order to promote the application of these tests to other NLU datasets, and to favour the development of benchmarks addressing the actual capability of the models to reason about language.
(a) CORRUPT-TRAIN: fine-tuning on the corrupted training set, evaluation on the original development set; (b) CORRUPT-TEST: fine-tuning on the original training set, evaluation on the corrupted test set; (c) CORRUPT-TRAIN AND TEST: training and evalua-

Figure 1 :
Figure 1: Impact of specific data corruptions in the CORRUPT-TRAIN setting.The columns correspond to the removed word class and the rows to the GLUE tasks.

Figure 2 :
Figure 2: Impact of specific data corruptions in the CORRUPT-TEST setting for each task.

Figure 3 :
Figure 3: Impact of specific data corruptions in the CORRUPT-TRAIN AND TEST setting for each task.

Table 1 :
Example sentence pairs from the corrupted MRPC training dataset where all instances of nouns have been removed.

Table 2 :
Baseline results obtained for different GLUE tasks with RoBERTa-base and the relevant metric.

Table 5 :
Accuracy of RoBERTa-BASE in predicting a masked word in the MRPC development set.