Out of Order: How Important Is The Sequential Order of Words in a Sentence in Natural Language Understanding Tasks?

Do state-of-the-art natural language understanding models care about word order - one of the most important characteristics of a sequence? Not always! We found 75% to 90% of the correct predictions of BERT-based classifiers, trained on many GLUE tasks, remain constant after input words are randomly shuffled. Despite BERT embeddings are famously contextual, the contribution of each individual word to downstream tasks is almost unchanged even after the word's context is shuffled. BERT-based models are able to exploit superficial cues (e.g. the sentiment of keywords in sentiment analysis; or the word-wise similarity between sequence-pair inputs in natural language inference) to make correct decisions when tokens are arranged in random orders. Encouraging classifiers to capture word order information improves the performance on most GLUE tasks, SQuAD 2.0 and out-of-samples. Our work suggests that many GLUE tasks are not challenging machines to understand the meaning of a sentence.


Introduction
Machine learning (ML) models recently achieved excellent performance on state-of-the-art benchmarks for evaluating natural language understanding (NLU). In July 2019, RoBERTa (Liu et al., 2019) was the first to surpass a human baseline on GLUE (Wang et al., 2019). Since then, 13 more methods have also outperformed humans on the GLUE leaderboard. Notably, at least 8 out of the 14 solutions are based on BERT (Devlin et al., 2019)-a transformer architecture that learns representations via a bidirectional encoder. Given their superhuman GLUE-scores, how do BERT-based models solve NLU tasks? How do their NLU capability differs from that of humans?
We shed light into these important questions by examining model sensitivity to the order of words. Word order is one of the key characteristics of a  Figure 1: A RoBERTa-based model achieving a 91.12% accuracy on QQP, here, correctly labeled a pair of Quora questions "duplicate" (a). Interestingly, the predictions remain unchanged when all words in question Q 2 is randomly shuffled (b-c). QQP models also often incorrectly label a real sentence and its shuffled version to be "duplicate" (d). We found evidence that GLUE models rely heavily on words to make decisions e.g. here, "marijuana" and "cancer" (more important words are highlighted by LIME). Also, there exist self-attention matrices tasked explicitly with extracting word-correspondence between two input sentences regardless of the position of those words. Here, the top-3 pairs of words assigned the highest self-attention weights at (layer 0, head 7) are inside red, green, and blue rectangles, respectively. sequence and is tightly constrained by many linguistic factors including syntactic structures, subcategorization, and discourse (Elman, 1990). Thus, arranging a set of words in a correct order is considered a key problem in language modeling (Hasler et al., 2017;Zhang and Clark, 2015).
Therefore, a natural question is: Do BERTbased models trained on GLUE care about the order of words in a sentence? Lin et al. (2019) found that pretrained BERT captures word-order information in the first three layers. However, it is unknown whether BERT-based classifiers actually use word order information when performing NLU tasks. Recently, Wang et al. (2020) showed that incorporating additional word-ordering and sentence-ordering objectives into BERT pretraining could lead to text representations (StructBERT) that enabled improved GLUE scores. However, StructBERT findings are inconclusive across different GLUE tasks and models. For example, in textual entailment (Wang et al., 2019, RTE), Struct-BERT improved the performance for BERT large but hurt the performance for RoBERTa (Table 2d).
Wang et al. 2020 motivated interesting questions: Are state-of-the-art BERT-based models using word order information when solving NLU tasks? If not, what cues do they rely on? To the best of our knowledge, our work is the first to study the above questions for an NLU benchmark (GLUE). We tested BERT-, RoBERTa-, and ALBERT-based (Lan et al., 2020) models on 7 GLUE tasks where the words of only one select sentence in the input text are shuffled at varying degrees. An ideal agent that truly understands language is expected to choose a "reject" option when asked to classify a sentence whose words are randomly shuffled. Alternatively, given shuffled input words, true NLU agents are expected to perform at random chance in multi-way classification that has no "reject" options ( Fig. 1b). Our findings include: 1. 65% of the groundtruth labels of 5 GLUE tasks can be predicted when the words in one sentence in each example are shuffled (Sec. 3.1).
2. Although pretrained BERT embeddings are known to be contextual, in some GLUE tasks, the contribution of an individual word to classification is almost unchanged even after its surrounding words are shuffled (Sec. 3.3).
4. BERT-based models trained on sequence-pair GLUE tasks used a set of self-attention heads for finding similar tokens shared between the two inputs (Sec. 3.4).

Encouraging
RoBERTa-based models to be more sensitive to word order improves the performance on SQuAD 2.0 and most GLUE tasks tested (i.e. except for SST-2) (Sec. 3.5).
Despite their superhuman scores, most GLUEtrained models behave similarly to Bag-of-Words (BOW) models, which are prone to naive mistakes ( Fig. 1b-d). Our results also suggest that GLUE does not necessarily require syntactic information or complex reasoning.

Datasets
We chose GLUE because of three reasons: (1) GLUE is a common benchmark for NLU evaluation (Wang et al., 2019); (2) there exist NLU models (e.g. RoBERTa) that outperform humans on GLUE, making an important case for studying their behaviors; (3) it is unknown how sensitive GLUE-trained models are to word order and whether GLUE requires them to be sensitive (Wang et al., 2020). Tasks Out of 9 GLUE tasks, we chose all 6 binaryclassification tasks because they share the same random baseline of 50% accuracy and enable us to compare models' word-order sensitivity across tasks. Six tasks vary from acceptability (CoLA Warstadt et al. 2019), to natural language inference (QNLI Rajpurkar et al. 2016 We also performed our tests on STS-B (Cer et al., 2017)-a regression task of predicting the semantic similarity of two sentences. 1 While CoLA and SST-2 require single-sentence inputs, all other tasks require sequence-pair inputs. Reject options For all binary-classification tasks (except SST-2), the negative label is considered the reject option (e.g. QQP models can choose "not duplicate" in Fig. 1b to reject shuffled inputs). Metrics We use accuracy scores to evaluate the binary classifiers (for ease of interpretation) and Spearman correlation to evaluate STS-B regressors, following Wang et al. (2019).

Classifiers
We tested BERT-based models because (1) they outperformed humans on the GLUE leaderboard; and (2) the pretrained BERT was shown to capture word positional information (Lin et al., 2019).
Pretrained BERT encoders We tested three sets of classifiers finetuned from three different, pretrained BERT variants: BERT, RoBERTa, and AL-BERT, downloaded from Huggingface (2020). The pretrained models are the "base" versions i.e. bidirectional transformers with 12 layers and 12 selfattention heads. The pretraining corpus varies from uncased (BERT, ALBERT) to case-sensitive English (RoBERTa). Classifiers For each of the seven GLUE tasks, we added one classification layer on top of each of the three pretrained BERT encoders and finetuned the entire model. Unless otherwise noted, the mean performance per GLUE task was averaged over three classifiers. Each model's performance matches either those reported on Huggingface (2020) or the original papers (Table A6). Hyperparameters Following Devlin et al. (2019), we finetuned classifiers for 3 epochs using Adam (Kingma and Ba, 2015) with a learning rate of 0.00002, β 1 = 0.9, β 2 = 0.999, = 10 −8 . We used a batch size of 32, a max sequence length of 128, and dropout on all layers with a probability of 0.1.

Constructing sets of real and shuffled examples for experiments
Modifying one sentence As GLUE tasks vary in the number of inputs (one or two input sequences) and the sequence type per input (a sentence or a paragraph), we only re-ordered the words in one sentence from only one input while keeping the rest of the inputs unchanged. Constraining the modifications to a single sentence enables us to measure (1) the importance of word order in a single sentence; and (2) the interaction between the shuffled words and the unchanged, real context.

Random shuffling methods
To understand model behaviors across varying degrees of wordorder distortions, we experimented with three tests: randomly shuffling n-grams where n = {1, 2, 3}.
Shuffling 1-grams is a common technique for analyzing word-order sensitivity (Sankar et al., 2019;Zanzotto et al., 2020). We split a given sentence by whitespace into a list of n-grams, and re-combined them, in a random order, back into a "shuffled" sentence (see Table 1 for examples). The ending punctuation was kept intact. We re-sampled a new random permutation until the shuffled sentence was different from the original sentence.
As the label distributions, dev-set sizes, and the performance of models vary across GLUE tasks,  Table 1: A real question on Quora (QQP dataset) and its three modified versions (Q 3 to Q 1 ) created by randomly shuffling 3-grams, 2-grams, and 1-grams, respectively. Q s was created by swapping two random nouns.
to compare word-order sensitivity across tasks, we tested each model on two sets: (1) dev-r i.e. a subset of the original dev-set (Sec. 2.3.1); and (2) dev-s i.e. a clone of version of dev-r but that each example has one sentence with re-ordered words (Sec. 2.3.2).

Selecting real examples
For each pair of (task, classifier), we selected a subset of dev-set examples via the following steps: 1. For tasks with either a single-sequence or a sequence-pair input, we used examples where the input sequence to be modified has only one sentence 2 that has more than 3 tokens (for shuffling 3-grams to produce a sentence different from the original sentence).
2. We only selected the examples that were correctly classified by the classifier (to study what features were important for high accuracy).
3. We balanced the numbers of positive and negative examples by removing random examples from the larger-sized class.
That is, on average, we filtered out ∼34% of the original data. See Table A4 for the total number of examples remaining after each filtering step above.

Creating shuffled sets
For each task, we cloned the dev-r sets above and modified each example to create a "shuffled" set (a.k.a. dev-s) per shuffling method.
Specifically, a CoLA and SST-2 example contains only a single sentence and we modified that sentence. Each QQP, MRPC and STS-B example has two sentences and we modified the first sentence. An RTE example has a pair of (premise, hypothesis), and we modified the hypothesis since it is a single sentence while premises are paragraphs. Each QNLI example contains a pair of (question, answer) and we modified the question, which is a sentence, while an answer is often a paragraph.

Experiments and Results
3.1 How much is word order information required for solving GLUE tasks?
GLUE has been a common benchmark for evaluating NLU progress. But, do GLUE tasks require models to use word order and syntactic information? We shed light into this question by testing model performance when word order is increasingly randomized. If a task strictly requires words to form a semantically meaningful sentence, then randomly re-positioning words in correctly-classified sentences will cause model accuracy to drop from 100% to 50% (i.e. the random baseline b for binaryclassification tasks with two balanced classes). Thus, to compare model-sensitivity across tasks, we use a Word-Order Sensitivity score (WOS): where p ∈ [50, 100] is the accuracy of a GLUEtrained model evaluated on a dev-s set (described in Sec. 2.3.2) and s ∈ [0, 1]. Here, b = 50. Experiments For each GLUE task, we computed the mean accuracy and confidence score over three classifiers (BERT, RoBERTa, and ALBERT-based) on dev-s sets created by shuffling 1-grams, 2grams, and 3-grams. The results reported in Table 2 were averaged over 10 random shuffling runs (i.e. 10 random seeds) per n-gram type, and then averaged over 3 models per task.

Results
We found that for CoLA, i.e. detecting grammatically incorrect sentences, the model accuracy, on average, drops to near random chance i.e. between 50.69% and 56.36% (Table 2b) when n-grams are shuffled. That is, most of examples were classified into "unacceptable" after n-gram shuffling, yielding ∼50% accuracy (see Fig. A2 for qualitative examples).
Surprisingly, for the rest of the 5 out of 6 binaryclassification tasks (i.e. except CoLA), between 75% and 90% of the originally correct predictions remain constant after 1-grams are randomly reordered (Table 2b; 1-gram). These numbers increase as the shuffled n-grams are longer (i.e. as n increases from 1→3), up to 95.32% (Table 2b; QNLI). Importantly, given an average dev-set accuracy of 86.35% for these 5 tasks, at least 86.35% × 75% ≈ 65% of the groundtruth labels of these 5 GLUE tasks can be predicted when all input words in one sentence are randomly shuffled.
Additionally, on average over three n-gram types, models trained on these five GLUE tasks are from 2 to 10 times more insensitive to word-order randomization than CoLA models (Table 2c). That is, if not explicitly tasked with checking for grammatical errors, GLUE models mostly will not care about the order of words in a sentence (see qualitative examples in Figs. 1, A2-A4). Consistently, the confidence scores of BERT-based models for five non-CoLA tasks only dropped ∼2% when 1-grams are shuffled (Table 2).
Consistently across three different BERT "base" variants and a RoBERTa "large" model (Table A5), our results suggest that word order and syntax, in general, are not necessarily required to solve GLUE. 2-noun swaps Besides shuffled n-grams, we also repeated all experiments with more syntacticallycorrect modified inputs where only two random nouns in a sentence were swapped (Table 1; Q s ). This is a harder test for NLU models since the meaning of a sentence with two nouns swapped often changes while its syntax remains correct. We found the conclusions to generalize to this setting. That is, the models hardly changed predictions although the meanings of the original sentence and its swapped version are different (Table 2b; 2-noun swap vs. 1-gram).
3.2 How sensitive are models trained to predict the similarity of two sentences?
An interesting hypothesis is that models trained explicitly to evaluate the semantic similarity of two sentences should be able to tell apart real from shuffled examples. Intuitively, word order information is essential for understanding what an entire sentence means and, therefore, for predicting whether two sentences convey the same meaning. We tested this hypothesis by analyzing the sensitivity of models trained on QQP and STS-Btwo prominent GLUE tasks for predicting semantic similarity of a sentence pair. While QQP is a binary classification task, STS-B is a regression task where a pair of two sentences is given a score ∈ [0, 5] denoting their semantic similarity.
Experiments We tested the models on dev-r and dev-s sets (see Sec. 2.3.2) where in each pair, the word order of the first sentence was randomized while the second sentence was kept intact.
QQP results Above 83% of QQP models' correct predictions on real pairs remained unchanged after word-order randomization (see Figs. 1a-c for examples).

STS-B results
Similarly, STS-B model performance only drops marginally, i.e. less than 2 points from 89.67 to 87.80 in Spearman correlation (Table 2; STS-B). Since a STS-B model outputs a score ∈ [0, 5], we binned the scores into 6 ranges. One might expect STS-B models to assign near-zero similarity scores to most modified pairs. However, the distributions of similarity scores for the modified and real pairs still closely match up (Fig. 2). In sum, despite being trained explicitly on predicting semantic similarity of sentence pairs, QQP and STS-B are surprisingly insensitive to n-gram shuffling, exhibiting naive understanding of sentence meanings. The distribution of similarity scores over 6 ranges for the (real, shuffled) pairs in dev-s (green) is highly similar to that for (real, real) STS-B pairs in dev-r (red). The statistics in each range were computed over 3 models (BERT, RoBERTa, and ALBERT).

How important are words to classification
after their context is shuffled?
BERT representations for tokens are known to be highly contextual (Ethayarajh, 2019). However, after finetuning on GLUE, would the importance of a word to classification drop after its context is shuffled?
To answer the above question, we used LIME (Ribeiro et al., 2016) to compute word importance. LIME computes a score ∈ [−1, 1] for each token in the input denoting how much its presence contributes for or against the network's predicted label ( Fig. 1; highlights). The importance score per word w is intuitively the mean confidence-score drop over a set of randomly-masked versions of the input when w is masked out.
Experiments We chose to study RoBERTabased classifiers here because they have the highest GLUE scores among the three BERT variants considered. We observed that 62.5% (RTE) to 79.6% (QNLI) of the dev-r examples were consistently, correctly classified into the same labels in all 5 different random shuffles (i.e. 5 different random seeds). We randomly sampled 100 such examples per binary-classification task and computed their LIME attribution maps to compare the similarity between the LIME heatmaps before and after unigrams are randomly misplaced.
Results On CoLA and RTE, the importance of words (i.e. mean absolute value of LIMEattribution per word), decreased substantially by 0.036 and 0.019, respectively. That is, the individual words become less important after their context is distorted-a behavior expected when CoLA and RTE have the highest WOS scores (Table 2). In contrast, for the other 4 tasks, word importance only changed marginally (by 0.008, i.e. 4.5× smaller than the 0.036 change in CoLA). That is, except for CoLA and RTE models, the contribution of a word to classification is almost unchanged even after the context of each word is randomly shuffled (Fig. 1a-c). This result suggests that the word embeddings after finetuning on GLUE became much less contextual than the pretrained BERT embeddings (Ethayarajh, 2019).

If not word order, then what do classifiers rely on to make correct predictions?
Given that all non-CoLA models are highly insensitive to word-order randomization, how did they arrive at correct decisions when words are shuffled?
We chose to answer this question for SST-2 and QNLI because they have the lowest WOS scores across all 6 GLUE tasks tested (Table 2) and they are representative of single-sentence and sequencepair tasks, respectively.  In addition to small accuracy drops, the mean confidence scores of all classifiers-reported in parentheses e.g. "(0.93)"-also changed marginally after words are shuffled (a vs. b).
3.4.1 SST-2: Salient words are highly predictive of sentence labels As 84.04% of the SST-2 correct predictions did not change after word-shuffling (Table 2b), a common hypothesis is that the models might rely heavily on a few key words to classify an entire sentence.
S the film 's performances are thrilling . 1.00 S1 the film thrilling performances are 's . 1.00 S2 's thrilling film are performances the . 1.00 S3 's thrilling are the performances film . 1.00 Figure 3: An original SST-2 dev-set example (S) and its three shuffled versions (S 1 to S 3 ) were all correctly labeled "positive" by a RoBERTa-based classifier with high confidence scores (right column).
Experiments To test this hypothesis, we took all SST-2 dev-r examples whose all 5 randomly shuffled versions were all correctly labeled by a RoBERTa-based classifier (i.e. this "5/5" subset is ∼65% of the dev-set). We used LIME to produce a heatmap of the importance of words in each example.
We identified the polarity of each top-1 most important word (i.e. the highest LIME-attribution score) per example by looking it up in the Opinion Lexicon (Hu and Liu, 2004) of 2,006 positive and 4,783 negative words. ∼57% of these top-1 words were found in the dictionary and labeled either "positive" or "negative" (see Table A3).

Results
We found that if the top-1 word has a positive meaning, then there is a 100% probability that the sentence's label is "positive". For example, the word "thrilling" in a movie review indicates a "positive" sentence (see Fig. 3). Similarly, the conditional probability of a sentence being labeled "negative" given a negative top-1 word is 94.4%. That is, given this statistics, the SST-2 label distribution and model accuracy, at least 60% of the SST-2 dev-set examples can be correctly predicted from only a single top-1 salient word.
We also reached similar conclusions when experimenting with ALBERT classifiers and the Senti-Words dictionary (Gatti et al., 2015) (see Table A3).

Self-attention layers matching similar words in both the question and the answer
For sequence-pair tasks, e.g. QNLI, how can models correctly predict "entailment" when the question words are randomly shuffled ( Fig. 4; Q 1 ) or when the question syntax is correct but its meaning QNLI sentence-pair inputs and their LIME attributions (negative -1, neutral 0, positive +1)  Figure 4: A RoBERTa-based model's correct prediction of "entailment" on the original input pair (Q, A) remains unchanged when the question is randomly shuffled (Q 1 & Q 2 ) or when two random nouns in the question are swapped (Q s ). The salient words in the questions e.g. manage and missions remain similarly important after their context has been shuffled. Also, the classifier harnessed self-attention to detect the correspondence between similar words that appear in both the question and the answer e.g. manage (Q) and managed (A). That is, the top-3 pairs of words that were assigned the largest question-to-answer weights in a self-attention matrix (layer 0, head 7) are inside in the red, green, and blue rectangles.
changes entirely ( Fig. 4; Q s ). We hypothesize that inside the model, there might be a self-attention (SA) layer that extracts pairs of similar words that appear in both the question and the answer (e.g. "manage" vs. "managed" in Fig. 4). Experiments To test this hypothesis, we analyzed the 5,000 QNLI dev-r examples (Table A4) of RoBERTa-based classifiers trained on QNLI. For each example, we identified one SA matrix (among all 144 as the base model has 12 layers & 12 heads per layer) that assigns the highest weights to pairs of similar words between the question and the answer, i.e. excluding intra-question and intra-answer attention weights (see the procedure in Sec. A).
Results First, in ∼58% of the examples, we found at least three pairs of words that match (i.e. the sum Levenshtein character-level edit-distance for all 3 pairs is ≤ 4). Second, we found, in total, 15 SA heads (out of the 144) which are explicitly tasked with capturing such question-to-answer word correspondence, regardless of word order (see Fig. 4).
Remarkably, 87% of the work of matching similar words that appear in both the QNLI question and the answer was handled by only 3 self-attention heads at (layer, head) of (0,7), (1,9), and (2,6).
We found consistent results when repeating the same analysis for other three sequence-pair tasks. That is, interestingly, the three SA heads at exactly the same location of (0, 7), (1, 9), and (2, 6) account for 76%, 89%, and 83% of the "wordmatching" task on QQP, RTE, and MRPC, respectively. This coincidence is likely due to the fact that these classifiers were finetuned for different downstream tasks starting from the same pretrained RoBERTa encoder. See Figs. 1, 4, A3-A4 for qualitative examples of these three tasks.
How important are the 15 word-matching attention heads to QNLI model performance?
We found that zero-ing out 15 random heads had almost no effect to correctly-classified predictionsi.e. accuracy dropped marginally (−1% to −3%, Table 3) across different groups of examples. However, ablating the 15 word-matching heads caused the performance to drop substantially i.e. (a) by 9.6% on the 1,453 "positive" examples identified in Sec. A; (b) by 22.1% on a set of 2,906 random, examples including both "positive" and "negative" examples (at 50/50 ratio); and (c) by 24.5% on the entire QNLI 5,000-example dev-r set. That is, the 15 SA heads that learned to detect similar words played an important role in solving QNLI, i.e. enabling at least ∼50% of the correct predictions (Table 3d; accuracy dropped from 100% to 75.54% when the random chance is 50%). In sum, we found overlap between words in the question and answer of QNLI examples and strong evidence that QNLI models harnessed selfattention to exploit such overlap to make correct decisions in spite of a random word-order.  3.5 Does increasing word-order sensitivity lead to higher model performance?
Here, we test whether encouraging BERT representations to be more sensitive to word order (i.e. more syntax-aware) would improve model performance on GLUE & SQuAD 2.0 (Rajpurkar et al., 2018). We performed this test on the five GLUE binary-classification tasks (i.e. excluding CoLA because its WOS score is already at 0.99; Table 2).
Experiments Inspired by the fact that CoLA models are highly sensitive to word order, we finetuned the pretrained RoBERTa on a synthetic, CoLA-like task first, before finetuning the model on downstream tasks. The synthetic task is to classify a single sentence into "real" vs. "fake" where the latter is formed by taking each real sentence and swapping two random words in it. For every downstream task (e.g. SST-2), we directly used its original training and dev sets to construct a balanced, 2-class, synthetic dataset. After finetuning the pretrained RoBERTa on this synthetic binary classification task, we reinitialized the classification layer (keeping the rest unchanged) and continued finetuning it on a downstream task.
For both finetuning steps, we trained 5 models per task and followed the standard BERT finetuning procedure (described in Sec. 2.2).
Results After the first finetuning on synthetic tasks, all models obtained a ∼99% training-set accuracy and a ∼95% dev-set accuracy. After the second finetuning on downstream tasks, we observed that all models were substantially more sensitive to word order, compared to the baseline models (which were only finetuned on the downstream tasks). That is, we repeated the 1-gram shuffling test (Sec. 3.1) and found a ∼1.5 to 2× increase in the WOS scores of all models (see Table  4a   GLUE On GLUE dev sets, on average over 5 runs, our models outperformed the RoBERTa baseline on all tasks except for SST-2 ( Table 5). The highest improvement is in RTE (from 72.2% to 73.21% on average, and to 74.73% for the best single model), which is consistent with the fact that RTE has the highest WOS score among non-CoLA tasks (Sec. 3.1).
SQuAD 2.0 Our models also outperformed the RoBERTa baseline on the SQuAD 2.0 dev set, with the highest F1 gain from 80.62% to 81.08% (Table 5). In sum, leveraging the insights that the original BERT-based models are largely word-order invariant, we showed that increasing model sensitivity via a simple extra finetuning step directly improves GLUE and SQuAD 2.0 performance.

Related Work
Pretrained BERT Lin et al. (2019) found that positional information is encoded in the first there layers of BERT base and fades out starting layer 4. Ettinger (2020) found that BERT heavily relies on word order when predicting missing words in masked sentences from the CPRAG-102 dataset. That is, shuffling words in the context sentence caused the word-prediction accuracy to drop by ∼1.3 to 2×. While all above work studied the  pretrained BERT, we instead study BERT-based models finetuned on downstream tasks.

Word-ordering as an objective
In text generation, Elman (1990) found that recurrent neural networks were sensitive to regularities in word order in simple sentences. Language models (Mikolov et al., 2010) with long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997) were able to recover the original word order of a sentence from randomly-shuffled words even without any explicit syntactic information (Schmaltz et al., 2016). Wang et al. 2020 also observed an increase in GLUE performance after pretraining BERT with two additional objectives of wordordering and sentence-ordering. Their work differs from ours in three points: (1) they did not study the importance of word order alone; (2) StructBERT improvements were inconsistent across tasks and models (Table 2d) and motivated us to compare the word-order importance between GLUE tasks; and (3) we proposed to improve model performance by finetuning not pretraining. Mitchell and Bowers, 2020; Ettinger, 2020). Compared to the prior work, we are the first to perform a word-order analysis on a NLU benchmark and to contrast this sensitivity across the tasks.

Word-order insensitivity in other NLP tasks
Humans can also be word-order invariant A recent human study interestingly showed that sentences with scrambled word orders elicit a response as high as that elicited by original sentences as long as the local mutual information among words is high enough (Mollica et al., 2020). Gibson et al. 2013 found that humans can also exhibit wordorder-invariance effects, especially when one interpretation is much more semantically plausible. Our work therefore documents an important similarity between humans and advanced NLU models.

Invariance to patch-order in computer vision
In computer vision, the accuracy of state-of-the-art image classifiers was found to only drop marginally when the patches in an image were randomly shuffled (Chen et al., 2020; Zhang and Zhu, 2019).

Discussion and Conclusion
Consistently across three BERT variants and two model sizes, we found that GLUE-trained BERTbased models are often word-order invariant unless explicitly asked for (e.g. in CoLA). We present a reflection on the progress of NLU by studying GLUE-a benchmark where humans have been surpassed by many models in the past 18 months. As suggested by our work, these models; however, may neither use syntactic information nor complex reasoning. We revealed how self-attention, a key building block in modern NLP, is being used to extract superficial cues to solve sequence-pair GLUE tasks even when words are out of order.
Adversarial NLI We also replicated our shuffling experiments on ANLI (Nie et al., 2020), a task considered challenging to existing models, and where RoBERTa-based models only obtained a 56% accuracy. We found RoBERTa-based models to remain not always sensitive to word-order randomization on ANLI (Table A2; WOS of 0.63), suggesting a common issue in existing benchmarks. Quora. 2017.
(Accessed on 09/30/2020). Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135-1144. A Self-attention layers that match question-words to similar words in the answer QNLI models being so insensitive to word shuffling (i.e. 89.4% of the correct predictions remain correct) suggests that inside the finetuned BERT, there might be a self-attention (SA) layer that extract pairs of similar words that appear in both the question and answer. We started by analyzing all 2,500 "positive" dev-r examples (Table A4) of RoBERTa-based classifiers trained on QNLI because there were fewer and more consistent ways for labeling a sentence "positive" than for the "negative" (shown in Sec. 3.3).
Experiment There were 1,776 (out of 2,500) examples whose predictions did not change in 5 random shufflings (a.k.a 5/5 subset). For each such example, we followed the following 4 steps to identify one SA matrix (among all 144 as the base model has 12 layers & 12 heads per layer) that captures the strongest attention connecting the question and answer words.

Per example x, we created its shuffled version
x by randomly shuffling words in the question and fedx into the classifier.
2. For each SA matrix obtained, we identified the top-3 highest-attention weights that connect the shuffled question tokens and the real answer tokens (i.e. excluding attention weights between question tokens or answer tokens only).
3. For each shuffled examplex, we identified one matrix M whose the top-3 word pairs are the nearest in Levenshtein character-level editdistance (NLTK). For instance, the distance is 1 between manage and managed (Fig. 4).
4. For each matrix M identified forx, we fed the corresponding real example x through the network and re-computed the edit-distance for each of the top-3 word pairs.
Results At step 3, there were 1,590 SA matrices (out of 1,776) whose the top-3 SA weights connected three pairs of matching words (i.e. the total edit-distance for 3 pairs together is ≤ 4) 3 that appear in both the shuffled question and original answer (see example top-3 pairs in Fig. 4). At step 4, this number only dropped slightly to 1,453 matrices when replacing the shuffled question by the original one (see Table A1 for detailed statistics).  However, there are only 15 unique, RoBERTa self-attention matrices in these 1,453 examples (see Fig. A1). Also at step 4, 83% of the same word pairs remained within the top-3 of the same SA matrices, after question replacement, i.e. 17% of attention changed to different pairs e.g. from ("managed", "manage") to ("it", "it").
First, our results showed that there is a set of 15 self-attention heads explicitly tasked with capturing question-to-answer word correspondence regardless of word order. Second, for ∼58% (i.e. 1,453 / 2,500) of QNLI "positive" examples: (1) there exist ≥ 3 words in the question that can be found in the accompanying answer; and (2) these correspondences are captured by at least one of the 15 SA matrices. We also found similar results for 2,500 "negative" dev-r examples (data not shown).
"unacceptable" 0.96 Figure A2: Each CoLA example contains a single sentence. Here, we shuffled the words in the original sentence (S) five times to create five new sentences (S 1 to S 5 ) and fed them to a RoBERTa-based classifier for predictions. Words that are important for or against the prediction (by LIME Ribeiro et al. 2016) are in orange and blue, respectively. Most of the shuffled examples were classified into "unacceptable" label (i.e. grammatically incorrect) with even higher confidence score than the original ones.
MRPC example. Groundtruth: "equivalent" A My decision today is not based on any one event . " "equivalent" 0.99 B Governor Rowland said his decision was " not based on any one event . " A1 event any is one decision based on My today not . " "equivalent" 0.98 B Governor Rowland said his decision was " not based on any one event . " A2 one based today not any My on event is decision . " "equivalent" 0.98 B Governor Rowland said his decision was " not based on any one event . " Figure A3: Each MRPC example contains a pair of sentences i.e. (A, B). Here, we shuffled the words in the original sentence (A) to create modified sentences (A 1 & A 2 ) and fed them together with the original second sentence (B) to a RoBERTa-based classifier for predictions. Also, the classifier harnessed self-attention to detect the correspondence between similar words that appear in both sentences. That is, the top-3 pairs of words that were assigned the largest cross-sentence weights in a self-attention matrix (layer 0, head 7) are inside in the red, green, and blue rectangles. RTE example. Groundtruth: "entailment" P About 33.5 million people live in this massive conurbation. I would guess that 95% of the 5,000 officially foreign-capital firms in Japan are based in Tokyo.
"entailment" 0.90 H About 33.5 miilion people live in Tokyo. P About 33.5 million people live in this massive conurbation. I would guess that 95% of the 5,000 officially foreign-capital firms in Japan are based in Tokyo. "entailment" 0.79 H1 people in miilion 33.5 live Tokyo About. P About 33.5 million people live in this massive conurbation. I would guess that 95% of the 5,000 officially foreign-capital firms in Japan are based in Tokyo. "entailment" 0.80 H2 33.5 in people About live Tokyo miilion. Figure A4: Each RTE example contains a pair of premises and hypotheses i.e. (P, H). We shuffled the words in the original hypothesis H to create modified hypotheses (H 1 & H 2 ) and fed them together with the original premise (P) to a RoBERTa-based classifier for predictions. Also, the classifier harnessed self-attention to detect the correspondence between similar words that appear in both the premise and hypothesis. That is, the top-3 pairs of words that were assigned the largest premise-to-hypothesis weights in a self-attention matrix (layer 0, head 7) are inside in the red, green, and blue rectangles.    Table A6: The dev-set performance of models finetuned from three different BERT "base" variants (12 selfattention layers and 12 heads) and one RoBERTa "large" model (24 self-attention layers and 16 heads) on seven GLUE tasks. These results match either those reported by original papers, Huggingface 2020 or GLUE leaderboard.