UnNatural Language Inference

Recent investigations into the inner-workings of state-of-the-art large-scale pre-trained Transformer-based Natural Language Understanding (NLU) models indicate that they appear to understand human-like syntax, at least to some extent. We provide novel evidence that complicates this claim: we find that state-of-the-art Natural Language Inference (NLI) models assign the same labels to permuted examples as they do to the original, i.e. they are invariant to random word-order permutations. This behavior notably differs from that of humans; we struggle to understand the meaning of ungrammatical sentences. To measure the severity of this issue, we propose a suite of metrics and investigate which properties of particular permutations lead models to be word order invariant. For example, in MNLI dataset we find almost all (98.7%) examples contain at least one permutation which elicits the gold label. Models are even able to assign gold labels to permutations that they originally failed to predict correctly. We provide a comprehensive empirical evaluation of this phenomenon, and further show that this issue exists in pre-Transformer RNN / ConvNet based encoders, as well as across multiple languages (English and Chinese). Our code and data are available at https://github.com/facebookresearch/unlu.


Premise Hypothesis Predicted Label
Boats in daily use lie within feet of the fashionable bars and restaurants.
There are boats close to bars and restaurants.
E restaurants and use feet of fashionable lie the in Boats within bars daily .
bars restaurants are There and to close boats .
E He and his associates weren't operating at the level of metaphor.
He and his associates were operating at the level of the metaphor.
C his at and metaphor the of were He operating associates n't level .
his the and metaphor level the were He at associates operating of . C Table 1: Examples from the MNLI Matched development set. Both the original example and the permuted one elicit the same classification label (entailment and contradiction respectively) from RoBERTa (large). A simple demo is provided in an associated Google Colab notebook.
tion (Hewitt and Manning, 2019;Jawahar et al., 2019;Wu et al., 2020), with their self-attention layers being capable of surprisingly effective learning (Rogers et al., 2020). In this work, we question such claims that current models "know syntax". Since there are many ways to investigate "syntax", we must be clear on what we mean by the term. Knowing the syntax of a sentence means being sensitive to the order of the words in that sentence (among other things). Humans are sensitive to word order, so clearly, "language is not merely a bag of words" (Harris, 1954, p.156). Moreover, it is easier for us to identify or recall words presented in canonical orders than in disordered, ungrammatical sentences; this phenomenon is called the "sentence superiority effect" (Cattell 1886 ;Scheerer 1981;Toyota 2001;Baddeley et al. 2009;Grainger 2017, 2019;Wen et al. 2019, i.a.). In our estimation then, if one wants to claim that a model "knows syntax", then they should minimally show that the model is sensitive to word order (at least for e.g. English or Mandarin Chinese).
Generally, knowing the syntax of a sentence is taken to be a prerequisite for understanding what that sentence means (Heim and Kratzer, 1998). Models should have to know the syntax first then, if performing any particular NLU task that genuinely requires a humanlike understanding of meaning (cf. Bender and Koller 2020). Thus, if our models are as good at NLU as our current evaluation methods suggest, we should expect them to be sensitive to word order (see Table 1). We find, based on a suite of permutation metrics, that they are not.
We focus here on textual entailment, one of the hallmark tasks used to measure how well models understand language (Condoravdi et al., 2003;Dagan et al., 2005). This task, often also called Natural Language Inference (NLI;Bowman et al. 2015, i.a.), typically consists of two sentences: a premise and a hypothesis. The objective is to predict whether the premise entails the hypothesis, contradicts it, or is neutral with respect to it. We find rampant word order insensitivity in purportedly high performing NLI models. For nearly all premise-hypothesis pairs, there are many permuted examples that fool the models into providing the correct prediction. In case of MNLI, for example, the current state-of-the-art of 90.5% can be increased to 98.7% merely by permuting the word order of test set examples. We even find drastically increased cross-dataset generalization when we reorder words. This is not just a matter of chance-we show that the model output probabilities are significantly different from uniform.
We verify our findings with three popular English NLI datasets-SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018b) and ANLI (Nie et al., 2020))-and one Chinese one, OCNLI (Hu et al., 2020a). It is thus less likely that our findings result from some quirk of English or a particular tokenization strategy. We also observe the effect for various transformer architectures pre-trained on language modeling (BERT, RoBERTa, DistilBERT), and non-transformers, including a ConvNet, an In-ferSent model, and a BiLSTM.
Our contributions are as follows: (i) we propose a suite of metrics (Permutation Acceptance) for measuring model insensitivity to word order ( §3), (ii) we construct multiple permuted test datasets for measuring NLI model performance at a large scale ( §5), (iii) we show that NLI models focus on words more than word order, but can partially reconstruct syntactic information from words alone ( §6), (iv) we show the problem persists on out-ofdomain data, (v) we show that humans struggle with UnNatural Language Inference, underscoring the non-humanlikeness of SOTA models ( §7), (vi) finally, we explore a simple maximum entropybased method ( §8) to encourage models not to accept permuted examples.

Related Work
Researchers in NLP have realized the importance of syntactic structure in neural networks going back to Tabor (1994). An early hand annotation effort on PASCAL RTE (Dagan et al., 2006) suggested that "syntactic information alone was sufficient to make a judgment" for roughly one third of examples (Vanderwende and Dolan, 2005). Anecdotally, large generative language models like GPT-2 or -3 exhibit a seemingly humanlike ability to generate fluent and grammatical text (Goldberg, 2019;Wolf, 2019). However, the jury is still out as to whether transformers genuinely acquire syntax.
Models appear to struggle with syntax. Several works have cast doubt on the extent to which NLI models in particular know syntax (although each work adopts a slightly different idea of what "knowing syntax" entails). For example, McCoy et al. (2019) argued that the knowledge acquired by models trained on NLI (for at least some popular datasets) is actually not as syntactically sophisticated as it might have initially seemed; some transformer models rely mainly on simpler, nonhumanlike heuristics. In general, transformer LM performance has been found to be patchy and variable across linguistic phenomena (Dasgupta et al., 2018;Naik et al., 2018;Ravichander et al., 2019;Jeretic et al., 2020). This is especially true for syntactic phenomena (Marvin and Linzen, 2018;Hu et al., 2020b;Gauthier et al., 2020;McCoy et al., 2020;, where transformers are, for some phenomena and settings, worse than RNNs . From another angle, many have explored architectural approaches for increasing a network's sensitivity to syntactic structure (Chen et al., 2017;Li et al., 2020). Williams et al. (2018a) showed that learning jointly to perform NLI and to parse resulted in parse trees that match no popular syntactic formalisms. Furthermore, models trained explicitly to differentiate acceptable sentences from unacceptable ones (i.e., one of the most common syntactic tests used by linguists) have, to date, come nowhere near human performance (Warstadt et al., 2019b).
Insensitivity to Perturbation. Most relatedly, several concurrent works (Pham et al., 2020;Alleman et al., 2021;Gupta et al., 2021;Parthasarathi et al., 2021) investigated the effect of word order permutations on transformer NNs. Pham et al. (2020) is very nearly a proper subset of our work except for investigating additional tasks (i.e. from the GLUE benchmark of Wang et al. 2018) and performing a by-layer-analysis. Gupta et al. (2021) also relies on the GLUE benchmark, but additionally investigates other types of "destructive" perturbations. Our contribution differs from these works in that we additionally include the following: we (i) outline theoretically-informed predictions for how models should be expected to react to permuted input (we outline a few options), (ii) show that permuting can "flip" an incorrect prediction to a correct one, (iii) show that the problem isn't specific to Transformers, (iv) show that the problem persists on out of domain data, (v) offer a suite of flexible metrics, and (vi) analyze why models might be accepting permutations (BLEU and POS-tag neighborhood analysis). Finally, we replicate our findings in another language. While our work (and Pham et al.; Gupta et al.) only permutes data during fine-tuning and/or evaluation, recently Sinha et al. explored the sensitivity during pre-training, and found that models trained on n-gram permuted sentences perform remarkably close to regular MLM pre-training. In the context of generation, Parthasarathi et al. (2021) crafted linguistically relevant perturbations (on the basis of part-of-speech tagging and dependency parsing) to evaluate whether permutation hinders automatic machine translation models. Relatedly, but not for translation, Alleman et al. (2021) investigated a smaller inventory of perturbations with emphasis on phrasal boundaries and the effects of n-gram perturbations on different layers in the network.
NLI Models are very sensitive to words. NLI models often over-attend to particular words to predict the correct answer (Gururangan et al., 2018;Clark et al., 2019). Wallace et al. (2019) show that some short sequences of non-human-readable text can fool many NLU models, including NLI models trained on SNLI, into predicting a specific label. In fact, Ettinger (2020) observed that for one of three test sets, BERT loses some accuracy in wordperturbed sentences, but that there exists a subset of examples for which BERT's accuracy remains intact. If performance isn't affected (or if permutation helps, as we find it does in some cases), it suggests that these state-of-the-art models actually perform somewhat similarly to bag-of-words models (Blei et al., 2003;Mikolov et al., 2013).

Our Approach
As we mentioned, linguists generally take syntactic structure to be necessary for humans to know what sentences mean. Many also find the NLI task to a very promising approximation of human natural language understanding, in part because it is rooted in the tradition of logical entailment. In the spirit of propositional logic, sentence meaning is taken to be truth-conditional (Frege, 1948;Montague, 1970;Chierchia and McConnell-Ginet, 1990;Heim and Kratzer, 1998). That is to say that understanding a sentence is equivalent to knowing the actual conditions of the world under which the sentences would be (judged) true (Wittgenstein, 1922). If grammatical sentences are required for sentential inference, as per a truth conditional approach (Montague, 1970), then permuted sentences should be meaningless. Put another way, the meanings of highly permuted sentences (if they exist) are not propositions, and thus those sentences don't have truth conditions. Only from their truth conditions of sentences can we tell if a sentence entails another. In short, the textual entailment task is technically undefined in our "unnatural" setting.
Since existing definitions don't immediately extend to UnNatural Language Inference (UNLI), we outline several hypothetical systematic ways that a model might perform, had it been sensitive to word order. We hypothesize two models that operate on the first principles of NLI, and one that doesn't. In the first case, Model A deems permuted sentences meaningless (devoid of truth values), as formal semantic theories of human language would predict. Thus, it assigns "neutral" to every permuted example. Next, Model B does not deem permuted sentences meaningless, and attempts to understand them. Humans find understanding permuted sentences difficult (see our human evaluations in §7). Model B could also similarly struggle to decipher the meaning, and just equally sample labels for each example (i.e., assigns equal probability mass to the outcome of each label). Finally, we hypothesize a non-systematic model, Model C, which attempts to treat permuted sentences as though they weren't permuted at all. This model could operate similarly as bag-of-words (BOW), and thus always assign the same label to the permuted examples as it would to the un-permuted examples. If the model failed to assign the original gold label to the original unpermuted examples, it will also fail to assign the original gold label to its permutations; it will never get higher accuracy on permuted examples than on unpermuted ones.
We find in our experiments that the state-of-theart Transformer-based NLI models (as well as pre-Transformer class of models) do not perform like any of the above hypothetical models. They perform closest to Model C, but are, in some cases, actually able to achieve higher accuracy on permuted examples. To better quantitatively describe this behaviour, we introduce our suite of Permutation Acceptance metrics that enable us to quantify how accepting models are of permuted sentences.

Methods
Constructing the permuted dataset. For a given dataset D having splits D train and D test , we first train an NLI model M on D train to achieve comparable accuracy to what was reported in the original papers. We then construct a randomized version of D test , which we term asD test such that: for each example (p i , h i , y i ) ∈ D test (where p i and h i are the premise and hypothesis sentences of the example respectively and y i is the gold label), we use a permutation operator F that returns a list (P i ,Ĥ i ) of q permuted sentences (p i andĥ i ), where q is a hyperparameter. F essentially permutes all positions of the words in a given sentence (i.e., either in premise or hypothesis) with the restriction that no words maintain their original position. In our initial setting, we do not explicitly control the placement of the words relative to their original neighbors, but we analyze clumping effects in §5. D test now consists of |D test |×q examples, with q different permutations of hypothesis and premise for each original test example pair. If a sentence S (e.g., h i ) contains w words, then the total number of available permutations of S are (w − 1)!, thus making the output of F a list of (w−1)! q permutations in this case. For us, the space of possible outputs is larger, since we permute p i and h i separately (and ignore examples for which any |S|≤ 5).
Defining Permutation Acceptance. The choice of q naturally allows us to analyze a statistical view of the predictability of a model on the permuted sentences. To that end, we define the following notational conventions. Let A be the original accuracy of a given model M on a dataset D, and c be the number of examples in a dataset which are marked as correct according to the standard formulation of accuracy for the original dataset (i.e., they are assigned the ground truth label). Typically A is given by c |Dtest| or c |D dev | . Let Pr M (P i ,Ĥ i ) cor then be the percentage of q permutations of an example (p i , h i ) assigned the ground truth label y i by M : To get an overall summary score, we let Ω x be the percentage of examples (p i , h i ) ∈ D test for which Pr M (P i ,Ĥ i ) cor exceeds a predetermined threshold 0 < x < 1. Concretely, a given example will count  as correct according to Ω x if more than x percent of its permutations (P i andĤ i ) are assigned y i by the model M . Mathematically, (2) There are two specific cases of Ω x that we are most interested in. First, we define Ω max or the Maximum Accuracy, where x = 1/|D test |. In short, Ω max gives the percentage of examples (p i , h i ) ∈ D test for which there is at least one permutation (p j ,ĥ j ) that model M assigns the gold label y i 1 . Second, we define Ω rand , or Random Baseline Accuracy, where x = 1/m or chance probability (for balanced m-way classification, where m = 3 in NLI). This metric is less stringent than Ω max , as it counts an example if at least one third of its permutations are assigned the gold label (hence provides a lower-bound relaxation). See Figure 1 for a graphical representation of Ω x .
We also define D f to be the list of examples originally marked incorrect according to A, but are now deemed correct according Ω max . D c is the list of examples originally marked correct according to A. Thus, we should expect D f < D c for models that have high accuracy. Additionally, we define P c and P f , as the dataset average percentage of permutations which predicted the gold label, when the examples were originally correct (D c ) and when the examples were originally incorrect (D f ) as per A (hence, flipped) respectively.
Note that for a classic BOW model, P c = 100 and P f = 0, because it would rely on the words alone (not their order) to make its classification decision.
Since permuting removes no words, BOW models should come to the same decisions for permuted examples as for the originals.

Results
We present results for two types of models:  (Zhao et al., 2015). We train all models on MNLI, and evaluate on in-distribution (SNLI and MNLI) and out-of-distribution datasets (ANLI). We independently verify results of (a) using both our fine-tuned model using HuggingFace Transformers      for almost all examples in D test such that model M predicts the gold label. We also observe high Ω rand at 79.4%, showing that there are many examples for which the models outperform even a random baseline in accepting permuted sentences (see Appendix E for more Ω values.) Evaluating out-of-domain generalization with ANLI dataset splits resulted in an Ω max value that is notably higher than A (89.7% Ω max for RoBERTa compared to 45.6% A). As a consequence, we encounter many flips, i.e., examples where the model is unable to predict the gold label, but at least one permutation of that example is able to. However, recall this analysis expects us to know the gold label upfront, so this test can be thought of as running a word-order probe test on the model until the model predicts the gold label (or give up by exhausting our set of q permutations). For out-of-domain generalization, Ω rand decreases considerably (36.4% Ω rand on A1), which means fewer permutations are accepted by the model. Next, recall that a classic bag-of-words model would have P c = 100 and P f = 0. No model performs strictly like a classic bag of words although they do perform somewhat BOW-like (P c >> P f for all test splits, Figure 5). We find this BOW-likeness to be higher for certain non-Transformer models, (InferSent) as they exhibit higher P c (84.2% for InferSent compared to 70.7% for RoBERTa on MNLI).
Models are very confident. The phenomenon we observe would be of less concern if the correct label prediction was just an outcome of chance, which could occur when the entropy of the log probabilities of the model output is high (suggesting uniform probabilities on entailment, neutral and contradiction labels, recall Model B from §3). We first investigate the model probabilities for the Transformer-based models on the permutations that lead to the correct answer in Figure 2. We find overwhelming evidence that model confidences on in-distribution datasets (MNLI, SNLI) are highly skewed, resulting in low entropy, and it varies among different model types. BART proves to be the most skewed Transformer-based model. This skewness is not a property of model capacity, as we observe DistilBERT log probabilities to have similar skewness as RoBERTa (large) model, while exhibiting lower A, Ω max , and Ω rand .
For non-Transformers whose accuracy A is lower, the Ω max achieved by these models are also predictably lower. We observe roughly the same relative performance in the terms of Ω max ( Figure 5 and Appendix Table 2) and Average entropy (Figure 2). However, while comparing the averaged entropy of the model predictions, it is clear that there is some benefit to being a worse model-non-Transformer models are not as overconfident on randomized sentences as Transformers are. High confidence of Transformer models can be attributed to the overthinking phenomenon commonly observed in deep neural networks (Kaya et al., 2019) and BERT-based models (Zhou et al., 2020).
Similar artifacts in Chinese NLU. We extended the experiments to the Original Chinese NLI dataset (Hu et al., 2020a, OCNLI), and reused the pre-trained RoBERTa-Large and InferSent (non-Transformer) models on OCNLI. Our findings are similar to the English results (Table 3), thereby suggesting that the phenomenon is not just an artifact of English text or tokenization.
Other Results. We investigated the effect of sentence length (which correlates with number of possible permutations; Appendix A), and hypothesisonly randomization (models exhibit similar phenomenon even when only hypothesis is permuted; Appendix C).

Analyzing Syntactic Structure Associated with Tokens
A natural question to ask following our findings: what is it about particular permutations that leads models to accept them? Since the permutation oper- Figure 3: BLEU-2 score versus acceptability of permuted sentences across all test datasets. RoBERTa and BART performance is similar but differs considerably from the performance of non-Transformer-based models, such as InferSent and ConvNet.
ation is drastic and only rarely preserves local word relations, we first investigate whether there exists a relationship between Permutation Acceptance scores and local word order preservation. Concretely, we compare bi-gram word overlap (BLEU-2) with the percentage of permutations that are deemed correct (Figure 3). 3 Although the probability of a permuted sentence to be predicted correctly does appear to track BLEU-2 score (Figure 3), the percentage of examples which were assigned the gold label by the Transformer-based models is still higher than we would expect from permutations with lower BLEU-2 (66% for the lowest BLEU-2 range of 0 − 0.15), suggesting preserved relative word order alone cannot explain the high permutation acceptance rates. Thus, we find that local order preservation does correlate with Permutation Acceptance, but it doesn't fully explain the high Permutation Acceptance scores. We now further ask whether Ω is related to a more abstract measure of local word relations, i.e., part-of-speech (POS) neighborhood.
Many syntactic formalisms, like Lexical Functional Grammar (Kaplan and Bresnan, 1995;Bresnan et al., 2015, LFG), Head-drive Phrase Structure Grammar (Pollard and Sag, 1994, HPSG) or Lexicalized Tree Adjoining Grammar (Schabes et al., 1988;Abeille, 1990, LTAG), are "lexicalized", i.e., individual words or morphemes bear syntactic features telling us which other words they can combine with. For example, "buy" could be associated with (at least) two lexicalized syntactic structures, one containing two noun phrases (as in Kim bought cheese), and another with three (as in Lee bought Logan cheese). We speculate that our NLI models might accept permuted examples at high rates, because they are (perhaps noisily) reconstructing the original sentence from abstract, word-anchored information about common neighbors.
To test this, we POS-tagged D train using 17 Universal Part-of-Speech tags (using spaCy, Honnibal et al. 2020). For each w i ∈ S i , we compute the occurrence probability of POS tags on tokens in the neighborhood of w i . The neighborhood is specified by the radius r (a symmetrical window r tokens from w i ∈ S i to the left and right). We denote this sentence level probability of neighbor POS tags for a word w i as ψ r {w i ,S i } ∈ R 17 (see an example in Figure 7 in the Appendix). Sentence-level word POS neighbor scores can be averaged across D train to get a type level score ψ r {w i ,D train } ∈ R 17 , ∀w i ∈ D train . Then, for a sentence S i ∈ D test , for each word w i ∈ S i , we compute a POS mini-tree overlap score: Concretely, β k {w i ,S i } computes the overlap of topk POS tags in the neighborhood of a word w i in S with that of the train statistic. If a word has the same mini-tree in a given sentence as it has in the training set, then the overlap would be 1. For a given sentence S i , the aggregate β k {S i } is defined by the average of the overlap scores of all its words: and we call it a POS minitree signature. We can also compute the POS minitree signature of a permuted sentenceŜ i to have β k {Ŝ i } . If the permuted sentence POS signature comes close to that of the true sentence, then their ratio (i.e., β k {Ŝ i } /β k {S i } ) will be close to 1. Also, since POS signature is computed with respect to the train distribution, a ratio of > 1 indicates that the permuted sentence is closer to the overall train statistic than to the original unpermuted sentence in terms of POS signature. If high overlap with the training distribution correlates with percentage of permutations deemed correct, then our models treat words as if they project syntactic minitrees. We investigate the relationship with percentage of permuted sentences accepted with β k Figure 4. We observe that the POS Tag Minitree hypothesis holds for Transformer-based models, RoBERTa, BART and DistilBERT, where the percentage of accepted pairs increase as the sentences have higher overlap with the un-permuted sentence in terms of POS signature. For non-Transformer models such as InferSent, ConvNet, and BiLSTM models, the POS signature ratio to percentage of correct permutation remains the same or decreases, suggesting that the reasoning process employed by these models does not preserve local abstract syntax structure (i.e., POS neighbor relations).

Human Evaluation
We expect humans to struggle with UNLI, given our intuitions and the sentence superiority findings (but see Mollica et al. 2020  much worse than RoBERTa (Table 4), although their accuracy was a bit higher than random. We also find that for both experts, accuracy on permutations from D c was higher than on D f , which verifies findings that showed high word overlap can give hints about the ground truth label (Dasgupta et al., 2018;Poliak et al., 2018;Gururangan et al., 2018;Naik et al., 2019).

Training by Maximizing Entropy
We propose an initial attempt to mitigate the effect of correct prediction on permuted examples. As we observe in §5, model entropy on permuted examples is significantly lower than expected. Neural networks tend to output higher confidence than random for even unknown inputs (Gandhi and Lake, 2020), which might be an underlying cause of the high Permutation Acceptance. An ideal model would be ambivalent about randomized ungrammatical sentences. Thus, we train NLI models baking in the principle of mutual exclusivity (Gandhi and Lake, 2020) by maximizing model entropy. Concretely, we fine-tune RoBERTa on MNLI while maximizing the entropy (H) on a subset of n randomized examples ((p i ,r i ), for each example (p, h) in MNLI. We modify the loss function as follows: Using this maximum entropy method (n = 1), we find that the model improves considerably with respect to its robustness to randomized sentences, all while taking no hit to accuracy (Table 5). We observe that no model reaches a Ω max score close to 0, suggesting further room to explore other methods for decreasing models' Permutation Acceptance. Similar approaches have also proven useful (Gupta et al., 2021) for other tasks as well.

Future Work & Conclusion
We show that state-of-the-art models do not rely on sentence structure the way we think they should: NLI models (Transformer-based models, RNNs, and ConvNets) are largely insensitive to permutations of word order that corrupt the original syntax. We also show that reordering words can cause models to flip classification labels. We do find that models seem to have learned some syntactic information as is evidenced by a correlation between preservation of abstract POS neighborhood information and rate of acceptance by models, but these results do not discount the high rates of Permutation Acceptance, and require further verification. Coupled with the finding that humans cannot perform UNLI at all well, the high rate of permutation acceptance that we observe leads us to conclude that current models do not yet "know syntax" in the fully systematic and humanlike way we would like them to. A few years ago, Manning (2015) encouraged NLP to consider "the details of human language, how it is learned, processed, and how it changes, rather than just chasing state-of-the-art numbers on a benchmark task." We expand upon this view, and suggest one particular future direction: we should train models not only to do well on clean test data, but also to not to overgeneralize to corrupted input.

A Effect of Length on Permutation Acceptance
We investigate the effect of length on Permutation Acceptance in Figure 6. We observe that shorter sentences in general have a somewhat higher probability of acceptance for examples which was originally predicted correctly-since shorter sentences have fewer unique permutations. However, for the examples which were originally incorrect, the trend is not present.

B Example of POS Minitree
In §6, we developed a POS signature for each word in at least one example in a test set, then compare that signature to the distribution of the same word in the training set. Figure 7 provides a snapshot a word "river" from the test set and shows how the POS signature distribution of the word in a particular example match with that of aggregated training statistic. In practice, we select the top k POS tags for the word in the test signature as well as the train, and calculate their overlap. When comparing the model performance with permuted sentences, we compute a ratio between the original test overlap score and an overlap score calculated instead from the permuted test. In the Figure 7, 'river' would have a POS tag minitree score of 0.75.

C Effect of Hypothesis-Only Randomization
In recent years, the impact of the hypothesis sentence (Gururangan et al., 2018;Tsuchiya, 2018;Poliak et al., 2018) on NLI classification has been a topic of much interest. As we define in §3, logical entailment can only be defined for pairs of propositions. We investigated one effect where we randomize only the hypothesis sentences while keeping the premise intact. Figure 9(a) shows that the Ω max value is almost the same for the two schemes; randomizing the hypothesis alone also leads the model to accept many permutations.

D Effect of clumped words in random permutations
Since our original permuted dataset consists of extremely randomized words, we observe very low BLEU-3 (< 0.2) and BLEU-4 scores (< 0.1). To study the effect of overlap across a wider range of permutations, we devised an experiment where we clump certain words together before performing random permutations. Concretely, we clump 25%, 50% and 75% of the words in a sentence and then permute the remaining words and the clumped word as a whole. This type of clumped-permutation allows us to study the full range of BLEU-2/3/4 scores, which we present in Figure 10. As expected, the acceptability of permuted sentences increase linearly with BLEU score overlap.

E Effect of the threshold of Ω x in various test splits
We defined two variations of Ω x , Ω max and Ω rand , but theoretically it is possible to define any arbitrary threshold percentage x to evaluate the unnatural language inference mechanisms of different models. In Figure 8 we show the effect of different thresholds, including Ω max where x = 1/|D test| and Ω rand where x = 0.34. We observe for indistribution datasets (top row, MNLI and SNLI splits), in the extreme setting when x = 1.0, there are more than 10% of examples available, and more than 25% in case of InferSent and DistilBERT. For out-of-distribution datasets (bottom row, ANLI splits) we observe a much lower trend, suggesting generalization itself is the bottleneck in permuted sentence understanding.

F Training with permuted examples
In this section, we hypothesize that if the NLU models are mostly insensitive to word order, then training using permuted examples should also yield the same or comparable accuracy as training using grammatically correct data (i.e., the standard setup). To test this, we train Transformer-based models on top ofD train , which is computed by applying F on each example of D train for q = 1 times. This ensures a control case where we keep the same amount of training data as the standard setup (such  Table 6: Statistics for Transformer-based models when trained on permuted MNLI corpus. We compare the accuracy for both models trained on unpermuted data (A) and the permuted data (Â). We use original test sets during inference.
that models does not benefit from data augmentation). We also ensure that we use the same hyperparameters while training as with the standard setup. Concretely,D train consists of n hypothesispremise pairs from MNLI training data, where each example is a permuted output of the original pair.
We present the results of such training in Table 6, and compare the accuracy (Â) with that of the standard setup (A). Note, during inference for all the models we use the un-permuted examples. As we can see, models perform surprisingly close to the original accuracy A even when trained with ungrammatical sentences. This adds further proof to the BOW nature of NLU models.  Figure 9: Comparing the effect between randomizing both premise and hypothesis and only hypothesis on two Transformer-based models, RoBERTa and BART (For more comparisons please refer to Appendix). In 9(a), we observe the difference of Ω max is marginal in in-distribution datasets (SNLI, MNLI), while hypothesis-only randomization is worse for out-of-distribution datasets (ANLI). In 9(b), we compare the mean number of permutations which elicited correct response, and naturally the hypothesis-only randomization causes more percentage of randomizations to be correct.

G Reproducibility Checklist
As per the prescribed Reproducibility Checklist, we provide the information of the following: • A clear description of the mathematical setting, algorithm and/or model: We provide details of models used in §5 • Description of the computing infrastructure used: We used 8 NVIDIA V100 32GB GPUs to train the models and perform all necessary inferences. We didn't run hyperparameter tuning for Transformer-based models as we used the published hyperparameters from the original models.
• Average runtime for each approach: On an average, each model inference experiment consistine of 100 permutations for each example takes roughly 1 hour to complete.
• Relevant statistics of the datasets used: We provide the statistics of the datasets used in Table 7.
• Explanation of any data that were excluded, and all pre-processing steps: We exclude examples where either the hypothesis and premise consists of less than 6 tokens. This way, we ensure that we have 100 unique permutations for each example.
• Link to downloadable version of data and code: We provide downloadable version of our data and code at https://github.com/facebookresearch/unlu.