Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.


Introduction
The field of natural language processing (NLP) has become dominated by the pretrain-and-finetune paradigm, where we first obtain a good parametric prior in order to subsequently model downstream tasks accurately. In particular, masked language model (MLM) pre-training, as epitomized by BERT (Devlin et al., 2019), has proven wildly successful, although the precise reason for this success has remained unclear. On one hand, we can view BERT as the newest in a long line of NLP techniques (Deerwester et al., 1990;Landauer and Dumais, 1997;Collobert and Weston, 2008;Mikolov et al., 2013;Peters et al., 2018) that exploit the wellknown distributional hypothesis (Harris, 1954). 1 On the other hand, it has been claimed that BERT 1 One might even argue that BERT is not actually all that different from earlier distributional models like word2vec (Mikolov et al., 2013), see Appendix A.
"rediscovers the classical NLP pipeline" (Tenney et al., 2019), suggesting that it has learned "the types of syntactic and semantic abstractions traditionally believed necessary for language processing" rather than "simply modeling complex cooccurrence statistics" (ibid. p.1).
In this work, we aim to uncover how much of MLM's success comes from learning simple distributional information, as opposed to grammatical abstractions (Tenney et al., 2019;Manning et al., 2020). We disentangle these two hypotheses by measuring the effect of removing word order information during pre-training: any sophisticated (English) NLP pipeline would presumably depend on the syntactic information conveyed by the order of words. Surprisingly, we find that most of MLM's high performance can in fact be explained by the "distributional prior" rather than its ability to replicate the classical NLP pipeline.
Concretely, we pre-train MLMs (RoBERTa, Liu et al. 2019) on various corpora with permuted word order while preserving some degree of distributional information, and examine their downstream performance. We also experiment with training MLMs without positional embeddings, making them entirely order agnostic, and with training on a corpus sampled from the source corpus's unigram distribution. We then evaluate these "permuted" models in a wide range of settings and compare with regularly-pre-trained models.
We demonstrate that pre-training on permuted data has surprisingly little effect on downstream task performance after fine-tuning (on non-shuffled training data). It has recently been found that MLMs are quite robust to permuting downstream test data (Sinha et al., 2021;Pham et al., 2020;Gupta et al., 2021) and even do quite well using permuted "unnatural" downstream train data (Sinha et al., 2021;Gupta et al., 2021). Here, we show that downstream performance for "unnatural language pre-training" is much closer to standard MLM pretraining than one might expect.
In an effort to shed light on these findings, we experiment with various probing tasks. We verify via non-parametric probes that the permutations do in fact make the model worse at syntax-dependent tasks. However, just like on the downstream finetuning tasks, permuted models perform well on parametric syntactic probes, in some cases almost matching the unpermuted model's performance, which is quite surprising given how important word order is crosslinguistically (Greenberg 1963;Dryer 1992;Cinque 1999, i.a.).
Our results can be interpreted in different ways. One could argue that our downstream and probing tasks are flawed, and that we need to examine models with examples that truly test strong generalization and compositionality. Alternatively, one could argue that prior works have overstated the dependence of human language understanding on word order, and that human language understanding depends less on the structure of the sentence and more on the structure of the world, which can be inferred to a large extent from distributional information. This work is meant to deepen our understanding of MLM pre-training and, through this, move us closer to finding out what is actually required for adequately modelling natural language.

Related Work
Sensitivity to word order in NLU. Information order has been a topic of research in computational linguistics since Barzilay and Lee (2004) introduced the task of ranking sentence orders as an evaluation for language generation quality, an approach which was subsequently also used to evaluate readability and dialogue coherence (Barzilay and Lapata, 2008;Laban et al., 2021).
More recently, several research groups have investigated information order for words rather than sentences as an evaluation of model humanlikeness. Sinha et al. (2021) investigate the task of natural language inference (NLI) and find high accuracy on permuted examples for different Transformer and pre-Transformer era models, across English and Chinese datasets (Hu et al., 2020). Gupta et al. (2021) use targeted permutations on RoBERTabased models and show word order insensitivity across natural language inference (MNLI), paraphrase detection (QQP) and sentiment analysis tasks (SST-2). Pham et al. (2020) show insensitivity on a larger set of tasks, including the entire GLUE benchmark, and find that certain tasks in GLUE, such as CoLA and RTE are more sensitive to per-mutations than others. Ettinger (2020) recently observed that BERT accuracy decreases for some word order perturbed examples, but not for others. In all these prior works, models were given access to normal word order at (pre-)training time, but not at test-time or (sometimes) fine-tuning time. It was not clear whether the model acquires enough information about word order during the fine-tuning step, or whether it is ingrained in the pre-trained model. In this work, we take these investigations a step further: we show that the word order information needed for downstream tasks does not need to be provided to the model during pre-training. Since models can learn whatever word order information they do need largely from fine-tuning alone, this likely suggests that our downstream tasks don't actually require much complex word order information in the first place (cf., Glavaš and Vulić 2021).
Randomization ablations. Random controls have been explored in a variety of prior work. Wieting and Kiela (2019) show that random sentence encoders are surprisingly powerful baselines. Gauthier and Levy (2019) use random sentence reordering to label some tasks as "syntax-light" making them more easily decodeable from images of the brain. Shen et al. (2021) show that entire layers of MLM transformers can be randomly initialized and kept frozen throughout training without detrimental effect and that those layers perform better on some probing tasks than their frozen counterparts. Models have been found to be surprisingly robust to randomizing or cutting syntactic tree structures they were hoped to rely on (Scheible and Schütze, 2013;Williams et al., 2018a), and randomly permuting attention weights often induces only minimal changes in output (Jain and Wallace, 2019). In computer vision, it is well known that certain architectures constitute good "deep image priors" for fine-tuning (Ulyanov et al., 2018) or pruning (Frankle et al., 2020), and that even randomly wired networks can perform well at image recognition (Xie et al., 2019). Here, we explore randomizing the data, rather than the model, to assess whether certain claims about which phenomena the model has learned are established in fact.
Synthetic pre-training. Kataoka et al. (2020) found that pre-training on synthetically generated fractals for image classification is a very strong prior for subsequent fine-tuning on real image data. In language modeling, Papadimitriou and Jurafsky (2020) train LSTMs (Hochreiter and Schmidhuber, 1997) on non-linguistic data with latent structure such as MIDI music or Java code provides better test performance on downstream tasks than a randomly initialized model. They observe that even when there is no vocabulary overlap among source and target languages, LSTM language models leverage the latent hierarchical structure of the input to obtain better performance than a random, Zipfian corpus of the same vocabulary. On the utility of probing tasks. Many recent papers provide compelling evidence that BERT contains a surprising amount of syntax, semantics, and world knowledge (Giulianelli et al., 2018;Rogers et al., 2020;Lakretz et al., 2019;Jumelet et al., 2019Jumelet et al., , 2021. Many of these works involve diagnostic classifiers (Hupkes et al., 2018) or parametric probes, i.e. a function atop learned representations that is optimized to find linguistic information. How well the probe learns a given signal can be seen as a proxy for linguistic knowledge encoded in the representations. However, the community is divided on many aspects of probing (Belinkov, 2021) including how complex probes should be. Many prefer simple linear probes over the complex ones (Alain and Bengio, 2017;Hewitt and Manning, 2019;Hall Maudslay et al., 2020). However, complex probes with strong representational capacity are able to extract the most information from representations (Voita and Titov, 2020;Pimentel et al., 2020b;Hall Maudslay et al., 2020). Here, we follow Pimentel et al. (2020a) and use both simple (linear) and complex (non-linear) models, as well as "complex" tasks (dependency parsing). As an alternative to parametric probes, stimulus-based non-parametric probing (Linzen et al., 2016;Jumelet and Hupkes, 2018;Marvin and Linzen, 2018;Gulordava et al., 2018a;Warstadt et al., 2019aWarstadt et al., , 2020aEttinger, 2020;Lakretz et al., 2021) has been used to show that even without a learned probe, BERT can predict syntactic properties with high confidence (Goldberg, 2019; Wolf, 2019). We use this class of non-parametric probes to investigate RoBERTa's ability to learn word order during pre-training.

Approach
We first describe the data generation and evaluation methodology used in this paper. We use the RoBERTa (base) (Liu et al., 2019) MLM architecture, due to its relative computational efficiency and good downstream task performance. We ex-pect that other variants of MLMs would provide similar insights, given their similar characteristics.

Models
In all of our experiments, we use the original 16GB BookWiki corpus (the Toronto Books Corpus, Zhu et al. 2015, plus English Wikipedia) from Liu et al. (2019). 2 We denote the model trained on the original, un-modified BookWiki corpus as M N (for "natural"). We use two types of word order randomization methods: permuting words at the sentence level, and resampling words at the corpus level. Sentence word order permutation. To investigate to what extent the performance of MLM pretraining is a consequence of distributional information, we construct a training corpus devoid of natural word order but preserving local distributional information. We construct word order-randomized versions of the BookWiki corpus, following the setup of Sinha et al. (2021). Concretely, given a sentence S containing N words, we permute the sentence using a seeded random function F 1 such that no word can remain in its original position. In total, there exist (N − 1)! possible permutations of a given sentence. We randomly sample a single permutation per sentence, to keep the total dataset size similar to the original.
We extend the permutation function F 1 to a function F n that preserves n-gram information. Specifically, given a sentence S of length N and n-gram value n, we sample a starting position i for possible contiguous n-grams ∈ {0, N − n} and convert the span S[i, i + n] to a single token, to formŜ, of lengthN = N − (n + 1). We continue this process repeatedly (without using the previously created n-grams) until there exists no starting position for selecting a contiguous n-gram inŜ. For example, given a sentence of length N = 6, F 4 will first convert one span of 4 tokens into a word, to havê S consisting of three tokens (one conjoined token of 4 contiguous words, and two leftover words). Then, the resulting sentenceŜ is permuted using F 1 . We train RoBERTa models on four permutation variants of BookWiki corpus, M 1 , M 2 , M 3 , M 4 for each n-gram value ∈ {1, 2, 3, 4}. More details on the process, along with the pseudo code and sample quality, are provided in Appendix B. Corpus word order bootstrap resample. The above permutations preserve higher order distributional information by keeping words from the same sentence together. However, we need a baseline to understand how a model would perform without such co-occurrence information. We construct a baseline, M UG , that captures word/subword information, without access to co-occurrence statistics. To construct M UG , we sample unigrams from BookWiki according to their frequencies, while also treating named entities as unigrams. We leverage Spacy (Honnibal et al., 2020) 3 to extract unigrams and named entities from the corpus, and construct M UG by drawing words from this set according to their frequency. This allows us to construct M UG such that it has exactly the same size as BookWiki but without any distributional (i.e. co-occurrence) information beyond the unigram frequency distribution. Our hypothesis is that any model pre-trained on this data will perform poorly, but it should provide a baseline for the limits on learning language of the inductive bias of the model in isolation.

Further baselines.
To investigate what happens if a model has absolutely no notion of word order, we also experiment with pre-training RoBERTa on the original corpus without positional embeddings. Concretely, we modify the RoBERTa architecture to remove the positional embeddings from the computation graph, and then proceed to pre-train on the natural order BookWiki corpus. We denote this model M NP . Finally, we consider a randomly initialized RoBERTa model M RI to observe the extent we can learn from each task with only the model's base inductive bias.
Pre-training details. Each model ∈ {M N , M 1 , M 2 , M 3 , M 4 , M UG , M NP } is a RoBERTa-base model (12 layers, hidden size of 768, 12 attention heads, 125M parameters), trained for 100k updates using 8k batch-size, 20k warmup steps, and 0.0006 peak learning rate. These are identical hyperparameters to Liu et al. (2019), except for the number of warmup steps which we changed to 20k for improved training stability. Each model was trained using 64 GPUs for up to 72 hours each. We train three seeds for each data configuration. We validate all models on the public Wiki-103 validation set (see Appendix C). We use FairSeq (Ott et al., 2019) for the pre-training and fine-tuning experiments.

Fine-tuning tasks
We evaluate downstream performance using the General Language Understanding and Evaluation (GLUE) benchmark, the Paraphrase Adversaries from Word Scrambling (PAWS) dataset, and various parametric and non-parametric tasks (see §5). GLUE. The GLUE (Wang et al., 2018) benchmark is a collection of 9 datasets for evaluating natural language understanding systems, of which we use Corpus of Linguistic Acceptability (CoLA, Warstadt et al., 2019b), Stanford Sentiment Treebank (SST, Socher et al., 2013), Microsoft Research Paragraph Corpus (MRPC, Dolan andBrockett, 2005), Quora Question Pairs (QQP) 4 , Multi-Genre NLI (MNLI, Williams et al., 2018b), Question NLI (QNLI, Rajpurkar et al., 2016;Demszky et al., 2018), Recognizing Textual Entailment (RTE, Dagan et al., 2005;Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009). Pham et al. (2020 show the word order insensitivity of several GLUE tasks (QQP, SST-2), evaluated on public regularly pre-trained checkpoints. PAWS. The PAWS task (Zhang et al., 2019) consists of predicting whether a given pair of sentences are paraphrases. This dataset contains both paraphrase and non-paraphrase pairs with high lexical overlap, which are generated by controlled word swapping and back translation. Since even a small word swap and perturbation can drastically modify the meaning of the sentence, we hypothesize the randomized pre-trained models will struggle to attain a high performance on PAWS. Fine-tuning details. We use the same fine-tuning methodology used by Liu et al. (2019), where we run hyperparameter search over the learning rates {1 × 10 −5 , 2 × 10 −5 , 3 × 10 −5 } and batch sizes {16, 32} for each model. For the best hyperparam configurations of each model, we fine-tune with 5 different seeds and report the mean and standard deviation for each setting. M NP is fine-tuned without positional embeddings, matching the way it was pre-trained.

Downstream task results
In this section, we present the downstream task performance of the models defined in §3. For evaluation, we report Matthews correlation for CoLA and accuracy for all other tasks.

Word order permuted pre-training
In our first set of experiments, we finetune the pretrained models on the GLUE and PAWS tasks. We report the results in Table 1. 5 First, we observe that the model without access to distributional or word order information, M UG (unigram) performs much worse than M N overall: M UG is 18 points worse than M N on average across the accuracy-based tasks in Table 1 and has essentially no correlation with human judgments on CoLA. M UG M NP and M RI perform comparably on most of the tasks, while achieving surprisingly high scores in QQP and SST-2. However, all three models perform significantly worse on GLUE and PAWS, compared to M N (Table 1, bottom half). M UG reaches up to 71.9 on MNLI -possibly due to the fact that M UG has access to (bags of) words and some phrases (from NER) is beneficial for MNLI. For the majority of tasks, the difference between M NP and M RI is small -a pure bag of words model performs comparably to a randomly initialized model.
Next, we observe a significant improvement on all tasks when we give models access to sentencelevel distributional information during pre-training. M 1 , the model pre-trained on completely shuffled sentences, is on average only 3.3 points lower than M N on the accuracy-based tasks, and within 0.3 points of M N on QQP. Even on PAWS, which was designed to require knowledge of word order, M 1 is within 5 points of M N . Randomizing n-grams instead of words during pre-training results in a (mostly) smooth increase on these tasks: M 4 , the model pre-trained on shuffled 4-grams, trails M N by only 1.3 points on average, and even comes 5 The MN results are not directly comparable with that of publicly released roberta-base model by Liu et al. (2019), as that uses the significantly larger 160GB corpus, and is trained for 500K updates. For computational reasons, we restrict our experiments to the 16GB BookWiki corpus and 100K updates, mirroring the RoBERTa ablations. within 0.2 points of M N on PAWS. We observe a somewhat different pattern on CoLA, where M 2 does almost as well as M N and outperforms M 3 and M 4 , though we also observe very high variance across random seeds for this task. Crucially, we observe that M 1 outperforms M NP by a large margin. This shows that positional embeddings are critical for learning, even when the word orders themselves are not natural. 6 Overall, these results confirm our hypothesis that RoBERTa's strong performance on downstream tasks can be explained for a large part by the distributional prior.

Word order permuted fine-tuning
There are two possible explanations for the results in §4.1: either the tasks do not need word order information to be solved, or any necessary word order information can be acquired during fine-tuning. To examine this question, we permute the word order during fine-tuning as well. Concretely, for each task, we construct a unigram order-randomized version of each example in the fine-tuning training set using F 1 . We then fine-tune our pre-trained models on this shuffled data and evaluate task performance. For all experiments, we evaluate and perform early stopping on the original, natural word order dev set, in order to conduct a fair evaluation on the exact same optimization setup for all models.
Our results in Figure 1 provide some evidence for both hypotheses. On QQP and QNLI, accuracy decreases only slightly for models fine-tuned on shuffled data. Models can also achieve above 80% accuracy on MNLI, SST-2, and MRPC when  fine-tuned on shuffled data, suggesting that purely lexical information is quite useful on its own. 7 On the other hand, for all datasets besides QQP and QNLI, we see noticeable drops in accuracy when fine-tuning on shuffled data and testing on normal order, both for M N and for shuffled models M 1 through M 4 . This suggests both that word order information is useful for these tasks, and that shuffled models must be learning to use word order information during fine-tuning. 8 Having word order during fine-tuning is especially important for achieving high accuracy on CoLA, RTE (cf. Pham et al. 2020), as well as PAWS, suggesting that these tasks are the most word order reliant. Recent research (Yu and Ettinger, 2021) raised some questions about potential artefacts inflating performance on PAWS: their swapping-distance cue of appears consistent both with our finding of high PAWS performance for n-gram shuffled models in Table 1, and with our PAWS results in Figure 1, which suggests that PAWS performance does in fact rely to some extent on natural word order at the fine-tuning stage.
Finally, for CoLA, MRPC, and RTE, performance is higher after fine-tuning on shuffled data for M 1 than M N . We hypothesize that M N repre-sents shuffled and non-shuffled sentences very differently, resulting in a domain mismatch problem when fine-tuning on shuffled data but evaluating on non-shuffled data. 9 Since M 1 never learns to be sensitive to word order during pre-training or fine-tuning, it does not suffer from that issue. Our results in this section also highlights the issues with these datasets, concurrent to the findings that many GLUE tasks does not need sophisticated linguistic knowledge to solve, as models typically tend to exploit the statistical artefacts and spurious correlations during fine-tuning (cf. Gururangan et al. 2018;Poliak et al. 2018;Tsuchiya 2018;McCoy et al. 2019). However, our results overwhelmingly support the fact that word order does not matter during pre-training, if the model has the opportunity to learn the necessary information about word order during fine-tuning.

Probing results
To investigate how much syntactic information is contained in the MLM representations, we evaluate several probing tasks on our trained models. We consider two classes of probes: parametric probes, which make use of learnable parameters, and non-parametric probes, which directly exam-ine the language model's predictions.

Parametric Probing
To probe our models for syntactic, semantic and other linguistic properties, we investigate dependency parsing using Pareto probing (Pimentel et al., 2020a) and the probing tasks from Conneau et al. (2018) in SentEval (Conneau and Kiela, 2018).

Syntactic Probing
Pimentel et al. (2020a) proposed a framework based on Pareto optimality to probe for syntactic information in contextual representations. They suggest that an optimal probe should balance optimal performance on the probing task with the complexity of the probe. Following their setup, we use the "difficult" probe: dependency parsing (DEP). We also investigate the "easy" probes, dependency arc labeling (DAL) and POS tag prediction (POS), results are reported in Appendix K. We probe with Linear and MLP probes, and inspect the task accuracy in terms of Unlabeled Attachment Score ( We triple this experiment size by evaluating on three pre-trained models of different seeds for each model configuration. We consider Pimentel et al.'s English dataset, derived from Universal Dependencies EWT (UD EWT) (Bies et al., 2012;Silveira et al., 2014) which contains 12,543 training sentences. Additionally, we experiment on the Penn Treebank dataset (PTB), which contains 39,832 training sentences. 11 We report the mean test accuracy over three seeds for the best dev set accuracy for each task. 12 10 We experimented with a much stronger, state-of-the-art Second order Tree CRF Neural Dependency Parser (Zhang et al., 2020), but did not observe any difference in UAS with different pre-trained models (see Appendix G) 11 PTB data (Kitaev et al., 2019) is used from github.com/nikitakit/self-attentive-parser/tree/master/data. 12 Pimentel et al. (2020a) propose computing the Pareto Hypervolume over all hyperparameters in each task. We did not observe a significant difference in the hypervolumes for the models, as reported in Appendix K.  Results. We observe that the UAS scores follow a similar linear trend as the fine-tuning results in that Table 2). Surprisingly, M UG probing scores seem to be somewhat better than M 1 (though with large overlap in their standard deviations), even though M UG cannot learn information related to either word order or co-occurrence patterns. The performance gap appears to be task-and probe specific. We observe a low performance gap in several scenarios, the lowest being between M N vs. M 3 /M 4 , for PTB using the both MLP and Linear probes.

SentEval Probes
We also investigate the suite of 10 probing tasks (Conneau et al., 2018) available in the SentEval toolkit (Conneau and Kiela, 2018). This suite contains a range of semantic, syntactic and surface level tasks. Jawahar et al. (2019) utilize this set of probing tasks to arrive at the conclusion that "BERT embeds a rich hierarchy of linguistic signals: surface information at the bottom, syntactic information in the middle, semantic information at the top". We re-examine this hypothesis by using the same probing method and comparing against models trained with random word order. Training setup. We run the probes on the final layer of each of our pre-trained models for three seeds, while keeping the encoder frozen. SentEval trains probes on top of fixed representations individually for each task. We follow the recommended setup and run grid search over the following hyperparams: number of hidden layer dimensions ([0, 50, 100, 200]), dropout ([0, 0.1, 0.2]), 4 epochs, 64 batch size. We select the best performance based on the dev set, and report the test set accuracy.
Results. We provide the results in Table 3. The M N pre-trained model scores better than the unnatural word order models for only one out of five semantic tasks and in none of the lexical tasks.  However, M N does score higher for two out of three syntactic tasks. Even for these two syntactic tasks, the gap among M UG and M N is much higher than M 1 and M N . These results show that while natural word order is useful for at least some probing tasks, the distributional prior of randomized models alone is enough to achieve a reasonably high accuracy on syntax sensitive probing.

Non-Parametric Probing
How to probe effectively with parametric probes is a matter of much recent debate (Hall Maudslay et al., 2020;Belinkov, 2021). From our results so far, it is unclear whether parametric probing meaningfully distinguishes models trained with corrupted word order from those trained with normal orders. Thus, we also investigate non-parametric probes ( We consider a set of non-parametric probes that use a range of sentences varying in their linguistic properties. For each, the objective is for a pretrained model to provide higher probability to a grammatically correct word than to an incorrect one. Since both the correct and incorrect options occupy the same sentential position, we call them "focus words". Linzen et al. (2016) use sentences from Wikipedia containing present-tense verbs, and compare the probability assigned by the encoder to plural vs. singular forms of the verb; they focus on sentences containing at least one noun between the verb and its subject, known as "agreement attractors." Gulordava et al. (2018b) instead replace focus words with random substitutes from the same part-of-speech and inflection. Finally, Marvin and Linzen (2018) construct minimal pairs of grammatical and ungrammatical sentences, and compare the model's probability for the words that differ. Setup. In our experiments, we mask the focus words in the stimuli and compute the probability of the correct and incorrect token respectively. To han-   (2016) and Gulordava et al. (2018b) datasets are skewed towards singular focus words, which could disproportionately help weaker models that just happen to assign more probability mass to singular focus words. To counter this, we balance these datasets to have an equal number of singular and plural focus words by upsampling, and report the aggregated and balanced results in Table 4 (see Appendix L for more detailed results). We verify our experiments by using three pre-trained models with different seeds for each model configuration.
Results. We observe for the Linzen et al. (2016) and Marvin and Linzen (2018) datasets that the gap between the M N and randomization models is relatively large. The Gulordava et al. (2018b) dataset shows a smaller gap between M N and the randomization models. While some randomization models (e.g., M 2 , M 3 , and M 4 ) performed quite similarly to M N according to the parametric probes, they all are markedly worse than M N according to the non-parametric ones. This suggests that nonparametric probes identify certain syntax-related modeling failures that parametric ones do not.

Discussion
The assumption that word order information is crucial for any classical NLP pipeline (especially for English) is deeply ingrained in our understanding of syntax itself (Chomsky, 1957): without order, many linguistic constructs are undefined. Our fine-tuning results in §4.1 and parametric probing results in §5.1, however, suggests that MLMs do not need to rely much on word order to achieve high accuracy, bringing into question previous claims that they learn a "classical NLP pipeline." One might ask, though, whether an NLP pipeline would really need natural word order at all: can transformers not simply learn what the correct word order is from unordered text? First, the lower nonparametric probing accuracies of the randomized models indicate that they are not able to accurately reconstruct the original word order (see also Appendix D). But even if models were able to "unshuffle" the words under our unnatural pre-training set up, they would only be doing so based on distributional information. Models would then abductively learn only the most likely word order. While models might infer a distribution over possible orders and use that information to structure their representations (Papadimitriou et al., 2021), syntax is not about possible or even the most likely orders: it is about the actual order. That is, even if one concludes in the end that Transformers are able to perform word order reconstruction based on distributional information, and recover almost all downstream performance based solely on that, we ought to be a lot more careful when making claims about what our evaluation datasets are telling us.
Thus, our results seem to suggest that we may need to revisit what we mean by "linguistic structure," and perhaps subsequently acknowledge that we may not need human-like linguistic abilities for most NLP tasks. Or, our results can be interpreted as evidence that we need to develop more challenging and more comprehensive evaluations, if we genuinely want to measure linguistic abilities, however those are defined, in NLP models.
There are many interesting and potentially exciting avenues for future work that we could not explore due to limitation of space. An interesting question revolves around whether this phenomenon is more pronounced for English than for other languages. It is natural to wonder whether more word-order flexible or morphologically-rich languages would suffer from the same problem. Using the methods discussed in this work, we could imagine devising a way to determine the degree of order-dependence for tasks across languages. Another possible extension pertains to other tasks, including extractive question answering (QA) or sequence tagging, for which we can also to deter-mine whether word order information is acquired downstream or during pre-training.
The sensitivity of generative models to word order permuted input could also be investigated further. Recent work by Parthasarathi et al. (2021) begins this discussion, by showing that a Machine Translation (MT) model can often arrive at the gold source translation when provided with input sentences that have had their words permuted using parse trees. Relatedly, Alleman et al. (2021) also investigates targeted parse-tree-based perturbations as a means of evaluating model robustness. O'Connor and Andreas (2021) also demonstrate the insensitivity of Transformers towards syntax manipulations while achieving low perplexity in language modeling tasks. Exploring model sensitivity to word order permutations for approaches that unify generation and classification (e.g., multitasking) could also be interesting future work.

Conclusion
In this work, we revisited the hypothesis that masked language modelling's impressive performance can be explained in part by its ability to learn classical NLP pipelines. We investigated targeted pre-training on sentences with various degrees of randomization in their word order, and observed overwhelmingly that MLM's success is most likely not due to its ability to discover syntactic and semantic mechanisms necessary for a traditional language processing pipeline during pre-training. Instead, our experiments suggest that MLM's success can largely be explained by it having learned higher-order distributional statistics that make for a useful prior for subsequent fine-tuning. These results should hopefully encourage the development of better, more challenging tasks that require sophisticated reasoning, and harder probes to narrow down what exact linguistic information is present in the representations learned by our models.

A From Word2vec to BERT in 4 steps
Take the basic parameterization of skipgram word2vec (Mikolov et al., 2013): where t is the target, w is a word in the context, V is the set of all possible context words and f is simply the dot product.
In actual word2vec, we would use negative sampling within a given window size and optimize log σ(w·t)+k·E t ∈P log σ(−w·t ) computed over context C(w) = {w i−k , ..., w i−1 , w i+1 , w i+k } for word index i, window size 2k and unigram probability distribution P . It has been shown that optimizing this objective is close to learning the shifted PPMI distribution (Levy et al., 2015).
Step 1: BPE One reason for not computing the full softmax is that it becomes a prohibitively expensive matrix multiplication with large vocabulary V . A solution is to tokenize based on subword units, e.g. BPE, to ensure a smaller total vocabulary U in the softmax denominator. Doing so makes the matrix multiplication feasible, at least on GPU. It also ensures we have sufficient coverage over the words in our vocabulary.
Step 2: Defenestration Next, replace the local context window with the entire sentence, while masking out the target word, i.e., C(t) = {w ∈ S : w = t} where S is the sentence containing w.
Step 3: Non-linearity Replace the pairwise word-level dot product f (w, t) with a fancy nonlinear function, say a sequence of multi-head self attention layers, g(t, C(t)), that takes as input the entire sentence-with-mask, and you get: t ∈U e g(t ,C(t)) Step 4: Sprinkle data and compute You have BERT. Now all you need is enough data and compute, and perhaps some optimization tricks. Make sure to update the parameters in your model g when fine-tuning, rather than keeping them fixed, for optimal performance on downstream tasks. This correspondence is probably (hopefully) trivial to most NLP researchers, but worth pointing out, lest we forget.  Table 5: BLEU-2,3,4 scores (mean and std dev) on a sample of 1M sentences drawn from the corpus used to train M 1 , M 2 , M 3 and M 4 compared to M N .

B Data generation
We provide pseudo-code for F i in Algorithm 1. Following Sinha et al. (2021), we do not explicitly control whether the permuted words maintain any of their original neighbors. Thus, a certain amount of extra n-grams are expected to co-occur, purely as a product of random shuffling. We quantify the amount of such shuffling on a sample of 1 million sentences drawn from the BookWiki random corpus, and present the BLEU-2, BLEU-3 and BLEU-4 scores in Table 5. We provide a sample snapshot of the generated data in Table 18.
Randomize a sentence S with seed t and n grams n 2: W = tokenize the words in S 3: Set the seed to t 4: if n > 1 then 5: while True do 6: K = Sample all possible starting points from [0, |W | − n] 7: Ignore the starting points in K which overlap with conjoined tokens Conjoined tokens consists of joined unigrams 8: if |K| ≥ 1 then 9: Sample one position p ∈ K 10: g = Extract the n-gram W  steps: 20k. We use the Wiki 103 validation and test set to validate and test the array of pre-trained models, as validation on this small dataset is quick, effective, and reproducible for comparison among publicly available datasets ( Figure 2). We observe that perplexity monotonically increases from M N , through M 4 -M 1 , to M UG , and finally M NP .

D Word-order pre-training ablations
We also train further model ablations with low to high distributional priors. Following the construction of the corpus bootstrap resample, we train a model where words are drawn uniformly from BookWiki corpus, thus destroying the natural frequency distribution (M UF ). We further study an ablation for a high distributional prior, M 512 , where we shuffle words (unigram) in a buffer created with joining multiple sentences such that maximum token length of the buffer is 512. This ablationwhich is similar to the paragraph word shuffle condition in Gauthier and Levy (2019)-will allow us to study the effect of unigram shuffling in a window larger than the one for M 1 . Buffer size is chosen to be 512 because BERT/RoBERTa is typically trained with that maximum sequence length. We observe dev set results on the GLUE benchmark of these ablations, along with baselines M UG , M RI and M NP and random shuffles in Table 6 and Figure 3. We observe that M 512 exhibits worse overall scores than M 1 , however it is still significantly better than M NP or M UG baselines. We observe that destroying the natural frequency distribution of words (M UF ) yields comparable or slightly better results compared to random corpus model M UG . This result shows that merely replicating the natural distribution of words without any context is not useful for the model to learn. These results indicate that at least some form of distributional prior is required for MLM-based models to learn a good downstream representation.
One might argue that the superior results displayed by the unnatural models is due to the ability of RoBERTa to "reconstruct" the natural word order from shuffled sentences. The data generation algorithm, F i requires a seed t for every sentence.
In our experiments, we had set the same seed for every sentence in the corpus to ensure reproducibility. However, it could be problematic if the sentences of the same length are permuted with the same seed, which could be easier for the model to "reconstruct" the natural word order to learn the necessary syntax. We tested this hypothesis by constructing a new corpus with different seeds for every sentence in every shard in the corpus (1/5th of BookWiki corpus is typically referred to as a shard for computational purposes), to build the model M 1 * . We observe that there is minimal difference in the raw numbers among M 1 and M 1 * for most of the tasks (Table 7) (with the exception of CoLA which performs similar to M 2 possibly due to a difference in initialization). This result consequently proves that even with same seed, it is difficult for the model to just reconstruct the unnatural sentences during pre-training.

E Measuring Relative difference
In this section, we further measure the difference in downstream task performance reported in §4.1 using as a metric the relative difference. Let us denote the downstream task performance as A(T |D), where T is the task and D is the pre-trained model. We primarily aim to evaluate the relative perfor-     mance gap, i.e. how much the performance differs between our natural and unnatural models. Thus, we define the Relative Difference (∆ {D} (T )): where A(T |∅) is the random performance on the task T (0.33 for MNLI, 0 for CoLA, and 0.5 for rest) ∆ {D} (T ) → 0 when the performance of a pre-trained model reaches that of the pre-trained model trained with natural word order.
We observe the relative difference on the tasks in Table 8. CoLA has the largest ∆ {D} (T ) among all tasks, suggesting the expected high word order reliance. ∆ {D} (T ) is lowest for QQP.

F Fine-tuning with randomized data
We perform additional experiments using the finetuned models from §4.1. Specifically, we construct unigram randomized train and test sets (denoted as shuffled) of a subset of tasks to evaluate whether models fine-tuned on natural or unnatural task data (having natural or unnatural pre-training prior) are able to understand unnatural data during testing. Sinha et al. showed for MNLI there exists at least one permutation for many examples which can be predicted correctly by the model. However, they also showed that every sentence can have many permutations which cannot be predicted correctly as well. We follow them in this evaluation, and construct 100 permutations for each example in the dev set for each task to capture the overall accuracy.
Concretely, we use M N , M 1 and M UG as our pre-trained representations (trained with natural, unigram sentence shuffle and corpus shuffle data respectively) and evaluate the effect of training and evaluation on natural and unnatural data in Table 9. We observe that all models perform poorly on the shuffled test set, compared to natural evaluation. However, interestingly, models have a slight advantage with a unigram randomized prior (M 1 ), with CoLA having the biggest performance gain. PAWS task suffers the biggest drop in performance (from 94.49 to 62.22) but the lowest gain in M 1 , confirming our conclusion from §4.1 that most of the word order information necessary for PAWS is learned from the task itself.
Furthermore, training on shuffled data surprisingly leads to high performance on natural data for M N in case of several tasks, the effect being weakest in case of CoLA and PAWS. This suggests that for tasks other than CoLA and PAWS, spurious correlations are leveraged by the models during fine-tuning (cf. Gururangan et al. 2018;Poliak et al. 2018;Tsuchiya 2018). We also observe evidence of domain matching, where models improve their performance on evaluation on shuffled data when name fine-tune-train fine-tune-eval MNLI  Table 9: Fine-tuning evaluation by varying different sources of word order (with mean and std dev). We vary the word order contained in the pre-trained model (M N ,M 1 ,M UG ); in fine-tuning training set (natural and shuffled); and in fine-tuning evaluation (natural and shuffled). Here, shuffled corresponds to unigram shuffling of words in the input. In case of fine-tune evaluation containing shuffled input, we evaluate on a sample of 100 unigram permutations for each data point in the dev set of the corresponding task.   et al., 2020) the training data source is changed from natural to shuffled (for M N , MNLI shuffled evaluation improves from 68.11 to 79.96 just by changing the training corpus from natural to shuffled). We observe this behavior consistently for all tasks with all pre-trained representations.

G Dependency parsing using Second order Tree CRF Neural Dependency Parser
We also conduct extensive experiments with Second Order Tree CRF Neural Dependency parser from Zhang et al. (2020), using their provided codebase. 13 We report the results on UD EWT and PTB corpus in Table 10. Strangely enough, we find the gap to be even smaller across the different randomization models, even for some cases the performance on R 1 improves over OR. We suspect this result is due to two reasons:

H Perplexity analysis
We measure perplexity of various pre-trained randomization models on text that is randomized using the same function F. Conventional language models compute the perplexity of a sentence S by using past tokens (S <t = (w 1 , w 2 , . . . , w t−1 )) and the application of chain rule ( |S| t=1 log P LM (w t |S t−1 )). However, this formulation is not defined for MLM, as a word is predicted using the entire sentence as a context. Following Salazar et al. (2020), we measure Pseudo-Perplexity, i.e., given a sentence S, we compute the log-probability of the missing word in S by iteratively masking out the specific word, and computing the average log-probability per word in S: We bootstrap the PLL score of a test corpus T by drawing 100 samples five times with replacement. We also similarly compute the bootstrap perplexity following Salazar et al.: where W is the combined bootstrap sample containing N sentences drawn with replacement from T . We compute this score on 6 pre-trained models, over four randomization schemes on the bootstrapped sample W (i.e., we use the same n-gram randomization function F i ). Thus, we obtain a 5x6 matrix of BPLL scores, which we plot in Figure 4.
We observe that the pre-trained model M N has the lowest perplexity on the sentences with natural word order. Pre-trained models with random word order exhibit significantly higher perplexity than the normal word order sentences (top row). With the exception of M 1 , the models pre-trained on randomized data (M 2 , M 3 and M 4 ) all display the lowest perplexity for their respective n = 2, 3, 4 randomizations. These results indicate that the models retain and detect the specific word order for which they were trained.

I The usefulness of word order
The results in §4.1 suggest that, with proper finetuning, an unnaturally trained model can reach a level of performance comparable to that of a naturally pre-trained model. However, we want to understand whether natural word order pre-training offers any advantage during the early stages of finetuning. Towards that end, we turn to compute the Minimum Description Length (MDL; Rissanen, 1984). MDL is designed to characterize the complexity of data as the length of the shortest program required to generate it. Thus, the length of the minimum description (in bits) should provide a fair estimate of how much word order is useful for fine-tuning in a few-shot setting. Specifically, we leverage the Rissanen Data Analysis (RDA) framework from Perez et al. (2021) to evaluate the MDL of pre-trained models on our set of downstream tasks. Under mild assumptions, if a pre-trained model θ 1 is useful for solving a particular task T over θ 2 , then the MDL in bits obtained by using θ 1 should be shorter than θ 2 . We follow the experimental setup of Perez et al. to compute the MDL on several tasks using θ = {M N ,M 1 ,M 2 ,M 3 ,M 4 }, over three seeds and on three epochs of training. Concretely, RDA involves sampling 9 blocks of data from the dataset at random, where the size of each block is increased monotonically, training on 8 blocks while evaluating the model's loss (or codelength) on the ninth. The minimum number of data samples in the smallest block is set at 64, while the largest number of data samples used in the last block is 10,000.
We observe that the value of MDL is consistently lowest for naturally pre-trained data (Figure 5). For purportedly word order reliant datasets such as RTE, CoLA and PAWS, the gap between the MDL scores among the natural and unnatural models is high. PAWS, specifically, has the largest advantage in the beginning of optimization, however with more fine-tuning, the model re-learns correct word order ( §4.1). The present analyses, when taken in conjunction with our main results in §4.1, suggest that fine-tuning on large training datasets with complex classifiers in the pursuit of state-ofthe-art results has mostly nullified the impact of word order in the pre-trained representations. Few shot (Bansal et al., 2020)   that of the random word order pretrained models. The idea is to find the point during pre-training on natural corpus at which the model exceeds the task performance of the random pre-training model.
Performance on all tasks ( Figure 6) increases rapidly during the first 20-25 epochs of pre-training. For some tasks, the word order information only helps after 30-50 pre-training epochs.

K More results from Syntactic Probes
We computed the Pareto Hypervolume on the dependency parsing task (Pimentel et al., 2020a). The Pareto Hypervolume is computed as the Area Under Curve (AUC) score over all hyperparameter runs, where the models are arranged based on their complexity. We observe minimal differences in the Pareto Hypervolumes (Table 13)   the randomization models for both datasets. We also investigated two "easy" tasks, Part-of-Speech tagging (POS) and Dependency Arc Labeling (DAL) from the Pareto Probing framework. For POS (Table 11) and DAL (Table 12), since these tasks are simpler than DEP, the gap between M N and unnaturally pre-trained models reduces even more drastically. The gap between M N and M 1 reduces to just 3.5 points on average for PTB in both POS and DAL.

L Non parametric probes
Probability difference. In the original formulation (Goldberg, 2019; Wolf, 2019), the effectiveness of each stimulus is determined by the accuracy metric, computed as the number of times the probability of the correct focus word is greater than that of the incorrect word (P (good) > P (bad)). We observed that this metric might not be reliable per se, since the probabilities may themselves be extremely low for all tokens, even when focus word probability decreases drastically from M N to M UG . Thus, we also report the mean difference of probabilities, ( 1 N N i P (good i ) − P (bad i )), scaled up by a factor of 100 for ease of observation, in Figure 9, Figure 8 and Figure 7. We observe the highest difference between probabilities of the correct and incorrect focus words for the model pretrained on the natural word order (M N ). Moreover, with each step from M 1 to M 4 , the difference between probablities of correct and incorrect focus words increases, albeit marginally, showing that pre-trained models with fewer n-gram words perturbed capture more word order information. M UG , the model with the distributional prior ablated, performs the worst, as expected. Accuracy comparison. We provide the accuracy as measured by Goldberg (2019); Wolf (2019) on the probing stimuli in Table 14, Table 15 and Ta- ble 16. We also highlight the difference in probability (P (good) − P (bad)) in the table to provide a more accurate picture. All experiments were conducted on three pre-trained seeds for each model in our set of models. However, the low token probabilities in M UG tend to present unreliable scores.    rect inflection (as opposed to plural). This dataset imbalance caused the weak models (such as M UG ) to have surprisingly high scores -the weak models were consistently providing higher probability for the singular inflection (Table 17). We upsample for both datasets, balancing the frequency of correct singular and plural inflections. We compute the upsampling number to the next multiple of 100 of the count of original singular inflections. For example, in condition 4 of Linzen et al. (2016) dataset, we upsample both S and P to 300 rows each. This type of balancing via upsampling largely alleviated the inconsistencies we observed, and might prove to be useful when evaluating other models on these datasets in future.   (2018) stimuli results in raw accuracy. Values in parenthesis reflect the standard deviation over different seeds of pre-training. Values in square brackets indicate the mean probability difference among correct and incorrect words. Abbreviations: Simple Verb Agreement (SVA), In a sentential complement (SCM), Short VP Coordination (SVC), Long VP Coordination (LVC), Across a prepositional phrase (APP), Across a subject relative clause (ASR), Across an object relative clause (AOR), Across an object relative (no that) (AOR-T), In an object relative clause (IOR), In an object relative clause (no that) (IOR-T), Simple Reflexive (SRX), In a sentential complement (ISC), Across a relative clause (ARC), Simple NPI (SNP).  1 They are commonly known as daturas, but also known as devil's trumpets, not to be confused with angel's trumpets, its closely related genus "Brugmansia".

Model
be They angel's also but trumpets, genus related devil's as commonly closely known its daturas, trumpets, as "Brugmansia". confused with known are to not as devil's They genus not to trumpets, closely related "Brugmansia". are commonly trumpets, its also known known as be confused daturas, but with angel's "Brugmansia". related They are commonly trumpets, its closely as daturas, but known genus also known as trumpets, confused with angel's devil's not to be its closely related genus They are commonly known trumpets, as trumpets, daturas, but also known as "Brugmansia". not to be confused with angel's devil's 2 They are also sometimes called moonflowers, jimsonweed, devil's weed, hell's bells, thorn-apple, and many more.
are devil's bells, called weed, hell's thorn-apple, and many They also more. moonflowers, jimsonweed, sometimes more. They are hell's bells, also sometimes and many called moonflowers, jimsonweed, devil's weed, thorn-apple, jimsonweed, devil's weed, They are also thorn-apple, and many bells, more. hell's sometimes called moonflowers, moonflowers, They are also sometimes bells, thorn-apple, and many more. called jimsonweed, devil's weed, hell's 3 Its precise and natural distribution is uncertain, owing to its extensive cultivation and naturalization throughout the temperate and tropical regions of the globe.
throughout owing precise extensive temperate and naturalization and tropical of to natural is its Its distribution cultivation the globe. uncertain, regions the and and natural distribution is tropical to its and naturalization throughout the the temperate and globe. Its precise uncertain, owing extensive cultivation regions of uncertain, owing to Its precise and its extensive cultivation of globe. natural distribution is the the and tropical regions and naturalization throughout temperate globe. Its precise and natural cultivation distribution the is uncertain, owing to its extensive and naturalization throughout the temperate and tropical regions of 4 Its distribution within the Americas and North Africa, however, is most likely restricted to the United States, Mexico and Southern Canada in North America, and Tunisia in Africa where the highest species diversity occurs. All species of "Datura" are poisonous, especially their seeds and flowers.
seeds and species of poisonous, "Datura" their are All flowers. especially "Datura" are especially their flowers. seeds and of All species poisonous, especially their seeds flowers. "Datura" are poisonous, All species of and flowers. poisonous, species of "Datura" are All especially their seeds and 6 Some South American plants formerly thought of as "Datura" are now treated as belonging to the distinct genus "Brugmansia" ("Brugmansia" differs from "Datura" in that it is woody, making shrubs or small trees, and it has pendulous flowers, rather than erect ones).
and "Datura" treated from than flowers, it small belonging woody, thought as ones). South differs Some "Brugmansia" American as are in the rather pendulous distinct making now erect "Datura" to ("Brugmansia" of formerly trees, or is it that plants genus has shrubs "Brugmansia" ("Brugmansia" than erect pendulous genus and ones). is woody, small trees, of as the distinct flowers, rather Some South differs from American plants treated as formerly thought belonging to "Datura" in making that it "Datura" are it has now shrubs or woody, small trees, and has pendulous flowers, as belonging to Some making shrubs or as rather than erect "Datura" are now "Brugmansia" ("Brugmansia" differs the distinct genus from "Datura" in formerly thought of it treated that it is ones). South American plants belonging to the distinct has making Some ("Brugmansia" differs from "Datura" in are now treated as genus pendulous shrubs flowers, rather than erect or ones). "Brugmansia" that it is woody, South American plants formerly thought of as "Datura" small trees, and it 7 Other related taxa include taxa Other include related include Other related taxa include Other related taxa Other related taxa include 8 "Hyoscyamus niger", "Atropa belladonna", "Mandragora officinarum", Physalis, and many more. and many niger", officinarum", belladonna", "Mandragora "Atropa "Hyoscyamus more. Physalis, belladonna", "Mandragora "Hyoscyamus niger", many Physalis, and more. officinarum", "Atropa more. Physalis, and many belladonna", "Mandragora officinarum", "Hyoscyamus niger", "Atropa niger", more.
the of also Sushruta Datura are referred to as In Ayurvedic and different species ' text '.
species of referred to are also Datura Sushruta different and as ' Ayurvedic text In the '.
as ' and In the Ayurvedic also referred to species of Datura are text Sushruta different '. different In the Ayurvedic text also referred to as and Sushruta ' species of Datura are '.