Studying word order through iterative shuffling

As neural language models approach human performance on NLP benchmark tasks, their advances are widely seen as evidence of an increasingly complex understanding of syntax. This view rests upon a hypothesis that has not yet been empirically tested: that word order encodes meaning essential to performing these tasks. We refute this hypothesis in many cases: in the GLUE suite and in various genres of English text, the words in a sentence or phrase can rarely be permuted to form a phrase carrying substantially different information. Our surprising result relies on inference by iterative shuffling (IBIS), a novel, efficient procedure that finds the ordering of a bag of words having the highest likelihood under a fixed language model. IBIS can use any black-box model without additional training and is superior to existing word ordering algorithms. Coalescing our findings, we discuss how shuffling inference procedures such as IBIS can benefit language modeling and constrained generation.


Introduction
Is a model's understanding of syntax a precondition for its understanding of natural language? Recent work on large language models (Devlin et al., 2019;Tenney et al., 2019;Rogers et al., 2021) has made this a popular hypothesis. Yet, models that consume only bag-of-words features but rival those that understand syntax have surprised researchers time and again (Iyyer et al., 2015;Joulin et al., 2017). New concerns have emerged that natural language understanding benchmarks may not be challenging enough to make sentence structure relevant (McCoy et al., 2019;Niven and Kao, 2019).
Syntax is an essential aspect of language (Chomsky, 1965). Sentence structure can be quite important: two sentences with very different meanings may use the same set of words (Fig. 1). But how much does syntax, as realized in word order, matter in typical English text? Given the words that make The scared mouse chased the hungry cat.
The hungry cat chased the scared mouse. up a sentence, but not their order, is the order usually recoverable? If so, word order rarely encodes more information than is found in the bag of words.
In the past, linguists could not have answered this question empirically. Manually ordering words into sentences is too laborious, and when there are multiple orders that satisfy grammatical constraints, one needs a way to choose among them.
With the power of large language models, we can reduce this question to a computational one and resolve both issues: given the bag of words, find the word order that is most likely under a trained LM. To make this search tractable, we develop inference by iterative shuffling (IBIS), a procedure inspired by techniques in combinatorial optimization, that is superior to existing approaches to this problem. Armed with IBIS, we answer the question above statistically and explore the implications.
First, we measure how often sentences and phrases are permutable in text of various genres.
Next, we analyze the effect of word order on the GLUE suite (Wang et al., 2018) and on the task of autoregressive language modeling. Randomly reordering input words drops the performance of models on nearly all tasks, but when we infer the order with the aid of a pretrained LM, this drop is small or absent. Thus, NLP pipelines can effectively consume bags of words as input, and order carries much less meaning than we might imagine.
We conclude with the implications of our results for language modeling. A computationally feasible search for word order clears a path for models that focus on content, rather than syntax, enabling a range of constrained generation applications. exterior is complemented by the beautiful ::: the period furniture housed inside. 88.0 Restored in 1967, the beautiful exterior is complemented by the fine period furniture housed inside.
Beam search (64) without future costs 116.9 is complemented by the beautiful exterior in the fine period furniture housed inside.Restored 1967, Beam search (64) with future costs 110.9 Restored exterior is complemented by the beautiful furniture housed in the fine 1967, period inside. Figure 2: Above: IBIS iteratively infers the word order that has lowest negative log-likelihood (left column) under GPT-2. At each step, the sentence is cut into pieces, which are then rearranged. After several such -opt moves, the original order is found. Below: Reconstructions of the same sentence using algorithms from prior work.

Related work
Research in cognitive science and psycholinguistics has raised the notion that syntax is a convention optimized for communicating bags of concepts over a linear channel. The emergence of syntactic phenomena is explained by information structure constraints (Jaeger, 2010;Levy and Jaeger, 2006;Hahn et al., 2020). Ours is the first large computational study to lend support to this view of syntax.
Circumstantial evidence for the redundancy of word order comes from work such as that of Niven and Kao (2019), which showed that language models' predictions in certain tasks are largely explained by word-level triggers. Concurrently with this work, Sinha et al. (2021a,b); Pham et al. (2021); Gupta et al. (2021) probed and demonstrated, in various ways, the surprising insensitivity of infilling LMs' performance on GLUE tasks to word order in training and evaluation data. These studies complement our discovery that nearly all of models' accuracy on GLUE tasks can be explained by bags of words only ( §5.2) to show that word order rarely carries information useful for classifying textual similarity, entailment, or sentiment.
In the domain of text generation, Khandelwal et al. (2018) found that the order of distant context words has little effect on prediction of the next word in a text. In §5.1 we confirm that the order of recent context strongly affects next-word prediction, but also show that this order can nearly always be inferred from the bag of words.
The problem of inferring word order from bags of words -text linearization -dates back to Elman (1990). This problem has been studied using both treelike and autoregressive models (de Gispert et al., 2014;Zhang and Clark, 2011;Song et al., 2018). Horvat and Byrne (2014) reduce linearization under an -gram model to generalized traveling salesman problems (TSP), but stop short of extending TSP algorithms to neural models, as we do in this work.
Algorithms based on best-first search were proposed by  and Schmaltz et al. (2016). The latter introduced a beam search with future costs, a key baseline in this paper. In the basic beam search algorithm for ordering a target bag of words, a LSTM model generates text from left to right, expanding a horizon of fixed size. The next-word distributions at each step are restricted to the words in the target bag that have not yet been used. The innovation of future costs is to modify the beam scoring function (which is usually the log-likelihood of the partially generated text) by adding the sum of log-likelihoods under a unigram model of the yet-unused tokens in the target bag.

Text linearization by iterative shuffling
IBIS is motivated by a need not only to generate text from left to right when inferring the most likely word order under a base LM, but to reason over an entire sentence and permute spans or words.
The bottom of Fig. 2 shows some of the fail- ure modes of beam search, with and without future costs. The reconstructions of a short sentence (in italics) by beam search using the GPT-2 Small model (Radford et al., 2019) are ungrammatical. 1 Beam search without future costs is unable to reason that the capitalized word is unlikely to occur in the middle of a text and should come first. Beam search with future costs suffers from the same inability to 'plan ahead': by the end of the sentence, the algorithm begins to fail as it is left with a set of words that cannot be arranged coherently. However, some long spans that appear in the original sentence are generated, such as "is complemented by the" and "the fine period furniture housed inside". The IBIS algorithm, which we will now describe, enables reasoning over the entire text to excise and recombine such coherent spans.
-opt moves. A -opt move is the following operation: A sentence is 'cut' at positions, creating − 1 spans between the cuts. These − 1 spans are then permuted (in one of ( − 1)! ways) to form a new sentence. Note that the cuts may come immediately before the first or after the last word.
Such operations were introduced by Lin and Kernighan (1973) in the context of the TSP on graphs. Recall that a tour of a weighted directed graph is a closed path that visits each vertex exactly once. A -opt move on a tour is the operation of removing edges and inserting new edges to create a new tour. Many heuristic algorithms for solving the TSP -finding the tour of lowest total weight -use -opt moves as the core search step; a maximum of = 5 is typical (Helsgaun, 2000).
There is a precise equivalence between text linearization and the TSP on graphs when the base language model is a bigram model. Suppose that the likelihood of a string has a factorization where 0 and are a fixed start/end token. We form a bidirected graph with vertices corresponding to the words of the sentence (and the start/end token) and set the weight of the edge from to to − log ( | ). An ordering of the words is then equivalent to a tour of the graph, and its negative log-likelihood (NLL) is the weight of this tour. Finding the most likely order is equivalent to solving the TSP on this graph.
Local search and the IBIS heuristic. Whileopt search was developed with graph tours in mind, it can be applied to any scoring function, such as a language model, which produces a NLL of the next word depending on a long sequence of words preceding it, − log ( +1 | 0 1 . . . ).
A naïve form of -opt local search would find the order of a bag of words most likely under a base LM by beginning with a random candidate order, then repeatedly performing a random -opt move, scoring the resulting order with the model, and accepting it as the new best candidate if it decreases the NLL. However, this approach is inefficient, as we will show below.
Instead, we propose a heuristic to improve the search. Let the current best order be 0 1 . . . . We form an auxiliary graph as above, but set the weight of the edge from to to that is, the NLL of word at position + 1 given by the base LM. If the LM is a bigram model, then the (2) reduces to − log ( | ), as before. Now, we observe that the current order is a tour of this graph: 0 → 1 → · · · → . We rank all possible -opt moves by how much they decrease the weight of this tour. Then, we create a batch of new candidate orders by performing -opt moves sampled from near the top of this ranking and we score this batch with the LM. The move that decreases the NLL most, if it exists, is accepted, yielding the new best candidate.
We emphasize that this is a heuristic, not an exact method. It is possible that a -opt move decreases the weight of the tour in the auxiliary graph, but that when this move is performed and the sentence rescored by the LM, the NLL does not decrease. This is the case because the next-word probabilities given by the LM may depend on all preceding words. A -opt move may change the context preceding a word , which will modify the weights of the edges from to other words in the graph. Nevertheless, the likelihood of a word depends mostly on recent context, especially on the preceding word, making IBIS an efficient heuristic.
Practical considerations. This heuristic for proposing -opt moves is limited by computational constraints, mainly the difficulty of ranking all possible -opt moves in memory. These difficulties arise in classical TSP solvers as well and are typically resolved by additional heuristics and sampling procedures. Our precise answers to these difficulties are described in Appendix A.
All our experiments are initialized with a random order of the target bag and use the heuristic described above to iteratively decrease the NLL. The search is terminated when there is no improvement for a specified number of steps (the 'patience' constant). For our experiments, we use a proposal batch size of 128 and a patience of 128 and limit the search to 3-, 4-, and 5-opt moves.
Our search for optimal -opt moves uses core tensor operations and can run on a GPU -the first implementation of this kind, to the best of our knowledge. Runnable example code is provided in the associated repository: https://github. com/malkin1729/ibis. The reader can run the provided program and see a text of their choice iteratively shuffled into the most likely order (  this choice. First, it enables direct comparison between beam search and IBIS under the same base LM, which is difficult with the LSTM model of Schmaltz et al. (2016) due to an incompatible code base. Second, it simultaneously allows us to measure the effect of the base model on results using the same inference procedure: IBIS is model-agnostic and can work with any base LM that produces next-token likelihoods. Third, GPT-2 is trained on generic English text and can handle arbitrary strings, making it a natural candidate for use in all parts of this paper. We chose the Small variant of the model for computational efficiency. We used the published code of Schmaltz et al. (2016), to tokenize the sentences, with minor processing for compatibility with GPT-2's tokenizer. We then evaluated beam search with beam sizes 512 and 1024, with and without future costs. The future cost function used unigram frequencies estimated from the GPT-2 training data. 2 Any input word that is broken into multiple tokens by GPT-2's subword tokenizer was always generated as a single unit. These baselines are thus directly comparable with the results of Schmaltz et al. (2016).

IBIS inference.
We ran the IBIS algorithm on this data, also using GPT-2 as the base LM, to infer an order of each sentence in the evaluation set.
To ensure that the bags of input and output words coincided, we did not allow -opt moves that broke intact words between GPT-2 subword tokens.
Results and discussion. The BLEU scores of reconstructed sentences with respect to the original orders, and their NLLs per word, are shown in Table 1. IBIS outperforms beam search with size 512 and future costs -the strongest procedure in past work -by a large margin. Doubling the beam size, and the computation time, closes less than half of the gap in BLEU and in mean log-likelihood.
A less surprising, yet still meaningful, comparison is between base LMs: beam search with future costs with Schmaltz et al. (2016)'s LSTM model and with GPT-2 Small (first and third rows). The former model is trained on a mix of target-domain (PTB) data, other datasets of news articles, and the Gigaword corpus, yet still resorts to using OOV tokens in place of infrequent words; GPT-2 is trained on a (biased) crawl of the Internet and processes rare words as sequences of subword tokens.
Finally, the mean log-likelihood per word under the base LM of sentences reconstructed with IBIS exceeds that of the original text. There are two ways to interpret this result. On one hand, it shows the strength of IBIS as an optimization algorithm: it may indeed be possible to permute the words in the original text, perhaps into an order more acceptable to human judgment, making word order more normative without changing the meaning. On the other hand, it shows that IBIS approaches the limit of what a text linearization algorithm that optimizes for GPT-2 Small likelihood can achieve as measured in BLEU score, which can be seen as a limitation of the base LM itself.
Computation cost. It is difficult to directly compare computation costs of beam search and IBIS due to the very different nature of these algorithms. A measure that corresponds well to the evaluation time is the number of calls to the base LM. During IBIS inference over the dataset of 2416 sentences, 67m strings were scored by GPT-2. Beam search with beam size 1024 and future costs would make approximately 60m calls to GPT-2 if all words were single tokens; handling of subwords increases this number to 76m. The computation time for the two algorithms was approximately equal. Thus IBIS is comparable to the BS-1024 + baseline in computation, but performs significantly better.
Unlike beam search, which infers order incre- Ibises, all mud and crustaceans, usually feed as a group, have long downcurved bills, usually probing for food items.

GPT-2 Small
Ibises, all mud crustaceans, usually as a group, have long downcurved bills, usually probing for food items and feed.

GPT-2 Medium
Ibises as a group usually have long downcurved bills, usually probing for feed, mud, crustaceans, and all food items.

GPT-2 Large
Ibises usually feed as a group, usually have long, downcurved bills, all probing for food items, mud and crustaceans. GPT-2 XL Ibises usually feed as a group, and all have long, downcurved bills, probing for food items, usually mud crustaceans.
original Ibises all have long, downcurved bills, and usually feed as a group, probing mud for food items, usually crustaceans. mentally from left to right, IBIS works with the entire string at every step and can be stopped early to set a balance between time and output quality. As we note below, IBIS exceeds beam search in log-likelihood per word after far fewer search steps than were performed in our experiment.
Dependence on base LM. We tested IBIS on the PTB dataset using four other GPT-2 variants: the lighter Distil-GPT-2 and the larger GPT-2 Medium, Large, and XL 3 ; the BLEU scores are shown in Table 1. More powerful base LMs improve the performance of IBIS, due to their greater 'world knowledge' or understanding of syntax, yet IBIS using the smallest model, Distil-GPT-2, still outperforms beam search with GPT-2 Small. Table 2 shows a sentence reordered using all five models.
Importance of the heuristic. We demonstrate the importance of the IBIS heuristic for proposing -opt moves by performing the naïve -opt local search described in §2 -randomly sampling -opt moves, but keeping all other search parameters the same as for IBIS. On the PTB test data, the sentence reconstructions by this algorithm are significantly worse than those by IBIS: the search tends to exceed the patience (128 steps without improvement) at a higher negative log-likelihood.  Table 1) as a function of the number of -opt batches, with and without the IBIS heuristic. IBIS tends to converge faster than an unguided -opt search and stabilizes at lower NLL, yet both reach lower NLL than the computationally comparable beam search baseline (BS-1024+ ). Fig. 4 shows the mean negative log-likelihood per word as a function of the number of search steps, averaged over all sentences in PTB. IBIS reaches a lower NLL than the strongest beam search algorithm we evaluate after just 59 size-128 batches -equivalent to about 1 4 of the number of calls to GPT-2 made by beam search -and a lower NLL than the original text after 169 batches. Random -opt search requires 6 times as many steps to reach the NLL of beam search.
Curiously, random -opt search reaches a better NLL but a worse BLEU score than the beam search baseline, suggesting that beam search is good at correctly generating short spans of text (benefiting BLEU), but -opt search is better at reasoning over the entire sentence (benefiting total likelihood).

Experiments: (Im)permutability
Using the IBIS algorithm, we analyze the importance of word order in English text of different genres. We use three publicly available corpora covering different domains of textual expression: Yelp: About 560k Yelp reviews, commonly used as a text classification benchmark. 4 Wiki: 2m Wikipedia articles (Shaoul, 2010). 5 arXiv: 1.7m scientific preprint abstracts (Clement et al., 2019) 6 filtered to remove T E X.
IBIS can be applied to any bag of tokens, including punctuation marks, as shown in Fig. 2 and Table 2. In §3, we shuffled all tokens in a sentence to be consistent with the setup in prior work. However, to measure whether order is essential to conveying meaning, we face the problem of discerning whether punctuation marks are used to structure a compound thought or to separate distinct thoughts: 4 kaggle.com/ilhamfp31/yelp-review-dataset 5 www.psych.ualberta.ca/∼westburylab/ 6 kaggle.com/Cornell-University/arxiv the pathological example is a stylistic choice to replace all periods with semicolons.
To simply investigate how long a phrase needs to be before it becomes possible to find significant rearrangements with higher likelihoods, we limit our analysis to text spans of two kinds: sentences that contain no punctuation and spans of text between two consecutive punctuation marks. We sample 1000 such sentences and spans from each of the three domains with lengths (in words) falling into each of several buckets (Table 4) and infer their most likely word orders using IBIS. For the spans between punctuation, 50 words of ordered context before the initial punctuation mark are provided for scoring of candidate word orders.
We analyze the reconstructions automatically using BLEU scores and via human evaluation.
Results and discussion. Table 4 shows the BLEU scores of the IBIS-inferred spans with respect to the original orders, as well as the ratio of perplexities under GPT-2 of the original and reconstructed texts. IBIS often finds orders that are more likely than the original ones in all three domains. The similarity of reconstructed and original spans at small lengths (< 30 tokens) is remarkably high. 7 We also see some differences between the domains. Especially at higher lengths, sentences and spans from Wikipedia and Yelp are difficult to permute into sentences with higher likelihood (PR > 1). Reconstructed sentences from Yelp and arXiv retain fewer of the original 2-, 3-, and 4grams (lower BLEU). Indeed, long Yelp sentences tend to 'ramble' using many frequent words, arXiv sentences are full of scientific terms that a nonexpert can easily permute without losing grammat-I was just there again in April and bragged to my friends about how great it was and we were all horribly disappointed. just bragged to all my friends about how great it was in April and we were there again and I was horribly disappointed.
Mao Zedong's philosophical essay furthered Marx and Lenin's thesis and suggested that all existence is the result of contradiction. Marx and Lenin's philosophical essay suggested and furthered Mao Zedong's thesis that all existence is the result of contradiction.
We introduce a longevity feature to the classical optimal dividend problem by adding a constraint on the time of ruin of the firm. We introduce the classical problem of the optimal ruin of a firm by adding longevity to the feature time constraint on a dividend.  Table 4: Comparison of original and IBIS-inferred orders of punctuationless sentences (above) and spans between punctuation (below) of different lengths. We report BLEU score and ratio of perplexities (PR). A PR less than 1 indicates that IBIS reaches lower NLL per word than the original text.
icality, and Wikipedia sentences have a measured style more familiar to GPT-2 (see Table 3).
Human evaluation. Three human subjects were asked to rate the relationship of IBIS-inferred punctuationless sentences to the original texts. We sampled 50 sentences from each of the five length buckets from the Wiki dataset; annotators ranked each pair (original sentence, IBIS-inferred order) on a scale of 0 (the inferred order is unreadable or completely dissimilar to the original sentence) to 3 (the original and inferred orders are identical, achieved for 47 of the 250 sentences).
For punctuationless sentences with fewer than 20 words, more than half of pairs were given scores of 2 or 3 (similar or identical meaning). This number sharply drops with increasing length, but long punctuationless sentences are rare in normal text: most sentences without punctuation have fewer than 20 words. More details can be found in Appendix D.

Experiments: (Dis)order in NLP tasks
In this section we answer the question: How well could language models perform on standard NLP tasks if they were not given access to word order?

Word order and text generation
Left-to-right (autoregressive) text generation remains a principal direction of NLP research. How well could models such as GPT-2 generate text if, when prompted to generate the next word in a text, they did not know the order of the previous words? We measure the performance of GPT-2 in generating the next token in a text where the order of the previous words is treated as a latent variable.
GPT-2 is a generative model of tokens, where preceding tokens (context) are used as predictor variables. We break this context into two parts: the distant ordered context , followed by a bag of tokens = { 1 , . . . , } whose order is not known (50 tokens in total). We evaluate the perplexity and word prediction accuracy of GPT-2 on a sample from OpenWebText, a reacquired version of the model's training data, under three schemes for predicting the next word +1 given and : Latent.  Table 6: Standard metrics of finetuned BERT models (mean of 32 random seeds) on the GLUE benchmark tasks, evaluated on raw validation data, data with randomly ordered words, and data with word order inferred by IBIS.
of tokens . We denote the order by , a discrete latent variable taking values in permutations ( : {1, . . . , } → {1, . . . , }). By scoring each of these permutations following the context under the base LM, we compute the posterior distribution over this latent variable conditioned on : Under any order of the past tokens, the LM gives a distribution over the word +1 , .
In this setting, we predict +1 by integrating out the latent (i.e., summing over all possible orders): Top. The same as Latent, but using only the top , i.e., ( +1 | arg max ( | , ), , ). Random. In this case, we assume a uniform distribution over orders of the bag and predict This expression differs from (3) in that the likelihood under the base LM of the order of recent context is not taken into account: the order of the bag is assumed to be randomly sampled.
Results and discussion. The perplexity and token accuracy of GPT-2 Small under the Latent, Top, and Random schemes are shown in Table 5.
Remarkably, even for 7 tokens of unordered context, integrating over a latent order reduces accuracy only about 2% from the model that has access to fully ordered context, and 83% of the time, the true order of the bag of 7 preceding tokens has the highest likelihood out of 7! = 5040 possible orders.
In light of the latter, it is unsurprising that the Top method has only lightly worse metrics than Latent. However, the model rapidly degenerates when we randomly sample the order of the previous tokens. Indeed, the dependence of a word on a context word appearing positions earlier sharply decreases with . When the number of shuffled tokens is large, the recent words, which are most predictive of the target word, are often moved far back in the context (see Appendix C for an example).
GPT-2 was trained with the objective of predicting a word given ordered context. We have shown that GPT-2 is able to infer the order of the context itself, then use it for prediction, while losing little in accuracy and perplexity. This is as much a result about language as it is a result about language models: the bag of tokens carries almost as much information as the ordered sequence.

Word order and GLUE
We evaluate the dependence of the GLUE benchmark (Wang et al., 2018) on word order in the input data. Because the inputs are often long, a full search over orders is infeasible, so we use IBIS.
Specifically, for each of the 9 tasks, we finetuned the BERT-Base model (Devlin et al., 2019), a standard baseline, on (ordered) training data using typical settings. 8 We then ran the IBIS algorithm with GPT-2 Small as the base LM to infer an order of the bag of words in each validation set sentence (in tasks with two input sentences per example, the sentences were ordered independently). The finetuned models were then evaluated on this IBIS-ordered validation data. For comparison, we evaluated the same models on validation data with words ordered randomly, as well as on the original orders. Fairy in techniques for coronavirus vaccine has been confirmed by research.
The mouse was hungry. It started feeding itself by taking UndergroundMISC's engineered growth cheese assay. Now the mouse is still hungry.
The cat was hungry. Someone picked up the mouse and chased the cat away from it. Now the cat is still hungry.
The cat liked mice. His appetite for sweet treats was a little more intense. Now the cat hates mice.
The Dragon King has reached out to the court at the request of a judge with the Magic Kingdom. The Minister of Magic declined to comment. Table 7: Constrained generation using IBIS variants (Appendix B): sentences were forced to begin or end with the underlined spans and to contain the bold words in any order; all other words were generated by the model.
Results and discussion. The standard evaluation metrics for these models are shown in Table 6. At least 95% of the prediction accuracy on tasks related to textual entailment (MNLI, QNLI, RTE), 97% on tasks evaluating similarity detection (MRPC, QQP) 9 , and 94% on the sentiment classification task (SST-2) is explained by bags of words alone. That is, such high scores can be achieved by a model that consumes only bags of words as input. Our model is a composition of a combinatorial search (IBIS) with a feedforward model (BERT), but these results place a lower bound on what models that are not given word order can achieve.
The last two tasks, CoLA and WNLI, do not follow this pattern. The anomaly of WNLI, which tests resolution of ambiguous anaphora, seems to be due to the tiny size of the data (71 validation examples); the baseline models, on average, perform worse than random guessing. On the other hand, CoLA tests grammaticality judgments, which clearly depend on word order; many examples have no grammatical order in the first place. The large drop in Matthews correlation is unsurprising.
The finetuned models perform substantially worse on evaluation data with randomly ordered words on all tasks (except WNLI), though still much better than chance (except on CoLA). We conclude that the trained models need word order to perform, but that the word order itself carries little information, as we can infer word orders that result in near-baseline evaluation scores.

Conclusion and future work
We have shown that word order in an English sentence encodes surprisingly little information in addition to that contained in the bag of words. NLP models such as BERT and GPT-2 depend on order when creating representations of text, because they were trained on ordered words, but at the same time do not strictly need it, since their understanding of syntax -and the compressed world knowledge that 9 The metrics for STS-B are correlations, not accuracy. they hold -are sufficient to infer word order.
It would be interesting to use techniques such as IBIS to study investigate humans' capacities for syntax, both productive and receptive. Are sentences with unlikely word order -as measured by a language model -more likely to lead to confusion (as in the first and last rows of Table 8)?
A bolder conjecture states that many aspects of English syntax can be explained by optimality for language modeling. If, for a corpus of unordered sentences, we jointly infer a most likely word order for each example and a language model that fits these orders, do the inferred orders recover true English syntax, or at least a syntax satisfying known cross-lingual universals? If so, we would be led to vastly generalize the main claims of Levy and Jaeger (2006) and Hahn et al. (2020). An iterative ordering algorithm like IBIS is an essential step towards answering such questions. 10 Our work can guide and motivate research into combining long-range dependencies in the evolution of content -vocabulary constraints such as sentiment, global story arcs, rules of rhyme and meter, etc. -with models like GPT-2 that are capable of generating and scoring text. As we discuss in Appendix B, variants of IBIS can be used for a wide variety of such constrained generation tasks by making some of the words in the bag latent and sampling them in concert with -opt moves: generating text from keywords, constraining text to a fixed length, composing poetry, and others where beam search is inefficient (Table 7).
Thus, IBIS is an attractive, flexible alternative to beam search in generative language models. It may find applications well beyond word ordering.

Ethics statement
We use this section not only to promote discussion of possible societal impacts, but also to help researchers keep certain things in mind when they look to use our method and results.
Annotation Process. All three human annotators ( §4) have English as their native or first language and are at least college-graduated. They were all compensated at the rate of US$15 per hour. They were made aware that the first of the two sentences they are shown is from English Wikipedia while the second sentence is a reordering that need not be grammatical. We share the full set of instructions given to help them do the rating task in Fig. 8.
Use of Large Language Models. The usage of large language models has significant environmental and financial impacts. However, the majority of the cost is borne by training a new large language model, rather than using an already trained fixed language model as we do in our algorithm. We posit that using more computationally feasible inference methods using existing large language models and resources, instead of training even bigger models or methods that require months of compute cycle, makes our work more usable and accessible.
Large pretrained language models are also known to carry significant social biases, and this might affect the optimal ordering of a bag of words that our approach may find (since our search space stems from the probabilities learned by large language models). In fact, it will be interesting to conduct a study specifically focused on language model preferences for word order in cases where the subject and object of a text operate in an imbalanced power hierarchy: we expect the training data of language models to have an impact on the recovered word order. On a broader note, our findings of how sentence structure may often be redundant in English text could be a premise for further work in sociolinguistics about variation in norms of word order.
Language. It is important to keep in mind that our experiments pertain to the English language and that our findings and implications should not be transferred to other languages without further experimentation. For example, we may expect measures of permutability to differ significantly in synthetic languages that mark grammatical roles by suffixation and have a freer word order: a Russian or Warlpiri sentence is more likely to have a grammatical reordering than an English or Mandarin one, but this reordering may have nearly the same meaning as the original sentence, with the context of surrounding sentences playing a large role in conditioning topicalization. In other highly agglutinative languages, entire complex sentences can be expressed in a single (orthographic) word, and the very notion of 'word' as separate from 'morpheme' is difficult to define -a challenge for NLP models. A More on IBIS

A.1 Search parameters and code
We describe the search strategy of IBIS. As noted in the main text, a complete enumeration of -opt moves to rank in the batch proposal step is not feasible. Thus we do the following: (1) At each step, we randomly sample ∈ {3, 4, 5} and a permutation of ( − 1)! spans resulting from cutting the candidate sentence at points. We search only for -opt moves that permute the spans according to this permutation.
(2) For long sentences it is impractical or infeasible, due to memory constraints, to compute the improvement in tour weight under every -opt move -for a sentence of length , the number of such moves is ( ). Thus we sample a smaller set of candidate cut positions and score only -opt moves that cut the sentence at positions in this set, alternating two strategies: (a) sampling 20 ( = 5) or 40 ( = 4) random candidate cut positions and (b) taking between 7 and 14 consecutive cut candidates at a random position in the text.  (3) We rank all -opt moves, with the given and permutation of spans, that cut at the candidate positions by how much they improve the current tour of the auxiliary graph with its current weights. A random of the top moves are proposed as the candidate batch. (We chose = 512 and took , the batch size, to be 128.) Our experiments were run on a mixture of Nvidia Tesla K80 and P40 GPUs. The latter are able to run GPT-2 Large with batch size 128 on texts the length of the longest sentence in the PTB dataset. Figure 6 shows the distribution of search steps and accepted -opt moves by sentence length.

A.2 Visualizations
The associated repository includes three videos showing the evolution of texts with a widely known order, and weight matrices of the auxiliary graphs, as IBIS shuffles them into their optimal orders. Table 8 shows more examples of original sentences and the word orders restored by IBIS, meant to illustrate its various success and failure modes. Of note, the first example is commonly misinterpreted by humans as having the opposite meaning. It would be interesting to study whether difficulty in communication between humans arises when spoken sentences can be permuted into text that is more likely and has very different meaning.

A.3 Permutation examples
Let's make a bet: winner owes loser 50 dollars. Let's make a bet: loser owes winner 50 dollars.
The mouse chases the cat. The cat chases the mouse.
Thoughts without content are empty, intuitions without concepts are blind. Thoughts without content are empty, intuitions without concepts are blind.
Experience without theory is blind, but theory without experience is mere intellectual play. Experience without theory is blind, but experience without theory is mere intellectual play.
Heat 12 oz. light beer, 1/2 tsp. Dijon mustard on low; whisk in 4 c. shredded sharp cheddar cheese until melted and smooth. Heat 1/2 tsp. Dijon mustard, melted in 4 oz. beer; whisk 12 c. shredded cheddar cheese on low light until smooth and sharp.
To be, or not to be, that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles, and by opposing end them: to die, to sleep; no more; and by a sleep, to say we end the heart-ache, and the thousand natural shocks that flesh is heir to? To say we suffer is to take a thousand shocks, whether in the flesh, or by the nobler heart, or by the more outrageous sea-ache and slings of arrows, and to be against them: and to end troubles, to end arms, to be of the opposing mind; and to die: no question, that 'tis not a sleep; that sleep is the natural heir to fortune?
Remarkably, even for entire paragraphs, the heuristic search is able to find mostly grammatical and somewhat coherent orderings.
Remarkably, heuristic search is mostly grammatical, somewhat coherent and even able to find the orderings for entire paragraphs.
It certainly was cold, he concluded, as he rubbed his numbed nose and cheek-bones with his mittened hand. It was certainly cold, he concluded, as he numbed his hand and rubbed his nose with mitten-ed cheekbones.
This much-needed paper fills a gap in the literature. This paper fills a much-needed gap in the literature.

B IBIS beyond linearization
In this section, we explore a few additional applications of IBIS and its variants. The generated examples presented here were chosen out of multiple runs for each prompt and not thoroughly evaluated by automatic metrics, but are rather intended to suggest possible uses and advantages of order-free generation.

B.1 Latent bags of words and lexical constraints
Suppose that we aim to generate a text with lexical constraints; e.g., the number of instances or particular words or the total length of the text may be fixed. Under an autoregressive LM, sampling from the set of sequences where the constraints are satisfied is in general intractable; for instance, it is even intractable to sample from the distribution over sequences of length 10 ending with a period. The IBIS algorithm can be modified to search for the most likely sequences of tokens satisfying such constraints. For example, suppose we are generating a text with six tokens that is required to contain the words 'cat' and 'mouse'. We initialize a search with a sentence with 'cat', 'mouse', and four random words. To the IBIS search step of permuting spans of text, we add a step of replacing any word -besides 'cat' and 'mouse' -with any other word in the vocabulary. A batch proposal heuristic is possible here as well: we sample candidate replacement words at a position from the (perhaps softened) distribution over words at this position under the base LM given the current context. Table 9 shows examples of sentences inferred by such a search, constrained to begin with certain words and to contain certain other words.

B.2 Reverse generation.
IBIS is readily modified to fix last few words of the generated text by simply restricting the set of candidate cut positions for -opt moves. Thus we can generate text constrained to end with given words. IBIS with a word replacement step enables a faster and more robust reverse generation using only a forward LM (Table 10). Notice that this search is able to find strings relevant to future context: when the sentence is forced to end with a span about cats, replacing a word in the middle of the sentence with 'cat' increases the likelihood of the entire text. At some point in the search, 'cat' gets sampled as a replacement, and the search enters a NLL sink: the word 'cat' is now likely to remain.

B.3 Rhyming constraints
To give a taste of what further applications are possible, we use a modified IBIS to generate short verses. Rhyme and meter are lexical constraints that can be incorporated into word replacement search steps: for example, the words sampled for replacement at certain positions may be required to lie in the set of words that rhyme with an already generated line.
Let us rewrite Shelley's famous lines: Rise like Lions after slumber In unvanquishable number-Shake your chains to earth like dew Which in sleep had fallen on you-Ye are many-they are few.
with the help of Distil-GPT-2, forced to keep the two underlined lines and generate two new rhyming lines of appropriate length: If you have a ton of lumber In unvanquishable number-Then enjoy your lumber stew-Ye are many-they are few.
Similarly, the following haiku verses, the result of a human-machine collaboration between the authors and GPT-2, were constrained to use the bold words and satisfy metric constraints.
Fuji. Simple frog, humble, short feet, do you know the distance to home?
See the early moon Night watcher's little lantern But I see nothing

C On random and latent next-word prediction
The goal of this short section is to illustrate how randomly sampling an order of recent context degrades the performance of GPT-2 in predicting the next token, while inferring it as a latent order does not. Consider the input: "Our Father, who art in Heaven, hall|owed be thy" (| indicates subword division). GPT-2 recognizes this standard text and would predict the next word as 'name' if given ordered context. Integrating over a latent order of the last = 2, . . . , 7 tokens, or taking a random order of the last 2 or 3 tokens, the most likely next token is still 'name'. However, for = 4, the most likely next token under a random order is '|owed': the distribution over next tokens is quite flat, and the most significant pattern is that 'hall' is the last token in 1 4 of orders -strong evidence for '|owed' to follow.

D Human evaluation on punctuationless sentences
All pairs of puntuationless sentences (original and IBIS-inferred orderings) were independently rated by 3 annotators, where each annotator followed the instruction set laid out in Fig. 8. The distribution of users' ratings for punctuationless sentences in each of the length buckets is shown in Fig. 7. For sentences in the most common length bucket (10-19), almost half of sentences are either identical to or have the same or similar meaning as the original text. 12 11 10 9 8 7 6 5 4 3 2 1 −4 −3 −2 −1 0 1 2 3 4 5 6 7 Fairy in techniques for coronavirus vaccine has been confirmed by research. In the aftermath of the storm, humans and hippos, penguins, dolphins, and the dinosaurs were forced into the streets by the hive mind of the Internet. See the full video for a long list of physicists and Mars' history as well as insights to the origins of bubbles. Drones, drugs, climate change: the search for answers to nothing. Table 9: Examples of constrained generation using IBIS endowed with a word replacement step. Underlined words are a fixed initial context, while the bold words are required to appear anywhere in the text, in some order. The base LM was Distil-GPT-2; a slight relaxation of greedy ascent is employed.
The cat was hungry. Someone picked up the mouse and chased the cat away from it. Now the cat is still hungry. The mouse was hungry. It started feeding itself by taking UndergroundMISC's engineered growth cheese assay. Now the mouse is still hungry. The cat liked mice. His appetite for sweet treats was a little more intense. Now the cat hates mice. The world just had a cat and a dog fight. Now the cat isn't hungry anymore. The cat and the mouse were both hungry. Although they oversee the animal kingdom, their predators eat more. Now only one of them is hungry. I thought I could find my future and fix it, but everyone was running in panic. The announcement of the student strike is expected to be welcomed by many of us. Students across Canada are rejoicing. The Dragon King has reached out to the court at the request of a judge with the Magic Kingdom. The Minister of Magic declined to comment.  Table 9, the underlined tokens are fixed and the bold tokens are required to appear.