Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text

While Large Language Models (LLMs) have achieved remarkable performance in many tasks, much about their inner workings remains unclear. In this study, we present novel experimental insights into the resilience of LLMs, particularly GPT-4, when subjected to extensive character-level permutations. To investigate this, we first propose the Scrambled Bench, a suite designed to measure the capacity of LLMs to handle scrambled input, in terms of both recovering scrambled sentences and answering questions given scrambled context. The experimental results indicate that most powerful LLMs demonstrate the capability akin to typoglycemia, a phenomenon where humans can understand the meaning of words even when the letters within those words are scrambled, as long as the first and last letters remain in place. More surprisingly, we found that only GPT-4 nearly flawlessly processes inputs with unnatural errors, even under the extreme condition, a task that poses significant challenges for other LLMs and often even for humans. Specifically, GPT-4 can almost perfectly reconstruct the original sentences from scrambled ones, decreasing the edit distance by 95%, even when all letters within each word are entirely scrambled. It is counter-intuitive that LLMs can exhibit such resilience despite severe disruption to input tokenization caused by scrambled text.


Introduction
Large language models (LLMs) demonstrate impressive proficiency across a range of tasks, with certain capabilities emerging as the models scale up in size-a phenomenon commonly known as emergent abilities.(Wei et al., 2022a).As the LLMs The following sentence contains words with scrambled letters.Please recover the original sentence from it.Scrambled sentence: oJn amRh wno het 2023 Meatsrs ermtnoTuna no duySan ta atgsuAu ntaaNloi Gflo bClu, gnclcinhi ish ifsrt nereg ecatkj nad ncedos raecer jroam.

Recovered sentence:
Jon Rahm won the 2023 Masters Tournament on Sunday at Augusta National Golf Club, clinching his first green jacket and second career major.become more "intelligent", many new benchmarks have been proposed (Liang et al., 2022;Qin et al., 2023) to investigate the ability of LLMs.Nonetheless, the internal mechanisms underlying the capabilities of LLMs remain enigmatic.
Several studies investigate the behavior of LLMs given some input perturbations.For example, Sinha et al. (2021a,b); Abdou et al. (2022) investigate the influence of word-level permutations and show that models are insensitive to permutations of word order that corrupt the original syntax, in some downstream tasks (e.g., natural language inference).These results are particularly interesting because they challenge the common assumption of the inner workings of LLMs, i.e., LLMs understand humanlike syntax to some extent and use it to understand sentences.
Going beyond the effect of word-level permutations, we investigate the ability of LLMs under character-level permutations.LLMs are supposed to rely on the tokenizers to turn natural language into the form that LLMs can perceive.It would be counter-intuitive if LLMs could effectively handle text containing unnatural permutations that significantly alter tokenization.In other words, we propose the following research question: Deos the oredr of ltteers in wrods mttaer for LLMs?Note that the above sentence contains scrambled words, but humans can somehow recognize and understand such a sentence, as several cognitive studies have explored (Rawlinson, 2007;Mason, 1982;Johnson and Eisler, 2012).Analyzing the robustness of LLMs against such character-level permutations can shed light on their word comprehension capabilities and reveal differences between various LLMs and human understanding.
To this end, this paper first constructs Scrambled Bench, which converts existing benchmarks into a test suite to measure the ability of LLMs to handle scrambled text.We designed two types of tasks: (1) Scrambled Sentence Recovery, which tests the capability of LLMs to reconstruct the original sentences from scrambled ones, and (2) Scrambled Question Answering, which measures how well LLMs can answer questions when some context is scrambled.Note that since the slight change in letter-order within a word drastically changes the tokenized output (see Figure 1) , it is questionable whether LLMs can recognize the scrambled words in a sentence.Counter-intuitively, we show that the most powerful LLMs are able to handle scrambled sentences to varying degrees, when we scramble words while keeping the first and last letters unchanged.More surprisingly, we found that only GPT-4 can almost flawlessly process inputs with unnatural errors, even under extreme conditions.That is, even when we scramble all letters in words, GPT-4 manages to handle such input -a significantly challenging task for other models and even humans.For instance, GPT-4 can reconstruct the original sentences to near-perfect recovery rate in the extreme scenario, as in Figure 1.

Related Works
The most related works are the studies investigating the effects of word or sub-word level perturbations (Sinha et al., 2021a,b;Pham et al., 2021;Abdou et al., 2022) and the studies evaluating the robustness of LLMs (Wang et al., 2023;Zhu et al., 2023).To the best of our knowledge, no existing studies have investigated LLMs' ability to handle character-level permutations, particularly those of an extremely high level that drastically change tokenization.Our study aims to fill this gap.Table 2 in Appendix B categorizes the prior studies and demonstrates the position of our study.Besides, we directly evaluate the ability to recover the scram-bled text along with the task accomplishment given scrambled context.It differs with typographical error correction (Shah and de Melo, 2020;Sun et al., 2022), as (i) we do not train models to correct errors, i.e., we measure the ability of LLMs, and (ii) we add much more severe noises than natural typographical errors.The word unscrambling task in BigBench (Srivastava et al., 2023) is similar to our recovery task.However, it is more akin to a wordplay puzzle rather than a task for comprehending scrambled text, since it includes only single common words and no context is given.

Scrambled Bench
We propose two tasks to evaluate the capability of LLMs to handle scrambled text.
(1) Scrambled Sentence Recovery (ScrRec).In this task, we provide a sentence containing scrambled words to LLMs and then ask them to recover the original sentence from it.This task can be utilized to directly measure the capability of LLMs to recognize and reconstruct the scrambled words in a sentence.
(2) Scrambled Question Answering (ScrQA).While ScrRec can directly measure the capability to comprehend and process scrambled text, it is an "unusual" task for LLMs.If a model does not perform well on the recovery task, there are two possible reasons: (i) having difficulty following the instructions and (ii) not being able to recover sentences.To distinguish them, we measure the ability to accomplish a standard task (i.e., QA) given scrambled context.Specifically, we scramble the content that contains essential information for answering questions and then assess the models based on the variations in their performances.

Dataset Creation
In this study, we primarily utilize a scrambled version of RealtimeQA (Kasai et al., 2022) for evaluation.A common issue in evaluating LLMs is data contamination, which occurs when the test data of downstream tasks is present in the training data.It particularly affects our experiments because the analysis would be useless if some models memorized the original contents.Real-timeQA is a dynamic question answering dataset that weekly announces questions about recent news that are unlikely to be memorized by the current LLMs.Specifically, we collect the most recent Besides, we also introduce two additional datasets: DREAM (Sun et al., 2019) and AQuA-RAT (Ling et al., 2017).DREAM is a dialoguebased multiple-choice reading comprehension dataset.AQuA-RAT is a dataset of math word problems necessitating multi-step reasoning for their resolution.For DREAM dataset, we constructed the dataset by selecting 1025 samples with annotated categories from the development and test sets and then scrambling the dialogue part of each question.For AQuA-RAT dataset, we adopt the few-shot Chain of Thought (CoT) setting as in Wei et al. 2022b and evaluate LLMs with scrambled questions in samples and demonstrations.
For each dataset, we generate scrambled text with various scramble types and rates.(1) Randomly Scramble (RS).For each sentence, we randomly select a certain percentage (20%, 50%, 100% in our case 3 ) of words and randomly shuffle the positions of letters in each selected word (Arabic numerals are kept invariant).(2) Keep First (KF).We keep the first letter in each word unchanged and randomly shuffle the letters in other positions.
(3) Keep First and Last (KFL).We keep the first and last letter in each word unchanged and randomly shuffle the letters in other positions.We use KF and KFL to investigate the effects of 3 In the latter part of the paper, RS denotes 100% randomly scrambling, unless a different percentage is specified.different scramble types in word recognition for LLMs.

Metrics
The average Edit Distance (ED) (Levenshtein, 1966) between the original sentences and the recovered sentences is a natural metric to quantify the performance on ScrRec.Besides, we define Recovery Rate (RR) to measure the proportion of ED reduced in recovered sentences, which makes the performance comparison on different settings more straightforward as: ) where ori i , scr i , rec i mean the original sentence, the scrambled sentence, and the recovered sentence of sample i, respectively.If a model is capable of recovering scrambled text to some extent, its RR should range between 0 and 100%.
For ScrQA, accuracy is a natural metric to measure performance.But varying capabilities of models on original questions make it hard to compare the performance among models.So, Relative Performance Gain (RPG) is defined to mitigate the differences and make evaluations focus on how well models can extract information from scrambled text in comparison to original text as: where Acc ori , Acc scr , Acc sub are defined as accuracy when using original evidences, scrambled evidences, substituted evidences (in which all letters of words that can be scrambled have been randomly substituted), respectively.Acc ori , Acc sub are served as the higher bound and the lower bound of accuracy, respectively.Normally, RPG ranges between 0 and 100%.
Results 1: Effect of different scramble types.
Figure 2 show the results on zero-shot ScrRec, fewshot ScrRec, and ScrQA, with three scramble types: randomly scramble (RS), keep first (KF), and keep first and last (KFL).The results show the performance gaps among models are not large in KFL setup.However, except for GPT-4, performance significantly decreases as the difficulty of scramble types increases (KFL, KF, and RS in order).In contrast, the performance of GPT-4 remains constantly high regardless of the scramble types.For ScrRec, RR of GPT-4 is constantly above 95% for all setups.For ScrQA, GPT-4 also constantly performs best with very limited accuracy drop, as the difficulty of scramble types increases.
Results 2: Effect of different scramble rates.
Figure 3 illustrates the relationship between the scramble rates (i.e., the percentages of randomly scrambled words in text) and the performance on ScrRec with scrambled RealtimeQA.As the scramble rates increases, RR decreases for text-davinci-003, Falcon-180b and Llama-2-70b.RR of GPT-3.5-turbo and GPT-4 does not change significantly.GPT-4 outperforms other models by a wide margin, with higher than 95% RR for most setups (except for 20% scramble rate).Similarly, Figure 4 plots RPG against the scramble rates for different models on ScrQA with scrambled RealtimeQA.RPG of all models drops as scramble percentage increases.But GPT-4 maintains 87.8% of its original performance even with 100% scrambled evidence.The gap between the performance of different models becomes larger when the scramble rates increases.Table 3, Table 4 and Table 5 in Appendix C show full results on scrambled RealtimeQA dataset, with additional evaluation metrics (ED for ScrRec and Acc for ScrQA).
Results 3: Results of additional datasets.Finally, we test the generality of the finding across datasets by two additional datasets for ScrQA.For scrambled DREAM dataset, we evaluate performance not only overall but also on different categories of questions, using the annotations.The performance disparities between GPT-4 and other models are more pronounced than those observed on RealtimeQA, possibly since DREAM requires  (left: text-davinci-003; middle: GPT-3.5-turbo;right: GPT-4).Metric: Acc (%).
higher-level comprehension of longer texts.Performance on arithmetic questions tends to be more susceptible to scrambled text compared to other categories, even for GPT-4.Table 1 demonstrates experimental results with a 4-shot CoT setting on scrambled AQuA-RAT dataset (we only test the performance of three closed-source models here because even the original questions in AQuA-RAT are too challenging for most open-source models).
We evaluate models with several different scenarios by changing scramble rates of both questions in exemplars and main question.The results show the influence of scrambled exemplars is relatively small.And the performance of GPT-3.5-turbo and text-davinci-003 drop significantly when the scramble rate of main questions increases to 100%, while GPT-4 remains most of its original performance.Table 6, Table 7 and Table 8 in Appendix C show full results on scrambled DREAM dataset.

Conclusion
In this study, we propose Scrambled Bench, a test suite to measure the ability of LLMs to handle scrambled text, including two tasks (i.e., scrambled sentence recovery and scrambled question answering) and construct scrambled datasets based on RealtimeQA, DREAM and AQuA-RAT.Despite the scrambled text drastically changes the tokenization, we demonstrate that advanced LLMs are capable of processing scrambled text to varying degrees.However, most LLMs have difficulty handling text that is scrambled to an extreme degree (i.e., 100% randomly scrambling).Surprisingly, for both tasks, GPT-4 shows good results and outperforms other models by a large margin.For the scrambled sentence recovery task, GPT-4 can recover sentences by 95% edit distance reduction even in 100% randomly scrambling settings.For the scrambled question answering task, GPT-4 can maintain a very high proportion of its original accuracy using scrambled context.

Limitations
For LLMs, there are various ways to disrupt the tokenization of words (e.g., inserting letters, substituting letters).In this study, we only investigate the influence of scrambling the letter-order in words.
Investigating the performance of LLMs to handle other situations would be an interesting topic.
In addition, we have conducted our experiments using only three datasets, RealtimeQA, DREAM and AQuA-RAT.Experiments on more diverse datasets could be another future work.Note that the two tasks can be applicable for diverse datasets, and it is easy to extend the analysis.
We investigate the capability of different LLMs to handle scrambled text in different settings.However, it is difficult to conclude the reason why (some) LLMs are capable to these tasks.Especially, the reason why GPT-4 can perform almost perfectly would be an interesting topic worth further investigation.We can not access the closedsource models directly and are aware of little information about them (even the exact model size of GPT-4).These situation make investigating the reason difficult.An hypothesis is that this capability might be related to training methods, such as incorporating tasks similar to denoising in the training objectives, or using a vast amount of text data containing various errors in the training process.Another hypothesis is that this capability emerges as LLMs scale.Validating these hypotheses could potentially enhance our understanding of the inner workings of LLMs, thereby enabling us to reverseengineer and recreate more sophisticated models like GPT-4.

A Examples of different tasks (by GPT-4)
A.1 Zero-shot ScrRec on RealtimeQA

Input:
The following sentence contains words with scrambled letters.Please recover the original sentence from it.Scrambled sentence: A reGebr byba ulfaorm wsa titbudiserd ot soetsr epdstie a lclera eovr bslpioes ionmanantitco, grnoadicc ot eth ADF.heT pyomacn si noniacrggue rptsean ot ckhec yna poducsrt yhte evah ta mhoe nda cdisadr sehot taht aym eb ecaeftdf.Recovered sentence: Output: A Gerber baby formula was distributed to stores despite a recall over possible contamination, according to the FDA.The company is encouraging parents to check any products they have at home and discard those that may be affected.

A.2 Few-shot ScrRec on RealtimeQA
Input: Scrambled Sentence: eTh camp continued to fnctinuo this ayw ilntu the rwa needd.Recovered Sentence: The camp continued to function this way until the war ended.
Scrambled Sentence: It swa first developed ni the 1980s yb oAcrn Computers tdL ot erowp their pstodke nmecisah and subsequently supn off sa a separate paocnmy, now ARM Holdings.Recovered Sentence: It was first developed in the 1980s by Acorn Computers Ltd to power their desktop machines and subsequently spun off as a separate company, now ARM Holdings.
Scrambled Sentence: According to the CIA kcb-Fotoa, the United States is one fo eethr iusecnort (het etrhos nebgi Liberia nda mBuar/Myanmar) that sha not adopted eth International System fo Utins (SI) rmtcei symset as iethr ffliicao system fo gswheit dna measures.Recovered Sentence: According to the CIA Factbook, the United States is one of three countries (the others being Liberia and Burma/Myanmar) that has not adopted the International System of Units (SI) metric system as their official system of weights and measures.Scrambled sentence: A reGebr byba ulfaorm wsa titbudiserd ot soetsr epdstie a lclera eovr bslpioes ionmanantitco, grnoadicc ot eth ADF.heT pyomacn si noniacrggue rptsean ot ckhec yna poducsrt yhte evah ta mhoe nda cdisadr sehot taht aym eb ecaeftdf.Recovered sentence: Output: A Gerber baby formula was distributed to stores despite a recall over possible contamination, according to the FDA.The company is encouraging parents to check any products they have at home and discard those that may be affected.First, let's rewrite the question with the words unscrambled: "Machine A puts out a yo-yo every 6 minutes.Machine B puts out a yo-yo every 9 minutes.After how many minutes will they have produced 10 yo-yos?" To solve this problem, we can find the least common multiple (LCM) of 6 and 9, which is 18.This means that every 18 minutes, both machines will have produced a yo-yo at the same time.In 18 minutes, Machine A will have produced 3 yo-yos (18/6) and Machine B will have produced 2 yo-yos (18/9), for a total of 5 yo-yos.Since we need 10 yo-yos, we can multiply 18 minutes by 2 to get 36 minutes.The answer is (C) 36 minutes.

C Full experimental results
We conduct experiments using the most powerful closed-source LLMs, including text-davinci-003, GPT-3.For GPT-4, the version GPT-4-0314 is used.For GPT-3.5-turbo, the version GPT-3.5-turbo-0301 is used.For Falcon-180b and Falcon-180b-chat, the quantized method (Dettmers et al., 2023) is used to load the model and run the experiments.It probably affects their performance to some extent.

C.1 Full experimental results on scrambled
RealtimeQA dataset Table 3 and Table 4 illustrates the full experimental results on scrambled RealtimeQA dataset for fewshot ScrRec and zero-shot ScrRec, respectively.Table 5 illustrates the full experimental results on scrambled RealtimeQA dataset for zero-shot ScrQA.

C.2 Full experimental results on scrambled DREAM dataset
Table 6 illustrates the full experimental results on scrambled DREAM dataset for zero-shot ScrQA.Table 7 and Table 8 illustrates the experimental results on different question types of the top five models (like in Figure 5) with accuracy and RPG as the metrics, respectively.

Figure 1 :
Figure 1: GPT-4 can recover the original sentence from the scrambled sentence, even if the tokenization drastically changes.(The colors indicate the division of sub-words during the tokenization.)

Figure 3 :Figure 4 :
Figure 3: The effect of scramble rates on the performance on ScrRec-RealtimeQA.The missing points represent values less than zero.

Figure 5 :
Figure 5: Experimental results of ScrQA on scrambled DREAM dataset.Overall performance and performance on different categories of questions.The missing bars represent values less than or equal to zero.

Table 1 :
Experimental results of ScrQA on scrambled AQuA-RAT dataset with a few-shot CoT setting.LLMs are assessed in various scenarios by adjusting scramble rate of questions in examplars and the main question.

Table 2 :
Categorizing the prior works and demonstrating the position of this work.
5-turbo and GPT-4 and various open-sourceLLMs, including the models from Falcon series,

Table 3 :
The experimental results on the few-shot ScrRec using RealtimeQA dataset (upper: ED; lower: RR).

Table 4 :
The experimental results on the zero-shot ScrRec using RealtimeQA dataset (upper: ED; lower: RR).