BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of adversarial examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across adversarial formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.


Introduction
Human reasoning processes comprise two types of thinking: vertical and lateral (Waks, 1997).Vertical thinking, also known as linear, convergent, or logical thinking, is a sequential analytical process that is based on rationality, logic, and rules, typically associated with the left-brain hemisphere.Vertical thinking, as illustrated in Figure 1 (top), is needed to create a reasoning path from flooding a room to filling it with water for physical reasoning, and from inanimate objects with five fingers to gloves in riddles.Meanwhile, lateral thinking (or "thinking

Lateral Thinking
Sentence Puzzle A man shaves everyday, yet keeps his beard long.(Bisk et al., 2020) and RiddleSense (Lin et al., 2021)) to our novel lateral thinking task called BRAIN-TEASER.While prior tasks require commonsense to be injected, BRAINTEASER's lateral thinking puzzles require default commonsense thinking to be deprecated.
outside the box") is a divergent and creative process that involves looking at a problem from a new perspective and defying preconceptions, associated with the right-brain hemisphere (De Bono, 1970;Waks, 1997).Lateral thinking is required to solve the puzzle in Figure 1 (bottom), by overwriting the commonsense associations of man shaves to he shaves himself, and regarding the man as somebody who shaves others all day (e.g., a barber).
The development of natural language processing (NLP) models and their evaluation has achieved much progress in vertical thinking.In particular, large language models (LLMs) (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020b) have achieved strong performance across a variety of complex reasoning tasks (Talmor et al., 2019;Bisk et al., 2020;Sap et al., 2019b), even with the complete absence (zero-shot) (Sanh et al., 2022) or limited provision (few-shot) of training time exemplars (Chung et al., 2022). 1 To perform well on tasks such as reasoning over physical interactions (Bisk et al., 2020) and social implications (Sap et al., 2019b), LLMs exhibit better vertical thinking capabilities, including commonsense association (Wei et al., 2022) and inference ability (Bosselut et al., 2019).While the extent to which these models possess common sense is heavily discussed (Marcus, 2022;Bubeck et al., 2023;Wei et al., 2023), we note that prior work has not considered the lateral thinking ability of LLMs.Creative thinking problems in benchmarks and knowledge bases are often filtered out as noise during preprocessing (Vajjala and Meurers, 2012;Speer et al., 2017;Sap et al., 2019a), and only kept if their resolution can be supported by commonsense associations, as in the case of riddles (Figure 1) (Lin et al., 2021;Gao et al., 2018).As many situations are novel, we expect that lateral thinking puzzles like those in Figure 1-bottom will be hindered by default commonsense associations and cannot be easily solved by further adaptation and scaling of the existing LLM methods.
To bridge this gap, we propose to study the ability of state-of-the-art LLMs to reason on lateral thinking puzzles.We formulate lateral thinking puzzles as multiple-choice Question Answering (QA) tasks, making them intuitive to answer by humans and easy to evaluate automatically.Following our task definition, we create a novel BRAINTEASER benchmark with two tasks of different granularity: Sentence Puzzles and Word Puzzles (cf. Figure 1).To construct the dataset, we design a data collection procedure, which crawls relevant puzzles from several publicly available websites, performs semiautomatic filtering of irrelevant question categories (e.g., pun, dad jokes), and ensures high data quality.To ensure fair and informative questions, we construct distractors semi-automatically by manual annotation of the explicit and implicit (commonsense) premises that arise from each puzzle.To address concerns of possible LLM memorization (Carlini et al., 2022) and their lack of consistency (Goldberg, 2023), we enrich BRAINTEASER with two reconstruction strategies: semantic reconstruction and context reconstruction, which create variants of each puzzle without changing its original way of defying default commonsense associations.This systematic procedure results in a novel BRAIN-TEASER benchmark with 1.1K high-quality data points and nearly 100% human evaluation results.Using BRAINTEASER as the benchmark, we conduct comprehensive experiments involving different model structures, model sizes, and prompting strategies.The results reveal a huge gap between human performance and current LLMs, indicating the great need to improve lateral thinking in LLMs.
We summarize our contributions as follows: 1) We introduce lateral thinking puzzles, a multiplechoice QA task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations.2) We design a three-step procedure for creating the first lateral thinking benchmark, BRAINTEASER, consisting of data collection, distractor generation, and generation of reconstruction examples, leading to 1,100 highquality puzzles.3) We conduct comprehensive experiments with state-of-the-art LLMs.We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.2

Related work
We review prior work on computational creativity, commonsense reasoning, and model robustness.
Computational Creativity Computational creativity work includes a broader set of tasks, some of which have been relatively popular, including pun (Zou and Lu, 2019) and humor (Meaney et al., 2021) detection.A particular class of creative challenges, called brain teasers (Draper, 2009;Highhouse et al., 2019), is designed to evaluate a wide range of human intelligence skills, including strategy development, planning, visual-spatial thinking, creativity, and memory (Altun et al., 2016).Most similar to our task, Lin et al. ( 2021) collects riddles from public websites to challenge current models.While in principle computational creativity puzzles and brain teasers combine vertical and lateral thinking, prior work has focused on the former category.Our BRAINTEASER task complements these works with word-and sentence-level lateral thinking puzzles.BRAINTEASER can serve as a formal platform to evaluate the creative skills of LLMs, which have been partially explored in recent work (Franceschelli and Musolesi, 2023;Bubeck et al., 2023;Wang et al., 2023a).

Commonsense Reasoning
The task of commonsense reasoning has been popular in recent years (Rajani et al., 2019;Ma et al., 2019;Lourie et al., 2021;Maharana and Bansal, 2022), accompanied by the introduction of numerous challenging benchmarks (Talmor et al., 2019;Sap et al., 2019b;Sakaguchi et al., 2019) and availability of largescale commonsense resources (Speer et al., 2017;Hwang et al., 2021).While each of the existing datasets focuses on different dimensions of commonsense knowledge (Ilievski et al., 2021a), most of them are constructed in the multiple-choice format, due to the ease of evaluation.Some prior works have focused on generative commonsense reasoning (Lin et al., 2020;Boratko et al., 2020).However, due to the vast plausible answer space, the evaluation has been challenging and a large amount of answer annotations have to be collected in order to ensure fairness (Boratko et al., 2020).Curiously, while possession of common sense has been a central goal of AI, its role in our BRAIN-TEASER task is as a distractor.Namely, successful solutions of the lateral thinking puzzles in BRAIN-TEASER require the models to defy commonsense associations and linear inference chains.
Robustness Studies As a novel benchmark, BRAINTEASER relates to other works that evaluate the performance of LLMs.Since these models are surpassing human performance on some existing benchmarks (Xu et al., 2022), the NLP community has shifted the focus towards robustness evaluation, i.e., whether the model can retain a similar performance to semantically perturbed or adversarially constructed questions (Abdou et al., 2020;Nie et al., 2020).Some recent works have adopted model adversarial approaches to generate datasets that are challenging for models to solve (Zellers et al., 2019;Sakaguchi et al., 2019), while others combine multiple tasks to evaluate the model's behavioral consistency across semantic, logical, and factual categories (Jang et al., 2022).Besides dataset construction, analysis studies have also shown that models easily learn shortcuts to solve the datasets (Branco et al., 2021;Elazar et al., 2021) and their performance heavily depends on the overlap of tokens between training and test data (Ma et al., 2021b).Different from prior works where associative resources are used to finetune the model to improve robustness, we expect that the lateral thinking puzzles in BRAINTEASER require unique associations and creative reasoning paths.In this way, BRAINTEASER is designed to minimize the impact of confounding factors like memorization in LLMs (Bang et al., 2023;Guo et al., 2023;Goldberg, 2023).

Construction of BRAINTEASER
In this section, we first provide a definition of lateral thinking puzzles in various granularities.We then present a three-stage pipeline for constructing the multiple-choice puzzles in the BRAINTEASER dataset, consisting of data collection, distractor sampling, and reconstruction sample generation.Finally, we present key data statistics and quality validation results.

Task Definition
While lateral thinking puzzles are often presented to humans in an open-ended fashion, these are difficult to evaluate automatically and are difficult to solve by humans. 3An additional complication is that there may be multiple independent, yet correct, puzzle explanations.To alleviate these challenges, we pose lateral thinking puzzles as a multiplechoice QA task, a format frequently employed for reasoning tasks.We expect this approach to be both facile for human comprehension and amenable to automated evaluation.In general, each puzzle contains a question Q stating the context, and a lateral explanation e from explanation space E that serves as the correct answer.Q can be decomposed into an atomic premise set P , which includes both explicitly stated clauses and implicit clauses derived through default commonsense inferences or associations.For example, in the following puzzle: "How could a cowboy ride into town on Friday, stay two days, and ride out on Wednesday?", the set P includes the following premises: • p 1 : Cowboy rides into town on Friday.
• p 2 : Cowboy stays in town for two days.
• p 3 : Cowboy rides out on Wednesday.
• p 4 : Wednesday is the third day of the week.
• p 5 : Sunday is two days after Friday.
3 Our small-scale user study shows that both humans and LLMs are unable to perform this open-ended task well, scoring 2.64 and 2.62 on a 5-point scale, respectively (see Appendix A.5 for details).
The premises p 1 , p 2 , and p 3 are explicitly provided by the context, and the premises p 4 and p 5 are implicitly obtained by default commonsense association.The goal of a puzzle is to find an explanation that does not contradict the premise set P , E ∩ ¬P = ∅, as the premises are the target to explain and support.With vertical thinking, the question appears impossible to answer because P contains statements that conflict with each other.The premises p 3 and p 4 are inconsistent with other premises, leading to an obstacle in explaining the puzzle.The default commonsense inference thus becomes a logic stumper (Bar-Hillel et al., 2018), preventing one from creatively exploring additional explanations in E.
Lateral thinking leads to a correct solution to this puzzle: "His horse is named Wednesday.".This creative solution defies the commonsense association of Wednesday as a third day of the week (p 4 ).Thus, the key point of a lateral thinking puzzle is that some implicit premises generated through default commonsense association incorrectly create an arbitrary "box" that wrongly excludes the possible solution from the explanation space (Bar-Hillel et al., 2018).
Upon careful exploration, we devise two granularity variants of lateral thinking puzzles following our definition (Figure 1): sentence-based, where the puzzle is centered on sentence premises (e.g., Wednesday is the third day of the week), and wordbased, where the answer violates the default meaning of the word and focuses on the letter composition of the target question (e.g., cheese made backwards → edam).

Data Collection
We collect over ten thousand lateral thinking puzzles with answers from public websites such as riddles.comand rd.com using web crawlers.We merge the data from different sources and remove (near-)duplicates based on sentence similarity (Reimers and Gurevych, 2019).We conduct a semi-automatic process that corrects typos by using an automatic library, Auto Correct, 4 followed by human verification to ensure that the puzzles preserve their original meaning.We filter the remaining data manually to preserve QA pairs that fit the definition of the sentence-or word-based lateral thinking puzzles.This process yields 373 unique lateral puzzles, formatted as QA pairs. 4github.com/phatpiglet/autocorrect

Distractor Sampling
We convert each puzzle and its explanation into a multiple-choice QA format to ensure a straightforward evaluation process.A key challenge in creating fair and informative multiple-choice questions is sampling distractors that are simultaneously incorrect and challenging (Ma et al., 2021a).We propose a systematic approach for distractor sampling that directly benefits from our premise-based definition of lateral thinking puzzles.
For sentence puzzles, we list possible premises P = {p 1 , p 2 , p 3 , . ..} from the question context manually as the commonsense associations in the data are obvious and straightforward, especially when the answers are provided, like the example in Section 3.1.We know the correct answer p ′ c is an unconventional overwriting of the wrong premise (logic stumper) p w generated by default commonsense association.We generate the distractors by overwriting other premises in P − p w .This procedure guarantees that the distractors are incorrect because the misleading premise p w still remains in the premise set and prevents one from reaching the correct explanation.We first use COMET (Hwang et al., 2021) to generate the possible premise overwriting candidates for the question as a head combined with inference relations (e.g., happens after, hindered by, cause).Then we pick the COMETgenerated tails that are consistent with the question context as distractors and revise them by manual annotation.Table 1 shows example distractors for our running example puzzle from Section 3.1.
For word puzzles, as we focus on the literal meaning rather than semantic meaning, distractors can share similar semantic meaning as the correct answers and still exhibit similar commonsense associations.We pick distractors from the correct answer's synonyms in WordNet (e.g., mozzarella for edam in Figure 1) and Wikipedia entries that belong to the same category (e.g., both edam and cheddar belong to the semi-hard cheese category).
Since it is generally possible that none of the cre- ative solutions will be sensible for some of the questions, we also include the option None of the above in all questions' candidates set.This answer candidate simulates the situation where humans cannot overwrite their commonsense inference and give up on explaining the lateral thinking puzzle.To create puzzles where lateral thinking fails (i.e., with answer None of the above), we replace the correct answer with a distractor in 6% of the questions.After this procedure, each question in BRAINTEASER has four answer candidates.

Generating Reconstruction Examples
Since the latest LLMs are pretrained on massive web snapshots, it is possible that the data sources for BRAINTEASER are also included in their training data.Consequently, it is possible for LLMs to memorize the correct answer without performing any reasoning.To ensure that our task evaluates lateral thinking ability rather than memorization, we construct reconstruction versions of the original data in two parallel ways (Table 2): (1) Semantic Reconstruction rephrases the original question without changing its answer, distractor, and any premises in P .To do so, we use an open-source rephrasing tool, 5 after which human annotators refine and validate that all premises remain the same.
(2) Context Reconstruction keeps the misleading commonsense premise intact and changes both the question and the answer to a new situational context.For this purpose, we prompt GPT-4 for initial reconstructions, which are then manually refined by human annotators.The new distractors are generated following the same process as in Section 3.3.The premise set and the corresponding distractors also get translated to the new context.Intuitively, a 5 https://quillbot.com/model that learns to reason should be able to solve these two reconstruction variants of the questions easily, whereas the model that memorizes the answer would stumble.

Data Analysis and Validation
Key Statistics BRAINTEASER includes 1,119 data samples including its reconstruction variants.Table 3 reports key statistics of each subtask of BRAINTEASER.The questions in the Sentence Puzzle category are much longer because they are in a narrative story format rather than simple short questions, like those in the Word Puzzle category.
The difference between the standard deviation in the number of choice tokens between Sentence Puzzle and Word Puzzle can be ascribed to the different strategies for generating distractors, i.e., overwriting various premises with new statements versus generating similar words from the synonym set.We use ChatGPT prompting to extract the context topic from each question and to analyze the major topics in each subtask.The topic distribution shows that both subtasks involve a large range of (more than 80) areas.Sentence Puzzle is denominated by math, physics, and nature while Word Puzzle is denominated particularly by the language topic.For both tasks, there is a long tail of less common topics.The details of topic extraction and its obtained statistics are given in the Appendix A.1.The data statistics and the topic analysis suggest that, despite its limited size, BRAINTEASER can function as a comprehensive benchmark for assessing model performance across diverse topics and varying lengths of context.

Human Validation
To ensure the quality of our dataset, we invited three expert annotators to verify the validity of the QA pairs and their reconstruction variants.We sampled 102 examples from BRAINTEASER randomly and asked the annotators the following two questions: 1) Does the original puzzle and correct answer make sense?2) Are the reconstruction variants still consistent with the original questions in terms of the required reasoning process?On average, the human annotators rated 99% of the original question-answering pairs as valid.100% of the semantic reconstructions and 97% context reconstructions were marked as consistent with the original question-answer pair.The overall Fleiss (1971) kappa inter-annotator agreement is 0.948, which is an almost perfect score.

Experimental Setup
We describe the models selected for our experiments and the metrics used to evaluate the reasoning accuracy and consistency of these models.

Model selection
Instruction-Based Models We evaluate the instruction-finetuned LLMs in zero/few-shot setting: 1) ChatGPT, a publicly available state-ofthe-art LLM from the GPT (Brown et al., 2020a) series.2) T0 (Sanh et al., 2022), a LLM trained with multitasking instruction tuning that has strong zero-shot generalization ability.3) FlanT5 (Chung et al., 2022), an enhanced version of T5 (Raffel et al., 2020) which is instruction-finetuned (Wei et al., 2021) in both zero-shot and few-shot setting.For a fair comparison with humans, while running zero-shot prompting on ChatGPT, we add a description indicating that the question is a brain teaser puzzle that needs creative thinking to solve.For the rest of the models, we use the same instruction templates as found in their training sets (for full details, please refer to Appendix A.2). Commonsense Models To understand the effect of commonsense knowledge on our task, we evaluate the following models that are enhanced with common sense: 1) RoBERTa-L (CSKG) (Ma et al., 2021a), a model finetuned on the synthetic QA pairs generated from a diverse set of commonsense knowledge graphs (CSKG) (Ilievski et al., 2021b).2) CAR (Wang et al., 2023b), a model finetuned in a similar pipeline as Ma et al. (2021a) but with enhanced negative sampling strategy and reportedly superior performance.For reference, we also include the vanilla RoBERTa model (Liu et al., 2019) to understand the impact of commonsense knowledge.We evaluate all of the models in a zero-shot fashion, following the scoring method defined in (Ma et al., 2021a).We select RoBERTa because of its widespread usage of the commonsense task and impressive zero-shot performance.RoBERTa-L (CSKG) achieve SOTA zero-shot result on multiple commonsense tasks, while CAR even outperforms ChatGPT on commonsense tasks.Human Evaluation To assess the upper bound performance on BRAINTEASER, we randomly sample 102 questions from it and invite three experts annotator to solve the test.On average, it takes one hour for an annotator to complete the task.

Evaluation Metrics
As accuracy is a fair evaluation metric for the MCQA format and it has been adopted by many popular commonsense reasoning tasks (Mihaylov et al., 2018;Talmor et al., 2019;Bisk et al., 2020), we evaluate model performance using two accuracy metrics: Instance-based Accuracy considers each (original or reconstruction) question separately.We report instance-based accuracy on the original puzzles, and their semantic and context reconstructions.Group-based Accuracy considers each original puzzle and its variants as a group.The model will score 1 only when it successfully solves all three puzzles in the group, otherwise, its score is 0.

Results
Our experiments target five questions: 1) Can LLMs reason on lateral thinking puzzles similar to humans? 2) How do LLMs perform on reconstruction variants?3) Are model predictions consistent across partitions?4) Does tuning on commonsense knowledge help to answer BRAINTEASER puzzles better?5) Can LLMs do better in the few-shot setting with more demonstrations?
Overall Performance The main results are shown in Table 4.For both word and sentence BRAINTEASER puzzles, the performance of the  and 63%) is halfway between random (25%) and human performance (92%).In general, neither type of model is able to perform consistently well across the two subtasks: instruction-based models perform better on word puzzles, whereas commonsense models perform slightly better on sentence puzzles.The performance of the models is often close to random, with around a third of the models performing equal or worse than random guessing.As it can be expected, we see that scaling up instruction-finetuned models leads to improved performance on both subtasks.Yet, the large gap between human and model performance clearly shows that even the most powerful LLMs are unable to exhibit lateral thinking in multiple-choice puzzles and confirms the challenging nature of our BRAINTEASER dataset.

Original vs Reconstruction Partitions
In most cases, all models and humans perform the best on the context reconstruction partition.We hypothesize that this is because original lateral thinking puzzles are designed to mislead humans to a wrong choice based on commonsense associations, often involving rare words and unconventional sentence structures.Meanwhile, we note that our contextual reconstruction mechanism yields puzzles that are more familiar or easier to solve than the original puzzle, possibly because some of the commonsense associations are relatively weaker.An exception to this trend is ChatGPT's performance on word puzzles, where ChatGPT performs the best on the original examples.We believe that this is due to a combination of two factors.First, the word puzzle reconstructions only have a limited impact on the vocabulary domain and sentence structure, because of the much shorter questions.Second, ChatGPT may have memorized some of the word puzzles, e.g., given the question "How do you spell COW in thirteen letters?", its answer begins with "The question seems to be a brain teaser . .."We provide representative examples of the prevalent lateral thinking errors of memorization and commonsense associations in Table 5.

Consistency of Model Predictions
We further compare the performance on instance-and groupbased metrics to understand whether the models can solve lateral thinking puzzles by following a consistent reasoning path.A model understand- The man calls his dog on the other side of the river, and the dog The river was frozen.
The river was frozen.crosses the river without getting wet and using ant tools.The man had to cross the rivers.He can't swim or use any tools The river was frozen.
He jumped a half-mile like the bridge.How does the man succeed in the end?far to across the river.

Commonsense Association
What animal has no wings, but yet will fly?
A caterpillar.An eagle.There is no light on the road and the car's headlight is broken.
It was daytime.
The driver is good How can the driver see the black dog? at listening .How can Jenny read in a totally no light house at night?
The book is in Braille.It was daytime.
ing rather than memorizing the reasoning path of the original brain teaser should be able to answer its adversarial reconstructions with ease.Notably, human performance only has a minimal drop on group-based metrics whereas all models suffer significant drops.Further analysis (see Appendix A.6) reveals that ChatGPT and RoBERTa-L fail to answer many (45 and 61%, respectively) of the original or semantically changed puzzles when contextually translated puzzles are solved correctly.These observations suggest that the ability of the models to perform consistent lateral thinking is far from human ability.

Impact of Commonsense Knowledge
We observe that commonsense knowledge has a salient negative impact on the model's performance on sentence puzzles.The best-performing model in the commonsense category is the vanilla RoBERTa model, whose adaptation with commonsense knowledge leads to a significant drop in results, especially with the CAR method.This trend confirms our initial hypothesis that learning commonsense associations is generally detrimental to complex lateral thinking tasks.Commonsense knowledge has a limited positive impact on the word-puzzle task, possibly because much of the commonsense associations learned by these models hold between words, including synonyms.Finally, given the apparent similarity of riddles and lateral thinking puzzles, we finetuned a RoBERTa model on the Riddle-Sense dataset and evaluated it on our task.Again, we observe that the model struggles on solving the puzzles despite gaining better results compared to the vanilla RoBERTa model (see Appendix A.7).
Impact of Few-Shot Demonstrations As LLMs are good few-shot learners (Brown et al., 2020b), we are interested to see if in-context learning can help them better solve our task.We experiment with our two most powerful models: ChatGPT

Qualitative Error Analysis
We analyze two prevalent lateral thinking errors in the ChatGPT and FlanT5 (11b) LLMs: memorization and commonsense associations, both of which become more apparent with scaling up (Carlini et al., 2022).We show examples in Table 5.
Memorization We find that memorization happens in both subtasks.Given the sentence puzzle "The man calls his dog on the other side of the river, crosses the river without getting wet and using ant tools." the LLMs picked the correct answer "The river was frozen."for both the original and its semantic reconstruction.However, when the question in a new context becomes "The man had to cross the rivers.He can't swim or use any tools.like the bridge.How does the man succeed in the end?", all LLMs failed to answer.Memorization is more frequent in word puzzles.A semantic reconstruction will cause confusion in the model, as is also apparent from the gap between original accuracy and the ori&sem accuracy in Table 4.
Commonsense association Similarly, we also find that commonsense association often confuses LLMs.For example, for "What animal has no wings, but yet will fly?", the models associate the words "wings" and "fly" with birds and pick the wrong answer "An eagle."despite the contradiction between "eagle" and "no wings".Meanwhile, the correct lateral thinking answer "A caterpillar." is not picked by the models.Interestingly, commonsense associations that mislead models in some examples can be the needed hint in others.For example, in one puzzle, "There is no light on the road and the car's headlight is broken.How can the driver see the black dog?", the answer "It was daytime." is hindered by the commonsense association between mentioning no light and night.However, in another example, "How can Jenny read in a totally no light house at night?", the same commonsense association leads the model to the correct answer: "The book is in Braille.".In the second example, the answer is misled by another commonsense association related to reading.

Conclusions and Outlook
We defined the task of lateral thinking for LLMs, formulated as a multiple-choice QA with a sentence-and word-level puzzles.We developed BRAINTEASER, a 1.1K lateral thinking benchmark that combines original puzzles and their reconstruction variants.Our experiments showed that Chat-GPT's performance on this task is halfway between random and humans, whereas other models often perform close to random.While scaling up model size improved performance, enriching with common sense or providing few-shot demonstrations yielded limited benefits.Meanwhile, all models tend to solve the variants of the same puzzle inconsistently.Our error analysis showed that the models' lateral thinking is often hindered by memorization and misleading commonsense associations.In the future, we intend to develop lateral thinking models, create additional lateral thinking evaluation tasks (e.g., relating to alteration (De Bono, 1970)), and investigate flexible ways to combine lateral and vertical thinking.

Limitations
While our work focuses on both Sentence puzzles and Word puzzles, we intend to develop a comprehensive lateral thinking benchmark according to de Bono's four skills: awareness, random stimulation, alternatives, and alteration (De Bono, 1970).Moreover, while our paper tries to provide a clear distinction between lateral and vertical thinking, it remains an open question to which extent other brain teaser categories, e.g.puns and visual puzzles, require lateral or vertical thinking.As these tasks are not the focus of our present paper, we leave it to future work to comprehensively evaluate models' ability to think out of the box on such tasks and to characterize the complementary and opposing aspects of vertical and lateral thinking.

Ethical Considerations
As our lateral thinking puzzles are "folk knowledge" and are published on a range set of websites, it is hard to check their original licenses comprehensively.Yet, the website owners declare permission to print and download material for noncommercial use without modification on the material's copyright.Therefore, we provide the corresponding copyright statements and website URLs for each original lateral thinking puzzle and its reconstruction version.In addition, we will create a form to ask future dataset users to sign a document claiming that the only aim of the data usage is research before providing them with the data.We note that, despite our best efforts, the task data may still contain bias in terms of gender or politics.We will indicate that future research should use the task data with caution.

A.3 Word puzzle example
Table 7 presents a word puzzle with its reconstruction examples.

A.4 Few-shot prompting result
Table 8 shows the few-shot result of ChatGPT and FlanT5 (11B) on the two BRAINTEASER subtasks.

A.5 Annotation Details
Human evaluation We give the following instruction to human evaluation participants: "Hi, welcome to the brain teaser test.Each brain teaser has only one possible solution (none of the above is possible!).Please select the choice in the answer column.Try to Think out of Box :)" Human validation We give the following instruction: "Congratulations on passing the brain teaser test.You should notice that some brain teasers are similar to each other :)!Actually, the brain teasers can be divided in groups like the following: In each brain teaser group, we have an original question, semantic reconstruction questions, and context reconstruction questions.Semantic reconstruction question rephrases the original question without changing the correct answer and the distractors.Context reconstruction question keeps the original reasoning path but changes both the question and the answer to describe a new situational context.
Please help with the following three tasks: 1)Whether the original question and its answer make sense.2)Whether the semantic reconstruction question rephrases the original question.
3)Whether the context reconstruction question keeps the original reasoning path."Open-ended Human Performance We give the following instruction: "Please write down the answer of each brain teaser.Anything that makes sense is welcome!! Also, no answer is acceptable!" We let both humans and ChatGPT write down the most possible answer to 30 context reconstruction questions based on their understanding.Three experts score the answers on a scale of 5, based on the following rubrics: • score 0: Fail to answer.
• score 1: Try to answer the question, but the answer doesn't make sense.
• score 2: The answer is wrong but related to the golden label.
• score 3: The answer is wrong, but the reasoning strategy is similar to the golden answer and may lack some keywords.
• score 4: The answer is wrong but lacks minor information.Or the answer makes sense but is not the same as the golden answer.
• score 5: The answer is correct.
Both humans and LLMs cannot perform this task well, scoring 2.64 and 2.62 on a 5-point scale.Humans give up more often (18%) rather than generating meaningless text like ChatGPT, making the comparison harder if the task is in an open-end format.
A.6 Evidence of stronger distractors in the original puzzle The barber example in Figure 1, "shaves everyday" and "keep his beard long" triggers a commonsense association that the man shaves himself every day.
The contextually reconstructed puzzle of the barber example is "How can a man go to football team every day but doesn't play football at all?".This new question still aims to guide the model to think in the default commonsense way that "He is a football player."but the correct answer "He is a coach." is also highly probable, resulting in an inherent decrease in difficulty.

A.7 Fine-tuned on Riddle Sense
We finetuned RoBERTa-L on RiddleSense (Lin et al., 2021) to analyze whether being aware of linguistic creativity can enhance the model's performance on BRAINTEASER.We train RoBERTa-L (RS) on the training data of RiddleSense in 3 epochs with a learning rate at 1e −6 , batch size at 4. RoBERTa-L (RS) reaches 59.95 on the validation set, which is on par with the original paper (60.72).We then adapt Roberta-L (RS) to do the zero-shot evaluation on BRAINTEASER.The results are shown in Table 9.
Even though Roberta-L (RS) already gains insight into creative thinking, it is still struggling on BRAINTEASER.The better results show that enhancing creative thinking during the training may

Figure 1 :
Figure 1: Contrasting existing Vertical Thinking tasks (PIQA (Bisk et al., 2020) and RiddleSense (Lin et al., 2021)) to our novel lateral thinking task called BRAIN-TEASER.While prior tasks require commonsense to be injected, BRAINTEASER's lateral thinking puzzles require default commonsense thinking to be deprecated.

Table 1 :
Example of generated distractors.

Table 2 :
A sentence-based lateral thinking puzzle and its reconstruction variations.We present an analogous word-level puzzle in the Appendix A.3.His horse is named Wednesday.How could a cowboy ride into town on Friday, stay While in town, he stays in bed for two days.twodays, and ride out on Wednesday?Friday and Saturday are holidays.

Table 3 :
Key statistics of the BRAINTEASER dataset.Choices combine the correct answer with all the distractors.Standard deviation is computed without the None of the above choice, as its token length is fixed and not related to the question context.

Table 4 :
Main zero-shot results over two BRAINTEASER subtasks across all models in all metrics: Ori = Original, Sem = Semantic, Con = Context.The best performance among all models is in bold, and the best performance in commonsense augmented models is underlined.The human evaluation (*) is computed over 102 randomly sampled data.The random base is average over three different seeds.

Table 5 :
Error analysis on memorization and commonsense association.