ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind

Theory of Mind (ToM), the capacity to comprehend the mental states of distinct individuals, is essential for numerous practical applications. With the development of large language models (LLMs), there is a heated debate about whether they are able to perform ToM tasks. Previous studies have used different tasks and prompts to test the ToM on LLMs and the results are inconsistent: some studies asserted these models are capable of exhibiting ToM, while others suggest the opposite. In this study, We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on Sally-Anne and Smarties tests with a diverse set of tasks. In addition, we also propose an auto-grader to streamline the answer evaluation process. We tested three models: davinci, turbo, and gpt-4. Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks. Performing the ToM tasks robustly remains a challenge for the LLMs. In addition, our paper wants to raise awareness in evaluating the ToM in LLMs and we want to invite more discussion on how to design the prompts and tasks for ToM tasks that can better access the LLMs’ ability.


Introduction
As large language models (LLMs) become increasingly prevalent in applications in natural language understanding and dialogue generation (Devlin et al., 2019;Brown et al., 2020;Raffel et al., 2020), the demand for models to develop Theory of Mind (ToM) has grown rapidly.Theory of Mind refers to the ability to impute mental states to different individuals, e.g., beliefs, emotions, and intentions (Wimmer and Perner, 1983;Gallese and Sinigaglia, 2011).ToM is commonly measured through false belief tasks in psychology studies (Dennett, 1978), as these tasks unambiguously show whether children can distinguish their own belief (true belief) and other people's belief (false belief).For example, in the Smarties test, a classic false belief task, the child is shown a Smarties candy box and asked what they believe is in the box.Naturally, the child would answer 'Smarties.'The experimenter opens the box to show the child that it was filled with something else, like crayons.Then, the child is asked what they think another person, who hasn't seen what's inside the box, would believe is inside the Smarties box.Children younger than 4 years old would answer 'crayons' as they assume that other people know what they know; whereas older children would answer 'Smarties' as they are able to reason that other people see the label on the box and assume that there are Smarties inside (Gopnik and Astington, 1988).Typically children are able to pass false belief tasks around age 4 or 5 (Wellman et al., 2001).The development of ToM is closely intertwined with language development, as both abilities develop around the same age and are highly correlated whereas other cognitive abilities do not correlate as highly as language (Milligan et al., 2007).Since the mental state can not be observed through behavior, language is indispensable in understanding and reasoning mental states.Although the exact nature of the relationship between language and ToM is still under study, some studies propose that the relation can be causal (De Villiers and Pyers, 2002;Moore et al., 1990).Theoretically, LLMs could develop ToM given its powerful natural language understanding capacity.Testing the ToM in LLMs could bring more insight into the relationship between language and ToM development.In addition, ToM is also important to improve the applications of LLMs as we want the models to generate appropriate and context-aware responses.For example, when requesting a model to continue generating a story, we anticipate that it will recognize distinct beliefs held by different characters.Likewise, we expect a chatbot to provide more tailored and empathetic responses to various users.
There is an ongoing debate on whether ToM has already emerged in the current models, with some studies asserted that the models exhibit ToM (Kosinski, 2023;Wu et al., 2023), some suggest the opposite, (Le et al., 2019;Nematzadeh et al., 2018;Sap et al., 2022;Ullman, 2023a), and others maintain caution and questions (Sileo and Lernould, 2023;Aru et al., 2023).The inconsistency of the findings may largely attributed to the variance of ToM evaluation methods adopted in these studies.
ToM theories in the field of child development (Quesque and Rossetti, 2020) suggest that we should ensure the measure focuses on mental states rather than irrelevant confounding processes (Mentalizing; e.g., focus on emotion rather than facial expression categorization), as well as maintain the distinction between the present and the imagined mental states (Nonmerging).Tasks that fail to satisfy the two criteria shouldn't be regarded as valid assessments.In Ullman (2023b)'s study, variations such as transparent access, uninformative label, and others were used to examine the robustness of models.However, the variations primarily incorporate pragmatic knowledge and inferential bias, which deviate from the criterion of Mentalizing and do not effectively maintain the Nonmerging requirement.Similarly, testing a few examples on a single format, as done by Kosinski (2023), also deviates from Mentalizing.This is because corner-heuristics may occur when the language of the task itself contains regularities and correlations (Le et al., 2019).More-over, LLMs are shown to be sensitive to the choice of prompts (Jiang et al., 2020;Zhao et al., 2021;Elazar et al., 2021;Schick and Schütze, 2022).To our knowledge, none of the works consider the impact of prompts, as Sap et al. (2022) framed the task as question answering, and Kosinski (2023) and Ullman (2023b) framed the task as story completion.
To improve the validity of ToM tests, one solution is to increase the open-endedness of the tasks, as proposed by Aru et al. (2023), while still adhering to the requirements of Mentalizing and Nonmerging as outlined by Quesque and Rossetti (2020).Open-ended tasks increase the diversity of task formats, making it harder for LLMs to use shortcuts to pass the tests.At the same time, following the requirements of Mentalizing and Nonmerging ensures a rigorous theoretical focus and meaningful results.
In this paper, we create a dataset based on two widely used false-belief tasks in human studies: the Sally-Anne test (Wimmer and Perner, 1983;Baron-Cohen et al., 1985) and the Smarties task (also known as the Crayon Box test, or the Unexpected Contents Test) (Gopnik and Astington, 1988).1According to Quesque and Rossetti (2020), the false-belief tasks meet Mentalizing and Nonmerging criteria.To enhance the openendedness of the tests, we adapt our data for various tasks by creating unique prompts tailored for each task: fully-constrained(Fill-in-the-Blank, Multiple Choice, True/False), semi-constrained(Chain-of-Thought True/False, Question Answering), and open-ended generation(Text Completion).Next, we evaluate the performance of two versions of GPT-3.5 (text-davinci-003 and gpt-3.5-turbo-0301) on our dataset.Our results demonstrate that the models cannot reliably perform the ToM tasks.The Text Completion task leads to the best results, followed by the Fill-in-the-Blank task.In addition, the models also have different patterns on accuracy for Sally-Anne and Smarties tests.
2 Related Work Nematzadeh et al. (2018) were the first to propose using Theory of Mind (ToM) tasks from developmental psychology to evaluate different questionanswer models.Their findings indicated that all the tested models were unsuccessful in completing their tasks, suggesting that these models lack the ability to keep track of inconsistent beliefs or states of the world.In 2019, Le et al. (2019) showed the QA benchmarks at that time would suffer from data biases such that corner-cutting heuristics can be made due to a strict event sequence template for each task type.To address this issue, they proposed new evaluation methods as well as a new dataset.Sap et al. (2022) evaluated GPT-3 (Brown et al., 2020) on this dataset and concluded that the models struggle with the task with an accuracy of 55 -60% on questions regarding mental states, even for GPT-3-davinci after few-shot finetuning.
These studies evaluated the ToM on different datasets and tasks.Most of the studies develop the datasets based on the false-belief tasks used in psychology studies, namely Smarties test and Sally-Anne test (for a detailed description, see Section 3.1).For example, Le et al. (2019) proposed a ToMi dataset that is based on Sally-Anne test and bAbi dataset (Weston et al., 2015).The findings of the study showed that the models do not reliably exhibit ToM.Kosinski (2023) used a different crafted dataset based on both Sally-Anne test and Smarties test and showed that the text-davinci-003 model is able to perform ToM tasks.
Different from these works, we focus on the validity of ToM tests, considering the Mentalizing and Nonmerging criteria.We also consider the impact of prompts on the model performance, and propose to adapt the data for various tasks by creating different prompts, and construct a principle-guided dataset and diverse evaluation tasks for exploring ToM.

TOMCHALLENGES and Tasks
We aimed to build a corpus based on two types of tests: Sally-Anne Test and Smarties Test, following the Mentalizing and Nonmerging criteria proposed by Quesque and Rossetti (2020).Below we describe how we construct TOMCHALLENGES data, and how we design the diverse evaluation tasks.

Dataset Construction
While Le et al. (2019) proposed the inclusion of distractors to prevent models from adopting cornercutting heuristics, it is important to note that distractors are more relevant for fine-tuning rather than zero-shot probing.Given the ongoing discussions surrounding the zero-shot performance of models where would Juanita think Neila would look for the towel?
The initial prompt with † is applied to 1STA, 1STB, 2NDA, and 2NDB.The initial prompt with † is applied to 1STA, 1STB, 2NDA, and 2NDB. in recent studies (Kosinski, 2023;Ullman, 2023b) and the fact that finetuning is not available yet for GPT-3.5, we introduce a distractor-less dataset as below to maintain focus, with examples displayed in Tables 1 and 2. We created 30 variations of each test (e.g., changing the person's name, location, and items), and the details of the tests and variables are described as follows.Where was the towel?3.After Juanita came back to the attic, where would Juanita look for the towel?4.After Juanita came back to the attic, where would Neila look for the towel? 5.After Juanita came back to the attic, where would Neila think Juanita would look for the towel?6.After Juanita came back to the attic, where would Juanita think Neila would look for the towel?

Text Completion
Complete the following paragraph: N After Juanita came back to the attic, Neila would think Juanita would look for the towel in Answer: Table 3: An illustrative example for different task templates of the Sally-Anne Test, ignoring line breaks in templates for space saving.
C2, representing the object's initial and updated positions, respectively.We extract the agent names from CMU Name Corpus 2 and manually write the options for L, O, C1 and C2 following the rules below: • The location L should be spacious enough for two people to spend time together.• The object O should be reasonably movable by hand.• Both containers (C1 and C2) should be capable of accommodating the object.
For the questions, REALITY focuses on the updated/current position of O, and BELIEF focuses on the initial/previous position.The first-order belief (1STA and 1STB) questions ask the agents' mental states, and the second-order belief (2NDA and 2NDB) questions ask one agent's belief regarding another agent's mental state.

Smarties Test
The Smarties Test is related to another person's false belief about the object in a specific container as that container is marked as holding a different object.Although it also involves one location and two agents, there are two differences: (1) only one container C that contains the object, and (2) two objects (O1 and O2) are mentioned in the narrative, with O1 being labeled and O2 actually occupying C. We choose the location 2 https://www.cs.cmu.edu/Groups/AI/util/locations/nlp/corpora/names/ with the same rule described above, and follow the rules below for the container and objects: • The container C is likely to obscure the object inside it.• The container C should be capable of accommodating both objects.
The questions of Smarties Test are similar in nature to those for the Sally-Anne Test, but the REALITY question focuses on the real object in the container, and there's no BELIEF question for this test.

Task Formulation
While we allow the model to generate freely, we restructure the two tests with diverse prompts, effectively transforming them into distinct task formats.Given the different levels of generation freedom, we categorize tasks into three groups: fullyconstrained, semi-constrained, and open-ended generation.
Fully-Constrained Fully-constrained generation limits the model's output to specific predefined structures or responses.In this group, we design 3 tasks, i.e., Fill-in-the-Blank, Multiple Choice, and True or False questions.
Semi-Constrained Semi-constrained generation involves partial guidance by specific rules or structures, while still permitting some flexibility in the model's responses.This group encompasses 2  Open-Ended Open-ended generation enables the model to generate responses without being restricted by predefined rules or structures, leading to more diverse and varied outputs.An example of this group is Text Completion.
We demonstrate how to reframe a Sally-Anne Test example into these different tasks in Table 3.We only present the question 2NDA except for the question answering task, where we include all questions in the same prompt when we conduct experiments.For the Smarties Test, our templates are similar for task descriptions, but different in the phrase regarding questions.

Experimental Setup
We evaluate the zero-shot performance of two versions of GPT-3.5 models: text-davinci-003 and gpt-3.5-turbo-0301(OpenAI, 2022).For the hyperparameters of both models, we set the temperature as 0, top_p as 1, and both frequency penalty and presence penalty as 0. Due to the different natures of our task design, we choose different maximum token limits for each prompt as follows: Fill in the Blank at 10 tokens, Multiple Choices at 2 tokens, True or False at 20 tokens, CoT True or False at 100 tokens, and both Question Answering and Text Completion are at 50 tokens.

Results and Analysis
In this section, we present the results of our evaluation for both models on two tests (Sally-Anne and Smarties tests), six tasks/prompts as shown in Table 3, and six question types (REALITY, BELIEF, 1STA, 1STB, 2NDA, and 2NDB) as shown in Table 1 and 2. As we created 30 stories of each test, an idealized model that is capable to solve The-ory of Mind tasks should be able to achieve high accuracy on all question types and in most of the stories.

Accuracy by Question Type
The accuracy of each question type is calculated by averaging the accuracy over all stories (e.g., an accuracy of 50% means that the model answered correctly for 15 out of the 30 stories).Figure 2 and 3 show the average accuracy of 6 types of questions for different prompts.
For the Sally-Anne tests, both text-davinci-003 and the turbo-0301 models are able to achieve nearperfect accuracy on REALITY, BELIEF, and 1STA questions for all prompts, indicating that the models can reason based on facts.For 1STB question that requires reasoning both the belief of A and B, the turbo-0301 model achieved better accuracy than the text-davinci-003 model.For 2NDA and 2NDB questions, both models struggled to understand one person's belief about another person's belief.In addition, the models achieved the best overall performance with the Text Completion prompt, followed by the Fill-in-the-Blank prompt.Also, the introduction of Chain-of-Thought did not improve the model's performance on True/False task.
The Smarties test showed a different accuracy pattern from the Sally-Anne test.Both models had difficulties answering the BELIEF and 1STA question correctly.However, for 2NDA and 2NDB questions, text-davinci-003 model achieved better performance in the Smarties test than in the Sally-Anne test.We observe that the completion prompt works best for the text-davinci-003 model, and the multiple-choice prompt works best for the turbo-0301 model.
By comparing the different tests, prompts, and questions, it is clear that the models can not reliably perform ToM tasks.The models are sensitive to the prompts, and framing the stories into Text Completion task works better than other tasks.

Accuracy by Stories
The accuracy of each story is calculated as the average accuracy over six question types.Although the stories are generated through the same template, the models produced different answers.Table 4 and 5 show the average accuracy of Sally-Anne and Smarties tests.For the Sally-Anne test, text-davinci-003 produced more stable results across different prompts since all stories achieved 50% accuracy for Multiple Choice, True/False, CoT True/False, Sally-Anne text-davinci-003 turbo-0301 N = 6 mean range mean range FB 0.61 0.5 -0.83 0.93 0.67 -1 MC 0.5 0.5-0.5 0.82 0.5 -1 TF 0.5 0.5-0.5 0.65 0.5 -0.83 CoT-TF 0.5 0.5-0.5 0.57 0.5 -1 QA 0.5 0.5-0.5 0.68 0.5 -1 Comp 0.72 0.5 -1 0.92 0.67 -1  and Question Answering prompts.The turbo-0301 model performs better since the average accuracy is higher across all prompts.
For the Smarties test, turbo-0301 model has better and more stable performance than the textdavinci-003 model, as the average accuracy is higher and the range is smaller.

Error Analysis
We further looked into the errors the models made, especially for the questions that the models had low accuracy.For the Sally-Anne task, the textdavinci-003 model made errors on 1STB, 2NDA and 2NDB questions by assuming person B knew the new location of the item.In the context, person B does not know the new location of the item, since person A moved it after B left the room.However, the model could not reason this aspect and assumed that person B knew that person A moved the item.For the turbo-0301 model, the model could reason most of the 1STB questions, but failed on the 2NDA and 2NDB questions.These results indicate that 2nd order belief task is still very difficult for the models.
For the Smarties test, the text-davinci-003 model struggles most on 1STB question, and not so much on 2NDA and 2NDB questions.The common error for the 1STB question is that the model assumed person B knew what's inside of the container despite B didn't open the container and didn't know that the item inside is not the item indicated on the label.
In addition, we also found that the models cannot reliably infer the mental state of agents in the story.For example, turbo-0301 model has 0 accuracy on 1STA question for Smarties test with questionanswering prompt.The model actually refused to answer the 1STA question (e.g., After B opened the container, what would A expect to find in the bag?) by producing answers like: 'The context does not provide information on what A would expect to find in the backpack after B opened it.'This type of error indicates that the model does not have a robust understanding of mental state.

Conclusions
In this study, we proposed TOMCHALLENGES to comprehensively test the ToM on LLMs.The dataset is constructed based on the Sally-Anne and Smarties tests.For each test, we created a template to generate variations of the test.In addition, we incorporated 6 types of questions to examine the model's understanding of reality, belief, 1st order belief and 2nd order belief.We also included 6 tasks with different prompts for evaluation, considering the impact of prompts on model performance.This evaluation method serves a dual purpose: it not only measures whether the model has ToM capacity, but also measures the robustness of the model in performing the ToM tasks.
Using 30 variations of Sally-Anne and Smarties tests, we found that the GPT-3.5 models can not reliably perform the ToM tasks.Overall, the models performed better on Sally-Anne test than the Smarties test.The types of prompts greatly affect the model's performance.The models achieved the best accuracy on the Text Completion task, followed by the Fill-in-the-Blank task.The models struggled on 1STB, 2NDA, and 2NDB questions for both Sally-Anne and Smarties tests.If a model has a robust representation of ToM, it should have good performance across tests, questions and prompts.However, our evaluation shows that the models are sensitive to the test template, task/prompt, and question type, and that they can not reliably perform well on the ToM tasks.
Further studies could investigate how and why different prompt types would affect the model's performance.We hope our study could invite more discussions on the evaluation and improvement of ToM in LLMs.

Figure 1 :
Figure 1: An example of Smarties test, and Mentalizing and Nonmerging criteria.
The Sally-Anne Test is related to another person's false belief about the container of an object because the person is unaware of the container change while absent.The narrative involves several components: (1) a location L, where the event takes place, (2) two agents A and B, who maintain distinct mental states, (3) an object O, which is moved from one container to another during the narrative, and (4) two containers C1 and Narrative N Neila and Juanita were hanging out in the attic.They saw a closet and a cabinet.They found a towel in the closet.Juanita left the attic.Neila moved the towel to the cabinet.Fill-in-the-BlankFill in the blank (<>): N After Juanita came back to the attic, Neila would think Juanita would look for the towel in the < >.Answer:Multiple ChoiceChoose the correct answer from A or B for the following question: Question: N After Juanita came back to the attic, where would Neila think Juanita would look for the towel? A. cabinet B. closet True/False Given the context, judge True or False of the given statements A and B respectively: N Statements: A. Juanita would look for the towel in the cabinet.B. Juanita would look for the towel in the closet.CoT True/False Given the context, reason through statements A and B step by step and provide a True or False judgment based on the reasoning: N Statements: A. Juanita would look for the towel in the cabinet.B. Juanita would look for the towel in the closet.Q&A Answer the following questions based on the context: Context: N Questions: 1.Where is the towel? 2.

Figure 2 :
Figure 2: The average accuracy for questions in Sally-Anne test for different prompts.

Figure 3 :
Figure 3: The average accuracy of questions in Smarties test for different prompts.
Neila and Juanita were hanging out in the attic.They saw a closet and a cabinet.They found a towel in the closet.Juanita left the attic.Neila moved the towel to the cabinet.
REALITY Where is the towel currently?

Table 1 :
An example for Sally-Anne Test.
Variables L: attic, A: Neila, B: Juanita, C: bag, O1: plate, O2: vest Narrative N Neila found a bag in the attic.The label on the bag says plate.Neila couldn't see what was inside the bag.Neila opened the bag and found a vest.There is no plate in the bag.Neila closed the bag and put it back.Juanita entered the attic and saw the bag.REALITY What was in the bag?BELIEF What was supposed to be in the bag?When the bag was opened, † 1STA what would Neila expect to find in the bag? 1STB what would Juanita expect to find in the bag? 2NDA what would Neila think Juanita would expect to find in the bag? 2NDB what would Juanita think Neila would expect to find in the bag?

Table 2 :
An example for Smarties Test.

Table 4 :
The average accuracy for stories in the Sally-Anne test for different prompts.The terms Fill-in-the-Blank, Multiple Choice, True/False, CoT True/False, Question Answering, and Text Completion are abbreviated as FB, MC, TF, CoT-TF, QA, and Comp, respectively.

Table 5 :
The average accuracy for stories in the Smarties test for different prompts.