FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions

Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering. Our benchmark draws upon important theoretical requisites from psychology and necessary empirical considerations when evaluating large language models (LLMs). In particular, we formulate multiple types of questions that demand the same underlying reasoning to identify illusory or false sense of ToM capabilities in LLMs. We show that FANToM is challenging for state-of-the-art LLMs, which perform significantly worse than humans even with chain-of-thought reasoning or fine-tuning.


Introduction
Existing evaluations for language models' theory of mind (ToM) -i.e., the ability to understand the mental states (e.g., thoughts, beliefs, and intentions) of others (Premack and Woodruff, 1978), is primarily focused on using situation descriptions (i.e., narratives) as the target domain (Nematzadeh et al., 2018;Le et al., 2019;Sap et al., 2022;Shapira et al., 2023a).However, ToM capabilities play an even more important role in understanding dynamic social interactions, as they form a crucial component of effective communication (Frith, 1994; Schober, in interactions -i.e., conversations.As conversations present interactions in their raw form, they are much less susceptible to reporting biases, and are more aligned with real-world scenarios requiring ToM reasoning.FANTOM consists of 10K questions covering 256 multiparty conversations around a certain topic while characters enter and leave the discussion, leading to distinct mental states between characters due to information asymmetry.
The goal of FANTOM is to effectively measure how well models can track the belief of multiple characters in conversations where some information may be inaccessible to some participants.For example, in Figure 1, Kailey briefly steps away from the conversation to get a cup of coffee, while the others continue discussing Linda's new dog.The information exchanged during Kailey's absence remains unknown to Kailey, and only the information shared after Kailey's return becomes accessible.We convert factual question-answer pairs to obtain multiple challenging questions about characters' beliefs concerning the inaccessible information.Our aim is to design questions at different levels that evaluate a model's capability for a coherent understanding of others' mental states.In doing so, we are particularly interested in identifying instances of illusory ToM, which we define as situations where a model may answer some questions correctly but fails to answer others that require the same type of ToM reasoning.
The analysis of evaluation results on FANTOM reveals several interesting findings ( §4): (1) First, existing neural models score significantly lower than humans on individual questions and on the full set of questions by more than 70% on average.(2) While chain-of-thought reasoning (CoT) does improve performance in most models, it does not substantially bridge the gap with human performance.(3) Although our benchmark is not meant for training, we observe that fine-tuning can help models achieve scores higher than human performance on individual question types.However, when it comes to metrics that require coherent responses across multiple question types, the finetuned model still significantly underperforms compared to humans.(4) Additionally, we find that models exhibit different error types depending on the format of questions, despite all questions requiring the same underlying reasoning.(5) Moreover, our results indicate that CoT has a selective impact on performance, showing improvement only in specific scenarios.
To the best of our knowledge, FANTOM is the first benchmark to introduce conversation-based ToM evaluation for language-based models.Our benchmark design and experiment results yield important insights into the debate around ToM (Whang, 2023) and the development of artificial general intelligence (Metz, 2023) in LLMs.We release our benchmark to spark further discussions on evaluating the ToM capabilities of LLMs.

Design Considerations for FANTOM
We go over the important design choices that we made when constructing FANTOM.Our goal is to incorporate (1) social interactions that necessitate natural theory of mind (ToM) reasoning ( §2.1), (2) essential theoretical prerequisites for validating ToM from psychology ( §2.2), and (3) empirical findings that must be taken into account when evaluating large language models ( §2.3).

Grounding in Social Interactions
To capture the interactive aspect of ToM, we ground our task in natural social interactions -i.e., conversations.By doing so, we gain two key benefits: (1) minimizing reporting bias (Gordon and Van Durme, 2013) and ( 2) aligning with real-world scenarios.
Since narratives are condensed descriptions of interactions, the process of deciding what to include or exclude can introduce reporting bias, resulting in artifacts that models exploit.For instance, including "Carlos did not see this, so he does not know currently where the apple is." in a narrative for ToM evaluation provides a significant clue about the other's mental state.However, such explicit hints are rarely present in real-world interactions.
Conversations, on the other hand, present interactions in their raw form, without those explicit hints about others' mental states.During conversations, we reason through the intermediate steps from scratch, thereby grounding the benchmark in conversations enables a more realistic and unbiased assessment of ToM.

Meeting Theoretic Requirements
We follow the two important criteria outlined by Quesque and Rossetti (2020) that must be met when designing a task to validate ToM: "nonmerging" and "mentalizing".
(1) "Non-merging": Evaluation should require the respondent to maintain a distinction between the others' mental state and its own.For example, suppose someone is asked about the other's belief regarding the location of the TV remote controller, and both are believing it to be on the sofa.If the respondent answers that the other believes it is on the sofa, it becomes unclear whether the response is based on the respondent's own belief or the other's (i.e., merging mental states).Such merging scenario is unsuitable for validating ToM.
Since machines lack emotions or intentions (Gros et al., 2022), we exploit information asymme-try when constructing our benchmark to simulate the non-merging mental state scenarios.We design multiparty conversations where specific information is inaccessible to certain characters.While machines do not possess their own point of view, they act as omniscient observers during our evaluation since we provide the entire conversation as input.As a result, the mental states of the model and the character can be regarded as distinct with respect to that information.
(2) "Mentalizing": Lower-level processes should not be accounted for successful performance of ToM tasks.If a simpler process can explain a phenomenon, it should always be preferred over a more complex one when interpreting the results.For instance, recognizing joy by observing laughter is more of a visual discrimination than reasoning mental representations.
If the correct answer for a ToM task has a high degree of word correlation with a salient part of the given input, it becomes difficult to determine whether the model is accurately ascribing the other's mental state or simply following a shortcut pattern matching (i.e., the lower-level process).Therefore, such cases should be discouraged when evaluating ToM in neural language models.In FANTOM, we create false answers that have high word correlation with the input to verify whether the models can overcome the shortcut pattern matching when reasoning mental states.

Seeking Comprehensive Evaluation
Since the performance of LLMs varies significantly based on given prompts (Webson and Pavlick, 2022), we adopt a series of reiterative questions at various levels for the same input context, including free-form response questions, multiple-choice questions, and straightforward yes or no questions.The inclusion of free-form response questions is important as it aligns with the common usage of LLMs in contrast to multiple-choice questions that are prevalent in existing benchmarks (Sakaguchi et al., 2021;Hendrycks et al., 2021).Although their formats are different, all questions in FANTOM fundamentally aim to ascertain the same underlying reasoning: "who is aware of the information?"As a result, FANTOM enables us to identify illusory ToM instances wherein models deliver accurate responses for one format but struggles to do so for another format.

FANTOM Overview
Following the success of previous works (Kim et al., 2022;Chen et al., 2023), we automatically construct full conversations using the large language model (LLM) InstructGPT davinci-003 (Ouyang et al., 2022).We also generate theory of mind (ToM) question-answer pairs related to the conversation participants' beliefs using a specially designed pipeline.In preliminary explorations, we find off-the-shelf LLMs struggle with directly generating ToM question-answer pairs for a given conversation.Our pipeline consists of three steps: (1) generate conversations with information asymmetry ( §3.1), ( 2) generate fact question-answer (QA) pairs ( §3.2), and (3) construct ToM (e.g., belief) QA pairs from the fact QA pairs ( §3.3).We use different evaluation methods for each question types ( §3.4), and validate the final dataset ( §3.5).

Information-Asymmetric Conversations
FANTOM consists of small talk conversations involving multiple characters, with each conversation centered around a topic (e.g., pets, risk-taking, personal growth).Each topic has several subtopics, e.g. the topic "pets" may include subtopics "breed" and "special moves".Initially, the conversation begins with two or three characters.As the conversation progresses, characters join and leave the discussion and the conversation's subtopic changes over time.
Conversations include explicit indications of leaving and joining, such as utterances like "Hey guys, I'll go grab a coffee."or "Hey, I'm back, what are you guys discussing now?" shown in Figure 1.
During the absence of a character, the conversation continues and information is shared among the remaining participants, creating a natural information asymmetry that reflects real-life interactions.After a series of utterances, the character who was absent (re)joins the conversation, unaware of the information that was previously shared with other participants.More details are in Appendix A.1.Many existing ToM tasks involve some form of asymmetry between characters (Braüner et al., 2020).For example, in the Sally-Anne task, Sally does not know that Anne relocated the object, while the observer is aware of the action.In the Smarties task, the character in the story does not know the label changed, whereas the observer is fully aware of this situation.This inherent asymmetry ensures two distinct mental states (i.e., the non-merging criterion; §2.2) to be present during the experiments.

Factual Question-Answer (QA) Pairs
The conversations in FANTOM include factual question-answer pairs (FACTQ) about the inaccessible information-i.e., the information that a specific character is unaware of.An example question would be "What is the breed of Linda's dog?" in Figure 1.More details are in Appendix A.2.
There are two distinct types of answers for each FACTQ: (1) FULL FACT A and (2) LIMITED FACT A. The FULL FACT A incorporates the full information in the preceding conversation where the character PersonX was absent.On the other hand, LIMITED FACT A relies only on the conversation in which PersonX participated.The former answer is based on information that PersonX does not have access to, while the latter answer only takes into account the accessible information for PersonX.For cases where no information was shared regarding the FACTQ, the LIMITED FACT A indicates that no information has been provided.Examples of these answers are in Figure 1."Linda has a golden retriever." is based on the preceding conversation where Kailey was absent, while "There is no information on the breed of Linda's dog" is based on the conversation Kailey participated.

ToM QAs based on Fact QAs
For each FACTQ, we build six types of ToM QA.Construction details can be found in Appendix A.3.
(1) BELIEFQ [DIST.]and (2) BELIEFQ [CHOICE] : These questions are created by rephrasing the FACTQ to ask beliefs of characters in the conversation.We are particularly interested in PersonX's belief about the inaccessible information from the previous conversation, in which PersonX did not participate.For example, the FACTQ of "What is the breed of Linda's dog?" in Figure 1 is converted to "What breed would Kailey think Linda's dog is?"The BELIEFQ [DIST.]requires free-form response, while BELIEFQ [CHOICE] provides multiple-choice options for the same question.
The options for BELIEFQ are created by rephrasing the FULL FACT A and LIMITED FACT A. For example, the "Linda has a golden retriever." in Figure 1 is converted to "Kailey believes Linda has a golden retriever."Since the FULL FACT A reflects information that is not accessible to PersonX and the LIMITED FACT A incorporates only the information accessible to PersonX, we label the converted FULL FACT A and LIMITED FACT A as "OMNISCIENT-VIEW BELIEF A" and "PERSONX-CENTRIC BELIEF A", respectively.
(3) ANSWERABILITY Q [LIST] : Given the FACTQ, we ask models "List all the characters who know the correct answer to this question".In essence, we are interested in whether the model can identify who among the participants can correctly answer the FACTQ.This is a meta-question that necessitates two-step reasoning: first determining the answer itself, and second, identifying the characters who have access to this knowledge.
(4) INFOACCESS Q [LIST] : Here, we provide the FULL FACT A with the FACTQ and ask the model "List all the characters who know this information".Essentially, this question aims to identify the individuals who have knowledge or access to this information.Since the information is explicitly provided to the model, only the second reasoning step of the ANSWERABILITY Q [LIST] is required. ( We ask models to determine, through a simple binary response (yes or no), whether each character is capable of answering the question or knows the information.For example, we ask models "Does David know the correct answer to this question?"and "Does Sally know about this information?"(Figure 1).

Evaluation
Each question is provided to the model along with the conversation as input.This makes the model an omniscient observer, having access to all information shared in the conversation.On the other hand, PersonX was absent for a while, thereby an information asymmetry naturally arises between the model and PersonX.Responses that include inaccessible information for PersonX indicate a lack of ToM in the model.

Input context types
FANTOM comprises two types of input conversations: short and full.In the case of short input, the model is provided with the conversation that only includes the part where the specific speaker left and (re)joined, while excluding the other earlier and later parts of the conversation.On the other hand, a full conversation encompasses the entire discussion on the main topic, including all subtopics.As a result, this is significantly longer than the short input.

BELIEFQ [DIST.]
When given a belief question regarding PersonX, the model should generate a response that incorporates only the information accessible to PersonX.We use cosine similarity to measure the distance between SentenceBERT (Reimers and Gurevych, 2019) embeddings of each option and response.A correct response should always be closer to the PERSONX-CENTRIC BELIEF A than the OMNISCIENT-VIEW BELIEF A.
To accurately assess the performance of the response, we also calculate the token F1 score for responses that are considered correct based on the distance metric, following the convention of various QA tasks (Rajpurkar et al., 2016(Rajpurkar et al., , 2018)).When comparing distances in the embedding space, nonsensical responses (e.g., repetition of character names) can be deceptively closer to PERSONX-CENTRIC BELIEF A, resulting in misleading accuracy.Therefore, models must score high on both the distance and F1 metrics for the BELIEFQ [DIST.] .

BELIEFQ [CHOICE]
The model should choose between the OMNISCIENT-VIEW BELIEF A and the PERSONX-CENTRIC BELIEF A. The correct answer is the PERSONX-CENTRIC BELIEF A.

ANSWERABILITY Q [LIST] and INFOACCESS Q [LIST]
A correct response must include all characters who have access to the answer or information while excluding all characters who do not.No partial marks are assigned.
The model should respond with "yes" or "true" for all characters who have access to the answer or information, and with "no" or "false" for all characters who do not.More details are in Appendix A.4.

Dataset Validation & Statistics
Validation To ensure the quality of our benchmark, we go through a manual validation process for all conversations and question-answer pairs using Amazon Mechanical Turk (MTurk).We conduct validation on the entire conversations in our dataset using 32 annotators who passed a qualification test for assessing conversation coherence.We ask workers to flag conversations that are incoherent or unsafe (e.g., unethical, biased, harmful, dangerous, or offensive).Each conversation is validated by three workers.While 10 conversations received votes for incoherence, none achieved a majority vote indicating they were incoherent.We refine all 10 conversations.As for safety, no conversations were voted as being unsafe.We also request workers to verify the answers provided for BELIEFQ [CHOICE] s.We remove all question sets that were marked as erroneous by the worker (∼8.6%).

Statistics
FANTOM is composed of 256 conversations with 1,415 BELIEFQ The average number of turns in the input context is 13.8 (short conversation), and the average number of words in each turn is 21.9.For reference, the corresponding statistics for ToMi (Le et al., 2019) are 4.9 and 4.2, respectively.More statistics can be found in Appendix A.5.
Although our benchmark is not meant for training, we also fine-tune Flan-T5-XL (Chung et al., 2022) by randomly splitting FANTOM according to the conversation's main topics.We then test the model on unseen conversation topics.More details can be found in Appendix B.
Human Performance We also measure human performance by asking graduate students in computer science.We ask BELIEFQ [CHOICE] , ANSWER-ABILITY Q [LIST] , and INFOACCESS Q [LIST] , given a conversation.As it is redundant to ask human testees binary questions when they have already been asked ANSWERABILITY Q [LIST]  ] and FACTQ, we also report the token F1 scores to measure the word overlap between the answer and model's free-form response.Moreover, we report the ALL* score which requires the models to answer all six ToM question types ( §3.3) in the set correctly for the same information piece in the conversation.This metric aims to measure how well the models show consistent understanding across different types of questions.To compare with human performance, we also report the ALL score, which only excludes the BELIEFQ [DIST.]from the ALL* score.

Results
All the models exhibit scores that are significantly worse than human performance.Table 9 shows the full results of state-of-the-art large language models (LLMs) on FANTOM.We break down the table and highlight each discussion point below.
Illusory Theory of Mind Figure 2 shows the results of a few selected models.We find models perform significantly better on BELIEFQ  models' performance sharply drops when evaluated for coherent reasoning across multiple question types with the same underlying theory of mind (ToM) reasoning (i.e., All Question Types).These findings suggest that some instances of successful LLM ToM reasoning in FANTOM should be interpreted as illusory.
Chain-of-thought and Fine-tuning Table 1 summarizes the results when we apply zero-shot chainof-thought (CoT) reasoning or fine-tuning to models.For CoT, we follow Kojima et al. (2022) and use the prompt "let's think step by step".We observe an improvement in scores with CoT applied.However, there are still significant score gaps compared to human performance.
We also find fine-tuned Flan-T5 XL still falls short of human performance in metrics that demand consistent accuracy across multiple questions-i.e., the All scores.3Although our benchmark is not intended for training purposes, developing models with a coherent ToM reasoning remains challenging, even with explicit training on the data.] aims to measure a model's understanding of individual characters' perspective of a particular informationi.e., belief.To meet the mentalizing criterion (see §2.2), we deliberately design the incorrect answers in BELIEFQ [DIST.] to have greater word overlap with the context than correct answers.Also, BELIEFQ [DIST.]are rephrased questions inquiring about PersonX's belief for the facts in FACTQ, thereby the two question types share significant word overlap.However, the same information that was used to answer FACTQ should not be included in the response for BELIEFQ [DIST.] on PersonX as it is from the conversation that PersonX missed.As a result, certain models with higher token F1 scores for FACTQ have lower scores for BELIEFQ [DIST.]compared to models that perform worse on FACTQ (e.g., InstructGPT davinci-003 vs. Llama-2 Chat and Mistral Instruct).This suggests the models lack the ability to comprehend distinct perspectives of individual characters, leading them to reproduce similar responses to FACTQ for BELIEFQ [DIST.] .

Comprehending Facts vs. Distinct Beliefs
Free-Response vs. Choice We observe a pattern where models score significantly worse in free-response questions than choice questions (BELIEFQ [DIST.] vs. BELIEFQ [CHOICE] ; Figure 3 and 2).4However, many of them still achieve scores either below or around 50, which is the random baseline for those binary choice questions.2).This may be because models significantly struggle with ANSWERABILITY Q [LIST] and INFOACCESS Q [LIST] , potentially resulting in the absence of meaningful performance patterns.
Short vs. Full Conversations When a model is provided with the full conversation (Table 9, bottom), its performance noticeably decreases compared to when it is given only the relevant parts of the conversation (Table 9, top).The decrease can be attributed to the model's need to identify the relevant information within the full conversation, whereas it does not have to do so for the short conversations.This indicates theory of mind reasoning becomes even more challenging for models when it needs to be combined with different types of reasoning (e.g., search).

In-depth Analysis
What types of errors do models make? Figure 4 and 5 summarize the error types of ANSWER-ABILITY Q and INFOACCESS Q for each model with and without chain-of-thought (CoT) reasoning.For list-type questions, models make more errors by including characters who are unaware of the information in the responses, rather than excluding characters who are aware.Interestingly, when CoT is applied, the error of including unaware characters decreases, whereas the error of excluding characters who are aware increases for most models.In the case of binary questions, false positives and false negatives correspond to including characters who are unaware and excluding characters who are aware in the response for list-type questions, respectively.If the model fails to generate a yes or no response, we mark it as irrelevant.Models tend to exhibit false negative responses more frequently for binary questions compared to listtype questions.Similarly, CoT primarily helps the model in reducing the false positive error rates, but the reduction in false negative error rates is not consistent across models.This suggests that CoT selectively improves reasoning specifically for determining characters who are unaware of the information, rather than characters who are aware.
How accurate and consistent are models' answers for a given character?For accuracy, we report the ALL FOR EACH CHARACTER score which is determined by whether the models are able to answer all six types of ToM questions correctly regarding the specific character.For consistency, we measure the ratio of consistent model responses across ANSWERABILITY Q and INFOACCESS Q for each character.Table 3 shows the accuracy and consistency of the models' responses for each character within the given conversation context.Overall, we observe a pattern where models that score low in accuracy also show low consistency.
While CoT generally improves model performance (see Table 9), we find that it does not always lead to improved accuracy and consistency.The decrease in ALL FOR EACH CHARACTER score when CoT is applied suggests that CoT has a selective impact on different question types.
Are there differences in performance in terms of the order of ToM beliefs?Table 4 presents the results of BELIEFQ with respect to different orders of ToM beliefs.Similar to Le et al. (2019), models perform better on the second-order belief questions than those with first-order beliefs.To further investigate the performance on second-order belief questions, we analyze the results based on the cyclic and acyclic patterns in them.The cyclic second-order belief questions inquire about Character 1's belief regarding Character 2's belief about Character 1 (e.g., What does Linda think about Kailey's belief on the breed of Linda's dog?); while the acyclic second-order questions focus on Character 1's belief about Character 2's belief regarding Character 3 (e.g., What does David think about Kailey's belief on the breed of Linda's dog?).Models show better performance on the cyclic questions than acyclic ones, which include more characters to track.However, when CoT is applied, the increase in score for acyclic questions is greater than that of cyclic ones, suggesting CoT helps multi-tracking.

Related Work
Existing Theory of Mind Benchmarks Many theory of mind (ToM) benchmarks, inspired by the false belief test from psychology (Wimmer and Perner, 1983), evaluate models on reasoning beliefs about object locations with narratives (Grant et al., 2017;Nematzadeh et al., 2018;Le et al., 2019).Other works such as Shapira et al. (2023b) build benchmarks based on the Faux Pas Test (Baron-Cohen et al., 1999).Also, ToM-related benchmarks focus on reasoning emotions and mental states narratives (Rashkin et al., 2018;Sap et al., 2019).
Theory of Mind in Large Language Models Although qualitative assessments might imply a degree of ToM in large language models (LLMs; Whang, 2023), more comprehensive quantitative investigations reveal that they have yet to achieve human-level ToM across various benchmarks (Sap et al., 2022;Shapira et al., 2023a).LLMs struggle to reason ToM robustly (Ullman, 2023), though their performance can be improved through fewshot samples and chain-of-thought prompting (Sap et al., 2022;Moghaddam and Honey, 2023) as well as specific inference methods (Sclar et al., 2023).

Conclusion & Discussion
We introduced FANTOM, a new benchmark for stress-testing theory of mind (ToM) capabilities of neural language models in conversations via question answering.Our benchmark is built upon essential theoretical requisites and empirical considerations required for validating ToM in large language models (LLMs).The conversations in our benchmark involve information asymmetry, with characters joining and leaving the discussion while it continues, to simulate distinct mental states.To identify illusory ToM, we crafted multiple types of challenging belief questions regarding the conversation participants' mental states by converting factual questions.Our evaluation results show that coherent ToM reasoning is challenging for current LLMs, performing significantly worse than humans even when using chain-of-thought reasoning or fine-tuning.
Although there has been recent debates around whether current LLMs possess ToM capabilities or not (Whang, 2023), our results indicate that this capacity has not yet emerged in any manner.Previous instances of success on well-known psychology ToM tests may be attributed to exposure during the pretraining phase (Ullman, 2023).Our work highlights the need for novel interaction-oriented benchmarks that introduce scenarios not encountered during training, and also aligning more closely with real-world use cases as LLMs are increasingly being deployed in interactive settings.
Our results also shed light on a broader issue in neural models -the lack of internal consistency (Elazar et al., 2021).We find they often fail to provide consistent answers to questions requiring the same underlying ToM reasoning.To address this concern, future works can explore various directions, such as grounding reasoning in pragmatics (Kim et al., 2020), visual information (Bisk et al., 2020), or belief graphs (Sclar et al., 2023).
Another issue that our work touches upon is the reporting biases inherent in language models.We observed that models often exhibit biases in their responses, showing a tendency to overly rely on the information they are conditioned on, such as preferring answers that have high overlap with the context (Sugawara et al., 2018).However, to achieve successful ToM reasoning, it is crucial to distinguish between accessible and inaccessible information for a particular agent, rather than blindly using all information available to the model.One potential approach to mitigate this is to combine pretraining with interactive learning (Sap et al., 2022).
In the spirit of encouraging future research in this direction, we make our benchmark publicly available at https://hyunw.kim/fantom.

Limitations
Although FANTOM is the first benchmark, to the best of our knowledge, to cover theory of mind (ToM) reasoning in conversational interactions, it is currently limited to small talks on specific topics.Additionally, our benchmark only considers only a single type of relationship between conversation participants, where they do not have prior knowledge of each other.However, social reasoning can become much more dynamic when variables such as relationships (e.g., family, friends, co-workers) are introduced.ToM is in all conversational interactions, hence we strongly encourage future works to evaluate ToM in a wider range of diverse conversation scenarios.
Our evaluation solely focuses on language-based models.However, it is important to note that ToM extends beyond a single modality (Piaget, 1956;Wu and Keysar, 2007).For instance, the wellknown Sally-Anne test (Wimmer and Perner, 1983;Baron-Cohen et al., 1985) is typically conducted as a face-to-face experiment, where visual cues affect the performance of the participants.Therefore, interesting future work will involve examining the capabilities of multi-modal models in relation to ToM reasoning.
Lastly, as we generate full conversations with large language models, conversations may contain offensive contents (Weidinger et al., 2021).However, we specifically select casual topics for small talks (e.g., pets, personal growth, traveling) to minimize the likelihood of offensive content generation.Also, we manually validate all conversations in our benchmark with crowdworkers from Amazon Mechanical Turk.

Societal and Ethical Considerations
We acknowledge that the term "theory of mind" (ToM) may evoke anthropomorphic connotations regarding AI models.However, we emphasize that the purpose of our work is not to promote anthropomorphism of AI models.Rather, our focus lies in exploring the limitations of existing language models in social reasoning.While the concept of ToM attempts to capture the ability to attribute mental states to oneself and others (Premack and Woodruff, 1978), it is important to clarify that AI models do not possess subjective consciousness or true understanding of intentions, beliefs, or desires.Our experiment results also demonstrate that current large language models do not exhibit any coherent ToM reasoning; instead, they primarily rely on word correlations.

A FANTOM Construction
Full examples of question sets in FANTOM can be found in Table 5 and Table 6.

A.1 Generating Conversations with Information Asymmetry
Information-asymmetric conversations To create the conversations in our benchmark, we use a predefined set of subtopics for each main topic and employ templates to generate scripts.For example, for the topic "pets" subtopics may include "breed", "special moves", and "favorite food".Following Kim et al. (2022), we use specific speaker prefixes with English names sampled from the Top-1K names in the US SSN database for more natural conversations.We append each utterance with speaker prefixes.We randomly shuffle the subtopics for each topic and generate conversations for each subtopic.We generate the first conversation with the following prompt: "{Character 1}, {Character 2}, ... {Character n} met for the first time at this social event.They are having a conversation on their {topic}.They now discuss {subtopic}.\n{Character1}:" The initial conversation starts with two or three characters and there can be up to five characters who are participating in the conversation at the same time.
Then, for each subtopic, we randomly select characters to join or leave the conversation.We use the following prompt when a character is selected to leave: "Now, {leaving character} leaves the conversation because of the reason '{leaving reason}'.
They now discuss {subtopic}.Remember to indicate that {leaving character} is leaving the conversation.
{Conversation history} \n{leaving character}:".We use a predefined list of 64 reasons for leaving the conversation.Table 7 shows all reasons for leaving.We append the previous conversation history to the input prompt to make the conversation continue from the previous one.
We use the following prompt when a character is selected to join: "Now {joining character} comes back after leaving the conversation because of the reason {leaving reason}.They now discuss {subtopic}.Remember to indicate that {joining character} is joining the conversation.
Do not mention the details in the previous conversations.
Extracting the inaccessible information for Per-sonX Whenever a character (re)joins the conversation, we extract the inaccessible information by asking  what information was shared in the preceding conversation where the character PersonX did not participate.We provide the previous conversation and the current one as input to GPT-4 with the prompt "What information was shared before PersonX joined, but was not mentioned after PersonX joined?" appended to it.To ease the task, the joining of the character is explicitly denoted by inserting a script between the conversations, as follows: "Previous conversation\n[PersonX joined the conversation]\nCurrent conversation".We observe quality improvements for the output generated by GPT-4 with the inclusion of the hint script.The returned result can be viewed as a conversation summary explicitly covering the previous context.

A.2 Generating Factual QA Pairs
We construct factual question-answer (QA) pairs related to the inaccessible information.First, we generate three non-yes-or-no questions and denote these as "FACTQs" and obtain them by prompting GPT-4, given the inaccessible information text.We obtain "FACTQs" by prompting GPT-4 with the following: "{inaccessible information}\n\nBased on this, formulate three non-yes-or-no questions that can be answered by this conversation summary." Next, we generate two distinct types of answers for each FACTQ with GPT-4.(1) First, we gener-bathroom break coffee break forgot something important forgot to print some documents forgot to recieve a package forgot to return a package forgot to run errands forgot to submit documents have a meeting starting soon that I need to prepare for have a previous engagement that I need to attend to quickly have a work-related emergency that requires my immediate attention have an unexpected visitor at my door have errands to run have to attend to someone who just walked in have to check on something have to go to the restroom have to pick up a prescription have to pick up dry cleaning have to print or scan documents have to receive a delivery have to recharge laptop have to return a borrowed item have to take care of a family matter have to take care of an unexpected task have unexpected visitor his/her pet needs attention his/her family is calling incoming delivery must respond to a phone call need to check on a friend or family member who needs assistance need to finish a task that's time-sensitive need to get a phone call need to get some coffee need to go to the toilet need to grab a snack or a drink need to have a quick chat with someone else need to make a phone call need to make a quick trip to the drug store need to make a quick trip to the grocery store need to pick up a package need to receive a parcel need to recharge cellphone need to register for an event need to schedule a haircut or salon appointment need to schedule another appointment need to step away for a moment to stretch and clear my mind need to step out for a moment need to submit some papers need to take care of some paperwork or documents need to take care of some personal matters need to take care of something related to my health need to take care of something urgent need to troubleshoot something parking meter expiring remembered something that needs to be taken care of remembered to receive a package remembered to submit some papers remembered to take care of some paperwork or documents remembered to take care of some personal matters remembered to take care of something urgent want to go grab a drink want to go grab a coffee want to go take some fresh air want to go to the bathroom  Table 9: Zero-shot results from humans and large language models on FANTOM with the same instructions.CoT denotes chain-of-thought reasoning and FT denotes fine-tuning.

Figure 1 :
Figure 1: An example question set in FANTOM.
[DIST.] s and BELIEFQ [CHOICE] s, 703 FACTQs, ANSWERABIL-ITY Q [LIST] s, and INFOACCESS Q [LIST] s, respectively.Additionally, there are 2,689 ANSWERABIL-ITY Q [Y/N] s and INFOACCESS Q [Y/N] s.Given that the ANSWERABILITY Q [Y/N] s and INFOACCESS Q [Y/N] s iterate over all characters present in the conversations, they have the highest count among all the question types.
[CHOICE] compared to ANSWERABILITY Q [LIST] and IN-FOACCESS Q [LIST] .Despite the ANSWERABIL-ITY Q [LIST] and INFOACCESS Q [LIST] being prerequisites for solving BELIEFQ [CHOICE] , they are much more challenging for models.Furthermore, Figure  3shows the token F1 scores for FACTQ and ac-
and INFOACCESS Q [LIST] , we do not ask ANSWERABILITY Q [Y/N] Metrics We report accuracy for BELIEFQ [DIST.], BELIEFQ [CHOICE] , ANSWERABILITY Q [LIST] , and INFOACCESS Q [LIST] .The weighted F1 scores are reported for ANSWERABILITY Q [Y/N] and IN-FOACCESS Q [Y/N] .We additionally report the "All" score for the ANSWERABILITY Q and IN-FOACCESS Q requiring models to be correct on both list-type and binary-type questions.For BELIEFQ [DIST.

Table 1 :
Results of models with zero-shot chain-ofthought (CoT) and fine-tuning (FT) for the short conversation context.Full results with all models, input types, and metrics are in Table9.
Figure3: Results of FACTQ and BELIEFQ[DIST.]formodelsgiventheshortconversationcontext.Full results with all models, input types, and metrics are in Table9.curacyforBELIEFQ[DIST.].The token F1 scores for FACTQ can be seen as a measure of a model's basic comprehension capability for interactions.Scoring high in FACTQ indicates the model is good at identifying the most relevant information piece to answering the question.Despite its small size, Mistral Instruct 7B shows the strongest performance among the open-source models.On the other hand, BELIEFQ [DIST.
et al. 2021.Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359.Oliver Whang.2023.Can a machine know that we know what it knows?The New York Times.Daya Guo, Nan Duan, and Julian McAuley.2023.Baize: An open-source chat model with parameter-efficient tuning on self-chat data.arXiv preprint arXiv:2304.01196.

Table 7 :
Predefined reasons for characters leaving the conversation.