MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks

An ideal integration of autonomous agents in a human world implies that they are able to collaborate on human terms. In particular, theory of mind plays an important role in maintaining common ground during human collaboration and communication. To enable theory of mind modeling in situated interactions, we introduce a fine-grained dataset of collaborative tasks performed by pairs of human subjects in the 3D virtual blocks world of Minecraft. It provides information that captures partners’ beliefs of the world and of each other as an interaction unfolds, bringing abundant opportunities to study human collaborative behaviors in situated language communication. As a first step towards our goal of developing embodied AI agents able to infer belief states of collaborative partners in situ, we build and present results on computational models for several theory of mind tasks.


Introduction
Creating embodied, situated agents able to move in, communicate naturally about, and collaborate on human terms in the physical world has been a persisting goal in artificial intelligence (Winograd, 1972). During communication in such a setting, agents not only need to ground entities in language to that of the physical world; efficient and accurate human-agent collaboration further requires agents to reason about the progress of the task at hand and to plan and execute a series of collaborative steps, whilst maintaining common ground (Clark, 1996) with collaboration partners, in order to achieve a certain goal.
Despite recent advances, we are still far away from fully enabling these desired agent behaviors.
One key challenge is in an agent's ability to establish and maintain common ground in tandem with human partners, especially in a setting where beliefs about the world and of each other may change on the fly (Popat and Palmer, 2005;Powers et al., 2005). It is important to understand how changes in a dynamic, physical world affect agents' beliefs of each other-i.e., theory of mind (Premack and Woodruff, 1978)-and how such beliefs influence teamwork and communication in collaborative tasks. As a first step to address this question, this paper explores theory of mind modeling (Chandrasekaran et al., 2017;Rabinowitz et al., 2018;Jara-Ettinger, 2019) in situated language communication (Iwahashi et al., 2009;McGuire et al., 2002) for collaborative tasks within the 3D virtual blocks world of Minecraft. Through a novel experimental setup, we collect a situated dialogue dataset that demonstrates how collaborative partners with a set of asymmetric knowledge and skills are able to collaborate to achieve joint goals, and how, in particular, their beliefs of each other evolve and converge over time. Based on this dataset, we further build several baseline computational models to explicitly predict key elements of a collaboration partner's mental state from the viewpoint of an agent as a task unfolds. Our empirical results demonstrate that while language is certainly important in this inference, the shared physical environment and the perceived activities play a greater role in shaping a partner's understanding of each other in order to come to a common ground.
The contributions of this work are threefold. First, we introduce MINDCRAFT, a task in which pairs of users collaboratively work to create novel materials by combining blocks in the 3D virtual world of Minecraft, with the ultimate objective of creating a final, goal material. Unlike prior work in situated collaborative tasks (Liu et al., 2013;Bisk et al., 2018;Suhr et al., 2019), a key focus of our work is to facilitate theory of mind modeling-the ability to attribute mental states, both of one's own and that of others-an important but not yet wellstudied topic in situated collaborative interactions. Within designed collaborative tasks, we have users record their beliefs about the state of the game, and of each other, at periodic intervals. Our data captures an evolution of the states of mind of our participants that are true representations of their beliefs-not simply proxies for the true sequence of events in a collaborative session. This explicit modeling of theory of mind sheds light on how partners strive to align their mental models in order to achieve common ground during collaboration.
Second, departing from previous Leader-Follower setups (where one partner explicitly leads and gives instructions to the other, the follower, who tries to execute said instructions) (Suhr et al., 2019), we focus on a setting where partners each have asymmetric knowledge (Bortolaso et al., 2019) and skill-sets towards completing a joint goal. In order to effectively complete the given tasks, partners need to negotiate their own plans of action by taking into account what they currently know and don't know about their partner, and of their common understanding of the task at hand. Our novel, more relaxed setup provides support for greater diversity in modes of collaboration that are more representative of that in the real world.
Third, we introduce a set of baseline computational models to infer fellow player mental states in situ, as a collaborative agent would, and highlight some further challenges present in moving towards building fully realistic agents able to reason about human mental states in situated environments.
Our platform, data, and models are made available ‡ and will facilitate future work on physical agents that can effectively collaborate with humans through situated dialogue.

Related Work
Our work builds upon existing efforts within collaborative dialogue in understanding the nuances of human collaboration and in towards building computational agents that can engage in language communication and collaborative tasks with humans in a physical environment.
Situated and task-oriented natural language interactions (Iwahashi et al., 2009;Zarrieß et al., 2016) have been studied in a variety of environments, including in custom 2D worlds (Liu et al., 2012;Aizawa, 2020, 2019), in the physical world with human-robot interactions (McGuire et al., 2002;Chai et al., 2014Chai et al., , 2018, and in various 3D virtual worlds (Bisk et al., 2018;Suhr et al., 2019). Most closely, our environment builds upon recent work by Narayan-Chen et al. (2019) and Jayannavar et al. (2020), whereby computational models of user dialogue prediction and user next-action prediction are investigated in the setting of a collaborative dialogue task within the 3D virtual blocks world of Minecraft. However, to our knowledge, none of these previous works explicitly model theory of mind for dialogue agents.
Theory of mind as a subject, especially in computation (Laird et al., 2017), has gained increased attention in areas including agent-agent reinforcement learning (Rabinowitz et al., 2018), dialogue systems (Qiu et al., 2021), human-computer interaction , agent-agent collaborative dialogue (Roman et al., 2020), and explainable AI (Akula et al., 2021). Worthy of note is the type of mental state recording we employ: specifically, we ask players to record their own mental states during interaction. Unlike prior work that has largely utilized external annotators for post-hoc mental state attribution , we expand on Eicher et al. (2017) and  by specifically bringing user self-reported mental states from that of only the linguistic domain to multimodal situated dialogue. Specifically, the novelty in our work exists in studying and bringing explicit theory of mind modeling to 3D situated collaborative interactions.

Experimental System and Data Collection
We consider a scenario whereby two agents, situated in the same environment and able to perform actions simultaneously, collaborate to complete a shared goal. Here, unlike traditional Leader-Follower setups, both agents have asymmetric information on the steps needed to complete the target task. In addition, in certain iterations, agents have asymmetric skill-sets as well: certain steps may only be completed with specific skills, and an agent may not be able to complete the target task by themselves, even with complete knowledge. Agents are provided a text channel to communicate in natural language, where they are able to share knowledge and negotiate on the actions to be per- Figure 1: Diagram of a sample interaction in MINDCRAFT. Two players are tasked to complete a common goal within the game environment of Minecraft; players communicate using in-game chat, are provided partial views of the plan needed to create the goal material, and are periodically asked paired questions to probe into their mental states. Additionally, we record first-person viewpoint videos of the two players' points of view (POV) as well as a third-person POV from the shared game environment.
formed by each agent in the process of completing the target task. Agents can also directly perform actions in the environment based on their own, albeit partial plan, and on their current understanding of the game state. We study this scenario in a modified blocks world environment of Minecraft with our custom game, MINDCRAFT. Figure 1 gives an overview of our experimental system and setup.

MINDCRAFT
The goal of MINDCRAFT is to create a specific goal material that is randomly generated for each game. A set of material blocks are spawned in the environment when agents enter, serving as the starting set of materials of the game. and creating a single block of a new type in their place. Note that macro-actions themselves are composed of many fine-grained, atomic actions that players may perform in-game, such as moving around, breaking blocks, chatting, jumping, and more-which utilize the full capability of the Minecraft game environment.

Modeling Agent Knowledge and Skills
In the real world, agents that engage in collaborative tasks may each have partial knowledge and incomplete skill sets. We are particularly interested in how these agents collaborate and negotiate with each other to come to a shared plan in order to achieve a joint goal. To this end, we explicitly model agents' knowledge and skills in MIND-CRAFT tasks.
Knowledge. Each player is given a knowledge graph-the recipe-as shown in Figure 1. Recipes given to players specify a joint goal and a partial set of macro-actions needed to take place toward completing the goal. For example, Player A, from the initial recipe, knows how to create Yellow Wool, but they do not know how it would contribute to making the goal material, Emerald Block. On the other hand, Player B, while they do not initially know how to create Yellow Wool, is given the knowledge that doing so would lead to creating Cobblestone, then used to make the goal material.
Skill-Sets. In order to stack blocks together, agents must be able to physically move blocks around in the virtual environment. This is achieved by hitting blocks with specific tools; randomly generated constraints exist in each game that specify which tools are able to interact with which blocks. Given randomly to agents at the start of each game, these tools effectively set constraints on which agents possess the necessary skills to interact with certain block types.
Combined, this asymmetry in both knowledge (recipes) and skill-sets (tools) motivates communication, as each individual agent does not (1) know how to create the goal material (2) nor do they have the skill-set to do so on their own. Furthermore, as both agents are situated within the environment, both have only partial observability of the game state, limited by their first-person field of view ingame. Players need to collaborate and communicate with each other to achieve the joint goal.

Belief Modeling and Common Ground
We facilitate theory of mind studies by asking players to record their beliefs about the progress of the current game, and of each other, at periodic intervals. As shown in Figure 1, each player is asked three types of questions: • Completed Task Status. This asks if a specific material has been created, by themselves or by the other player, since the start of the game, probing into the player's beliefs about the current state of the game as influenced by either themselves or their collaboration partner. For example, as shown in Figure 1, Player B is prompted with the question "Has the other player made Blue Wool until now?" • Player Knowledge. This asks if the player knows how to create a specific material, or if they believe that their partner possesses the knowledge to create it. This probes into a player's current knowledge of their own and of their partner's current knowledge, as in-fluenced by the initial knowledge they were provided and that which has been gained, via communication with their partner, since the start of the game. In this example, Player B is given the question "Do you know how to make Yellow Wool?" • Player Current Task. This asks players what they believe they themselves are making, or believe their partner to be making, at current time. For example, Player A is given the question "What do you think the other player is making right now?" Question Pairing. The three questions received by players are paired by type; i.e. if one player is asked a question of their own beliefs, the other player is asked the same question on what they believe their partner's beliefs to be. In the example given, when Player B is prompted with the question "Has the other player made Blue Wool until now?", at the same time, Player A is prompted with the question "Have you created Blue Wool?" The game is paused when players record their answers to the set of questions and resumed when both players have completed their answers. By explicitly soliciting players' states of mind during collaboration, we are able to define a quantitative measure of common ground: specifically, we consider common ground to be instances of answer agreement among pairs of players to a given question.

Data Collection
With the experimental setup described above, we collected a dataset totaling 100 games. Pairs of players participated in the experiments through a remote video conference, where they were instructed to access a custom Minecraft server using a game client, as well as a web page interface that we provided, used to display recipe information and to collect player beliefs with periodic popups, once every 75 seconds. Pop-ups ask three questions at once-one of each type-the content of each being paired to the corresponding question asked to the other player. During games, players are only able to communicate with each other using in-game chat, and each pair of players played at most 5 games.
From these games, we log their timestamped dialogue utterances via in-game chat, their questions and answers to the periodic popups for belief recordings, an internal game log that stores the entire game state, and three sets of video record-ings, representing each player's first-person point of view and a third-person point of view at a high vantage point with a clear view of the entire game.
In our dataset, there is an average of 20.5 dialogue exchanges per game, for a total of 2091 exchanges. Games last between 1 minute and 22 seconds to 27 minutes and 26 seconds, with the average game lasting 7 minutes and 23 seconds. A total of 12 hours, 18 minutes, and 33 seconds of in-game interaction was recorded. On average, 4 popup question pairs appear each game. Between 5 and 10 objects are used in a game, and between 7 and 11 macro steps are necessary in each game to achieve the goal.

Findings and Observations
We further perform an analysis of our dataset to gain an understanding of collaborative behaviors between players both in their reasoning and in their alignment of mental models.

The Role of Asymmetry in Knowledge and Skill-Sets
To quantitatively understand how a disparity in skill-sets and knowledge affects player behavior in situated collaborative tasks, we perform an initial pilot study based on four different configurations that vary on whether players share the same, complete plan (i.e. knowledge) and/or the same tools (i.e. skills) necessary to complete the task. In disparate configurations, both players possess disparate, partial plans and/or tools, with partial overlap between players. This pilot consists of 32 games to measure key statistics in areas of communication, interaction length, and mutual mental state agreement, with 8 games per configuration. Games were played in sets of four between pairs of players in a round-robin fashion across configurations, mitigating for external factors among pairs such as player game and mutual familiarity. As shown in Table 1, within our expectation, a disparity in both skill-sets and knowledge causes players to disagree and communicate the most to a statistically significant degree, and a disparity in either produces significantly more dialogue utterances than when both are shared. Players in fully disparate games have the lowest agreement in mutual knowledge and task completion. Despite this, in fully disparate games, a higher level of agreement is present for beliefs of the current tasks being performed by either player (e.g., compared to the shared skills configuration), which we attribute to players needing to ask for help more, thus communicating more and being more aware of each other, in such situations.

Evolution of Belief States
To understand how the interaction discourse shapes partners' beliefs of the tasks and of each other, we take a closer look at three types of beliefs (as reflected by our three types of questions) and examine how they evolve as collaboration and communication unfold. Segmenting individual games into 10% sections across each game's duration, we examine player agreement and disagreement as games progress. Figure 2 shows the aggregated results from all games for our three types of beliefs.
On average, player agreement on completed task status remains high and relatively constant throughout a game's progression, averaging around 80 percent, as shown in Figure 2a. However, as each game progresses, there is a noticeable increase in the agreement among two players in terms of what they believe about the other player's knowledge (Figure 2b). Similarly, beliefs about what the other player's current task is also increase notably in agreement as each game progresses, averaging around 12 percent at the start, gradually reaching over 60 percent by the end of the game (Figure 2c).
These results demonstrate that the longer the two players collaborate with each other in a game, the more aligned they become in their beliefs about each other. Furthermore, player understanding of completed tasks can be acquired by direct observations from the environment itself, and it's easier to reach an agreement (i.e., joint understanding or common ground) here than an understanding of a partner's mental states.

Dialogue Behavior
To better understand how agreement or disagreement in players' mutual beliefs affect dialogue behavior, we conduct a further analysis by examining dialogue utterances in a fixed time window of 75 seconds before and after a question is asked for each question type, separating instances of agreement and disagreement. Figure 3 shows the average number of dialogue exchanges across all games in this stratification.
For beliefs about the status of a completed task (Figure 3a), we observe no difference in dialogue exchanges before the question is posed between instances of agreement and disagreement in beliefs;  Table 1: Statistics on games with varying skill and knowledge configurations; minimums (min), averages (avg), maximums (max), and standard deviations (std) for the number of dialogue exchanges and durations of each game configuration are shown, as are player agreements for all three question types. interestingly, however, immediately following a given question, a significant difference becomes apparent in the number of dialogue exchanges. When there is agreement between players about the state of tasks, we observe that they, on average, tend to communicate more to continue on the course to further elaborate on their plan.
On the other hand, for beliefs of partner knowledge, we do not observe a change in behavior before or after a question is asked (Figure 3b), and, for beliefs that involve a partner's current task (Figure 3c), interesting of note is that the average number of dialogue exchanges leading to disagreement was significantly less than that which led to agreement. This highlights a potential reason why disagreement occurred: less communication. We observe that communication is especially important for players to infer what tasks their partner is currently working on, as it's difficult to know the current goal of a partner by only observing their partial actions without communicating about it, due to their own incomplete plan.

Computational Models for Inferring Belief States
Based on our dataset, a variety of computational problems can be formulated, developed, and evaluated. In this section, we focus on one key problem-predicting player belief states of the task and of a collaborative partner in situ. As a first step, we implement a straightforward model that, from a player's perspective, predicts the state of the task as well as the mental states of a partner at any given time based on historical observations of a rich discourse of dialogue and perceived actions in the shared environment. Figure 4 shows the overall architecture of our model. Our dataset is comprised of two time-seriesbased modalities: (1) a video stream coming from either player's first-person POV, and (2) dialogue exchanges. We implement a forward sequence-tosequence model, such that inferences at any given time are only able to process inputs that have occurred before it.
Plan Processing. Recall that each player is provided a partial view of the complete plan. Here, each plan is stored as a list of tuples, representing each material present in the plan, associated with the materials needed to make it and the tool needed to interact with it. Represented naturally as a graph, the list of nodes is given as input to a GRU (Chung et al., 2014) for encoding. In tasks that involve predicting a partner's mental state in situ, only the partial plan associated with the player (not the partner) is used.

Visual and Dialogue Processing.
We encode video frames with a Convolutional Neural Network and encode dialogue utterances with bert-large-uncased (Devlin et al., 2019). As dialogue exchanges are a sparse input, a zerovector is used when there is no associated dialogue utterance at a particular time point.
Time Series Processing. We use either an LSTM (Hochreiter and Schmidhuber, 1997) or a Trans-former network (Vaswani et al., 2017), masked such that it only attends to the past, and feed a sequence of visual frame, plan, and dialogue embeddings as aforementioned to produce a latent representation of game interactions for every step.
Learning and Inference. Questions, together with each question's associated game embeddings (i.e., dialogue utterance embeddings, visual frame embeddings, and the agent's own partial plan embedding) at corresponding time steps pass through a Feed-Forward Network to make predictions of their answers. Ground truth answers to questions and cross-entropy loss are used for model training. The same overall architecture is used for all question types; the only difference between them is the space of their output predictions.

Evaluation of Belief State Inference
We randomly partition our dataset into 60%/20%/20% training, validation, and testing splits with the condition that all three partitions have a similar distribution of game lengths. To achieve testing in situ, we replace one of the two players with our model. At every point where a question is prompted, our model is used to provide an answer about the other player's belief state through inference, using the self-reported belief state of the other player as ground truth for evaluation. We present the multi-class average F1 score weighted by the number of instances in each class (accounting for class imbalances) in our results. We perform our experiments by varying (a) (b) (c) Figure 5: Model F1 scores on predicting player belief states. Human performance and random chance performance are marked by the blue and orange horizontal lines, respectively. Detailed results are given in Appendix Table 2. the following configurations: • Neural architecture: LSTM or Transformer, with the rationale that they have different abilities in capturing long-distance dialogue history. • Input: dialogue exchanges only (D), firstperson POV video stream only (V), and both (V+D), with the intent to understand the role of both language communication and visually perceived activities in the environment towards the task of mental state inference.

Performance in Situ
Inferring Player Beliefs of Completed Tasks.
Here, we predict player beliefs on the subject of task completion: whether a designated sub-task has been completed by their partner, specifically in response to questions such as "Has the other player made an Emerald Block until now?". Participant answers can be one of Yes, Maybe, or No. This experiment aims to gauge an agent's ability to keep track of the two player's progress towards their goal based on its own knowledge (i.e., the partial plan available to the agent) and the shared interaction history. As shown in Figure 5a, we find that the best performing configuration is the Transformer-based model that uses only the video modality. This result seems to suggest that Seeing is believing; in situated communication, as partners are co-present in a shared environment, they can observe each other's activities and the resulting world state after participant actions to reason about completed tasks-collaborators don't need to use language to communicate about what has already been accomplished. Furthermore, as a large time period may exist between sub-tasks that have been completed and their associated belief question prompts, the Transformer-based model with video inputs only is able to significantly outperform LSTM-based models which may be unable to capture such a time dependency.
Inferring Player Beliefs of Partner Knowledge.
Here, we predict player beliefs of the knowledge possessed by their partners to achieve designated sub-goals, specifically in response to questions such as "Does the other player know how to make an Emerald Block?". Participant answers can be one of Yes, Maybe, or No. Our results in Figure 5b show that different model configurations result in similar performance, as players are able to explicitly ask questions about each other's knowledge in dialogue exchanges (Figure 1) in addition to making their own observations from the environment and inferring directly from the plan they were given.

Inferring Player Beliefs of Partner Current
Task. Here, we predict player beliefs of their partner's immediate task, specifically in response to questions such as "What do you think the other player is making right now?". For this question, participant answers can be one of 21 choices-the number of total possible material types participants may create in a game. Compared to the predictions aforementioned, this experiment is more constrained in time to the vicinity of the question prompt. Our results in Figure 5c show that LSTMbased models seem to outperform transformerbased models, though only marginally in the videoonly setting, demonstrating that local context seems to play a more important role in this prediction.

Analysis of the Evolution of Inferred Belief States
As we are interested in the evolution of an agent's belief as the game progresses, we further plot prediction matches of model-predicted belief states over every 10% interval of the game, similar to that of player-belief matches shown prior in Figure 2. Figure 6 shows the breakdowns from the best performing configuration for each experiment. For predicting the status of a completed task ( Figure   (a) (b) (c) Figure 6: Histograms of test-set model-predicted answer matches (agreement, in blue) and mismatches (disagreement, in orange) to question pairs on (a) completed task status, (b) partner knowledge, and (c) current task status asked at different relative intervals during games. Red crosses show the ratio of matched answers out of the total questions, and the red line shows the 3 rd order polynomial fit to the crosses. 6a), we observe that, similarly to human performance, the percentage of matched answers remains relatively stable, though we do notice a slight decrease for predictions later in the game. On the experiment of predicting the other player's knowledge, we see a similar increase in the percentage of matched answers as the game progresses (Figure 6b) as in Figure 2b. For the experiment of predicting the other player's current task ( Figure  6c), our model does not match the observations on human performance: the percentage of matched answers stays low and relatively constant. This result demonstrates that it is difficult to predict what the other player is doing given only interaction discourse and visual perception. This prediction requires a better understanding of the progress of the task, which the agent, as construed here, is lacking. This also points to the utility of actively engaging in dialogue-for example, explicitly asking what a partner is doing-to have a better understanding of their current goal.

Conclusion and Future Work
In a real-world collaborative scenario with physical agents, humans and agents will inevitably have disparities in their abilities, knowledge, and understanding of the shared world. This work specifically stimulates these disparities in a virtual environment, introducing a new dataset and experimental framework that supports in-depth studies of theory of mind modeling for situated dialogue in collaborative tasks. Through a novel implementation of self-reported belief states during collaborative interactions, our dataset keeps track of partners' beliefs about the task at hand and of each other step-by-step and captures how their states of mind evolve-and, indeed, how their common ground evolves-as communication and interaction unfold. To our knowledge, this is the first dataset in the context of situated dialogue that provides this fine-grained information for mental modeling.
Our initial analysis of this dataset generates several interesting findings that will inform the development of computational models for various problems-for instance, in tracking mental models and managing dialogue behaviors in collaborative agents. Our baseline results demonstrate the importance of interaction discourse and visual experience in a shared environment on predicting mutual belief states of the task at hand, and of a collaborative partner, to ensure common ground. While we have built baseline computational models to better help in understanding human collaborative behaviors and several theory of mind tasks, we hope our work further facilitates improvements in areas like agent planning and decision-making, computational reasoning, multimodal dialogue generation, and to move towards fully autonomous agents that are able to engage with humans in collaborative activities, on human terms, both effectively and efficiently in a human world. exempt. A total of 29 subjects took part in the data collection; no personally identifiable information was stored throughout our experimental sittings, and participants were provided with anonymous Minecraft accounts to access the game servers such that they did not use their own. We do not additionally control for any ethnicity or cultural aspects aside from the condition that the participant is to be an English speaker and has some experience with Minecraft.

A.1 Detailed Prediction Results
A detailed comparison of the F1 scores on the testing and validation sets may be seen in Table 2. Each experiment was run 10 times; training each model for all settings lasts roughly 35 minutes on average.

A.2.1 Convolutional Neural Network
The parameters for the convolutional network used in visual processing were primarily constrained by GPU memory limitations; image frames of size 96 × 96 were input into a CNN consisting of four convolutional layers with kernel sizes of 3 × 3, 5 × 5, 5 × 5, and 3 × 3, whereby the sizes of the intermediate inputs were 3, 8, 32, 128, and 512, respectively. These parameters were chosen under the consideration that the blocks-world Minecraft video frames are not as rich in content as that of a real-world photo setting. A Dropout of 0.2 was further used between layers, chosen after a parameter sweep in the range of [0, 0.5], which was done by picking kernel sizes between 3 and 5 and layers between 3 and 6, taking into account the aforementioned image input size.

A.2.2 Plan Processing
Recall that each player is provided a partial view of the complete plan. For processing, each plan is stored as a list of tuples, representing all materials present in the plan and their links with (1) the materials needed to create them and (2) the tools needed to break them-i.e. nodes (materials) are linked with their children (composite materials and tools) as in the graph representation. The goal material, the root node, is always first in the list. All subsequent nodes are added in a breath first fashion, except in cases whereby a node has already been added to the list (as cycles are allowed). Each material and tool is given a one-hot encoding; mined materials have their children represented as zero vectors, as no other material is needed to make them. Partial plans and their representations are generated from the complete plan by hiding the children of randomly selected nodes-excluding the goal and mines-to depict a lack in knowledge. For encoding, each tuple has the one-hot encodings of (1) the material itself, (2) its parent node, (3) its children nodes, and (4) its associated tool concatenated; the list of tuples are then input to a GRU (Chung et al., 2014), which takes in an input vector of size 81 and has a hidden state size of 32. In the tasks that involve predicting a player's mental state from the perspective of the other player, only the partial plan associated to the other player's point of view is used. Figure 7a shows a relatively verbose exchange of dialogue. Note that only a portion of the entire game's dialogue (which has 40 exchanges in total) is shown. Here, we observe that there is a clear selfassignment of leader and follower roles between the players: the leader explicitly states every step they think their partner needs to make, almost to the point of micro-managing. We also see an example of slight backtracking happening, where Player 2 realizes that they are further along in the plan than they initially thought.

A.3 Example Player Interaction
In Figure 7b, we see an example of a fairly straightforward exchange of dialogue. Player 1 notices that they are not aware of the recipe for Soul Sand, which is needed to create the goal material, Emerald Block. They then inquire with their partner about it, who then states that they are unaware, instead, of how to make Black Wool, which is necessary for creating Soul Sand. Once the information is exchanged, the intermediate material is created promptly and both players then proceed to create their goal material.
Consider the dialogue exchange in Figure 7c. The two players are one step away from creating their goal material, Orange Wool. Player 1 points out that they require a block of Cyan Wool. Player 1 is pointing this out to Player 2 even though they cannot be sure Player 2 shares the knowledge as, in order to interact with the necessary materials, an Iron Shovel, which Player 1 does not possess, is required. From Player 2's perspective, while they are also aware that a block of Cyan Wool is required, they do not know how to make one as the arrows in their plan view are missing. As such, they inquire with their collaboration partner about the recipe. Player 1 then updates Player 2 on how to make Cyan Wool and also points out that one of the materials necessary was already created. This sample extract of their overall interaction is an example of grounding to the visual modality of their dialogue: our dataset provides much longer sequences of such interactions that  are also causally dependant on one another. It is important to note here that the players are not assumed leader or follower roles; in this situation, the two participants coordinated entirely on their own and reached a consensus on who provides information and who is to execute the tasks. These roles switch throughout the game as their disparities in skills and knowledge change. These select dialogue exchanges showcase a small part of the diversity in possible interactions that happen in our experimental setup, whereby players are able to negotiate, decide, and execute their plans of action in a collaborative setting with relaxed constraints on player roles.  Figure 7: Example dialogue exchanges, with the two players' partial plans also shown as context.