Learning to Execute Actions or Ask Clarification Questions

Collaborative tasks are ubiquitous activities where a form of communication is required in order to reach a joint goal. Collaborative building is one of such tasks. We wish to develop an intelligent builder agent in a simulated building environment (Minecraft) that can build whatever users wish to build by just talking to the agent. In order to achieve this goal, such agents need to be able to take the initiative by asking clarification questions when further information is needed. Existing works on Minecraft Corpus Dataset only learn to execute instructions neglecting the importance of asking for clarifications. In this paper, we extend the Minecraft Corpus Dataset by annotating all builder utterances into eight types, including clarification questions, and propose a new builder agent model capable of determining when to ask or execute instructions. Experimental results show that our model achieves state-of-the-art performance on the collaborative building task with a substantial improvement. We also define two new tasks, the learning to ask task and the joint learning task. The latter consists of solving both collaborating building and learning to ask tasks jointly.


Introduction
Following instructions in natural language by intelligent agents to achieve a shared goal with the instructors in a pre-defined environment is a ubiquitous task in many scenarios, e.g., finding a target object in an environment (Nguyen and Daumé III, 2019;Roman et al., 2020), drawing a picture (Lachmy et al., 2021), or building a target structure (Narayan- . A number of machine learning (ML) research projects about following instructions tasks have been initiated by making use of the video game Minecraft (Johnson et al., 2016;Shu et al., 2018;Narayan-Chen et al., 2019;Guss et al., 2019;Jayannavar et al., 2020). Building such agents requires to make progress in grounded natural language understanding -understanding complex instructions, for example, with spatial relations in natural languageself-improvement -studying how to flexibly learn from human interactions -synergies of ML components -exploring the integration of several ML and non-ML components to make them work together (Szlam et al., 2019). The recently introduced Minecraft Corpus dataset (Narayan-  proposes a collaborative building task, in which an architect and a builder can communicate via a textual chat. Architects are provided with a target structure they want to have built, and the builders are the only ones who can control the Minecraft avatar in the virtual environment. The task consists in building 3D structures in a block world-like scenario collaboratively, as shown in the Figure 1. Earlier works in Minecraft collaborative building tasks (Jayannavar et al., 2020) attempted to build an automated builder agent with a large action space but failed to allow the builder to take the initiative in the conversation. However, an intelligent agent should not only understand and execute the instructor's requests but also be able to take initiatives, e.g., asking clarification questions, in case the instructions are ambiguous. In the task defined by this dataset, builders may encounter ambiguous situations that are hard to interpret by just relying on the world state information and instructions. For example, in Figure 1, we provide a simple case where the architect fails to provide sufficient information to the builder, such as the color of the blocks. In this situation, it is clearly difficult for the builder to know exactly which action should be taken. If, however, the builder is able to clarify the situation with the architect, this ambiguity can be resolved. Therefore, builders, besides following architects' instructions, should take the initiative in the conversation and ask questions when necessary.
To this end, in this paper we annotate all builder utterances in the Minecraft Corpus dataset by categorizing them in the dataset into eight dialogue utterance types as shown in Table 2, allowing the intelligent agents to learn when and what to ask given the world state and dialogue context. Particularly, a builder would ask task-level questions or instruction-level questions for further clarifications. Experimental results in the Sec. 5.2 show that determining when to ask clarification questions remains a challenging task. However, it is worth noting that the clarification questions in the Minecraft Corpus dataset are more complex and diverse than those in navigation tasks (Roman et al., 2020;Nguyen and Daumé III, 2019;) whose questions are relatively simpler and mainly about where to go.
Also, we propose a new automated builder agent that learns to map instructions to actions and decide when to ask questions. Our model utilizes three dialogue slots, the action type slot, the location slot, and the color slot. This solution has the benefit of making the learning easier with respect to those models that work using a large action space (Jayannavar et al., 2020). To solve the collaborative building task, both the dialogue context and the world state need to be considered. Therefore, to endow our model with the ability to better learn the representations between the world state and language, our model implements a cross-modality module, which is based on the cross attention mechanism. Experimental results on our extended Minecraft Corpus dataset show that our model achieves stateof-the-art performance with a substantial improve-ment for the collaborative building task. We also provide new baselines for learning to ask task and the joint learning of these two tasks.
2 Related Work and Background Dialogue Tasks. As virtual personal assistants have now penetrated the consumer market, with products such as Siri and Alexa, the research community has produced several works on taskoriented dialogue tasks such as: hotel booking, restaurant booking, movie recommendation, etc. (Budzianowski et al., 2018;Wei et al., 2018;Wu et al., 2019;Feng et al., 2021Feng et al., , 2022Kim and Lipani, 2022). These task-oriented dialogues have been modelled as slot filling tasks. These tasks consist of correctly identifying and extracting information (slots) useful to solve the task. However, most of these slot filling tasks (Coope et al., 2020;Ni et al., 2020) are considered as semantic tagging or parsing of natural language and do not normally consider visual information. Moreover, these tasks focus only on two of the many components needed by conversational systems: the Natural Language Understanding (NLU) and Dialogue State Tracking (DST) ones (Budzianowski et al., 2018;Williams et al., 2014). Beside these task-oriented dialogue tasks, the research community has also focused on instruction following dialogue tasks, such as target completion tasks (de Vries et al., 2017), object finding tasks (Roman et al., 2020), and navigation tasks . Narayan-  proposed the Minecraft Corpus dataset, where the task consists in a cooperative asymmetric task involving an architect and a builder that have to build a target structure collaboratively. Jayannavar et al. (2020) then built a builder model to follow the sequential instructions from the architect.
Multi-Modal. Almost all instruction following dialogue tasks need to consider both contextual information and actions as well as the state of the world (Suhr and Artzi, 2018;Lachmy et al., 2021), which remains a key challenge for instruction following dialogue tasks. In particular, the Vision-and-Dialog Navigation (VDN) task Roman et al., 2020;Zhu et al., 2021) where the question-answering dialogue and visual contexts are leveraged to facilitate navigation, has attracted increasing research attention. Other tasks, such as moving blocks tasks (Misra et al., 2017) and object finding tasks (Janner et al., 2018), also require the modelling of both contextual information in natural language as well as the world state representation to be solved.
Spatial Reasoning. Many instruction following dialogue tasks contain texts with spatial-temporal concepts Yang et al., 2020). For instance, the Minecraft Corpus dataset contains utterances with spatial relations, e.g., "go to the middle and place an orange block two spaces to the left". Although pre-trained language models have been used successfully in a wide array of downstream tasks, interpreting and grounding abstractions stated in natural language, such as spatial relations, have not been systematically studied and remain still challenging. Therefore, another challenge for an agent is to follow instructions which require the learning and understanding of spatiotemporal linguistic concepts in natural language. To train models able to understand and reason about spatial references in natural language, Shi et al. (2022) proposed a benchmark for robust multi-hop spatial reasoning over texts.
Learning by Asking Questions. Determining whether to ask clarification questions and what to ask is critical for instruction followers to complete the tasks. Several recent studies have focused on learning a dialogue agent with the ability to interact with users by both responding to questions and by asking questions to accomplish their task interactively (Li et al., 2017;de Vries et al., 2017;Misra et al., 2018;Roman et al., 2020). For instance, de Vries et al. (2017) introduced a game to locate an unknown object via asking questions about objects in a given image. A decision-maker is introduced to learn when to ask questions by implicitly reasoning about the uncertainty of the agent. Different from earlier works (Kitaev and Klein, 2017;, recent works on VDN tasks propose agents that learn to ask a question when the certainty of the next action is low Roman et al., 2020;Chi et al., 2020). Roman et al.
(2020) proposed a two models-based agent with a navigator model and a questioner model. The former model was responsible for moving towards the goal object, while the latter model was used to ask questions. Zhu et al. (2021) proposed an agent that learned to adaptively decide whether and what to communicate with users in order to acquire instructive information to help the navigation.

The Minecraft Dialogue Corpus
The Minecraft Dialogue Corpus (Narayan-Chen et al., 2019) is built upon a simulated block-world environment with dialogues between an architect and a builder. This consists of 509 human-human dialogues (15,926 utterances, 113,116 tokens) playing the role of an architect and a builder, and game logs for 150 target structures of varying complexity (min. 6 blocks, max. 68 blocks, avg. 23.5 blocks), For each target structure at least three dialogues are collected where each dialogue contains 30.7 utterances (22.5 architect utterances and 8.2 builder utterances) and 49.5 builder blocks movements on average.
The architect instructs about a target structure the builder to build it via a dialogue. Although the architect observes the builder operating in the world, only the builder can move blocks. The builder has access to an inventory of 120 blocks of six given colors that he or she can place or remove. The collaborative building task restricts the structures to a build region of size 11 × 9 × 11, and contains 3709, 1331, and 1616 samples for training, validation, and test sets.

Builder Dialogue Annotation
Builders need to be able to decide their actions at any time point rather than only execute actions with the information about when to execute. Thus, we annotate all builders' utterances in the Minecraft Corpus dataset (Narayan-Chen et al., 2019) and categorize all 4,904 builder utterances into 8 utterance types. These types are partially inspired by the work of Lambert et al. (2019). Each utterance falls into exactly one category. These categories are defined as follows: (1) Instruction-level Questions: used to request that the architect clarifies previous instructions;(2) Task-level Questions: used to request the architect to give a description about the whole picture of the building task, e.g., ask for the next instruction or ask to describe how the target structure should look like; (3) Verification Questions: used to request the architect to give feedback on previous actions; (4) Greetings: used to greet each other; (5) Suggestions: used to provide suggestions about building; (6) Display Understanding: used to express whether previous instructions have been understood; (7) Status Update: used to describe the current status, e.g., tell the architect where they are, their current block stock status, or  whether they have finished a given instruction; (8) Others: not relevant to the task itself (e.g. chit-chat, expressing gratification, or apologies).
Among these 8 utterance types, the instructionlevel questions and the task-level questions are a sub-type of clarification questions used to further clarify instructions or the task itself when the information from the architect is not clear or ambiguous. Based on these annotations, we extend the original dataset (the first row in Table 2) with two other dialogue acts, 'Ask for clarifications' and 'Others', as shown in the second and third rows of Table  2. Each 'Ask for clarifications' sample includes a world state and a dialogue context followed by a builder utterance labelled as instruction-level questions or task-level questions; each sample tagged as 'Others' includes a world state and a dialogue context followed by other builder utterance types.

Task Definition
Let H be the set of all dialogue contexts, W the set of all world states, and A the set of actions, including building actions (placing a block, removing a block, or a special stop action that terminates the task) and utterance actions (ask clarification questions or other utterance categories). Given a dialogue context h ∈ H, an initial grid-based world state w 0 ∈ W, the target is to predict the next action type as Execution, Ask for clarifications and Others. When the prediction of the action type is Execution, a sequence of actions {a i } n i=1 , where a i ∈ A and a n is the Stop action, is needed to be generated to reach the final target structure w n . The action type Execution will update the world state via a deterministic transition function T such that

Method
In this section we introduce the proposed builder model, as shown in Figure 2. The model comprises four major components: the utterance encoder, the world state encoder, the fusion module, and the slot decoder. The utterance encoder (in Sec. 4.1) and world state encoder (in Sec. 4.2) learn to represent the dialogue context and the world state. These encoded representations are then fed into the fusion module (in Sec. 4.3) that learns contextualized embeddings for the grid world and textual tokens through the single and cross modality modules. Finally, the learned world and text representations are mapped into the pre-defined slot-values in the slot decoder (in Sec. 4.4).  Figure 2: The model architecture. The ⊕ sign represents the concatenation operation. This illustration uses the plate notation. There are a total of N T + 1 text single modality modules, N G + 1 grid single modality modules, N T text cross modality modules, and N T grid cross modality modules. Arrows indicate the flow of information.

Dialogue Context Encoder
We add "architect" and "builder" annotations before each architect utterance A t and each builder utterance B t respectively. Then, the dialogue utterances are represented as D t = "architect"A t ⊕ "builder"B t at the turn t, where ⊕ is the operation of sequence concatenation.The entire dialogue context is defined as: Given the dialogue context H, we truncate the tokens from the end of the dialogue context or pad them to a fixed length as inputs and then use the dialogue context encoder to encode utterance history into U ∈ R s×dw , where d w is the dimension of the word embedding and s is the maximum number of tokens for a dialogue context. The dialogue context encoder can be word embeddings like Glove Pennington et al. (2014) or contextual word embeddings Devlin et al. (2019), which are both widely used in the literature (Ni et al., 2021a,b;Wang et al., 2020.

Grid World State Encoder
The world state is represented by a voxel-based grid. We first represent each grid state as a 7dimensional one-hot vector that stands for empty or a block having one of six colors, yielding a 7×11×9×11 world state representation. Additionally, we truncate the action history to the last five ones, assign an integer weight in 1, . . . , 5 and then include these weights as a separate input feature in each grid, resulting in a raw world state input of W 0 ∈ R 8×11×9×11 . We also represent the last action as an 11-dimensional vector a where the first two dimensions represent the placement or removal actions, the next six dimensions represent the color, and the last three dimensions represent the location of the last action. The structure of the world state encoder is similar to Jayannavar et al. (2020)'s, i.g., consisting of k 3D-convolutional layers (f 1 ) with kernel size 3, stride 1 and padding 1, followed by a ReLU activation function. Between every successive pair of these layers there is a 1×1×1 3D-convolutional layer (f 2 ) with stride 1 and no padding followed by ReLU: where i = 1, 2, . . . , k − 1. W k ∈ R dc×11×9×11 is the learned world grid-based representation where d c is the dimension of each grid representation. Then we concatenate the last action representation a ∈ R 11 to each grid vectors in W k and reshape them into

Fusion Module
The fusion module comprises four major components: two single modality modules and two crossmodality modules. The former modules are based on self-attention layers and the latter on crossattention layers. These take as input the world state representation and dialogue history representation. Between every successive pair of grid singlemodality modules or text single-modality modules there is a cross modality module. We take N G and N T layers for the grid cross modality module and the text cross modality module. We first revisit the definition and notations about the attention mechanism (Bahdanau et al., 2015) and then introduce how they are integrated into our single modality modules and cross-modality modules.
Attention Mechanism. Given a query vector x and a sequence of context vectors {y j } K j=1 , the attention mechanism first computes the matching score s j between the query vector x and each context vector y j . Then, the attention weights are calculated by normalizing the matching score: . The output of an attention layer is the attention weighted sum of the context vectors: Attention(x, y j ) = j a j · y j . Particularly, the attention mechanism is called self-attention when the query vector itself is in the context vectors {y j }. We use the multi-head attention following Devlin et al. (2019); Tan and Bansal (2019).
Single-Modality Module. Each layer in a singlemodality module contains a self-attention sublayer and a feed-forward sub-layer, where the feedforward sub-layer is further composed of a linear transformation layer, a dropout layer and a normalization layer. We take N G + 1 and N T + 1 layers for the grid single-modality modules and the text single-modality modules respectively, interspersed with cross-modality module as shown in Figure 2. Since new blocks can only be feasibly placed if one of their faces touches the ground or another block in the Minecraft world, we add masks to all infeasible grids in the grid single-modality modules. For a set of text vectors {u n i } s i=1 and a set of grid vectors {w m j } 1089 j=1 as inputs of n-th text single-modality layer and m-th grid single-modality layer, where n ∈ {1, . . . , N T + 1} and m ∈ {1, . . . , N G + 1}, we first feed them into two self attention sub-layers: w m j = SelfAttn m w (w m j , {w m j }, mask) (5) Lastly, the outputs of self attention modules, u n i and w m j , are followed by feed-forward sub-layers to obtainû i n andŵ j m .
Cross-Modality Module. Each layer in the crossmodality module consists of one cross-attention sub-layer and one feed-forward sub-layer, where the feed-forward sub-layers follow the same setting as the single-modality module. Given the outputs of n-th text single-modality layer, {û i n } s i=1 , and the m-th grid single-modality layer, {ŵ j m } 1089 j=1 , as the query and context vectors, we pass them through cross-attention sub-layers, respectively: The cross-attention sub-layer is used to exchange the information and align the entities between the two modalities in order to learn joint crossmodality representations. Then the output of the cross-attention sub-layer is processed by one feed-forward sub-layer to obtain {u n+1 j=1 , which will be passed to the following singe-modality modules.
Finally, we obtain a set of word vectors, {û i N T +1 } s i=1 , and a set of grid vectors, {ŵ j N G +1 } 1089 j=1 , that is, U N T and W N G . Since the value of N G and N T could be different, the modality with more layers would keep using the last single modality module's output of another modality as the input of its cross modality modules, as shown in the Figure 2.

Slot Decoder
The Slot Decoder contains three linear projection layers of trainable parameters, M L ∈ R d ′ c , M C ∈ R 6×dw , M T ∈ R da×dw where d a is the number of action types to predict. We compute the average of U N T ∈ R s×dw alongside the s-dimension to obtain u ∈ R dw . Then we compute location logits, color logits, and action type logits: where softmax functions are used to map the extracted information intol ∈ R 1089 ,ĉ ∈ R 6 , and t ∈ R da .

Experiments, Results and Discussion
In this section, we first compare our model against the baseline for the collaborative building task where models only need to learn the instruction following task (in Sec. 5.1). Then, we train our model to learn when to ask and evaluate on our extended Minecraft Dialogue Corpus (in Sec. 5.2). Finally, we evaluate our model's ability on the combination of the two above-mentioned tasks (in the Sec. 5.3). All training details are reported in the Appendix. The software and data are available at: https://github.com/ ZhengxiangShi/LearnToAsk. We first compare our model against the only baseline 1 named BAP (Jayannavar et al., 2020) by only using the "Execution (Original)" dataset as in Table 2. Then, we conduct the experiments by using the augmented data from Jayannavar et al. (2020): the models are trained and evaluated with the augmented training data: 5,563 (indicated as 2x), 9,272 (4x), and 12,981 (6x) training samples. We provide the ground-truth previous actions and the world state for the next action prediction. For the sake of fairness, we retrain the BAP model under the same setting. For our models, we present the performance of two different dialogue context encoders: we use the pre-trained GloVe word embeddings with 300 dimensions (Pennington et al., 2014) as the initial word embeddings followed by a GRU (Chung et al., 2014) and contextual word embeddings using the pre-trained BERT base model (Devlin et al., 2019). For the action type slot, we pre-define three potential values: placement, removal, and stop. The value of the location slot can be one of 1,089 candidate voxels and the value of the color slot can be one of six candidate colors. During training we minimize the sum of the cross entropy losses of the location slot, the color slot, and the action type slot. The F1-score metric on the test set is used to evaluate model performance by comparing the model predictions against the action sequence performed by the human builder.
Results. In Table 3 we present the results of our model and the baselines for the collaborative building task on the Minecraft Corpus Dataset. Experimental results show that our model outperforms the baseline model with a large margin. Results on the augmented dataset show that the advantage of the data augmentation is not obvious. The performance using contextualized word embeddings is poorer. This could be due to the size of the builder model with the BERT encoder which makes it more difficult to train.

Learning to Ask Task
Settings. In this task, all the models are trained only to predict one type of actions, Execution, Ask for clarifications and Others, without the need to generate a sequence of actions. All datasets in the Table 2 are used. In our model, for the action type slot, we define three potential slot values: Execution, Ask, and Others. In this experiment, the slots for location and color are not used. We use the pretrained GloVe embeddings in the dialogue context encoder. During the training, the cross entropy loss of the action type is minimized. Results. In Table 4, we present the results of our model. Although our model achieves around 80% overall test accuracy, the correct answers mainly come from the execution type while the model struggles with the ask and other types. These two types have in fact a joint test accuracy of 38.6%. Experimental results demonstrate that the difficulty of the learning to ask task and that there is still a large room for improvement. Figure 3: Case study of the collaborative building task in Sec. 5.1: A and B represents the architect and the builder.

Joint Learning Task
Settings. In this task, the models are trained to not only predict one type of actions from Execution, Ask for clarifications and Others but also to generate a sequence of actions. All the datasets in Table 2 are used. In our model, for the action type slot, we pre-define five potential values: Placement, Removal, Stop, Ask, and Others. The value of the location slot can be one of 1,089 candidate grid and the value of the color slot is one of six candidate colors. We still use pre-train GloVe embedding in the dialogue context encoder. During the training we minimize the sum of the cross entropy losses of the location slot, the color slot, and the action type slot with weights equal to 0.1, 0.1, 0.8. We provide the ground-truth previous actions and the world state for the next action prediction.  Results. In Table 5, we present the results of our model's test accuracy for each action type. The model has an 72.3% test accuracy. However, if the execution of building actions is excluded, its joint test accuracy of ask and other action types is about 40.5%, indicating that deciding when to take the initiative remains challenging. In Table 6, we also report the results for the building task. Not surprisingly, the performance of our model drops slightly compared to those in Table 3, reflecting the difficulty of jointly learning.

Case Study
Although our model can predict the actions more accurately than the baseline, for example our model can usually predict the color of the blocks correctly with about 60% test accuracy rate, it is still nontrivial for our model to predict the whole action sequence correctly. In Figure 3, the architect instructed the builder to build a 3x3 square and then our model generated only parts of the structure successfully.
The dataset noise makes the learning process more challenging: the builder action sequences are noisy due to, for example, the builder missclicking in the construction process (Narayan-Chen, 2020). Also, builder action sequences are often fragmented between utterances due to the frequent interruptions of the architect. In order to solve these issues a good model should be capable to learn better representations for higher-level abstractions in natural language like spatial relation concepts and be more robust to noisy actions (Shi et al., 2022). However, existing models including pre-trained ones (Devlin et al., 2019) fail to learn such representations for spatial reasoning, which translates into poor performance in these instruction following tasks.

Conclusion
In this paper, we extend the Minecraft Corpus dataset by labelling builder utterances into eight types, in which two of them are relevant to clarification questions. This allows builder models to learn to take initiatives in the instruction following tasks. Also, we have proposed a new model that achieves state-of-the-art performance on the Minecraft collaborative building task with a large improvement. Besides these contributions, we introduce a new learning to ask task for clarification questions and a jointly learning task with the collaborative building task. We leave the generation of clarification questions to future work.