Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Despite recent progress, learning new tasks through language instructions remains an extremely challenging problem. On the ALFRED benchmark for task learning, the published state-of-the-art system only achieves a task success rate of less than 10% in an unseen environment, compared to the human performance of over 90%. To address this issue, this paper takes a closer look at task learning. In a departure from a widely applied end-to-end architecture, we decomposed task learning into three sub-problems: sub-goal planning, scene navigation, and object manipulation; and developed a model HiTUT (stands for Hierarchical Tasks via Unified Transformers) that addresses each sub-problem in a unified manner to learn a hierarchical task structure. On the ALFRED benchmark, HiTUT has achieved the best performance with a remarkably higher generalization ability. In the unseen environment, HiTUT achieves over 160% performance gain in success rate compared to the previous state of the art. The explicit representation of task structures also enables an in-depth understanding of the nature of the problem and the ability of the agent, which provides insight for future benchmark development and evaluation.


Introduction
As physical agents (e.g., robots) start to emerge as our assistants and partners, it has become increasingly important to empower these agents with an ability to learn new tasks by following human language instructions. Many benchmarks have been developed to study the agent's ability to follow natural language instructions in various domains including navigation (Anderson et al., 2018;Chen et al., 2019), object manipulation (Misra et al.,

Goal Directive
Place a clean mug in the coffee machine.  2017;  and embodied reasoning (Das et al., 2018a;Gordon et al., 2018). Despite recent progress, learning new tasks through language instructions remains an extremely challenging problem as it touches upon almost every aspect of AI from perception, reasoning, to planning and actions. For example, on the ALFRED benchmark for task learning , the state-of-the-art system only achieves less than 10% task success rate in an unseen environment (Singh et al., 2020), compared to the human performance of over 90%. Most previous works apply an end-to-end neural architecture Singh et al., 2020;Storks et al., 2021) which attempt to map language instructions and visual inputs directly to actions. While striving to top the leader board for end task performance, these models are opaque, making it difficult to understand the nature of the problem and the ability of the agent.
To address this issue, this paper takes a closer look at task learning using the ALFRED benchmark. In a departure from an end-to-end ar-chitecture, we have developed an approach to learn the hierarchical structure of task compositions from language instructions. As shown in Figure 1, a high-level goal directive ("place a clean mug in the coffee machine") can be decomposed to a sequence of sub-goals. Some subgoals involve navigation in space (e.g., Goto(Mug), Goto(Sink)) and others require manipulation of objects (e.g., Pickup(Mug), Clean(Mug)). These sub-goals can be further decomposed into navigation actions such as RotateLeft and MoveAhead, and manipulation actions such as Put(Mug, Sink), TurnOn(Faucet). In fact, such hierarchical structure is similar to Hierarchical Task Network (HTN) widely used in AI planning (Erol et al., 1994). While this hierarchical structure is explicit and has several advantages in planning and making models transparent, how to effectively learn such structure remains a key challenge.
Motivated by recent work in multi-task learning (Liu et al., 2019a), we decomposed task learning in ALFRED into three sub-problems: sub-goal planning, scene navigation, and object manipulation; and developed a model called HiTUT (stands for Hierarchical Tasks via Unified Transformers) that addresses each sub-problem in a unified manner to learn a hierarchical task structure. On the ALFRED benchmark, HiTUT has achieved the best performance with a remarkably higher generalization ability. In the unseen environment, HiTUT achieves over 160% performance gain in success rate compared to the previous state of the art.
The contributions of this work lie in the following two aspects.
An explainable model achieving the new state-ofthe-art performance. By explicitly modeling a hierarchical structure, our model offers explainability and allows the agent to monitor its own behaviors during task execution (e.g., what sub-goals are completed and what to accomplish next). When a failed attempt occurs, the agent can backtrack to previous sub-goals for alternative plans to execute. This ability of self-monitoring and backtracking offers flexibility to dynamically update sub-goal planning at the inference time to cope with exceptions and new situations. It has led to a significantly higher generalization ability in unseen environments.
A de-composable platform to support more indepth evaluation and analysis. The decomposition of task learning into sub-problems not only makes it easier for an agent to learn, but also pro-vides a tool for an in-depth analysis of task complexity and the agent's ability. For example, one of our observations from the ALFRED benchmark is that the agent's inability to navigate is a major bottleneck in task completion. Navigation actions are harder to learn than sub-goal planning and manipulation actions. For manipulation actions, the agent can learn action types and action arguments predominantly based on sub-goals and the history of actions, while language instructions do not contribute significantly to learning. The success of manipulation actions also largely depends on the agent's ability in detecting and grounding action arguments to corresponding objects in the environment. These findings allow a better understanding of the nature of the tasks in ALFRED and provide insight to address future opportunities and challenges in task learning.

Related Work
Recent years have seen an increasing amount of work on in the intersection of language, vision and robotics. One line of work particularly focuses on teaching robots new tasks through demonstration and instruction (Rybski et al., 2007;Mohseni-Kabir et al., 2018). Originated in the robotics community, learning from demonstration (LfD) (Thomaz and Cakmak, 2009;Argall et al., 2009) enables robots to learn a mapping from world states to robots' manipulations based on human's demonstration of desired robot behaviors. More recent work has also explored the use of natural language and dialogue together with demonstration to teach robots new actions (Mohan and Laird, 2014;Scheutz et al., 2017;Liu et al., 2016;She and Chai, 2017;Chai et al., 2018;Gluck and Laird, 2018).
To facilitate task learning from natural language instructions, several benchmarks using simulated physical environment have been made available (Anderson et al., 2018;Misra et al., 2018;Blukis et al., 2019;. In particular, the vision and language navigation (VLN) benchmark (Anderson et al., 2018) has received a lot of attention. Many models have been developed, such as the Speaker-Follower model (Fried et al., 2018), the Self-Monitoring Navigation Agent (Ma et al., 2019a;Ke et al., 2019), the Regretful Agent (Ma et al., 2019b), and the environment drop-out model (Tan et al., 2019). The VLN benchmark is further extended to study the fidelity of instruction following (Jain et al., 2019) and examined to understand the bias of the benchmark (Zhang et al., 2020). Beyond navigation, there are also benchmarks that additionally incorporate object manipulation to broaden research on vision and language reasoning, such as embodied question answering (Das et al., 2018a;Gordon et al., 2018). The work closest to ours is the Neural Modular Control (NMC) (Das et al., 2018b), which also decomposes high-level tasks into sub-tasks and addresses each sub-task accordingly. However, selfmonitoring and backtracking between sub-tasks is not explored in NMC.
The ALFRED benchmark consists of high-level goal directives such as "place a clean mug in the coffee machine" and low level language instructions such as "rinse the mug in the sink" and "turn right and walk to the coffee machine" to accomplish these goals. In addition to language instructions, it also comes with expert demonstrations of task execution in an interactive visual environment. We choose this dataset because its unique challenges are closer to the real world, which require the agent to not only learn to ground language to visual perception but also learn to plan for and execute actions for both navigation and object manipulation.

Hierarchical Tasks via Unified Transformers
As discussed in Section 1, task structures are inherently hierarchical, which compose of goals and subgoals. Different sub-goals involve tasks of different nature. For example, navigation focuses on path planning and movement trajectories, while manipulation concerns more about interactions with concrete objects. Instead of end-to-end mapping from language instructions to primitive actions Singh et al., 2020;Storks et al., 2021), we decomposed task learning into three separate but connected sub-problems: sub-goal planning, scene navigation, and object manipulation, and developed a model called HiTUT (stands for Hierarchical Tasks via Unified Transformers) to tie these sub-problems together to form a hierarchical task structure.

Task Decomposition
We first introduce some notations to describe the task and the model. There are three types of information: -Language (L). We use G to denote a high-level goal directive, e.g., "place a clean mug in the coffee machine" and I i to refer to a specific low-level language instruction.
-Vision (V). It captures the visual representation of the environment.
-Predicates (P). Symbolic representations are defined to capture three types of predicates: subgoals (sg), navigation actions (a n ), and manipulation actions (a m ). Each sg has two parts (sg type , sg arg ) where sg type is the type (e.g., Goto) and sg arg is the argument (e.g., Knife). Each a n specifies a type (a n type ) of action, from {RotateLeft, RotateRight, MoveAhead, LookUp, LookDown}. Each a m has also two parts (a m type , a m arg ) where a m type is the action type (e.g., TurnOn); a m arg is the action argument (e.g.,

Faucet).
Sub-Goal Planning. Sub-goal planning acquires a sequence of sub-goals sg 1 , · · · , sg n to accomplish the high-level goal G. We predict the type sg type i and argument sg arg i separately to avoid the combinatorial expansion of the output space. Previous work (Jansen, 2020) models sub-goal planning merely from high-level goal directives without visual grounding. These plans are fixed and thus not robust to potential failures during execution and variations of the visual environment. To overcome these drawbacks, our sub-goal planning is done on the fly after the previous sub-goal is executed in the environment. More specifically, our sub-goal planning objective is to learn a model (M sg ) that takes the visual observation at the current step (v t ), the high-level goal directive (G), and a complete sub-goal history prior to the current step (sg <i ) to predict the current sub-goal as follows: The predicted sub-goals serve as a bridge between the high-level goal and the low-level predictions of navigation actions and/or manipulation actions.
Scene Navigation. Navigation sub-goals only require predictions for the types of navigation actions. The objective is to learn a model for navigation (M n ) which takes the current visual observation (v t ), current sub-goal (sg i ), language instruction (I i ), and the navigation action history up to the current step (a n <j ) to predict the next navigation action: a n j a n type j = M n (v t , I i , sg i , a n <j )  Object Manipulation. For a manipulation subgoal, in addition to the type and argument of the action, the model (M m ) also needs to generate a segmentation mask (m j ) on the current visual observation to indicate which object to interact with (i.e., which object the argument is grounded to):

BERT
The mask prediction is crucial because the action will not be successfully executed with an incorrect grounding even if a m j is correctly predicted. As described above, although the context of the three sub-problems varies, each model has similar input components from the space of V, L, P . This similarity inspires us to design an unified model to solve three sub-problems simultaneously.

Unified Transformers
We leverage the effective self-attention based model (Vaswani et al., 2017) to capture the correspondence of different input sources as shown in Figure 2. We first project the input from different modalities into the language embedding space, and adopt a transformer to integrate the information together. Multiple prediction heads are constructed on top of the transformer encoder to make predictions for the sub-goal type and argument, the action type and argument, and object masks respectively. As the three sub-problems share the similar input form, we solve them all together using a unified model based on multi-task learning (Liu et al., 2019a).
Our model differs from previous works Singh et al., 2020) in the following aspects. First, we do not apply recurrent state transitions, but feed the prediction history as the input to each subsequent prediction. This may help better capture correlations between predicates and other modalities. Second, we do not use dense visual features from the scene, but rather the object detection results. By doing this, we map different modalities to the word embedding space before feeding them into the transformer encoder, thus taking advantage of the pre-trained language models. Third, we use a predicate embedding to share linguistic knowledge between predicate symbols and word embeddings.
Predicate Embedding. We use the term predicates to refer to symbolic representations including sub-goal types, action types, and their arguments. We map symbols to their corresponding natural language phrases (e.g., AppleSliced is mapped to a sliced apple). We then tokenize and embed the tokens using word embeddings, and take the sum of the embeddings to obtain the representation of each predicate.
Vision Encoding. We use a pre-trained object detector (Mask R-CNN (He et al., 2017)) to encode visual information. Instead of dense features, we simply use the detection results (class labels, bounding box coordinates and confidence scores) as visual features. Specifically, we use the top K detected objects with a confidence score higher than 0.4 to form the visual features. The object class labels share the same space with object arguments, thus can be embedded into the same space. The position information of an object is encoded by a 7-dimensional vector consisting of its coordinates, width and height of the bounding box and its confidential score. This vector is first mapped to the same dimension as word embeddings by a liner transformation, then added to the class embedding to form the final object representation.
Object Grounding. HiTUT does not generate masks by itself. Instead it chooses an object from the K input objects and uses the corresponding mask generated by the object detector. This method makes use of the strong prior learned from object detection pre-training, so the model can focus on learning the grounding task. A drawback is that the object detector cannot be improved during training, and the performance of the detector determines the upper bound of our model's grounding ability.
We leave the exploration of more robust grounding method for future work.
Posture Feature We use an additional posture feature to assist scene navigation, which includes the agent's rotation (N, S, E, W) and its angle of sight horizon (discretized by 15 degree). The positions are embedded and summed up to form the posture feature representation. The agent maintains its own posture in the form of a relative change to its initial posture instead of the absolute posture in the environment, thus avoid using additional sensory data.

Self-Monitoring and Backtracking
These unified transformers trained for subproblems are integrated together as shown in Figure 3. One important advantage of intermediate sub-goal representations is to facilitate selfmonitoring and backtracking which allows the agent to dynamically adjust the plan to cope with failures during execution. As shown in Section 4, this feature brings out the most remarkable performance gain compared to the state of the art.
Self-Monitoring. The world is full of uncertainties, and mistakes are inevitable. Based on the learned model, the agent should be able to monitor its own behaviors and dynamically update its plan when the situation arises. Our explicit representation of sub-goals allows the agent to self-check whether some sub-goals are accomplished.
Particularly for manipulation sub-goals, it is feasible for the agent to detect their failures by simply monitoring whether all the   Backtracking. In classical AI, backtracking is the technique to go back and try an alternative path that can potentially lead to the goal. As shown in Figure 4, when Pickup(Mug) fails, the agent backtracks to Goto(Mug) and tries a different sequence of primitive actions to accomplish this sub-goal.
In ALFRED, only based on the visual information without other sensory information (e.g., only observing a mug without knowing how far it is), is it difficult to check whether a navigation sub-goal is successfully achieved (e.g. whether a Mug is reachable). So every time after trying a different path for Goto(Mug), the agent will check whether the subsequent manipulation action Pickup(Mug) is successful. If it's successful, the agent will move on to the next sub-goal; otherwise the agent will continue to backtrack until a limit on the maximum number of attempts is reached. Our explicit representation of sub-goals makes this backtracking possible and has led to a significant performance gain in unseen environments.

Setting and Implementation
Dataset. We follow the train/validation/tests data partition proposed in ALFRED, where validation and test sets are further split into seen and unseen based on whether the scene is shown to the model during training. Each sub-goal planning step or a primitive prediction step forms a data instance for  the corresponding sub-problem. The number of data instances are shown in Table 1.
Pre-training. We employ the pre-training followed by fine-tuning paradigm for both the object detector and the main model. For the object detector, we use a Mask R-CNN (He et al., 2017) model pre-trained on MSCOCO (Lin et al., 2014), and fine-tune it on 50K images collected by replaying the expert trajectories in the ALFRED train split. As we observe that the model struggles on detecting small objects together with large receptacles, we train two networks to detect movable objects and big receptacles separately. We use the pre-trained RoBERTa (Liu et al., 2019b) model to initialize the transformer encoder.
Training. We perform imitation learning (supervised learning) on the expert demonstrations. The ground-truth labels of sub-goals and primitive actions are obtained from the metadata. Different input and output labels are organized for each subproblem respectively as described in Section 3. We use the mask proposal that overlaps the most with the ground truth mask as the mask selection label if the intersection-of-union is above 50%. If there is no valid mask proposals, the label is assigned to 0 as an indicator of non-valid grounding. We optimize the cross-entropy loss between model predictions and the ground truth. We follow the multi-task training schema in Liu et al. (2019a) where for each iteration, a batch is randomly sampled among all the sub-problems, and the model is updated according to the corresponding objective. More details are in Appendix.
Evaluation Metrics. ALFRED leverages an interactive evaluation in the AI2-THOR environment . A task is considered successful if all the goal conditions (e.g. the target object is placed on a correct receptacle and in a requested state such as heated or cleaned etc.) are met. Three measures are used: (1) success rate (the ratio of successfully completed tasks), (2) goal-condition rate (ratio of completed goal conditions), and (3) a weight version of these two rates which takes into account of the length difference between the predicted action sequence and the expert demonstrated action sequence .

Overall Performance of HiTUT
We first evaluate the overall performance of the proposed framework as shown in Table 2. On the testing data reported by the leader board, in seen environments, HiTUT achieves comparable performance as MOCA. However in unseen environments, HiTUT outperforms MOCA by over 160% on success rate. This demonstrates our hierarchical task modeling approach has higher generalization ability compared to end-to-end models. Self-monitoring and backtracking enabled by hierarchical task structures allows the agent to better handle new situations. Remarkably, only based on high-level goal directives (i.e., HiTUT (G Only)) without using any sub-goal instructions, is HiTUT able to obtain a success rate of 11% in unseen environment, achieving 110% performance gain compared to MOCA. This result indicates that HiTUT can learn prior task knowledge from the hierarchical modeling process and apply that directly in new environment with some success. Nevertheless, our results are far from human performance and there is still huge room for future improvement.
To have a better understanding of the problem, we also conduct evaluations on sub-goals. The agent is positioned at the starting point of each subgoal by following the expert demonstration and the success rate of accomplishing the sub-goal is measured. HiTUT predicts first a symbolic sub-goal representation and then the action sequence to complete the sub-goal. As shown in Table 3, HiTUT outperforms previous models on almost all of the manipulation sub-goals by a large margin. The performance gain is particularly significant in unseen environment, which demonstrates the advantage of our explicit hierarchical task modeling in low-level action planning.

The Role of Backtracking
We conduct experiments to better understand the role of self-monitoring and backtracking. We repeat the task-solving evaluation with different limits on the allowed maximum number of backtracking. The agent only stops when the model predicts to stop (i.e., predicts End) or it reaches the backtracking limit. As shown in Table 4, as the limit increases, the task/goal-condition success rate increases accordingly. One thing notable is that the gap between success rates (weighted and unweighted) become larger when more backtrack attempts are allowed. This is within our expectation because backtracking deviates from instruction following navigation to goal-oriented exploration, which usually takes more steps than the expert demonstration. Since backtracking is particularly targeted to navigation sub-goals Goto (see Section 3.3), we further examine the role of number of re-tries (i.e. backtracks) in completing the sub-goal. As shown in Table 5, HiTUT reaches more targets when given more opportunities to backtrack. The backtracking is most beneficial in unseen environment.

Complexity of Tasks
Task decomposition provides a tool to enable better understanding of task complexity and agent's ability. To do that, we replace different part of model predictions by the corresponding oracle sub-goals, actions, or masks, as shown in Table 6.
Using oracle sub-goals improves the success rate for 2%-6% (line SG), showing sub-goal planning is a relatively easy problem and the agent can perform reasonably well. After using the oracle   Table 6: Success rates of HiTUT with different parts of predictions replaced by oracle operations with expert demonstrations. N, M, SG and GR denote oracle navigation actions, manipulation actions, sub-goals and object grounding (i.e., mask generation) respectively. navigation actions, the seen and unseen success rates are boosted by an absolute gain of 50% and 46% respectively (line N), indicating that navigating to reach target objects is a particularly hard problem and the agent performs poorly. When oracle sub-goals, navigation actions, and manipulation actions (only symbolic representations) are given (line SG+N+M), the task success is bounded by the performance of the pre-trained object mask generator (i.e., visual grounding of the object). When oracle object masks are given together with oracle sub-goals and navigation actions (line SG+N+GR) and the agent only needs to predict symbolic representation of manipulation actions, the performance is near perfect. These last two lines indicate that predicting the type and the argument of a manip-ulation action is a rather simple problem in the ALFRED benchmark while grounding action arguments to the visual environment remains a challenging task.
We further examine the complexity of learning to solve sub-problems by evaluating the next-step prediction accuracy given the golden history under different conditions as shown in Figure 5. The models are trained and evaluated with different combinations of input and different amount of training data. We observe that excluding the visual input does not hurt performance for sub-goal prediction and manipulation action prediction (shown by a,b,d,e). This indicates that in ALFRED, pure symbolic planning is often independent from visual understanding, which is consistent with the findings in . However, this could be an oversimplification brought by the bias in the dataset rather than a true reflection of the physical world. For example, next action prediction can be made by remembering the correlation of predicates instead of reasoning over vision and language, due to the lack of diversity of the task environments. Removing language instructions causes a minimal performance drop of 1%-2% on action prediction tasks, which brings up the question about the usefulness of language instructions in this benchmark. Furthermore, the prediction accuracy is above 90% and 98% with only 5% training data for sub-goal and manipulation planning respectively, while the navigation accuracy is only 82% given all the data. This again supports the finding that planning and performing navigation actions is a much harder problem than sub-goal planning and manipulation actions in ALFRED.

Discussion and Conclusion
This paper presents a hierarchical task learning approach that achieves the new state-of-the-art performance on the ALFRED benchmark. The task decomposition and explicit representation of subgoals enable a better understanding of the problem space as well as the current strengths and limitations. Our empirical results and analysis have shown several directions to pursue in the future. First, we need to develop more advanced component technologies integral to task learning, e.g., more advanced navigation modules through either more effective structures (Hong et al., 2020) or richer perceptions (Shen et al., 2019) to solve navigation bottleneck. We need to develop better representations and more robust and adaptive learning algorithms to support self-monitoring and backtracking. We also need to seek ways to improve visual grounding, which is crucial to both navigation and manipulation.
Second, we should also take a closer look at the construction and objective of existing benchmarks. How a benchmark is created and how truthfully it reflects the complexity of the physical world would impact the scalability and reliability of the approach in the real world. As for the objective, there is a distinction between learning to perform tasks and learning to follow language instructions. If the objective is the former, the agent should be measured by the ability to learn to accomplish highlevel goal directives without being given specific language instructions at the inference time. If the objective is the latter, then the agent should be measured by how faithful it follows human instructions aside from achieving the goals, similar to (Jain et al., 2019). We need to be clear about the objectives and develop evaluation metrics accordingly.
Finally, when humans perform poorly in a complex task, we have the ability to diagnose the problem and put more energy on learning the difficult part. Physical agents should also have similar abilities. In task learning, on the one hand, the agent should be able to master simple sub-tasks from a few data instances, e.g., through a few turns of interactions with humans (Karamcheti et al., 2020). On the other hand, it should be aware of the bottleneck of its learning progress and proactively request for help when problems are encountered either during learning or during deployment (She and Chai, 2017). How to effectively design interactive and active learning algorithms for the agent to learn complex and compositional tasks remains an important open research question.

C Additional Results
A detailed per-task performance comparison of Hi-TUT and MOCA is shown in Table 7. As the comparison might be unfair since HiTUT benefits from model pre-training, we also conduct an ablation study to show the effectiveness of pretraining. In Table 8, we compare the fine-tuned RoBERTa model to a Transformer with the same size trained from scratch to show the role of the RoBERTa pretraining.We can see that RoBERTa consistently improves the performance over training from scratch both w/o and w/ backtracking with an absolute gain between 0.4% and 5% on task success rate. Notably, Scratch with 4 or 8 backtrackings still outperform MOCA by a large margin in terms of the unseen success rate.