Modular Networks for Compositional Instruction Following

Standard architectures used in instruction following often struggle on novel compositions of subgoals (e.g. navigating to landmarks or picking up objects) observed during training. We propose a modular architecture for following natural language instructions that describe sequences of diverse subgoals. In our approach, subgoal modules each carry out natural language instructions for a specific subgoal type. A sequence of modules to execute is chosen by learning to segment the instructions and predicting a subgoal type for each segment. When compared to standard, non-modular sequence-to-sequence approaches on ALFRED, a challenging instruction following benchmark, we find that modularization improves generalization to novel subgoal compositions, as well as to environments unseen in training.


Introduction
Work on grounded instruction following (MacMahon et al., 2006;Vogel and Jurafsky, 2010;Tellex et al., 2011;Chen and Mooney, 2011;Artzi and Zettlemoyer, 2013) has recently been driven by sequence-to-sequence models (Mei et al., 2016;Hermann et al., 2017), which allow end-to-end grounding of linguistically-rich instructions into equally-rich visual contexts (Misra et al., 2018;Anderson et al., 2018;Chen et al., 2019).These sequence-to-sequence models are monolithic: they consist of a single network structure which is applied identically to every example in the dataset.
Monolithic instruction following models typically perform well when evaluated on test data from the same distribution seen during training.However, they often struggle in compositional generalization: composing atomic parts, such as actions or goals, where the parts are seen in training but their compositions are not (Lake and Baroni, 2018;Ruis et al., 2020;Hill et al., 2020).In this work, we improve compositional generalization in instruction following with modular networks, which have been successful in nonembodied language grounding tasks (Andreas et al., 2016;Hu et al., 2017;Cirik et al., 2018;Yu et al., 2018;Mao et al., 2019;Han et al., 2019) and in following synthetic instructions or symbolic policy descriptions (Andreas et al., 2017;Oh et al., 2017;Das et al., 2018).Modular networks split the decision making process into a set of neural modules.Modules are each specialized for some function, composed into a structure specific to each example, and trained jointly to complete the task.
Prior work has found that modular networks often perform well in compositional generalization

Controller
Turn right and cross the room, turn right and walk to the tv stand on your left, turn to face the tv stand.Pick up the blue vase that is sitting on the tv stand … Place the vase on the coffee table to the left of the computer...  because of their composable structure (Devin et al., 2017;Andreas et al., 2017;Bahdanau et al., 2019;Purushwalkam et al., 2019), and that they can generalize to new environments or domains through module specialization (Hu et al., 2019;Blukis et al., 2020).However, all this work has either focused on grounding tasks without a temporal component or used a network structure which is not predicted from language.

GoTo
We propose a modular architecture for embodied vision-and-language instruction following 1 , and find that this architecture improves generalization on unseen compositions of subgoals (such as navigation, picking up objects, cleaning them, etc.).We define separate sequence-to-sequence modules per type of subgoal.These modules are strung together to execute complex high-level tasks.We train a controller to predict a sequence of subgoal types from language instructions, which determines the order in which to execute the modules.
We evaluate models on the ALFRED dataset (Shridhar et al., 2020), an instructionfollowing benchmark containing a diverse set of household tasks.We focus on compositional generalization: carrying out instructions describing novel high-level tasks, containing novel compositions of actions (see Figure 1 for an example).We find that our modular model improves performance on average across subgoal types when compared to a standard, monolithic sequence-to-sequence architecture.Additionally, we find improved generalization to environments not seen in training.

Modular Instruction Following Networks
We focus on following instructions in embodied tasks involving navigation and complex object interactions, as shown in Figure 2.
In training, each set of full instructions (e.g."Turn right and cross the room ... Place the vase on the coffee table to the left of the computer.")is paired with a demonstration of image observations and actions.In training, we further assume that the full instruction is segmented into subgoal instructions, and each subgoal instruction is labeled with one of a small number (in our work, 8) of subgoal types , e.g.["Walk to the coffee maker.":GOTO ], ["Pick up the dirty mug...": PICKUP], . . ., and paired with the corresponding segment of the demonstration.
During evaluation, the agent is given only full instructions (which are unsegmented and unlabeled), and must predict a sequence of actions to carry out the instructions, conditioning on the image observations it receives.
Our modular architecture for compositional instruction following consists of a high-level controller (Figure 2, left), and modules for each subgoal type (Figure 2,right).The high-level controller chooses modules to execute in sequence based on the natural language instructions, and each chosen module executes until it outputs a STOP action.The modules all share the same sequence-to-sequence architecture, which is the same as the monolithic architecture.We initialize each module's parameters with parameters from the monolithic model, and then fine-tune the parameters of each module to specialize for its subgoal.

Instruction-Based Controller
Our instruction-based controller is trained to segment a full instruction into sub-instructions and predict the subgoal type for each sub-instruction.We use a linear chain CRF (Lafferty et al., 2001) that conditions on a bidirectional-LSTM encoding of the full instruction and predicts tags for each word, which determine the segmentation and sequence of subgoal types.This model is based on standard neural segmentation and labelling models (Huang et al., 2015;Lample et al., 2016).
We train the controller on the ground-truth instruction segmentations and subgoal sequence labels, and in evaluation use the model to predict segmentations and their associated subgoal sequences (Figure 2, top left).This predicted sequence of subgoals determines the order to execute the modules (Figure 2, right).We use a BIO chunking scheme to jointly segment the instruction and predict a subgoal label for each segment.
Formally, for a full instruction of length N , the controller defines a distribution over subgoal tags s 1:N for each word given the instruction x 1:N as The subgoal tag scores U sn for word n are given by a linear projection of bidirectional LSTM features for the word at position n.The tag transition scores B s n−1 ,sn are learned scalar parameters.
In training, we supervise s 1:N using the segmentation of the instruction x 1:N into K subgoal instructions and the subgoal label for each instruction.To predict subgoals for a full instruction in evaluation, we obtain arg max s 1:N p(s 1:N | x 1:N ) using Viterbi decoding, which provides a segmentation into sub-instructions and a subgoal label for each sub-instruction.
The controller obtains 96% exact match accuracy on subgoal sequences on validation data.

Module Architecture
Our modularized architecture may be seen in Figure 2, right.The architecture consists of 8 independent modules, one for each of the 8 subgoals in the domain (e.g.GOTO, PICKUP).For each module, we use the same architecture as Shridhar et al. ( 2020)'s monolithic model.This is a sequence-tosequence model composed of an LSTM decoder taking as input an attended embedding of the natural language instruction, pretrained ResNet-18 (He et al., 2016) features of the image observations, and the previous action's embedding.Hidden states are passed between the modules' LSTM decoders at subgoal transitions (Figure 2, right).
At each time step, each module M i computes its hidden state based on the last time step's action a t−1 , the current time step's observed image features o t , an attended language embedding xi t , and the previous hidden state h i t−1 : Each module's attended language embedding xi t is produced using its own attention mechanism over embeddings X = x 1:N of the language instruction, which are produced by a bidirectional LSTM encoder: Finally, the action a t and object interaction mask m t are predicted from h i t and e i t with a linear layer and a deconvolution network respectively.More details about this architecture can be found in Shridhar et al. (2020).Both the action and mask decoders, well as the language encoder, are shared across modules. 2ur use of subgoal modules is similar to the hierarchical policy approaches of Andreas et al. (2017), Oh et al. (2017), and Das et al. (2018).However, in those approaches, the input to each module is symbolic (e.g.FIND [KITCHEN]).In contrast, all modules in our work condition directly on natural language.

Training
We first pre-train the monolithic model by maximizing the likelihood of the ground-truth trajectories in the training data (Shridhar et al., 2020).We train for up to 20 epochs using the Adam optimizer (Kingma and Ba, 2014) with early stopping on validation data (see Appendix A.1 for hyperparameters).We use this monolithic model to initialize the parameters of each of the modules, which have identical architecture to the monolithic model, and fine-tune them using the same training and early stopping procedure on the same validation data,3 allowing the monolithic model's parameters to specialize for each module.Each module predicts only the actions for its segment of each trajectory; however, modules are jointly fine-tuned, passing hidden states (and gradients) from module to module.

Generalization Evaluation
We evaluate models on out-of-domain generalization in two conditions (see below) using the AL-FRED benchmark (Shridhar et al., 2020), comparing our modular approach to their non-modular sequence-to-sequence model.ALFRED is implemented in AI2-THOR 2.0 (Kolve et al., 2017), which contains a set of simulated environments with realistic indoor scene renderings and object interactions.
The dataset contains approximately 25K expert instruction-trajectory pairs, comprised of 3 instructions for each of 8K unique trajectories.The instructions include both a high level instruction and a sequence of low level instructions.In our experiments, we do not use the high level instructions, which Shridhar et al. (2020) found to produce comparable results when evaluated on generalization to unseen environments with these architectures.
Figure 1 shows two example trajectories and their associated instructions.Trajectories are composed (see Sec. 2) of sequences of eight different types of subgoals: navigation (GOTO) and a variety of object interactions (e.g.PICKUP, CLEAN, HEAT).Each subgoal's subtrajectory is composed of a sequence of low-level discrete actions which specify commands for navigation or object interactions (which are accompanied by image segmentations to choose the object to interact with).

Generalization Conditions
The ALFRED dataset was constructed to test generalization to novel instructions and unseen environments.However, all evaluation trajectories in the dataset correspond to sequences of subgoals that are seen during training.For example, some training and evaluation instances might both correspond to the underlying subgoal sequence GOTO, PICKUP, GOTO, PUT, but differ in their low-level actions, their language descriptions, and possibly also the environments they are carried out in.
Novel Tasks.We evaluate models' ability to generalize to different high-level tasks (compositions of subgoals) than seen in training.The dataset contains seven different task types, such as Pick & Place, as described in Appendix B.1.We hold out two task types and evaluate models on their ability to generalize to them: Pick Two & Place and Stack & Place.These tasks are chosen because they contain subgoal types that are all individually seen in training, but typically in different sequences.
We create generalization splits pick-2-seen and pick-2-unseen by filtering the seen and unseen splits below to contain only Pick Two & Place tasks, and remove all Pick Two & Place tasks from the training data.We create splits stack-seen and stack-unseen for Stack & Place similarly.
Novel Instructions and Environments This is the standard condition defined in the original AL-FRED dataset.There are two held-out validation sets: seen, which tests generalization to novel instructions and trajectories but through environments seen during training, and unseen, which tests generalization to novel environments: rooms with new layouts, object appearances, and furnishings.

Results
We compare our modular architecture with the monolithic baseline, averaging performance over models trained from 3 random seeds.For each generalization condition, we measure success rates over full trajectories as well as over each subgoal type independently.Due to the challenging nature of the domain, subgoal evaluation provides finergrained comparisons than full trajectories.
We use the same evaluation methods and metrics as in Shridhar et al. (2020).Success rates are weighted by path lengths to penalize successful trajectories which are longer than the ground-truth demonstration trajectory.To evaluate full trajectories, we measure path completion: the portion of subgoals completed within the full trajectories.To evaluate the subgoals independently, we advance the model along the expert trajectory up until the point where a given subgoal begins (to maintain a history of actions and observations), then require the model to carry out the subgoal from that point.
We also report results from Shridhar et al. (2020) andSingh et al. (2020).We note that the approach of Singh et al. ( 2020) obtains higher performance on full trajectories than the system of Shridhar et al. ( 2020) (which we base our approach on) primarily by introducing a modular object interaction architecture (shared across all subgoals) and a pre-trained object segmentation model.These techniques could also be incorporated into our approach, which uses modular components for individual subgoal types.
Novel Tasks.Table 1 shows for each split the success rates on subgoals appearing in at least 50 validation examples.The modular outperforms the monolithic model on both seen and unseen splits (Tables 1b and 1c).Full trajectory results for novel task generalization are shown in Table 2.In the double generalization condition (unseen environments for the held-out pick-2 and stack tasks) on full trajectories, neither model completes subgoals successfully.Overall, we find that modularity helps across most generalization conditions.
Generalization to novel environments.We also compare models on generalization to unseen environments.In the independent subgoal evaluation, the monolithic and modular models perform equally on average in the standard-seen split (Ta- 10.9 (7.0) 7.1 (4.9) 1.3 (1.6) 1.3 (0.3) Mod.

Conclusions
We introduced a novel modular architecture for grounded instruction following where each module is a sequence-to-sequence model conditioned on natural language instructions.With the ALFRED dataset as a testbed, we showed that our modular model achieves better out-of-domain generalization, generalizing better at the subgoal level to novel task compositions and unseen environments than the monolithic model used in prior work.All of the module types in our model currently use separate parameterizations but identical architectures; future work might leverage the modularity of our approach by using specialized architectures, training procedures, or loss functions for each subgoal type.Furthermore, unsupervised methods for jointly segmenting instructions and trajectories without requiring labeled subgoal labels and alignments would be a valuable addition to our framework.
image observation.The dataset contains approximately 25K expert instruction-trajectory pairs, pertaining to about 8K unique trajectories.

B.1 Task Types
The dataset contains demonstrations for 7 different kinds of tasks.
Pick & Place The agent must pickup a specified object, bring it to a destination, and place it.For example, "Pick up a vase, place it on the coffee table ." Examine in Light The agent must pickup an object and bring it to a light source.For example, "Examine the remote control under the light of the floor lamp ." Heat & Place The agent must pickup an object, put it in the microwave, toggle the microwave, take the object out of the microwave, and finally place the heated object at a specified location.For example: "Put a heated apple next to the lettuce on the middle shelf in the refrigerator." Cool & Place This is the same as above, but with a refrigerator instead of a microwave.For example, "Drop a cold potato slice in the sink." Clean & Place The agent must put an object into the sink and turn on the water to clean the object.Then, it must be placed at a specified location.For example, "Put a washed piece of lettuce on the counter by the sink." Stack & Place The agent must pick up an object, place it into a receptacle, and then bring the stacked objects to a specified location and place them.For example, "Move the pan on the stove with a slice of tomato in it to the Turn left to move to the counter... Pick up the pan to the right of the toaster.Turn around to move in front of the sink.Place the pan in the sink ... Carry the pan ... Place the pan on the back of the counter by the wall.Training Test Turn around and walk past the bed... Grab the cellphone off of the cabinet.Turn right and walk ... Place the cellphone on the dresser.Turn around ... Grab the cellphone off of the chair there.Turn left and walk ... Place the cellphone down ...

Figure 1 :
Figure 1: At evaluation time, an instruction following agent may need to generalize both to novel chains of subgoals encountered during training as well as to completely new environments.In the generalization condition above, the agent must generalize to multiple pickup actions (in green) at test time, whereas only single ones were seen at training, as well as to a new house.We propose a modular architecture to handle these cases.
cross the room, turn right and walk to the tc stand on your left, turn to face the tv stand Pick up the blue vase that is sitting on the tv stand Turn right and cross the room, turn right and walk to the tv stand on your left, turn to face the tv stand.Pick up the blue vase that is sitting on the tv stand … Place the vase on the coffee table to the left of the computer...

Figure 2 :
Figure 2: Our modular approach first uses a controller (left) trained with supervised learning to segment a given instruction and label segments with subgoal types (e.g.GOTO, PICKUP) to execute.These subgoal types are used to chain together modules (right) to carry out instructions in the environment.Each module is a separatelyparameterized sequence-to-sequence model that conditions on an attended representation of the instruction sequence, the visual observations, and the action taken at the previous timestep.Modules pass recurrent hidden states to each other.

Table 1 :
Shridhar et al. (2020) andess percentages, by subgoal type, on the various generalization splits, and averaged across subgoal types (Avg.).We compare the performance of the monolithic (Mono.)model to our modular model (Mod.).The modular model generalizes better on average to unseen environments (standard-unseen) and to both seen and unseen environments for two held-out task types: Pick-2 and Stack.Bolded numbers show the best model between Mono and Modular, with * and * * denoting differences that are statistically significant at the p < 0.15 and p < 0.05 levels, respectively, by a one-tailed t-test.S+ gives results fromShridhar et al. (2020) and MOCA from  Singh et al. (2020).

table . "
Pick Two & PlaceThe agent must pickup an object, place it somewhere, then pick up another instance of that object and put it in the same place.For example, "Place two CDs in top drawer of black cabinet."These last two task types, Stack & Place and Pick Two & Place, are the ones held out in the Novel Tasks generalization experiments.