Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments

In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle ‘off the path’ scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent’s location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.


Introduction
Training agents to navigate in realistic environments based on natural language instructions is a step towards building robots that understand humans and can assist them in their daily chores. Anderson et al. (2018b) introduced the Vision-and-Language Navigation (VLN) task, where an agent navigates a 3D environment to follow natural language instructions. Much of the prior work on VLN assumes a discrete navigation graph (navgraph), where the agent teleports between graph nodes, both in indoor (Anderson et al., 2018b) and outdoor (Chen et al., 2019;Mehta et al., 2020) settings. Krantz et al. (2020) reformulated the VLN task to a continuous environment (VLN-CE) by lifting the discrete paths to continuous trajectories, bringing the task closer to real-world scenarios. Krantz et al. (2020) supervised agent training with actions based on the shortest path from the agent's location to the goal, following prior work in VLN (Fried et al., 2018;Tan et al., 2019;Hu et al., 2019;Anderson et al., 2019). However, as Jain et al. Goal Start Start Goal Figure 1: A language-aligned path (blue) in an instruction following task may differ from the shortest path (red) to the goal. Language-aligned supervision (blue arrows) encourages the agent at any given location (dark circles) to move towards the nearest waypoint on the language-aligned path and can hence be a better supervisory signal for instruction following than goal-oriented supervision (red arrows).
(2019) observed, such supervision is goal-oriented and does not always correspond to following the natural language instruction.
Our key idea is that language-aligned supervision is better than goal-oriented supervision, as the path matching the instructions (language-aligned) may differ from the shortest path to the goal (goaloriented). This is especially true in 'off the path' scenarios (where the agent veers off the reference path prescribed by the instructions). Languagealigned supervision encourages the agent to move towards the nearest waypoint on the languagealigned path at every step and hence supervises the agent to better follow instructions (see Figure 1). In the discrete nav-graph setting, Jain et al. (2019) interleave behavioral cloning and policy gradient training, with a sparse 'fidelity-oriented reward' based on how well each node is covered on the reference path. In contrast, we tackle the VLN-CE setting and propose a simple and effective approach that provides a denser supervisory signal leading the agent to the reference path. A dense supervisory signal is especially important for VLN-CE where the episodes have a longer average length of 55.88 steps vs. 4-6 nodes in (discrete) VLN. To this end, we conduct experiments investigating the effect of density of waypoint supervision on task performance.
To assess task performance, we complement the commonly employed normalized Dynamic Time Warping (nDTW) metric (Ilharco et al., 2019) with a intuitive Waypoint Accuracy metric. Finally, to provide qualitative insights about degree of following instructions, we combine language-aligned waypoints with information about sub-instructions. Our experiments show that our language-aligned supervision trains agents to more closely follow instructions compared to goal-oriented supervision.

Related Work
Vision-and-Language Navigation. Since the introduction of the VLN task by Anderson et al. (2018b), there has been a line of work exploring improved models and datasets. The original Roomto-Room (R2R) dataset by Anderson et al. (2018b) provided instructions on a discrete navigation graph (nav-graph), with nodes corresponding to positions of panoramic cameras. Much work focuses on this discrete nav-graph setting, including cross-modal grounding between language instructions and visual observations (Wang et al., 2019), addition of auxiliary progress monitoring (Ma et al., 2019), augmenting training data by re-generating language instructions from trajectories (Fried et al., 2018), and environmental dropout (Tan et al., 2019).
However, these methods fail to achieve similar performance in the more challenging VLN-CE task, where the agent navigates in a continuous 3D simulation environment. Chen et al. (2021) propose a modular approach using topological environment maps for VLN-CE and achieve better results. In concurrent work, Krantz et al. (2021) propose a modular approach to predict waypoints on a panoramic observation space and use a lowlevel control module to navigate. However, both these works focus on improving the ability of the agent to reach the goal. In this work, we focus on the VLN-CE task and on accurately following the path specified by the instruction. Instruction Following in VLN. Work in the discrete nav-graph VLN setting has also focused on improving the agent's adherence to given instruc-tions. Anderson et al. (2019) adopt Bayesian state tracking to model what a hypothetical human demonstrator would do when given the instruction, whereas Qi et al. (2020) attends to specific objects and actions mentioned in the instruction. Zhu et al. (2020) train the agent to follow shorter instructions and later generalize to longer instructions through a curriculum-based reinforcement learning approach. Hong et al. (2020) divide language instructions into shorter sub-instructions and enforce a sequential traversal through those subinstructions. They additionally enrich the Roomto-Room (R2R) dataset (Anderson et al., 2018b) with the sub-instruction-to-sub-path mapping and introduce the Fine-Grained R2R (FG-R2R) dataset.
More closely related to our work is Jain et al. (2019), which introduced a new metric -Coverage weighted by Length Score (CLS), measuring the coverage of the reference path by the agent, and used it as a sparse fidelity-oriented reward for training. However, our work differs from theirs in a number of ways. First, in LAW we explicitly supervise the agent to navigate back to the reference path, by dynamically calculating the closest waypoint (on the reference path) for any agent state. In contrast to calculating waypoints, Jain et al. (2019) optimize accumulated rewards, based on the CLS metric. Moreover, we provide dense supervision (at every time step) for the agent to follow the reference path by providing a cross-entropy loss at all steps of the episode, in contrast to the single reward at the end of the episode during stage two of their training. Finally, LAW is an online imitation learning approach, which is simpler to implement and easier to optimize compared to their policy gradient formulation, especially with sparse rewards. Similar to Jain et al.

Approach
Our approach is evaluated on the VLN-CE dataset (Krantz et al., 2020), which is generated by adapting R2R to the Habitat simulator (Savva et al., 2019). It consists of navigational episodes with language instructions and reference paths. The reference paths are constructed by taking the discrete nav-graph nodes corresponding to positions of panoramic cameras (we call these pano waypoints, shown as gray circles in Figure 2 top), and taking the shortest geodesic distance between them  (2020) which predicts an action. We optimize the model using language-aligned supervision, which brings it back on the path toward the next waypoint.
to create a ground truth reference path consisting of dense waypoints (step waypoints, see dashed path in Figure 2) corresponding to an agent step size of 0.25m. We take waypoints from these paths as languagealigned waypoints (LAW) to supervise our agent, in contrast to the goal-oriented supervision in prior work. We interpret our model performance qualitatively, and examine episodes for which the groundtruth language-aligned path (LAW step) does not match goal-oriented shortest path (shortest) 1 . Task. The agent is given a natural language instruction, and at each time step t, the agent observes the environment through RGBD image I t with a 90 • field-of-view, and takes one of four actions from A: {Forward, Left, Right, Stop}. Left and Right turn the agent by 15 • and Forward moves forward by 0.25m. The Stop action indicates that the agent has reached within a threshold distance of the goal. Model. We adapt the Cross-Modal Attention (CMA) model (see Figure 2) which is shown to perform well on VLN-CE. It consists of two recurrent networks, one encoding a history of the agent state, and another predicting actions based on the attended visual and instruction features (see supplement for details). Training. We follow the training regime of VLN-CE. It involves two stages: behavior cloning (with teacher-forcing) on the larger augmented dataset to train an initial policy, and then fine-tuning with DAgger (Ross et al., 2011). DAgger trains the model on an aggregated set of all past trajectories, sampling actions from the agent policy. Rather than supervising with the conventional goal-oriented sensor, we supervise with a language-aligned sensor in both the teacher-forcing phase and the DAgger phase. The language-aligned sensor helps bring the agent back on the path to the next waypoint if it wanders off the path (see Figure 2 top).
The training dataset D = {S (i) , W (i) } consists of instructions S (i) and reference path W (i) . For each episode (S, W ) ∼ D with the agent starting state as x 0 , we use cross entropy loss to maximize the log likelihood of the ground-truth action a * at each time step t: Here, x t is the 3D position of the agent at time t, e a * is the one-hot vector for the ground-truth action a * , which is defined as a * = g(x t , φ(x t , W )). The set of languagealigned waypoints is W = {w 1 , ..., w m }. The waypoint in W that is nearest to a 3D position x t is obtained by φ(x t , W ). The best action based on the shortest path from a 3D position x t to w is denoted by g(x t , w).

Experiments
Dataset. We base our work on the VLN-CE dataset (Krantz et al., 2020). The dataset contains 4475 trajectories from Matterport3D (Chang et al., 2017). Each trajectory is described by multiple natural language instructions. The dataset also contains ∼150k augmented trajectories generated by Tan et al. (2019) adapted to VLN-CE.
To qualitatively analyze our model behavior, we use the Fine-Grained R2R (FG-R2R) dataset from Hong et al. 2020. It segments instructions into subinstructions 2 and maps each sub-instruction to a corresponding sub-path. Evaluation Metrics. We adopt standard metrics used by prior work (Anderson et al., 2018b,a;Krantz et al., 2020). In the main paper, we report Success Rate (SR), Success weighted by inverse Path Length (SPL), Normalized dynamictime warping (nDTW), and Success weighted by nDTW (SDTW). Trajectory Length (TL), Navigation Error (NE) and Oracle Success Rate (OS) are reported in the supplement. Since none of the existing metrics directly measure how effectively waypoints are visited by the agent, we introduce Waypoint Accuracy (WA) metric. It measures the fraction of waypoints the agent is able to visit correctly (specifically, within 0.5m of the waypoint). This allows the community to intuitively analyze  Figure 3: Agent performance binned by nDTW value of reference path to shortest path (95% CI error bars) shows that LAW pano performs better than goal, especially on lower-range NDTW episodes. This indicates that language-aligned supervision is better suited for the instruction following task.
the agent trajectory as we illustrate in Figure 4. Implementation Details. We implement our agents using PyTorch (Paszke et al., 2019) and the Habitat simulator (Savva et al., 2019). We build our code on top of the VLN-CE codebase 3 and use the same set of hyper-parameters as used in the VLN-CE paper. The first phase of training with teacher forcing on the 150k augmented trajectories took ∼60 hours to train, while the second phase of training with DAgger on the original 4475 trajectories took ∼36 hours over two NVIDIA V100 GPUs. Ablations. We study the effect of varying the density of language-aligned waypoints on model performance. For all the ablations we use the CMA model described in Section 3. We use LAW # to distinguish among the ablations. On one end of the density spectrum, we have the base model which is supervised with only the goal (LAW#1 or goal). On the other end is LAW step which refers to the pre-computed dense path from the VLN-CE dataset and can be thought of as the densest supervision available to the agent. In the middle of the spectrum, we have LAW pano, which uses the navigational nodes (an average of 6 nodes) from the R2R dataset. We also sample equidistant points on the language-aligned path to come up with LAW#2, LAW#4 and LAW#15 containing two, four and fifteen waypoints, respectively. The intuition is that LAW pano takes the agent back to the languagealigned path some distance ahead of its position, while LAW step brings it directly to the path. Quantitative Results. In Table 1  The agent is able to learn to follow instruction better when supervised with language-aligned path (right) than the goal-oriented path (left). This is reflected in higher nDTW and waypoint accuracy (WA) metrics. Note that WA can be intuitively visualized and interpreted. We also show the mapping of sub-instructions to waypoints utilizing FG-R2R for this episode.
path performs better than the base model supervised with the goal-oriented path, across all metrics in both validation seen and unseen environments. We observe the same trend in the Waypoint Accuracy (WA) metric that we introduced. Table 2 shows that agents perform similarly even when we vary the number of waypoints for the languagealigned path supervision, since all of them essentially follow the path. This could be due to relatively short trajectory length in the R2R dataset (average of 10m) making LAW pano denser than needed for the instructions. To check this, we analyze the sub-instruction data and find that one subinstruction (e.g. 'Climb up the stairs') often maps to several pano waypoints, suggesting fewer waypoints are sufficient to specify the language-aligned path. For such paths, we find that the LAW #4 is better than the LAW pano (see supplement for details). Figure 3 further analyzes model performance by grouping episodes based on similarity between the goal-oriented shortest path and the languagealigned path in the ground truth trajectories (measured by nDTW). We find that the LAW model performs better than the goal-oriented model, especially on episodes with dissimilar paths (lower nDTW) across both the nDTW and Waypoint Accuracy metrics. Qualitative Analysis. To interpret model performance concretely with respect to path alignment we use the FG-R2R data, which contains mapping between sub-instructions and waypoints. Figure 4 Training   Table 2: Varying density of language-aligned supervision from very sparse (#2) to dense (step). This study shows that with varying density of the language-aligned waypoint supervision, the agent performs similarly, since all of them essentially follow the same path.
contrast the agent trajectories of the LAW pano and goal-oriented agents on an unseen scene. We observe that the path taken by the LAW agent conforms more closely to the instructions (also indicated by higher nDTW). We present more examples in the supplement. Additional Experiments. We additionally experiment with mixing goal-oriented and languageoriented losses while training, but observe that they fail to outperform the LAW pano model. The best performing mixture model achieves 53% nDTW in unseen environment, as compared to 54% nDTW for LAW pano (see supplement). Moreover, we perform a set of experiments on the recently introduced VLN-CE RxR dataset and observe that language-aligned supervision is better than goaloriented supervision for this dataset as well, with LAW step showing a 6% increase in WA and 2% increase in nDTW over goal on the unseen environment. We defer the implementation details and results to the supplement.

Conclusion
We show that instruction following during the VLN task can be improved using language-aligned supervision instead of goal-oriented supervision as commonly employed in prior work. Our quantitative and qualitative results demonstrate the benefit of the LAW supervision. The waypoint accuracy metric we introduce also makes it easier to interpret how agent navigation corresponds to following sub-instructions in the input natural language. We believe that our results show that LAW is a simple but useful strategy to improving VLN-CE.

A.1 Glossary
Some commonly used terminologies in this work are described here: • LAW refers to language-aligned waypoints such that the navigation path aligns with the language instruction.
• nav-graph refers to the discrete navigation graph of a scene.
• pano refers to the reference paths constructed by taking the discrete nav-graph nodes corresponding to positions of panoramic cameras in the R2R dataset.
• step refers to the reference paths constructed by taking the shortest geodesic distance between the pano paths to create dense waypoints corresponding to an agent step size of 0.25m.
• 'shortest' refers to the goal-oriented path, i.e, the shortest path to the goal.
• goal refers to the model supervised with only the goal.

A.2 Analysis of VLN-CE R2R path
We analyze the similarity of the VLN-CE R2R reference path to the shortest path using nDTW. We find that ∼6% of episodes (including training and validation splits), have nDTW(shortest, LAW step) < 0.8. Figure 5 shows the distribution of nDTW of the ground truth trajectories (LAW step) against the shortest path (goal-oriented action sensor) and LAW pano (language-aligned action sensor). It shows that the two distributions are different and that the language-aligned sensor will be much closer to the ground truth trajectories. Figure 6 shows percentage of unseen episodes binned by nDTW value of reference path to shortest path, which helps us analyze our model performance as shown in Figure 3 (main paper). Additionally, we visualize a few such paths to see how dissimilar they are in Figure 7.

A.3 CMA Model
The Cross-Modal Attention (CMA) Model takes the input RGB and depth observations and encode them using a ResNet50 (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009) and a modified ResNet50 trained on point-goal navigation (Wijmans et al., 2020) respectively. It also takes as input the GLoVE (Pennington et al., 2014) embeddings for the tokenized words in the language instruction and pass them through a bi-directional LSTM to obtain their feature representations. The CMA model consists of two recurrent (GRU) networks. The first GRU encodes a history of the agent state, which is then used to generate attended instruction features. These attended instruction features are in turn used to generate visual attentions. The second GRU takes in all the features generated thus far to predict an action. The attention used here is a scaled dot-product attention (Vaswani et al., 2017).

A.4.1 Metrics
We report the full evaluation of the models here on the standard metrics for VLN such as: Trajectory Length (TL): agent trajectory length. Navigation Error (NE): distance from agent to goal at episode termination. Success Rate (SR): rate of agent stopping within a threshold distance (around 3 meters) of the goal. Oracle Success Rate (OS): rate of agent reaching within a threshold distance (around 3 meters) of the goal at any point during navigation. Success weighted by inverse Path Length (SPL): success weighted by trajectory length relative to shortest path trajectory between start and goal. Normalized dynamic-time warping (nDTW): evaluates how well the agent trajectory matches the ground truth trajectory. Success weighted by nDTW (SDTW): nDTW, but calculated only for successful episodes.

A.4.2 Quantitative Results
We observe that the models in the ablation (LAW #2 to LAW step in Table 3) perform similarly, which could be due to the fact that the average trajectory length in the R2R dataset is around 10m and the LAW pano is actually denser than the agent needs to follow instructions. We analyze this by using the sub-instruction data and find that one subinstruction often maps to several pano waypoints and the language-aligned path can be explained via fewer waypoints. We show some such examples from the dataset in Figure 8. We also report the results on the R2R test split in Table 4, which shows that LAW pano performs better on OS, while performing similarly to goal on SR and SPL metrics. There are many episodes for which the goal-oriented shortest path does not match the language-aligned path, as generated by the goal-oriented action sensor (top). We mitigate this problem by using language-aligned action sensor (bottom).  Table 3: LAW pano model supervised with language-aligned waypoints performs better than the same model supervised with goal-oriented path, i.e. the shortest path to the goal. All models supervised with language-aligned path, but with varying density, perform similarly.   Table 5 shows a qualitative interpretation of some R2R unseen episodes for the two models goal and LAW pano, along with the sub-instruction data from the FG-R2R dataset. We see that LAW pano is able to get more number of waypoints (and hence sub-instructions) correct than the goal model. We report the Waypoint Accuracy metric at threshold distances of 0.5m and 1.0m for the same. It also shows that Waypoint Accuracy is more intuitive than nDTW in terms of interpreting what fraction of waypoints the agent is able to predict correctly.
A.5 Mixing goal-oriented and language-oriented losses We experiment with mixing goal-oriented loss (G) and language-oriented loss (L) during training   (2020). CMA + LAW pano (right) correctly predicts more sub-instructions compared to CMA + goal (left). Mapping between sub-instruction and waypoints is indicated by start and end waypoint indices. Green and Red indicate correct and incorrect prediction respectively. WA@0.5m and WA@1.0m indicate Waypoint Accuracy measured at a threshold distance of 0.5m and 1.0m respectively from the waypoint. to further understand the contribution languageoriented supervision. We pre-trained with G using teacher forcing and then fine-tuned with (a) only L, (b) L+G, (c) randomly chosen L or G, using DAgger. The results as reported in Table 6 show that none of the models outperform LAW pano, indicating that training with mixed losses fail to perform as well as training with only languageoriented loss.

A.6 Evaluation on VLN-CE RxR dataset
Dataset. Beyond the R2R dataset, there exist VLN datasets where the aim is to have the language-aligned path not be the shortest path. Across-Room (RxR) dataset that consists of new trajectories designed to not match the shortest path between start and goal. Importantly, they do not have a bias on the path length itself. The RxR dataset is 10x larger than R2R and consists of longer trajectories and instructions in three languages, English, Hindi, and Telugu. Both the R4R and RxR datasets are on the discrete nav-graph setting. However, the RxR dataset has been recently ported to the continuous state space of the VLN-CE for the RxR-Habitat challenge at CVPR 2021 5 . We experiment on the VLN-CE RxR to further investigate if language-aligned supervision is better than goal-oriented supervision on a dataset other than R2R.
Model. We build our experiments on the model architecture provided for the VLN-CE RxR Habitat challenge. The only difference in the CMA architecture in VLN-CE RxR from the one used in VLN-CE is that they use pre-computed BERT features for the language instructions instead of GLoVE  Table 6: Experiments show that models trained with a mixture of goal-oriented (G) and language-oriented (L) superivision underperforms the model trained with only our language-oriented loss.  Table 7: Experiments on the recently released RxR-Habitat benchmark (English language split) show that LAW methods outperform the goal, with LAW step having a 6% increase in WA and 2% increase in nDTW over goal on unseen environment. This indicates that our idea of language-aligned supervision is useful beyond R2R.
Go straight past the pool.
Walk between the bar and chairs.
Stop when you get to the corner of the bar.
That's where you will wait.

R2R waypoints
Explainable with fewer waypoints