On the Evaluation of Vision-and-Language Navigation Instructions

Vision-and-Language Navigation wayfinding agents can be enhanced by exploiting automatically generated navigation instructions. However, existing instruction generators have not been comprehensively evaluated, and the automatic evaluation metrics used to develop them have not been validated. Using human wayfinders, we show that these generators perform on par with or only slightly better than a template-based generator and far worse than human instructors. Furthermore, we discover that BLEU, ROUGE, METEOR and CIDEr are ineffective for evaluating grounded navigation instructions. To improve instruction evaluation, we propose an instruction-trajectory compatibility model that operates without reference instructions. Our model shows the highest correlation with human wayfinding outcomes when scoring individual instructions. For ranking instruction generation systems, if reference instructions are available we recommend using SPICE.


Introduction
Generating route instructions is a long studied problem with clear practical applications (Richter and Klippel, 2005). Whereas earlier work sought to create instructions for human wayfinders, recent work has focused on using instruction-generation models to improve the performance of agents that follow instructions given by people. In the context of Vision-and-Language Navigation (VLN) datasets such as Room-to-Room (R2R) (Anderson et al., 2018b), models for generating navigation instructions have improved agents' wayfinding performance in at least two ways: (1) by synthesizing new instructions for data augmentation (Fried et al., 2018;Tan et al., 2019), and (2) by fulfilling the role of a probabilistic speaker in a pragmatic reasoning setting (Fried et al., 2018). Such data augmentation is so effective that it is nearly ubiquitous in the best   Figure 1: Proposed dual encoder instruction-trajectory compatibility model. Navigation instructions and trajectories (sequences of panoramic images and view angles) are projected into a shared latent space. The independence between the encoders facilitates learning using both contrastive and classification losses. performing agents Li et al., 2019).
To make further advances in the generation of visually-grounded navigation instructions, accurate evaluation of the generated text is essential. However, the performance of existing instruction generators has not yet been evaluated using human wayfinders, and the efficacy of the automated evaluation metrics used to develop them has not been established. This paper addresses both gaps.
To establish benchmarks for navigation instruction generation, we evaluate existing English models (Fried et al., 2018;Tan et al., 2019) using human wayfinders. These models are effective for data augmentation, but in human trials they perform on par with or only slightly better than a templatebased system, and they are far worse than human instructors. This leaves much headroom for better instruction generation, which may in turn improve agents' wayfinding abilities.
Next, we consider the evaluation of navigation instructions without human wayfinders, a necessary step for future improvements in both grounded instruction generation (itself a challenging and important language generation problem) and agent wayfinding. We propose a model-based approach (Fig. 1) to measure the compatibility of an instruction-trajectory pair without needing reference instructions for evaluation. In training this model, we find that adding contrastive losses in addition to pairwise classification losses improves AUC by 9-10%, round-trip back-translation improves performance when used to paraphrase positive examples, and that both trajectory and instruction perturbations are useful as hard negatives.
Finally, we compare our compatibility model to common textual evaluation metrics to assess which metric best correlates with the outcomes of human wayfinding attempts. We discover that BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Denkowski and Lavie, 2014) and CIDEr  are ineffective for evaluating grounded navigation instructions. For system-level evaluations with reference instructions, we recommend SPICE (Anderson et al., 2016). When averaged over many instructions, SPICE correlates with both human wayfinding performance and subjective human judgments of instruction quality. When scoring individual instructions, our compatibility model most closely reflects human wayfinding performance, outperforming BERTScore  and VLN agent-based scores. Our results are a timely reminder that textual evaluation metrics should always be validated against human judgments when applied to new domains. We plan to release our trained compatibility model and the instructions and human evaluation data we collected.

Related Work
Navigation Instruction Generation Until recently, most methods for generating navigation instructions were focused on settings in which a system has access to a map representation of the environment, including the locations of objects and named items (e.g. of streets and buildings) (Richter and Klippel, 2005). Some generate route instructions interactively given the current position and goal location (Dräger and Koller, 2012), while others provide in-advance instructions that must be more robust to possible misinterpretation (Roth and Frank, 2010;Mast and Wolter, 2013).
Recent work has focused on instruction generation to improve the performance of wayfinding agents. Two instruction generators, Speaker-Follower (Fried et al., 2018) and EnvDrop (Tan et al., 2019), have been widely used for R2R data augmentation. They provide ∼170k new instruction-trajectory pairs sampled from training environments. Both are seq-to-seq models with attention. They take as input a sequence of panoramas grounded in a 3D trajectory, and output a textual instruction intended to describe it.
Vision-and-Language Navigation For VLN, embodied agents in 3D environments must follow natural language instructions to reach prescribed goals. Most recent efforts (e.g., Fu et al., 2019;Wang et al., 2019, etc.) have used the Room-to-Room (R2R) dataset (Anderson et al., 2018b), which contains 4675 unique paths in the train split, 340 in the valseen split (same environments, new paths), and an additional 783 paths in the val-unseen split (new environments, new paths). However, our findings are also relevant for similar datasets such as Touchdown (Chen et al., 2019;Mehta et al., 2020), CVDN (Thomason et al., 2019), REVERIE (Qi et al., 2020), and the multilingual Room-across-Room (RxR) dataset (Ku et al., 2020).
Text Generation Metrics There are many automated metrics that assess textual similarity; we focus on five that are extensively used in the context of image captioning: BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015 and SPICE (Anderson et al., 2016). More recently, model-and semi-model-based metrics have been proposed. BERTScore  takes a semi-model-based approach to compute token-wise similarity using contextual embeddings learned with BERT (Devlin et al., 2019). BLEURT (Sellam et al., 2020) is a fully model-based approach combining large-scale synthetic pretraining and domain specific finetuning. However, all of the aforementioned metrics are reference-based, and none is specifically designed for assessing navigation instructions associated with 3D trajectories for an embodied agent, which requires not only languageto-vision grounding but also correct sequencing.

Instruction-Trajectory Compatibility Models
Our model builds on that of , but differs in loss (using focal and contrastive losses), input features (adding action and geometry representation), and negative mining strategies (adding instruction perturbations in addition to trajectory perturbations). Compared to the trajectory re-ranking compatibility model proposed by Majumdar et al. (2020), we use a dual encoder architecture rather than dense cross-attention. This facilitates the efficient computation of contrastive losses, which are calculated over all pairs in a minibatch, and improve AUC by 10% in our model. We also avoid training on the outputs of the instruction generators (to prevent overfitting to the models we evaluate). We are yet to explore transfer learning (which is the focus of Majumdar et al. (2020)).

Human Wayfinding Evaluations
To benchmark the current state-of-the-art for navigation instruction generation, we evaluate the outputs of the Speaker-Follower and EnvDrop models by asking people to follow them. We use instructions for the 340 and 783 trajectories in the R2R val-seen and val-unseen splits, respectively. Both models are trained on the R2R train split and the generated instructions were provided by the respective authors. To contextualize the results, we additionally evaluate instructions from a template-based generator (using ground-truth object annotations), a new set of instructions written by human annotators, and three adversarial perturbations of these human instructions. New navigation instructions and wayfinding evaluations are collected using a lightly modified version of PanGEA 1 , an open-source annotation toolkit for panoramic graph environments.
Crafty Crafty is a template-based navigation instruction generator. It observes the trajectory's geometry and nearby ground-truth object annotations, identifies salient objects, and creates English instructions using templates describing movement with respect to the trajectory and objects. See the Appendix for details. Note that Crafty has an advantage over the learned models which rely on panoramic images to identify visual references and do not exploit object annotations.

Human Instructions
We collect 340 new English instructions for the trajectories in the R2R val-seen split using the PanGEA Guide task.

Instruction Perturbations
To quantify the impact of common instruction generator failure 1 https://github.com/google-research/pangea modes on instruction following performance, we include three adversarial perturbations of human instructions capturing incorrect direction words, hallucinated objects/landmarks, and repeated or skipped steps. We use Google Cloud NLP 2 to identify named entities and parse dependency trees and then generate perturbations as follows: • Direction Swap: Random swapping of directional phrases with alternatives from the same set, with sets as follows: around/left/right, bottom/middle/top, up/down, front/back, above/under, enter/exit, backward/forward, away from/towards, into/out of, inside/outside.
Example: "Take a right (left) and wait by the couch outside (inside) the bedroom. " • Entity Swap: Random swapping of entities in an instruction. All noun phrases excluding a stop list containing any, first, end, front, etc. are considered to be entities. If two entities have the same lemma (e.g., stairs/staircase/stairway) they are considered to be synonyms and are not swapped.
Example: "Exit the bedroom (bathroom), turn right, then enter the bathroom (bedroom)." • Phrase Swap: A random operation on the dependency tree: either remove one subsentence tree, duplicate one sub-sentence tree, or shuffle the order of all sentences except the last.
Example: "Exit the room using the door on the left. Turn slightly left and go past the round table an chairs. Wait there." -where the first and second sentences are swapped.
Wayfinding Task Using the PanGEA Follower task, annotators are presented with a textual navigation instruction and the first-person camera view from the starting pose. They are instructed to attempt to follow the instruction to reach the goal location. Camera controls allow for continuous heading and elevation changes as well as movement between Matterport3D panoramas based on a navigation graph. Each instruction is evaluated by three different human wayfinders. People are resourceful and may succeed in following poor quality instructions by expending additional effort. Therefore, we report additional metrics to capture these costs. Quality ↑ is a selfreported measure of instruction quality based on a 1-5 Likert scale. At the end of each task annotators respond to the prompt: Do you think there are mistakes in the instruction? Responses range from Way too many mistakes to follow (1) to No mistakes, very very easy to follow' (5). Visual Search cost ↓ measures the percentage of the available panoramic visual field that the annotator observes at each viewpoint, based on the pose traces provided by PanGEA and first proposed in the RxR dataset (Ku et al., 2020). Higher values indicate greater effort spent looking for the correct path. We report this separately for the start viewpoint and other viewpoints since wayfinders typically look around to orient themselves at the start. Time ↓ represents the average time taken in seconds.
Results Table 1 summarizes the results of 11,886 wayfinding attempts using 37 English-speaking annotators. The performance of annotators stays consistent over time and does not show any sign of adaptation. See Appendix for detailed analysis.
As expected, human instructions perform best in human wayfinding evaluations on all path evaluation metrics and on subjective assessments of instruction quality, and they also incur the lowest visual search costs. The only metric not dominated by human instructions is the time taken -which correlates with instruction length, and may be affected by wayfinders giving up when faced with poor quality instructions. Overall, the Speaker-Follower and EnvDrop models are surprisingly weak and noticeably worse than even adversarially perturbed human instructions. Compared to the template-based approach (Crafty), the Speaker-Follower model performs on par and EnvDrop is only slightly better. As a first step to improving existing navigation instruction generators, we focus on developing and evaluating automated metrics that can approximate these human wayfinding evaluations.

Compatibility Model
As an alternative to human evaluations, we train an instruction-trajectory compatibility model to assess both the grounding between textual and visual inputs and the alignment of the two sequences.

Model Structure
Our model is a dual encoder that encodes instructions and trajectories into a shared latent space ( Figure 1). The instruction representation h w is the concatenation of the final output states of a bi-directional LSTM (Schuster and Paliwal, 1997) encoding the instruction tokens W = {w 1 , w 2 , ..., w n }. We use contextualized token embeddings from BERT (Devlin et al., 2019) as input to the LSTM. The visual encoder is a two-layer LSTM that processes visual features extracted from a sequence of viewpoints V = {(I 1 , p 1 ), (I 2 , p 2 )..., (I t , p t )} comprised of panoramic images I t captured at positions p t along a 3D trajectory. The vector h v t representing the viewpoint at step t is given by: where e pano,t is a set of 36 visual features representing the panoramic image I t (discretized into 36 viewing angles by elevation θ and heading φ), e prev,t and e next,t are the visual features in the directions of the previous and next viewpoints (v t−1 −v t and v t+1 − v t respectively), and f is a projection layer. Each visual feature is a concatenation of a pre-trained CNN image feature (Juan et al., 2020) with orientation vectors encoding both sine and cosine functions of the absolute and relative angles {θ abs , φ abs , θ rel , φ rel }. We use standard dot-product attention (Luong et al., 2014) and define h v = h v T , the final viewpoint embedding in the trajectory. The output of the model is the compatibility score S between an instruction and a trajectory defined as the cosine similarity between h v and h w .

Hard Negative Mining
To avoid overfitting, our compatibility model is not trained on the outputs of any of the instruction generators that we evaluate. Instead, we use only the relatively small set of positive instructiontrajectory examples from R2R. We use round-trip back-translation to expand the set of positive examples. Unmatched instruction-trajectory pairs from R2R are considered to be negative examples and we also construct hard negative examples from positive examples by adversarially perturbing both trajectories and instructions.

Instruction Perturbations
We use the same instruction perturbations described in Section 3: Direction Swap, Entity Swap, and Phrase Swap. These perturbations are inspired by typical failure modes in instruction generators and are designed to be hard to recognize without grounding on images and actions along the trajectory. Previous work by  considered only trajectory perturbations. While this encourages the model to recognize incorrect trajectories for a given ground truth instruction, it may not encourage the model to identify a trajectory matched with a poor quality instruction. Our results suggest that instruction perturbations are equally important.
Trajectory Perturbations To perturb trajectories we use the navigation graphs defining connected viewpoints in R2R. Inspired by , we consider Random Walk, Path Reversal, and Viewpoint Swap perturbations: • Random Walk: The first or last two viewpoints are fixed and the remainder of the trajectory is re-sampled using random edge traversals subject to the path length remaining within ±1 step of the original. To make the task harder, we avoid revisiting a viewpoint and require the re-sampled trajectory to have at least two overlapping viewpoints with the original.
• Path Reversal: The entire trajectory is reversed while keeping the same viewpoints.
• Viewpoint Swap: A new method we introduce that randomly samples and swaps a viewpoint in a trajectory with a new viewpoint sampled from the neighbors of the adjacent viewpoints in the original trajectory.
Paraphrases To expand the 14k positive examples from the R2R train set and balance the positiveto-negative ratio, we paraphrase instructions via round-trip back-translation. We use the following ten intermediate languages and Google Translate 3 : ar, es, de, fr, hi, it, pt, ru, tr, and zh. To exclude low quality or nearly duplicate instructions, we filter paraphrased instructions outside the BLEU score range of [0.25, 0.7] compared to the original. Overall we have a total of 110,601 positive instructiontrajectory pairs in the training set, which contains 4675 unique trajectories.

Loss Functions
During training, each minibatch is constructed with N matching instruction-trajectory pairs, which may be perturbed. We define M ∈ {0, 1} N as the vector indicating unperturbed pairs. A compatibility matrix S ∈ R N ×N is defined such that S i,j is the cosine similarity score between instruction i and trajectory j determined by our model. We use both binary classification loss functions, defined on diagonal elements of S, and a contrastive loss defined on S's rows and columns. Contrastive losses are commonly used for retrieval and representation learning (e.g.,    The best models are selected based on the validation set (column 3), and we report the final test performance in column 4. To understand the performance of individual perturbation method, we also report the best AUCs for each of the six perturbations in columns 5 -10.  and in our case exploits all random instruction-trajectory pairs in a minibatch. Each loss requires a separate normalization. For the classification loss we compute the probability of a match p i,j , such that p i,j = σ(aS i,j + b) where a and b are learned scalars and σ is the sigmoid function. For the classification loss L cls we consider both binary cross entropy loss L CE , and focal loss (Lin et al., 2017) given by L FL = (1 − p i,j ) γ L CE where we set γ = 2.
For the contrastive loss we compute logits by scaling S with a learned scalar temperature τ . The contrastive loss L C (S) calculated over the rows and colums of S is given by: , the diagonal element is a perturbed pair and not considered to be a match. Otherwise: The final loss is the combination: where L cls is the classification loss, either L CE or L F L , and we set β = 1.

Sampling hyperparameters
We sample positive and negative examples equally with a mix ratio of 2:1:1 for ground truth, instruction perturbations, and trajectory perturbations, respectively. For each perturbation type, we sample the three methods with equal probability.

Experiments
We evaluate our compatibility model against alternative model-based evaluations and standard textual similarity metrics. We report instruction classification results in Section 5.1, improved data augmentation for VLN agents in Section 5.2, and correlation with human wayfinder outcomes in 5.3.

Instruction Classification
Evaluation In this setting we use the instructiontrajectory compatibility model to classify high and low quality instructions for trajectories from the R2R val-unseen and val-seen sets. The instruction pool includes 3 high-quality instructions per trajectory from R2R, plus 2 instructions per trajectory from the Speaker-Follower and EnvDrop models. These are considered to be high quality if 2 out of 3 human wayfinders reached the goal (see Section 3), and low quality otherwise. We assess model performance using Area Under the ROC Curve (AUC). We use the val-unseen split (3915 instructions, 75% high quality) for model validation and the val-seen split (1700 instructions, 78% high quality) as test. Benchmark We compare to the compatibility model proposed by , which computes elementwise similarities between instruction words and trajectory panoramas before pooling, and does not include action embeddings (e prev,t and e next,t ) or position encodings p t . In contrast, our model calculates the similarity between instructions and trajectories after pooling each sequence and includes both action and position encodings.
Results Table 2 reports classification AUC including comprehensive ablations of loss functions, approaches to hard negative mining, and modeling choices. With regard to the loss function, we find that the combination of contrastive and focal loss (row 6) performs best overall, and that adding contrastive loss provides a very significant 9 -10% increase in AUC compared to using just cross-entropy (CE) or focal loss (rows 2 and 3) due to the effective use of in-batch negatives. Adding paraphrased positive instructions and pretrained BERT token embeddings also leads to significant performance gains (rows 7 and 8 vs. row 6). The best performing model on both the validation and test sets uses Contrastive + Focal loss with paraphrased instructions and BERT embeddings, as well as trajectory and instruction perturbations (row 8). This model consistently outperforms the benchmark from prior work (row 1) by a large margin and achieves a test set AUC of 73.7%. In rows 9-17 we ablate the six perturbation methods that we use for hard negative mining. Ablations using only instruction perturbations (row 9), only path perturbations (13), or no perturbations at all (row 17) perform considerably worse than our best model (row 8). We also show that no individual perturbation approach is effective on its own. In addition to scores for the validation and test sets, we report AUC for each perturbation method on the val-seen set to investigate their individual performance. Overall, trajectory perturbations get higher scores than instruction perturbations, showing they are easier tasks. Phrase Swap proves the hardest task, while Random Walk is the easiest.

Data Augmentation for VLN
Data augmentation using instructions from the Speaker-Follower and EnvDrop models is pervasive in the training of VLN agents Li et al., 2019). In this section we evaluate whether our compatibility model can be used to filter out low quality instructions from the augmented training set to improve VLN performance. We score all of 170k augmented instruction-trajectory pairs from the Speaker-Follower model and rank them in descending order. We then use different fractions of the ranked data to train VLN agents, and compare with agents trained using random samples of the same size. We use a VLN agent model based on  and implemented in VALAN (Lansing et al., 2019), which achieves a success rate (SR) of 45% on the R2R val-unseen split when trained on the R2R train split and all of the Speaker-Follower augmented instructions. Figure 2 indicates that instruction-trajectory pairs selected by our compatibility model consistently outperform random training pairs in terms of the performance of the trained VLN agent. This demonstrates the efficacy of our compatibility model for improving VLN data augmentation by identifying high quality instructions.

Correlation with Human Wayfinders
In this section we evaluate the correlation between the scores given by our instruction-trajectory compatability model and the outcomes from the human wayfinding attempts described in Section 3. Using Kendall's τ to assess rank correlation, we report both system-level and instance-level correlation. The instance-level evaluations assess whether the metric can identify the best instruction from two candidates, while the system-level evaluations assess whether a metric can identify the best model from two candidates (after averaging over many instruction scores for each model). The results in Table 3 are reported separately over all 3.9k instructions (9 systems comprising the rows of Table  1), and over model-generated instructions only (4 systems comprising the 2.2k instructions generated by the Speaker-Follower and EnvDrop models on R2R val-seen and val-unseen).
Automatic Metrics For comparison we include standard textual evaluation metrics (BLEU, CIDEr, METEOR, ROUGE and SPICE) and two modelbased metrics: BERTScore , and scores based on the performance of a trained VLN agent attempting to follow the candidate instruction (Agarwal et al., 2019). Note that only the compatibility model and the VLN agent-based scores use the candidate trajectory -the other metrics are calculated by comparing each candidate instruction to the three reference instructions from R2R (and are thus reliant on reference instructions).
To calculate the standard metrics we use the official evaluation code provided with the COCO captions dataset (Chen et al., 2015). For BERTScore, we use a publicly available uncased BERT 4 model with 12 layers and hidden dimension 768, and compute the mean F 1-score over the three references. For the VLN agent score, we train three VLN agents based on  from different random initializations using the R2R train set. 4 tfhub.dev/google/bert uncased L-12 H-768 A-12/1  We then employ the trained agents for the wayfinding task and report performance as either the SPL or SDTW similarity between the path taken by the agent and the reference path -using either a single agent or the average score from three agents.
Results Table 3 compares system-level and instance-level correlations for all metrics, both standard and model-based. At the system-level, we see no correlation between standard text metrics such as BLEU, ROUGE, METEOR and CIDEr and human wayfinder performance. The exception is SPICE, which shows the desired negative correlation with NE, and positive correlation with SR, SPL (see Figure 3) and Quality. At the systemlevel, the model-based approaches (BERTScore, Figure 3: Standard evaluation metrics vs. human wayfinding outcomes (SPL) for 9 navigation instruction generation systems. SPICE is most consistent with human wayfinding outcomes, although no metrics score the Crafty template-based instructions highly. agent SPL/SDTW and Compatability) also lack the desired correlation and exhibit wide confidence intervals. Here, it is important to point out that the 9 systems under evaluation include a variety of styles (e.g., Crafty's template-based instructions, different annotator pools, adversarial perturbations) which are dissimilar to the R2R data used to train the VLN agents and the compatibility model. Accordingly, the model-based approaches are unable to reliably rank these out-of-domain systems.
At the instance-level (when scoring individual instructions) we observe different outcomes. SPICE scores for individual instructions have high variance, and so SPICE does not correlate with wayfinder performance at the instruction level. In contrast, the model-based approaches exhibit the desired correlation, particularly when restricted to the model-generated instructions (Table 3 bottom panel). Our compatibility score shows the strongest correlation among all metrics, performing similarly to an ensemble of three VLN agents.

Conclusion
Generating grounded navigation instructions is one of the most promising directions for improving the performance of VLN wayfinding agents, and a challenging and important language generation task in its own right. In this paper, we show that efforts to improve navigation instruction generators have been hindered by a lack of suitable automatic evaluation metrics. With the exception of SPICE, all the standard textual evaluation metrics we evaluated (BLEU, CIDEr, METEOR and ROUGE) are ineffective, and -perhaps as a result -existing instruction generators have substantial headroom for improvement.
To address this problem, we develop an instruction-trajectory compatibility model that outperforms all existing automatic evaluation metrics on instance-level evaluation without needing any reference instructions -making it suitable for use as a reward function in a reinforcement learning setting, as a discriminator in a Generative Adversarial Network (GAN) (Dai et al., 2017), or for filtering instructions in a data augmentation setting.
Progress in natural language generation (NLG) is increasing the demand for evaluation metrics that can accurately evaluate generated text in a variety of domains. Our findings are a timely reminder that textual evaluation metrics should not be trusted in new domains unless they have been comprehensively validated against human judgments. In the case of grounded navigation instructions, for model selection in the presence of reference instructions we recommend using the SPICE metric. In all other scenarios (e.g., selecting individual instructions, or model selection without reference instructions) we recommend using a learned instruction-trajectory compatibility model.

A Automated metric scores for all instructions
We provide more details about automated metric scores for all instructions in this section. Table 4 gives automated metrics for each model we consider. Generated instructions from EnvDrop and Speaker-Follower are scored the highest, whereas human instructions are scored poorly and on par with perturbed instructions, and Crafty is the lowest. These results diverge significantly from human wayfinding performance in Section 3, and highlights the inefficacy of these automated text metrics.

B Crafty Details
We use the data in Matterport3D to build Crafty, a template-based navigation instruction generator that uses a Hidden Markov Model (HMM) to select objects as reference landmarks for wayfinding. Crafty's four main components (Appraiser, Walker, Observer and Talker) are described below.

B.1 Appraiser
The Appraiser scores the interestingness of objects based on the Matterport3D scans in the training set. It treats each panorama as a document and the categories corresponding to objects visible from the panorama as words, and then computes a percategory inverse document frequency (IDF) score.

B.2 Walker
The Walker converts a panorama sequence into a motion sequence. Given a path (sequence of connected panoramas) and an initial heading, it calculates the entry heading into each panorama and the exit heading required to transition to the next panorama. For each panorama, all annotated objects that are visible from the location are retrieved.
For each object, we obtain properties such as their category and center, which allows the distance and heading from the panorama center to be computed. From these, the Walker creates a sequence of motion tuples, each of which captures the context of the source panorama and the goal panorama, along with the heading to move from source to goal.

B.3 Observer
The Observer selects an object sequence by generating objects from an HMM that is specially constructed for each environment, characterized by: • Emissions: how panoramas relate to objects. This is a probability distribution over panoramas for each object, based on the distance between the object and the panoramas.
• Transitions: how looking at one object might shift to another one, based on their relative location, the motion at play, and the Appraiser's assessment of their prominence.
The intuition for using an HMM is that we tend to fixate on a given salient object over several steps as we move (high self-transitions); these tend to be nearby (high emission probability for objects near a panorama's center) and connected to the next salient object (biased object-object transitions). To explain a particular observed panorama sequence (path), we can then infer the optimal object sequence using the Viterbi algorithm.

B.4 Talker
Given a motions sequence from the Walker and corresponding object observations from the Observer, the Talker uses a small set of templates to create English instructions for each step. We decompose this into low-level and high-level templates.

B.4.1 Low-level templates
For single step actions, there are three main things to mention: movement, the fixated object and its relationship to the agent's position. MOVE. For movement, we simply generate a set of possible commands for each direction type, where the direction types are defined as in the orientation wheel shown in Fig. 4. There are additional direction types for UP and DOWN based on relative pitch (e.g. when the goal panorama is higher or lower than the source).
Given one of these heading types, we generate a set of matching phrases appropriate to each. E.g.  OBJ. An object's description is its category (e.g. couch, tv, window).
ORIENT. We use the same direction types shown in Figure 4.
When an object is STRAIGHT and BEHIND, we use the phrases ahead of you or in front of you and behind you or in back of you, respectively. For objects to the LEFT or RIGHT, we use two templates DIRECTION PRE DIRECTION and DIRECTION DIRECTION POST, where DIRECTION PRE is selected from [to your, to the, on your, on the] and DIRECTION POST is the phrase of you. This produces to your left, on the right, right of you, and so on. For SLIGHT LEFT and SLIGHT RIGHT, one of [a bit, slightly, a little, just] is added in front (e.g. a bit to your left).

B.4.2 High level templates
Crafty pieces these low-level textual building blocks together to describe actions. In what follows, MOVE, OBJ, and ORIENT indicate the move command, object phrase and orientation phrase, respectively, discussed above.
Single action. We use templates for three situa-tions: start of a path, heading change in a panorama (intra) and moving between panoramas (inter).
• Start of path: There are several templates that simply help a wayfinder verify their current position. Ex: you are near a OBJ, ORIENT.
• Intra: These templates include the movement command followed by a verification of the orientation to an object having completed the movement. Ex: MOVE. a OBJ is ORIENT.
• Inter : These templates capture walking from one panorama to another and provide additional object verification. Ex: MOVE, going along to the OBJ ORIENT.
Multi-step actions. We attempt to reduce verbosity by collapsing actions that involve fixation on the same object. . These produce a composite move command, e.g. proceed forward and make a right and go straight.
• Describing the object: To orient with respect to the fixated-upon object, we switch on the direction type between the agent and the object at the last action. Ex: for STRAIGHT, we use heading toward the OBJ and for SLIGHT LEFT/RIGHT, we use approaching the OBJ ORIENT.
The final output is the concatenation of the combined move command and the object orientation phrase.
End-of-path instruction templates. The final action is a special situation in that it needs to describing stopping near a salient object. For this, we extract MOVE and OBJ phrases from the last action and use templates such as MOVE and stop by the OBJ.
Full example. Putting it all together, Crafty creates full path instructions such as the following, with relevant high-level templates indicated: • (START) there is a lamp when you look a bit to the left. pivot right, so it is in back of you.
• (INTER) walk forward, going along to the curtain in front of you.
• (INTRA) curve left. you should see a tv ahead of you.
• (MULTI-ACTION) go forward and go slightly left and walk straight, passing the curtain to your right.
• (END-OF-PATH) continue forward and stop by the couch.
Crafty's instructions are more verbose than human instructions, but are often easy to followprovided there are good, visually salient landmarks in the environment to use for orientation.

C Human Rater Performance Over Time
Human raters are excellent at learning and adapting to new problems over time. To understand whether our 37 human raters learn to self-correct the perturbed instructions over time and whether that affects the quality of our human wayfinding results, we investigate rater performance as a function of time using the sequence of examples they evaluate. Figure 5 shows the average human rater performance for all of the 9 datasets included in Table  1 of Section 3. Due to the binary nature of SR, we use a 50-point bin to average each rater's performance, and then average the results across all raters for each bin. Figure 5 shows that the average rater performance stays flat within the uncertainties and does not show systematic drift over time, indicating no overall self-correction that affects the wayfinding results. For a more granular scrutiny of individual perturbation methods, in particular the perturbed instructions, we plot in Figure 6 the average human rater performance over time for the three methods: Direction Swap, Entity Swap, and Phrase Swap. Despite greater uncertainties due to much fewer data points used for averaging, the overall human performance for each method still does not drift significantly in a systematic manner. These results indicate that our human wayfinding performance results are reliable and robust over time, which can be attributed to shuffling of the examples and that the perturbation methods are blind to human raters. We normalize the scores of each rater by their mean value over time to remove performance bias of each rater in order to better pick up the trend over time. We average 50 examples to get the mean SDTW for each rater due to the discrete nature of success. Left: The mean performance of all rater for each bin. Error bars represent the standard deviation of the mean. Right: Individual rater performance over time. Each line represents a single rater. Despite a few outliers, the overall human rater performance is flat and consistent over time, indicating no self-correction or adaptation to the datasets by human raters.   Figure 5. We use a 15-point bin to compute the average for each human rater, and aggregate over all raters to get the mean and its uncertainty. The overall human rater performance stays flat and does not drift significantly over time for instruction perturbations.