Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates in real-life urban environments. With the lack of human-annotated instructions that illustrate the intricate urban scenes, outdoor VLN remains a challenging task to solve. In this paper, we introduce a Multimodal Text Style Transfer (MTST) learning approach and leverage external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set.


Introduction
A key challenge for Artificial Intelligence research is to go beyond static observational data and consider more challenging settings that involve dynamic actions and incremental decision-making processes (Fenton et al., 2020). Outdoor visionand-language navigation (VLN) is such a task, where an agent navigates in an urban environment by grounding natural language instructions in visual scenes, as illustrated in Fig. 1. To generate a series of correct actions, the navigation agent must comprehend the instructions and reason through the visual environment.

Speaker
You ' ll have a red brick building with a red awning on your right . Go forward until you reach the next intersection , and turn right.

MTST model
Turn right again and stop just past the orange and white construction barriers.
Orient yourself so that the red deli awning is on your right. Turn left at the intersection. Different from indoor navigation Fried et al., 2018;Wang et al., 2019;Ma et al., 2019a;Ma et al., 2019b;Ke et al., 2019), the outdoor navigation task takes place in urban environments that contain diverse street views (Mirowski et al., 2018;Chen et al., 2019;Mehta et al., 2020). The vast urban area leads to a much larger space for an agent to explore and usually contains longer trajectories and a wider range of objects for visual grounding. This requires more informative instructions to address the complex navigation environment. However, it is expensive to collect human-annotated instructions that depict the complicated visual scenes to train a navigation agent. The issue of data scarcity limits the navigator's performance in the outdoor VLN task.
To deal with the data scarcity issue, Fried et al. (2018) proposes a Speaker model to generate additional training pairs. However, synthesizing instructions purely from visual signals is hard, especially for outdoor environments, due to visual complexity.
On the other hand, template-based navigation instructions on the street view can be easily obtained via the Google Map API, which may serve as additional learning signals to boost outdoor navigation tasks. But instructions generated by Google Maps API mainly consist of street names and directions, while human-annotated instructions in the outdoor navigation task frequently refer to street-view objects in the panorama. The distinct instruction style hinders the full utilization of external resources.
Therefore, we present a novel Multimodal Text Style Transfer (MTST) learning approach to narrow the gap between template-based instructions in the external resources and the human-annotated instructions for the outdoor navigation task. It can infer style-modified instructions for trajectories in the external resources and thus mitigate the data scarcity issue. Our approach can inject more visual objects in the navigation environment to the instructions ( Fig. 1), while providing direction guidance. The enriched object-related information can help the navigation agent learn the grounding between the visual environment and the instruction.
Moreover, different from previous LSTM-based navigation agents, we propose a new VLN Transformer to predict outdoor navigation actions. Experimental results show that utilizing external resources provided by Google Maps API during the pre-training process improves the navigation agent's performance on Touchdown, a dataset for outdoor VLN . In addition, pretraining with the style-modified instructions generated by our multimodal text style transfer model can further improve navigation performance and make the pre-training process more robust. In summary, the contribution of our work is four-fold:

• We present a new Multimodal Text Style
Transfer learning approach to generate stylemodified instructions for external resources and tackle the data scarcity issue in the outdoor VLN task.
• We provide the Manh-50 dataset with stylemodified instructions as an auxiliary dataset for outdoor VLN training.
• We propose a novel VLN Transformer model as the navigation agent for outdoor VLN and validate its effectiveness.
• We improve the task completion rate by 8.7% relatively on the test set for the outdoor VLN task with the VLN Transformer model pretrained on the external resources processed by our MTST approach.

Related Work
Vision-and-Language Navigation (VLN) is a task that requires an agent to achieve the final goal based on the given instructions in a 3D environment. Besides the generalizability problem studied by previous works (Wang et al., , 2019, the data scarcity problem is another critical issue for the VLN task, expecially in the outdoor environment Mehta et al., 2020;Xiang et al., 2020). Fried et al. (2018) obtains a broad set of augmented training data for VLN by sampling trajectories in the navigation environment and using the Speaker model to back-translate their instructions. However, the Speaker model might cause the error propagation issue since it is not trained on large corpora to optimize generalization. While most existing works select navigation actions dynamically along the way in the unseen environment during testing, Majumdar et al. (2020) proposes to test in previously explored environments and convert the VLN task to a classification task over the possible paths. This approach performs well in the indoor setting, but is not suitable for outdoor VLN where the environment graph is different. Multimodal Pre-training has attracted much attention to improving multimodal tasks performances. The models usually adopt the Transformer structure to encode the visual features and the textual features Chen et al., 2020;Sun et al., 2019;Huang et al., 2020b;Luo et al., 2020;Zheng et al., 2020;Wei et al., 2020;Tsai et al., 2019). During pre-training, these models use tasks such as masked language modeling, masked region modeling, image-text matching to learn the cross-modal encoding ability, which later benefits the multimodal downstream tasks. Majumdar et al. (2020) proposes to use image-text pairs from the web to pre-train VLN-BERT, a visiolinguistic transformer-based model similar to the model proposed by . A concurrent work by  proposes to use Transformer for indoor VLN. Our VLN Transformer is different from their model in several key aspects: (1) The pre-training objectives are different:  pre-trains the model on the same dataset for training, while we create an augmented, stylized dataset for outdoor VLN using the proposed MTST method. (2) Benefiting from the effective external resource, a simple navigation loss is employed in our VLN Transformer, while they adopt the masked language modeling to better train their model. (3) Model-wise, instead of encoding the whole instruction into one feature, we use sentence-level encoding since Touchdown instructions are much longer than R2R instructions. (4) We encode the trajectory history, while their model encodes the panorama for the current step. Unsupervised Text Style Transfer is an approach to mitigate the lack of parallel data for supervised training. One line of work encodes the text into a latent vector and manipulate the text representation in the latent space to transfer the style.

Task Definition
In the vision-and-language navigation task, the reasoning navigator is asked to find the correct path to reach the target location following the instructions (a set of sentences) X = {s 1 , s 2 , . . . , s m }. The navigation procedure can be viewed as a series of decision making processes. At each time step t, the navigation environment presents an image view v t . With reference to the instruction X and the visual view v t , the navigator is expected to choose an action a t ∈ A. The action set A for urban environment navigation usually contains four actions, namely turn left, turn right, go forward, and stop.

Overview
Our Multimodal Text Style Transfer (MTST) learning mainly consists of two modules, namely the multimodal text style transfer model and the VLN Transformer. Fig. 2 provides an overview of our MTST approach. We use the multimodal text style transfer model to narrow the gap between the human-annotated instructions for the outdoor navigation task and the machine-generated instruc-  tions in the external resources. The multimodal text style transfer model is trained on the dataset for outdoor navigation, and it learns to infer stylemodified instructions for trajectories in the external resources. The VLN Transformer is the navigation agent that generates actions for the outdoor VLN task. It is trained with a two-stage training pipeline. We first pre-train the VLN Transformer on the external resources with the style-modified instructions and then fine-tune it on the outdoor navigation dataset.

Multimodal Text Style Transfer Model
Instruction Style The navigation instructions vary across different outdoor VLN datasets. As shown in Table 1, the instructions generated by Google Maps API is template-based and mainly consists of street names and directions. In contrast, humanannotated instructions for the outdoor VLN task emphasize the visual environment's attributes as navigation targets. It frequently refers to objects in the panorama, such as traffic lights, cars, awnings, etc. The goal of conducting multimodal text style transfer is to inject more object-related information in the surrounding navigation environment to the machine-generated instruction while keeping the correct guiding signals.

Masking-and-Recovering Scheme
The multimodal text style transfer model is trained with a "masking-and-recovering" (Zhu et al., 2019;Donahue et al., 2020;Huang et al., 2020a) scheme to inject objects that appeared in the panorama into the instructions. We mask out certain portions in the instructions and try to recover the missing contents with the help of the remaining instruction skeleton and the paired trajectory. To be specific, we use NLTK (Bird et al., 2009) to mask out the object-related tokens in the human-annotated instructions, and the street names  Figure 3: An example of the training and inference process of the multimodal text style transfer model. During training, we mask out the objects in the human-annotated instructions to get the instruction template. The model takes both the trajectory and the instruction skeleton as input, and the training objective is to recover the instructions with objects. When inferring new instructions for external trajectories, we mask the street names in the original instructions and prompt the model to generate new object-grounded instructions.
in the machine-generated instructions 2 . Multiple tokens that are masked out in a row will be replaced by a single [MASK] token. We aim to maintain the correct guiding signals for navigation after the style transfer process. Tokens that provide guiding signals, such as "turn left" or "take a right", will not be masked out. Fig. 3 provides an example of the "masking-and-recovering" process during training and inferring. Model Structure Fig. 3 illustrates the input and expected output of our multimodal text style transfer model. We build the multimodal text style transfer model upon the Speaker model proposed by Fried et al. (2018). On top of the visual-attentionbased LSTM (Hochreiter and Schmidhuber, 1997) structure in the Speaker model, we inject the textual attention of the masked instruction skeleton X to the encoder, which allows the model to attend to original guiding signals. The encoder takes both the visual and textual inputs, which encode the trajectory and the masked instruction skeletons. To be specific, each visual view in the trajectory is represented as a feature vector v = [v v ; v α ], which is the concatenation of the visual encoding v v ∈ R 512 and the orientation encoding v α ∈ R 64 . The visual encoding v v is the output of the last but one layer of the RESNET18 (He et al., 2016) of the current view. The orientation encoding v α encodes current heading α by repeating vector [sinα, cosα] for 32 times, which follows Fried et al. (2018). As described in section 3.4, the feature matrix of a panorama is the concatenation of eight projected visual views.
In the multimodal style transfer encoder, we use a soft-attention module (Vaswani et al., 2017) to calculate the grounded visual featurev t for current view at step t: where h t−1 is the hidden context of previous step, W v refers to the learnable parameters, and attn v t,i is the attention weight over the i th slice of view v i in current panorama. We use full-stop punctuations to split the input text into multiple sentences. The rationale is to enable alignment between the street views and the semantic guidance in sub-instructions. For each sentence in the input text, the textual encoding s is the average of all the tokens' word embedding in the current sentence. We also use a soft-attention modules to calculate the grounded textual featurê s t at current step t: where W s refers to the learnable parameters, attn s t,j is the attention weight over the j th sentence encoding s j at step t, and M denotes the maximum sentence number in the input text. The input text for the multimodal style transfer encoder is the instruction template X . Based on the grounded visual featurev t , the grounded textual featureŝ t and the visual view feature v t at current timestamp t, the hidden context can be given as: Training Objectives We train the multimodal text style transfer model in the teacher-forcing manner (Williams and Zipser, 1989). The decoder generates tokens auto-regressively, conditioning on the masked instruction template X , and the trajectory. The training objective is to minimize the following cross-entropy loss: where x 1 , x 2 , . . . , x n denotes the tokens in the original instruction X , n is the total token number in X , and N denotes the maximum view number in the trajectory.

VLN Transformer
The VLN Transformer is the navigation agent that generates actions in the outdoor VLN task. As illustrated in Fig. 4, our VLN Transformer is composed of an instruction encoder, a trajectory encoder, a cross-modal encoder that fuses the modality of the instruction encodings and trajectory encodings, and an action predictor.

Instruction Encoder
The instruction encoder is a pre-trained uncased BERT-base model (Devlin et al., 2019). Each piece of navigation instruction is split into multiple sentences by the full- ? Figure 4: Overview of the VLN Transformer. In this example, the VLN Transformer predicts to take a left turn for the visual scene at t = 3.
{x i,1 , x i,2 , . . . , x i,l i } that contains l i tokens, its sentence embedding h s i is calculated as: where w i,j is the word embedding for x i,j generated by BERT, and FC is a fully-connected layer. View Encoder We use the view encoder to retrieve embeddings for the visual views at each time step. Following Chen et al. (2019), we embed each panorama I t by slicing it into eight images and projecting each image from an equirectangular projection to a perspective projection. Each of the projected image of size 800 × 460 will be passed through the RESNET18 (He et al., 2016) pre-trained on ImageNet (Russakovsky et al., 2015). We use the output of size 128 × 100 × 58 from the fourth to last layer before classification as the feature for each slice. The feature map for each panorama is the concatenation of the eight image slices, which is a single tensor of size 128×100×464. We center the feature map according to the agent's heading α t at timestamp t. We crop a 128 × 100 × 100 sized feature map from the center and calculate the mean value along the channel dimension. The resulting 100 × 100 features is regarded as the current panorama featureÎ t for each state. Following Mirowski et al. (2018), we then apply a three-layer convolutional neural network onÎ t to extract the view features h v t ∈ R 256 at timestamp t. Cross-Modal Encoder In order to navigate through complicated real-world environments, the agent needs to grasp a proper understanding of the natural language instructions and the visual views jointly to choose proper actions for each state. Since the instructions and the trajectory lies in different modalities and are encoded separately, we introduce the cross-modal encoder to fuse the features from different modalities and jointly encode the instructions and the trajectory. The cross-modal encoder is an 8-layer Transformer encoder (Vaswani et al., 2017) with mask. We use eight self-attention heads and a hidden size of 256.
In the teacher-forcing training process, we add a mask when calculating the multi-head selfattention across different modalities. By masking out all the future views in the ground-truth trajectory, the current view v t is only allowed to refer to the full instructions and all the previous views that the agent has passed by, which is where M denotes the maximum sentence number.
Since the Transformer architecture is based solely on attention mechanism and thus contains no recurrence or convolution, we need to inject additional information about the relative or absolute position of the features in the input sequence. We add a learned segment embedding to every input feature vector specifying whether it belongs to the sentence encodings or the view encodings. We also add a learned position embedding to indicate the relative position of the sentences in the instruction sequence or the trajectory sequence's views.

Action Predictor
The action predictor is a fullyconnected layer. It takes the concatenation of the cross-modal encoder's output up to the current timestamp t as input, and predicts the action a t for view v t : where FC is a fully-connected layer in the action predictor, and T refers to the Transformer operation in the cross-modal encoder. During training, we use the cross-entropy loss for optimization. While the StreetLearn dataset's trajectory contains more panorama along the way on average, the paired instructions are shorter than the Touchdown dataset. We extract a sub-dataset Manh-50 from the original large scale StreetLearn dataset for the convenience of conducting experiments. Manh-50 consists of navigation samples in the Manhattan area that contains no more than 50 panoramas in the whole trajectory, containing 31k training samples. We generate style-transferred instructions for the Manh-50 dataset, which serves as an auxiliary dataset, and will be used to pre-train the navigation models. More details can be found in the appendix.

Evaluation Metrics
We use the following metrics to evaluate VLN performance: (1) Task Completion (TC): the accuracy of completing the navigation task correctly. Following Chen et al. (2019), the navigation result is considered correct if the agent reaches the specific goal or one of the adjacent nodes in the environment graph.
(2) Shortest-Path Distance (SPD): the mean distance between the agent's final position and the goal position in the environment graph.
(3) Success weighted by Edit Distance (SED): the normalized Levenshtein edit distance between the path predicted by the agent and the reference path, which is constrained only to the successful navigation. (4) Coverage weighted by Length Score (CLS): a measurement of the fidelity of the agent's path with regard to the reference path. (5) Normalized Dynamic Time Warping (nDTW): the minimized cumulative distance between the predicted path and the reference path, normalized by the reciprocal of the square root of the reference path length. The value is rescaled by taking the negative exponential of the normalized value. (6) Success weighted Dynamic Time Warping (SDTW): the nDTW value where the summation is only over the successful navigation.
TC, SPD, and SED are defined by Chen et al. (2019). CLS is defined by . nDTW and SDTW are originally defined by Ilharco et al. (2019), in which nDTW is normalized by the length of the reference path. We adjust the normalizing factor to be the reciprocal of the square root of the reference path length for length invariance (Mueen and Keogh, 2016). In case the reference trajectories length has a salient variance, our modification to the normalizing factor made the nDTW and SDTW scores invariant to the reference length.

Results and Analysis
In this section, we report the outdoor VLN performance and the quality of the generated instructions to validate the effectiveness of our MTST learning approach. We compare our VLN Transformer with the baseline model and discuss the influence of pre-training on external resources with/without instruction style transfer.
Outdoor VLN Performance We compare our VLN Transformer with RCONCAT Mirowski et al., 2018) and GA Chaplot et al., 2018) as baseline models. Both baseline models encode the trajectory and the instruction in an LSTM-based manner and use supervised training with Hogwild! (Recht et al., 2011). Table 2 presents the navigation results on the Touchdown validation and test sets, where VLN Transformer performs better than RCONCAT and GA on most metrics with the exception of SPD and CLS.
Pre-training the navigation models on Manh-50 with template-based instructions can partially improve navigation performance. For all three agent models, the scores related to successful casessuch as TC, SED, and SDTW-witness a boost after being pre-trained on vanilla Manh-50. However, the instruction style difference between Manh-50 and Touchdown might misguide the agent in the pre-training stage, resulting in a performance drop on SPD for our VLN Transformer model.
In contrast, our MTST learning approach can better utilize external resources and further improve navigation performance. Pre-training on Manh-50 with style-modified instructions can stably improve the navigation performance on all the metrics for both the RCONCAT model and the VLN Transformer. This also indicates that our MTST learning approach is model-agnostic. Table 4 compares the SPD values on success and failure navigation cases. In the success cases, VLN Transformer has better SPD scores, which is aligned with the best SED results in Table 2. Our model's inferior SPD results are caused by taking longer paths in failure cases, which also harms the fidelity of the generated path and lowers the CLS scores. Nevertheless, every coin has two sides, and exploring more areas when getting lost might not be a complete bad behavior for the navigation agent. We leave this to future study.

Multimodal Text Style Transfer in VLN
We attempt to reveal each component's effect in the multimodal text style transfer model. We pre-train the VLN Transformer with external trajectories and instructions generated by different models, then fine-tune it on the TouchDown dataset.
According to the navigation results in Table 3, the instructions generated by the Speaker model misguide the navigation agent, indicating that relying solely on the Speaker model cannot reduce the gap between different instruction styles. Adding textual attention to the Speaker model can slightly improve the navigation results, but still hinders the agent from navigating correctly. The stylemodified instructions improve the agent's performance on all the navigation metrics, suggesting that our Multimodal Text Style Transfer learning approach can assist the outdoor VLN task.
Quality of the Generated Instruction We evaluate the quality of instructions generated by the Speaker and the MTST model. We utilize five automatic metrics for natural language generation to evaluate the quality of the generated instructions, including BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Elliott and Keller, 2013), CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016). In addition, we calculate the guiding signal match rate (MR) by comparing the appearance of "turn left" and "turn right". If the generated instruction contains the Model Dev Set Test Set  Table 2: Navigation results on the outdoor VLN task. +M-50 denotes pre-training with vanilla Manh-50 which contains machine-generated instructions; in the +style setting, the model is pre-trained with Manh-50 trajectories and style-modified instructions that are generated by our MTST model.  Table 3: Ablation study of the multimodal text style transfer model on the outdoor VLN task. In the +speaker setting, the instructions used in pre-training are generated by the Speaker (Fried et al., 2018), which only attends to the visual input; +text_attn denotes that we add a textual attention module to the Speaker to attend to both the visual input and the machine-generated instructions provided by Google Maps API.   We report the quantitative results on the validation set in Table 5. After adding textual attention to the Speaker, the evaluation performance on all seven metrics improved. Our MTST model scores the highest on all seven metrics, which indicates that the "masking-and-recovering" scheme is beneficial for the multimodal text style transfer process. The results validate that the MTST model can generate higher quality instructions, which refers to more visual objects and provide more matched guiding signals.

Human Evaluation
We invite human judges on Amazon Mechanical Turk to evaluate the quality of the instructions generated by different models. We conduct a pairwise comparison, which covers 170 pairs of instructions generated by Speaker, Speaker with textual attention, and our MTST model. The instruction pairs are sampled from the Touchdown  validation set. Each pair of instructions, together with the ground truth instruction and the gif that illustrates the navigation street view, is presented to 5 annotators. The annotators are asked to make decisions from the aspect of guiding signal correctness and instruction content alignment. Results in Table 6 show that annotators think the instructions generated by our MTST model better describe the street view and is more aligned with the groundtruth instructions.
Case Study We demonstrate case study results to illustrate the performance of our Multimodal Text Style Transfer learning approach. Fig. 5 provides two showcases of the instruction generation results. As listed in the charts, the instructions generated by the vanilla Speaker model have a poor performance in keeping the guiding signals in the ground truth instructions and suffer from hallucinations, which refers to objects that have not appeared in the trajectory. The Speaker with textual attention can provide guidance direction. However, the instructions generated in this manner does not utilize the rich visual information in the trajectory. On the other hand, the instructions generated by our multimodal text style transfer model inject more object-related information ("the light", "scaffolding") in the surrounding navigation environment to the StreetLearn instruction while keeping the correct guiding signals.

Conclusion
In this paper, we proposed the Multimodal Text Style Transfer learning approach for outdoor VLN. This learning framework allows us to utilize outof-domain navigation samples in outdoor environments and enrich the original navigation reasoning training process. Experimental results show that our MTST approach is model-agnostic, and our MTST learning approach outperforms the baseline models on the outdoor VLN task. We believe our study provides a possible solution to mitigate the data scarcity issue in the outdoor VLN task. In future studies, we would love to explore the pos-

StreetLearn
Turn right onto W 36th St. Turn right onto Dyer Ave.

Original Speaker
Go to the next intersection and turn left again. There will be a building with a red awning on your right. Go straight through the next intersection.

Speaker with Textual Attention
Turn right at the next intersection. Stop just before the next intersection.

Multimodal Text Style Transfer
Turn right again at the next intersection. On your right will be scaffolding on your right. Turn right.

StreetLearn
Head northwest on W 35th St toward Hudson Blvd E. Turn right at the 1st cross street onto Hudson Blvd E.

Original Speaker
Turn so the red construction is on your left and the red brick building is on your right. Go forward to the intersection and turn right. You'll have a red brick building with a red awning on your right.

Speaker with Textual Attention
Head in the direction of traffic. Turn right at the first intersection.

Multimodal Text Style Transfer
Move forward with traffic on the right turn right at the light. Continue straight. sibility of constructing an end-to-end framework. We will also further improve the quality of stylemodified instructions, and quantitatively evaluate the alignment between the trajectory and the styletransferred instructions.   Table 7 lists out the statistical information of the datasets used in pre-training and fine-tuning. Even though the Touchdown dataset and the StreetLearn dataset are built upon Google Street View, and both of them contain urban environments in New York City, pre-training the model with the VLN task on the StreetLearn dataset does not raise a threat of test data leaking. This is due to several causes: First, the instructions in the two datasets are distinct in styles. The instructions in the StreetLearn dataset is generated by Google Maps API, which is template-based and focuses on street names. However, the instructions in the Touchdown dataset are created by human annotators and emphasize the visual environment's attributes as navigational cues. Moreover, as reported by Mehta et al. (2020), the panoramas in the two datasets have little overlaps. In addition, Touchdown instructions constantly refer to transient objects such as cars and bikes, which might not appear in a panorama from a different time. The different granularity of the panorama spacing also leads to distinct panorama distributions of the two datasets.

A.2 Training Details
We use Adam optimizer (Kingma and Ba, 2015) to optimize all the parameters. During pre-training on the StreetLearn dataset, the learning rate for the RCONCAT model, GA model, and the VLN Transformer is 2.5 × 10 −4 . We fine-tune BERT separately with a learning rate of 1 × 10 −5 . We pre-train RCONCAT and GA for 15 epochs and pre-train the VLN Transformer for 25 epochs.
When training or fine-tuning on the Touchdown dataset, the learning rate for RCONCAT and GA is 2.5 × 10 −4 . For the VLN Transformer, the learning rate to fine-tune BERT is initially set to 1 × 10 −5 , while the learning rate for other parameters in the model is initialized to be 2.5 × 10 −4 . The learning rate for VLN Transformer will decay. The batch size for RCONCAT and GA is 64, while the VLN Transformer uses a batch size of 30 during training.  Table 8: Ablation results of the VLN Transformer's instruction split on Touchdown dev set. In split setting, the instruction is split into multiple sentences before being encoded by the instruction encoder, while no split setting encodes the whole instruction without splitting.

A.3 Split Instructions vs. No Split
We compare VLN Transformer performance with and without splitting the instructions into sentences during encoding. Results in Table 8 show that breaking the instructions into multiple sentences allows the visual views and the guiding signals in sub-instructions to attend to each other during cross-modal encoding fully. Such cross-modal alignments lead to betters navigation performance.

A.4 Amazon Mechanical Turk
We use AMT for human evaluation when evaluating the quality of the instructions generated by different models. The survey form for head-to-head comparisons is shown in Figure 6.