Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we also highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community.


Introduction
Humans communicate with each other using natural language to issue tasks and request help. An agent that can understand human language and navigate intelligently would significantly benefit human society, both personally and professionally. Such an agent can be spoken to in natural language, and would autonomously execute tasks such as household chores indoors, repetitive delivery work outdoors, or work in hazardous conditions following human commands (bridge inspection; fire-fighting). Scientifically, developing such an agent explores how an artificial agent interprets natural language from humans, perceives its visual environment, and utilizes that information to navigate to complete a task successfully.
Vision-and-Language Navigation (VLN) (Anderson et al., 2018b;Chen et al., 2019;Thomason et al., 2019b) is an emerging research field that aims to build such an embodied agent that can 1 We also release a Github repo to keep track of advances in VLN: https://github.com/eric-ai-lab/ awesome-vision-language-navigation communicate with humans in natural language and navigate in real 3D environments. VLN extends visual navigation in both simulated (Zhu et al., 2017;Mirowski, 2019) and real environments (Mirowski et al., 2018) with natural language communication. As illustrated in Figure 1, VLN is a task that involves the oracle (frequently a human), the agent, and the environment. The agent and the oracle communicate in natural language. The agent may ask for guidance and the oracle could respond. The agent navigates and interacts with the environment to complete the task according to the instructions received and the environment observed. Meanwhile, the oracle observes the environment and agent status, and may interact with the environment to help the agent. Since the development and release of works such as Room-to-Room (R2R) (Anderson et al., 2018b), many VLN datasets have been introduced. Regarding the degree of communication, researchers create benchmarks where the agent is required to passively understand one instruction before navigation, to benchmarks where agents converse with the oracle in free-form dialog. Regarding the task objective, the requirements for the agent range from strictly following the route described in the ini-tial instruction to actively exploring the environment and interacting with objects. In a slight abuse of terminology, we refer to benchmarks that involve object interaction together with substantial sub-problems of navigation and localization, such as ALFRED (Shridhar et al., 2020), as VLN benchmarks.
Many challenges exist in VLN tasks. First, VLN faces a complex environment and requires effective understanding and alignment of information from different modalities. Second, VLN agents require a reasoning strategy for the navigation process. Data scarcity is also an obstacle. Lastly, the generalization of a model trained in seen environments to unseen environments is also essential. We categorize the solutions according to the respective challenges.
(1) Representation learning methods help understand information from different modalities.
(2) Action strategy learning aims to make reasonable decisions based on gathered information.
(3) Data-centric learning methods effectively utilize the data and address data challenges such as data scarcity. (4) Prior exploration helps the model familiarize itself with the test environment, improving its ability to generalize.
We make three primary contributions.
(1) We systematically categorize current VLN benchmarks from communication complexity and task objective perspectives, with each category focusing on a different type of VLN task. (2) We hierarchically classify current solutions and the papers within the scope. (3) We discuss potential opportunities and identify future directions.

Tasks and Datasets
The ability for an agent to interpret natural language instructions (and in some instances, request feedback during navigation) is what makes VLN unique from visual navigation (Bonin-Font et al., 2008). In Table 2, we mainly categorize current datasets on two axes, Communication Complexity and Task Objective.
Communication Complexity defines the level at which the agent may converse with the oracle, and we differentiate three levels: In the first level, the agent is only required to understand an Initial Instruction before navigation starts. In the second level, the agent sends a signal for help whenever it is unsure, utilizing the Guidance from the oracle. In the third level, the agent with Dialogue ability asks questions in the form of natural language during the navigation and understands further oracle guidance.
Task Objective defines how the agent attains its goal based on the initial instructions from the oracle. In the first objective type, Fine-grained Navigation, the agent can find the target according to a detailed step-by-step route description. In the second type, Coarse-grained Navigation, the agent is required to find a distant target goal with a coarse navigation description, requiring the agent to reason a path in a navigable environment and possibly elicit additional oracle help. Tasks in the previous two types only require the agent to navigate to complete the mission. In the third type, Navigation and Object Interaction, besides reasoning a path, the agent also needs to interact with objects in the environment to achieve the goal since the object might be hidden or need to change physical states. 2 As with coarse-grained navigation, some object interaction tasks can require additional supervision via dialogue with the oracle.

Initial Instruction
In many VLN benchmarks, the agent is given a natural language instruction for the whole navigation process, such as "Go upstairs and pass the table in the living room. Turn left and go through the door in the middle." Fine-grained Navigation An agent needs to strictly follow the natural language instruction to reach the target goal. Anderson et al. (2018b) create the R2R dataset based on the Matterport3D simulator ). An embodied agent in R2R moves through a house in the simulator traversing edges on a navigation graph, jumping to adjacent nodes containing panoramic views. R2R is extended to create other VLN benchmarks. Roomfor-Room joins paths in R2R to longer trajectories (Jain et al., 2019). Yan et al. (2020) (Yan et al., 2020), Landmark-RxR (He et al., 2021), VLNCE (Krantz et al., 2020), TOUCHDOWN (Chen et al., 2019), StreetLearn (Mirowski et al., 2019), StreetNav (Hermann et al., 2020), Talk2Nav (Vasudevan et al., 2021, LANI (Misra et al., 2018) RoomNav , EmbodiedQA (Das et al., 2018), REVERIE (Qi et al., 2020b), SOON (Zhu et al., 2021a) IQA (Gordon et al., 2018), CHAI (Misra et al., 2018), ALFRED (Shridhar et al., 2020) Oracle Guidance  Outdoor environments are usually more complex and contain more objects than indoor environments. In TOUCHDOWN (Chen et al., 2019), an agent follows instructions to navigate a streetview rendered simulation of New York City to find a hidden object. Most photo-realistic outdoor VLN datasets including TOUCHDOWN (Chen et al., 2019), StreetLearn (Mirowski et al., 2019;Mehta et al., 2020), StreetNav(Hermann et al., 2020), and Talk2Nav (Vasudevan et al., 2021 are proposed based on Google Street View.
Some work uses natural language to guide drones. LANI (Misra et al., 2018) is a 3D synthetic navigation environment, where an agent navigates between landmarks following natural language instructions. Current datasets on drone navigation usually fall in a synthetic environment such as Unity3D (Blukis et al., 2018.

Coarse-grained Navigation
In real life, detailed information about the route may not be available since it may be unknown to the human instructor (oracle). Usually, instructions are more concise and contain merely information of the target goal.
RoomNav  requires agent navigate according to instruction "go to X", where X is a predefined room or object.
In Embodied QA (Das et al., 2018), the agent navigates through the environment to find answer for a given question. The instructions in REVERIE (Qi et al., 2020b) are annotated by humans, and thus more complicated and diverse. The agent navigates through the rooms and differentiates the object against multiple competing candidates. In SOON (Zhu et al., 2021a), an agent receives a long, complex coarse-to-fine instruction which gradually narrows down the search scope.
Navigation+Object Interaction For some tasks, the target object might be hidden (e.g., the spoon in a drawer), or need to change status (e.g., a sliced apple is requested but only a whole apple is available). In these scenarios, it is necessary to interact with the objects to accomplish the task (e.g., opening the drawer or cutting the apple). Interactive Question Answering (IQA) requires the agent to navigate and sometimes to interact with objects to answer a given question. Based on indoor scenes in AI2-THOR (Kolve et al., 2017), Shridhar et al. (2020) propose the ALFRED dataset, where agents are provided with both coarse-grained and fine-grained instructions complete household tasks in an interactive visual environment. CHAI (Misra et al., 2018) requires the agent to navigate and simply interact with the environments.

Oracle Guidance
Agents in Guidance VLN tasks may receive further natural language guidance from the oracle during navigation. For example, if the agent is unsure of the next step (e.g., entering the kitchen), it can send a [help] signal, and the oracle would assist by responding "go left" (Nguyen et al., 2019). Fine-grained Navigation The initial fine-grained navigation instruction may still be ambiguous in a complex environment. Guidance from the oracle could clarify possible confusion. Chi et al. (2020) introduce Just Ask-a task where an agent could ask oracle for help during navigation. Coarse-grained Navigation With only a coarsegrained instruction given at the beginning, the agent tends to be more confused and spends more time exploring. Further guidance resolves this ambiguity. VNLA (Nguyen et al., 2019) and HANNA (Nguyen and Daumé III, 2019) both train an agent to navigate indoors to find objects. The agent could request help from the oracle, which responds by providing a subtask which helps the agent make progress. While oracle in VNLA uses predefined script to respond, the oracle in HANNA uses a neural network to generate natural language responses. CEREALBAR (Suhr et al., 2019) is a collaborative task between a leader and a follower. Both agents move in a virtual game environment to collect valid sets of cards. Navigation+Object Interaction While VLN is still in its youth, there are no VLN datasets in support of Guidance and Object Interaction.

Human Dialogue
It is human-friendly to use natural language to request help (Banerjee et al., 2020; Thomason et al., 2019b). For example, when the agent is not sure about what fruit the human wants, it could ask "What fruit do you want, the banana in the refrigerator or the apple on the table?", and the human response would provide clear navigation direction. Fine-grained Navigation No datasets are in the scope of this category. Currently, route-detailed instruction with possible guidance could help the agent achieve relatively good performance in most simulated environments. We expect datasets to be developed for this category for super long horizon navigation tasks in complex environments especially with rich dynamics where dialog is necessary to clear confusions. Coarse-grained Navigation CVDN (Thomason et al., 2019b) is a dataset of human-human dialogues. Besides interpreting a natural language instruction and deciding on the following action, the VLN agent also needs to ask questions in natural language for guidance. The oracle, with knowledge of the best next steps, needs to understand and correctly answer said questions.
Dialogue is important in complex outdoor environments. de Vries et al. (2018) introduce the Talk the Walk dataset, where the guide has knowledge from a map and guides the tourist to a destination, but does not know the tourist's location; while the tourist navigates a 2D grid via discrete actions. Navigation+Object Interaction Minecraft Collaborative Building (Narayan-Chen et al., 2019) studies how an agent places blocks into a building by communicating with the oracle. TEACh (Padmakumar et al., 2021) is a dataset that studies object interaction and navigation with free-form dialog. The follower converses with the commander and interacts with the environment to complete various house tasks such as making coffee. Dial-FRED (Gao et al., 2022) extends ALFRED (Shridhar et al., 2020) dataset by allowing the agent to actively ask questions.

Evaluation
Goal-oriented Metrics mainly consider the agent's proximity to the goal. The most intuitive is Success Rate (SR), which measures how frequently an agent completes the task within a certain distance of the goal. Goal Progress (Thomason et al., 2019b) measures the reduction in remaining distance to the target goal. Path Length (PL) measures the total length of the navigation path. Shortest-Path Distance (SPD) measures the mean distance between the agent's final location and the goal. Since a longer path length is undesirable (increases duration and wear-and-tear on actual robots), Success weighted by Path Length (SPL) (Anderson et al., 2018a) balances both Success Rate and Path Length. Similarly, Success weighted by Edit Distance (SED) (Chen et al., 2019) compares the expert's actions/trajectory to the agent's actions/trajectory, also balancing SR and PL. Oracle Navigation Error (ONE) takes the shortest dis-tance from any node in the path rather than just the last node, and Oracle Success Rate (OSR) measures whether any node in the path is within a threshold from the target location. Path-fidelity Metrics evaluate to what extent an agent follows the desired path. Some tasks require the agent not only to find the goal location but also to follow specific path. Fidelity measures the matches between the action sequence in the expert demonstration and the action sequence in the agent trajectory.

VLN Methods
As shown in Figure 2, we categorize existing methods into Representation Learning, Action Strategy Learning, Data-centric Learning, and Prior Exploration. Representation learning methods help agent understand relations between these modalities since VLN involves multiple modalities, including vision, language, and action. Moreover, VLN is a complex reasoning task where mission results depend on the accumulating steps, and better action strategies help the decision-making process. Additionally, VLN tasks face challenges within their training data. One severe problem is scarcity. Collecting training data for VLN is expensive and time-consuming, and the existing VLN datasets are relatively small with respect to the complexity of VLN tasks. Therefore, data-centric methods help to utilize the existing data and create more training data. Prior exploration helps adapt agents to previously unseen environments, improving their ability to generalize, decreasing the performance gap between seen versus unseen environments.

Representation Learning
Representation learning helps the agent understand how the words in the instruction relate to the perceived features in the environment.  Vision and Language Vision-and-language pretrained models provide good joint representation for text and vision. A common practice is to initialize a VLN agent (Kim et al., 2021) with a pretrained model such as ViLBERT (Lu et al., 2019). The agent may be further trained with VLNspecific features such as objects and rooms (Qi et al., 2021). VLN Downstream tasks benefit from being closely related to the pretraining task. Researchers also explored pretraining on the VLN domain directly. VLN-BERT (Majumdar et al., 2020) pretrains navigation models to measure the compatibility between paths and instructions, which formats VLN as a path selection problem. PREVALENT (Hao et al., 2020) is trained from scratch on image-textaction triplets to learn textual representations in VLN tasks. The output embedding from the [CLS] token in BERT-based pretraining models could be leveraged in a recurrent fashion to represent his-tory state (Hong et al., 2021;Moudgil et al., 2021). Airbert (Guhur et al., 2021) achieve good performance on few-shot setting after pretraining on a large-scale in-domain dataset.

Semantic Understanding
Semantic understanding of VLN tasks incorporates knowledge about important features in VLN. In addition to the raw features, high-level semantic representations also improve performance in unseen environments. Intra-Modality Visual or textual modalities can be decomposed into many features, which matter differently in VLN. The overall visual features extracted by a neural model may actually hurt the performance in some cases (Thomason et al., 2019a;Hu et al., 2019;Zhang et al., 2020b). Therefore, it is important to find the feature(s) that best improve performance. High-level features such as visual appearance, route structure, and detected objects outperform the low level visual features extracted by CNN (Hu et al., 2019). Different types of tokens within the instruction also function differently (Zhu et al., 2021c). Extracting these tokens and encoding the object tokens and directions tokens are crucial (Qi et al., 2020a;Zhu et al., 2021c). Inter-Modality Semantic connections between different modalities: actions, scenes, observed objects, direction clues, and objects mentioned in instructions can be extracted and then softly aligned with attention mechanism (Qi et al., 2020a;Gao et al., 2021). The soft alignment also highlights relevant parts of the instruction with respect to the current step (Landi et al., 2019;Zhang et al., 2020a).

Graph Representation
Building graph to incorporate structured information from instruction and environment observation provides explicit semantic relation to guide the navigation. The graph neural network may encode the relation between text and vision to better interpret the context information (Hong et al., 2020a;Deng et al., 2020). The graph could record the location information during the navigation, which can used to predict the most likely trajectory (Anderson et al., 2019a) or probability distribution over action space (Deng et al., 2020). When connected with prior exploration, an overview graph about the navigable environment (Chen et al., 2021a) can be built to improve navigation interpretation.

Memory-augmented Model
Information accumulates as the agent navigates, which is not efficient to utilize directly. Memory structure helps the agent effectively leverage the navigation history. Some solutions leverage memory modules such as LSTMs or recurrently utilize informative states (Hong et al., 2021), which can be relatively easily implemented, but may struggle to remember features at the beginning of the path as path length increases. Another solution is to build a separate memory model to store the relevant information (Zhu et al., 2020c;Lin et al., 2021;Nguyen and Daumé III, 2019). Notably, by hierarchically encoding a single view, a panorama, and then all panoramas in history, HAMT (Chen et al., 2021b) successfully utilized the full navigation history for decision-making.

Auxiliary Tasks
Auxiliary tasks help the agent better understand the environment and its own status without extra labels. From the machine learning perspective, an auxiliary task is usually achieved in the form of an additional loss function. The auxiliary task could, for example, explain its previous actions, or predict information about future decisions (Zhu et al., 2020a). Auxiliary tasks could also involve the current mission such as current task accomplishment, and vision & instruction alignment (Ma et al., 2019a;Zhu et al., 2020a). Notably, auxiliary tasks are effective when adapting pretrained representations for VLN (Huang et al., 2019b).

Action Strategy Learning
With many possible action choices and complicated environment, action strategy learning provides a variety of methods to help the agent decide on those best actions.

Reinforcement Learning
VLN is a sequential decision-making problem and can naturally be modeled as a Markov decision process. So Reinforcement Learning (RL) methods are proposed to learn better policy for VLN tasks. A critical challenge for RL methods is that VLN agents only receive the success signal at the end of the episode, so it is difficult to know which actions to attribute success to, and which to penalize. To address the ill-posed feedback issue, Wang et al. (2019Wang et al. ( , 2020c propose RCM model to enforces cross-modal grounding both locally and globally, with goal-oriented extrinsic reward and instructionfidelity intrinsic reward. He et al. (2021) propose to utilize the local alignment between the instruction and critical landmarks as the reward. Evaluation metrics such as CLS (Jain et al., 2019) or nDTW (Ilharco et al., 2019) can also provide informative reward signal (Landi et al., 2020), and natural language may also provide suggestions for reward (Fu et al., 2019).
To model the dynamics in the environment, Wang et al. (2018) leverage model-based reinforcement learning to predict the next state and improve the generalization in unseen environment. Zhang et al. (2020a) find recursively alternating the learning schemes of imitation and reinforcement learning improve the performance.

Exploration during Navigation
Exploring and gathering environmental information while navigating provides a better understanding of the state space. Student-forcing is a frequently used strategy, where the agent keeps navigating based on sampled actions and is supervised by the shortest-path action (Anderson et al., 2018b).
There is a tradeoff between exploration versus exploitation: with more exploration, the agent sees better performance at the cost of a longer path and longer duration, so the model needs to determine when and how deep to explore (Wang et al., 2020a). After having gathered the local information, the agent needs to decide which step to choose, or whether to backtrack (Ke et al., 2019). Notably, Koh et al. (2021) designed Pathdreamer, a visual world model to synthesize visual observation future viewpoints without actually looking ahead.

Navigation Planning
Planing future navigation steps leads to a better action strategy. From the visual side, predicting the waypoints (Krantz et al., 2021), next state and reward (Wang et al., 2018), generate future observation (Koh et al., 2021) or incorporating neighbor views (An et al., 2021) has proven effective. Recognizing and stopping at the correct location also reduces navigation costs (Xiang et al., 2020). The natural language instruction also contains landmarks and direction clues to plan detailed steps. Anderson et al. (2019b) predict the forthcoming events based on the instruction, which is used to predict actions with a semantic spatial map. (Kurita and Cho, 2020) formulates VLN as a generative approach where a language model is used to compute the distribution over all possible instructions. The instruction may also be used to tag navigation and interaction milestones which the agent needs to complete step by step (Raychaudhuri et al., 2021;Song et al., 2022).

Asking for Help
An intelligent agent asks for help when uncertain about the next action (Nguyen et al., 2021b). Action probabilities or a separately trained model (Chi et al., 2020;Zhu et al., 2021e;Nguyen et al., 2021a) can be leveraged to decide whether to ask for help. Using natural language to converse with the oracle covers a wider problem scope than sending a signal. Both rule-based methods (Padmakumar et al., 2021) and neural-based methods (Roman et al., 2020;Nguyen et al., 2021a) have been developed to build navigation agents with dialog ability. Meanwhile, for tasks (Thomason et al., 2019b;Padmakumar et al., 2021) that do not provide an oracle agent to answer question in natural language, researchers also need to build a rule-based (Padmakumar et al., 2021)

Data-centric Learning
Compared with previously discussed works that focus on building a better VLN agent structure, data-centric methods most effectively utilize the existing data, or create synthetic data.

Trajectory-Instruction Augmentation
Augmented path-instruction pairs could be used in VLN directly. Currently the common practice is to train a speaker module to generate instructions given a navigation path (Fried et al., 2018). This generated data have varying quality (Zhao et al., 2021;Huang et al., 2019a). Therefore an alignment scorer (Huang et al., 2019b) or adversarial discriminator (Fu et al., 2020) can select high-quality pairs for augmentation. Style transfer module may also improve instruction quality via adapting instructions from the source domain (Zhu et al., 2021d). Environment Augmentation Generating more environment data not only helps generate more trajectories, but also alleviates the problem of overfitting in seen environments. Randomly masking the same visual feature across different viewpoints (Tan et al., 2019) or simply splitting the house scenes and re-mixing them (Liu et al., 2021) could create new environments, which could further be used to generate more trajectory-instruction pairs (Fried et al., 2018). Training data may also be augmented by replacing some visual features with counterfactual ones (Parvaneh et al., 2020).

Curriculum Learning
Curriculum learning (Bengio et al., 2009) gradually increases the task's difficulty during the training process. The instruction length could be a metric for task difficulty. BabyWalk (Zhu et al., 2020b) keep increasing training samples' instruction length during the training process. Attributes from the trajectory may also be used to rank task difficulty. Zhang et al. (2021) rearrange the R2R dataset using the number of rooms each path traverses. They found curriculum learning helps smooth the loss landscape and find a better local optima.

Multitask Learning
Different VLN tasks can benefit from each other by cross-task knowledge transfer. Wang et al. (2020d) propose an environment-agnostic multitask navigation model for both VLN and Navigation from Dialog History tasks (Thomason et al., 2019b). Chaplot et al. (2020) propose an attention module to train a multitask navigation agent to follow instructions and answer questions (Wijmans et al., 2019a).

Instruction Interpretation
A trajectory instruction interpreted multiple times in different ways may help the agent better understand its objective. LEO (Xia et al., 2020) leverages and encodes all the instructions with a shared set of parameters to enhance the textual understanding. LWIT (Nguyen et al., 2021c) interprets the instructions to make it clear to interact with what class of objects. Shorter, and more concise instructions provide clearer guidance for the agent compared to longer, semantically entangled instructions, thus Hong et al. (2020b) breaks long instructions into shorter ones, allowing the agent to track progress and focus on each atomic instruction individually.

Prior Exploration
Good performance in seen environments often cannot generalize to unseen environments (Hu et al., 2019;Parvaneh et al., 2020;Tan et al., 2019). Prior exploration methods allow the agent to observe and adapt to unseen environments, 3 bridging the performance gap between seen and unseen environments. Wang et al. (2019) introduce a self-supervised imitation learning to learn from the agent's own past, good behaviors. The best navigation path determined to align the instruction the best by a matching critic will be used to update the agent. Tan et al. (2019) leverage the testing environments to sample and augment paths for adaptation. Fu et al. (2020) propose environment-based prior exploration, where the agent can only explore a particular environment where it is deployed. When utilizing graph, prior exploration may construct a map or overview about the unseen environment to provide explicit guidance for navigation (Chen et al., 2021a;Zhou et al., 2021).

Related Visual-and-Language Tasks
This paper focuses on Vision-and-Language Navigation tasks with an emphasis on photo-realistic environments. 2D map may also be a uesful virtual environment for navigation tasks (Vogel and Jurafsky, 2010; Chen and Mooney, 2011; Paz-Argaman and Tsarfaty, 2019). Synthetic environments may also be a substitute for realistic environment (MacMahon et al., 2006;Blukis et al., 2020). Tellex et al. (2011) propose to instantiate a probabilistic graphical model for natural language commands in robotic navigation and mobile manipulation process.
In VLN, an agent needs to follow the given instruction and even ask for assistants in human language. An agent in Visual Navigation tasks is usually not required to understand information from textual modality. Visual Navigation (Zhu et al., 2021b) is a problem of navigating an agent from the current location to find the goal target. Researchers have achieved success in both simulated environments (Zhu et al., 2017;Mirowski, 2019) and real environments (Mirowski et al., 2018).

Conclusion and Future Directions
In this paper, we discuss the importance of VLN agents as a part of society, how their tasks vary as a function of communication level versus task objective, and how different agents may be evaluated. We broadly review VLN methodologies and categorize them. This paper only discusses these issues broadly at an introductory level. In reviewing these papers, we can see the immense progress that has already been made, as well as directions that this research topic can be expanded on.
Current methods usually do not explicitly utilize external knowledge such as objects and general house descriptions in Wikipedia. Incorporating knowledge also improves the interpretability and trust of embodied AI. Moreover, currently several navigation agents learn which direction to move and with what to interact, but there is a last-mile problem of VLN-how to interact with objects. Anderson et al. (2018b) asked whether a robot could learn to "Bring me a spoon"; new research may ask how a robot can learn to "Pick up a spoon". The environments also lack diversity: most interior terrestrial VLN data consists of American houses, but never warehouses or hospitals: the places where these agents may be of most use.
Below we detail additional future directions: Collaborative VLN Current VLN benchmarks and methods predominantly focus on tasks where only one agent navigates, yet complicated realworld scenarios may require several robots collaborating. Multi-agent VLN tasks require development in swarm intelligence, information communication, and performance evaluation. MeetUp! (Ilinykh et al., 2019) is a two-player coordination game where players move in a visual environment to find each other. VLN studies the relationship between the human and the environment in Figure 1, yet here humans are oracles simply observing (but not acting on) the environment. Collaboration between humans and robots is crucial for them to work together as teams (e.g., as personal assistants or helping in construction). Future work may target at collaborative VLN between multiple agents or between human and agents. Simulation to Reality There is a performance loss when transferred to real-life robot navigation (Anderson et al., 2020). Real robots function in continuous space, but most simulators only allow agents to "hop" through a pre-defined navigation graph which is unrealistic for three reasons (Krantz et al., 2020). Navigation graphs assume: (1) perfect localization-in the real world it is a noisy estimate; (2) oracle navigation-real robots cannot "teleport" to a new node; (3) known topology-in reality an agent may not have access to a preset list of navigable nodes. Continuous implementations of realistic environments may contain patches of the images, be blurred, or have parallax errors, making them unrealistic. A simulation that is based on both a 3D model and realistic imagery could improve the match between virtual sensors (in simulation) and real sensors. Lastly, most simulators assume a static environment only changed by the agent. This does not account for other dynamics such as people walking or objects moving, nor does it account for lighting conditions through the day. VLN environments with probabilistic transition functions may also narrow the gap between simulation and reality. Ethics & Privacy During both training and inference, VLN agents may observe and store sensitive information that can get leaked or misused. Effective navigation with privacy protection is crucially important. Relevant areas such as federated learning (Konečnỳ et al., 2016) or differential privacy (Dwork et al., 2006) could also be studied in VLN domain to preserve the privacy of training and inference environments. Multicultural VLN VLN lacks diversity in 3D environments: most outdoor VLN datasets use Google Street View recorded in major American cities, but lacks data in developing countries. Agents trained on American data face potential generalization problems in other city or housing layouts. Future work should explore more diverse environments across multiple cultures and regions. Multilingual VLN datasets (Yan et al., 2020;Ku et al., 2020) could be good resources to study multicultural differences from the linguistic perspective.

A Dataset Details
Here in Table 2, we introduce more information about the datasets. Compared with the number of the datasets, the simulators are limited. More specifically, most indoor datasets are based on Mat-terport3D and most outdoor datasets are based on Google Street View. Also, more datasets are about indoor environments rather than outdoor environments. Outdoor environments are usually more complex and contain more objects compared with indoor environments.

B Simulator
The virtual features of the dataset are deeply connected with the simulator in which datasets are built. Here we summarize simulators frequently used during the VLN dataset creation process. House3D ) is a realistic virtual 3D environment built based on the SUNCG (Song et al., 2017) dataset. An agent in the environment has access to first-person view RGB images, together with semantic/instance masks and depth information.
Matterport3D (Anderson et al., 2018b) simulator is a large-scale visual reinforcement learning simulation environment for research on embodied AI based on the Matterport3D dataset . Matterport3D contains various indoor scenes, including houses, apartments, hotels, offices, and churches. An agent can navigate between viewpoints along a pre-defined graph. Most indoors VLN datasets such as R2R and its variants are based on the Matterport3D simulator.
Habitat (Manolis Savva* et al., 2019;Szot et al., 2021) is a 3D simulation platform for training embodied AI in 3D physics-enabled scenarios. Compared with other simulation environments, Habitat 2.0 (Szot et al., 2021) shows strength in system response speed. Habitat has the following datasets built-in: Matterport3D (Chang et al., 2017), Gibson , and Replica (Straub et al., 2019). AI2-THOR (Kolve et al., 2017) is a near photo-realistic 3D indoor simulation environment, where agents could navigate and interact with objects. Based on the object interaction function, it helps to build a dataset that requires object interaction, such as ALFRED (Shridhar et al., 2020).
Gibson ) is a real-world perception interactive environment with complex semantics. Each viewpoint has a set of RGB panoramas with global camera poses and reconstructed 3D meshes. Matterport3D dataset  is also integrated into the Gibson simulator.
House3D  converts SUNCG's static environment into a virtual environment, where the agent can navigate with physical constraints (e.g. it cannot pass through walls or objects).
LANI (Misra et al., 2018) is a 3D simulator built in Unity3D platform. The environment in LANI is a fenced, square, grass field containing randomly placed landmarks. An agent needs to navigate between landmarks following the natural language instruction. Drone navigation tasks (Blukis et al., 2018(Blukis et al., , 2019 are also built based on LANI. Currently, most datasets and simulators focus on indoors navigable scenes partly because of the difficulty of building an outdoor photo-realistic 3D simulator out of the increased complexity. Google Street View 4 , an online API that is integrated with Google Maps, is composed of billions of realistic street-level panoramas. It has been frequently used to create outdoor VLN tasks since the development of TOUCHDOWN (Chen et al., 2019).

C Room-to-Room Leaderboard
Room-to-Room (R2R) (Anderson et al., 2018b) is the benchmark used most frequently for evaluating different methods. Here we collect all the reported performance metrics in the corresponding papers and the official R2R leaderboard 5 . Since beam search explores more routes, and since prior exploration has additional observations in the test environment, their performance can not be directly compared with other methods.