Aerial Vision-and-Dialog Navigation

The ability to converse with humans and follow natural language commands is crucial for intelligent unmanned aerial vehicles (a.k.a. drones). It can relieve people's burden of holding a controller all the time, allow multitasking, and make drone control more accessible for people with disabilities or with their hands occupied. To this end, we introduce Aerial Vision-and-Dialog Navigation (AVDN), to navigate a drone via natural language conversation. We build a drone simulator with a continuous photorealistic environment and collect a new AVDN dataset of over 3k recorded navigation trajectories with asynchronous human-human dialogs between commanders and followers. The commander provides initial navigation instruction and further guidance by request, while the follower navigates the drone in the simulator and asks questions when needed. During data collection, followers' attention on the drone's visual observation is also recorded. Based on the AVDN dataset, we study the tasks of aerial navigation from (full) dialog history and propose an effective Human Attention Aided Transformer model (HAA-Transformer), which learns to predict both navigation waypoints and human attention.


Introduction
Drones have been widely adopted for many applications in our daily life, from personal entertainment to professional use.It has the advantage of mobility and observing large areas over ground robots.However, compared with ground robots, the control of the aerial robot is more complex because an extra degree of freedom, altitude, is involved.To control a drone, people often need to hold a controller all the time, so it is essential to create a hands-free control experience for drone users and develop an intelligent drone that can complete tasks simply by talking to humans.It can lower the barrier of  drone control for users with some disabilities and who have their hands occupied by activities such as taking photos, writing, etc.Therefore, this work introduces Aerial Visionand-Dialog Navigation (AVDN), aiming to develop an intelligent drone that can converse with its user to fly to the expected destination.As shown in Figure 1, the user (commander) provides instructions, and the aerial agent (follower) follows the instruction and asks questions when needed.The past visual trajectories are also provided along with the question, which frees the commander from monitoring the drone all the time and minimizes the burden of drone control.In this free-form dialog, potential ambiguities in the instruction can be gradually resolved through the further instructions provided by the commanders upon request.
To implement and evaluate the AVDN task, we build a photorealistic simulator with continuous state space to simulate a drone flying with its onboard camera pointing straight downward.Then we collect an AVDN dataset of 3,064 aerial navigation trajectories with human-human dialogs, where crowd-sourcing workers play the commander role and drone experts play the follower role, as illustrated in Figure 1.Moreover, we also collect the attention of human followers over the aerial scenes for a better understanding of where humans ground navigation instructions.
Based on our AVDN dataset, we introduce two challenging navigation tasks, Aerial Navigation from Dialog History (ANDH) and Aerial Navigation from Full Dialog History (ANDH-Full).Both tasks focus on predicting navigation actions that can lead the agent to the destination area, whereas the difference is that ANDH-Full presents the agent with full dialog and requires it to reach the final destination (Kim et al., 2021), while ANDH evaluates the agent's completion of the sub-trajectory within a dialog round given the previous dialog information (Thomason et al., 2020).
The proposed tasks open new challenges of sequential action prediction in a large continuous space and natural language grounding on photorealistic aerial scenes.We propose a sequenceto-sequence Human Attention Aided Transformer model (HAA-Transformer) for both tasks.The HAA-Transformer model predicts waypoints to reduce the complexity of the search space and learns to stop at the desired location.More importantly, it is jointly trained to predict human attention from the input dialog and visual observations and learns where to look during inference.Experiments on our AVDN dataset show that multitask learning is beneficial and human attention prediction improves navigation performance.The main contributions are concluded as follows: • We create a new dataset and simulator for aerial vision-and-dialog navigation.The dataset includes over 3K aerial navigation trajectories with human-human dialogs.
• We introduce ANDH and ANDH-Full tasks to evaluate the agent's ability to understand natural language dialog, reason about aerial scenes, and navigate to the target location in a continuous photorealistic aerial environment.
• We propose an HAA-Transformer model as the baseline for ANDH and ANDH-Full.Be-sides predicting the waypoint navigation actions, HAA-Transformer also learns to predict the attention of the human follower along the navigation trajectory.Experiments on our AVDN dataset validate the effectiveness of the HAA-Transformer model.

Related work
Vision-and-Language Navigation Vision-and-Language Navigation (VLN) is an emerging multimodal task that studies the problem of using both language instructions and visual observation to predict navigation actions.We compare some of the works with our AVDN dataset in where the agent needs to follow language instructions or dialogs to finish household tasks.Besides the indoor environment, some VLN datasets work on the more complex outdoor environment, such as the Touchdown dataset (Chen et al., 2019) and the modified LANI dataset (Misra et al., 2018).Blukis et al. (2019) is similar to ours for both using drones.However, the synthetic environment used has a gap from the realistic scene, and they ignored the control of the drone's altitude, where such navigation is oversimplified and has a large gap towards navigation in the real world in terms of language and vision aspects.Our work absorbs the advantage from previous works where we have continuous environments and dialog instructions to better approximate the real-world scenario.
already an active topic in the field.Some inspiring works (Loquercio et al., 2018;Giusti et al., 2015;Smolyanskiy et al., 2017;Fan et al., 2020;Bozcan and Kayacan, 2020;Majdik et al., 2017;Kang et al., 2019) worked on using pre-collected realworld drone data to tackle aerial vision navigation problems.Due to the hardness of collecting data and the risk of crashes, some other works applied simulation for aerial navigation (Chen et al., 2018;Shah et al., 2017;Chen et al., 2020), where rich ground truths are provided without the need for annotation.However, the modality of language is missing in these prior works and as a result, the navigation tasks only contain simple goals.As for the aerial vision-and-language navigation task in this work, the navigation is guided by natural dialog.
As a result, it allows more diverse and complex navigation and also resolves ambiguities during complicated navigation.

Dataset
The AVDN dataset includes dialogs, navigation trajectories, and the drone's visual observation with human attention, where an example is shown in Figure 2.With the help of a newly proposed simulator, we record the AVDN trajectories created by two groups of humans interacting with each other, playing either the commander role or the follower role.Our AVDN dataset is the first aerial navigation dataset based on dialogs to the best of our knowledge.

Starting position Destination
Follower's trajectory

Simulator
We build a simulator to simulate the drone with a top-down view area.Our simulation environment is a continuous space so that the simulated drone can move continuously to any point within the environment.The drone's visual observations are square images generated corresponding to the drone's view area by cropping from high-resolution satellite images in the xView dataset (Lam et al., 2018), an open-source large-scale satellite image object detection dataset.In this way, our simulator is capable of providing continuous frames with rich visual features.We also design an interface for our simulator, where the simulated drone can be controlled with a keyboard and the drone's visual observation will be displayed in real-time with a digital compass.During the control, users can also provide their attention over the displayed images on the interface by clicking the region they attend to.Last but not least, our simulator is capable of generating trajectory overviews, i.e. commander's view, showing the starting positions, destination areas, current view area and past trajectory (if exists) as in Figure 2.

Dataset Structure
In our AVDN dataset, each navigation trajectory includes time steps T = 0, 1, . . ., M , where M ≥ 1.At T = 0, an initial instruction is provided by the commander.Between adjacent time steps, there is a corresponding navigation sub-trajectory.At every time step of 0 < T < M , there are questions from the follower and the corresponding answers from the commander.At T = M , the navigation trajectory ends because the destination area Des is reached and claimed by the follower.For details about when a trajectory ends, please refer to Section 3.3 Success Condition.
There are M follower's view area sequences < u T 0 , u T 1 , . . ., u T N T >, N T is the length of T -th sequence, where the view area's center coordinate c T i always falls on the trajectory.Therefore, based on each view area, we could retrieve not only the simulated drone's location c i , but also direction d i and altitude h i .Last but not least, for each view area u, there is a corresponding binary human attention mask with the same size.The area in u that corresponds to the white area on the mask is where the follower attended.

Dataset Collection
We collect our dataset with the help of Amazon Mechanical Turk (AMT) workers and drone experts, where AMT workers play the commander role to provide instructions and drone experts play the follower role to control a simulated drone and carry out the instruction.We pay the workers with wages no less than $15/h, and the data collection lasts for 90 days.We adopt an asynchronous data collection method, where the followers and commanders work in turns rather than simultaneously.This not only lowers the cost of data collection but also simulates how aerial vision-and-dialog navigation would work in practice, where the commanders will not monitor the follower's actions all the time.
Pipeline Before the start of data collection, we first sample objects in the xView dataset (Lam et al., 2018) as the destination areas and pair them with randomly selected initial follower's view areas within 1.5km distance.Then, using our simulator, we generate the trajectory overview at time step T = 0, as shown in Figure 2, which becomes the initial commander's view.
During data collection, the initial commander's view is presented to AMT workers for creating the initial instructions.We instruct the AMT workers to write instructions as if they are talking to a drone pilot based on the marked satellite images.Next, we let human drone experts play the follower role, i.e. controlling the simulated drone through our simulator interface, following the instructions and asking questions if they cannot find the destination area.When the experts stop the current navigation, they can either enter questions into a chatbox, claim the destination with a template sentence or reject the instruction for bad quality.If the destination is falsely claimed, the simulator will generate an auto-hint to let the follower ask some questions.For questions asked, AMT workers will provide further instructions accordingly based on given navigation information and dialog history.Then, the same drone experts will continue playing the follower role and again asking questions when necessary.We iterate the process until the destination is successfully reached and claimed by the follower.

Success Condition
The navigation trajectory is successful only when the destination is reached at the time the follower claims it.We determine that the destination is reached in view area u j by checking the center c j and computing the Intersection over Union (IoU) between u j and Des.If c j is inside Des and the IoU of u j and Des is larger than 0.4, the destination is regarded in u j .

Data Analysis
Our AVDN dataset includes 3,064 aerial navigation trajectories, each with multi-round natural language dialog.There are two rounds of dialog on average per trajectory, where the number of dialog rounds in a trajectory equals to the maximum time step M .The most frequent words are shown in Figure 3a.The recorded AVDN trajectory path length has an average of 287m, and its distribution is shown in Figure 3b.The trajectories and dialogs can be further separated into 6,269 sub-trajectories corresponding to the dialog rounds.
We split our dataset into training, seenvalidation, unseen-validation, and unseen-testing Table 2: Dataset statistics.#dialogs is the number of dialogs, and #words per dialog is the average number of words in each dialog.#areas refers to the number of nonoverlapped satellite images used.Destination area-dim is the average dimension of the sampled destination areas.#sub-paths is the number of sub-trajectories where each sub-trajectory corresponds to one round of dialog.Sub-path length is the average sub-trajectory length.
sets, where seen and unseen sets are pre-separated by making sure the area locations of the visual scenes are over 100km apart from each other.We show some statistical analysis across the dataset splits in Table 2.The visual scenes in our dataset come from the xView dataset (Lam et al., 2018), which covers both urban and rural scenes.The average covered area of the satellite images is 1.2km 2 .Rather than providing a target hint in the beginning as in Thomason et al. (2020), the destination must be inferred from the human instructions given by the commander.For example, the commander may give a detailed description of the destination initially or write a rough instruction first and then describe the destination later in the dialog.We also find that there are two ways of describing the directions for navigation: egocentric direction description, such as "turn right", and allocentric direction description, such as "turn south".By filtering and categorizing words related to directions, we find that 82% of the dialog rounds use egocentric direction description and 30% of the dialog rounds include allocentric direction description.There are 17% dialog rounds that have mixed direction deceptions, making the instruction complex.This opens a new challenge for developing a language understanding module that can ground both the egocentric and allocentric descriptions to navigation actions.

Task
Following indoor dialog navigation (Thomason et al., 2020;Kim et al., 2021), we introduce an Aerial Navigation from Dialog History (ANDH) task and an Aerial Navigation from Full Dialog History (ANDH-Full) task based on our AVDN dataset and simulator.

Aerial Navigation from Dialog History
The goal of the task is to let the agent predict aerial navigation actions that lead to goal areas G, following the instructions in the dialog history.Specifically, to predict one action âj of an action sequence between navigation time step T i and T i+1 , the inputs are dialogs from navigation time step 0 to T i and images from a sequence of view areas < û0 , û1 , . . ., ûj−1 >.A new view area ûj will be generated after âj takes place. 1 The goal area G depends on the current navigation time step, The predicted view area sequence will be recorded for evaluation with regard to the ground truth view area sequence < u T i 0 , . . ., u T i N T i >.

Aerial Navigation from Full Dialog History
Compared with the ANDH task, the major difference of the ANDH-Full task is that it adopts the complete dialog history from navigation time step T = 0, 1, . . ., M as input.With the full dialog and visual observation, the agent needs to predict the full navigation trajectory from the starting view area u 0 0 to the destination area Des.ANDH-Full provides complete supervision for agents on a navigation trajectory with a more precise destination description and includes longer utterances and more complex vision grounding challenges.

Evaluation
Since the agent in both tasks, ANDH and ANDH-Full, needs to generate predicted view area sequences, the evaluation metrics for both tasks are the same.In the evaluation, the center points of every view area are connected to form the navigation trajectory, and the last view area is used to determine whether the predicted navigation successfully leads to the destination area.The predicted navigation is successful if the IoU between the predicted final view area and the destination area is greater than 0.4.We apply several metrics for evaluation.Success Rate (SR): the number of the predicted trajectory being regarded as successful, i.e., the final view area of the predicted trajectory satisfies the IoU requirement, over the number of total trajectories predicted.Success weighted by inverse Path Length (SPL) (Anderson et al., 2018): measuring the Success Rate weighted by the total length of the navigation trajectory.Goal Progress (GP) (Thomason et al., 2020): evaluating the distance of the progress made towards the destination area.It is computed as the Euclidean distance of the trajectory, deducted by the remaining distance from the center of the predicted final view area ĉN to the center of goal area G.

Model
We proposed a Human Attention Aided (HAA-Transformer) model for the ANDH and ANDH-Full tasks as shown in Figure 4, where it takes as input multimodal information and generates multimodal predictions, including human attention prediction and navigation prediction.Multimodal Encoding The input has three modalities, the drone's direction, images from the drone's visual observation, and history dialogs.At the start of a prediction series, our model uses a BERT encoder (Devlin et al., 2018) to get the language embeddings of the input dialog history, h l 1:L , where special language tokens such as [INS] and [QUE] are added in front of each instruction and question in the dialog.Then, at every time step, all previous drone directions and images from the drone's visual observation are input to the model.A fully connected direction encoder is used to generate direction embeddings h x 1:t and an xView-pretrained Darknet-53 2 (Redmon and Farhadi, 2018) with an attention module is used to extract and flatten the visual features to get visual embeddings h v 1:t .Finally, similar to the Episodic Transformer (Pashevich et al., 2021), all embeddings from the languages, images and directions, are concatenated and input into a multimodal transformer (F M T ) to produce output multimodal embeddings {z l 1:L , z v 1:t , z x 1:t } as in Equation 2.

Navigation Prediction and Waypoint Control
The navigation outputs from our model come from a fully connected navigation decoder (F N D ) taking as input the transformer's output embeddings {z l 1:L , z v 1:t , z x 1:t } and generating predicted waypoint actions ŵ and predicted navigation progress ĝ as in Equation 3.
3) The predicted waypoint action ŵ is a 3-D coordinate (x, ŷ, ĥ), where (x, ŷ) corresponds to an position in the current view area u and ĥ corresponds to an altitude.The predicted waypoint also controls the drone's direction, where the direction is kept towards the direction of movement.Therefore, ŵ controls the drone's movement, and as a result, the center, width and rotation of the next view area center are determined by ŵ.As for the navigation progress prediction ĝ, it is to generate a one-dimension navigation progress indicator for deciding when to stop (Xiang et al., 2019).If the predicted navigation progress is larger than a threshold, the drone navigation will be ended without executing the predicted waypoint action.
Human Attention Prediction A human attention decoder is proposed to predict the human attention mask using the output embeddings, z v 1:t , from the multi-layer transformer that corresponds to the visual inputs.We build the decoder based on He et al. (2019), where the input to the decoder will be decoded to an 8 * 8 representation through a fully connected layer and then linearly interpolated to a mask with the same shape as the input image.The greater the values in the mask means more likely the human follower attends the corresponding pixels.
Training We first train our HAA-Transformer model on the ANDH task and then fine-tuned it on the ANDH-Full task because the ANDH task is relatively easier with a shorter path length.For each task, we conduct the training alternately in teacher-forcing (Williams and Zipser, 1989) and student-forcing modes, where the main difference is whether the model interacts with the simulator using ground truth actions or the predicted actions.Our model is trained with a sum of losses from both navigation prediction and human attention prediction.First, the predicted waypoint action ŵ and predicted navigation progress ĝ are trained with Mean Square Error (MSE) loss, supervised by the ground truth w and g computed based on the recorded trajectory in our dataset.The naviga- tion prediction loss (L nav ) is shown in Equation 4, where Rot(.) is computing the rotation change as a result of the waypoint action.
Second, for human attention prediction training, we apply the modified Normalized Scanpath Score loss (NSS) (He et al., 2019).Given a predicted human attention mask P and a ground-truth human attention mask Q, where N = i Q i and P = P − µ(P ) σ(P ) (5) Since human attention may not exist in certain view areas, the human attention loss is only computed for view areas with recorded human attention.

Results
We conduct experiments to study our AVDN dataset and our HAA-Transformer model on the ANDH and ANDH-Full tasks.
Results on the ANDH task and ANDH-Full task As shown in Table 3, we evaluate our HAA-Transformer model along with multiple baseline models on both ANDH and ANDH-Full tasks.
We first create a multimodal Episodic Transformer (E.T.) model (Pashevich et al., 2021) by removing the human attention decoder from our HAA-Transformer, and then build vision-only and language-only uni-modal models by ablating on the multimodal E.T. model.For uni-modal models, direction inputs are maintained while either vision input or language input is discarded.A multimodal LSTM-based model is also included as a sequence-to-sequence baseline model, which has the same input and output as the multimodal E.T.model.All models, including our HAA-Transformer model are trained with random initialization.The batch size is 4 for the ANDH task, while for the ANDH-Full task.Based on the result, our HAA-Transformer model outperforms the baseline models in both tasks by a large margin.Also, compared with unimodal baseline models and a random model outputting random waypoint actions, the multimodal E.T. model achieves overall higher performance, which indicates the importance of learning multimodal information in order to succeed in the ANDH task.Last but not least, we find that the language-only uni-modal model achieves much better performance than the vision-only uni-modal model showing that the language instructions play a more important role in guiding the navigation in our AVDN dataset.

Impact of Human Attention Prediction Training
We then evaluate the impact of human attention pre-  Besides improving task performance, human attention prediction also benefits the interpretability of the model by generating visualizable attention predictions paired with navigation predictions.We evaluate the human attention prediction result using the Normalized Scanpath Saliency (NSS) score, which measures the normalized saliency prediction at the ground truth human attention.Our HAA-Transformer model receives NSS scores of 0.84, 0.62 and 0.68, respectively, in seen validation, unseen validation, and test set, indicating the human attention prediction is effective.

Comparison for Different Input Dialog Length
Comparing with the ANDH task, the ANDH-Full task requires the model to predict actions that correspond to longer dialogs with more dialog rounds.As a result, more challenges are involved and longer training time is needed compared with the results in the ANDH task.During training, we add a prompt of the drone's direction that corresponds to the dialog, e.g., "when facing east" to clarify the instructions in dialogs that happened in different time steps, especially when egocentric direction descriptions exist.In Table 4, we show our HAA-Transformer model's performance on trajectories with different dialog lengths, i.e. different numbers of dialog rounds, and we find the model's SR and SPL are diminished for trajectories with the num-  ber of dialog rounds less or greater than average, where they either containing too less or too much information.It shows a big room for improvement in understanding dialog with various lengths.

Conclusion
In this work, we introduce a dataset and a simulator for Aerial Vision-and-Language Navigation (AVDN).Challenging tasks are proposed based on our dataset focusing on navigation.A Human Attention Aided Multimodal Transformer (HAA-Transformer) model is designed for both tasks.Our work provides the possibilities for further studies to develop stronger models on AVDN that not only focus on navigation prediction but also on question generation.Furthermore, based on our results, future works may investigate using human attention prediction training to help solve VLN problems.

Limitation
This work proposed a dataset, a simulator, tasks, and models for Aerial Vision-and-Language Navigation.Since satellite images are needed to simulate the drone's observation, risks of privacy leaking may exist.By using the open-source satellite dataset xView (Lam et al., 2018), we mitigate the risks while also being able to develop a simulator for training our model.Additionally, using satellite images for simulating top-down visual observation of the drone introduces the shortcoming of having only 2D static scenes while adopting the strength of the satellite images where rich labels and visual features are included.

Broader Impact
We recognize the potential ethical problems during the dataset collection, where human annotators are involved.The data collection of this project is classified as exempt by Human Subject Committee vis IRB protocols.As a result, we utilized the Amazon Mechanical Turk (AMT) website to find workers willing to participate in the project.With AMT, our data collection is constrained by legal terms, and the data collection protocol is under AMT's approval.The agreement signed by both requesters and workers on AMT also ensures a transparent and fair data annotation process and that privacy is well protected.

A HAA-Transformer Model Details
There are around 120m parameters in our HAA-Transformer model.Our model uses a BERT BASE encoder (Devlin et al., 2018) with pretrained weights that open-sourced on Hugging Face (Wolf et al., 2020) to extract language feature of the input dialog history.For ANDH task, We extract two sets of language embeddings in ANDH task, where the input is either all the previous and current dialog rounds, or only the current dialog round for the target sub-trajectory.The language embeddings that include all previous dialog are used to attend to the image feature extracted by DarkNet-53 and flatten the feature to only 768 long per frame.The other with only current dialog is passed to the multimodel encoder.Whereas in ANDH-Full task, since the agent starts at an initial position with no previous dialog, only one set of language embeddings is extracted and used.
The attention modules that are used in our HAA-Transformer model and the HAA-LSTM model have the same structure.They generate soft attention based on dot-product attention mechanism.The inputs are context features and attention features.There is a fully connected layer before the output of the attention module.The context features attended by the attention features are concatenated with the attention features to become the input of the fully connected layer, and the output will be the attention module's output which has the same shape as the attention features.

A.1 Navigation Progress Prediction
As for the navigation progress prediction, we adopt the idea of L2Stop (Xiang et al., 2019) and create a navigation progress predictor to help decide when to stop, which overcomes the problem that the model would fail to stop at the desired position.The navigation progress is trained with the supervision of IoU score of the current view area ûi,j,k and the destination area.When the IoU is larger than 0, it indicates the designation area is seen in ûi,j,k and the larger the IoU the closer the ûi,j,k to the des i,j .During the inference time, the predicted navigation stops when the generated navigation progress indicator is less than 0.5.

B HAA-LSTM Model
We also design a Human Attention Aided Multimodal LSTM model for experiments in Section 6 as shown in Figure 6, where it takes the same input   and output as our HAA-Transformer model.We also add the same human attention decoder as in our HAA-Transformer model for human attention prediction training.The language embeddings, visual observation and direction embeddings are also extracted in the same way.

C Training Details
We train all models on one Nvidia RTX A6000 graphic card.We train all baseline models as well as both HAA-Transformer model and HAA-LSTM model for approximately 150k iterations on ANDH task with batch size being 4 and learning rate being 1e-5.For the ANDH-Full task, since it uses full dialog history as input, where more GPU RAM is needed, we use a batch size of 2 and learning rate of 5e-6 and train the model for 200k interactions which take about 48 hours.

D Simulator Details
We design a simulator to simulate a drone flying with its onboard camera facing straight downward, as in Figure 7a.The simulator uses satellite images from xView dataset (Lam et al., 2018) for the drone's visual observation, where the observation is square image patches cropped from the satellite image based on the drone's view area, as in Figure 7b.We argue that by using satellite images, our simulator is capable of providing equally rich visual features as in the real world and some examples are shown in Figure 7c.Additionally, since satellite images have boundaries that are not adjacent with each other, we prevent the drone's view area from moving out of boundary by automatically invalidate the drone's action that will lead to out-of-boundary view areas.Further more, for simplicity, we assume perfect control of the drone's movement, and therefore, the drone's current view area is determined by the previous drone's position and navigation action.
During the dataset collection, the follower controls the simulated drone through the simulator interface with keyboards.We defined 8 keys for the control with a total of four degrees of freedoms (DoFs), where there are 2 DoFs for horizontal movement, 1 DoF for altitude control, and 1 DoF for control.Despite that our simulator environment is continuous, the control through the interface is discrete for an easier control experience.Every time a key is pressed, the simulated drone will move along the DoF for a fixed distance and the higher the simulated drone flies, the faster it moves with one press of the keyboard.Before the follower presses ESC key to stop the control, he/she can also generate the human attention data by using the mouse to left-click on the attended image region shown on the interface.After every left-click, a circle with a radius being 1/10 of the current view area width will become the attended region and be displayed on the interface.Also, a right-click on the circle will remove this region from the attention record.

E Dataset Details and Examples
We provide some details about our dataset with related examples.Each example includes a dialog, sample drone's visual observation with human attention and navigation overviews.

E.1 Human Attention
We record the attention from the follower through our simulator interface when the follower is controlling the simulated drone.In each navigation trajectory collected, the attention are stored in a list where the order of the list is ignored, meaning that the attended areas either recorded earlier or later during the navigation will be retrieved together when using the human attention data.In this way, the human attention data becomes more accurate since the area that followers missed to attend in the current view area is likely to be included in the future time steps.Also, because the previously attended area is kept in later view areas, less effort is needed to annotate the attended areas.We find that 1/7 of the area on average is attended to in the recorded view areas u i,j .

E.2 Dialog Structure
The dialogs contained in our AVDN datset have a various number of rounds.Since the dialog rounds are split based on the data collection rounds, each dialog round contains only one instruction written by the commander.Figure 8 shows an example of a simple dialog with only one dialog round.However, when the follower can not follow the initial instruction to find the destination area, questions will be brought up, and therefore more dialog

Di a l o g : T r a j e c t o r y o v e r v i e w a t t h e e n d o f t h e t r a j e c t o r y T r a j e c t o r y o v e r v i e w a t t h e e n d o f t h e s e c o n d s u b -r a j e c t o r y T r a j e c t o r y o v e r v i e w a t t h e e n d o f t h e f i r s t s u b -t r a j e c t o r y
Figure 10: Example of a trajectory with three dialog rounds.There is an incorrect instruction in the second dialog round, where the destination should be described as the second nearest brown building rather than the nearest one.
For this case, since the instruction is clear and can be followed by the follower, we treat it as an inevitable and acceptable type of instruction with mistakes and keep it in our dataset.
rounds will be introduced.Every dialog rounds start with the instruction from human commanders and could include one or more utterance from the follower, depending on if auto-instructions exist.
We provide details about auto-instructions in the next sub-section.Also, when followers are writing the questions, we enable them to define some shortcut keys for frequently used general questions such as "could you further explain it?","where should I go?", etc.To avoid templated dialogs, followers are forbidden to only use the shortcut for the question but need to incorporate their own language.

E.3 Auto-instructions
When the follower claims that the destination is reached, our simulator will check the navigation result automatically using the success condition described in Section 3.3.Then, auto-instructions will be generated based on whether the destination area is reached successfully.Specifically, when the success condition is met, an auto-instruction of "Yes, you have found it!!!" will be added to the dialog as the end; if the destination is in the center of the view area, but the view area is either too large or too small, failing the success condition, the simulator will also provide auto-instructions asking the follower to adjust the drone's altitude and verify again if the success conditions are met or not, as shown is Figure 9.

E.4 Dialog Quality
To ensure the dialogs in our dataset have good quality, we make efforts during the data collection process and conduct extra examination for the dialog data after the data collection.During the data collection, online workers from Amazon Mechanical Turk (AMT) are playing as commanders and provide instructions in the dialog, who, compared with the follower that we hired to work on-site and supervised by us in-person, have a higher chance of generating low quality and incorrect language instructions.We develop some strategies to deal with these undesired instructions.First, if the follower, guided by the instruction, lets the drone navigate to a direction that is more than 90 degrees different from the ground truth direction of the destination area, our simulator will automatically label the instruction as incorrect.Those labeled instructions will be discarded and collected again.Then, since the follower needs to read and understand the instructions, they have the chance to report the instructions as being low-quality or incomprehensible and skip them.Finally, in the remaining instructions that are not spotted as lowquality or incorrect, it is still possible that instructions are not accurate or incorrect due to human mistakes from the AMT workers, such as in Figure 10.By manually checking the dialogs and navigation trajectories in randomly selected subsets of our AVDN dataset, we spot only 5 instructions with potential mistakes in 50 dialogs.In those cases, because the follower successfully followed the instruction, we keep those instructions unchanged even if they didn't help guide the follower to find the destination area.In the real world, the user in AVDN could also make mistakes, so this mistake tolerance strategy makes our dataset even closer to real scenarios.
We further examine the dialog quality after the data collection by analyzing the dialogs.The average utterance (human-written instructions and questions) in a dialog is 3.1, with a minimum and maximum being 1 and 7 because each dialog includes at least one instruction written by a human.The average number of words written by commander and follower are 45 and 19, and there are about 15 words from auto-instructions.Also, in Figure 11, we show the distribution of the top 30 most frequent words in the commander's and follower's utterances.The results show a smooth variance across nouns, verbs, adjectives, and prepositions, indicating that our dataset's utterances have rich contents and good variety.Last but not least, we manually checked the dialogs in all validation and test sets by visualizing the corresponding navigation trajectory and the dialog, and we observed no major issue.

F Interface for workers in dataset collection
We use help from Amazon Mechanical Turk (AMT) workers and human drone experts during the collection of our Aerial Vision-and-Language Navigation (AVDN) dataset, where the AMT workers play the commander role providing instructions the drone experts play the follower role asking questions and controlling the drone.In this section, we demonstrate the interface for both groups of workers with all the information they receive in the data collection procedure.

F.1 Interfaces for commanders
There are two interfaces for commanders (AMT workers) depending on which data collection round it is.The interface includes one trajectory each time and contains all the information needed for the commander to create the instruction.Detailed and step-by-step instructions for what needs to be done as a commander are introduced at the beginning of the interface.The AMT workers need to write sentences in the Answer according to the provided information.
In the first round of data collection, the commander needs to write the initial instruction based on an overview of the AVDN trajectory.As shown in Fig. 12 the satellite image shows the trajectory overview marked with a predefined staring position (the red point with an arrow showing the drone's direction at the starting position) and a destination area (purple bounding box).
In the data collection round after the first round, the commander is required to give follow-up instructions, i.e., answers, to the questions from the follower.The user interface for the second and following rounds is shown in Fig. 13.Besides all the information shown to the commander in the first round, the follower is also provided with previous dialog, past trajectories (broken purple line), and the view area corresponding to the most recent time step (named current view area marked with white bounding box).

F.2 Interface for followers
The follower uses an interface to interact with our simulator.In our simulator, they receive instructions from the commander and control the simulated drone.The keyboard is used to simulate the drone controller with eight keys representing four channels in the controller, where key w and s represent the channel controlling forward and backward movement, key a and d represent the channel controlling left and right movement, key q and e represent the channel controlling rotating clockwise and anti-clockwise movement and key 1 and 2 represent the channel controlling altitude change.After the experts finish the control, the commander can either claim the destination is reached or ask questions for more instruction.As in Fig. 14, the interface is an image window showing the simulated drone's visual observation and a text window for displaying the previous dialogs and inputting questions from the follower.There is a compass on the top left of the image window, showing the orientation of the simulated drone.The red cross in the image window shows the center of the view, helping the follower control the drone to right above the destination area, and the red corners in the window show the area of 0.4 IoU with the view area.The follower is instructed to make the destination area larger than the area indicated by the red corners in order to finish successful navigation.
e mu l t i p l e y e l l o w b u i l d i n g s .Ca n I s e e t h e d e s t i n a t i o n ?Ho w t o g e t h e r e ?No w I t h i n k I h a v e a r r i v e d .A m I r i g h t ?Y e s , T h e d e s t i n a t i o n i s n e x t a d j a c e n t b u i l d i n g a t y o u r p o s i t i o n .P l e a s e g o a l i t t l e a t 4 o ' c l o c k .Y e s .Y o u h a v e a r r i v e d !Ag e n t c o n t i n u e s n a v i g a t i o n Ag e n t s t o p s n a v i g a t i o n He y d r o n e .Go d i r e c t l y e a s t a c r o s s o n e r o a d t h e n p a s s s o me p a r k e d c a r s a n d e mp t y l a n d .De s t i n a t i o n i s a b u i l d i n g wi t h a y e l l o w r o o f .

Figure 1 :
Figure 1: An example of Aerial Vision-and-Dialog Navigation (AVDN).The user instructs the agent to fly to a destination.During the navigation, the agent can ask questions while showing the images of past visual observations and relative trajectory.The user will talk back at a convenient time to provide further guidance to the agent without having to monitor the agent all the time.

Follower
You are close to your destination, head north and you will reach the complex gray warehouse office.Follower: Yes, I think I have arrived.Correct?Auto-hint: Nope, you haven't got there.Ask some more questions.Follower: I see some warehouses in my view.Am I near the destination?How to go to destination?Follower: Yes, I think now I have arrived the destination.Am I right?Auto-hint: Yes you are there!

Figure 2 :
Figure 2: Example of a trajectory in our AVDN dataset.On the left, the commander's turn and the follower's turn alternate in chronological order.In each turn, dialog utterances are shown, and the follower's turn also shows the navigation process that spans from time step T to T + 1, including the follower's observation and attention.On the right, there are trajectory overviews at different time steps.More examples can be found in the Appendix.
Figure 3: (a) displays the frequent words that appear in the dialogs and (b) shows the path length distribution of our AVDN dataset.

Figure 4 :
Figure 4: Our Human Attention Aided (HAA) model.The output of the model will interact with our simulator for generating the input for next time step.

Table 3 :
Main results on both ANDH and ANDH-Full tasks including ablation results on human attention prediction training.Both the Human Attention Aided Multi-modal LSTM (HAA-LSTM) model and our HAA-Transformer model are benefited from the human attention prediction training based on the performance comparison.forANDH task are split into four subsets based on the ground truth length.In Figure5, we compare the number of successful sub-trajectory in different subsets among models with and without human attention prediction training.As a result, both our HAA-Transformer model and the HAA-LSTM model achieves significant performance improvements for subsets of longer trajectory.It leads to the conclusion that human attention prediction training benefits navigation prediction, especially for long trajectories for both two models that are based on LSTM and Transformer.

Figure 5 :
Figure 5: The impact of human attention prediction training on the success of trajectories of different lengths.Human attention prediction significantly improves navigation performance for longer trajectories.
P r e d i c t e d h u ma n a t t e n t i o n ma s kHu ma n A t t e n t i o n De c o d e r S i mu l a t o r Na v i g a t i o n De c o d e r A t t e n t i o n P o s i t i o n E e c o d e r Da r k Ne t -5 3 P r e d i c t e d wa y p o i n t P r e d i c t e d n a v i g a t i o

Figure 6 :
Figure 6: Human Attention Aided Multi-modal LSTM (HAA-LSTM) model which uses the same input and output as our HAA-Transformer.

Figure 7 :
Figure 7: (a) shows how simulated drone's visual observation is generated from satellite images in our simulator.We compare the simulated drone's visual observation from satellite images, (b), with images from a drone's onboard camera at about 200m above ground, (c).
Co mma n d e r : De s t i n at i o n i s a s h o r t r e c t a n g u l a r b u i l d i n g p a r a l l e l wi t h t h e h i g h wa y a t y o u r t r e e o ' c l o c k .F o l l o we r : Y e s , I t h i n k t h i s i s t h e d e s t i n a t i o .A m I r i g h t ?Co mma n d e r : Y e s y o u h a v e f o u n d i t !!!

Figure 8 :Figure 9 :
Figure 8: Example of a trajectory with one dialog round.
Co mma n d e r : He a d s t r a i g h t a h e a d , y o u r d e s t i n a t i o n i s i n a s e g r e g d a r e a .F l l o we r : Wh e r e s h u l d I g o a f t e r I g o s t r a i g h t ?w d o e s t h e d e s t i n a t i o n l o o k l i k e ?Co mma n d e r : Co mma n d e r P r o c e e d f o r wa r d d i r e c t i o n .T h e n e a r e s t b y b r o wn c o i o u r b u i l d i n g i s i n y o u r d e s t i n a t i o n .F o l l o we r : Y e s , I t h i n k t h i s i s t h e d e s t i n at i o n .I s i t c o r r e c t ?Co mma n d e r : No p e , y o u h a v e n ' t g e t t h e r e .A s k s o me mo r e q u e s t i o n s .F o l l o we r : I h a v e a l r e a d y s e e n t h e b r o wn b u i l d i n g , h o w t o g o t o d e s t i n a t o n ?Co mma n d e r : Go s o u t h t o n e x t b u i l d i n g .F o l l o we r : Y e s , I t h i n k t h i s i s t h e d e s t i n at i o n .A m I r i g h t ?Co mma n d e r : Y e s y o u h a v e f o u n d i t !!!

Figure 11 :
Figure 11: Counts of top 50 most frequently used words in commander and follower utterances.

Figure 12 :
Figure 12: Interface for AMT workers (commanders) in first round of data collection.

Figure 13 :
Figure 13: Interface for AMT workers (commanders) in following rounds of data collection.

Figure 14 :
Figure 14: Interface for human drone experts (follower).The upper window shows the simulated drone's visual observation and the lower window shows the previous dialog.

Table 1 .
Early VLN datasets such as Anderson et al. (2018); Ku et al. (2020) start with the indoor house environ- Krantz et al. (2020)ort3D simulator(Chang et al., 2017), where the visual scenes are connected on a navigation graph.To simulate continuous state change as in the real world,Krantz et al. (2020)built a 3D continuous environment by reconstructing the scene based on topological connections where the agent uses continuous actions during the navigation.Some other VLN studies focus on language instructions.Nguyen et al. (2019); Nguyen and Daumé III (2019); Thomason et al. (2020) created datasets where the agent can interact with the user by sending fixed signals or having dialogs.There are also works on synthetic indoor environments, such as Shridhar et al. (2020b); Padmakumar et al. (2021) that use an interactive simulation environment with synthetic views named ALFRED,

Table 4 :
Result of our HAA-Transformer on ANDH-Full task regarding different dialog lengths.The more rounds in dialog, the longer the trajectory is and the more challenging the task is.