Towards Navigation by Reasoning over Spatial Configurations

We deal with the navigation problem where the agent follows natural language instructions while observing the environment. Focusing on language understanding, we show the importance of spatial semantics in grounding navigation instructions into visual perceptions. We propose a neural agent that uses the elements of spatial configurations and investigate their influence on the navigation agent’s reasoning ability. Moreover, we model the sequential execution order and align visual objects with spatial configurations in the instruction. Our neural agent improves strong baselines on the seen environments and shows competitive performance on the unseen environments. Additionally, the experimental results demonstrate that explicit modeling of spatial semantic elements in the instructions can improve the grounding and spatial reasoning of the model.


Introduction
The ability to understand and follow natural language instructions is critical for intelligent agents to interact with humans and the physical world. One of the recently designed tasks in this direction is Vision-and-Language Navigation (VLN) (Anderson et al., 2018), which requires an agent to carry out a sequence of actions in a photo-realistic simulated environment in response to a sequence of natural language instructions. To accomplish this task, the agent should have three abilities: understanding linguistic semantics, perceiving the visual environment, and reasoning over both modalities (Zhu et al., 2020;Wang et al., 2019). While understanding vision and language are difficult problems by themselves, learning the connection between them without direct supervision makes this task even more challenging (Hong et al., 2020).
To address this challenge, some neural agents establish the connection using attention mechanism to relate the tokens from a given instruction to the images in a panoramic photo (Anderson et al., 2018;Fried et al., 2018;Ma et al., 2019;Yu et al., 2018). Surprisingly, although those models can improve the performance, Hu et al. (2019) found they ignore the visual information. There is no clear evidence that the agent can correspond the components of the visual environment to the instructions (Hong et al., 2020). Based on these results, recent research started to improve the agent's reasoning ability by explicitly considering the structure of language and image. From the language side, Hong et al. (2020) annotated fine-grained subinstructions and their corresponding trajectories and used the co-grounded features of a part of instruction and the image to predict the next action. From the image side, Hu et al. (2019) induced a high-level object-based visual representation to ground the language into the visual context.
In the same direction, we propose a neural agent, namely Spatial-Configuration-Based-Navigation (SpC-NAV), and consider the structure of both modalities, that is, spatial semantics of the instructions and the objects in the images. We use the notion of Spatial Configuration (Dan et al., 2020) to model the instructions and design a state attention to ensure the execution order of spatial configurations. Then, we utilize the spatial semantics elements, namely motion indicator, spatial indicator and landmark in spatial configuration to establish the connection with the visual environment. Specifically, we use the similarity score between the landmark representation in the spatial configurations and the object representation in the panoramic images to control the transitions between configurations. Also, we align object representations with the configuration representations enriched with motion indicator, spatial indicator and landmark representations to finally select the navigable image.
A spatial configuration is the smallest linguistic (a) Spatial Configuration Scheme (b) Spatial Configuration Annotation Figure 1: Spatial Configuration example. The instruction "Move to the table with chair, and stop." can be split into two spatial configurations: "move to the table with chair" and "stop". In configuration1, "move" is motion indicator; "to" is spatial indicator; "table" is landmark. "table with chair" is a nested spatial configuration of configuration1. The role of "table" is trajector; "with" is spatial indicator; and "chair" is landmark. In configuration2, "stop" is motion indicator.
unit that describes the location/trans-location of an object with respect to a reference or a path that can be perceived in the environment. It contains finegrained spatial roles, such as motion indicator, landmark, spatial indicator, trajector. Essentially, each spatial configuration forms a sub-instruction in our setting. Figure 1 shows an example of splitting an instruction into its corresponding spatial configurations and the extracted spatial roles. Previous research argues representing the semantic structure of the language could improve the reasoning capabilities of deep learning models (Dan et al., 2020;. There are relevant work modeling the meaning of spatial semantics in probabilistic models (Kollar et al., 2010;Tellex et al., 2011) and neural models (Regier, 1996;Ghanimifard and Dobnik, 2019). However, its impact on deep learning models for navigation remains an open research problem. The contribution of this paper is as follows: 1. We consider the spatial semantic structure of the instructions explicitly in terms of spatial configurations and their spatial semantic elements, i.e., spatial/motion indicators, and landmarks to enrich the configuration representations. 2. We introduce a state attention to guarantee that configurations are executed sequentially. Also, we utilize the grounding between the extracted spatial elements and the object representation to help control the transitions between configurations. 3. Our experiment results show that considering the explicit representation of semantic elements of the spatial configurations improves the strong baselines significantly in the seen environments and yields competitive results in the unseen environments.

Related Work
Older studies on navigation before the deep learning era are mostly symbolic grounding methods, which are based on parsing the semantics of the instruction and learning probabilistic models. MacMahon et al. (2006) used the parser to associate the linguistic elements in free-form instruction to their corresponding action, location and object in the environment. Tellex et al. (2011) represented the spatial language as a hierarchy of Spatial Description Clauses (SDC) and proposed a discriminative probabilistic graphical model to find the most probable path with the extracted SDC and the detected visual landmark. Mei et al. (2016) provided a good overview of the past classical work on navigation. However, one of the biggest limitations of those methods is that they required prior linguistic structure and manual annotations.
In recent years, given the new capabilities created by deep learning architectures, the navigation task is extended to the photo-realistic simulated environments (Anderson et al., 2018;Thomason et al., 2019;Chen et al., 2019). Based on this, a Sequence-to-Sequence (Seq2seq) baseline model was proposed by Anderson et al. (2018) to encode the instructions and decode the embeddings to identify the corresponding output action sequence with the observed images. Fried et al. (2018) proposed to train a speaker model to augment the instructions for the follower model. Ma et al. (2019) introduced a visual and textual co-attention mechanism and a progress monitor loss to track the execution progress. Although those agents achieved better performance, the semantic structures on both language and vision sides were ignored.
We aim to exploit both symbolic grounding and neural models in the spatial domain. Regier (1996) designed the neurons to learn the meaning of spatial prepositions. Ghanimifard and Dobnik (2019) explored the effects of spatial knowledge in a generative neural language model for the image description. We mainly work on incorporating the spatial semantics in navigation neural agent. Hong et al. (2020) recently provided a method to segment the long instruction into sub-instructions. They used a shifting attention module to infer whether the current sub-instruction has been completed. Subinstructions differ from us as they manually aligned the instructions and viewpoints to learn the alignments, while we modeled spatial semantics to guide the alignment automatically. Moreover, their proposed shifting attention module is hard attention, and a threshold is set to decide whether the agent should execute the next sub-instruction. However, we utilize the grounding between the landmarks and the objects to control the transitions between sub-instructions.

Problem Formulation
In this task, the agent follows an instruction to navigate from a start viewpoint to a goal viewpoint in a photo-realistic environment. Formally, the agent is given a natural language instruction S, which is a sequence of tokens, and {s 1 , s 2 , · · · } is its corresponding token embeddings. The agent observes a 360-degree panoramic view of its surrounding scene at the current viewpoint. Here, we follow Ma et al. (2019) to map the n navigable viewpoints to discrete images from the current panoramic view 1 . We obtain n images corresponding to each navigable viewpoint I = {I 1 , I 2 , · · · , I n }. The task is to select the next viewpoint among the navigable viewpoints or the current viewpoint (indicating the stop), and finally, to generate the trajectory that takes the agent close to an intended goal location.

Sequence-to-Sequence
We model the agent with a LSTM-based sequenceto-sequence architecture (Sutskever et al., 2014) to control the flow of information, as illustrated in Fig 2. The encoder computes a contextual embeddings j of each token embedding s j in S bys j = LST M encode (s j ). At each step t of navigation, the decoder receives the grounded instruction representation C * t and the aligned image representation I * t to update its context h t by . Finally, we predict the probability distribution of the next navigable viewpoint p t by h t . We introduce the method to obtain C * t and I * t in Section 3.5 and Section 3.6, as well as the next viewpoint prediction in Section 3.7.

Spatial Configurations Representation
To obtain the configurations in a navigation instruction, we first split the instructions into sentences. Then we design a parser with rules applied on an 1 12 headings and 3 elevations with 30 degree interval. off-the-shelf dependency parser 2 to extract all the verb phrases and noun phrases in each sentence. In general, each configuration contains at most one motion indicator. Since we aim to process instructions and look for motions, we split the sentences with the extracted verb phrases as motion indicators to obtain spatial configurations. We do not separate the nested configurations with no motion indicator and keep them attached to the dynamic configurations (i.e. the ones with motion-indicator). As shown in Figure 1, "table with chair" is the nested spatial configuration of "move to the table with chair". Here, we only consider the prepositions that are attached to verbs, and merge the spatial indicators and motion indicators such as "move to" and use them together as the motion indicator. After that, we insert a pseudo delimiter token after each configuration and identify their contained noun phrases as landmarks. Each navigation instruction S is split into m configurations. We re-organize the contextual embeddings of tokens [s 1 ,s 2 , · · · ] generated by the encoder into the array of spatial configurations representation where m is the number of configurations in the instruction. In the i-th configuration representation C i = c i 1 , c i 2 · · · , c i P , the j-th element c i j is the contextual embedding of the corresponding k-th tokens in the instruction: c i j =s k . The last token of each configuration is always the pseudo delimiter indexed by P, which contains the most comprehensive context information about the preceding words. Soft attention is widely used to merge a collection of representations V into one by weighted sum based on the relevance indicated by their associated keys representations K and a query Q, calculated by Eq. 1.
(1) where W is a trainable linear mapping, and d k is the dimension of each representation in K. We apply a soft attention to each configuration representation with the pseudo delimiter representation c i P , which can be calculated by Eq. 2.
After obtaining configuration representations, an agent needs to identify which configuration to fol- Figure 2: Model Architecture. The input to the encoder is the instruction text. The inputs to the decoder are the grounded language C * t calculated by state attention and the aligned visual representations I * t obtained from navigable images at each step t. The decoder predicts the distribution of next viewpoint p t with the updated context h t . The high-level view at the top-left shows the information flow in the model aligning with the circled numbers. low at each step. To achieve this, we incorporate the intra-configuration and inter-configuration knowledge. Concretely, intra-configuration knowledge is the motion indicator that guides the agent movement and the landmarks that could be grounded into the objects in visual images; inter-configuration knowledge is that configurations should be processed one after another.
As mentioned above, we identify verbs and noun chunks in configurations as motion indicator and landmarks respectively. Each configuration can contain only one motion indicator and multiple landmarks. Formally, for the i-th configuration C i , the motion indicator representation is denoted as c i M and the landmark representation is denoted as where p is the number of landmarks. If there is no landmark in the configuration, the value of c i L will be set as zeros. To enhance the motion indicator and landmark information, we concatenate their word embedding with the configuration representation. In case there are multiple noun chunks in configuration, to simplify, we select the noun closest to the root of the parsing tree as the main landmark, denoted asp. Then the enriched configuration representation is denoted as

Visual Representation
To execute a series of configurations, the agent needs to keep track of the sequence of images observed along the navigation trajectory.
We firstly transform the low-level image features from ResNet of n navigable images I = {I 1 , I 2 , . . . , I n } to I = I 1 , I 2 , · · · , I n by a fully-connected layer I j = FC img (I j ). Then, a soft attention is applied to I with the previous context h t−1 , as shown in Eq. 3.
Furthermore, we equip the agent with objectbased representation. Specifically, we get top-K object representations from each image with an object detection model 3 . In this paper, we consider two kinds of object representation: object label representation and object visual representation. Specifically, the label representation uses the GloVe embedding (Pennington et al., 2014) of the type of the object, and visual representation uses the region-of-interest (ROI) pooling of the object detection model. We will compare the two representations and a hybrid representation of them in Appendix A.1. Formally, the object representations could be denoted as o j,k is the k-th object representation in j-th image.

Spatial Configuration Grounding
To guarantee the sequential execution, we design a state attention mechanism over the configurations.
We consider the attention weight at each step as a state that measures navigation progress and is updated by a controller. Formally, the i-th configuration at step t is denoted as α t,i . At the first step, the attention weight is initialized to be focused on the first configuration α 0 = [1, 0, · · · ]. At each of the following steps, the attention weight is updated by a controller γ t with discrete convolution. γ t is a two dimensional probability distribution indicating to what extent the agent should execute the current configuration or move to the next. The updating process is formally defined in Eq. 4.
Using a set of rules to determine the value of the controller γ is not practical. For example, for the instruction "move to the table" or "move past the table", it is hard for an agent to decide whether to execute the current configuration or to move to the next one only based observing or not-observing the "table". To address this issue, we let the agent learn the value of γ based on three aspects of information. The first one is the previous hidden state h t−1 ; the second one is the attended image repre-sentationĪ t at the current step; the third one is the similarity score S t between the landmark representations and the object representations, Eq. 5 shows how to get the similarity score S t , and α t−1 is the attention weight at the previous step.
Then, we use a fully connected layer to predict the distribution γ t = FC γ h t−1 ;Ī t ; S t . Finally, we apply the state attention toC to get the grounded instruction representation based on the configuration C = i α t,i ·C i , which is used as the language input to the decoder C * t =Ĉ.

Visual Representation Alignment
The intuition to leverage the object representation is to select navigable images by aligning the object representation with the configuration representation. We use two levels of soft attention, first over the objects in each image by configuration repre-sentationĈ, and second over all images guided by the previous context h t−1 .
whereÔ = Ô 1 ,Ô 2 , · · · ,Ô n . We use the image representationÎ, that has aligned the objects with the configurations, as the visual input to the decoder I * t =Î.

Navigable Viewpoint Selection
We obtain a new decoder context h t , as described in Section 3.2, with configuration input C * t and visual input I * t , where t is the current step. The next step is to predict the viewpoint with the image that has the highest correlation with the current context and configuration, calculated by , where FC pred (·) is a fully-connected layer. We sum the scores of the three elevations for each navigable viewpoint k as ζ t,k = j∈κ k z t,j , where κ k is the set of three elevations' image indexes. The predicted navigable viewpoint distribution p t can be calculated with p t = softmax(ζ t ).

Training and Inference
We train our model with two state-of-the-art training strategies in this task. (1) T1: We follow Self-Monitor (Ma et al., 2019) optimizing the model with a cross-entropy loss to maximize the likelihood of the ground-truth navigable viewpoint given by the model, and a mean squared error loss to minimize the normalized distance in units of length from the current viewpoint to the goal destination. At each step, the next viewpoint is selected by sampling the predicted probability of each navigable viewpoint. (2) T2: We follow (Tan et al., 2019) training the model with the mixture of Imitation Learning and Reinforcement Learning, where Imitation Learning minimizes the cross-entropy loss of the prediction and always samples the ground-truth navigable viewpoint at each time step, and Reinforcement Learning uses policy gradient to update the parameters of the model. During inference, we conduct a greedy search with the highest probability of the next viewpoints to generate the trajectory. It should be noticed that beam search with a beam size greater than one is not practical because the agent needs to move forward and backward in the physical world, resulting in a long trail trajectory before making a decision.

Experimental Setup
Dataset We evaluate our model with Room-to-Room (R2R) dataset (Anderson et al., 2018), which is built upon the Matterport3D dataset (Chang et al.,    (Anderson et al., 2018). SPL is recommended as the primary metric because it considers both the effectiveness and efficiency of navigation performance.

Baseline Models
We mainly compare Spc-NAV with the following baseline models. Seq2Seq (Anderson et al., 2018) trained an encoder-decoder model with two learning strategies of random and student-forcing. Sub-instruction (Hong et al., 2020) segmented the instruction into sub-instructions and designed a shifting attention module to ensure the sequential execution order between sub-instructions. The differences between Sub-instruction and our model has been discussed in Section 2.

Implementation Details
We implement SpC-NAV using PyTorch 4 We use 768-d BERT-base (Devlin et al., 2018) (frozen) as the embedding of the raw instruction, and get its 512-d contextual embedding by LSTM. We encode the representations of the motion indicator and the landmark in each configuration with 300-d GloVe embedding respectively, and concatenate them with the 512-d configuration representation to obtain the enriched configuration representation (1112-d).
We use 300-d GloVe embedding of object label representation to calculate similarity score S with configuration representation. We trained an autoencoder to map 2048-d object visual representation from Faster R-CNN to 152-d, and use it to obtain the attended object representationÔ. We optimize using ADAM with learning rate 1e−4 in batches of 64. We used a rule-based parser to obtain the spatial configuration and spatial semantic elements. This provides some noisy extractions. Appendix A.2 includes the details about the accuracy of the parser based on our manual annotations of a subset of instructions. Table 1 shows the main performance metrics of our proposed SpC-NAV, compared with the baseline models on seen/unseen validation set and unseen testing set. To achieve the best result, SpC-NAV is trained with the training strategy T2 (see Section 3.8) and the data augmentation proposed in (Tan et al., 2019). Our model improves the performance in the seen environment and obtains com-petitive results in the unseen environment. Since we use BERT as the input to the encoder while the baseline models use basic word embeddings, we replace the word representations in Environment Dropout with BERT for a fair comparison. Although the richer language representations help the performance, our model still achieves better results, especially in the seen environments. It indicates that the spatial configuration and spatial elements indeed improve the agent's reasoning ability. Training strategies are orthogonal to our work, and our model is friendly to the strategies widely used in the literature (T1/T2) (see Section 3.8). We evaluate SpC-NAV with both T1 and T2 and compare the results with their baseline models as well as Sub-Instruction. We do not apply data augmentation in this setting. As shown in Table 2, SpC-NAV achieves consistent improvement in the seen environment compared with all the baselines. In the unseen environment, training with T1, SpC-NAV outperforms Self-monitor (and is even comparable to it with data augmentation) and performs similarly as Sub-Instruction. However, training with T2, our model does not outperform Environment Dropout and Sub-Instruction in unseen environments. We analyze the errors in Section 5.2. Table 3 shows how various spatial semantic elements influence the performance of the model. The model is trained with the training strategy T1. Row#1 is our model without considering spatial elements. From row#2 to row#3, we incorporate the representations of the motion indicator and the landmark into spatial configuration representation incrementally. In row#4, we use the similarity score between the landmark representations in the configuration and the object label representations in the image to control the transitions between spatial configurations. All motion indicator, landmark and similarity score improve the performance. After applying the similarity score, the large gain indicates that the connection between landmarks and objects is important in language grounding.

Seen Environment
We analyze some qualitative examples to find out how the spatial semantics improve the model. For the semantics of motion, we find that our model can improve the cases that motions contain "up"  Table 3: Ablation study with different spatial semantics. The subscription letters mean the model took those information into account; M: motion indicator; L: landmark; S: similarity score. and "down" after adding the representation of motion indicator. Figure 3 (a) shows an example of such a scenario. The spatial configuration is "walk up the stairs", and the agent could find the right viewpoints after we incorporated the representation of the motion indicator "walk up". However, the model makes more mistakes in the cases that the motion indicators are highly related to the objects, such as "walk through", "walk past", and "walk towards", which need the landmark information. In these latter cases, the model should consider both motions and landmarks together. In another experiment, we added the landmark representation. Figure 3 (b) shows an example that the spatial configurations is "walk past the dining room table". The agent can select the correct viewpoints when we incorporate the representation of landmark "dining room table". We also analyze the influence of the similarity score, and found that when the information in the current configuration is not sufficient to make a decision, the similarity score will assist in choosing the next configuration. For example, in Figure 3 (c), the spatial configurations are "turn right" and "walk past the couch". Without using the similarity score in controlling the transitions between configurations, the agent tends to select a viewpoint in the "right" direction. But with similarity score, the agent will consider both "turn right" and "walk past the couch", and selects the correct viewpoint that the "couch" can be seen. Table 1 and Table 2 show that our model does not outperform Environment Dropout in the unseen environments. We noticed that the main error is that some objects can not be detected in the image by the object detection model. This is more problematic for our model because we explicitly align the landmark phrases with the detected objects. For example, in Fig 4 (a), the agent selects the correct viewpoint when the configuration is "Walk to the glass door" because the connection between the landmark "glass door" and the object "door" has   been learned in training set. In Fig (b), the agent is wrong when the configuration is "Go to the pottery." because the "pottery" is not detected at the initial perspective and the word "pottery" never appears in the training set. However, the agent selects a viewpoint that a bounding box contains a pottery. The gap between seen and unseen become larger after data augmentation since our model is able to capture the structure of the language by observing more examples. It can deal with the variations in the instructions and improve the performance in the seen environment, but it fails to deal with the novel objects and visual variations in the unseen environments. This is an orthogonal issue addressed in zero-shot learning (Blukis et al., 2020).

State Attention Visualization
We visualize the state attention and the soft attention weights over configurations. As shown in Fig 5a and Fig 5c, our designed state attention demonstrates that the grounded configuration shifts gradually from the first configuration to the last in both seen and unseen environments. We apply the soft attention used in Self-Monitor on spatial configurations, as shown in Fig 5b and Fig 5d, it can not preserve the sequential execution order. We also show the soft attention weights of the grounded instruction in the Self-Monitor by splitting the instructions with the boundaries of our configurations. As shown in Fig 5e and Fig 5f, although their attention weights show the gradual shift, many configurations are skipped.

Conclusion
We propose a neural agent that incorporates the semantic elements of spatial language for visionand-language navigation. We use the notion of spatial configurations as the main linguistic unit of the instructions and enhance the spatial configuration representation with the representations of motion indicator and landmark. We design a state attention to guarantee the sequential execution order of configurations and use the similarity score between the representations of landmarks and objects to control the transitions between configurations. Based on our results, incorporating the spatial semantics improves reasoning ability over navigation. Future work could investigate more fine-grained spatial semantics and the geometry of spatial relations. Also, we will deal with novel objects in a zero-shot setting to improve the unseen environments results.

A.1 Visual Representation Analysis
In this section, we experiment with three types of object representations introduced in Section 3.6, which are object label representation and object visual representation and the combination of these two types of object representation. As shown in Table 4, object visual representation performs better in unseen environments, and we use it to get attended object representationÔ in our best model. This experiment does not consider the similarity score between the representations of landmarks and objects.

A.2 Parsing Analysis
The performance of our rule-based parser influences the result of navigation. To evaluate it, we manually annotated 845 spatial configurations for 200 instructions. We annotated motion indicators, spatial indicators and landmarks in those configurations. Our parser achieves an accuracy of 85% in extracting the spatial configurations. For the extraction of spatial elements, the accuracy is 73% for motion/spatial indicators, and 77% for landmarks.
In the following, we analyze two types of error in getting spatial configurations (Split Error and Order Error), and other errors that generated in the extraction of motion indicator, spatial indicator and landmark.

Split Error
The split configuration may only convey the spatial position of objects rather than executable navigation information. For example, in the instruction, "Turn left. There is a rocking chair in it," two configurations are generated based on our split method: "Turn left" and "There is a rocking chair in it." However, the second configuration is not an independent spatial configuration because it indicates no motion, and it is attached to the previous configuration.

Order Error
We order the configurations based on their occurrence in the sentence. However, there are cases that the configurations have an inverted order. For instance, "Stop once you pass the counter on the right" is split as "stop" and "you pass the counter on the right." However, the implied sequence is inverted because of "once".

Motion Indicator and Spatial Indicator
We build a vocabulary based on training data to collect the commonly used verb phrases, and the vocabulary size is 241. Table 5 shows some examples. If the motion indicator and spatial indicator does not show in the vocabulary, we will treat the verbs as the motion indicators and prepositions as spatial indicators in configurations. With this method, we can get 73% accuracy since there are expressions that never appear in the training dataset, and it is hard to extract the complete verb phrases only based on pos-tag.

Landmark
We extract the noun phrases of each configuration as landmark and can get 77% accuracy. However, there are some special cases, for example, "a left" in "make a left" is extracted as noun chunk, but it can not be treated as a landmark. Also, for the expression "middle of the doorway", "the middle" and "the doorway" are both noun chunks, but the whole phrase is the landmark instead of separated ones. head straight, walk through, walk down, walk into, walk inside, turn around, turn left, make a left turn, jump over, move forward, turn slightly right