CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. In this paper, we aim to solve two key challenges in this task: utilizing multilingual instructions for improved instruction-path grounding and navigating through new environments that are unseen during training. To address these challenges, we propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. First, our agent learns a shared and visually-aligned cross-lingual language representation for the three languages (English, Hindi and Telugu) in the Room-Across-Room dataset. Our language representation learning is guided by text pairs that are aligned by visual information. Second, our agent learns an environment-agnostic visual representation by maximizing the similarity between semantically-aligned image pairs (with constraints on object-matching) from different environments. Our environment agnostic visual representation can mitigate the environment bias induced by low-level visual information. Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation. Furthermore, we show that our learned language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task, and present detailed qualitative and quantitative generalization and grounding analysis. Our code is available at https://github.com/jialuli-luka/CLEAR


Introduction
The Vision-and-Language Navigation task requires an agent to navigate through the environment based on language instructions. This task has two unsolved challenges. First, directly introducing pretrained linguistic and visual representations into 1 Code and model are available at https://github. com/jialuli-luka/CLEAR. You are facing towards the closed door right now, turn back, you can find a washroom in front of you, enter into the washroom, stand in between the bath tub and the wash basin, that would be your end point.  Telugu instruction, Hindi instruction on the left all correspond to the same path -Path A. The words in red correspond to the same visual object "wash basin". Path A and Path B are similar paths (i.e., the instruction for these two paths are semantically similar) in different environments. these agents suffers from domain shift (i.e., pretrained linguistic and visual representation might not generalize to VLN task) (Huang et al., 2019b). Learning the instruction representation while also learning how to navigate based on the instruction is even more challenging for a multi-lingual agent, since more language variance is injected via multi-lingual instructions. At the same time, it also poses the important question that whether we can utilize multi-lingual instructions to learn a better cross-lingual representation and improve instruction-path grounding and referencing. Second, previous works (Fried et al., 2018;Wang et al., 2019a;Landi et al., 2021;Wang et al., 2020a;Huang et al., 2019a;Ma et al., 2019a;Majumdar et al., 2020;Qi et al., 2020a) on vision-language navigation have seen that agents tend to perform substantially worse in environments that are unseen during training, indicating the lack of generalizability of the navigation agent. In this paper, we propose to address these two challenges via crosslingual and environment agnostic representations.
Although some initial progress (Huang et al., 2019b;Majumdar et al., 2020;Hong et al., 2021; has been made towards introducing pre-trained linguistic representations into vision-language navigation agents, how to understand and utilize paired multilingual instructions to transfer the pre-trained linguistic representation to multilingual navigation agent still remains unexplored. We argue that for a multilingual agent, the linguistic representation can capture more visual concepts from learning the similarity between paired multilingual instructions. As shown in Figure 1, though the three instructions shown here are in different languages and vary in length and level of detail 2 , all of them correspond to the sample path -Path A. Hence, by learning the similarity between these paired instructions, the cross-lingual language representation of the same visual concept mentioned in these paired instructions (e.g., the red words correspond to the same visual object "wash basin") will be close to each other, making it easier for the agent to comprehend. Furthermore, the cross-lingual language representation will benefit from the complementary information from instructions in different languages since they elicit more references to visible entities. For example, in Figure 1, the target room environment "washroom" is only mentioned in English instructions. Hindi and Telugu instructions could benefit from learning the connection between "washroom" and "wash basin" through learning from the English instruction.
Moreover, many methods have been proposed to encourage agent generalization to unseen environments during training (Tan et al., 2019;Wang et al., 2020c;Fu et al., 2020;Zhang et al., 2020). Zhang et al. (2020) has shown that it is the low-level appearance information that causes the environment bias. To mitigate this bias, previous works only consider one single environment when learning the visual representation for a given path. We instead learn an environment-agnostic visual representation by exploring the connections between multiple environments. For the example shown in Figure  1, Path A and Path B are two semantically aligned paths in different environments. In both cases, the 2 We translate Telugu instruction and Hindi instruction into English instruction with Google Translation for reference here (the translated instructions are not used in representation learning or navigation learning). Telugu: Return to the left from where you are standing, enter the door on the opposite side, and go to the side of the wash basin on the left and wait. Hindi: Turn back and go inside the door directly, come to the right side of the sink and stop. agent needs to head into the washroom and stop beside the wash basin. Learning the relationship between these paired paths helps the agent comprehend concepts like "bath tub", and not be distracted by the low-level appearance of the objects in unseen environments.
Overall, in this paper, we propose 'CLEAR: Cross-Lingual and Environment-Agnostic Representations' to address the two challenges above. First, we define a visually-aligned instruction pair as two instructions that correspond to the same navigation path. Given the instruction pairs, we transfer the pre-trained multilingual BERT (Devlin et al., 2019) to the Vision-Language Navigation task by encouraging these paired instructions to be embedded close to each other. Second, we identify semantically-aligned path pairs based on the similarity between instructions. Intuitively, if the similarity between the two instructions is high, then their corresponding navigation path will be semantically similar (i.e., mentioning the same objects like "wash basin"). We further filter out image pairs (a pair of paths will contain multiple image pairs) that do not contain the same objects, for higher path pair similarity. Then, we train an environment agnostic visual representation that learns the connection between these semantically-aligned path pairs.
We conduct experiments on the Room-Across-Room (RxR) dataset (Ku et al., 2020), which contains instruction in three languages (English, Hindi, and Telugu). Empirical results show that our proposed representations significantly improves the performance over the mono-lingual model (Shen et al., 2022) by 2.59% in nDTW score on RxR test leaderboard. We further show that our CLEAR approach outperforms our baseline that utilizes ResNet (He et al., 2016) to extract image features by 5.3% in success rate and 4.3% in nDTW score (and it also outperforms a stronger baseline that utilizes the recent CLIP (Radford et al., 2021) method to extract image features). Moreover, our CLEAR approach shows better generalizability when transferred to Room-to-Room (R2R) dataset (Anderson et al., 2018b) and Cooperative Vision-and-Dialogue Navigation dataset (Thomason et al., 2019), and adapted to other SotA VLN Agent . We also demonstrate the advantage of optimizing similarity between all the three languages in RxR dataset for language representation learning and the effectiveness of the way we generate positive path pairs for visual representation learning. Lastly, we demonstrate that our cross-lingual language representation captures visual semantics underlying the instructions, and our environment-agnostic visual representation generalizes better to the unseen environment with both qualitative and quantitative analysis.

Related Work
Vision-and-language navigation. Vision-and-Language Navigation (VLN) requires an agent to find the routes to the desired target based on instructions (Jain et al., 2019;Thomason et al., 2020;Nguyen and Daumé III, 2019;Chen et al., 2019;Krantz et al., 2020). Specifically, there are two key challenges in VLN: grounding the natural language instruction to visual environments and generalizing to unseen environments. To address the first challenge, one line of research in VLN utilizes carefully designed cross-modal attention modules (Wang et al., , 2019aTan et al., 2019;Landi et al., 2021;Xia et al., 2020;Wang et al., 2020b,a;Zhu et al., 2020;Zhu et al., 2021;Kim et al., 2021), progress monitor modules (Ma et al., 2019b,a;Ke et al., 2019), and object-action aware modules (Qi et al., 2020a). Another line of research improves vision and language co-grounding by improving vision and language representations with pre-training techniques Huang et al., 2019b;Hao et al., 2020;Majumdar et al., 2020;Hong et al., 2021). Li et al. (2019) directly adopts pre-trained BERT for encoding instructions, Hao et al. (2020) and Hong et al. (2021) learn from a large amount of image-text-action triplets, Majumdar et al. (2020) learns from large amount of text-image pairs from the web, and Huang et al. (2019b) transfers language and visual representation to in-domain representation with auxiliary tasks. Different from them, we utilize the visually-aligned multilingual instructions to learn a cross-lingual language representation that inherently captures visual semantics underlying the instruction.
Multiple methods have been proposed to encourage generalization to unseen environments during training (Zhang et al., 2020;Tan et al., 2019;Wang et al., 2020c;Fu et al., 2020;. Zhang et al. (2020) demonstrates that it is the low-level appearance information that causes the large performance gap between seen and unseen environments. Tan et al. (2019) proposes to use environ-ment dropout on visual features to create new environments and Fu et al. (2020) utilizes adversarial path sampling to encourage generalization. However, both of these methods rely on a speaker module to generate synthetic training data and can be considered as data augmentation methods, which are complementary to our proposed environmentagnostic visual representation. The closest work to ours is Wang et al. (2020c), where they proposes to pair an environment classifier with gradient reversal layer to learn an environment-agnostic representation. However, they only consider one single environment when learning the visual representation for a given path (i.e., given one path and predict its environment). In our environmentagnostic representation learning, we explore the connections between multiple environments (i.e., maximize the similarity between paths from different environments). Vision-and-language with multilinguality. There has been growing interest in combining vision and language for tasks such as visual-guided machine translation (Sigurdsson et al., 2020;Surís et al., 2022;Huang et al., 2020), multi-lingual visual question answering (Gao et al., 2015;Gupta et al., 2020;Shimizu et al., 2018), multi-lingual image captioning (Gu et al., 2018;Lan et al., 2017), multilingual video captioning (Wang et al., 2019b), and multi-lingual image-sentence retrieval Burns et al., 2020). In this paper, we work on multi-lingual vision-and-language navigation. We use vision (i.e., navigation path) as a bridge between multi-lingual instructions and learn a crosslingual representation that captures visual concepts. Moreover, our method also use language as a bridge between different visual environments to learn an environment-agnostic visual representation.

Method
In this section, we present our CLEAR method that learns cross-lingual language representations and environment-agnostic visual representations. Given these learned language and visual representations, we then train the agent on the vision-andlanguage navigation task with imitation learning and reinforcement learning. The overall representation learning and navigation agent training processes are illustrated in Figure 2. We next describe our representation learning methods in Sec. 3.1 and Sec. 3.2. The navigation model (Tan et al., 2019) and training process are detailed in Appendix. Stage 2:Training Navigation Agent Figure 2: Left: the agent learns a cross-lingual language representation and an environment-agnostic visual representation via maximizing the similarity between positive pairs (connected with blue line) and minimizing the similarity between negative pairs (connected with red dashed line). For simplicity, we use 3 as batch size when illustrating the positive pairs and negative pairs. Right: then the agent is trained on the vision-and-language navigation task based on these learned representations.

Language Representation Learning
The goal of our language representation learning approach is to learn a cross-lingual language representation that can mitigate the natural ambiguity and variance in multilingual instructions and improve the path-instruction alignment by capturing the shared and salient visual concepts underlying the instructions. We define visually-aligned instruction pairs as instructions that correspond to the same navigation path. Since these instruction pairs refer to the same navigation path, the visual concepts underlying these instructions (e.g., visual objects mentioned in the instruction) are shared. Thus, we could train the language representation to emphasize these visual concepts by learning the connection between these visually-aligned instruction pairs. For each navigation path, the Room-Across-Room (RxR) dataset (Ku et al., 2020) provides 9 corresponding language instructions in 3 languages (English, Hindi, and Telugu). During training, for each navigation path, we randomly sample two instructions out of the nine corresponding instructions as the visually-aligned instruction pairs. The two instructions can be in different languages, which helps the agent learn a cross-lingual language representation. Exclusively learning connections between instructions in the same language will lose crucial information across languages, and we quantitatively illustrate this result in Sec. 6.1.
Given the instruction {w i } m i=0 with m words, we use feature of the [CLS] token (i.e., w 0 ) in the pre-trained multilingual BERT (Devlin et al., 2019) outputs as the sentence representation w: In a batch of size N , we have N positive pairs of instructions with representations ( w j , u j ) N j=1 from Eqn. 2. Each positive pair is matched with 2(N −1) negatives in the batch (i.e., { w k } k̸ =i and { u k } k̸ =j ). Our goal is to learn a representation that maps instructions for the same path closer to each other in the representation space, regardless of the language and the natural variance in human-generated instructions. We learn the representation by optimizing a contrastive loss: where α i,j is the similarity between the instruction w i and u j , and τ is the temperature hyperparameter.

Visual Representation Learning
Our goal in visual representation learning is to learn an environment-agnostic visual representation that can mitigate the environment bias caused by objects' low-level appearance, such that it could generalize better to unseen environments. Intuitively, the agent would learn the general concept of objects instead of the low-level appearance if the agent can identify the same objects in two images in different environment. Thus, we train the agent to learn the connected visual semantics between the semantically-aligned navigation paths (i.e., paths that mention the same objects or mention similar actions in different environments).
Identifying semantically-aligned path pairs: Although the appearance of the path varies a lot in different environments, the instructions that describe the similar paths are more consistent across environments. Based on this intuition, we use language as the bridge between paths in multiple visual environments. Specifically, we propose to use instruction similarity as a direct measurement of how semantically similar two paths are. For each instruction-path pairs (I, P ) given in the Room-Across-Room (RxR) dataset, we first represent each instruction I as in Eqn. 2. Then, we compute the cosine similarity between the representation of instruction I and all the other instructions in the training set. We pick the instruction I that is most similar to I and also constraints that I's corresponding path P has the same path length as P . Thus, we group P and P as the semantically-similar path pair.
Constraint on object-matching: In a batch of size N, we have N positive semantically-aligned path pairs (P k , Q k ) N k=1 . We represent the positive path pair (P k , Q k ) as sequences of panoramic views Since paths might not be fully aligned (i.e., correspondence between image pairs {p k,t } and {q k,t } might not hold), we use object-matching to filter out image pairs that don't contain the same objects. Specifically, we use Mask-RCNN (He et al., 2017) model trained on LVIS dataset (Gupta et al., 2019) in de-tectron2 (Wu et al., 2019) to detect objects in the 36 discretized views of the panoramic view. We filter out object classes that appear less than 1% of the time in all panoramic views. 27 object classes left, including objects like 'cabinet', 'chair', and 'sofa'. All object classes can be found in Appendix. During training, we randomly sample 10 out of 27 object classes in each iteration and filter out image pairs that don't contain same objects of the sampled 10 object classes. Our object-matching constraint ensures that the corresponding image pairs {p k,t } and {q k,t } also have a high semantic similarity.
Visual encoder: The panoramic view of time step t is discretized into 36 single views {o t,i } 36 i=1 . We encode the visual representation for each view as: We first encode images with pre-trained vision models. Then the encoded view features are passed through two fully-connected layers with ReLU as activation function. Layer normalization and residual connection are applied on top of the fullyconnected layer.
Learning visual representation: Given the N positive semantically-aligned path pairs (P k , Q k ) N k=1 , at each time step t, we have N p panoramic views (computed as the average of 36 single views as in Eqn. 10) that have a positive pair (i.e., the paired view contain at least one same object). For each view p k,t that has a positive pair, the visual encoder is trained to predict which of the N possible panoramic views {q k,t } N k=1 contain similar semantic information. Specifically, we train the visual encoder to maximize the cosine similarity of the N p positive image pairs in the batch while minimizing the cosine similarity of the N * N p − N p negative image pairs (i.e. each view has N − 1 negatives). We optimize the contrastive loss as: where β k,t is the similarity between positive panoramic view pair p k,t and q k,t , and τ is the temperature hyperparameter. We compute the panoramic view representation as the average of 36 single views: where v p,k,t,i is the output representation from the visual encoder. q k,t is computed similarly.

Learning
Our CLEAR agent has two stages of learning: representation learning and navigation learning.
In the representation learning stage, we train the multilingual encoder and visual encoder by optimizing the contrastive loss L lang in Eqn. 3 and RxR is the mono-lingual baseline in Ku et al. (2020), CLIP is the mono-lingual agent in Shen et al. (2022) L visual in Eqn. 8 respectively. The representation learning process transfers the language representation to domain-specific language representation and adapts the visual representation to learn the correlation underlying the navigation environments.
In the navigation learning stage, we use a mixture of imitation learning and reinforcement learning to train the agent on the navigation task as in Tan et al. (2019). Details can be found in Appendix.

Dataset
We evaluate our agent on the Room-Across-Room (RxR) dataset (Ku et al., 2020). The dataset is split into training set, seen and unseen validation set, and test set. In the unseen validation set and test set, the environments are not appeared in training set. Thus the performance on these two sets show the model's generalizability to new environments. More details can be found in Appendix.

Evaluation Metrics
To evaluate the performance of our model, we follow the metrics used in the Room-Across-Room paper (Ku et al., 2020)

Implementation Details
In our experiments, we learn the shared crosslingual representation based on cased multilingual BERT BASE . For the pre-trained vision model, we compare performance between image features extracted from ImageNet-pre-trained (Russakovsky et al., 2015) ResNet-152 (He et al., 2016 and CLIPpre-trained (Radford et al., 2021) vision transformer (ViT-B/32) (Dosovitskiy et al., 2021) (abbreviated as 'CLIP feature' later). More details about representation learning and navigation training can be found in Appendix.

Test Set Results
We compare our final agent model with results on the Room-Across-Room (RxR) leaderboard. Our agent is a multilingual model that learn three languages in the same model. Compared with monolingual agents that learn instructions in three languages separately, a multilingual agent performs worse due to high-resource languages degradation (Ku et al., 2020;Aharoni et al., 2019;Pratap et al., 2020). Our agent is tested under the single-run setup. In the single-run setting, the agent only navigates once and does not pre-explore the test environment. As shown in Table 1, our CLEAR model with CLIP features is 16.88% higher in nDTW score than the baseline mono-lingual model (Ku et al., 2020) ('RxR') that utilizes ResNet features and other base navigation model. Furthremore, our model is 2.59% higher in nDTW score than the mono-lingual model (Shen et al., 2022) ('CLIP') that utilizes CLIP features and the same base navigation model as ours.

Ablation Results
We demonstrate the effectiveness of our learned visual and language representations with ablation studies. The baseline model (annotated as 'ResNet' in Table 2) uses multilingual BERT and pre-trained ResNet to encode instructions and images without the representation learning stage. Our CLEAR-ResNet ('ResNet+both' in Table 2) outperforms its baseline models in all evaluation metrics on average. Specifically, it improves the baseline model by 5.3% in success rate (SR) and 4.3% in nDTW score on average over three languages. These results demonstrate that our CLEAR agent is not only more capable of reaching the target, but also follows the ground-truth path better.
We then show that both the cross-lingual language representation and environment-agnostic visual representation contribute to the overall improvement. When the cross-lingual language representation is added ('+text'), we see consistent improvement on the averaged metrics and observe that Hindi benefits most from the crosslingual language representation. When adding the environment-agnostic visual representation ('+visual'), the nDTW score improves by 2.6%. These   improvements validate the effectiveness of our learned language and visual representations. Moreover, we show that our CLEAR approach could generalize to other pre-trained visual features. We implement another model (annotated as 'CLIP' in Table 2) that uses CLIP to encode images, which is a stronger baseline compared with the ResNet baseline ('ResNet' in Table 2). Our CLEAR-CLIP model ('CLIP+both' in Table 2) also shows 2.7% improvement in success rate (SR) and 1.2% improvement in nDTW score on average over three languages. This demonstrates the effectiveness of our CLEAR approach over different pre-trained visual features.

Effectiveness of Cross-Lingual Representations
In this section, we show the effectiveness of our language representation learning method described in Sec. 3.1. We first show the effectiveness of using paired multilingual instructions instead of monolingual instructions in the language representation learning stage. Then, we show that our learned cross-lingual language representation captures the visual concepts behind the instruction better than the original multilingual BERT representation. Multilingual vs. monolingual. To show that the multilingual instruction pairs are crucial for our cross-lingual language representation learning, we experiment with fine-tuning multilingual BERT with instruction pairs in same language only ('Mono' in Table 3). We observe that compared with the agent with cross-lingual representation ('Multi'), the success rate decreases by 3.1% and sDTW score decreases by 2.5%. Furthermore, compared with the baseline model that uses the original multilingual-BERT ('m-BERT'), the success rate drops 2.2% and the sDTW score drops 2.1%. This result indicates that instruction representations in one language cannot benefit from learning representation in other languages if the multi-lingual representation is only supervised by contrastive loss between mono-lingual instruction pairs.
Capturing visual concepts. Our cross-lingual language representation can ground to the visual environment more easily by capturing the visual concepts in the instruction. We demonstrate that shared visual concepts in different paths are captured by our language representation. We first encode the instruction as in Eqn. 2 with cross-lingual representation and original multilingual BERT separately. For every instruction, we retrieve another instruction with the highest cosine similarity under the constraints that two instructions don't correspond to the same path and equal path length. As shown in Figure 3, the second row is the query instruction and the first row is its corresponding path. The following four rows correspond to the instruction-path picked with cross-lingual representation and multilingual-BERT representation. First, we observe in Figure 3 that our cross-lingual representation retrieves a Hindi instruction while the multilingual-BERT picks an English instruction. This indicates that our cross-lingual representation learns to encode instructions with similar semantics in different languages closer to each other. Besides, we observe that in all three paths, the agent passes tables and chairs, but only in the query path and the cross-lingual paired path, the agent stops at places similar to "bar stools". This demonstrates that the visual objects in the cross-lingual picked path are Right now you're facing towards a chair. Turn behind and exit the room. Now slightly turn left and move forward by passing through a large black table on the right side and a large portrait on the left side. Right now you can see a white teapoy and a sofa set in front of you. Move towards that teapoy. Now there is a couch on the right side. Move forward and stand in between the couch and the window and that is the end point.

Instruction:
Path: Cross-Lingual paired instruction: Cross-Lingual paired path: Multilingual-BERT paired instruction: Multilingual-BERT paired path: Figure 3: Comparison of the most similar instruction picked with cross-lingual representation and multilingual-BERT. Our cross-lingual picked instruction mentions more visual object as in the query instruction. Besides, the path corresponding to the cross-lingual picked instruction contains more accurate visual objects as in the query path.  more similar to the objects in the query path.

Effectiveness of Optimizing Similarity between Three Languages
In this section, we further show that only optimizing the similarity between a subset of languages (i.e., two out of three languages) will hurt the performance. Specifically, we train the language representation that optimizes similarity between only English and Hindi ('en+hi'), only English and Telugu ('en+te'), only Hindi and Telugu ('hi+te'), and only single language ('mono'). Given paired language instructions in English and Hindi in unseen set, the average distance is 0.61 for our language representation (i.e., optimizes similarity between all three languages), 0.43 for en+te, 1.67 for hi+te, and 1.55 for same language only, indicating that explicitly optimizing the similarity between en+hi helps reduce the distance between en+hi most. Adding te in optimization will make en+hi farther from each other, but still much better than only optimizing hi+te, and could also make the distance between all three languages to be closer to each other. We further show the performance of training the navigation agent with these language representations in Table 4. We observe that both the success rate and the nDTW score drop significantly when only training on a subset of languages. This result shows that it's crucial to train the language representation with instruction pairs in all three languages.

Decreasing Gap between Seen and Unseen Environments
Most previous navigation models (Wang et al., 2019a;Ma et al., 2019a;Majumdar et al., 2020) suffer from a large performance drop when moving from seen validation to unseen validation because the visual encoder overfits the low-level appearance features (Zhang et al., 2020). Our environment agnostic visual representation can decrease the performance gap between validation seen and unseen environments. As shown in Table 5, the nDTW gap is decreased from 1.6 to 1.0 compared with baseline model. It is also lower than the gap of multi-lingual agent in Ku et al. (2020).

Comparison with Other Contrastive Learning Approaches
In this section, we compare with SimCSE (Gao et al., 2021), an effective contrastive learning approach for text representation learning. We use SimCSE on our visual representation learning, where we use dropout as positives in contrastive learning. Using SimCSE to train the visual representation gets 34.8/53.0 (SR/nDTW),   Table 6: Results on R2R validation unseen environments. "CLEAR" (based on ResNet) transfers the language and visual representation from RxR dataset, and "ResNet" is the baseline model that uses multilingual BERT and pre-trained ResNet. "ResNet-zero" and "CLEAR-zero" are zero-shot performance of baseline and our approach on R2R dataset.
which is lower than our visual representation (35.6/53.7). Furthermore, we experiment with using both dropout as positives and our identified path pairs as positives. The performance decreases in nDTW score (52.4) compared with only using our identified path pairs as positives (53.7).

Generalization to Other VLN Tasks
We further evaluate our CLEAR approach's generalizability on Room-to-Room (R2R) dataset (Anderson et al., 2018b) and Cooperative Vision-and-Dialog Navigation (CVDN) dataset (Thomason et al., 2019), in which we directly transfer our CLEAR approach and train on the navigation task on R2R and CVDN. R2R and CVDN follows the same training, validation seen, and validation unseen split of environments as Room-Across-Room dataset. The main difference is that the language instructions in R2R and CVDN is monolingual (i.e., English). Besides, instructions in CVDN are multiround dialogues between navigator and the oracle. Our baseline model uses multilingual BERT to encode instructions and the ResNet pretrained on Im-ageNet to extract image features. The cross-lingual language representation and environment-agnostic visual representation is trained on RxR dataset (as in Sec. 3.1 and Sec. 3.2). We then train the navigation agent on R2R dataset and CVDN dataset with the language and visual encoder initialized from our CLEAR representation. As shown in Table 6, on R2R dataset, our learned representation outperforms the baseline by 1.4% in success rate and 1.8% in nDTW. Furthermore, we show that the zero-shot performance of our approach improves the baseline by 4.5% in success rate and 2.2% in SPL on R2R dataset. On CVDN dataset, our learned representation outperforms the baseline by 0.74 in Goal Progress (4.05 vs. 3.31) after training on CVDN dataset, and outperforms the baseline by 0.42 in Goal Progress (0.92 vs. 0.50) in the zero-shot setting. Goal Progress measures the progress made towards the target location and is the main evaluation metric in CVDN. This result demonstrates that our learned cross-lingual and environment agnostic representation could generalize to other tasks.

Generalization to Other VLN Agents
We further evaluate our CLEAR approach's generalizability to other VLN agent. Specifically, we adapt CLEAR to SotA VLN agent HAMT . With the pre-trained weights released in HAMT, we further learn the text representation and visual representation with our approach. Adapting CLEAR to HAMT achieves 57.2% in success rate and 65.6% in nDTW score, which is 0.7% higher than HAMT in success rate and 2.5% higher than HAMT in nDTW score on RxR validation unseen set, demonstrating the effectiveness of our proposed approach over SotA VLN models.

Conclusion
In this paper, we presented the CLEAR method that learns a cross-lingual and environment-agnostic representation. We demonstrated that our crosslingual language representation captures more visual semantics and our environment-agnostic representation generalizes better to unseen environments. Our experiments on Room-Across-Room dataset suggest that our CLEAR method improved the performance in all evaluation metrics over a strong baseline. Furthermore, we qualitatively and quantitatively analyze the effectiveness of every component of our CLEAR approach and its generalizability to other tasks and base VLN agents.

Ethics Statement
In this paper, we presented a method to learn crosslingual and environment-agnostic representations for Vision-and-Language Navigation. Vision-and-Language Navigation task can be used in many realworld applications, for example, a home service robot can bring things to the owner based on natural language instructions, making people's life easier. Our learned representations enable the agent to understand multi-lingual instructions and improve agents' generalizability to unseen environments. However, currently we learn our cross-lingual representation from three languages (i.e., English, Hindi, and Telugu) only due to dataset availability, which might limit its generalization to other languages. Besides, similar to other instructionfollowing agent, our agent might fail to reach the target given some instructions, which requires further human assistance.

B Navigation Model
Our navigation agent follows the decoder structure as Tan where θ t,m and ϕ t,m the heading and elevation of view o t,m . As a reaction to the input, the agent needs to select one of the K navigable locations as an action. The action is represented as the orientation features (heading and elevation) between the current viewpoint and the chosen navigable viewpoint. The navigation decoder takes the attended visual feature f t of the current viewpoint and the previous action embedding a t−1 as input, and updates its environment-aware context vector h t : where a t−1 is represented as the orientation features (cos θ t−1,k ⋆ , sin θ t−1,k ⋆ , cos ϕ t−1,k ⋆ , sin ϕ t−1,k ⋆ ) of the chosen navigable viewpoint k ⋆ at time step t − 1, and h t−1 is the instruction-aware context vector that incorporates the attended instruction information. The navigator calculates the probability of moving to the k-th navigable location based on the alignment between the visual feature g t,k of that navigable location and the instruction-aware context vector h t : where g t,k is constructed similarly as f t,i in Eqn. 11, and w j is the language representation.

C Learning
Our CLEAR agent has two stages of learning: representation learning and navigation learning. In the representation learning stage, given a pair of instructions that correspond to the same navigation path, we train the shared multilingual encoder to generate representations of paired instructions close to each other by optimizing a contrastive loss L lang . Furthermore, we train the visual encoder to learn the connections between paths with similar instructions by optimizing the contrastive loss L visual . The representation learning process transfers the language representation to domain-specific language representation and adapts the visual representation to learn the correlation underlying the navigation environments.
In the navigation learning stage, we use a mixture of imitation learning and reinforcement learning to train the agent on the navigation task as in Tan et al. (2019).
In imitation learning, we use teacher-forcing to determine the next navigable viewpoint. Different from previous methods (Hong et al., 2021;Tan et al., 2019;Huang et al., 2019b) that takes the shortest path as the teacher action, our teacher action a ⋆ t at each time step t is picked based on the given ground truth path between the start point and target point. The agent tries to imitate the teacher action a ⋆ t by minimizing the negative log probability: We combine reinforcement learning with imitation learning to learn a more generalizable agent. At each time step t, the agent samples an action a t from the predicted distribution p t (a t ). We follow (Hong et al., 2021) to do the reward shaping. The immediate reward at each time step t consists of three parts. First, if the agent moves closer to the target viewpoint, a positive reward +1 is given, otherwise the agent receives a negative reward -1. Second, to encourage instruction following, we include normalized Dynamic Time Warping (nDTW) score in the reward. The agent gets a positive reward if the nDTW score for the navigated path increases. Lastly, the agent receives a negative reward if it misses the target. When the agent predicts the "STOP" action, the agent will receive a +3/-3 reward based on whether the agent is within 3 meters from the target viewpoint. We use Advantage Actor-Critic (Mnih et al., 2016) to train the agent.
The navigation loss L nav is a weighted combination of imitation learning loss and reinforcement learning loss.

D Dataset
We evaluate our agent on the Room-Across-Room (RxR) dataset (Ku et al., 2020). The dataset is built on the Matterport3D simulator (Anderson et al., 2018b). It contains 126,069 human-annotated instructions with an average instruction length of 78.  Table 7: Comparison between visual representation trained with objects constraints ('+visual'), without sampling strategy ('-sample') and without object constraints ('-object') on validation unseen sets. nDTW is the main metric for Room-Across-Room (RxR) dataset.
The dataset is split into training set, seen validation set, unseen validation set, and test set. In the unseen validation set and test set, the environments do not appear in the training set. Thus the performance on these two sets show the model's generalizability to new environments. There are 16,522 paths in total, and each path is annotated in 3 languages (and 3 instructions per language on average). The training set contains 11,089 paths, the seen validation set contains 1,232 paths, the unseen validation contains 1,517 paths, and the test set contains 2,684 paths.

E Evaluation Metrics
To evaluate the performance of our model, we follow the metrics used in the Room-Across-Room paper (Ku et al., 2020). The metrics include: (1) Success Rate (SR): We consider a success for navigation if the agent stops less than 3m from the target location.
(2) Success rate weighted by Path Length (SPL) (Anderson et al., 2018a): This metric penalizes the navigation with long paths (i.e., when both navigations reach the target, the navigation with shorter path length has a higher SPL score).

F Implementation Details
In our experiments, we learn the shared multilingual representation based on cased multilingual BERT BASE . The instruction is truncated from the end with a maximum sequence length of 160. For the pre-trained vision model, we compare performance between image features extracted from ImageNet-pre-trained (Russakovsky et al., 2015) ResNet-152 (He et al., 2016 and CLIP-pre-trained (Radford et al., 2021)

G Analysis: Effectiveness of Object-Matching Constraints
Our visual representation learning optimizes the similarity between panoramic views at each step of the semantically-aligned path pairs. Since paths are not fully-aligned, we use object-matching as a constraints to filter out panoramic view pairs that don't contain same objects. As shown in Table 7, the visual representation trained with fixed object classes as constraints ('-sample') improve the nDTW score (the main metric for RxR dataset) by 0.6% com-pared with the visual representation trained without object-matching constraints ('-object'), suggesting that using object-matching as constraints help learn a better visual representation. Besides, the sampling strategy (i.e., randomly sample 10 object classes from 27 object classes during each iteration) also helps the visual representation learning ('+visual'), further improving the nDTW score by 0.7% compared with the visual representation learned with fixed 27 object classes ('-sample'). In total, our object-matching constraints and sampling strategy ('+visual') improves the performance by 1.3% in nDTW score compared with learning without object constraints ('-object').

H Analysis: Performance Variance Reduction among Different Environments
We demonstrate that our CLEAR approach could decrease the performance variance (i.e., performance's standard deviation) among different environments. Intuitively, we hope the agent to perform equally well in different environments instead of getting high performance by only learning to navigate through several easy environments. We show the results for 11 environments in validation unseen set in Table 8. Our CLEAR approach ('+both' as in Table 2 in the main paper) outperforms the baseline model ('ResNet' as in Table 2 in the main paper) in most of the environments. Moreover, the weighted standard deviation (weighted by # Data in Table 8) of our CLEAR approach is lower than the baseline model. Specifically, the standard deviation of nDTW score for our CLEAR approach is 9.24 while the standard deviation of nDTW score for the baseline model is 10.01, suggesting that our CLEAR approach decreases the performance variance between different environments.

I Analysis: Word Representation from Cross-Lingual Representation
The visual semantics are injected during learning the cross-lingual language representation by maximizing the similarity between full instruction sentences (representation of 'CLS' token). However, it's unclear that whether the word-level representation also learned such visual information. In this section, we investigate whether the learning encodes spatially close words/objects closer to each other. As shown in Table 9, we check the top-5 close words to 'kitchen', and 'fire' from a vocabu-   lary of 2754 English tokens. We see that our crosslingual representation puts words that appear spatially near each other close (e.g. 'kitchen' and 'island'/'dinning', 'fire' and 'chair'/'fireplace') while m-BERT representation fails (e.g. 'kitchen' and 'room'/'house', 'fire' and 'family'/'study').

J Analysis: Alignment between Instructions and Environments
The Room-Across-Room dataset provides groundtruth alignment between instructions and navigation paths. To demonstrate that our CLEAR approach learns a good alignment between instructions and paths, we not only compare our CLEAR approach with the baseline approach, but also compare it with the ground truth alignment provided in the RxR dataset. The attention weights for grounded instruction for CLEAR, Baseline, and Ground Truth are shown in Figure 4. We observe that our CLEAR model successfully attends to sub-instructions "turn right", "move towards the open door to your right and exit the room through the door", "slightly turn left", "move towards and stand in front of the sofa" sequentially. Although the baseline model also successfully executes the first two sub-instructions "turn right" and "move towards the open door", yet the baseline agent gets lost in the later navigation. Furthermore, the alignment learned by our CLEAR approach matches better with the ground truth alignment provided in the RxR dataset.

K Analysis: Filtering out Low Quality Path Pairs
We investigate whether filtering out low-quality path pairs during visual representation learning

Word
Top-5 kitchen 'island ', 'counter', 'maker', 'din', '##iding' 'living', 'counter', 'room', '   could further improve the performance. Since our identified path pairs are retrieved based on the similarity between instructions, we hypothesize that the path pair is aligned better if having a higher instruction similarity score. Thus, we experiment with filtering out instruction pairs that have a cosine similarity score less than 0.90, 0.95, 0.98, and 0.99, and then train the visual representation with filtered data and object-matching constraints. The proportion of filtered-out data is 1%, 6%, 28% and 58% respectively. We also experiment with filtering out 0% and 100%. Filtering out 0% of the data is the same to our proposed environmentagnostic visual representation ('+visual' in Table 2) and filtering out 100% of the data is analogous to randomly initialize the visual encoder 3 . We then train our environment-agnostic representation (in Sec. 3.2) based on the remaining data and show its performance on the validation unseen environments. As shown in Table 10, though the success rate improves when filtering out some path pairs 3 Note that filtering out 100% of the data is not the same as the baseline model ('ResNet' in Table 2). The baseline model does not have the visual encoder we introduced in Sec. 3.2 with lower quality, not filtering out any path pairs achieve the highest nDTW score. This demonstrates that using object-matching constraints without filtering out path pairs with low instruction similarity is enough for learning a good visual representation. Furthermore, we see a significant performance drop when not fine-tuning the visual representation on any data, which indicates that training the visual encoder with semantically-aligned path pairs is important for agent performance.

L Analysis: Correspondence between Instruction Similarity and Path Pair Alignment
In this section, we show that instruction pairs that have high similarity have similar BLEU score and ROUGE score to the instruction pairs that corresponding to the same path. Specifically, the BLEU-1 and ROUGE-L score for instruction pairs that have high similarity are 0.42 and 0.320, and the BLEU-1 and ROUGE-L score for the instruction pairs that corresponding to the same path are 0.41 and 0.323. Randomly picking gets 0.37 BLEU-1 score and 0.295 ROUGE-L score. These results indicate that high similarity instruction pairs may be of competitive quality as the instruction pairs that corresponding to the same path, and can be used to pick the semantically-aligned path pairs.