Language Embeddings for Typology and Cross-lingual Transfer Learning

Cross-lingual language tasks typically require a substantial amount of annotated data or parallel translation data. We explore whether language representations that capture relationships among languages can be learned and subsequently leveraged in cross-lingual tasks without the use of parallel data. We generate dense embeddings for 29 languages using a denoising autoencoder, and evaluate the embeddings using the World Atlas of Language Structures (WALS) and two extrinsic tasks in a zero-shot setting: cross-lingual dependency parsing and cross-lingual natural language inference.


Introduction
Recent efforts to leverage multilingual datasets in language modeling (Conneau and Lample, 2019;Devlin et al., 2019) and machine translation (Johnson et al., 2017;Lu et al., 2018) highlight the potential of multilingual models that can perform well across various languages, including ones for which training sets are scarce. Most of the current multilingual research focuses on learning invariant representations or removing language-specific features after training (Libovický et al., 2020;Bjerva and Augenstein, 2021). Despite recent advances, there are still limitations. Previous work has shown that similar languages can benefit from sharing parameters, but less similar languages do not help (Zoph et al., 2016;Pires et al., 2019). However, in spite of some interests in typology (Ponti et al., 2019), identifying similar languages is nontrivial, especially for less studied ones. In addition, as Zhao et al. (2019) suggest, learning invariant representations can actually harm model performance. * Equal contribution. 1 Our learned language embeddings and code available at https://github.com/DianDYu/language_ embeddings Therefore, in order to leverage language agnostic and language specific information effectively, we propose to generate language representations and examine the interactions among different language representations.
One way to represent language identity within a multilingual model is the use of language codes, or dense vectors representing language embeddings. If languages are represented with vectors that capture cross-lingual similarities and differences across different dimensions, this information can guide a multilingual model regarding what and how much of the information in the model should be shared among specific languages. Much of the previous research focused on generating language embeddings using prior knowledge such as word order (Ammar et al., 2016;Littell et al., 2017), using a parallel corpus (Bjerva et al., 2019b;Östling and Tiedemann, 2017), and using language codes as an indicator to distinguish input and output words in a shared vocabulary into different languages (Johnson et al., 2017;Conneau and Lample, 2019). In contrast, our work focuses on generating and using language embeddings more effectively as softsharing (de Lhoneux et al., 2018) of parameters among various languages in a single model. Furthermore, we are motivated by a more difficult setting where the properties of each language are not known in advance, and no parallel data is available.
We investigate whether we can generate language embeddings to represent typological information derived solely from corpora in each language without the use of any parallel text, translation models, or prior knowledge. Inspired by the findings that structural similarity, especially word ordering, is crucial in large pretrained multilingual language models (K et al., 2020), we propose an unsupervised method leveraging denoising autoencoders (Vincent et al., 2008) to generate language embeddings. We show that our ap-proach captures typological information by comparing the information in our language embeddings to language-specific information in the World Atlas of Language Structures (WALS, Dryer and Haspelmath, 2013). In addition, to address the question of whether the learned language embeddings can help in downstream language tasks, we plug-in the language embeddings to cross-lingual dependency parsing and natural language inference (XNLI, Conneau et al., 2018) in a zero-shot learning setting, obtaining performance improvements.

Related Work
Previous related research approached language representations by using prior knowledge, dense language embeddings with multilingual parallel data, or no prior knowledge about languages but having language embeddings primarily as a signal to identify different languages.

Feature-based language representations
An intuitive method to represent language information is through explicit information such as known word order patterns (Ammar et al., 2016;Little, 2017), part-of-speech tag sequences (Wang and Eisner, 2017), and syntactic dependencies (Östling, 2015). Littell et al. (2017) propose sparse vectors using pre-defined language features such as known typological and geographical information for a given language. However, linguistic features may not be available for less studied languages. Our proposed approach assumes no prior knowledge about each language, deriving typological information from plain text alone. Once a vector for a target language is created, it contains many typological features of the target language, and can be used for transfer learning in downstream tasks.

Dense representation with parallel data
Other previous work has also explored dense continuous representations of languages. One method is to append a language token to the beginning of a source sentence and train the language embeddings with a many-to-one neural machine translation model (Malaviya et al., 2017;Tan et al., 2019). Another method is to concatenate language embedding vectors to a character level language model (Östling and Tiedemann, 2017;Bjerva and Augenstein, 2018;Bjerva et al., 2019a). These two methods require parallel translation data such as Bible and TED Talk. Rabinovich et al. (2017) derive typological information in the form of phylogenetic trees from translation of different languages into English and French using the European Parliament speech corpus (Koehn, 2005), based on the assumption that unique language properties are present in translations (Baker et al., 1993;Toury, 1995). Bjerva et al. (2019b) abstract the translated sentences from other languages to English with part-of-speech tags, function words, dependency relation tags, and constituent tags, and train the embedding vectors by concatenating a language representation with a symbol representation. In comparison, we generate our language embeddings using no parallel corpora or linguistic annotation, which is suitable for a wider variety of languages, including in situations where no parallel data or prior knowledge is available.

Language vectors without parallel data
The approach that is closest to ours is XLM (Conneau and Lample, 2019), which adds language embeddings to each byte pair embedding using Wikipedia data in various languages with a masked language modeling objective. However, similar to Johnson et al. (2017), the trained language embeddings only serve as an indicator to the encoder and decoder to identify input and output words in the vocabulary as belonging to different languages. In fact, in a follow up paper, XLM-R (Conneau et al., 2020), language embeddings are removed from the model for better code-switching, which suggests that the learned language embeddings may not be optimal for cross-lingual tasks. In this paper, following the finding that structural similarity is critical in multilingual language models (K et al., 2020), we generate language embeddings from a denoising autoencoder objective and demonstrate that they can be effectively used in cross-lingual zero-shot learning.

Generating Language Embeddings
We first present the data used to generate language embeddings, then introduce our approach inspired by denoising autoencoders (Vincent et al., 2008).

Data and preprocessing
To train our multilingual model, we use the Com-monCrawl dataset from the CoNLL 2017 shared task (Ginter et al., 2017) to obtain monolingual plain text in various languages. To represent words across different languages in a shared space, we use the unsupervised pretrained aligned word embeddings from MUSE (Lample et al., 2018). We choose the 29 languages from the CoNLL 2017 monolingual text dataset for which MUSE pretrained embeddings are available. 2 A subset of 200K sentences are selected randomly for each language. The languages we use are: English, French, Romanian, Arabic, German, Russian, Bulgarian, Greek, Slovak, Catalan, Hebrew, Slovene, Croatian, Hungarian, Spanish, Czech, Indonesian, Swedish, Danish, Italian, Turkish, Dutch, Norwegian Bokmål, Ukrainian, Estonian, Polish, Vietnamese, Finnish, and Portuguese, which cover ten language genera.
We experiment with two types of word representations in training language embeddings. The most straightforward way is to use the pretrained MUSE embedding for each specific language (we refer to this setting as Spe.). We also experimented with mapping word embeddings from different languages into one language (English in our experiments because it is used as the pivot language in MUSE embeddings, Eng.) for three reasons. First, because MUSE is mainly trained by an orthogonal rotation matrix and the distances among words in each language are still maintained thereafter, language identities can potentially be revealed. The result is that the learned language embeddings reflect the features incorporated in the unsupervised word mapping methods instead of the intrinsic language features. Second, we hypothesize that mapping to a single language space requires the model to encode more information in language embeddings as their language identities instead of relying on their revealed ones. Finally, using shared word embeddings can reduce the vocabulary size for memory concerns by effectively reducing both the lookup table size and the output softmax dimension size.
For Eng. word embedding mapping, we align words from different languages to English embeddings using cross-domain similarity local scaling (CSLS, Lample et al., 2018). The vocabulary of our model is restricted to the words in the English MUSE embeddings, and all unknown words are replaced with a special unknown token. Although imperfect mapping from each language to English tokens may introduce noise (see scores in Appendix D) and result in a coarse approximation of the original sentences, crucial syntactic and semantic infor-2 https://github.com/facebookresearch/ MUSE mation should still be present.
In our experiments, a language code is appended to each token according to the original language of the sentence. For instance, the German sentence "Er hat den roten Hund nicht gesehen" would be represented in our Spe. condition as Er_de hat_de den_de roten_de Hund_de nicht_de gesehen_de and in the Eng. condition as he_de has_de the_de red_de dog_de not_de seen_de Intuitively, the idea is to have the words themselves be the same across languages (either through the aligned MUSE embeddings or by direct mapping to English words), and let the additional language code provide to the model the information that would explain the structural differences observed across languages in the training data.

Denoising autoencoder
Given a multilingual plain text corpus with sentences in each language (and no parallel text), we first perturb each sentence to create a noisy version of the sentence where its words are randomly shuffled. The training objective is to recover the original sentences, which requires the model to learn how to order words in each language. We hypothesize that compared to language modeling, this will encourage the language embeddings to learn more structural information instead of relying on topics or word co-occurrence to generate meaningful training sentences. We implement our multilingual denoising autoencoder with an LSTM-based (Hochreiter and Schmidhuber, 1997) sequence-tosequence model (Sutskever et al., 2014). The input strings are perturbed sentences and the output strings are the original sentences. See Appendix A.1 for implementation details.
After preprocessing the data, we concatenate a language embedding vector initialized from normal distribution as a language identity feature (the language code mentioned in Section 3.1) to each of the pretrained word embeddings. Since certain languages are more similar to, or more different from, each other, the model will learn how to reorder a sequence of words depending on the specific language. For example, reordering an Italian sentence should be more similar to reordering a Spanish sentence than it is to reordering a German sentence. Because the decoder captures the actual word order of the sentences in each target language, whereas the language codes in the encoder are meant to capture only language identity and no word order information, we use the extracted language embeddings from the decoder in our experiments. 3 Each word is represented with a pretrained 300dimensional vector, and each language embedding is represented with a 50-dimensional vector 4 . The input token is thus a 350-dimensional vector from the concatenation.

Experiments
To examine the quality of the typological information captured by the language embeddings, we perform intrinsic and extrinsic evaluations. Our intrinsic evaluation consists of predicting linguistic typology and language features from the World Atlas of Language Structures (WALS, Dryer and Haspelmath, 2013). Our extrinsic evaluations are based on cross-lingual dependency parsing and cross-lingual natural language inference (XNLI, Conneau et al., 2018) in a zero-shot learning setting, where a trained model makes predictions on a language not seen during training, but for which a language embedding has been learned from plain monolingual text. In contrast with previous research which applies learned typology to cluster similar languages and train machine translation tasks in clusters (Tan et al., 2019), we explore if we can apply the learned embeddings directly into downstream tasks. We compare three different sets of embeddings based on our approach with three sets of embeddings from previous work: Spe. lang_emb represents language embeddings from our proposed denoising autoencoder trained with language specific MUSE embeddings, using CommonCrawl text.
Eng. lang_emb represents language embeddings trained with English MUSE embeddings after mapping words from different languages to English, using CommonCrawl text.
Wiki lang_emb represents language embeddings trained with English MUSE embeddings using Wikipedia. We use the same data selection and preprocessing process as detailed in Section 3.1. We use these embeddings to show the impact of training data. In addition, we use these embeddings to compare with XLM embeddings trained with Wikipedia.
Malaviya represents language embeddings from Malaviya et al. (2017), trained with a many-to-one machine translation model using Bible parallel data. It has 26 languages in common with our 29 languages except English, Hebrew, and Norwegian. We use these embeddings to represent previous methods of learning language representations from parallel data. 5 XLM mono represents language embeddings trained with XLM model using the same monolingual data as Wiki lang_emb on 29 languages.
XLM parallel represents language embeddings trained with XLM using monolingual and parallel data from 15 XNLI languages. We extract the embeddings from the publicly available model.

Linguistic typology prediction
We first inspect the language embeddings qualitatively through principle component analysis (PCA) visualization. We also use spectral clustering to recover the language genus (language family subgroup) information from the embeddings. To compare the quality of the clusterings quantitatively, we calculate the adjusted Rand index (Hubert and Arabie, 1985) between the generated clusters and the actual language genera.

WALS feature prediction
We evaluate the language embeddings on predicting language features in WALS. Each WALS feature describes a characteristic of languages, such as the order of subject, object, and verb. We consider the features for which information is available for more than 50% of the languages we use and cast each feature prediction as a multi-class classification task. We then classify the features into the following categories (see details in Appendix B).
whether the language has separate words for "hand" and "arm", etc.; • Syntax: mostly related to the relative orders between various types of constituents, including order of subject, object and verb, adpo-sitions and noun phrases, and also features related to syntactic constructions; • Partially Morphological (Part. Morph.): features that mainly concern syntax or semantics but either usually relate to morphology (such as inflectional morphemes), or have morphological information coded in the values of the features, e.g. gender systems, order of negative morphemes and verbs; • Non-learnable: features that mainly concern morphology, phonology, or phonotactics, and are not learnable from reordering plain text.
The categories make it easier to evaluate what the language embeddings capture. We train linear classifiers to predict WALS results. For each feature, we hold out one language and train a classifier on the language embeddings of the rest of the languages to predict the corresponding feature values on the held-out language embedding, in a leaveone-out cross-validation scheme. We then average the accuracy of the features within each category to report the results. In addition to comparing different language embeddings, we also compare to two baselines: a Random baseline, and a Majority baseline (which predicts the most common value for each feature). We repeat this procedure 100 times while randomly permuting the orders of the input vectors to the classifiers to eliminate possible effects due to initial states and report the average and significant scores.
Compared to a recent shared task where the input is some features of a language (e.g. language family and various WALS features), with optionally pre-computed language embeddings to develop models to predict other features (Bjerva et al., 2020), we investigate if trained language embeddings alone can be used to predict WALS features. In addition, we showed that our language embeddings outperformed a frequency baseline among other baselines (see Section 5.2) compared to Bjerva et al. (2020).

Cross-lingual dependency parsing
Since our language embeddings are trained using a word ordering task, we hypothesize that they capture syntactic information. To verify that meaningful syntactic information is captured in the language embeddings, we use a dependency parsing task where sentences for each target language are parsed with a model trained with treebanks from other languages, but no training data for the target language. This can be seen as a form of crosslingual parsing or zero-shot parsing, where multiple source languages are used to train a model for a new target language. Without annotated training data for parsing a target language, the model is expected to leverage treebanks from other languages through language embeddings.
We use 16 languages from Universal Dependencies v2.6 (Zeman et al., 2020), representing five distinct language genera (Table 2). We modified Yu Zhang's implementation 6 of biaffine dependency parser (Dozat and Manning, 2017). In specific, we freeze word embeddings, concatenate a 50dimensional embedding (either the corresponding Eng. language embedding or a random embedding) to the embedding of each token, and not use partof-speech information (since we are assuming no annotated data is available for the target language). The goal of this evaluation is not to obtain stateof-the-art attachment scores, but to find whether a model that uses our language embeddings produces higher attachment scores than a model that instead uses random embeddings of the same size 7 . While our embeddings should capture syntactic typology, random embeddings would simply indicate to the model the language for each sentence with no information about how languages are related.

XNLI
Natural language inference (NLI) is a language understanding task where the goal is to predict textual entailment between a premise and a hypothesis as a three-way classification: neutral, contradiction, and entailment. The XNLI dataset (Conneau et al., 2018) translates English NLI validation and test data into 14 other languages. We evaluate on ten of the XNLI languages which we trained language embeddings with.
State-of-the-art models on XNLI are Transformers ( . XLM adds language embeddings together with each word embedding and position embedding as the input embedding in training masked language modeling (MLM, with monolingual data) and/or a translation language modeling (TLM, with translation parallel data). In comparison, XLM-R removes language embedding and is pretrained with MLM on much more data. We train our model on the English MultiNLI (Williams et al., 2018) dataset, and directly evaluate the trained model on the other languages without language-specific fine-tuning, in a zero-shot cross-lingual setting. To select the best checkpoint for test set evaluation, we follow Conneau et al. (2020) by evaluating on the development set of all languages. In addition, we also experiment with a fully zero-shot transfer setting where we select the best checkpoint by evaluating on the English development set. We run the selected checkpoint on the test set of each language and report the accuracy scores. We use the public available XLM model pretrained on 15 XNLI languages with MLM and TLM objectives, and XLM-R pretrained on 100 languages. In order to add our learned language embeddings into XLM and XLM-R models, we normalize our embeddings to have the same variance as the XLM language embeddings, and we learn a simple linear projection layer to map our 50-dimension embeddings (which is frozen during training) to the hidden dimension of corresponding models. We report all results averaged over three random seeds. See Appendix A.2 for implementation details.

Results and Analysis
We show results of our proposed language embeddings in comparison to the baselines and language vectors generated from previous work on linguistic typology, WALS, cross-lingual parsing, and XNLI. We report results with Eng. language embeddings. Detailed comparison to other language embeddings on each task can be found in Appendix C.  Table 1: WALS prediction and linguistic typology clustering results on 26 in-common languages across 10 language genera. * indicates statistical significance (p < 0.01) over the Majority baseline. Figure 1 shows a two-dimensional PCA projec-tion of the learned language embeddings. Due to space limitations, we only show the projection of the language embeddings using words mapped to English embeddings; using language-specific embeddings produces similar results. We can clearly see the clustering of Slavic languages on the lower left, Romance on the right, and Germanic on the upper left. Our dataset also contains two Finnic languages, which appear right above the Slavic languages, and two Semitic languages, which appear on the lower right. The other languages, Vietnamese, Indonesian, Turkish, and Greek, are from language groups underrepresented in our dataset, and appear either mixed with the Germanic languages (in the case of Hungarian, Turkish and Greek), or far on the lower right corner (Vietnamese, Indonesian). Romanian, a Romance language, appears miscategorized by our language embeddings. While it is close to the cluster of romance languages, it appears closer to the singleton languages in the dataset and to the two Semitic languages.

Linguistic Typology
In addition to actual language relationships represented by color, we also present the result of spectral clustering with four categories through different shapes. Results illustrate that our language embeddings can capture similarities and dissimilarities among language families. In comparison, language embeddings generated by Malaviya et al. (2017) do not capture clearly visible language relationships (see Appendix C.3). Quantitatively, clusters from our learned language embeddings (Eng.) achieve a much higher Rand score (0.58) compared to previous language embeddings, as shown in Table 1 (last column). This indicates that our clusters closely align with true language families. Table 1 shows the prediction accuracy for WALS features, averaged within each category. Unlike the language representations generated by Bjerva et al. (2019b), which do not outperform the majority baseline without finetuning, our derived language embeddings perform significantly better than the baselines and previous methods in lexicon, syntax, and partially morphological categories. Note that even though the training objective of the denoising autoencoder is to recover a language-specific word order, the model does not use linguistic features such as grammatical relation labels or subjectverb-object order information. Instead, it derives typological information from text alone through the word reordering task. The language embeddings generated with words mapped to English embeddings (Wiki and Eng.) generally produce more accurate predictions, with the models trained from Wikipedia producing slightly better results likely due to cleaner training data. We show WALS results comparison on 29 languages and comparison to XLM parallel in Appendix C.1. Results from different settings show that we do not need clean data (e.g. Wiki) to generate language embeddings.

Language
Baseline Language Emb.  In the Baseline column, results were obtained using a random embedding instead. Boldface indicates a statistically significant difference (p < 0.05).

Cross-lingual dependency parsing
The cross-lingual dependency parsing results in Table 2 indicate that our language embeddings are in fact effective in allowing a parsing model to leverage information from different languages to parse a  new language. Substantial accuracy improvements were observed for 13 of the 16 languages used in the experiment, while accuracy degradation was observed for two languages. Notably, there were large improvements for each of the four Romance languages used (ranging from 7.32 to 10.62 absolute points), and a steep drop in accuracy for Hebrew (-8.21). Although a sizeable improvement was observed for the only other language from the same genus in our experiment (Arabic, with a 4.07 improvement), accuracy for the two Semitic languages was far lower than the accuracy for the other genera. This is likely due to the over-representation of Indo-European languages in our dataset, and the lower quality of the MUSE word alignments for these languages (Appendix D).
While our accuracy results are well below current results obtained with supervised methods (i.e. using training data for each target language), the average accuracy improvement of 3.4 over the baseline, which uses the exact same parsing setup but without language embeddings, shows that our language embeddings encode actionable syntactic information, corroborating our results using WALS.

XNLI prediction
The XNLI results in Table 3 indicate that our language embeddings, which capture relationships between each test language and the training language (English), are also effective in tasks involving higher-level semantic information. We observe consistent performance gains over very strong baselines in all settings and models for each language. Specifically, in the fully zero-shot setting where we select the best model based on the English development data, adding our learned language embeddings increases 1.1 absolute points on average for XLM. The same trend holds for XLM-R results, not shown due to space limits. On the other hand, if we select the best model on the averaged development set following Conneau et al. (2020), we observe averaged performance gain of 0.9, 0.5, and 0.6 absolute points for XLM, XLM-R Base , and XLM-R Large , respectively. We conjecture that the lower improvement on XLM-R models compared to XLM is due to that XLM-R was pretrained without language embeddings. When we add our language embeddings to the original word and positional embeddings, the distribution of the overall input embedding such as variance is changed. Hence, the language embeddings can be considered as noise at the beginning, making it hard to learn and incorporate additional information. However, the improvement is consistent over all strong baselines, suggesting that our language embeddings, which are not optimized towards any specific task, can be leveraged off-the-shelf in large pretrained models and achieve better zero-shot transfer ability in downstream tasks.

Discussion
Our results in each of the intrinsic and extrinsic evaluation settings demonstrate that our denoising autoencoder objective, which has been shown to be effective in various language model pre-training tasks (Lewis et al., 2020;Raffel et al., 2020), is effective for learning language embeddings that capture typological information and can be used to improve cross-lingual inference. Even though reconstructing the original sentence from a randomly ordered string is the direct training objective, our evaluation of the resulting embeddings is not based simply on word order.
The grammar of a language is of course an important factor in determining the order of words in a sentence in that language, although it is not the only factor. The syntax area features in our WALS evaluation, which are largely related to relative orders of constituents and syntactic constructions and therefore clearly relevant to our training objective, confirm that part of what our embeddings capture is in fact related to word ordering. However, our results on the lexicon and morphology areas indicate that language-specific information capture in our embeddings goes beyond ordering information. Although it may seem that the model only has access to information about word ordering during training, text in the various languages also provides information about word usage, co-occurrence, and to some extent even inflection through the word embeddings. As a result, language embeddings trained with our approach capture interpretable and useful typological information beyond word order. Because language embeddings are the only signal to the model indicating what each of the languages that are mixed within the training data reads like, we conjecture that our denoising autoencoder objective encourages the embeddings to encode language-specific information necessary to distinguish each language from the others.

Conclusion
Language embeddings have the potential to contribute to our understanding of language and linguistic typology, and to improve the performance of downstream multilingual NLP applications. Our proposed method to generate dense vectors to capture language features is relatively simple, based on the idea of denoising autoencoders. The model does not require any labeled or parallel data, which makes it promising for cross-lingual learning in situations where no task-specific training datasets are available.
We showed that the trained language embeddings represent typological information, and can also benefit the downstream tasks in a zero-shot learning setting. This is an encouraging result that indicates that task-specific annotated data for various languages can be leveraged more effectively for improved task performance in situations where language-specific resources may be scarce. At the same time, our results indicate that the effectiveness of our approach is sensitive to the set of languages used, highlighting the importance of using a more balanced variety of languages than is current practice, our work included. We will pursue an investigation of the impact of language selection in multilingual and cross-lingual models as future work, to our understanding of these methods and their broader applicability.

Acknowledgments
We thank the anonymous reviewers for their constructive suggestions. This work was supported by the National Science Foundation under Grant No. 1840191. Any opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of the NSF.

Ethical Consideration
Our motivation to learn language embeddings without parallel data is to understand how language relationships and typology can be generated without any human annotation. We also explore how our learned language embeddings can be applied to downstream tasks. We hope that our proposed method can inspire future research on generating and utilizing typology in cross-lingual settings because we may not have a large amount of translation data for each language, which has been widely used in past research on data-driven modeling of linguistic typology. Since our proposed method can be easily adapted to different architectures and pre-trained models with minimal cost (in terms of both data annotation cost and computation cost), it can reduce resources needed when applying language embeddings for zero-shot crosslingual downstream tasks. We run all our experiments on two TITAN RTX GPUs and two RTX 2080Ti GPUs. We compare our language embeddings to baselines in the standard settings in literature. Mona Baker, Gill Francis, and Elena Tognini-Bonelli. 1993

A.2 XNLI
For XNLI experiments with both XLM and XLM-R, we follow the hyper-parameter tuning suggestions in the code base and author response. We tune the hyper-parameters on the English development set to match the scores reported in the corresponding papers, and use the same hyper-parameters for all runs.
Specifically, for XLM, we fine-tuned the mlm_tlm_xnli15_1024 model with the implementation from the XLM code base (Conneau and Lample, 2019). We use a learning rate of 5e-6 (from a suggested range of [5e-6, 2.5e-5, 1.25e-4]), a batch size of 8 (from suggested range of [4,8] Table 8: Precision at k = 1 word translation to English for the most frequent 50,000 words in each language using CSLS for the generated dictionary.