Cross-lingual Sentence Embedding using Multi-Task Learning

Multilingual sentence embeddings capture rich semantic information not only for measuring similarity between texts but also for catering to a broad range of downstream cross-lingual NLP tasks. State-of-the-art multilingual sentence embedding models require large parallel corpora to learn efficiently, which confines the scope of these models. In this paper, we propose a novel sentence embedding framework based on an unsupervised loss function for generating effective multilingual sentence embeddings, eliminating the need for parallel corpora. We capture semantic similarity and relatedness between sentences using a multi-task loss function for training a dual encoder model mapping different languages onto the same vector space. We demonstrate the efficacy of an unsupervised as well as a weakly supervised variant of our framework on STS, BUCC and Tatoeba benchmark tasks. The proposed unsupervised sentence embedding framework outperforms even supervised state-of-the-art methods for certain under-resourced languages on the Tatoeba dataset and on a monolingual benchmark. Further, we show enhanced zero-shot learning capabilities for more than 30 languages, with the model being trained on only 13 languages. Our model can be extended to a wide range of languages from any language family, as it overcomes the requirement of parallel corpora for training.


Introduction
Sentence embeddings provide an efficient way to encode semantic information of text by mapping texts onto a shared vector space, such that sentences with similar meaning are represented by similar representations. With the abundance of data in diverse languages, cross-lingual sentence embedding enable the mapping of multilingual texts into a single unified vector space for a wide range of Natural Language Processing (NLP) tasks. Current * * Work started during internship at Huawei Research. sentence embedding methods are predominantly monolingual systems, geared mainly towards English (Conneau et al., 2017;Yin et al., 2020). While there exist multilingual sentence embedding frameworks, they are mostly supervised methods requiring a large parallel corpus for training. For underresourced languages, there is not sufficient training data to effectively learn a model and we show that our unsupervised approach can better exploit the available unsupervised data, and thus produce better results for under-resourced languages. This is achieved by using a dual-encoder architecture based on word-level semantic similarity score (via Word Mover's Distance) and learning to embed this into a single vector for sentences.
Supervised sentence embedding approaches map parallel sentences from source and target languages into the same vector space by either maximising their cosine similarity or minimising the distance between the generated embeddings (Artetxe and Schwenk, 2019;Reimers and Gurevych, 2020). For example, recent supervised methods using parallel corpus rely on a teacher-student model to minimize cross-lingual embedding distance (Reimers and Gurevych, 2020) or an additive margin softmax function based dual sentence encoder to maximally separate the sentences that are true translations from similar overlapping sentences (Yang et al., 2019). Although such methods produced good results, the use of these loss functions in unsupervised settings fails to efficiently capture crosslingual semantic similarities across sentences. The state-of-the-art unsupervised approach relies on automated machine translation to generate a "pseudo parallel corpus" (Kvapilíková et al., 2020) for training. This method is affected by presence of translation errors and fails to generalize to low-resource languages for which translations are not available.
To alleviate the above challenges, in this paper, we propose DuEAM, a cross-lingual sentence embedding framework based on a novel dual encoder architecture with an unsupervised joint loss function using an anchor-learner approach, a variant of the teacher-student model. We also depict the performance of a weakly-supervised variant of our unsupervised DuEAM architecture (obtained by simply changing the training dataset). The weaklysupervised framework to learn semantic relationship between cross-lingual sentences is motivated by the existence of the multilingual natural language inference dataset (XNLI) (Conneau et al., 2018), and the possible creation of such a dataset from existing comparable corpora. The unsupervised DuEAM framework learns from randomly chosen sentence pairs from the XNLI dataset (see Sec. 5). Thus, we overcome the need for parallel sentences for multilingual sentence embedding generation. To understand the degree of similarity between monolingual and multilingual sentence pairs during training, the anchor module uses the Word Mover's Distance (WMD) (Kusner et al., 2015) (used as a scalar value during backpropagation), while the learner module is trained to generate sentence embeddings (refer to Fig. 1). Thus, we learn a low-dimensional embedding of the sentences from the more complex encoding generated by WMD. We show that our joint loss formulation effectively captures cross-lingual semantic similarity between sentences by preserving distances between points across languages.
Extensive experiments (in Section 6) on multilingual sentence similarity and parallel sentence mining tasks have showcased the efficacy of our sentence embedding framework. For example, on the cross-lingual STS benchmark (Reimers and Gurevych, 2020), our unsupervised approach achieves state-of-the-art average Spearman rank correlation score of 62.1, comparable to the supervised sentence embedding approach of LASER (Artetxe and Schwenk, 2019) with an average of 65.8. In fact, for certain languages, our models are even seen to outperform LASER (e.g., for EN-DE our unsupervised model achieves a Spearman rank correlation score of 64.6 and weaklysupervised model achieves a Spearman rank correlation score of 69.4 compared to 64.2 for LASER). On the BUCC task (bitext mining task) (Zweigenbaum et al., 2017) our model achieves a better F1 score compared to the existing unsupervised model of Kvapilíková et al. (2020). Interestingly, for certain under-resourced languages, we outperform both LASER and multilingual S-BERT (Reimers and Gurevych, 2020) by an average of 10% on the Tatoeba benchmark. We also show better or comparable performance to LASER even on monolingual classification benchmark tasks. Thus, our model is robust across diverse language families for multilingual sentence embeddings.
In a nutshell, our contributions are: (i) DuEAM, a novel dual encoder based on an anchor-learner architecture for unsupervised and weakly-supervised multilingual sentence embedding generation, (ii) a joint loss function coupling Word Mover's Distance and cosine similarity to capture the degree of text similarity and relatedness between sentence pairs, (iii) experimental evaluations, on monolingual as well as several cross-lingual benchmark tasks, depict that our model efficiently captures semantic similarity across languages, and provides state-of-the-art unsupervised performance, comparable with supervised models, (iv) robustness in zero-shot transfer learning for low-resource languages across language families, outperforming state-of-the-art supervised approaches on sentence matching tasks in certain scenarios.

Related Work
Paragraph vectors were first proposed as sentence embeddings for computing document similarity (Le and Mikolov, 2014). The majority of the current multilingual sentence embedding methods are supervised approaches. There exist some unsupervised sentence embedding frameworks (Zhang et al., 2020;Pagliardini et al., 2018), but are mostly for English sentence embeddings. Initial methods generated sentence embeddings based on neural machine translation system with a shared encoder (Schwenk, 2018;España-Bonet et al., 2017;Schwenk and Douze, 2017). The use of cosine similarities between source and target language parallel sentences was studied by Guo et al. (2018) using a bidirectional dual encoder architecture. Chidambaram et al. (2019) proposed Multilingual Universal Sentence Encoder (mUSE), a dual-encoder model trained on large web-mined translation parallel corpora, along with data from Reddit, Wikipedia, and Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) to learn more context, supporting 16 languages. A translation ranking task was used to identify a correct translation pair, and the architecture assumes 5 hard negative pairs for each sample while training. Subsequently, the LASER (Artetxe and Schwenk, 2019;Schwenk et al., 2019) framework considered a sequence-to-sequence architecture using LSTM networks, and was trained on parallel corpora designed for neural machine translation across 93 languages. Expanding beyond translation-based approaches, the multilingual sentence encoder model of Yang et al. (2020a) was trained for semantic retrieval on three different tasks: multi-feature question-answer prediction, translation ranking, and natural language inference (NLI). Recently, Yang et al. (2020b) proposed Conditional Masked Language Modeling (CMLM) to generate sentence embeddings, by co-training the system with bi-text retrieval and Natural Language Inference (NLI) tasks. To generate sentence embeddings beyond the naïve CLS token and simple pooling strategies of language models, sentence transformer architectures were proposed (Reimers and Gurevych, 2019). For multilingual S-BERT models, Reimers and Gurevych (2020) utilized the teacher-student model where the student model is tuned with a parallel corpus from 50 languages based on knowledge transfer from a fine-tuned teacher model developed by Reimers and Gurevych (2019). LaBSE (Feng et al., 2020) was designed on the BERT architecture and trained on 6 billion sentence pairs by use of additive margin softmax loss with in-batch negative sampling.
Another thread of research for sentence embedding involves improving the alignment of contextual embeddings into a shared vector space using iterative self-supervised learning or tuning with synthetic parallel corpora. Hirota et al. (2020) introduced the Enhancing Multilingual Sentence Embeddings (EMU) framework which tries to semantically enhance pre-trained multilingual sentence embeddings. That is, instead of building sentence embeddings from scratch, EMU fine-tunes pre-trained multilingual sentence embeddings with two major components: enhancement of semantic similarity and multilinguality of sentence embeddings using multlingual adverserial training. Further, Cao et al. (2020) used a parallel corpus as an anchor to align representations in a multilingual language model whereas Wang et al. (2019) used iterative self-learning to perform the task.
Recently, Kvapilíková et al. (2020) proposed an unsupervised method for improving pre-trained cross-lingual context vectors using synthetic parallel sentences and extracted sentence embeddings via mean pooling. However, use of machine trans- lation to generate synthetic parallel data for such methods fails to generalize to low-resourced languages for which translations might be erroneous.

DuEAM Model
In this section, we describe the components and working of the proposed Dual Encoder with Anchor Model (DuEAM) architecture for multilingual sentence embeddings, trained using an unsupervised multi-task joint loss function. Figure 1 depicts the dual-encoder based anchor module, where the same sentence pair is fed as inputs into both the anchor and the learner components. Note, these sentence pairs are not considered to be parallel translations and can even be from either the same language or different languages. Such architectures are well suited to capture semantically and contextually similar sentences and map them close to each other in a shared vector space.

Dual Encoder with Anchor-Learner
We use word-level semantic knowledge from pretrained multilingual language models in the anchor module to gauge the semantic similarity between the source (s i ) and target (t i ) sentences. Subsequently, embeddings for s i and t i are generated by the learner such that their vector space distances reflect their degree of semantic relatedness. Inspired by MoverScore measure (Zhao et al., 2019), using the pre-trained multilingual word-embeddings, the anchor module computes the semantic similarity between source and target sentences by use of Word Mover's Distance (WMD) (Kusner et al., 2015).
The learner module is then trained to generate source and target sentence embeddings such that their Euclidean distance closely approximates the WMD obtained from the anchor. We force the system to consider the knowledge of our anchor system about the semantic relationships of the sentences at the word-level, which also helps to stabilize the training process as the pre-trained anchor model is fixed and the WMD score is considered as a scalar. Thus, during training, we generate embeddings to minimize the semantic loss, L A as: where d wmd (si, ti) = W M D(si, ti) is the Word Mover's Distance between the input source and target sentences s i and is the Euclidean distance between the generated embeddings s i and t i (by the learner) corresponding to the source and target sentences respectively. The use of our WMD based semantic loss factor enables DuEAM to capture a more compact representation between the sentence embeddings, better capturing cross-lingual semantic relationships in the shared vector space.

Dual Encoder with Translation Mining
While mapping semantically similar sentences close to each other in the shared vector space, an effective embedding framework should also address the translation ranking problem to efficiently map correct translations of source-target sentence pairs within a compact zone of the vector space. Yang et al. (2019) addressed this problem by introducing hard negative sentence pairs along with parallel data during the training process. Since DuEAM is an unsupervised approach, we introduce translation mining based loss, L B , using the cosine similarity score between source and target sentences, to handle translation mining, as:

Multi-Task Dual Encoder Learning
To bring both loss functions under the same umbrella, we construct a multi-task learning setup where we minimize Eq. 1 while maximizing Eq. 2. Hence, to efficiently generate sentence embeddings, the final multi-task loss function for training DuEAM is given by where λ is the weight parameter.
This multi-task joint learning enables DuEAM to effectively capture both cross-lingual semantic similarity and text translation ranking relationship.
Overall, our unsupervised loss function aims to learn sentence embeddings such that the Euclidean distance between them are proportional to the semantic distance obtained from WMD, thereby providing a low-dimensional embedding from the more complex word-level similarity space (using Eq. (1)) Additionally, Eq. (2) enables our framework to align sentence embeddings in the cosine space, for translation understanding.

Intuitions behind Loss Function
In DuEAM, WMD is computed between the contextual token embeddings (of the pair of input sentences) obtained from anchor encoders, without any stopwords removal (as standard while using language models (Conneau et al., 2020)). While multi-lingual language models capture different languages in a common space, WMD captures the contextualized semantic distance between the multi-lingual input sentences. Our use of WMD is motivated by MoverScore (Zhao et al., 2019). Specifically, the use of WMD and cosine in the loss function of DuEAM is based on the following intuitions: • The learner module is trained to generate sentence embeddings such that the Euclidian distance between the learnt sentence embeddings closely approximates the WMD (calculated using token embeddings by anchor) between sentence pairs. This tends to preserve the "relative semantic distance" between the input sentences, enabling DuEAM to capture the "semantic relation at word level" within the sentence embeddings obtained.
• Existing methods using parallel sentences for training effectively teach the architecture to learn similar embeddings for similar contexthowever, the distance in the embedding space between dissimilar sentences are not considered. By using WMD, DuEAM generates closer representations for similar sentences, while at the same time forcing dissimilar sentences to have embeddings that are apart in the embedding space. This provides better semantic understanding for improved performance in downstream tasks, as observed in our experiments. For the example, the WMD between the German sentence "Sie ist keine Lehrerin" ('She is not a teacher') and the English sentence "She is a teacher" is more than that between the German sentence "Sie ist eine Lehrerin" and the English sentence that is a direct translation, "She is a teacher". Thus, the embeddings are different, and DuEAM is able to capture negation and other semantic information for better performance.
• The cosine loss enables DuEAM to align the learnt sentence embeddings in the cosine space, based on the cosine similarity between the source and target train sentence pairs, to address the translation ranking problem. The weight parameter λ in the final multi-task loss function further controls its effect.

Training Dataset
Following Chidambaram et al. (2019), we train our DuEAM architecture on the natural language inference dataset -using only the XNLI dataset (Conneau et al., 2018) on 13 languages, without any parallel corpora 1 . We do not consider the entailmentcontradiction labels in XNLI during training. Unsupervised Data. To create the training dataset for unsupervised training of DuEAM, we randomly pick sentences from the premises and hypothesis of XNLI dataset and form the input sentence pairs. Hence, this random shuffling along with the absence of sentence pair labels provides no supervision during the training procedure.
Weakly Supervised Data. This training data contains both monolingual and cross-lingual sentence pairs, where the monolingual sentence pairs are same as those of XNLI (without annotated labels). To create cross-lingual sentence pairs, we keep the premises from the source language and replace the hypothesis with target language hypothesis sentences, and vice-versa (example shown in Table 9). Note that this dataset does not contain any parallel cross-lingual sentences, but contains semantically related monolingual and cross-lingual sentences -providing weak supervision.
Validation (using accuracy of finding parallel sentences) is done on held out 1K parallel sentences (across languages pairs) from the TED2020 corpus. The DuEAM models trained on the above unsupervised and weakly supervised datasets are henceforth denoted as DuEAM unsupv and DuEAM wklysupv respectively. More details on dataset and training are given in the appendix.

Experimental Evaluation
We evaluate the performance of our proposed DuEAM framework on the following 3 benchmark tasks: (a) STS: monolingual and cross-lingual semantic textual similarity; (b) BUCC: bitext mining to extract parallel sentences; and (c) Tatoeba: cross-lingual parallel sentence matching.

Multilingual Semantic Textual Similarity
Understanding semantic textual similarity between monolingual and cross-lingual datasets is one of the major tasks for a sentence embedding model. We evaluate our model against the STS benchmark dataset (Cer et al., 2017), containing sentence pairs with scores indicating how semantically similar the sentences are. The SemEval dataset consists of annotated sentences for EN-EN, AR-AR, ES-ES, EN-AR, EN-ES, and EN-TR language pairs. Further, we also use the EN-DE, EN-IT and EN-NL test sets from multilingual SBERT (Reimers and Gurevych, 2020). For evaluation, we compute the cosine similarity between sentence pair embeddings and obtain the Spearman rank correlation, ρ across the computed similarities and gold scores.
As shown in Table 1, the unsupervised baselines based on the language models (mBERT and XLM-R) perform quite poorly, suggesting that the obtained cross-lingual sentence embeddings are not well aligned in the vector space. While trained with multi-task learning, DuEAM achieved a significant improvement both for monolingual and cross-lingual datasets. For cross-lingual settings, both the DuEAM models significantly outperform the unsupervised models, with an average improvement of 41.9 and 37.2 respectively on Spearman rank correlation (ρ) score. Similarly, on the monolingual datasets (EN-EN and ES-ES), our models achieve an improvement of 26.9 and 21.4 points (on average), based on the rank correlation score.
It is interesting to note that DuEAM achieves better results, compared to the supervised LASER and LaBSE approaches, for both the monolingual datasets (EN-EN and ES-ES). In cross-lingual settings, in certain cases, both DuEAM unsupv and DuEAM wklysupv are seen to outperform LASER (e.g., EN-DE and EN-ES language pairs).

Zero Shot Testing on STS benchmark
An important property of the embedding techniques is "zero shot learning", i.e., to robustly generalize to languages that the model has not been trained on, by inherent knowledge transfer from the other languages. To study the efficiency of DuEAM in zeroshot scenarios, in this setting, we train the weaklysupervised model only on the EN-DE (English-German) training dataset (as in Section 5) and test on the other language pairs, including monolingual sentence similarity for ES-ES and AR-AR.
From Table 2, we can see that even for zero shot learning DuEAM outperforms the unsupervised baseline models, across all the monolingual and cross-lingual STS datasets, with improvements in Spearman rank correlation score of around 20. Thus, our architecture based on dual encoder with multi-task learning provides better cross-lingual sentence embeddings, making it robust across diverse languages with improved performance.

Bitext Mining Task
Efficient multilingual sentence embeddings should have a good understanding of sentence parallelism and should be able to retrieve good translation pairs across corpora in different languages. Intuitively, sentence translation pairs should be equivalent in terms of semantic similarity, and hence their crosslingual embeddings should be very similar.
To evaluate the performance of our method, we conduct experiments on the BUCC benchmark mining task -parallel sentence extraction from two different monolingual corpora. We use the data available from the 2018 shared task, consisting of corpora for four language pairs (FR-EN, DE-EN, RU-EN, and ZH-EN), with a subset of parallel sentences demarked as the gold mapping for each language pair. The data is split into train and test set, and the training data is used to find a threshold for the scoring function, such that sentence pairs above the threshold are returned as parallel sentences. Performance is measured using F1 score. Similar to Reimers and Gurevych (2020), in this setting, we use the margin scoring function as: score(x, y) = margin(cos(x, y), cos * (x, y)), with where x, y are the sentence embeddings, NN k (x) is the k nearest neighbours of x in other languages excluding duplicates, and margin(a, b) = a b . Table 3 shows the performance of the approaches on the BUCC task. For the unsupervised setting, we observe that for all language pairs DuEAM performs significantly better than XLM-R mean and mBERT mean methods, with an improvement of nearly 35.6 F1 score over mBERT. Additionally, we compare our model with the recent approach by Kvapilíková et al. (2020), specifically trained for bitext mining task with synthetic parallel data. We see that DuEAM also outperforms these unsupervised models across all language pairs. Further, our weakly-supervised model achieves competitive results compared to the supervised model of XLM-R ← SBERT-nli-stsb, trained with large parallel datasets. Observe that LASER and LaBSE achieve high accuracy, as they are specifically designed and trained to identify translations between languages. On the other hand, although DuEAM is not trained with any parallel data, we are able to effectively extract parallel sentences (bitext mining task) owing to our multi-task learning.

Cross-lingual Parallel Sentence Matching
In this section, we compare the performance of the approaches in extracting parallel sentences using the Tatoeba benchmark of Artetxe and Schwenk (2019). From Table 4, on well-resourced languages, we observe DuEAM to perform significantly better than the unsupervised approach of Kvapilíková et al. (2020), with results comparable to the supervised methods of SBERT and LASER -similar as above -efficiently extracting parallel sentences.

Tatoeba Under-Resourced Languages
We now evaluate the robustness of the approaches for extracting parallel sentences for underresourced languages on the Tatoeba benchmark. In this setting, we consider two scenarios -(i) zero-shot learning and (ii) training on small scale non-parallel datasets. We compare the results with the supervised models of SBERT (Reimers and Gurevych, 2020) and LASER (Artetxe and Schwenk, 2019) respectively, with different baseline languages for LASER for which it has been pre-trained and results have been published. Zero-shot Transfer. We perform zero-shot transfer on different under-resourced languages: (i) Telugu (TE, Dravidian family), (ii) Tagalog (TL, Malayo-Polynesian family), (iii) Irish (Gaelic) (GA, Celtic family), and (iv) Afrikaans (AF, Germanic family). We observe from    languages of Tatoeba benchmark for trained and zero-shot scenarios can be found in the appendix.

Monolingual Classification Performance
A multilingual sentence embedding framework is expected to produce efficient results in monolingual settings. To evaluate the performance on monolingual classification tasks, we now study the per- In Table 6, we can see quite satisfactory result produced by DuEAM framework. On every task, the DuEAM unsupv model surpasses the performance of the supervised LASER model whereas in case of the TREC task the results are better than even the supervised multilingual S-BERT model.
Overall, the monolingual and multi-lingual experimental results depict DuEAM to effectively capture cross-lingual semantic understanding (without parallel training data) to generate efficient sentence embeddings by alignment of multiple languages in the same vector space. Observe that DuEAM is trained on only 1GB of data, while other supervised techniques are trained on around 10x or more data.

Ablation Study
We now study the effects of different components of DuEAM on the quality of generated embeddings. Necessity of Multi-task Learning. One of the important features of DuEAM is multi-task joint learning via the dual-encoder based anchor-learner architecture. To explore the necessity of the different factors for our learning loss function, we use the Tatoeba dataset for DE-EN, FR-EN, and HI-EN language pairs. We train weakly-supervised   DuEAM wklysupv in three variants: (i) with only the anchor module loss term L A (Eq. 1), which considers semantic similarity between sentences, (ii) using only the translation mining loss term L B (Eq. 2), which identifies the best translation pairs, and (iii) the full multi-task learning objective L. From Figure 2, we observe that the learning objective factors L A and L B individually perform quite poorly. However, the proposed multi-task training performs efficiently providing a high accuracy in the range of 92% to 96%. Similar results are observed across the language pairs considered. Training Dataset Size. Table 5(b) depicts that DuEAM outperforms LASER on under-resourced languages with minimal training. To understand the impact of the size of the training dataset, we evaluated the performance of DuEAM on Tatoeba data for Georgian (KA), with varying training sizes.
In Table 7 we see a healthy performance improvement when our weakly-supervised model is trained with the language-specific dataset, with around 4.5% improvement over zero-shot learning given training data of size 20K sentences. Thereafter, the improvement is seen to be incremental. Thus, although DuEAM demonstrates zero-shot learning capabilities, a small amount of languagespecific data further boosts the performance. Training Dataset Type. We explore the performance of the unsupervised loss function in DuEAM on various training data scenarios. We consider EN, DE, FR, and ES languages under 3 training settings: (i) 25K sentences from XNLI (weakly supervised),  (ii) 25K parallel sentences from TED2020 (supervised), and (iii) 12.5K sentences from each of the datasets. Table 8 depicts similar performances when trained on XNLI or on a parallel corpus alone, while a combination outperforms the others -showcasing the stability of our proposed unsupervised loss joint loss function (on training data), removing the dependency on parallel training datasets. Weight Parameter Value Selection. Weight parameter selection while training the multi-task model is very important. We experimented with the weight parameter over a range of values and set to the value for which the model has performed best while training. We have given a snapshot of the model performances over the different weight parameters in Figure 3. We have calculated the average Spearman rank correlation (ρ) results for STS datasets for six language pairs. From the figure we can see that the best model performance achieved with weight parameter 0.003. Higher weight parameter value decreased the performance. This helps us to understand the importance of the weight parameter while training the model in multi-task settings.

Discussion: Semantic Similarity
In general, applications use cosine similarity between sentence embeddings to gauge the semantic textual similarity. We provide a performance analysis of DuEAM based on the raw cosine sim-ilarity score on Tatoeba DE-EN data. For example, the German sentence "das ist der Geburtstag von Muiriel!" ("That is the birthday of Muiriel") has the highest cosine similarity with its English translation "it is Muiriel's birthday!", although the sentence "Happy birthday, Muiriel!" is very similar (refer Table 10 in appendix). This depicts that DuEAM can capture fine-grained semantic difference among similar sentences. On the other hand, for the German sentence "Das Wesen der Freiheit liegt in der Mathematik." we obtain a higher cosine similarity score for the English sentence "The essence of mathematics is liberty.". In fact, the true translation "The essence of freedom lies in mathematics" (achieving the highest cosine-similarity but absent in the Tatoeba dataset) is closer to "The essence of liberty is mathematics.". Although the similarity score is almost equal, our model is unable to identify the correct word ordering in highly overlapping sentences as the WMD measure is inherently word-order agnostic. Multilingual SBERT too fails in this scenario, but with a higher difference in cosine-similarity between sentences, 0.01 compared to 0.001 in DuEAM (using the translation mining L B loss factor). Use of Wikipedia dumps for training such sentence embedding models forms an interesting future study.

Conclusion
This paper proposed an unsupervised loss function based DuEAM framework for multilingual sentence embeddings based on dual encoder with anchor-learner model via multi-task learning. Experiments on monolingual and cross-lingual benchmarks showcase the efficacy of our sentence embeddings in capturing semantic similarities across languages. We demonstrate that DuEAM significantly outperforms existing unsupervised models for textual similarity understanding. We also depicts robustness in zero-shot learning and limited training, for catering to under-resourced languages, and achieve results better or comparable to existing supervised methods in certain cases.

A Training setup
For training, in our anchor module as well as our encoder model, we use the XLM-RoBERTa (XLM-R) language model, with its publicly available 250K shared vocabulary. DuEAM is trained for 5 epochs with a batch size of 64, 10K warm-up steps, and a learning rate of 2e − 5. We set λ (parameter in our joint loss function) to 0.003, and the training was performed on a 24GB Titan RTX GPU for about 12 days. Finally, we apply MEAN pooling on the final layer of the encoder to get the sentence embeddings. To test the model on SentEval toolkit, we have set k-fold to 10 keeping epoch size to 10, batch size to 64 and 0.2 dropout rate.

B Baseline Models
We have compared the performance of DuEAM on benchmark datasets with multiple supervised and unsupervised baseline models which are as follows: • mBERT / XLM-R mean: We use publicly available mBERT and XLM-RoBERTa (XLM-R) language models trained on large datasets (no parallel sentences considered during pretraining phase). We consider mean pooling of the output layer as the sentence embedding.
• mUSE: Multilingual Universal Sentence Encoder uses a dual-encoder transformer architecture to generate sentence embeddings trained using parallel corpora for 16 languages.
• LASER: Language Agnostic Sentence Representation is designed based on an encoderdecoder architecture using LSTM networks. The model is trained with big parallel datasets and performs max-pooling to get the sentence embedding from the stacked network. It supports 93 languages.
• LaBSE: Language-agnostic BERT Sentence Embedding (LaBSE) is a dual-encoder model based on BERT. The model was trained on 6 billion parallel sentence pairs over 109 languages.
• XLM-R ← SBERT-nli-stsb / XLM-R ← SBERT-paraphrases: The sentence transformer models generate sentence embeddings using the teacher-student architecture, where   the XLM-R student model is trained with parallel sentences for across languages with knowledge transfer from the fine-tuned English SBERT-nli-stsb or SBERT-paraphrases as the teacher model.
Description and details of the above models can be publicly obtained from the links as presented in Table 11.

C Training Data
To create the weakly-supervised training dataset, we keep monolingual sentence pairs same as those of XNLI dataset. To create cross-lingual sentence pairs, we keep premises from the source language and replace the hypothesis with target language hypothesis sentences, and vice-versa. In the example in Table 9, for language pair EN-DE, the premise is taken from English while the hypothesis is from German and vice-versa. We do not consider any labels to train our model.

D Language Codes
We have empirically evaluated existing sentence embedding techniques with the proposed DuEAM architecture on several languages across diverse language families, including low-resourced languages. We reported results for 8 language pairs on the STS benchmark and for 4 language pairs on the BUCC benchmark. On the Tatoeba dataset, we conducted experiments for the full set of 58 languages under different use-case scenarios. Tables 12 and 13 list the languages along with their codes as provided in the benchmark datasets and as presented in the main body of our paper.

E Tatoeba Results
We have performed all our experiments on the 58 languages of the Tatoeba test datasets. Evaluation for the parallel sentence matching task is done by finding the most similar sentence between two languages based on their cosine similarity. We have calculated accuracy in both directions (English to target language and vice versa), and have reported the average accuracy of the two. We have reported performance results of the different approaches based on three settings: (i) model performance on languages that it has been trained on, (ii) zero-shot model performance on untrained languages, while compared with supervised trained models, and (iii) model performance on under-resourced languages for which it is untrained.
In the main body of the paper, we reported snapshots of the results obtained across a few of the languages (taken across varied language families). Here we report the full evaluation results across all the 58 languages. Baseline supervised models are taken from (Reimers and Gurevych, 2020).
Performance on Trained Languages: Table  14 reports the performance of the models across 12 languages. We have compared our model with supervised baseline and unsupervised baseline models. We can see that across all 12 languages our model achieved high accuracy compared to unsupervised model. Our model also achieved comparative results with supervised model XLM-R ← SBERT-nli-stsb and LASER for some languages.
Zero-shot Transfer: We have compared our model accuracy on 30 untrained languages and compared with baseline models. Table 15 shows that for all 30 languages our model has achieved state-of-the-art unsupervised results. While the supervised models are trained on the parallel datasets for these languages, for some languages our model achieved comparative results even in zero-shot settings.
Zero-shot Transfer on Under-Resourced Languages: While our model has achieved high accuracy on wide range of languages, we have compared our model with supervised baseline on 16 under-resourced languages from different language families for zero-shot transfer. From Table 16 we can observe that across all languages weakly-supervised DuEAM achieved higher accuracy than the supervised baseline of XLM-R ← SBERT-paraphrases. In fact, our unsupervised DuEAM performed better than XLM-R ← SBERTparaphrases for most of the languages.
Overall, our unsupervised and weaklysupervised DuEAM perform significantly better than the existing unsupervised approach, and is comparable with the supervised models across diverse languages. Our model also efficiently supports zero-shot transfer learning and is robust for under-resourced languages.