Cross-Lingual Word Embedding Refinement by \ell_{1} Norm Optimisation

Cross-Lingual Word Embeddings (CLWEs) encode words from two or more languages in a shared high-dimensional space in which vectors representing words with similar meaning (regardless of language) are closely located. Existing methods for building high-quality CLWEs learn mappings that minimise the ℓ2 norm loss function. However, this optimisation objective has been demonstrated to be sensitive to outliers. Based on the more robust Manhattan norm (aka. ℓ1 norm) goodness-of-fit criterion, this paper proposes a simple post-processing step to improve CLWEs. An advantage of this approach is that it is fully agnostic to the training process of the original CLWEs and can therefore be applied widely. Extensive experiments are performed involving ten diverse languages and embeddings trained on different corpora. Evaluation results based on bilingual lexicon induction and cross-lingual transfer for natural language inference tasks show that the ℓ1 refinement substantially outperforms four state-of-the-art baselines in both supervised and unsupervised settings. It is therefore recommended that this strategy be adopted as a standard for CLWE methods.


Introduction
Cross-Lingual Word Embedding (CLWE) techniques have recently received significant attention as an effective means to support Natural Language Processing applications for low-resource languages, e.g., machine translation (Artetxe et al., 2018b) and transfer learning (Peng et al., 2021).
The most successful CLWE models are the socalled projection-based methods, which learn mappings between monolingual word vectors with very little, or even zero, cross-lingual supervision Artetxe et al., 2018a;Glavaš et al., 2019). Mainstream projection-based CLWE models typically identify orthogonal mappings by * Chenghua Lin is the corresponding author. minimising the topological dissimilarity between source and target embeddings based on 2 loss (aka. Frobenius loss or squared error) (Glavaš et al., 2019;. This learning strategy has two advantages. First, adding the orthogonality constraint to the mapping function has been demonstrated to significantly enhance the quality of CLWEs (Xing et al., 2015). Second, the existence of a closed-form solution to the 2 optima (Schönemann, 1966) greatly simplifies the computation required (Artetxe et al., 2016;Smith et al., 2017).
Despite its popularity, work in various application domains has noted that 2 loss is not robust to noise and outliers. It is widely known in computer vision that 2 -loss-based solutions can severely exaggerate noise, leading to inaccurate estimates (Aanaes et al., 2002;De La Torre and Black, 2003). In data mining, Principal Component Analysis (PCA) using 2 loss has been shown to be sensitive to the presence of outliers in the input data, degrading the quality of the feature space produced (Kwak, 2008). Previous studies have demonstrated that the processes used to construct monolingual and cross-lingual embeddings may introduce noise (e.g. via reconstruction error (Allen and Hospedales, 2019) and structural variance ), making the presence of outliers more likely. Empirical analysis of CLWEs also demonstrates that more distant word pairs (which are more likely to be outliers) have more influence on the behaviour of 2 loss than closer pairs. This raises the question of the appropriateness of 2 loss functions for CLWEs.
Compared to the conventional 2 loss, 1 loss (aka. Manhattan distance) has been mathematically demonstrated to be less affected by outliers (Rousseeuw and Leroy, 1987) and empirically proven useful in computer vision and data mining (Aanaes et al., 2002;De La Torre and Black, 2003;Kwak, 2008). Motivated by this insight, our paper proposes a simple yet effective post-processing technique to improve the quality of CLWEs: adjust the alignment of any cross-lingual vector space to minimise the 1 loss without violating the orthogonality constraint. Specifically, given existing CLWEs, we bidirectionally retrieve bilingual vectors and optimise their Manhattan distance using a numerical solver. The approach can be applied to any CLWEs, making the post-hoc refinement technique generic and applicable to a wide range of scenarios. We believe this to be the first application of 1 loss to the CLWE problem.
To demonstrate the effectiveness of our method, we select four state-of-the-art baselines and conduct comprehensive evaluations in both supervised and unsupervised settings. Our experiments involve ten languages from diverse branches/families and embeddings trained on corpora of different domains. In addition to the standard Bilingual Lexicon Induction (BLI) benchmark, we also investigate a downstream task, namely cross-lingual transfer for Natural Language Inference (NLI). In all setups tested, our algorithm significantly improves the performance of strong baselines. Finally, we provide an intuitive visualisation illustrating why 1 loss is more robust than it 2 counterpart when refining CLWEs (see Fig. 1). Our code is available at https://github.com/Pzoom522/ L1-Refinement.
Our contribution is three-fold: (1) we propose a robust refinement technique based on the 1 norm training objective, which can effectively enhance CLWEs; (2) our approach is generic and can be directly coupled with both supervised and unsupervised CLWE models; (3) our 1 refinement algorithm achieves state-of-the-art performance for both BLI and cross-lingual transfer for NLI tasks.
2 Related Work CLWE methods. One approach to generating CLWEs is to train shared semantic representations using multilingual texts aligned at sentence or document level (Vulić and Korhonen, 2016;Upadhyay et al., 2016). Although this research direction has been well studied, the parallel setup requirement for model training is expensive, and hence impractical for low-resource languages.
Recent years have seen an increase in interest in projection-based methods, which train CLWEs by finding mappings between pretrained word vectors of different languages (Mikolov et al., 2013;Peng et al., 2020). Since the input embeddings can be generated independently using monolingual corpora only, projection-based methods reduce the supervision required for training and offer a viable solution for low-resource scenarios. Xing et al. (2015) showed that the precision of the learned CLWEs can be improved by constraining the mapping function to be orthogonal, which is formalised as the so-called 2 Orthogonal Procrustes Analysis (OPA): where M is the CLWE mapping, O denotes the orthogonal manifold (aka. the Stiefel manifold (Chu and Trendafilov, 2001)), and A and B are matrices composed using vectors from source and target embedding spaces. While Xing et al. (2015) exploited an approximate and relatively slow gradient-based solver, more recent approaches such as Artetxe et al. (2016) andSmith et al. (2017) introduced an exact closed-form solution for Eq. (1). Originally proposed by Schönemann (1966), it utilises Singular Value Decomposition (SVD): where M denotes the 2 -optimal mapping matrix. The efficiency and effectiveness of Eq. (2) have led to its application within many other approaches, e.g., Ruder et al. (2018),  and Glavaš et al. (2019). In particular, PROC-B (Glavaš et al., 2019), a supervised CLWE framework that simply applies multiple iterations of 2 OPA, has been demonstrated to produce very competitive performance on various benchmark tasks including BLI as well as cross-lingual transfer for NLI and information retrieval.
While the aforementioned approaches still require some weak supervision (i.e., seed dictionaries), there have also been some successful attempts to train CLWEs in a completely unsupervised fashion. For instance,  proposed a system called MUSE, which bootstraps CLWEs without any bilingual signal through adversarial learning. VECMAP (Artetxe et al., 2018a) applied a self-learning strategy to iteratively compute the optimal mapping and then retrieve bilingual dictionary. Comparing MUSE and VECMAP, the latter tends to be more robust as its similarity-matrixbased heuristic initialisation is more stable in most cases (Glavaš et al., 2019;. Very recently, some studies bootstrapped unsupervised CLWEs by jointly training word embeddings on concatenated corpora of different languages and achieved good performance (Wang et al., 2020).
The 2 refinement algorithm. CLWE models often apply 2 refinement, a post-processing step shown to improve the quality of the initial alignment (see  for survey). Given existing CLWEs {X LA , X LB } for languages L A and L B , bidirectionally one can use approaches such as the classic nearest-neighbour algorithm, the inverted softmax (Smith et al., 2017) and the crossdomain similarity local scaling (CSLS)  to retrieve two bilingual dictionaries D LA →LB and D LB →LA . Note that word pairs in D LA →LB ∩ D LB →LA are highly reliable, as they form "mutual translations". Next, one can compose bilingual embedding matrices A and B by aligning word vectors (rows) using the above word pairs. Finally, a new orthogonal mapping is learned to fit A and B based on least-square regressions, i.e., perform 2 OPA described in Eq. (1).
Early applications of 2 refinement applied a single iteration, e.g. (Vulić and Korhonen, 2016). Due to the wide adoption of the closed-form 2 OPA solution (cf. Eq. (2)), recent methods perform multiple iterations. The iterative 2 refinement strategy is an important component of approaches that bootstrap from small or null training lexicons (Artetxe et al., 2018a). However, a single step of refinement is often sufficient to create suitable CLWEs Glavaš et al., 2019).

Methodology
A common characteristic of CLWE methods that apply the orthogonality constraint is that they optimise using 2 loss (see § 2). However, outliers have disproportionate influence in 2 since the penalty increases quadratically and this can be particularly problematic with noisy data since the solution can "shift" towards them (Rousseeuw and Leroy, 1987). The noise and outliers present in real-world word embeddings may affect the performance of 2 -lossbased CLWEs.
The 1 norm cost function is more robust than 2 loss as it is less affected by outliers (Rousseeuw and Leroy, 1987). Therefore, we propose a refinement algorithm for improving the quality of CLWEs based on 1 loss. This novel method, which we refer to as 1 refinement, is generic and can be applied post-hoc to improve the output of existing CLWE models. To our knowledge, the use of alternatives to 2 -loss-based optimisation has never been explored by the CLWE community.
To begin with, analogous to 2 OPA (cf. Eq. (1)), 1 OPA can be formally defined and rewritten as where tr(·) returns the matrix trace, sgn(·) is the signum function, and ∈ O denotes that M is subject to the orthogonal constraint. Compared to 2 OPA which has a closed-form solution, solving Eq. (3) is much more challenging due to the discontinuity of sgn(·). This issue can be addressed by replacing sgn(·) with tanh(α(·)), a smoothing function parameterised by α, such that Larger values for α lead to closer approximations to sgn(·) but reduce the smoothing effect. This approach has been used in many applications, such as the activation function of long short-term memory networks (Hochreiter and Schmidhuber, 1997). However, in practice, we find that Eq. (4) remains unsolvable in our case with standard gradient-based frameworks for two reasons. First, α has to be sufficiently large in order to achieve a good approximation of sgn(·). Otherwise, relatively small residuals will be down-weighted during fitting and the objective will become biased towards outliers, just similar to 2 loss. However, satisfying this requirement (i.e., large α) will lead to the activation function tanh(α(·)) becoming easily saturated, resulting in an optimisation process that becomes trapped during the early stages. In other words, the optimisation can only reach an unsatisfactory local optimum. Second, the orthogonality constraint (i.e., M ∈ O) also makes the optimisation more problematic for these methods.
We address these challenges by adopting the approaches proposed by Trendafilov (2003). This method explicitly encourages the solver to only explore the desired manifold O thereby reducing the 1 solver's search space and difficulty of the optimisation problem. We begin by calculating the gradient ∇ w.r.t. the objective in Eq. (4) through matrix differentiation: where Z=α(AM−B) and is the Hadamard product. Next, to find the steepest descent direction while ensuring that any M produced is orthogonal, Here I is an identity matrix with the shape of M. With Eq. (6) defining the optimisation flow, our 1 loss minimisation problem reduces to an integration problem, as where M 0 is a proper initial solution of Eq. (3) (e.g., 2 -optimal mapping obtained via Eq. (2)). Empirically, unlike the aforementioned standard gradient-based methods, by following the established policy of Eq. (6), the optimisation process of Eq. (7) will not violate the orthogonality restriction or get trapped during early stages. However, this 1 OPA solver requires extremely small step size to generate reliable solutions (Trendafilov, 2003), making it computationally expensive 2 . Therefore, it is impractical to perform 1 refinement in an iterative fashion like 2 refinement without significant computational resources.
Previous work has demonstrated that applying the 1 -loss-based algorithms from a good initial state can speed up the optimisation. For instance, Kwak (2008) found that feature spaces created by 2 PCA were severely affected by noise. Replacing the cost function with 1 loss significantly reduced this problem, but required expensive linear programming. To reduce the convergence time, Brooks and Jot (2013) exploited the first principal component from the 2 solution as an initial guess. Similarly, when reconstructing corrupted pixel matrices, 2 -loss-based results are far from satisfactory; using 1 norm estimators can improve the quality, but are too slow to handle large-scale datasets (Aanaes et al., 2002). However, taking the 2 optima as the starting point allowed less biased reconstructions to be learned in an acceptable time (De La Torre and Black, 2003).
Inspired by these works, we make use of 1 refinement to carry out post-hoc enhancement of existing CLWEs. Our full pipeline is described in : perform integration to solve Eq. (7) for M , with initial value M0 ← I, until stopping criteria are met Algorithm 1 (see § 4.3 for implemented configurations). In common with 2 refinement (cf. § 2), steps 1-4 bootstrap a synthetic dictionary D and compose bilingual word vector matrices A and B which have reliable row-wise correspondence. Taking them as the starting state, in step 5 an identity matrix naturally serves as our initial solution M 0 . During the execution of Eq. (7), we record 1 loss per iteration and see if either of the following two stopping criteria have been satisfied: (1) the updated 1 loss exceeds that of the previous iteration; (2) on-the-fly M has non-negligibly departed from the orthogonal manifold, which can be indicated by the maximum value of the disparity matrix as where is a sufficiently small threshold. The resulting M can be used to adjust the word vectors of L A and output refined CLWEs. A significant advantage of our algorithm is its generality: it is fully independent of the method used for creating the original CLWEs and can therefore be used to enhance a wide range of models, both in supervised and unsupervised settings.

Datasets
In order to demonstrate the generality of our proposed method, we conduct experiments using two groups of monolingual word embeddings trained on very different corpora: Wiki-Embs : embeddings developed using Wikipedia dumps for a range of ten diverse languages: two Germanic (English|EN, German|DE), two Slavic (Croatian|HR, Russian|RU), three Romance (French|FR, Italian|IT, Spanish|ES) and three non-Indo-European (Finnish|FI from the Uralic family, Turkish|TR from the Turkic family and Chinese|ZH from the Sino-Tibetan family). News-Embs are considered to be more challenging for building good quality CLWEs due to the heterogeneous nature of the data, while a considerable portion of the multilingual training corpora for Wiki-Embs are roughly parallel. Following previous studies Artetxe et al., 2018a;Zhou et al., 2019;Glavaš et al., 2019), only the first 200K vocabulary entries are preserved. In the original implementations, MUSE, PROC-B and JA were only trained on Wiki-Embs while VECMAP additionally used News-Embs. Although all baselines reported performance for BLI, they used various versions of evaluation sets, hence previous results are not directly comparable with the ones reposted here. More concretely, the testsets for MUSE/JA and VECMAP are two different batches of EN-centric dictionaries, while the testset for PROC-B also supports non-EN translations.

Implementation Details of Algorithm 1
The CSLS scheme with a neighbourhood size of 10 (CSLS-10) is adopted to build synthetic dictionaries via the input CLWEs. A variable-coefficient ordinary differential equation (VODE) solver 3 was implemented for the system described in Eq. (7). Suggested by Trendafilov (2003), we set the maximum order at 15, the smoothness coefficient α in Eq. (5) at 1e8, the threshold in Eq. (8) at 1e-5, and performed the integration with a fixed time interval of 1e-6. An early-stopping design was adopted to ensure computation completed in a reasonable time: in addition to the two default stopping criteria in § 3, integration is terminated if dt reaches 5e-3 (dt is the differentiation term in Eq. (7)).
In terms of the tolerance of the VODE solver, we set the absolute tolerance at 1e-7 and the relative tolerance at 1e-5, following the established approach of Kulikov (2013). These tolerance settings show good generality empirically and were used for all tested language pairs, datasets, and models in our experiments.

Results
We evaluate the effectiveness of the proposed 1 refinement technique on two benchmarks: Bilingual Lexicon Induction (BLI), the de facto standard for measuring the quality of CLWEs, and a downstream natural language inference task based on cross-lingual transfer. In addition to comparison against state-of-the-art CLWE models, we also report the performance of the single-iteration 2 refinement method which follows steps 1-4 of Algorithm 1 then minimises 2 loss in the final step.
To reduce randomness, we executed each model in each setup three times and the average accuracy (ACC, aka. precision at rank 1) is reported. Following Glavaš et al. (2019), by comparing scores achieved before and after 1 refinement, statistical significance is indicated via the p-value of two-tailed t-tests with Bonferroni correction (Dror et al., 2018) (note that p-values are not recorded for Tab. 2b given the small number of runs).
To put these improvements in context, Heyman et al. (2019) reported an improvement of 0.4% for VECMAP on same dataset and language pairs.
Our method tends to work better on the more distant language pairs. For instance, for the distant pairs EN-{RU, ZH}, the increments achieved by MUSE-1 are 1.6% and 1.3%, respectively; whereas for the close pairs EN-{DE, ES, FR} the average gain is a maximum of 0.9%. A similar trend can be observed for JA-MUSE-1 and VECMAP-1 . (As the VECMAP algorithm always collapses for EN-ZH, no result is reported for this language pair).
Another set of experiments were conducted to evaluate the robustness of our algorithm following the main setup of Artetxe et al. (2018a), who tested four language pairs based on the more homogeneous News-Embs. Tab. 1b shows that JA-MUSE-1 and VECMAP-1 consistently improves the original VECMAP with an average gain of 1.2% and 1.0% (p<0.01). Obtaining such substantial improvements over the state-of-the-art is nontrivial. For example, even a very recent weakly supervised method by  is inferior to VECMAP by 1.0% average ACC. On the other hand, MUSE fails to produce any analysable result as it always collapses on the more challenging News-Embs. Improvement with 1 refinement is also larger when language pairs are more distant, e.g., for VECMAP-1 the ACC gain on EN-FI is 1.8%, more than double of the gain (0.7%) on the close pairs EN-{DE, IT} (cf. Tab. 1a and above).
We also conduct an ablation study by reporting the performance of 2 refinement scheme ({MUSE, JAMUSE, VECMAP}-2 ). This observation is in accordance with that of , who reported that after performing 2 refinement in the first loop, applying further iterations only produces marginal precision gain, if any.
Overall, the 1 refinement consistently and significantly improve the CLWEs produced by base algorithms, regardless of the embeddings and setups used, thereby demonstrating the effectiveness and robustness of the proposed algorithm.

Refining supervised baselines.
To test the generalisability of our method, we also applied it on state-of-the-art supervised CLWE models: PROC-B (Glavaš et al., 2019) and JA-RCSLS (Wang et al., 2020). Following the setup of Glavaš et al. (2019), we learn mappings using Wiki-Embs and 1K training splits of their dataset.
Their evaluation code retrieves bilingual word pairs using the classic nearest-neighbour algorithm and outputs the Mean Reciprocal Rank (MRR). As shown in Tab. 2a, both JA-RCSLS-1 and PROC-B-1 outperform the baseline algorithms for all  Table 3: MRR (%) of BLI for non-EN language pairs. Rows marked with ␃ are from the supplementary of Glavaš et al. (2019). MUSE yielded one unsuccessful run for TR-IT, and we only record the average of the two successful scores with *.  language pairs (with the exception of EN-IT where the score of PROC-B is unchanged) with an average improvement of 0.9% and 0.5%, respectively (p<0.01). JA-RCSLS-1 and PROC-B-1 were also tested using News-Embs with results shown in Tab. 2b 5 .
1 refinement achieves an impressive improvement for both close (EN-{DE, IT}) and distant (EN-FI) language pairs: average gain of 1.9% and 3.9% respectively and over 5% for EN-DE (PROC-B-1 ) in particular. The 2 refinement does not benefit the supervised baseline, similar to the lack of improvement observed in the unsupervised setups.
Comparison of unsupervised and supervised settings. This part provides a comparison of the effectiveness of 1 refinement in unsupervised and supervised scenarios. Unlike previous experiments where only alignments involving English were investigated, these tests focus on non-EN setups. Glavaš  Results shown in Tab. 3 demonstrate that the main baselines (MUSE, JA-MUSE, VECMAP, JA-RCSLS, and PROC-B) outperform these other models by a large margin. For all these main baselines, post applying 1 refinement improves the mapping quality for all language pairs (p < 0.01), with average improvements of 1.7%, 1.4%, 1.8%, 1.1%, and 0.8%, respectively. Consistent with findings in the previous experiments, 2 refinement does not enhance performance. Improvement with 1 refinement is higher when language pairs are more distant, e.g., for all inter-language-family pairs such as FI-HR and TR-IT, even the minimum improvement of MUSE-1 over MUSE is 2.3%.
Comparing unsupervised and supervised approaches, it can be observed that MUSE, JA-MUSE and VECMAP achieve higher overall gain with 1 refinement than JA-RCSLS and PROC-B, where JA-MUSE-1 and VECMAP-1 give the best overall performance. One possible explanation to this phenomenon is that there is only a single source of possible noise in unsupervised models (i.e. the embedding topology) but for supervised methods noise can also be introduced via the seed lexicons. Consequently unsupervised approaches drive more benefit from 1 refinement, which reduces the influence of topological outliers in CLWEs.
Topological behaviours of 1 and 2 refinements. To validate our assumption that 2 refinement is more sensitive to outliers while its 1 counterpart is more robust, we analyse how each refinement strategy changes the distance between bilingual word vector pairs in the synthetic dictionary D (cf. Algorithm 1) constructed from trained CLWE models. Specifically, for each word vector pair we subtract its post-refinement distance from the original distance (i.e., without applying additional 1 or 2 refinement step). Fig. 1 shows visualisation examples for three algorithms and language pairs, where each bar represents one word pair. It can be observed that 1 refinement effectively reduces the distance for most word pairs, regardless of their original distance (i.e., indicated by bars with negative values in the figures). The conventional 2 refinement strategy, in contrast, exhibits very different behaviour and tends to be overly influenced by word pairs with large distance (i.e. by outliers). The reason for this is that the 2 -norm penalty increases quadratically, causing the solution to put much more weight on optimising distant word pairs (i.e., word pairs on the right end of the X-axis show sharp distance decrements). This observation is in line with Rousseeuw and Leroy (1987) and explains why 1 loss performs substantially stronger than 2 loss in the refinement.
Case study. After aligning EN-RU embeddings with unsupervised MUSE, we measured the distance between vectors corresponding to the groundtruth dictionary of  (cf. Fig. 1a). We then detected large outliers by finding vector pairs whose distance falls above Q3 + 1.5 · (Q3 − Q1), where Q1 and Q3 respectively denote the lower and upper quartile based on the popular Inter-Quartile Range (Hoaglin et al., 1986). We found that many of the outliers correspond to polysemous entries, such as {state (2× noun meanings and 1× verb meaning), состояние (only means status)}, {type (2× nominal meanings and 1× verb meaning), тип (only means kind)}, and {film (5× noun meanings), фильм (only means movie)}. We then  re-perform 2 -based mapping after removing these vector pairs, observing that the accuracy jumps to 45.9% (cf. the original 2 -norm alignment it is 43.8% and after 1 refinement it is 45.6%, cf. Tab. 1). This indicates that although all baselines already make use of preprocessing steps including vector normalization, outlier issues still exist and harms the 2 norm CLWEs. However, they can be alleviated by the proposed 1 refinement technique.

Natural Language Inference
Finally, we experimented with a downstream NLI task in which the aim is to determine whether a "hypothesis" is true (entailment), false (contradiction) or undetermined (neutral), given a "premise". Higher ACC indicates better encoding of semantics in the tested embeddings. The CLWEs used are those trained with Wiki-Embs for BLI. For MUSE, JA-MUSE and VECMAP, we also obtain CLWEs for EN-TR pair with the same configuration. Following Glavaš et al. (2019), we first train the Enhanced Sequential Inference Model (Chen et al., 2017) based on the large-scale English MultiNLI corpus  using vectors of language L A (EN) from an aligned bilingual embedding space (e.g., EN-DE). Next, we replace the L A vectors with the vectors of language L B (e.g., DE), and directly test the trained model on the language L B portion of the XNLI corpus .
Results in Tab. 4 show that the CLWEs refined by our algorithm yield the highest ACC for all language pairs in both supervised and unsupervised settings. The 2 refinement, on the contrary, is not beneficial overall. Improvements in cross-lingual transfer for NLI exhibit similar trends to those in the BLI experiments, i.e. greater performance gain for unsupervised methods and more distant language pairs, consistent with previous observations (Glavaš et al., 2019). For instance, MUSE-1 JA-MUSE-1 and VECMAP-1 outperform their baselines by at least 2% in ACC on average (p < 0.01), whereas the improvements of JA-RSCLS-1 and PROC-B-1 over their corresponding base methods are 2% and 2.1% respectively (p < 0.01). For both unsupervised and supervised methods, 1 refinement demonstrates stronger effect for more distant language pairs, e.g., MUSE-1 surpasses MUSE by 1.2% for EN-FR, whereas a more impressive 2.7% gain is achieved for EN-TR.
In summary, in addition to improving BLI performance, our 1 refinement method also produces a significant improvement for a downsteam task (NLI), demonstrating its effectiveness in improving the CLWE quality.

Conclusion and Future Work
This paper proposes a generic post-processing technique to enhance CLWE performance based on optimising 1 loss. This algorithm is motivated by successful applications in other research fields (e.g. computer vision and data mining) which exploit the 1 norm cost function since it has been shown to be more robust to noisy data than the commonly-adopted 2 loss. The approach was evaluated using ten diverse languages and word embeddings from different domains on the popular BLI benchmark, as well as a downstream task of cross-lingual transfer for NLI. Results demonstrated that our algorithm can significantly improve the quality of CLWEs in both supervised and unsupervised setups. It is therefore recommended that this straightforward technique be applied to improve performance of CLWEs.
The convergence speed of the optimiser prevented us from performing 1 loss optimisation over multiple iterations. Future work will focus on improving the efficiency of our 1 OPA solver, as well as exploring the application of other robust loss functions within CLWE training strategies.

Ethics Statement
This work provides an effective post-hoc method to improve CLWEs, advancing the state-of-the-art in both supervised and unsupervised settings. Our comprehensive empirical studies demonstrate that the proposed algorithm can facilitate researches in machine translation, cross-lingual transfer learning, etc, which have deep societal impact of bridging cultural gaps across the world.
Besides, this paper introduces and solves an optimisation problem based on an under-explored robust cost function, namely 1 loss. We believe it could be of interest for the wider community as outlier is a long-standing issue in many artificial intelligence applications.
One caveat with our method, as is the case for all word-embedding-based systems, is that various biases may exist in vector spaces. We suggest this problem should always be looked at critically. In addition, our implemented solver can be computationally expensive, leading to increased electricity consumption and the associated negative environmental repercussions.