IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces

The ability to extract high-quality translation dictionaries from monolingual word embedding spaces depends critically on the geometric similarity of the spaces—their degree of “isomorphism.” We address the root-cause of faulty cross-lingual mapping: that word embedding training resulted in the underlying spaces being non-isomorphic. We incorporate global measures of isomorphism directly into the skipgram loss function, successfully increasing the relative isomorphism of trained word embedding spaces and improving their ability to be mapped to a shared cross-lingual space. The result is improved bilingual lexicon induction in general data conditions, under domain mismatch, and with training algorithm dissimilarities. We release IsoVec at https://github.com/kellymarchisio/isovec.


Introduction
The task of extracting a translation dictionary from word embedding spaces, called "bilingual lexicon induction" (BLI), is a common task in the natural language processing literature.Bilingual dictionaries are useful in their own right as linguistic resources, and automatically generated dictionaries may be particularly helpful for low-resource languages for which human-curated dictionaries are unavailable.BLI is also used as an extrinsic evaluation task to assess the quality of cross-lingual spaces.If a high-quality translation dictionary can be automatically extracted from a shared embedding space, intuition says that the space is highquality and useful for downstream tasks.
"Mapping-based" methods are one way to create cross-lingual embedding spaces.Separatelytrained monolingual embeddings are mapped to a shared space by applying a linear transformation to one or both spaces, after which a bilingual lexicon can be extracted via nearest-neighbor search (e.g., Mikolov et al., 2013b;Lample et al., 2018;Artetxe et al., 2018b;Joulin et al., 2018;Patra et al., 2019).
Mapping methods are effective for closelyrelated languages with embedding spaces trained on high-quality, domain-matched data even without supervision, but critically rely on the "approximate isomorphism assumption"-that monolingual embedding spaces are geometrically similar.1 Problematically, researchers have observed that the isomorphism assumption weakens substantially as languages and domains become dissimilar, leading to failure precisely where unsupervised methods might be helpful (e.g.Søgaard et al., 2018;Ormazabal et al., 2019;Glavaš et al., 2019;Vulić et al., 2019;Patra et al., 2019;Marchisio et al., 2020).
Existing work attributes non-isomorphism to linguistic, algorithmic, data size, or domain differences in training data for source and target languages.From Søgaard et al. (2018), "the performance of unsupervised BDI [BLI] depends heavily on... language pair, the comparability of the monolingual corpora, and the parameters of the word embedding algorithms."Several authors found that unsupervised machine translation methods suffer under similar data shifts (Marchisio et al., 2020;Kim et al., 2020;Marie and Fujita, 2020).
While such factors do result in low isomorphism of spaces trained with traditional methods, we needn't resign ourselves to the mercy of the geometry a training methodology naturally produces.While multiple works post-process embeddings or map non-linearly, we control similarity explicitly during embedding training by incorporating five global metrics of isomorphism into the Skip-gram loss function.Our three supervised and two unsupervised losses gain some control of the relative isomorphism of word embedding spaces, compensat- ing for data mismatch and creating spaces that are linearly mappable where previous methods failed.

Related Work
Cross-Lingual Word Embeddings There is a broad literature on creating cross-lingual word embedding spaces.Two major paradigms are "mapping-based" methods which find a linear transformation to map monolingual embedding spaces to a shared space (e.g., Artetxe et al., 2016Artetxe et al., , 2017;;Alvarez-Melis and Jaakkola, 2018;Doval et al., 2018;Jawanpuria et al., 2019), and "joint-training" which, as stated in the enlightening survey by Ruder et al. (2019), "minimize the source and target language monolingual losses jointly with the cross-lingual regularization term" (e.g.Luong et al., 2015, Ruder et al. (2019) for a review).Gouws et al. (2015) train Skip-gram for source and target languages simultaneously, enforcing an L2 loss for known translation.Wang et al. (2020) compare and combine joint and mapping approaches.
Handling Non-Isomorphism Miceli Barone (2016) explore whether comparable corpora induce embedding spaces which are approximately isomorphic.Ormazabal et al. (2019) compare crosslingual word embeddings induced via mapping methods and jointly-trained embeddings from Luong et al. (2015), finding that the latter are better in measures of isomorphism and BLI precision.Nakashole and Flauger (2018) argue that word embedding spaces are not globally linearlymappable.Others use non-linear mappings (e.g.Mohiuddin et al., 2020;Glavaš and Vulić, 2020) or post-process embeddings after training to improve quality (e.g.Peng et al., 2021;Faruqui et al., 2015;Mu and Viswanath, 2018).Eder et al. (2021) initialize a target embedding space with vectors from a higher-resource source space, then train the lowresource target.Zhang et al. (2017) minimize earth mover's distance over 50-dimensional pretrained word2vec embeddings.Ormazabal et al. (2021) learn source embeddings in reference to fixed target embeddings given known or hypothesized translation pairs induced during via self-learning.

Examining & Exploiting Embedding Geometry
Emerging literature examines geometric properties of embedding spaces.In addition to isomorphism, some examine isotropy (e.g.Mimno and Thompson, 2017;Mu and Viswanath, 2018;Ethayarajh, 2019;Rajaee and Pilehvar, 2022;Rudman et al., 2022).Li et al. (2020) transform the semantic space from masked language models into an isotropic Gaussian distribution from a non-smooth anisotropic space.Su et al. (2021) apply whitening and dimensionality reduction to improve isotropy.Zhang et al. (2022) inject isotropy into a variational autoencoder, and Ethayarajh and Jurafsky (2021) recommend "adding an anisotropy penalty to the language modelling objective" as future work.

Background
We discuss the mathematical background used in our methods.Throughout, X ∈ R n×d and Y ∈ R m×d are the source and target word embedding spaces of d-dimensional word vectors, respectively.We may assume seed pairs {(x 0 , y 0 ), (x 1 , y 1 ), ...(x s , y s )} are given.

The Orthogonal Procrustes Problem
Schönemann (1966) derived the solution to the orthogonal Procrustes problem, whose goal is to find the linear transformation W that solves: The solution is W = V U T , where U ΣV T is the singular value decomposition of Y T X.If X is a matrix of vectors corresponding to seed words x i in {(x 0 , y 0 ), (x 1 , y 1 ), . . ., (x s , y s )} and Y is a matrix of the corresponding y i , then W is the linear transformation that minimizes the difference between the vector representations of known pairs.

Embedding Space Mapping with VecMap
We use the popular VecMap2 toolkit for embedding space mapping, which can be run in supervised, semi-supervised, and unsupervised modes.As of the time of its writing, Glavaš et al. (2019) deem VecMap the most robust unsupervised method.
First, source and target word embeddings are unit-normed, mean-centered, and unit-normed again (Zhang et al., 2019).The bilingual lexicon is induced by whitening each space and then solving a variant of the orthogonal Procrustes problem.3Spaces are reweighted, dewhitened, dimensionality reduced, and translation pairs are extracted via nearest-neighbor search from the mapped embedding spaces.See the original works and implementation for details (Artetxe et al., 2018a).
Unsupervised and semi-supervised modes utilize the same framework as supervised mode, but with an iterative self-learning procedure that repeatedly solves the orthogonal Procrustes problem over hypothesized translations.On each iteration, new hypotheses are extracted.The modes differ only in how they induce the initial hypothesis seed pairs.In semi-supervised mode, this is a given input seed dictionary.In unsupervised mode, similarity matrices M x = XX T and M z = ZZ T are created over the first n vocabulary words. 4Word z j is the assumed translation of x i if vector M z j is most similar to M x i compared to all others in M z .See Artetxe et al. (2018b) for details.

Isomorphism Metrics
In NLP, relative isomorphism is often measured by Relational Similarity, Eigenvector Similarity, and Gromov-Hausdorff Distance.We describe these metrics in detail in this section.
Relational Similarity Given seed translation pairs, calculate pairwise cosine similarities: The Pearson's correlation between the lists of cosine similarities is known as Relational Similarity (Vulić et al., 2020;Zhang et al., 2019).
Eigenvector Similarity (Søgaard et al., 2018) measures isomorphism between two spaces based on the Laplacian spectra of their k-nearest neighbor (k-NN) graphs.For seeds {x 0 , x 1 , . . ., x s } and {y 0 , y 1 , . . ., y s }, we compute unweighted k-NN graphs G X and G Y , then compute the Graph Laplacians (L G ) for both graphs (the degree matrix minus the adjacency matrix: where l X is the maximum l such that the first l eigenvalues of L G X sum to less than 90% of the total sum of the eigenvalues.EVS is the sum of squared differences between the partial spectra: Gromov-Hausdorff Distance is a "worst-case" metric that optimally linearly maps embedding spaces and then calculates the distance between nearest neighbors in a shared space.
• For each x of the source embeddings, find its nearest neighbor y of the target embeddings.Measure the distance.
• For each y of the target embeddings, find its nearest neighbor x of the source embeddings.Measure the distance.
• Hausdorff distance is the worst of the above.
• Gromov-Hausdorff distance is Hausdorff distance after optimal isometric transformation to minimize distances.As in previous work, since we apply mean-centering to source and target embeddings, we search only over the space of orthogonal transformations (Patra et al., 2019;Vulić et al., 2020).See Figure 2.
We follow Chazal et al. (2009) and approximate the Gromov-Hausdorff distance with the Bottleneck distance between the source and target embeddings.

Method
We implement Skip-gram with negative sampling on GPU using PyTorch and use it to train monolingual embedding spaces for Bengali (bn), Ukra-nian (uk), Tamil (ta), and English (en).5 Our implementation mirrors the official word2vec6 release closely (Mikolov et al., 2013a).We create comparison embedding spaces using the official word2vec release with default hyperparameters and map the resulting spaces from both algorithms with VecMap for BLI.We report precision@1 (P@1) on the development set in Table 1.P@1 is a standard evaluation metric for BLI.Our implementation slightly outperforms word2vec except ta in unsupervised mode.

Data
For the main experiments, we train word embeddings on the first 1 million lines from newscrawl2020 for en, bn, and ta (Barrault et al., 2020). 7For uk, we use the entirety of newscrawl2020 (∼427,000 lines).We normalize punctuation, lowercase, remove non-printing characters, and tokenize using standard Moses scripts.8Domain mismatch experiments in Section 5.2 use approximately 33.8 million lines of webcrawl from the English Common Crawl.Larger data experiments in the same section use 93 million lines of English newscrawl2018-2020.The size of the training data in tokens is seen in Table 2.
We use the publicly available train and test dictionaries from MUSE (Lample et al., 2018). 9For the development set, we use source words 6501-8000 from the "full" set.Train, development, and test sets are non-overlapping.We use all possible training set seed words for our supervised losses, which is 6000-7000 word pairs per language. 10We use the test set for evaluating downstream BLI.

Integrating Isomorphism Losses
To train the embedding space X such that it 1) captures the distributional information via Skip-gram with negative sampling and 2) is geometrically similar to the reference word embedding space Y, we propose the objective below.L SG is the familiar Skip-gram with negative sampling loss function and L ISO is the isomorphism metric loss.Each L ISO requires a reference embedding space Y, trained separately using our base implementation.
We use English as the reference language because we generally assume that the data quality is higher than the low-resource languages used on the sourceside.Y is normalized, mean-centered, and normalized again before use.On each calculation of L ISO , we perform the same operations on a copy of the current model's word embeddings.
L2 We implement L2 distance and normalize over samples.Intuitively, this coaxes translation pairs to have similar vector representations, with the hope that other words in X and Y will be tugged closer to their translations.L2 is easy to implement and understand, and computes quickly.
Proc-L2 We find W that solves the orthogonal Procrustes problem as in Section 3.1, then mini-9 https://github.com/facebookresearch/MUSE#ground-truth-bilingual-dictionaries 10 ∼ 90% of train set pairs are present in the trained embedding spaces; bn-en: 6859, uk-en: 6476, ta-en: 6019.mize L2 distance over the mapped space: Proc-L2+Init Same as Proc-L2, except initialize source seed embeddings with the reference translation vectors so that spaces begin with the same representation for known translations.
RSIM We implement relational similarity over seeds.Higher is better, so we minimize L ISO = 1 − Pearsons_Corr.Like Proc-L2+Init, we can also initialize the source space with reference seed embeddings.We call this RSIM+Init.

Unsupervised Losses
We use two unsupervised metrics to increase isomorphism when no seed translations are available.
RSIM-U In this unsupervised variant of RSIM, we calculate pairwise cosine similarities over the first k words in X and Y, sort the lists, then calculate Pearson's correlation.As above, L ISO = 1 − Pearsons_Corr.We use k = 2000 for efficiency.
EVS-U We calculate eigenvector similarity over the first 2000 words in X and Y.

On Differentiability
Each metric must be differentiable with respect to X, a matrix of the model's current word embeddings, to allow isomorphism-based losses to inform parameter updates in X.
L2 is straightforwardly differentiable, as it is the Frobenius norm of X − Y .The same applies for variants Proc-L2 and Proc-L2+Init.RSIM is naturally differentiable, seen in the formulation below.For mean-centered cosine similarity vectors11 x sim and y sim , Pearson's correlation coefficient is: x T sim y sim ∥x sim ∥∥y sim ∥ EVS is not immediately differentiable due to the need for the non-differentiable k-NN computation.Instead we modify the graph computation step to use a fully-connected weighted graph where the edge weight is the dot product between node vectors.12With this amended formulation, computing the gradients of Laplacian eigenvalues is possible.

β and Linear Mapping for BLI
Each isomorphism loss may be considered a different method, as each loss may cause the overall framework to behave differently.Accordingly, we set β for each separate loss function based on performance on the development set. 13After selecting β, we evaluate and present results only on the test set.βs for each method are in Table 3.  VecMap in supervised mode consistently scores higher than semi-supervised mode in all baseline experiments on the development set.For IsoVec, semi-supervised mapping often works best.We thus use VecMap in supervised mode for baselines and semi-supervised mode for IsoVec supervised runs.This sometimes underestimates IsoVec's strength when supervised mapping would have performed better.For unsupervised experiments and baselines, we map in unsupervised mode.Each is run five times and averaged.IsoVec and VecMap use one NVIDIA GeForce GTX 1080Ti GPU.

Experiments & Results
We pretrain English embeddings to use as reference space Y. IsoVec trains source space X.

Main Experiments
For baselines, we train source and target spaces separately for each run using our base implementation.In experimental conditions, we train the source space with IsoVec using each isomorphism loss from Sections 4.3 and 4.4.In Table 4, we see that IsoVec consistently outperforms the baseline for bn-en and uk-en.For ta-en, it outperforms with Proc-L2+Init and both unsupervised methods. 14 In terms of training efficiency, L2-based methods perform comparably to the baseline (< 10% time increase) and RSIM-based methods see a slight time increase (∼10-16% increase over baseline).EVSbased methods require an expensive eigendecom- 13 We try β ∈ {0.5, 0.333, 0.2, 0.1, 0.01}.For RSIM* and EVS-U, we also try 0.001.An early L2 run used 0.05, 0.0001.Table 4: Main Experiments.Average P@1 (µ) and standard deviation (σ) over 5 runs of IsoVec with isomorphism losses for bn-en, uk-en, ta-en.
position step which causes a ∼2.5x time increase over the baseline.

Algorithm, Domain, & Data Mismatch
Søgaard et al. (2018) show that mapping methods fail for embeddings trained with different algorithms, and that BLI performance deteriorates when source and target domains do not match (Marchisio et al., 2020).We test IsoVec under algorithm and domain mismatch using the best losses from the main experiments: Proc-L2+Init and RSIM-U.We use β as-is from the previous section.
The IsoVec base model intends to mirror word2vec closely, but there are likely output differences due to implementation. 15We map the baseline source embeddings trained in the main experiments to varying en target spaces trained with the official word2vec release, so that algorithms do not match between source and target embedding spaces.We run experiments using the below training data: • Algorithm Mismatch: 1 million lines of en newscrawl2020 (same as main experiments).Shows effect of algorithm mismatch only.
• +More Target-Side Data: 93 million lines of en newscrawl2018-20.Shows effect of target trained with ample in-domain data.
• +Domain Mismatch: 33.8 million lines of en Common Crawl (web-crawl).Shows the effect of different domains in source vs. target.is surprising given that this target space is stronger than the one from only Algorithm Mismatch.Perhaps its geometry has changed so considerably because of its additional data and different algorithm that it is too different from the lower-resource source space to be mapped with unsupervised methods.This should be investigated in future work.
We run Proc-L2+Init and RSIM-U in Algorithm Mismatch, +More Target-Side Data, and +Domain Mismatch conditions as described above.Results are in Table 6.In supervised mode, IsoVec recovers from algorithm mismatch by 2.7-4.9 points, domain mismatch by 2.5-7.3, and still improves when the target space is trained on ∼100x more data.Whereas +Domain Mismatch and +More Target-Side Data baselines fail to extract any correct translation pairs in unsupervised mode, RSIM-U method completely recovers in all conditions: equalling or outperforming the main unsupervised baseline from Table 4 which matched on algorithm, domain, and data size. 16IsoVec is thus useful for many types of distributional shifts: algorithmic, domain, and amount of data available.

Effect on Isomorphism
Table 7 (left) shows the effect of IsoVec on global isomorphism measures.We measure relational similarity, eigenvector similarity, and Gromov-Hausdorff distance of trained embedding spaces (before mapping) for all main experiments of Section 5.1 using scripts from Vulić et al. (2020) 17 .We average over experiments.To avoid confusion with the IsoVec loss functions, we call the metrics "Rel-Sim", "EigSim", and "GH".The script calculates EigSim (k = 2) over the first 10,000 embeddings in each space and GH over the first 5000.RelSim is calculated over the first 1000 seed translation pairs.
All supervised methods improved RelSim (↑ better).Perhaps surprisingly, initializing the source space with target embeddings (+Init) worsens iso-morphism.RSIM is best, directly optimizing for this metric in a supervised manner.RelSim stayed roughly consistent in unsupervised experiments.
All uk-en and ta-en experiments improve GH (Patra et al., 2019, ↓ better).GH worsened for bn-en despite improved BLI (Table 4).EigSim ( ↓ better) improves across all experiments except uk-en supervised methods, despite improved BLI (notably, initial EigSim for uk was low).EVS-U strongly improves EigSim, optimizing it directly.Table 7 (left) measures the unperturbed geometry of spaces after training and shows that IsoVec improves isomorphism in a majority of settings.The same calculation over embeddings after mapping with semi-supervised VecMap is in Table A It is interesting that baseline experiments performed better when mapped in supervised mode while spaces trained with IsoVec tended to map better in semi-supervised mode (as mentioned in Section 4.6).This may further indicate that the IsoVec spaces have become more geometrically similar.

The Promise of Geometric Losses
We have seen that IsoVec improves relative isomorphism and downstream BLI from word embedding spaces.The success of unsupervised methods is particularly encouraging for the use of global isomorphism measures to improve embedding spaces.Notably, we use only the first 2000 words per space to calculate unsupervised IsoVec losses-i.e., we coax these frequent words to have similar representations, regardless of identity.While there are likely some true translation pairs in the mix, there are almost certainly words this subset of X whose translation is not in the first 2000 words of Y (and vice-versa)-particularly when source and target corpora are from different domains.Regardless, IsoVec unsupervised methods work.

Need for a Sensitive Isomorphism Metric
Previous authors found that EigSim and GH correlate well with BLI performance (Søgaard et al., 2018;Patra et al., 2019), however our results reveal a nuanced story.In Table 7 (right), we correlate the EigSim, RelSim, and GH with BLI P@1 performance over all runs of the main supervised IsoVec experiments (L2, Proc-L2, Proc-L2+Init, RSIM, RSIM+Init; 25 data points per calculation).
(↓ better for GH/EigSim).RelSim should correlate negatively with GH/EigSim, and GH positively with EigSim.In Table 7 (right) within language, however, only P@1 vs. GH on uk-en aligns with intuition.Many correlations are weak (gray, magnitude <= 0.05) or opposite of expected; For instance, P@1 should increase with RelSim, but we see the opposite within language pair.Over languages combined, the relationship is weakly positive.Samples for Pearson's correlation should be drawn from the same population, and in Table 7 (right) we assume that is our IsoVec embedding spaces.Perhaps the assumption is unfair: different IsoVec losses might induce different monolingual spaces where specific metrics are indeed predictive of downstream BLI performance, but this may not be visible in the aggregate.An ideal metric, however, would predict downstream BLI performance regardless of how monolingual spaces were trained; such that we might assess the potential of spaces to align well without having to map them and measure their performance with development or test dictionaries.In that light, the discrepancies in Table 7 (right) highlight the need for a more sensitive metric that works within language and with small differences in BLI performance. 18e should thus be cautious drawing between-vs.within-language conclusions about isomorphism metrics and downstream BLI.When isomorphism metrics differ considerably, perhaps BLI performance also differs similarly, as seen in previous work; however if isomorphism scores are poor or too similar, the metrics may not be sensitive enough to be predictive.Future work should investigate these hypotheses and develop isomorphism metrics that are more sensitive.The spectral measures of Dubossarsky et al. (2020) might be examined in these lower-resource contexts, as the authors claim to correlate better with downstream BLI.All-in-all, though, our main results show that coaxing towards improved isomorphism as measured by the three popular metrics can improve BLI performance even if the scores are not strongly predictive of raw P@1.

Conclusion & Future Work
We present IsoVec, a new method for training word embeddings which directly injects global measures of embedding space isomorphism into the Skipgram loss function.Our three supervised and two unsupervised isomorphism loss functions successfully improve the mappability of monolingual word embedding spaces, leading to improved ability to induce bilingual lexicons.IsoVec also shows promise under algorithm mismatch, domain mismatch, and data size mismatch between source and target training corpora.Future work could extend our work to even greater algorithmic mismatches, and in massively multilingual contextualized models.We release IsoVec at https://github.com/kellymarchisio/isovec.

Limitations
As with most methods based on static word embeddings, our work is limited by polysemy.By using word2vec as a basis, we inherit many of its limitations, many of which are addressed in recent contextualized representation learning work.Future work might apply our methods to contextualized models.We also experiment with only English as a target language, limiting our method's universal applicability.Future work could extend our results to non-English pairs, and also evaluate monolin-nius norm/inner product: Given the cyclic property of the trace, the objective is equivalent to maximizing Trace(XW x W T z Z T ), as stated by Artetxe et al. (2018a).
A.3 RSIM vs. P@1: Pearson's Correlation P@1 over all language pairs described in Section 5.3.We observe how it is possible to have withinlanguage negative correlations but a positive overall correlation.

A.4 Effect on Isomorphism, After Mapping
Comparing with Table 7 in the main body of this paper, Table A.1 shows average isomorphism scores of source vs. reference embedding spaces after mapping with VecMap in semi-supervised mode. 20RSIM best improves relational similarity 20 Though we map the Baseline in supervised mode and unsupervised methods in unsupervised mode in the main body and eigenvector similarity.GH distance improves for supervised methods and RSIM-U.
Table 7 measures the unperturbed geometry of the space after applying IsoVec.Importantly, Rel-Sim, GH, and EigSim do not require mapping for measurement, as they are invariant to transformation (RelSim and EigSim measure nearest-neighbor graphs, and GH measures nearest neighbor after optimal isometric transform).Table A.1, measures a geometry that may have been perturbed due to VecMap operations such as whitening/dewhitening and dimensionality reduction.While isomorphism scores over the mapped spaces appear more consistent in terms of internal patterns, Table 7 measures isomorphism induced directly by IsoVec, whereas Table A.1 may be influenced by VecMap as well.
Table A.2 shows correlations after mapping over only the supervised IsoVec methods.Compared with Table 7 (right), we see that relationships between the isomorphism measures are more aligned with our expectations now, but still no isomorphism measure is consistently predictive of P@1 across languages.Table A.3 shows the same calculations over all IsoVec experiments mapped in semi-supervised mode, though unsupervised training with semi-supervised mapping probably would not be used in practice as it is here for RSIM-U and EVS-U.
of the paper, we map everything in semi-supervised mode here for comparability.

Figure 1 :
Figure 1: Proposed Method.Loss is a weighted combination of Skip-gram with negative sampling loss (seen left with a reproduction of the familiar image from Mikolov et al. (2013a) for reader recognizability) and an isomorphism loss (seen right, ours) calculated in relation to a fixed reference space.Gray boxes are two possibilities explored in this work: Proc-L2 (supervised) where L ISO is calculated over given seed translations, and RSIM-U (unsupervised).

Figure 2 :
Figure 2: Calculation of Gromov-Hausdorff (GH) Distance: the worst case distance of nearest neighbors in a shared embedding space after optimal orthogonal mapping.The right-most red dots have been orthogonally rotated to turn Hausdorff distance into GH Distance.
Figure A.1 shows how this is possible.

Figure
Figure A.1 shows relational similarity score vs. P@1 over all language pairs described in Section 5.3.We observe how it is possible to have withinlanguage negative correlations but a positive overall correlation.

Table 2 :
Size of training data (millions of tokens).

Table 3 :
β parameter for isomorphism losses.Each loss function should be considered a separate method, so β is set for each loss based on development set performance.Once β is chosen, we evaluate on the test set.

Table 5
contains baselines for our mismatch experiments and shows the drop in performance compared to Table4baselines, where both source and target embedding spaces were trained with the IsoVec base model.This occurs across languages, moderately for supervised baselines, and severely for unsupervised.The large performance drop given more high-quality data of the same domain in unsupervised mode (+More Target-Side Data)

Table 5 :
Effect of algorithm and data mismatch in source vs. target embedding spaces.Average P@1 of 5 runs (µ) with ∆ vs. baseline and std.dev.(σ).Isomorphism losses are not used here.Source-side embeddings are trained with our base implementation, target-side with word2vec (algorithm mismatch).Main Baseline is from the main experiments, Table4.+More Target-Side Data (+More Trg Data) uses nearly 100x more data on the target-side than previous experiments.+Domain Mismatch uses target embeddings trained on ∼ 34M lines of web crawl.