Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport

Bilingual lexicons form a critical component of various natural language processing applications, including unsupervised and semisupervised machine translation and crosslingual information retrieval. In this work, we improve bilingual lexicon induction performance across 40 language pairs with a graph-matching method based on optimal transport. The method is especially strong with low amounts of supervision.


Introduction
Bilingual lexicon induction (BLI) from word embedding spaces is a popular task with a large body of existing literature (e.g.Mikolov et al., 2013;Artetxe et al., 2018;Conneau et al., 2018;Patra et al., 2019;Shi et al., 2021).The goal is to extract a dictionary of translation pairs given separate language-specific embedding spaces, which can then be used to bootstrap downstream tasks such as cross-lingual information retrieval and unsupervised/semi-supervised machine translation.
A great challenge across NLP is maintaining performance in low-resource scenarios.A common criticism of the BLI and low-resource MT literature is that while claims are made about diverse and under-resourced languages, research is often performed on down-sampled corpora of highresource, highly-related languages on similar domains (Artetxe et al., 2020).Such corpora are not good proxies for true low-resource languages owing to data challenges such as dissimilar scripts, domain shift, noise, and lack of sufficient bitext (Marchisio et al., 2020).These differences can lead to dissimilarity between the embedding spaces (decreasing isometry), causing BLI to fail (Søgaard et al., 2018;Nakashole and Flauger, 2018;Ormazabal et al., 2019;Glavaš et al., 2019;Vulić et al., 2019;Patra et al., 2019;Marchisio et al., 2020).
There are two axes by which a language dataset is considered "low-resource".First, the language itself may be a low-resource language: one for which little bitext and/or monolingual text exists.Even for high-resource languages, the long tail of words may have poorly trained word embeddings due rarity in the dataset (Gong et al., 2018;Czarnowska et al., 2019).In the data-poor setting of true low-resource languages, a great majority of words have little representation in the corpus, resulting in poorly-trained embeddings for a large proportion of them.The second axis is low-supervision.Here, there are few ground-truth examples from which to learn.For BLI from word embedding spaces, lowsupervision means there are few seeds from which to induce a relationship between spaces, regardless of the quality of the spaces themselves.
We bring a new algorithm for graph-matching based on optimal transport (OT) to the NLP and BLI literature.We evaluate using 40 language pairs under varying amounts of supervision.The method works strikingly well across language pairs, especially in low-supervision contexts.As lowsupervision on low-resource languages reflects the real-world use case for BLI, this is an encouraging development on realistic scenarios.

Background
The typical baseline approach for BLI from word embedding spaces assumes that spaces can be mapped via linear transformation.Such methods typically involve solutions to the Procrustes problem (see Gower et al. (2004) for a review).Alternatively, a graph-based view considers words as nodes in undirected weighted graphs, where edges are the distance between words.Methods taking this view do not assume a linear mapping of the spaces exists, allowing for more flexible matching.
BLI from word embedding spaces Assume separately-trained monolingual word embedding spaces: X ∈ R n×d , Y ∈ R m×d where n/m are the source/target language vocabulary sizes and d is the embedding dimension.We build the matrices X and Y of seeds from X and Y, respectively, such that given s seed pairs (x 1 , y 1 ), (x 2 , y 2 ), ...(x s , y s ), the first row of X is x 1 , the second row is x 2 , etc.We build Y analogously for the y-component of each seed pair.The goal is to recover matches for the X \ X and/or Y \ Y non-seed words.
Procrustes Many BLI methods use solutions to the Procrustes problem (e.g.Artetxe et al., 2019b;Conneau et al., 2018;Patra et al., 2019).These compute the optimal transform W to map seeds: Once solved for W, then XW and Y live in a shared space and translation pairs can be extracted via nearest-neighbor search.Constrained to the space of orthogonal matrices, Equation 1 has a simple closed-form solution (Schönemann, 1966): Graph View Here, words are nodes in monolingual graphs G x , G y ∈ R n×n , and cells in G x , G y are edge weights representing distance between words.As is common in NLP, we use cosine similarity.The objective function is Equation 2, where Π is the set of permutation matrices. 1Intuitively, PG y P T finds the optimal relabeling of G y to align with G x .This "minimizes edge-disagreements" between G x and G y .This graph-matching objective is NP-Hard.Equation 3 is equivalent.
Ex. Take source words x 1 , x 2 .We wish to recover valid translations ), a solution P can have an edgedisagreement of 0 here.We then extract y x 1 , y x 2 as translations of x 1 , x 2 .In reality, though, it is unlikely that distance(x 1 , x 2 ) = distance(y x 1 , y x 2 ).
Because Equation 2 finds the ideal P to minimize edge disagreements over the entire graphs, we hope that nodes paired by P are valid translations.If G x and G y are isomorphic and there is a unique solution, then P correctly recovers all translations.Graph-matching is an active research field and is computationally prohibitive on large graphs, 1 A permutation matrix represents a one-to-one mapping: There is a single 1 in each row and column, and 0 elsewhere.but approximation algorithms exist.BLI involves matching large, non-isomorphic graphs-among the greatest challenges for graph-matching.

Vogelstein et al. (2015)'s Fast Approximate
Quadratic Assignment Problem algorithm (FAQ) uses gradient ascent to approximate a solution to Equation 2. Motivated by "connectonomics" in neuroscience (the study of brain graphs with biological [groups of] neurons as nodes and neuronal connections as edges), FAQ was designed to perform accurately and efficiently on large graphs.
FAQ relaxes the search space of Equation 3 to allow any doubly-stochastic matrix (the set D).Each cell in a doubly-stochastic matrix is a non-negative real number and each row/column sums to 1.The set D thus contains Π but is much larger.Relaxing the search space makes it easier to optimize Equation 3 via gradient ascent/descent.2FAQ solves the objective with the Frank-Wolfe method (Frank et al., 1956) then projects back to a permutation matrix.
Algorithm 1 is FAQ; T PG y P T ).These may be built as G x = XX T and G y = YY T .G x and G y need not have the same dimensionality.
Step 2 finds a permutation matrix approximation Q {i} to P {i} in the direction of the gradient.Finding such a P requires approximation when P is high-dimensional.Here, it is solved via the Hungarian Algorithm (Kuhn, 1955;Jonker and Volgenant, 1987), whose solution is a permutation matrix.Finally, P n is projected back onto to the space of permutation matrices.Seeded Graph Matching (SGM;Fishkind et al., 2019) is a variant of FAQ allowing for supervision, and was recently shown to be effective for BLI by Marchisio et al. (2021).The interested reader may find Vogelstein et al. (2015) and Fishkind et al. (2019) 4. Update P {i+1} := αP {i} + (1 − α)Q {i} end while return permutation matrix approx.to P {n} via Hung.Alg.onyms/antonyms, and idiosyncratic concepts; it is more natural to assume that an exact matching between word spaces does not exist, and that multiple matchings may be equally valid.This is an inexact graph-matching problem.FAQ generally performs poorly finding non-seeded inexact matchings (Saad-Eldin et al., 2021).

GOAT
Graph Matching via OptimAl Transport (GOAT) (Saad-Eldin et al., 2021) is a new graph-matching method which uses advances in OT.Similar to SGM, GOAT amends FAQ and can use seeds.GOAT has been successful for the inexact graphmatching problem on non-isomorphic graphs: whereas FAQ rapidly fails on non-isomorphic graphs, GOAT maintains strong performance.
Optimal Transport OT is an optimization problem concerned with the most efficient way to transfer probability mass from distribution µ to distribution v .Discrete4 OT minimizes the inner product of a transportation "plan" matrix P with a cost matrix C, as in Equation 4. ⟨•, •⟩ is the Frobenius inner product.
P is an element of the "transportation polytope" U (r, c)-the set of matrices whose rows sum to r and columns sum to c.The Hungarian Algorithm approximately solves OT, but the search space is restricted to permutation matrices.
Sinkhorn: Lightspeed OT Cuturi (2013) introduce Sinkhorn distance, an approximation of OT distance that can be solved quickly and accurately by adding an entropy penalty h to Equation 4.
Adding h makes the objective easier and more efficient to compute, and encourages "intermediary" solutions similar to that seen in the Intuition subsection.
As λ → ∞, P λ approaches the ideal transportation matrix P * .Cuturi (2013) show that Equation 5 can be computed using Sinkhorn's algorithm (Sinkhorn, 1967).The interested reader can see details of the algorithm in Cuturi (2013); Peyre and Cuturi (2019).Unlike the Hungarian Algorithm, Sinkhorn has no restriction to a permutation matrix solution and can be solved over any U (r, c).
Intuition The critical difference between SGM/FAQ and GOAT is how each calculates step direction based on the gradient.Under the hood, each algorithm maximizes trace(Q T ∇f (P {i} ) to compute Q {i} (the step direction) in Step 2 of their respective algorithms.See Saad-Eldin et al. (2021) or Fishkind et al. (2019) for a derivation.FAQ uses the Hungarian Algorithm and GOAT uses LOT.
For ∇f (P {i} ) below, there are two valid permutation matrices Q 1 and Q 2 that maximize the trace.When multiple solutions exist, the Hungarian Algorithm chooses one arbitrarily.Thus, updates of P in FAQ are constrained to be permutation matrices.Saad-Eldin et al. (2021) find that seed order influences the solution in a popular implementation of the Hungarian Algorithm.Since BLI is a high-dimensional many-to-many task, arbitrary choices could meaningfully affect the result.
GOAT, on the other hand, can step in the direction of a doubly-stochastic matrix.Saad-Eldin et al. (2021) prove that given multiple permutation matrices that equally approximate the gradient at P {i} , any convex linear combination is a doubly stochastic matrix that equally approximates the gradient: P λ is a weighted combination of many valid solutions-obviating the need to arbitrarily select one for the gradient-based update.LOT's output of a doubly-stochastic matrix in Step 2 is similar to finding a P λ in that it needn't discretize to a single permutation matrix.In this way, GOAT can be thought of as taking a step that incorporates many possible permutation solutions.For instance, GOAT may select ds ∇f (P {i} )) = 5 Thus whereas FAQ takes non-deterministic "choppy" update steps, GOAT optimizes smoothly and deterministically.Figure 1 is an illustration.

Experimental Setup
We run Procrustes, SGM, and GOAT on 40 language pairs.We also run system combination experiments similar to Marchisio et al. (2021).We evaluate with the standard precision@1 (P@1).We induce lexicons using (1) the closed-form solution to the orthogonal Procrustes problem of Equation 1, extracting nearest neighbors using CSLS (Conneau et al., 2018), (2) SGM, solving the seeded version of Equation 2, and (3) GOAT.Word graphs are System Combination We perform system combination experiments analogous to those of Marchisio et al. (2021), incorporating GOAT. Figure 2 shows the system, which is made of two components: GOAT run in forward and reverse directions, and "Iterative Procrustes with Stochastic-Add" from Marchisio et al. (2021).This iterative version of Procrustes runs Procrustes in source→target and target→source directions and feeds H random hypotheses from the intersection of both directions into another run of Procrustes with the gold seeds.The process repeats for I iterations, adding H more random hypotheses each time until all are chosen.We set H = 100 and I = 5, as in the original work.

Data & Software
We use publicly-available fastText word embeddings (Bojanowski et al., 2017) 5 which we normalize, mean-center, and renormalize (Artetxe et al., 2018;Zhang et al., 2019) and bilingual dictionaries from MUSE6 filtered to be one-to-one. 7For languages with 200,000+ embeddings, we use the first 200,000.Dictionary and embeddings space sizes are in Appendix Table A1.Each language pair has ∼4100-4900 translation pairs post-filtering.We choose 0-4000 pairs in frequency order as seeds for experiments, leaving the rest as the test set. 8or SGM and GOAT, we use the publicly-available implementations from the GOAT repository9 with default hyperparameters (barycenter initialization).We set reg=500 for GOAT.For system combination experiments, we amend the code from Marchisio et al. (2021) 10 to incorporate GOAT.

Languages
The languages chosen reflect various language families and writing systems.The language families represented are: Balto-Slavic: Macedonian, Bosnian, Russian Germanic: English, German Indo-Iranian: Bengali, Persian Romance: French, Spanish, Portuguese, Italian The chosen languages use varying writing systems, including those using or derived from Latin, Cyrillic, Arabic, Tamil, and character-based scripts.

Results
Results of Procrustes vs. SGM vs. GOAT are in Table 1, visualized in Figure 3.
Procrustes vs. SGM Marchisio et al. ( 2021) conclude that SGM strongly outperforms Procrustes for English→German and Russian→English with 100+ seeds.We find that the trend holds across language pairs, with the effect even stronger with less supervision.SGM performs reasonably with only 50 seeds for nearly all languages, and with only 25 seeds in many.Chinese↔English and Japanese↔English perform relatively worse, and highly-related languages perform best: French, Spanish, Italian, and Portuguese.German↔English performance is low relative to some less-related languages, which have surprisingly strong performance from SGM: Indonesian↔English and Macedonian↔English score P @1 ≈ 50-60, even with low supervision.Except for the aforementioned highly-related language pairs, Procrustes does not perform above ∼ 10 for any language pair with ≤ 100 seeds, whereas SGM exceeds P @1 = 10 with only 25 seeds for 33 of 40 pairs.SGM vs. GOAT GOAT improves considerably over SGM for nearly all language pairs, and the effect is particularly strong with very low amounts of seeds and less-related languages.GOAT improves upon SGM by +19.0, +8.5, and +7.9 on English→Bengali with 25, 50, and 75 seeds, respectively.As the major use case of low-resource BLI and MT is dissimilar languages with low supervision, this is an encouraging result.It generally takes 200+ seeds for SGM to achieve similar scores to GOAT with just 25 seeds.

Isomorphism of Embedding Spaces
Eigenvector similarity (EVS; Søgaard et al., 2018) measures isomorphism of embedding spaces based on the difference of Laplacian eigenvalues.Gromov-Hausdorff distance (GH) measures distance based on nearest neighbors after an optimal orthogonal transformation (Patra et al., 2019).EVS and GH are symmetric, and lower means more isometric spaces.Refer to the original papers for mathematical descriptions.We compute the metrics over the word embedding using scripts from Vulić et al. (2020)  Table 1: P@1 of Procrustes (P), SGM (S) or GOAT (G).∆ is gain/loss of GOAT vs. SGM.Full results in Appendix.Figure 3 is a visualization of these results.  1 for select languages.Procrustes (dashed) vs. SGM (dotted) vs. GOAT (solid).X-axis: # of seeds (log scale).Y-axis: Precision@1 (↑ is better).GOAT is typically best.moderate correlation between EVS and GH (Spearman's ρ = 0.434, Pearson's r = 0.44).

EVS GH
Figure 4 shows the relationship between relative isomorphism of each language vs. English, and performance of Procrustes/GOAT at 200 seeds.Trends indicate that higher isomorphism varies with higher precision from Procrustes and GOAT.GH shows a moderate to strong negative Pearson's correlation with performance from Procrustes and GOAT: r = −0.47 and r = −0.53,respectively, for *to-en and -0.55 and -0.61 for en-to-*.EVS correlates weakly negatively with performance from Procrustes (*-to-en: -0.06, en-to-*: -0.28) and strongly negatively with GOAT (*-to-en: -0.67, en-to-*: −0.75).As higher GH/EVS indicates less isomorphism, negative correlations imply that lower de- grees of isomorphism correlate with lower scores from Procrustes/GOAT.

System Combination
System combination results are in Table 3.Similar to Marchisio et al. (2021)'s findings for their combined Procrustes/SGM system, we find (1) our combined Procrustes/GOAT system outperforms Procrustes and GOAT alone, (2) ending with the Iterative Procrustes is best for moderate amounts of seeds, (3) ending with GOAT is best for very low or very high number of seeds.
Whether we end with Iterative Procrustes vs. GOAT is critically important for the lowest seed sizes: -EndGOAT (-EG) usually fails with 25 seeds; all language pairs except German↔English and Russian↔English score P @1 < 15.0, and most score P @1 < 2.0.Simply switching the order of processing in the combination system, however, boosts performance dramatically: ex.from 0.6 for StartProc-EndGOAT to 61.5 for StartGOAT-EndProc for Bosnian→English with 25 seeds.
There are some language pairs such as English→Persian and Russian↔English where a previous experiment with no seeds had reasonable performance, but the combined system failed.It is worth investigating where this discrepancy arises.

Discussion
We have seen GOAT's strength in low-resource scenarios and in non-isomorphic embedding spaces.As the major use case of low-resource BLI and MT is dissimilar languages with low supervision, GOAT's strong performance is an encouraging result for real-world applications.Furthermore, GOAT outperforms SGM.As the graph-matching objective is NP-hard so all algorithms are approximate, GOAT does a better job by making a better calculation of step direction.Chinese↔English and Japanese↔English are outliers, which is worthy of future investigation.Notably, these languages have very poor isomorphism scores in relation to English.

Why might graph-based methods work?
The goal for Procrustes is to find the ideal linear transformation W ideal ∈ R dxd to map the spaces, where d is the word embedding dimension.Seeds in Procrustes solve Equation 1 to find an approximation W to W ideal .Accordingly, the seeds can be thought of as samples from which one deduces the optimal linear transformation.This is a supervised learning problem, so when there are few seeds/samples, it is difficult to estimate W ideal .Furthermore, the entire space X is mapped by W to a shared space with Y meaning that every point in X is subject to a potentially inaccurate mapping W : the mapping extrapolates to the entire space.As graph-based methods, GOAT and SGM do not suffer this issue and can induce non-linear relation-ships.Graph methods can be thought of as a semisupervised learning problem: even words that don't serve as seeds are incorporated in the matching process.The graph manifold provides additional information that can be exploited.
Secondly, the dimension of the relationship between words in GOAT/SGM is much lower than for Procrustes.For the former, the relationship is one-dimensional: distance.As words for the Procrustes method are embedded in d-dimensional Euclidean space, their relationships have a magnitude and a direction: they are {d + 1}-dimensional.It is possible that the lower dimension in GOAT/SGM makes them robust to noise, explaining why the graph-based methods outperform Procrustes in lowresource settings.This hypothesis should be investigated in follow-up studies.
6 Related Work BLI Recent years have seen a proliferation of the BLI literature (e.g.Ruder et al., 2018;Aldarmaki et al., 2018;Joulin et al., 2018;Doval et al., 2018;Artetxe et al., 2019a;Huang et al., 2019;Patra et al., 2019;Zhang et al., 2020;Biesialska and Ruiz Costa-Jussà, 2020).Many use Procrustes-based solutions, which assume that embedding spaces are roughly isomorphic.Wang et al. (2021) argue that the mapping can only be piece-wise linear, and induce multiple mappings.Ganesan et al. (2021) learn an "invertible neural network" as a non-linear mapping of spaces, and Cao and Zhao (2018) align spaces using point set registration.Many approaches address only high-resource languages.The tendency to evaluate on similar languages with high-quality data from similar domains hinders advancement in the field (Artetxe et al., 2020).garian Algorithm for BLI from text.Lian et al. and Alaux et al. (2018) align all languages to a common space for multilingual BLI.The latter use Sinkhorn to approximate a permutation matrix in their formulation.Zhao et al. (2020a) incorporate OT for semi-supervised BLI.

Conclusion
We perform bilingual lexicon induction from word embedding spaces of 40 language pairs, utilizing the newly-developed GOAT algorithm for graph-matching.Performance is strong across all pairs, especially on dissimilar languages with lowsupervision.As the major use case of low-resource BLI and MT is dissimilar languages with low supervision, the strong performance of GOAT is an encouraging result for real-world applications.

Limitations
Although we evaluate GOAT on 40 language pairs, this does not capture the full linguistic diversity of world languages.Languages of Eurasia are overrepresented, particularly the Indo-European family.Each pair has at least one high-resource Indo-European language which uses the Latin script.Future work should examine GOAT's performance when both languages are low-resource, and on an even broader diversity of languages from around the world.Furthermore, graph-matching methods are also considerably slower than calculating the solution to the orthogonal Procustes problem on GPU, potentially limiting the former's usefulness when one must match large sets of words.Future work might examine the speed/accuracy trade-off between methods as embedding space size scales.Unsupervised Performance For some highlyrelated languages, GOAT performs well even with no seeds (unsupervised).GOAT scores 48.8 on English→German, 34.5 on German→English, 62.4 on English→Spanish, and 19.6 on Spanish→English with no supervision.Particularly striking is the unsupervised performance on highly-related languages: >87 on Italian↔French and >90 for Spanish↔Portuguese.We suspect that that the word embedding spaces are highly isomorphic for these language pairs, allowing GOAT (and sometimes SGM) to easily recover the translations.
Iterative IterSGM/GOAT perform similarly across conditions, with a few exceptions where either performs very strongly with no supervision: Iter-GOAT scores 49.2, 45.2, 34.4, 58.2, and 55.9 for En-De, En-Fa, De-En, Id-En, and Ru-En, respectively, and IterSGM scores 57.3 for En-Ru.On Chinese↔English, IterGOAT underperforms IterSGM, similar to GOAT's underperformance of SGM in the single run.
Similar to Marchisio et al. (2021), we find that IterProc compensates for an initial poor first run and outperforms IterSGM with a moderate amount of seeds (100+).Extending to the very lowest seeds sizes (0-75), however, IterSGM/IterGOAT are superior.With 25 seeds, IterProc fails for all language pairs except En↔De and En↔Ru, scoring P @1 < 5. IterSGM and IterGOAT, however, perform reasonably well for most language pairs with 25 seeds, suggesting that the graph-based framing is the better approach for low-seed levels.At the highest supervision level (2000+ seeds), IterSGM/IterGOAT again tends to be superior.

Figure 1 :
Figure 1: Optimization step of FAQ vs. GOAT.FAQ arbitrarily chooses the direction of a permutation matrix.GOAT averages permutation matrices to take a smoother path.

Table 3 :
P@1 of Combination Exps.-EP starts with GOAT, ends with IterProc.-EG: IterProc, ends with GOAT.Prev is previous best of prior experiments.Some seed sizes omitted for brevity (see Appendix).
, hypotheses are intersected, and H random hypotheses are added to the gold seeds and fed into subsequent runs of Procrustes[SGM].The next iteration adds 2H hypotheses, repeating until all hypotheses are chosen.We set H = 100 and create an analogous iterative algorithm for GOAT, which we call Iterative GOAT.
Results of Iterative Procrustes (Iter-Proc), Iterative SGM (IterSGM), and Iterative GOAT (IterGOAT) are in Table A5.We run the Iterative Procrustes and Iterative SGM procedures of Marchisio et al. (2021) with stochastic-add.Here, Procrustes [or SGM] is run in source↔target di-rections