A Structure-Aware Generative Adversarial Network for Bilingual Lexicon Induction

,


Introduction
Bilingual lexicon induction (BLI) has emerged as a crucial task in natural language processing (NLP), focusing on the discovery of corresponding words between two languages using monolingual corpora.Due to its ability to facilitate the transfer of semantic knowledge between languages, BLI has been successfully applied in various NLP applications, including machine translation (Artetxe et al., 2018c;Ren et al., 2020), cross-lingual sentiment analysis (Singh and Lefever, 2020) and text classification (Dong and de Melo, 2019).
Most BLI methods aim to learn a mapping function that aligns word embeddings of two languages into a shared embedding space, which allows leveraging independently trained monolingual embeddings and then utilizing the learned mapping to generate bilingual lexicons (Mikolov et al., 2013;Glavaš et al., 2019).Thereinto, Mikolov et al. (2013) first observed that a linear orthogonal mapping proved to be empirically effective in transforming the source embedding space to the target language's space.This mapping was achieved by minimizing the squared Euclidean distance between the translation pairs in a given parallel vocabulary.They attribute the success of their method to the isomorphic assumption that the two embedding spaces exhibit similar geometric structures as they found that the linear projection outperformed its non-linear counterpart with multilayer neural networks.Building upon this work, various BLI methods have been proposed to improve the inductive performance by enforcing an orthogonality constraint (Lample et al., 2018), normalizing the embeddings (Artetxe et al., 2018a), relaxing the isomorphic assumption (Patra et al., 2019), leveraging the clique-level information (Ren et al., 2020), refining with Coherent Point Drift algorithm (Cao and Zhao, 2018;Oprea et al., 2022), distinguishing the relative orders (Tian et al., 2022), etc. From them, it can be noticed that reliable mapping functions can be learned even with weak supervision.
Furthermore, recent advancements have introduced several unsupervised models through adversarial training to learn mapping functions without the need for parallel data (Lample et al., 2018;Bai et al., 2019;Mohiuddin and Joty, 2019;Xiong and Tao, 2021), offering a data-driven, scal-able, and language-independent approach to induce cross-lingual representations from low-resource languages.However, existing adversarial methods focus on word-level alignment and treat the words in the embedding space as isolated entities, ignoring the underlying topological structures among words.Therefore, the relationship between words is not preserved and the topological structure of the embedding spaces is not well exploited or involved during training, leading to poor performance compared with other non-adversarial methods (Artetxe et al., 2018b;Ren et al., 2020).
In addition, conventional BLI methods typically assume that the embedding spaces of different languages are nearly isomorphic, and they learn a global linear mapping function shared by all words based on this assumption.However, recent studies (Søgaard et al., 2018;Patra et al., 2019) have found that the isomorphic assumption may not hold strictly due to deviations in the distributions of word embeddings for different languages.Consequently, the performance of BLI methods might be degraded, especially for the language pairs far from isometry.In such a case, a globally-aligned mapping function may not be an optimal solution.There have been some approaches that attempted to alleviate this problem by learning personalized mapping functions for different words or employing supervised non-linear mapping in latent space (Glavaš and Vulić, 2020;Tian et al., 2022;Mohiuddin et al., 2020).However, the supervised signals are all indispensable to their proposals and cannot be applied to the unsupervised learning setting without any labeled data.
To address these challenges, we propose a novel unsupervised model called structure-aware generative adversarial network (SA-GAN) to explicitly capture multiple topological structure information for accurate BLI.Specifically, given a source language and a target language, SA-GAN first views the embedding space of each language as a graph and utilizes two lightweight graph convolutional networks (GCNs) to encode two embeddings for exploring the intra-space topological structures.With the extracted structural information, we formulate the learning of a mapping function in a fashion that admits an adversarial game.SA-GAN employs a GAN model to learn a linear mapping matrix, allowing for the global mapping of the extracted source embeddings into the target embedding space.Unlike previous adversarial methods that usually enforce an orthogonality constraint on the mapping function, SA-GAN removes this constraint during adversarial training since the isomorphic assumption may not hold true practically.The learned mapping matrix facilitates the construction of a seed dictionary.To further refine the coarsegrained structures and enhance the seed dictionary, SA-GAN introduces a pairwise local mapping algorithm (PLM).This algorithm can learn wordspecific transformations for different words based on their nearest neighbors within the seed dictionary.By doing so, our method reduces reliance on isometry and achieves improved BLI performance in a fully unsupervised manner.To verify the effectiveness of SA-GAN, we conduct extensive experiments with sixteen different language pairs, comprising both etymologically distant and close languages to thoroughly test our model performance with varying degrees of isomorphism between monolingual spaces.Experimental results show that our model can achieve comparable performance to state-of-the-art unsupervised methods in most cases and even surpass previous supervised ones.Our main contributions can be summarized as follows: • We develop a novel adversarial framework SA-GAN to explore both the intra-space and interspace topological information for unsupervised BLI.It integrates two GCNs and a GAN to learn a linear mapping function through adversarial training without imposing an orthogonality constraint, providing greater flexibility in aligning different languages where the isomorphic assumption may not hold.
• We propose a pairwise local mapping (PLM) algorithm, which enables the learning of wordspecific transformations.PLM utilizes topological information from the nearest neighbors in the seed dictionary to refine the alignments and alleviate the reliance on isometry.
• We conduct extensive experiments over popular benchmarks, and the results demonstrate that our model outperforms existing unsupervised methods and even outperforms supervised state-of-the-art methods.

Methdology
In this paper, we denote the source and target language word embeddings as X ∈ R Y ∈ R d×m , where n and m are the numbers of words in X and Y , respectively, and d stands for the embedding size.Our proposed SA-GAN method consists of three major components, including structure extraction, adversarial training, and pairwise local mapping, as shown in Figure 1.Each module has its own role to play while targeting different goals.By the strategy of splitting, each module can focus more on its task and improve the overall performance while reducing the complexity.Specifically, given two monolingual embeddings X and Y, we first capture the topological information of each language via two lightweight GCN modules.After that, a global mapping matrix is learned via adversarial training, which transforms the source word embeddings into the target embedding space.Finally, SA-GAN designs a novel PLM algorithm to learn word-specific transformation, which alleviates the reliance on isometry.We will next formally introduce the model.

Structure Extraction
Recently, graph neural networks (GNNs) have been widely utilized in various fields due to their powerful ability to extract spatial information from graphs.Inspired by this, we propose to incorporate a GNN module prior to adversarial training to exploit the topological correlations in the embedding spaces by viewing the entire embedding space as a graph.In this graph, each word is represented as a node and edges connect it to its k-nearest neighbors.
The graph can be denoted as G = (V, E, A), where V = {v 1 , ..., v n } represents n nodes and n is the total number of vocabulary words in one language; E = {e i,j } n i,j=1 is a set of edges, where each edge e i,j is associated with a weight A i,j in adjacency matrix A to describe the similarities between the words v i and v j in the graph.
where x i and x j are the word embeddings for node v i and node v j , respectively.The basic idea of a GNN is to learn node representations in a graph by incorporating information from neighboring nodes through iterative aggregation and transformation processes.During the aggregation process, the neighboring node features are aggregated to generate a combined representation for each node.During the transformation process, the combined representations undergo a transformation to generate refined node representations using neural network layers for capturing more complex topological relationships.A wellknown example of a traditional GNN is the Graph Convolutional Network (GCN) (Kipf and Welling, 2017).GCN leverages convolutional layers on the graph structure to perform neighborhood aggregation and transformation as follows. (2) where X (l) and X (l−1) denote the embedding representations after l and (l − 1) layers propagation for all the n nodes and X (0) = X; Â is the normalized and regularized adjacency matrix; I is an identity matrix, which is added on A to include selfconnections; D is a diagonal node-degree matrix.W (l) is the feature transformation matrix at the l-th layer and σ(•) is an activation function.However, the full gradient descent strategy is often used to train GCN, suffering from high computational complexity for large-scale datasets.Hence it is difficult to fit in with subsequent adversarial training, where mini-batch stochastic gradient descent (SGD) is used for each update.Some researchers (Hamilton et al., 2017) have proposed mini-batch SGD for GCN to alleviate the problem, but the overheads of these methods are still large.Motivated by He et al. (2020), we propose a simplified GCN by removing the activation function σ(•) and the feature transformation matrices {W } L l=1 , defined as follows: Furthermore, to reduce the computation time, we construct a K g -nearest graph to preserve the edge connections of the top K g nearest neighbors for each node and keep the adjacency matrix A as a sparse matrix.Lastly, we combine the embeddings obtained at each layer to produce the final embedding matrix: where {α l } L l=0 are the tradeoff coefficients.It is worth noting that there are no trainable parameters of the designed GCN module.In other words, rather than training the propagation process at each iteration, the final embedding matrix only needs to be precomputed once and can be stored as a constant, which greatly decreases the computational cost and memory resource requirements.
Two GCN modules are respectively applied to the source language X and the target languages Y to form the new embedding representations X with n nodes and Ŷ with m nodes, which contain the topological structure information of the source and target embedding spaces.

Adversarial Training
With the extracted embedding representations, our goal is to match them for inducing a seed dictionary.Recent studies have demonstrated the effectiveness of adversarial training in aligning two distributions (Lample et al., 2018;Xiong and Tao, 2021).Building upon this concept, we employ adversarial training through a GAN in our work to learn a mapping function in a fully unsupervised manner.Specifically, we train a generator G to learn a linear mapping matrix W to deceive a discriminator D. The generator G aims to map the word embeddings from the source language to the target language through G(x i ) = W xi .G can be trained with the loss function as follows: The discriminator D is trained to distinguish between the mapped source embeddings W X = {W x1 , ..., W xn } and the target embeddings Ŷ = {ŷ 1 , ..., ŷm } using the cross-entropy loss: At each iteration, we optimize the generator loss (Equation ( 7)) and the discriminator loss (Equation ( 8)) alternately with stochastic gradient updates.Through adversarial training, we can obtain an initial solution of W .Following other GAN-based methods (Lample et al., 2018;Bai et al., 2019;Xiong and Tao, 2021), we further refine the learned mapping W via a self-learning strategy in (Artetxe et al., 2018b) by iteratively solving the Procrustes problem and applying a dictionary induction step.In our self-learning, we run five iterations of this process.
Although the word embeddings X and Ŷ contain the structure information using GCNs, they also introduce a challenge known as oversmoothing (Li et al., 2018).This issue arises when the words become indistinguishable from each other, especially those words lying in dense areas, leading to poorer performance when inducing the bilingual lexicon.To address this concern, we utilize X and Ŷ for finding the initial solution W . Subsequently, we discard X and Ŷ, and the remaining processes, including self-learning and the PLM algorithm (Section 2.3), are executed using the original embeddings X and Y.This decision is made to mitigate the over-smoothing problem and ensure that subsequent steps operate on the unaltered embeddings, thus potentially improving the performance of bilingual lexicon induction.

Pairwise Local Mapping Algorithm
With structure extraction and adversarial training, we can capture valuable structural information and learn a mapping function that is shared globally by all words under the isomorphic assumption.However, several studies (Ruder et al., 2019;Patra et al., 2019) have found that this assumption is not strictly applicable, and it may lead to poor performance in BLI, particularly for language pairs that deviate significantly from isometry.In this situation, a global-shared mapping function may not be the optimal solution.To further refine the alignments, we introduce a novel PLM algorithm to recompute and upgrade the embedding representations for different words based on the learned seed dictionary and improve the BLI performance.
Our PLM algorithm consists of two steps: generating a seed dictionary D(Z D , Y D ) and then utilizing the word pairs in this synthetic dictionary to perform a local mapping for each word.Firstly, we induce a seed dictionary utilizing the learned mapping matrix W to map the source word embeddings to the target embedding space as follows: where Z is the mapped source word representations.
With Z and Y , we can retrieve the translation pairs and build the seed dictionary D(Z D , Y D ) according to the cross-domain similarity local scaling (CSLS) measurement (Lample et al., 2018).Specifically, given a mapped source word z, we treat the nearest word in the target embedding space as the translation results as where r T (z) is the average cosine similarity between z and its k-nearest neighbors in Y ; r S (y) is the average cosine similarity between y and its k-nearest neighbors in Z.To refine the quality of the dictionary, we filter out word pairs in the generated dictionary that are not the K m most frequency words in each language which are usually of low quality and induce word pairs from both directions in the seed dictionary D(Z D , Y D ).
Secondly, we use the word pairs in this synthetic dictionary to improve the mapped embedding and get a pair-wise local mapping for each word.Given a mapped source word z i , we first obtain its top K a -nearest neighbor words z D 1 , ..., z D Ka from Z D as anchors, denoted as N i with a coefficient for each anchor point: that indicates the importance of anchor word z D j in N i to the given source word z i .The closer an anchor is to z i , the larger the importance coefficient it gets.However, since the cosine similarity ranges from 0 and 1, we observe that even the anchors that are too far to give a useful guideline still get a high coefficient, ie.0.4, which will introduce potential noises to the pairwise mapping.To avoid the influence, we scale the importance using the softmax function with temperature τ : ) which increases the influence of the nearest neighbor anchors even further and decreases for the distant ones.We then compute the new embedding representation of z i with the guidelines of generated dictionary, as follows: where p is the rate for updating the word embeddings.
The above steps can be iteratively done for both directions and at each iteration, we regenerate the dictionary D with the updated embedding representation in the previous iteration to further improve the quality of the synthetic dictionary.

Training Paradigm
In summary, the proposed approach first extracts the structure information using GCNs and learns a global mapping function in an adversarial manner to map the embeddings of two languages into the same space.In order to alleviate reliance on isometry, we further apply the PLM algorithm to learn pairwise mapping functions for different words based on the learned seed dictionary.The whole training process of the proposed approach is unsupervised and described in Algorithm 1.
For each baseline model, we report the results in the original papers and conduct experiments with the publicly available code if necessary.
Implementation details Following previous work, vocabularies of each language are trimmed to the most frequent 200k word embeddings for evaluation, same for the graph generation in section 2.1.The adversarial model uses 75k most frequent words in each language to feed the discriminator.
The original word embeddings are normalized following (Artetxe et al., 2018b), including length normalization, center normalization and length normalization again to ensure the word embeddings have a unit length.The generator G is a single linear layer.The discriminator is a multilayer perceptron with two hidden layers of size 2048 and Leaky-ReLU activation functions.We train our models using stochastic gradient descent (SGD), with a batch size of 32, and a learning rate of 0.1.A smoothing coefficient s = 0.1 is added to the discriminator predictions.We train the discriminator more frequently (5 times) than the generator.For the PLM algorithm, the temperature τ is set to 0.1; the updating rate p is set to 0.02; the vocabulary most frequency K m is set to 20,000 in the synthetic dictionary; the number of neighbor words as anchors K a is set to 150; the number of iterations is 10.

Experimental Results
We report the BLI performance over four etymologically close language pairs(en-es, en-fr, en-it, and en-de) and four etymologically distant pairs (enru, en-da, en-hu, en-zh) from the MUSE dataset.
The results are presented in Table 1.For our approach, we map the embedding representations of the source language (English) into target embedding space (other languages) and evaluate the performance of our model in both directions with the corresponding test datasets.It should be noted that all results reported in the paper are an average of 5 runs.The 'NA' indicates the authors did not report the number or their code is not publicly available, and '*' indicates that the methods fail to converge.
Table 1 shows the Gromov-Hausdorff (GH) distance of the selected language pairs.From the measurements, we can see that etymologically  (Glavaš and Vulić, 2020) 82.4 86.3 84.5 84.9 80.2 81.9 76.5 77.5 57 67.1 59.4 70 55.2 70.1 47.9 47.2 (Mohiuddin et al., 2020) 82.9 86.4 82.7 84.2 78.1 81.4 75.5 75.9 52.3 67.8 60.9 70.5 57.5 66.9 42.9 42.0 (Ganesan et al., 2021) 83 close language pairs have lower GH distances compared to etymologically distant ones.We compare SA-GAN with both existing unsupervised and semi/supervised approaches.From Table 1, one can clearly see that our proposed method significantly outperforms previous unsupervised methods over most language pairs, and also obtain comparable performance on the rest.Compared with state-ofthe-art unsupervised methods, SA-GAN performs better on 14 of 16 language pairs, especially on enit and en-de with the absolute improvements of 2% to 2.3%, and on etymologically distant language pairs like en-hu and en-da, with absolute improvements of 4.1% to 4.6% over the best baseline.Furthermore, compared with the supervised methods, SA-GAN can still achieve competitive results and even outperform existing stateof-the-art supervised methods on some language pairs.The performance of our approach on enit is 81.4%, compared to 80.2% with the bestsupervised method.On en-hu, our SA-GAN obtains 60.5%, which is 3% better than the supervised method.Such performance gains demonstrate the superiority of SA-GAN.From table 1, we also find that leveraging the unsupervised pairwise local mapping (PLM) contributes to bilingual lexicon induction, with a gain of 0.7% on average on etymologically close language pairs and 1.2% on distant language pairs, which is remarkable.
From the results, we note that SA-GAN achieves more improvements in etymologically distant languages, where other unsupervised baselines per-form poorly or even fail to converge.This is reasonable as we capture much richer semantics by extracting the structure of embedding space with the GNN module, which helps learn a better mapping function compared with other methods.Moreover, since the distributions of different languages deviate and the isomorphic assumption may not be strictly held (Patra et al., 2019;Søgaard et al., 2018), a global-share mapping is not the optimal solution (Tian et al., 2022).In this situation, an unsupervised PLM algorithm is applied to every word to get personalized mappings, which improves the performance further.

Results of Morphologically Rich Languages
To better explore our model's robustness, we further evaluate our method on "difficult" morphologically rich languages, where unsupervised bilingual dictionary induction performs much worse (Søgaard et al., 2018).Following Oprea et al. ( 2022), we evaluate English (En) from/to 3 morphologically rich languages like Finnish(Fi), Hebrew (He), and Romanian(Ro), a mixture of isolating or exclusively concatenating languages from a morphological point of view (Søgaard et al., 2018) 2. From the measurements, we can see that our approach outperforms existing methods on 5 of 6 tasks on morphologically rich language pairs, with a gain up to 2.8% on en-fi and 0.8% on average of all languages, which further shows the robustness and effectiveness of our framework.

Ablation Study
To further analyze our approach, we perform ablation studies and measure the contribution of each novel component that is proposed in this work.We conduct extensive ablation on 8 translation tasks from 4 language pairs from the MUSE dataset, consisting of 2 etymologically close and 2 etymologically distant languages.

Strcture extraction and adversarial training
Here we study the impact of the structure extraction (GNN) module and orthogonality constraint.
To avoid the influence of PLM, ablation studies are investigated in the setting without the PLM module, as shown in table 3.One can see that model performance will consistently drop in all language pairs and even fail to converge in distance language pairs if we further remove the GNN module.After enforcing an orthogonality constraint, the performance pairs drop (eg.en-ru) and fail to converge (eg.en-zh) in the distant language pairs that are far from isometry.We can get the following conclusions: 1) the GNN module can capture much richer semantics by extracting structure information of embedding space, which contributes to learning a better mapping function and stable the BLI performance; 2) a strict orthogonality constraint limits the performance of language pairs that are etymologically distant and far from isometry.
Pair-wise local mapping Here we aim to study the importance of the designed PLM algorithm, and the influence of updating rate p, coefficient scaling, dictionary frequency cutoff, and bidirectional forwarding components to the PLM module.The obtained results are presented in Table 4.The baseline in the table is a variant of our approach without using PLM.From the table, we can find that the performance declines over all tasks after removing PLM, revealing the importance of personalized local mappings.As for the different PLM components, we observe that coefficient scaling is necessary to avoid the potential noise introduced by anchor words.The dictionary frequency-based cutoff also has a positive influence on our model, with a 1.2% gain in en-it and 1.3% gain in en-ru.At the same time, the updating rate plays a critical role in the systems.Without updating rate (p = 1), the model performance declines sharply due to overly drastic updates of embeddings.Bidirectional forwarding is also beneficial, which provides an optimal solution by mapping the source and target languages together to a latent space, rather than fixing one of them.In summary, every component of PLM is indispensable to achieving better performance.

Parameter Sensitivity Analysis
We further analyze the performance of PLM with respect to two core hyper-parameters: (1) the vocabulary cutoff with the most frequency K m for synthetic dictionary, and (2) the scaling temperature τ in Formula 12.The sensitivity analysis is conducted on the en→it language pair on the MUSE dataset.
Frequency-based vocabulary cutoff The hyperparameter K m denotes the number of most frequent words in each language considered when inducing the synthetic dictionary.As shown in Figure 2(a), On the one hand, when K m is too small, the syn- thetic dictionary can't obtain enough information to guide the local mapping; On the other hand, when K m is too large, much noise will be introduced, which reduces the quality of the dictionary and declines the performance.
Temperature of scaling Figure 2(b) illustrates how the performance varies with different values for scaling temperature τ .We can find that a small τ helps to increase the influence of the nearest anchors in the dictionary and decrease for the distant ones, which scales the importance coefficients further to provide useful guidelines and reduce the potential noise in the dictionary.

Conclusion
In this paper, we proposed a novel unsupervised framework SA-GAN for bilingual lexicon induction.Different from previous works that generally treat words in the embedding space as isolated entities, SA-GAN considers each embedding space as a graph and utilizes a GCN module to learn the topological information between words.Additionally, SA-GAN employs a GAN to learn a linear mapping matrix without imposing an orthogonality constraint, thereby transforming both languages into the same embedding space.To further improve the performance, especially for the language pairs where the isomorphic assumption may not hold exactly, we propose a pairwise local mapping algorithm to learn word-specific transformations instead of only applying a shared global mapping to all words.Extensive experiments conducted on the MUSE dataset demonstrate the superior performance of our model.SA-GAN outperforms existing unsupervised alternatives and even surpasses state-of-the-art supervised methods, especially for etymologically distant language pairs.

Limitations
Although our approach can achieve impressive performance, there are still some limitations to be resolved in the future.
• SA-GAN requires tuning more hyperparameters compared to previous methods, which is time-consuming.
• SA-GAN matches source and target languages by mapping the source embeddings into the target embedding space, rather than mapping them into a common latent space.While the performance relies on the target word embedding space, the mapping function might be sub-optimal.
• Additionally, SA-GAN focuses on aligning single-word embeddings, making it unsuitable for directly applying to the alignment of multi-word expressions that encompass intricate semantic concepts.

A Related Work
The basic idea for Bilingual lexicon induction (BLI) is to learn cross-lingual mappings which transform word embeddings of different languages to the same embedding space, and then induce Bilingual lexicons from the learned cross-lingual embeddings.Based on the availability of a seed dictionary, we divide related work into the following two categories: supervised/semi-supervised methods and unsupervised methods.
A.1 Supervised/Semi-supervised Methods Mikolov et al. (2013) first observe that the word embedding space of one language can be transformed into another using linear mapping, based on the isomorphic assumption that monolingual word embeddings exhibit similar geometric properties across languages.Artetxe et al. (2018a) propose a multi-step framework that generalizes a substantial body of previous work.The core steps include normalization, whitening, orthogonal mapping, reweighting, de-whitening, and dimensionality reduction.Joulin et al. (2018) use a supervised method RCSLS which optimizes the CSLS distance in an end-to-end manner for the supervised matching pairs.Jawanpuria et al. (2019) propose to map both the source and target word embeddings to the common latent space via two orthogonal transformations.
Previously methods learned global-shared linear transformations based on the isomorphic assumption.However, several researchers have found that the isomorphic assumption may not hold all the time, especially for distant language (Søgaard et al., 2018).Patra et al. (2019) observe that the language pairs with high Gromov-Hausdorff (GH) distance cannot be aligned well using orthogonal transformation and proposed semi framework which relaxed isomorphic assumption by jointly optimizing a weak orthogonality constraint in the form of a back-translation loss.Mohiuddin et al. (2020) design a semi-supervised model that uses non-linear mapping in the latent space to learn cross-lingual word embeddings, which is also independent of the isomorphic assumption.Glavaš and Vulić (2020) propose a supervised word-specific transformation after learning a single global rotation matrix, thus the final mapping function is globally non-linear which performs well in distant language pairs.The PLM algorithm in this paper is inspired by the literature, but differs in that, in comparison with this work, we propose a different transformation framework that can be applied to the unsupervised approach without any labeled data.Sachidananda et al. (2021) align embeddings to isomorphic vector spaces, using pairwise inner products.Li et al. (2022) improve word translation via two-stage contrastive learning.Tian et al. (2022) propose a ranking-based bilingual lexicon induction model which provides sufficient discriminative capacity to rank the candidates.
Nevertheless, all these methods still require supervised signals and cannot be applied to the unsupervised learning setting without any labeled dictionary.

A.2 Unsupervised Methods
Recently fully unsupervised methods have been proposed to induce a bilingual dictionary by aligning monolingual word embedding spaces.A typical research line is based on adversarial training.Miceli Barone (2016) proposes an adversarial autoencoder framework to map the source language word embeddings to the target language, where an encoder aims to make the transformed embeddings not only indistinguishable by the discriminator but also recoverable after a reversed mapping by the decoder.Although promising, the reported performance is not satisfying.Lample et al. (2018) are the first to show very impressive results for unsupervised word translation where a rough rotation matrix is first learned using the adversarial framework and further refined with a self-learning process.Based on the previous work (Lample et al., 2018), Chen and Cardie (2018) propose an adversarial training framework in the multilingual setting which not only considers one pair of languages at a time but explicitly exploits the relations between all language pairs.Mohiuddin and Joty (2019) revisit the adversarial autoencoder for unsupervised word translation and includes cycle consistency and input reconstruction constraints to guide the mapping.Xiong and Tao (2021) propose an unsupervised approach via bidirectional feature mappings based on cycle-GAN and hybrid training.In contrast to other frameworks which focus on direct or bidirectional mappings between the source language and target language, Bai et al. (2019) train two autoencoders jointly to transform the source and the target monolingual word embeddings into a shared embedding space to capture the cross-lingual features of word embeddings.Li et al. (2021) observe that low-frequency words tend to be densely clustered in the embedding space, to overcome this issue, they introduced a noise function to disperse dense word embeddings and a Wasserstein critic network to preserve the semantics of the source word embeddings.
On the other hand, non-adversarial approaches have also been proposed for unsupervised crosslingual word alignment.Hoshen and Wolf (2018) use the principal component of monolingual word embeddings to build initial alignment and then iteratively refined the alignment using a variation of the e Iterative Closest Point (ICP) method used in computer vision.Artetxe et al. (2018b) explore the similarity of the embeddings to learn an initial dictionary in an unsupervised way and improve it with a robust self-learning approach.Alvarez-Melis and Jaakkola (2018) cast the problem as an optimal transport problem and measure the similarity between pairs of words across languages using Gromov-Wasserstein distance.Cao and Zhao (2018) propose to use the Coherent Point Drift (CPD) algorithm to map the whole source embeddings to the target embedding space.Inspired by Cao andZhao (2018), Oprea et al. (2022) employ the CPD algorithm to perform an iterative two-step refinement on the initial global mapping trained by CycleGAN.However, both of them focus on global mapping under the isomorphic assumption.Ren et al. ( 2020) leverage the Bron-Kerbosch (BK) algorithm to extract clique-level information, which is not only semantically richer than what a single word provides but also reduces the bad effect of the noise in the pre-trained embeddings.

B.1 Case Study
To better demonstrate the effectiveness of our model on bilingual lexicon induction, we give some examples of the dictionary inferred with our method, comparing with that inferred by two adversarial methods Mohiuddin andJoty, 2019 andBai et al., 2019, denoted as Adv-M and Adv-B respectively.We choose the language pair English-Danish 10774 as examples, as shown in Table 5 In the first example, both approaches find the correct translations.In the following four examples, our approach SA-GAN successfully induces the correct translations with similar meanings, while Adv-M and Adv-B fail to find all correct translations for the given queries, even having significantly different meanings for their induced words compared with the gold translations.From these examples, we find that our method produces bilingual lexicons with higher quality.This is because our approach can effectively utilize the topological structure of the embedding spaces, and pair-wise mapping is learned for every different word to alleviate the reliance on isometry, which improves the BLI performance even further.

B.2 Downstream Tasks
To better test our model's robustness and effectiveness, we include more downstream tasks, i.e., Semantic Word Similarity and Sentence Translation Retrieval tasks as in the lample2018word and oprea-etal-2022-multi.
Semantic word similarity We evaluate the quality of cross-lingual embeddings with the task of Semantic Word Similarity, which aims at evaluating how well the cosine similarity between words of different languages correlates with humanannotated word similarity scores.As shown in Table 6(a), our proposed SA-GAN has a better Pearson's correlation to human-annotated scores across languages on the en-de and de-en language pairs and achieves comparable performance on enes and es-en, indicating that our model provides good alignment across languages.
Sentence translation retrieval This task goes from word to sentence level and studies sentence translation retrieval.Following (Lample et al., 2018), the sentences are represented as a bag of words, and the IDF-weighted average of word embeddings of the sentence is used as its sentence embedding.The closest sentence from the target language is returned as its translation of the given source sen-tence.Table 6(b) shows sentence translation retrieval results on the Europarl corpus.On the en-fr language pairs, our model obtains the best score with up to 3.5% improvements.Besides, our proposed method performs the best on the averaged accuracy, which depicts that SA-GAN provides better performance in sentence translation retrieval tasks.

Figure 1 :
Figure 1: An overview of our proposed SA-GAN framework.
Dataset To demonstrate the effectiveness of our SA-GAN model, we leverage the widely used Algorithm 1: Training procedure of model Data: Normalized monolingual word embeddings Xfor source language and Y for target language 1 Build adjacency matrix A according to Eq.1; 2 Extract structural information following Eq.4 and get new embedding representation X and Ŷ; 17Build a synthetic dictionary between Z and Y ; 18 Calculate new embedding representation for each word in Z according to Eq.13;19Build a synthetic dictionary between Z and Y ; 20 Calculate new embedding representation for each word in Y according to Eq.13;

Table 1 :
Word translation accuracy (Precision@1) on MUSE dataset.For each metric, underline marks the highest accuracy among all approaches; bold marks the best performance across all unsupervised methods; 'NA' indicates the authors did not report the number or their code is not available; '*' indicates that the methods fail to converge.

Table 2 :
Word translation accuracy (Precision@1) of morphologically rich languages on MUSE dataset.Bold marks the best performance across all methods.

Table 3 :
Ablation study on adversarial training.

Table 5 :
. Word translation examples for English-Danish.