Learning a Reversible Embedding Mapping using Bi-Directional Manifold Alignment

We propose a Bi-Directional Manifold Alignment (BDMA) that learns a non-linear mapping between two manifolds by explicitly training it to be bijective. We demonstrate BDMA by training a model for a pair of languages rather than individual, directed source and target combinations, reducing the number of models by 50%. We show that models trained with BDMA in the"forward"(source to target) direction can successfully map words in the"reverse"(target to source) direction, yielding equivalent (or better) performance to standard unidirectional translation models where the source and target language is flipped. We also show how BDMA reduces the overall size of the model.


Introduction
Learning continuous vector representations of embeddings is an expensive exercise as it requires a large quantity of free text to train stable representations (Sahin et al., 2017). Learning word embeddings in the English language is relatively easy since a model can make use of free text online from sources like Wikipedia, but it is challenging to learn embeddings for natural languages where the free text is limited (low-resource languages). Resourceconstrained languages suffer from dual problems of reduced quality of embeddings and their vocabulary being small. Cross-lingual words embedding (CLWE) models alleviate this problem but are often linear mapping functions that align the source and target language manifolds, since non-linear mapping functions such as neural networks are unidirectional and known to perform poorly as compared to their linear counterparts (Ruder et al., 2019).
In this paper, we propose Bi-Directional Manifold Alignment (BDMA), which learns a reversible, non-linear mapping function between two * This research was completed prior to joining Amazon. Figure 1: Mapping vector spaces with Bi-Directional Manifold Alignment (BDMA). f is the feedforward network. f a and f b are represent the forward and backward direction of flow through the network. In a shared BDMA network, the blue components represent network fully connected layers, orange are activation layers during forward network flow while purple represents activation layers in reverse flow. During reverse flow from output to input, the weight matrix is a transpose of weights during forward flow through the network. manifolds. Inspired by CycleGAN (Zhu et al., 2017), we use a cycle consistency loss to optimize BDMA. We study BDMA in the context of crosslingual lexicon induction and show that it offers solutions to two known problems: (1) that non-linear models are known to perform poorly in comparison to their linear counterparts (Ruder et al., 2019), and (2) most approaches perform unidirectional mapping only (from a source to target language), leading to an ever increasing set of translation models. We show how BDMA is a generic training method that uses different distance metrics (or losses) like MSE, cosine or RCSLS (Joulin et al., 2018) while training models cyclically. 1

Bi-Directional Manifold Alignment
Consider two manifolds M s ∈ R n×d (source domain) and M t ∈ R m×d (target domain) that are vector space representations of words. The monolingual word embeddings are pretrained from a large corpus and may be created using different methods. Let V s and V t be the respective vocabularies of the two languages. Hence V s = {w s 1 ... w s n } and V t = {w t 1 .. w t m } are words in each vocabulary of size n and m. The distributed representations of words in each manifold are M s = {m s 1 ... m s n } and M t = {m t 1 ... m t m }. We assume there is V p = {w p 1 ... w p c }, an available dictionary or parallel corpus of words for the given source/target pair.

Bi-Directional Loss Mechanism
We achieve bi-directional alignment by learning a mapping function is optimized with a cyclicconsistency loss (CCL). In Figure 1, the mapping function f a : M s → M t to align the manifold M s to M t . We also use a backward mapping function f b : M t → M s to align the manifold M t to M s . We refer to the parameters of both f a and f b as θ f .
Our method is based on jointly minimizing the distance D between pairs of embeddings, and their mapped counterparts, from each manifold. We define our cycle consistency loss for a single training sample based on this distance function D as (1) Following previous work (Xing et al., 2015), we include an orthogonal loss in the objective; we extend this loss function for a neural network by performing a layerwise orthogonal loss. For our full objective, we sum over all training instances and minimize over θ f : where w j are weights of layer j in the network. While Euclidean distance (mean squared error: D = MSE) is a common way of computing distance in a manifold (Ruder et al., 2019;Artetxe et al., 2016), cosine or relaxed cross-domain similarity local scaling (RCSLS) (Joulin et al., 2018) distance functions have been shown to be effective for word and embedding alignment tasks. Our formulation works with these other computable distance functions. For example, while applying D = MSE, and for ease omitting the orthogonal loss term w j ∈θ f w j w T j − I, the loss is See Appendix D for similar formulations for D = cosine, D = RCSLS, and a combined distance function D = cosine + RCSLS (used in §3).

Forward -Reverse Network Flow
As described in §2.1, f a and f b represent the forward and reverse network flow. We represent the forward and reverse mapping with two networks that have shared or independent parameters. When the parameters are independent, two separate networks are trained simultaneously and optimized in order to learn the mapping between two languages. In Figure 1, the network parameters are shared in our model. The forward flow is shown in orange while reverse flow is depicted in purple. Although the two networks share parameters, they cannot do so directly as the required shapes of each layer differ. In order to perform backward translation, reverse flow is enabled in the network by explicitly taking the transpose of each layer in the network (we use fully connected layers without bias vectors) making the network bi-directional or invertible. With our cycle consistency loss formulation, the model learns layers such that the transpose of the layer inverts the network.

Experiments & Analysis
We experiment with the MUSE dataset (Conneau et al., 2017). It consists of 110 bilingual dictionaries with separate training and test datasets for each language pair. The pairs contain polysemous words. When it comes to training BDMA, polysemous words can provide additional context to the model being trained while handicapping other baseline models. We filter out training pairs for polysemous words (source or target). The models are trained with 5000 unique pairs. We show two sets of experiments: (a) with a filtered evaluation set that contains 1500 unique pairs and (b) with the original evaluation dataset. We measure the performance of BDMA on two sets of languages: the low-resource languages Russian (Ru) and Japanese (Ja), and the high-resource languages Spanish (Es), French (Fr), German (De) and Italian (It).
In each table, s is the source language while t is the target language. → indicates the direction of mapping and training language pairs used from MUSE. For reverse translation, the model is trained with the t → s dataset and evaluated on the s → t test dataset-for example, the model trained on En→Ru is evaluated on Ru→En. P@1 measurements highlighted in blue show the forward (training) direction in which the model is trained and its adjacent non-colored measurement uses the same model to perform reverse translation.  Embeddings & Baselines. We use normalized and mean-centered FastText embeddings (Joulin et al., 2016), learned from language-specific Wikipedia. We train two types of translation models: (a) a linear mapping with a weight matrix W ∈ R d×d for a d-dimensional embedding, and (b) a 1 hidden layer feed forward network. For baseline comparisons, we retrain VECMAP (Artetxe et al., 2016(Artetxe et al., , 2018, GeoMM (Jawanpuria et al., 2019) and RCSLS (Joulin et al., 2018). When possible, we compare with BLISS(R) (Patra et al., 2019), Joint Align (Wang et al., 2019), Cross-lingual Anchoring (Ormazabal et al., 2020) and LNMAP (Mohiuddin et al., 2020) using results previously reported for high resource languages. We train BDMA with a combination of cosine (C) and RCSLS (R) losses, and separate baseline methods for each language and translation direction pair.

Impact of Polysemy
In Table 1, we observe BDMA's performance translating words in high resource languages. BDMA's performance is better or equivalent in comparison to other methods. Additionally, we note that the translation model is trained with 5000 unique pairs while Joint Align (Wang et al., 2019) and cross-lingual anchoring (Ormazabal et al., 2020) are trained with the full MUSE training dataset for any given language pair which is greater than 5K. 2 Similarly, Table 2 shows the performance of different models on low resource languages compared to BDMA. BDMA with 1-H FFN performs better than a linear mapping with an overall increase as high as 2.82% while translating Japanese to English. The exception is for Hindi, where the performance drops by 3.8% (Hi → En). We see that the   model benefits from bidirectional training when there are polysemous words in the evaluation corpus, improving the network's ability to generalize.

Impact of Unique Vocabulary
Similar to the previous experiment, we analyze the impact of BDMA with an evaluation dataset of unique pairs for both high resource and low resource languages. In contrast to  Table 4 details experiments for the same under low resource language conditions. Although BDMA performs better for En → Ru and En → Ja, Hi → En continues to perform poorly. In contrast, its performance is comparable for Portuguese where the reduction is 1.13% (En → Pt) only. Therefore, the 2 main benefits of BDMA are: (a) it creates a single bidirectional word translation model while keeping the performance of the model comparable to baseline, and (b) the 1-H FFN is a single network in comparison to LNMAP (which has 3), while Linear BDMA has the same number of parameters as all other methods in Table 1 and 2.

Importance of Training Direction
If the filtered training pairs do not contain polysemous words, why is the training direction important? This is because when the model is trained for a number of epochs, its optimal savepoint is chosen based on the forward translation performance for the given language pair direction. As seen in Table  2 and 4, the direction chosen to start model training can have an impact of forward and reverse translation performance. For example, the model training with Ru → En performs better than En → Ru.  Ablation Study. In Table 5, we assess the impact of using (combinations of) MSE, cosine and RCSLS distance functions D. A combined cosine and RCSLS loss ([C + R]) performs the best and provides consistent forward (s ← t) and reverse translation (t ← s) performance (within 0.5%).

Related Work
Over the years, many supervised methods have been proposed. Irvine and Callison-Burch (2013) learn a binary classifier for a language pair that predicts if a given word pair is a translation of each other or not. Artetxe et al. (2016) implement Procrustes alignment while normalizing and mean centering word embeddings. Xing et al. (2015) add an orthogonal loss while aligning manifolds. In Artetxe et al. (2018), additional pre-and post-processing steps are provided. Conneau et al. (2017) propose a new retrieval method called crossdomain similarity local scaling (CSLS) in order to reduce the "hubness" problem. Joulin et al. (2018) convert CSLS into a loss objective in order to optimize the translation matrix. An important challenge with linear mapping is that it assumes that source and target languages have a similar manifold struc-ture; Søgaard et al. (2018) show this assumption is not true for many language pairs. Nakashole and Flauger (2018) show that transformations need to be non-linear and are dependent on the word's local neighborhood. Instead of learning a mapping between languages separately, Wang et al. (2019) jointly learn the monolingual and cross-lingual embeddings for the given language pair. Ormazabal et al. (2020) extend skip-gram to project source embeddings into a fixed target space and using them as anchors to iteratively learn the mapping. Cyclic Loss for Reverse Translation. Xu et al. (2018) perform unsupervised word alignment using the cycle consistency loss while computing the sinkhorn distance between a forward and reverse translation network. Mohiuddin and Joty (2019) train a dual autoencoder-discriminator architecture and use a cyclic loss to train a bi-directional model. LNMAP (Mohiuddin et al., 2020) extends the autoencoder architecture with a 2 layer mapping to learn a non-isomorphic mapping between languages. Our work differs as we reduce the number of parameters in the model (as it contains the mapping only) while training an invertible network that can perform both forward and back translation.

Conclusion
We show how a non-linear mapping (invertible neural network) can be trained with a cyclic consistency loss, showing that a common isomorphic assumption is not strictly necessary (Søgaard et al., 2018). The network trained has fewer parameters in comparison to Mohiuddin et al. (2020) while providing equivalent or improved performance on the low-resource word translation task.
In the following sections, we provide information about hyperparameter values for each network architecture, statistics about the dataset and results from additional experiments. The experiments are conducted on a NVIDIA K20 GPU with ≈ 4GB of RAM and NVIDIA V100 GPUs with 16GB of RAM. Each model is trained on a single GPU. Linear models can be trained on K20s and the larger 1-H-FFN are optimized on V100s.

A Hyperparameters
Following are the hyper-parameters used in our experiments: Hyper-parameter Value batch size 128 lr_decay 0.98 lr_shrink 0.5 map_beta 0.001 max_vocab 200000 As seen in table A1, the maximum vocabulary (max_vocab) size is 200K. The vocabulary is selected by taking 200K words that have the highest frequency. map_beta is the parameter that controls the contribution of the orthogonal loss to the overall loss function. The network is trained with an Adam optimizer (Kingma and Ba, 2014) having a learning rate of 0.0005. The word embeddings are preprocessed i.e. they are normalized and centered. The 1 hidden layer feedforward network used to perform alignment has a hidden layer size of 4096. The activation function of the hidden layer is tanh.

B CCL Correlation with Linear Mapping
As observed in equation 2.1, a linear relationship between source and target language embeddings can be learned by minimizing the squared loss between them. Although, in practice, an additional orthogonal constraint L ortho = W W T − I is added (Xing et al., 2015) as shown in the equation below: Minimizing L ortho makes the linear mapping implicitly bidirectional able to map words from the target to source language. In comparison, L ccl in equation 1 trains a non-linear neural network or linear mapping to be explicitly bidirectional. Thus L ccl can be considered as an extension of L ortho .

Target Language Train Test
French 10872

C Dataset
MUSE (Conneau et al., 2017). As described in §3, the dataset has 110 bilingual dictionaries and contains pairs with English being the source or target language. Additionally, Non-English language pairs are available for European languages that includes German, Spanish, French, Italian and Portuguese. Each bilingual dataset has a vocabulary of 5000 unique source language words to train the translation model and 1500 unique words to evaluate them. Because the pairs are not unique and contain polysemous source words (the target word is always unique), the overall size of training and test dictionaries is greater than 5000 and 1500.
Tables A2 and A3 show the dataset size from the original MUSE dataset. The tables show that samples for different language pairs contain polysemous words that expand dataset size by 36.8% to 123.7% in comparison to BDMA (in table 2, 4 and 1) that is trained with 5000 unique pairs only.

D Additional Loss
In §2.1, we showcased how MSE is adapted for L ccl . Similarly, cosine and Relaxed CSLS loss can be modified for BDMA too. In an adapted version of cosine loss, we minimize the following: In order to modify RCSLS (Joulin et al., 2018), we first take look at CSLS (Conneau et al., 2017) criteria for retrieval: where N s (x) is the neighborhood of x in the source manifold and N t (y) is the same in the target, k is the number of nearest neighbors and W is assumed to be orthogonal. Joulin et al. (2018) relax the cosine criteria in RCSLS i.e. cos(Wm s i , m t i ) = m s i T W T m t i . Hence RCSLS becomes: In BDMA, we replace the orthogonal matrix W with a mapping that is either linear or non-linear (neural network). RCSLS changes to: In equation 6, f a and f b are the forward and reverse flow projections of m s i and m t i respectively.