Data Augmentation with Unsupervised Machine Translation Improves the Structural Similarity of Cross-lingual Word Embeddings

Unsupervised cross-lingual word embedding(CLWE) methods learn a linear transformation matrix that maps two monolingual embedding spaces that are separately trained with monolingual corpora. This method relies on the assumption that the two embedding spaces are structurally similar, which does not necessarily hold true in general. In this paper, we argue that using a pseudo-parallel corpus generated by an unsupervised machine translation model facilitates the structural similarity of the two embedding spaces and improves the quality of CLWEs in the unsupervised mapping method. We show that our approach outperforms other alternative approaches given the same amount of data, and, through detailed analysis, we show that data augmentation with the pseudo data from unsupervised machine translation is especially effective for mapping-based CLWEs because (1) the pseudo data makes the source and target corpora (partially) parallel; (2) the pseudo data contains information on the original language that helps to learn similar embedding spaces between the source and target languages.


Introduction
Cross-lingual word embedding (CLWE) methods aim to learn a shared meaning space between two languages (the source and target languages), which is potentially useful for cross-lingual transfer learning or machine translation (Yuan et al., 2020;Artetxe et al., 2018b;Lample et al., 2018a). Although early methods for learning CLWEs often utilize multilingual resources such as parallel corpora (Gouws et al., 2015;Luong et al., 2015) and word dictionaries (Mikolov et al., 2013), recent studies have focused on fully unsupervised methods that do not require any cross-lingual supervision (Lample et al., 2018b;Artetxe et al., 2018a;Patra et al., 2019). Most unsupervised methods fall into the category of mapping-based methods, which generally consist of the following procedures: train monolingual word embeddings independently in two languages; then, find a linear mapping that aligns the two embedding spaces. The mappingbased method is based on a strong assumption that the two independently trained embedding spaces have similar structures that can be aligned by a linear transformation, which is unlikely to hold true when the two corpora are from different domains or the two languages are typologically very different (Søgaard et al., 2018). To address this problem, several studies have focused on improving the structural similarity of monolingual spaces before learning mapping (Zhang et al., 2019;Vulić et al., 2020), but few studies have focused on how to leverage the text data itself.
In this paper, we show that the pseudo sentences generated from an unsupervised machine translation (UMT) system (Lample et al., 2018c) facilitates the structural similarity without any additional cross-lingual resources. In the proposed method, the training data of the source and/or target language are augmented with the pseudo sentences ( Figure 1).
We argue that this method facilitates the structural similarity between the source and target embeddings for the following two reasons. Firstly, the source and target embeddings are usually trained on monolingual corpora. The difference in the content of the two corpora may accentuate the structural difference between the two resulting embedding spaces, and thus we can mitigate that effect by making the source and target corpora parallel by automatically generated pseudo data. Secondly, in the mapping-based method, the source and target embeddings are trained independently without taking into account the other language. Thus, the embedding structures may not be optimal for CLWEs. We argue that pseudo sentences generated by a UMT Figure 1: Our framework for training CLWEs using unsupervised machine translation (UMT). We first train UMT models using monolingual corpora for each language. We then translate all the training corpora and concatenate the outputs with the original corpora, and train monolingual word embeddings independently. Finally, we map these word embeddings on a shared embedding.
system contain some trace of the original language, and using them when training monolingual embeddings can facilitate the structural correspondence of the two sets of embeddings.
In the experiments using the Wikipedia dump in English, French, German, and Japanese, we observe substantial improvements by our method in the task of bilingual lexicon induction and downstream tasks without hurting the quality as monolingual embeddings. Moreover, we carefully analyze why our method improves the performance, and the result confirms that making the source and target corpora parallel does contribute to performance improvement, and also suggests that the generated translation data contain information about the original language.

Background and Related Work
Cross-lingual Word Embeddings CLWE methods aim to learn a semantic space shared between two languages. Most of the current approaches fall into two types of methods: joint-training approaches and mapping-based ap-proaches.
On the other hand, mapping-based approaches utilize monolingual embeddings that are already obtained from monolingual corpora. They assume structural similarity between monolingual embeddings of different languages and attempt to obtain a shared embedding space by finding a transformation matrix W that maps source word embeddings to the target embedding space (Mikolov et al., 2013). The transformation matrix W is usually obtained by minimizing the sum of squared euclidian distances between the mapped source embeddings and target embeddings: where D is a bilingual word dictionary that contains word pairs (x i , y i ) and x i and y i represent the corresponding word embeddings. Although finding the transformation matrix W is straightforward when a word dictionary is available, a recent trend is to reduce the amount of crosslingual supervision or to find W in a completely unsupervised manner (Lample et al., 2018b;Artetxe et al., 2018a). The general framework of unsupervised mapping methods is based on heuristic initialization of a seed dictionary D and iterative refinement of the transformation matrix W and the dictionary D, as described in Algorithm 1. In our experiment, we use the unsupervised mappingbased method proposed by Artetxe et al. (2018a). Their method is characterized by the seed dictionary initialized with nearest neighbors based on similarity distributions of words in each language.
These mapping-based methods, however, are based on the strong assumption that the two independently trained embedding spaces have similar structures that can be aligned by a linear transformation. Although several studies have tackled improving the structural similarity of monolingual spaces before learning mapping (Zhang et al., 2019;Vulić et al., 2020), not much attention has been paid to how to leverage the text data itself.
Input: The source embeddings X, the target embeddings Y Output: The transformation matrix W Heuristically induce an initial seed word dictionary D while not convergence do Compute W given the word dictionary D from the equation (1) Update the word dictionary D by retrieving cross-lingual nearest neighbors in a shared embedding space obtained by W end return W Algorithm 1: The general workflow of unsupervised mapping methods In this paper, we argue that we can facilitate structural correspondence of two embedding spaces by augmenting the source or/and target corpora with the output from an unsupervised machine translation system (Lample et al., 2018c).

Unsupervised Machine Translation
Unsupervised machine translation (UMT) is the task of building a translation system without any parallel corpora (Artetxe et al., 2018b;Lample et al., 2018a,c;Artetxe et al., 2019b). UMT is accomplished by three components: (1) a wordby-word translation model learned using unsupervised CLWEs; (2) a language model trained on the source and target monolingual corpora; (3) a backtranslation model where the model uses input and its own translated output as parallel sentences and learn how to translate them in both directions.
More specifically, the initial source-to-target translation model P 0 s→t is created by the word-byword translation model and the language model of the target language. Then, P 1 t→s is learned in a supervised setting using the source original monolingual corpus paired with the synthetic parallel sentences of the target language generated by P 0 s→t . Again, another source-to-target translation model P 1 s→t is trained with the target original monolingual corpus and the outputs of P 0 s→t , and in the same way, the quality of the translation models is improved with an iterative process.
In our experiments, we adopt an unsupervised phrase-based statistical machine translation (SMT) method to generate a pseudo corpus because it produces better translations than unsupervised neural machine translation on low-resource languages (Lample et al., 2018c). The difference of the unsupervised SMT (USMT) model from its supervised counterpart is that the initial phrase table is derived based on the cosine similarity of unsupervised CLWEs, and the translation model is iteratively im-proved by pseudo parallel corpora.
Our proposed method utilizes the output of a USMT system to augment the training corpus for CLWEs.

Exploiting UMT for Cross-lingual Applications
There is some previous work on how to use UMT to induce bilingual word dictionaries or improve CLWEs. Artetxe et al. (2019a) explored an effective way of utilizing a phrase table from a UMT system to induce bilingual dictionaries. Marie and Fujita (2019) generate a synthetic parallel corpus from a UMT system, and jointly train CLWEs along with the word alignment information (Luong et al., 2015). In our work, we use the synthetic parallel corpus generated from a UMT system not for joint-training but for data augmentation to train monolingual word embeddings for each language, which are subsequently aligned through unsupervised mapping. In the following sections, we empirically show that our approach leads to the creation of improved CLWEs and analyze why these results are achieved.

Experimental Design
In this section, we describe how we obtain mapping-based CLWEs using a pseudo parallel corpus generated from UMT. We first train UMT models using the source/target training corpora, and then translate them to the machine-translated corpora. Having done that, we simply concatenate the machine-translated corpus with the original training corpus, and learn monolingual word embeddings independently for each language. Finally, we map these embeddings to a shared CLWE space.

Corpora
We implement our method with two similar language pairs: English-French (en-fr), English-German (en-de), and one distant language pair: English-Japanese (en-ja). We use plain texts from Wikipedia dumps 1 , and randomly extract 10M sentences for each language. The English, French, and German texts are tokenized with the Moses tokenizer (Koehn et al., 2007) and lowercased. For Japanese texts, we use kytea 2 to tokenize and normalize them 3 .

Training mapping-based CLWEs
Given tokenized texts, we train monolingual word embeddings using fastText 4 with 512 dimensions, a context window of size 5, and 5 negative examles. We then map these word embeddings on a shared embedding space using the open-source implementation VecMap 5 with the unsupervised mapping algorithm (Artetxe et al., 2018a).

Training UMT models
To implement UMT, we first build a phrase table by selecting the most frequent 300,000 source phrases and taking their 200 nearest-neighbors in the CLWE space following the setting of Lample et al. (2018c). We then train a 5-gram language model for each language with KenLM (Heafield et al., 2013) and combine it with the phrase table, which results in an unsupervised phrase-based SMT model. Then, we refine the UMT model through three iterative back-translation steps. At each step, we translate 100k sentences randomly sampled from the monolingual data set. We use a phrase table containing phrases up to a length of 4 except for initialization. The quality of our UMT models is indicated by the BLEU scores (Papineni et al., 2002) in Table 1. We use newstest2014 from WMT14 6 to evaluate En-Fr and En-De translation accuracy and the Tanaka corpus 7 for En-Ja evaluation.

Training CLWEs with pseudo corpora
We then translate all the training corpora with the UMT system and obtain machine-translated corpora, which we call pseudo corpora. We concatenate the pseudo corpora with the original corpora, and learn monolingual word embeddings for each language. Finally, we map these word embeddings to a shared CLWE space with the unsupervised mapping algorithm.

Models
We compare our method with a baseline with no data augmentation as well as the existing related methods: dictionary induction from a phrase table (Artetxe et al., 2019a) and the unsupervised jointtraining method (Marie and Fujita, 2019). These two methods both exploit word alignments in the pseudo parallel corpus, and to obtain them we use Fast Align 8 (Dyer et al., 2013) with the default hyperparameters. For the joint-training method, we adopt bivec 9 to train CLWEs with the parameters used in Upadhyay et al. (2016) using the pseudo parallel corpus and the word alignments. To ensure fair comparison, we implement all of these methods with the same UMT system.

Evaluation of Cross-lingual Mapping
In this section, we conduct a series of experiments to evaluate our method. We first evaluate the performance of cross-lingual mapping in our method ( § 4.1) and investigate the effect of UMT quality ( § 4.2). Then, we analyze why our method improves the bilingual lexicon induction (BLI) performance. Through carefully controlled experiments, we argue that it is not simply because of data augmentation but because: (1) the generated data makes the source and target corpora (partially) parallel ( § 4.3); (2) the generated data reflects the co-occurrence statistics of the original language ( § 4.4).

Bilingual Lexicon Induction
First, we evaluate the mapping accuracy of word embeddings using BLI. BLI is the task of iden-   We use XLing-Eval 10 as test sets for En-Fr and En-Ge. For En-Ja. We create the word dictionaries automatically using Google Translate 11 , following Ri and Tsuruoka (2020). Other than BLI from a phrase table, we train three sets of embeddings with different random seeds and report the average of the results.
We compare the proposed method with other alternative approaches in BLI as shown in Table  2. In all the language pairs, the mapping method with pseudo data augmentation achieves better performance than the other methods. Here, one may think that the greater amount of data can lead to better performance, and thus augmenting both the source and target corpora shows the best performance. However, the result shows that it is not necessarily the case: for our mapping method, augmenting only either the source or target, not both, achieves the best performance in many language pairs. This is probably due to the presence of two pseudo corpora with different natures.
As for the two methods using word alignments (BLI from phrase table; joint training), we observe some cases where these models underperform the mapping methods, especially in English and Japanese pairs. We attribute this to our relatively low-resource setting where the quality of the synthetic parallel data is not sufficient to per-  form these methods which require word alignment between parallel sentences.

Effect of UMT quality
To investigate the effect of UMT quality on our method, we compare the accuracy of BLI on the CLWEs using pseudo data generated from UMT models of different qualities. As a translator with low performance, we prepare models that perform fewer iterations on back-translation (BT). Note that we compare the results on the source-side (English) extension, where the quality of the translation is notably different. As shown in Table 3, we find that the better the quality of generated data, the better the performance of BLI.

Effect of sharing content
In the mapping method, word embeddings are independently trained by monolingual corpora that do not necessarily have the same content. As a result, the difference in the corpus contents can hurt the structural similarity of the two resulting embedding spaces. We hypothesize that using synthetic parallel data which have common contents for learning word embeddings leads to better structural correspondence, which improves cross-lingual mapping.
To verify the effect of sharing the contents using parallel data, we compare the extensions with a parallel corpus and a non-parallel corpus. More concretely, we first split the original training data   of the source and target languages evenly (each denoted as Split A and Split B). As the baseline, we train CLWEs with Split A. We use the translation of Split A of the target language data for the parallel extension of the source data, and Split B for the nonparallel extension. Also, we compare them with the extension with non-pseudo data, which is simply increasing the amount of the source language data by raw text.
Along with the BLI score, we show eigenvector similarity, a spectral metric to quantify the structural similarity of word embedding spaces (Søgaard et al., 2018). To compute eigenvector similarity, we normalize the embeddings and construct the nearest neighbor graphs of the 10,000 most frequent words in each language. We then calculate their Laplacian matrices L1 and L2 from those graphs and find the smallest k such that the sum of the k largest eigenvalues of each Laplacian matrices is < 90% of all eigenvalues. Finally, we sum up the squared differences between the k largest eigenvalues from L1 and L2 and derive the eigen similarity. Note that smaller eigenvector similarity values mean higher degrees of structural similarity. Table 4 shows the BLI scores and eigenvector similarity in each extension setting. The parallel extension method shows a slightly better BLI performance than the non-parallel extension. This supports our hypothesis that parallel pseudo data make word embeddings space more suitable for bilingual mapping because of sharing content. In eigenvector similarity, there is no significant improvement between the parallel and non-parallel corpora. This is probably due to large fluctuations in eigenvector similarity values. Surprisingly, the results show that augmentation using pseudo data is found to be much more effective than the extension of the same amount of original training data. This result suggests that using pseudo data as training data is useful, especially for learning bilingual models.

Effect of reflecting the co-occurrence statistics of the language
We hypothesize that the translated sentences reflect the co-occurrence statistics of the original language, which makes the co-occurrence information on training data similar, improving the structural similarity of the two monolingual embeddings.
To verify this hypothesis, we experiment with augmenting the source language with sentences translated from a non-target language. To examine only the effect of the co-occurrence statistics of language and avoid the effects of sharing content, we use the extensions with the non-parallel corpus. Table 5 shows that BLI performance and eigenvector similarity improve with the extension from the same target language, but that is not the case if the pseudo corpus is generated from a non-target language. These results indicate that our method can leverage learning signals on the other language in the pseudo data.

Downstream Tasks
Although CLWEs were evaluated almost exclusively on the BLI task in the past,  recently showed that CLWEs that perform well on BLI do not always perform well in other cross-lingual tasks. Therefore, we evaluate our embeddings on the four downstream tasks: topic classification (TC), sentiment analysis (SA), dependency parsing (DP), and natural language inference en-fr en-de en-ja  (NLI).
Topic Classification This task is classifying the topics of news articles. We use the MLDoc 12 corpus compiled by Schwenk and Li (2018). It includes four topics: CCAT (Corporate / Industrial), ECAT (Economics), GCAT (Government / Social), MCAT (Markets). As the classifier, we implemented a simple light-weight convolutional neural network (CNN)-based classifier.

Sentiment Analysis
In this task, a model is used to classify sentences as either having a positive or negative opinion. We use the Webis-CLS-10 corpus 13 . This data consists of review texts for amazon products and their ratings from 1 to 5. We cast the problem as binary classification and define rating values 1-2 as "negative" and 4-5 as "positive", and exclude the rating 3. Again, we use the CNN-based classifier for this task.
Dependency Parsing We train the deep biaffine parser (Dozat and Manning, 2017) with the UD English EWT dataset 14 (Silveira et al., 2014). We use the PUD treebanks 15 as test data.
Natural Language Inference We use the English MultiNLI corpus  for training and the multilingual XNLI corpus for evaluation (Conneau et al., 2018). XNLI only covers French and German from our experiment. We train the LSTM-based classifier (Bowman et al., 2015), which encodes two sentences, concatenated the representations, and then feed them to a multi-layer perceptron.
12 https://github.com/facebookresearch/ MLDoc 13 https://webis.de/data/webis-cls-10. html 14 https://universaldependencies.org/ treebanks/en_ewt/index.html 15 https://universaldependencies.org/ conll17/ In each task, we train the model using English training data with the embedding parameters fixed . We then evaluate the model on the test data in other target languages. Table 6 shows the test set accuracy of downstream tasks. For topic classification, our method obtains the best results in all language pairs. Especially in En-Fr and En-Ja, a significant difference is obtained in Student's t-test. For sentiment analysis, we observe a significant improvement in En-De, but cannot observe consistent trends in other languages. For dependency parsing and natural language inference, we observe a similar trend where the performance of our method outperforms other methods, although no significant difference is observed in the t-test. The cause of the lower performance of joint-training compared with the mapping method is presumably due to the poor quality of synthetic parallel data as described in § 4.1. In summary, given the same amount of data, the CLWEs obtained from our method tend to show higher performance not only in BLI but also in downstream tasks compared with other alternative methods, although there is some variation.

Analysis
Monolingual Word Similarity Our method uses a noisy pseudo corpus to learn monolingual word embeddings, and it might hurt the quality of monolingual embeddings. To investigate this point, we evaluate monolingual embeddings with the word similarity task. This task evaluates the quality of monolingual word embeddings by measuring the correlation between the cosine similarity in a vector space and manually created word pair similarity. We use simverb-3500 16 (Gerz et al., en-fr en-de en-ja corpus en fr en de en ja origin 1.60 × 10 −3 1.63 × 10 −3 1.51 × 10 −3 3.78 × 10 −3 1.52 × 10 −3 1.03 × 10 −3 pseudo 0.57 × 10 −3 0.57 × 10 −3 0.66 × 10 −3 0.59 × 10 −3 0.19 × 10 −3 0.17 × 10 −3   (Bruni et al., 2014) consisting of 3000 frequent words extracted from web text. Table 8 shows the results of word similarity. The scores of monolingual word embeddings using a French and German pseudo corpus are maintained or improved, while they decrease in Japanese. This suggests that the quality of monolingual word embeddings could be hurt due to the low quality of the pseudo corpus or differences in linguistic nature. Nevertheless, the proposed method improves the performance of En-Ja's CLWE, which suggests that the monolingual word embeddings created with a pseudo corpus have a structure optimized for crosslingual mapping.
Application to UMT UMT is one of the important applications of CLWEs. Appropriate initialization with CLWEs is crucial to the success of UMT (Lample et al., 2018c). To investigate how CLWEs obtained from our method affect the performance of UMTs, we compare the BLEU scores of UMTs initialized with CLWEs with and without a pseudo corpus at each iterative step. As shown in Table 9, we observe that initialization with CLWE using the pseudo data result in a higher BLEU score in the first step but does not improve the score at further steps compared to the CLWE without the pseudo data. Marie and Fujita (2019) also demonstrate the same tendency in the CLWE with joint-training.
To investigate this point, we compare the lexical densities of the training corpus and the pseudocorpus used in the above experiments ( § 4, 5) using type-token ratio (  standardized to some extent as reported in Vanmassenhove et al. (2019). As a result, specific words might be easily mapped in CLWEs using a pseudo corpus 18 , and then the translation model makes it easier to translate phrases in more specific patterns. Hence, the model cannot generate diverse data during back-translation, and the accuracy is not improved due to easy learning.

Conclusion and Future Work
In this paper, we show that training cross-lingual word embeddings with pseudo data augmentation improves performance in BLI and downstream tasks. We analyze the reason for this improvement and found that the pseudo corpus reflects the co-occurrence statistics and content of the other language and that the property makes the structure of the embedding suitable for cross-lingual word mapping.
Recently,  have shown that fully unsupervised CLWE methods fails in many language pairs and argue that researchers should not focus too much on the fully unsupervised settings. Still, our findings that improve structural similarity of word embeddings in the fully unsupervised setting could be useful in semi-supervised settings, and thus we would like to investigate this direction in the future.