Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages

,


Introduction
Cross-lingual word representations are shared embedding spaces for two -Bilingual (BWEs) -or more languages -Multilingual Word Embeddings (MWEs).They have been shown to be effective for multiple tasks including machine translation (Lample et al., 2018c) and cross-lingual transfer learning (Schuster et al., 2019).They can be created by jointly learning shared embedding spaces (Lample et al., 2018a;Conneau et al., 2020) or via mapping approaches (Artetxe et al., 2018;Schuster et al., 2019).However, their quality degrades when low-resource languages are involved, since they require an adequate amount of monolingual data (Adams et al., 2017), which is especially problematic for languages with just a few millions of tokens (Eder et al., 2021).
Recent work showed that building embeddings jointly by representing common vocabulary items of the source and target languages with a single embedding can improve representations (Wang et al., 2019;Woller et al., 2021).On the other hand, these approaches require the source and target to be related, which in practice means high vocabulary overlap.Since for many distant language pairs this requirement is not satisfied, in this paper, we propose to leverage a chain of intermediate languages to overcome the large language gap.We build MWEs step-by-step, starting from the source language and moving towards the target, incorporating a language that is related to the languages already in the multilingual space in each step.Intermediate languages are selected based on their linguistic proximity to the source and target languages, as well as the availability of large enough datasets.
Since our main targets are languages having just a few million tokens worth of monolingual data, we take static word embeddings (Mikolov et al., 2013a) instead of contextualized representations (Devlin et al., 2019) as the basis of our method, due to the generally larger data requirements of the latter.Additionally, the widely used mappingbased approaches (Mikolov et al., 2013b), including multilingual methods (Kementchedjhieva et al., 2018;Jawanpuria et al., 2019;Chen and Cardie, 2018), require good quality monolingual word embeddings.Thus, to incorporate a single language to the multilingual space in each step we rely on the anchor-based approach of Eder et al. (2021).We refer to this method as ANCHORBWES.It builds the target embeddings and aligns them to the source space in one step using anchor points, thus not only building cross-lingual representations but a better quality target language space as well.We extend this bilingual approach to multiple languages.Instead of aligning the target language to the source in one step, we maintain a multilingual space (initialized by the source language), and adding each intermediate and finally the target language to it sequentially.This way we make sure that the language gap between the two spaces in each step stays minimal.
We evaluate our approach (CHAINMWES) on the Bilingual Lexicon Induction (BLI) task for 4 language families, including 4 very (≤ 5 million tokens) and 4 moderately low-resource (≤ 50 million) languages and show improved performance compared to both bilingual and multilingual mapping based baselines, as well as to the bilingual ANCHORBWES.Additionally, we analyze the importance of intermediate language quality, as well as the role of the number of anchor points during training.In summary, our contributions are the following: • we propose to strengthen word embeddings of low-resource languages by employing a chain of intermediate related languages in order to reduce the language gap at each alignment step, • we extend ANCHORBWES of Eder et al. (2021) to multilingual word representations which does not take the distance between the source and target languages into consideration, • we test our approach on multiple low-resource languages and show improved performance, • we make our code available for public use.1

Related Work
Bilingual lexicon induction is the task of inducing word translations from monolingual corpora in two languages (Irvine and Callison-Burch, 2017), which became the de facto task to evaluate the quality of cross-lingual word embeddings.There are two main approaches to obtain MWEs: mapping and joint learning.Mapping approaches aim at computing a transformation matrix to map the embedding space of one language onto the embedding space of the others (Ravi and Knight, 2011;Artetxe et al., 2017;Lample et al., 2018b;Artetxe et al., 2018;Lample et al., 2018a;Artetxe et al., 2019, inter alia).Alternatively, joint learning approaches aim at learning a shared embedding space for two or more languages simultaneously.(Devlin et al., 2019;Conneau et al., 2020).However, large LMs require more training data than static word embeddings, thus we focus on the latter in our work.Ruder et al. (2019) provided a survey paper on cross-lingual word embedding models and identified three sub-categories within static word-level alignment models: mapping-based approaches, pseudo-multilingual corpus-based approaches and joint methods, highlighting their advantages and disadvantages.To combine the advantages of mapping and joint approaches Wang et al. (2019) proposed to first apply joint training followed by a mapping step on overshared words, such as false friends.Similarly, a hybrid approach was introduced in (Woller et al., 2021) for 3 languages, which first applies joint training on two related languages which is then mapped to the distant third language.A semi-joint approach was introduced in (Ormazabal et al., 2021) and (Eder et al., 2021), which using a fixed pre-trained monolingual space of the source language trains the target space from scratch by aligning embeddings close to given source anchor points.We utilize (Eder et al., 2021) in our work, since it is evaluated on very low-resource languages which is the main interest of our work.
Most work on cross-lingual word embeddings is English-centric.Anastasopoulos and Neubig (2019) found that the choice of hub language to which others are aligned to can significantly affect the final performance.Other methods leveraged multiple languages to build MWEs (Kementched-jhieva et al., 2018;Chen and Cardie, 2018;Jawanpuria et al., 2019), showing that some languages can help each other to achieve improved performance compared to bilingual systems.However, these approaches rely on pre-trained monolingual embeddings, which could be difficult to train in limited resource scenarios.In our work we also leverage multiple languages, but mitigate the issue of poor quality monolingual embeddings.Søgaard et al. (2018) showed that embedding spaces do not tend to be isomorphic in case of distant or low-resource language pairs, making the task of aligning monolingual word embeddings harder than previously assumed.Similarly, Patra et al. (2019) empirically show that etymologically distant language pairs are hard to align using mapping approaches.A non-linear transformation is proposed in (Mohiuddin et al., 2020), which does not assume isomorphism between language pairs, and improved performance on moderately lowresource languages.However, Michel et al. (2020) show that for a very low-resource language such as Hiligaynon, which has around 300K tokens worth of available data, good quality monolingual word embeddings cannot be trained, meaning that they can neither be aligned with other languages.Eder et al. (2021) found that mapping approaches on languages under 10M tokens achieve under 10% P@1 score when BLI is performed.In our work, we focus on such low-resource languages and propose to combine the advantages of related languages in multilingual spaces and hybrid alignment approaches.

Method
The goal of our approach is to reduce the distance between two languages which are being aligned at a time.Thus instead of directly aligning the source and target languages we incorporate a chain of intermediate related languages in order for a reduced distance.Our approach starts from the source language as the initial multilingual space and iteratively adds the languages in the chain till it reaches the target language.We build upon the bilingual ANCHORBWES algorithm presented in (Eder et al., 2021) by extending it to multilingual setting.First, we discuss the ANCHORBWES approach, followed by our proposed intermediate language-based CHAINMWES method.

ANCHORBWES
The anchor-based method assumes that the source language is high-resource, thus starts by training source monolingual word embeddings with a traditional static word embedding approach, more precisely word2vec (Mikolov et al., 2013a).Using this vector space it trains an embedding space for the low-resource target language by aligning them at the same time, this way the properties of the good quality source space, such as similar embeddings for words with similar meaning, is transferred to the target space.Given a seed dictionary defining word translation pairs, the source side of the pairs are defined as the anchor points.Instead of randomly initializing all target language words at the beginning of the training process, the method initializes target words in the seed dictionary using their related anchor points.The rest of the training process follows the unchanged algorithm of either CBOW or Skip-gram on the target language corpus.This approach significantly outperforms previous methods in low-resource bilingual settings, as demonstrated by strong results on both simulated lowresource language pairs (English-German) and true low-resource language pairs (English-Hiligaynon). Additionally, Eder et al. (2021) shows that not only the cross-lingual performance is improved, but the monolingual space is of better quality compared when the target space is trained independently of the source language.

CHAINMWES
We extend ANCHORBWES by first defining a chain of languages C = [c 1 , c 2 , ..., c n ], starting from the high-resource source language (c 1 ) and ending at the low-resource target language (c n ), including intermediate languages that are related to the preceding and following nodes.As described in Section 4, we define chains in which the lowerresource languages are of the same language family.The intuition is to interleave the source and target with languages that are similar in terms of linguistic properties.After selecting the intermediate languages, our method comprises five steps as depicted in Figure 1: 1.As the first step (i = 1), we construct the initial monolingual embedding space (E 1 ) for the source language (c 1 ) using its monolingual corpus (D 1 ), by training a Word2Vec (Mikolov et al., 2013a) model.We consider this space as the initial multilingual space (M 1 := E 1 ) which we extend in the following steps.
2. In the next step (i = i + 1), we collect the seed lexicon (L i ) for training embeddings for the next language in the chain (c i ) by concatenating the seed lexicons of all the languages before c i in the chain paired with c i .More precisely: where l k,i is the seed lexicon between languages k and i.Since Eder et al. (2021) showed that ANCHORBWES performs better as the number of available anchor points increase, our goal is to take all available anchor points already in M i−1 .
3. Apply ANCHORBWES using M i−1 as the source embedding space, D i as the training corpus and L i as the anchors to build embeddings (E i ) for c i .
4. Since ANCHORBWES builds embeddings for c i which are aligned with the maintained multilingual space, we simply concatenate them 5. Goto step 2 until the target language is reached.
By strategically integrating intermediate languages, we enrich the quality of the multilingual space by making sure that the distance between two languages at any alignment step is minimal.Our experiments show that without the intermediate languages the quality of the embeddings built by ANCHORBWES is negatively affected by the large gap between the source and target.

Experimental Setup
In this section, we describe the experimental setup, including the selection of languages, datasets, and model parameters used in our study.

Data
We select four language families of different geographic locations for evaluation.Figure 2 depicts the language similarities in 2D using lang2vec language embeddings based on their syntactic features (Malaviya et al., 2017).We discuss their relevance on the final results in Section 5. Although, we selected low-resource target and intermediate languages based on language families, we stepped over their boundaries in order to have intermediate languages related to the source language as well by considering the influence some languages had on others, e.g., during the colonial era.Our source language is English in each setup, and sort the intermediate languages based on their monolingual corpora sizes.We present the exact chains of these languages in section 5.
Austronesian We select two languages spoken in the Philippines: Tagalog as moderately and Hiligaynon as very low-resource target languages, with Indonesian and Spanish as the intermediates.Spanish being an Indo-European language is related to English.Additionally, due to colonization, it influenced the selected Austronesian languages to a varying degree.Furthermore, Indonesian, Tagalog and Hiligaynon show similarities, especially the two languages of the Philippines, due to their close proximity.
Turkic languages using the Cyrillic script.We take Kazakh as moderately, and Chuvash and Yakut as very low-resource languages.Since they use the Cyrillic alphabet and mostly spoken in Russia, we use Russian as the intermediate language.Due to Russian being high-resource, it can be well aligned with English.
Scandinavian We select Icelandic and Faroese as two very low-resource with Norwegian and Swedish as the intermediates that are related to both of them and to English.
Atlantic-Congo Finally, we select Swahili as a moderately low-resource language, which has a high number of loanwords from Portuguese and German which we take as the intermediate languages.We note that we experimented with the very low-resource Zulu and Xhosa languages as well, however due to difficulties acquiring good quality lexicons for training and evaluation, we achieved near zero performance, thus we do not present them in this paper.
The embeddings were trained on Wikipedia dumps for all languages except Hiligaynon, which was trained on the corpus used in (Michel et al., 2020) due to comparison reasons.Hiligaynon is extremely low-resource, having 345K tokens in its monolingual corpus.Corpus sizes for each language are presented in Table 1.Bilingual dictionaries for training and testing are taken from the Wiktionary based resource released in (Izbicki, 2022).As mentioned in the previous section, at each iteration of our approach we take training dictionaries between the current language and all languages which are already in the multilingual vector space.Since, Izbicki (2022) only release resources for English paired with various target languages, we build dictionaries for the other language pairs through pivoting, more precisely: (src e,k , trg e,k , src e,i , trg e,i ) ∈ l e,k × l e,i , src e,i = src e,k } where l e,x is a dictionary between English (e) and an arbitrary language (x), while src x,y and trg x,y is a source (x) and target (y) language translation pair.Number of dictionary entries for each language pair is presented in Table 2.

Baselines and Model Parameters
We compare our approach to the mapping-based bilingual VecMap (Artetxe et al., 2018) and multilingual UMWE (Chen and Cardie, 2018) approaches.Additionally, we run ANCHORBWES (Eder et al., 2021) as our joint alignment baseline.We trained word2vec embeddings (Mikolov et al., 2013a) with a maximum vocabulary size of 200 000 in every setup, i.e., for the mappingbased baselines as well as in ANCHORBWES and CHAINMWES.The training was performed using standard hyperparameters included in the Gensim Word2Vec package ( Řehůřek and Sojka, 2010): context window of 5, dimensionality of 300 and for 5 epochs, with the exception that we used minimum word frequency of 3 due to the small corpora for the target languages.Additionally, since Eder et al. (2021) showed that CBOW outperforms SG in ANCHORBWES, we used the former in our experiments.We use the MUSE evaluation tool (Lample et al., 2018b) to report precision at 1, 5, and 10, using the nearest neighbor search.For the mapping based approaches we leverage the CSLS similarity score as it was shown to perform better by handling the hubness problem (Lample et al., 2018b).However, similarly to (Woller et al., 2021) we found that jointly trained embeddings do not benefit from the CSLS method, thus we use simple cosine similarity (NN) based search for both ANCHORBWES and CHAINMWES.

Results
We present our results in Table 3 split into the moderately and very low-resource language groups and sorted based on the size of available monolingual data for each target language (Table 1).Overall, the results show the difficulties of building crosslingual word embeddings for the selected target languages, since the performance is much lower compared to high resource languages in general, which for example is around 50% P@1 for English-German on the Wiktionary evaluation set (Izbicki, 2022).Comparing the multilingual UMWE approach to the bilingual VecMap the results support the use of related languages, since they improve the performance on most source-target language pairs.However, this is most apparent on the moderately low-resource languages.The results on the very low-resource languages are very poor for the mapping-based approaches, which as discussed depend on the quality of pre-trained monolingual em- beddings.In contrast, the semi-joint anchor-based approaches can significantly improve the embedding quality showing their superiority in the very low-resource setups.
Our proposed CHAINMWES method outperforms mapping-based approaches on 7 out of 8 target languages, and ANCHORBWES on 6 target languages, which is most apparent when retrieving more than one translation candidate (P@5 and P@10).Interestingly when looking at P@1, the systems are close to each other, indicating that our method improves the general neighborhood relations of the embedding space instead of just improving the embeddings of a few individual words.This is further supported in the case of Kazakh and Icelandic where UMWE outperforms CHAINMWES in terms of P@1, however it performs lower when a larger neighborhood is leveraged for the translation.This property is caused by the combination of the semi-joint anchor-based training, instead of relying on independently trained monolingual spaces, and the smaller distances between aligned languages.
When comparing moderately and very lowresource languages, we found similar trends in the two groups.In both cases CHAINMWES outperforms ANCHORBWES on 3 out of 4 languages, however in case of Hiligaynon, which has less than 1 million tokens, the results are mixed, i.e., ANCHORBWES tends to perform better when the smaller neighborhood of P@5 is considered, but it is the opposite when P@10 is measured.

Method
Intermediate P@1 P@5 P@10  Furthermore, UMWE tends to be more competitive with ANCHORBWES on the moderately lowresource languages, e.g., it performs better in case of Kazakh, while it does not improve over CHAIN-MWES.Overall however, we found no strong correlation between the available monolingual resources for a given language and on which target language CHAINMWES achieved the best results, since the two cases where it did not improve over the baselines are the 3 rd (Yakut) and 5 th (Swahili) lowest resource languages.Looking at the visualization of language embeddings in Figure 2, the negative results on Swahili can be explained by the relatively large distance between its two intermediate pairs.Although Swahili has a large number of German and Portuguese loan words, the syntactic properties of the languages seem to be too different.Similarly, Yakut (sah) is the furthest away from Russian which could explain our negative results.Table 4: Experiments on adding related moderately lowresource languages to the language chains of very lowresource languages.

Adding Moderate Resource Languages
Since some moderately low-resource languages are related to the very low-resource ones (Kazakh to Yakut 2 , Icelandic to Faroese and Tagalog to Hiligaynon), we add them to the language chain in the experiments presented in Table 4.The results show, that although these languages are closely related, they do not contribute positively to the quality of the resulting MWEs.These results indicate, that the languages involved in the language-chains as intermediate steps should have good quality embeddings (the BLI performance P@5 for the Russian, Swedish, Norwegian and Spanish range between 45% and 65%), thus embedding quality is more important than language closeness.Additionally, Figure 2 shows that Tagalog is less similar to Indonesian and Spanish than to Hiligaynon, and Icelandic is less similar to Faroese than to Norwegian or Swedish.

Ablation Study
An advantage of the sequential nature of our approach is that as we add more languages to the multilingual space step-by-step, the number of potential anchor points for aligning the language next in line increases.We exploit this by accumulating all word translation pairs from the dictionaries between all languages already in the multilingual space and the currently trained language (Step 2).Although this requires dictionaries between all language pairs, we mitigated this requirement by pivoting through English.In Table 5 we present an ablation study, where we turn dictionary accumulation off, by using dictionaries only between the trained language and its preceding neighbor.The results show that this has a sizable impact on the performance.Although there are a few cases where P@1 is marginally improved (Icelandic, Swahili, 2 Kazakh is also related to Chuvash which we omitted in these experiments due to low results on Chuvash in general.

Method
Inter.P@1 P@5 P@10 Table 5: Results of the ablation experiments, where we turn training dictionary accumulation off in CHAIN-MWES * , by using only the dictionary between a given language and its preceding neighbor.
Chuvash and Yakut), both P@5 and P@10 are decreased in most cases even where P@1 is improved except Chuvash.The least impacted by the accumulated dictionaries are Turkic languages which indicates their strong relation to Russian and distance from English which could stem from their different scripts.Overall, these findings align with the results of (Eder et al., 2021), who showed that the embedding quality improves as more dictionary entries are available.

Conclusion
In this paper we proposed CHAINMWES, a novel method for enhancing multilingual embeddings of low-resource languages by incorporating intermediate languages to bridge the gap between distant source and target languages.Our approach extends ANCHORBWES, the bilingual approach of Eder et al. (2021) to MWEs by employing chains of related languages.We evaluate CHAINMWES on 4 language families involving 4 moderately and 4 very low-resource languages using bilingual lexicon induction.Our results demonstrate the effectiveness of our method showing improvements on 6 out of 8 target languages compared to both bilingual and multilingual mapping-based, and the ANCHORBWES baselines.Additionally, we show the importance of involving only those intermediate languages for which building good quality embeddings is possible.

Limitations
One limitation of our work is the manual selection of intermediate languages.Although, the selection and ordering of languages in the chains was straightforward based on language family information, such as Glottolog (Nordhoff and Hammarström, 2011), and available data size, it could be possible that other languages which we did not consider in our experiments are also helpful in improving the quality of MWEs.Additionally, we did not consider all possible ordering of intermediate languages, such as the order of English→Norwegian→Swedish→Faroese instead of English→Swedish→Norwegian→Faroese, in order to save resources.Thus, a wider range of chains could uncover further improvements.

Figure 1 :
Figure 1: Visual depiction of our CHAINMWES method.The resulting embedding (M n in green) is multilingual involving all languages in the chain.

Figure 2 :
Figure 2: Visualization of language embeddings using lang2vec syntax features.Colors indicate different language families: Austronesian in turquoise, Turkic in green, Scandinavian in yellow and Atlantic-Congo in blue.

Table 1 :
Selected intermediate as well as moderately and very low-resource languages.Monolingual corpora sizes are shown in millions.

Table 2 :
Number of unique words in the train and test dictionaries of the used language pairs.

Table 3 :
Precision at k ∈ {1, 5, 10} values for the target languages paired with English as the source in each case.The Intermediate column shows the languages in between the source and target (e.g., line 2 shows the chain English→Russian→Kazakh