Evaluating a Joint Training Approach for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora on Lower-resource Languages

Cross-lingual word embeddings provide a way for information to be transferred between languages. In this paper we evaluate an extension of a joint training approach to learning cross-lingual embeddings that incorporates sub-word information during training. This method could be particularly well-suited to lower-resource and morphologically-rich languages because it can be trained on modest size monolingual corpora, and is able to represent out-of-vocabulary words (OOVs). We consider bilingual lexicon induction, including an evaluation focused on OOVs. We find that this method achieves improvements over previous approaches, particularly for OOVs.


Introduction
Word embeddings are an essential component in systems for many natural language processing tasks such as part-of-speech tagging (Al-Rfou' et al., 2013), dependency parsing (Chen and Manning, 2014) and named entity recognition (Pennington et al., 2014). Cross-lingual word representations provide a shared space for word embeddings of two languages, and make it possible to transfer information between languages (Ruder et al., 2019). A common approach to learn cross-lingual embeddings is to learn a matrix to map the embeddings of one language to another using supervised (e.g., Mikolov et al., 2013b), semi-supervised (Artetxe et al., 2017), or unsupervised (e.g., Lample et al., 2018) methods. These methods rely on the assumption that the geometric arrangement of embeddings in different languages is the same. However, it has been shown that this assumption does not always hold, and that methods which instead jointly train embeddings for two languages produce embeddings that are more isomorphic and achieve stronger results for bilingual lexicon induction (BLI, Ormazabal et al., 2019), a well-known in-trinsic evaluation for cross-lingual word representations (Ruder et al., 2019;Anastasopoulos and Neubig, 2020). The approach of Ormazabal et al. uses a parallel corpus as a cross-lingual signal. Parallel corpora are, however, unavailable for many language pairs, particularly low-resource languages. Duong et al. (2016) introduce a joint training approach that extends CBOW (Mikolov et al., 2013a) to learn cross-lingual word embeddings from modest size monolingual corpora, using a bilingual dictionary as the cross-lingual signal. Bilingual dictionaries are available for many language pairs, e.g., Panlex (Baldwin et al., 2010) provides translations for roughly 5700 languages. These training resource requirements suggest this method could be well-suited to lower-resource languages. However, this word-level approach is unable to form representations for out-of-vocabulary (OOV) words, which could be particularly common in the case of lowresource, and morphologically-rich, languages.
Hakimi Parizi and Cook (2020b) propose an extension of Duong et al. (2016) that incorporates subword information during training and therefore can generate representations for OOVs in the shared cross-lingual space. This method also does not require parallel corpora for training, and could therefore be particularly well-suited to lower-resource, and morphologically-rich, languages. However, Hakimi Parizi and Cook only evaluate on synthetic low-resource languages. We refer to the methods of Duong et al. and Hakimi Parizi and Cook as DUONG2016 and HAKIMI2020, respectively.
Most prior work on BLI focuses on invocabulary (IV) words and well-resourced languages (e.g., Artetxe et al., 2017;Ormazabal et al., 2019;Zhang et al., 2020), although there has been some work on OOVs (Hakimi Parizi and Cook, 2020a) and low-resource languages (Anastasopoulos and Neubig, 2020). In this paper, we evaluate HAKIMI2020 on BLI for twelve lower-resource languages, and also consider an evaluation focused on OOVs. Our results indicate that HAKIMI2020 gives improvements over DUONG2016 and several strong baselines, particularly for OOVs.

Joint Training Incorporating
Sub-word Information Equation 1 shows the cost function for DUONG2016, which jointly learns embeddings for a word w i and its translationw i , where h i is a vector encoding the context, α is a weight parameter, and D s and D t are the source and target language vocabularies, respectively.
Following Bojanowski et al. (2017), HAKIMI2020 modifies Equation 1 by including sub-word information during the joint training process as follows: where G w is the set of sub-words appearing in w and z g is the sub-word embedding for g. h is calculated by averaging the representations for each word appearing in the context, where each word is itself represented by the average of its sub-word embeddings. HAKIMI2020 use character n-grams as subwords. Specifically, each word is augmented with special beginning and end of word markers, and then represented as a bag of character n-grams, using n-grams of length 3-6 characters. The entire word itself (with beginning and end of word markers) is also included among the sub-words.

Experimental Setup
We consider BLI from twelve lower-resource source languages to English. The languages (shown in Table 1) were selected to cover a variety of language families, while having small to medium size Wikipedias and BLI evaluation datasets available. We compare HAKIMI2020 with DUONG2016, VECMAP (Artetxe et al., 2018), and MEEMI (Doval et al., 2018). In each case, we use cosine similarity to find the closest target language translations for a source language word. We evaluate using precision@N (Ruder et al., 2019) for N = 1, 5, 10.

Training Corpora and Dictionaries
The corpus for each language is a Wikipedia dump from 27 July 2020, cleaned using tools from Bojanowski et al. (2017), and tokenized using Eu-ropalExtract (Ustaszewski, 2019), except for Bengali and Hindi, which are tokenized using NLTK (Bird et al., 2009). Because DUONG2016 and HAKIMI2020 can learn high quality cross-lingual embeddings from monolingual corpora of only 5M sentences each, we down-sample the English corpus for these two methods to 5M sentences. DUONG2016 benefits from a relatively large training dictionary (Duong et al., 2016), therefore, for DUONG2016 and HAKIMI2020 we follow Duong et al. to create large training dictionaries by extracting translation pairs from Panlex. Details of the training corpora and Panlex dictionaries are shown in Table 1.

Baselines
We compare against two baselines: VECMAP (Artetxe et al., 2018), a supervised mapping-based method, and MEEMI (Doval et al., 2018), a post processing method. We consider various training corpora and dictionaries to create strong baselines.
Supervised mapping-based approaches tend to see a reduction in performance with seed lexicons larger than roughly 5k pairs (Vulić and Korhonen, 2016). Training translation pairs from MUSE (Lample et al., 2018) are therefore used, except for Azerbaijani, which is not included in MUSE, where training pairs from Anastasopoulos and Neubig (2020) are used. We first train VECMAP using these MUSE pairs, and embeddings learned from the full English corpus, to give this baseline access to as much training data as is available. We then consider this approach, but using the down-sampled English corpus. We found that the smaller English corpus gave higher precision@N (for N = 1, 5, and 10) for both the IV and OOV evaluations in Section 4. This could be due to the smaller corpus having a smaller vocabulary. We then also consider VECMAP trained using Panlex training pairs and embeddings learned from the down-sampled English corpus.
We next consider MEEMI applied to each of the three sets of cross-lingual embeddings obtained from VECMAP. In each case we train MEEMI using the same training pairs (MUSE or Panlex) that were used to train VECMAP. In Section 4 we report results for the baseline that performs best.

Hyper-Parameter Settings
Hakimi Parizi and Cook (2020b) show that DUONG2016 performs best using its default parameters, i.e., an embedding size of 200 and window size of 48, but that HAKIMI2020 performs better using an embedding size of 300 and window size of 20. We use these parameter settings here.
fastText is used to train monolingual embeddings for VECMAP and MEEMI. We use skipgram with its default settings, except the dimension of the embeddings is set to 300 (Bojanowski et al., 2017).

Experimental Results
In this section, we present results for BLI for IV words, and then OOV source language words.

BLI for In-Vocabulary Words
For these experiments we use MUSE test data for all languages except Azerbaijani, where we use test data from Anastasopoulos and Neubig (2020  the embedding matrices learned from our corpora. We compare HAKIMI2020 with DUONG2016 and MEEMI trained using the down-sampled English corpus and MUSE training pairs, which performed best of the baselines considered for each evaluation measure. Results are shown in Table 2. 1 HAKIMI2020 improves over DUONG2016, indicating that DUONG2016 can indeed be improved by incorporating sub-word information during training. Comparing HAKIMI2020 and MEEMI, the results are more mixed. In terms of precision@1, MEEMI substantially outperforms HAKIMI2020, although for precision@10 HAKIMI2020 outperforms MEEMI.

BLI for OOVs
Following Hakimi Parizi and Cook (2020a) we use Panlex to construct a test dataset of translation pairs in which the source language words are OOV and the target language words are IV. However, Hakimi Parizi and Cook observe that some translations in Panlex are noise. To avoid noisy translations, we use all translation pairs for which the source language word is OOV with respect to the embedding matrix, i.e., the embedding models have no direct knowledge of these words, but is attested in the source language corpus, i.e., there is evidence that this is indeed a word in the source language. 2 The resulting test datasets consist of between 806 translation pairs in the case of Azerbaijani to roughly 11k pairs for Hungarian.
Here we compare against the VECMAP baseline using the down-sampled English corpus and Panlex training pairs, which performed best of the baselines considered for each evaluation measure. For VECMAP, we follow Hakimi Parizi and Cook (2020a) by forming a representation  for the OOV source language word from its subword embeddings, and then mapping it into the shared space. We cannot, however, compare directly against DUONG2016 because it is a wordlevel approach that cannot represent OOVs. We therefore instead compare against a baseline in which the OOV source language word is copied into the target language. This approach, referred to as COPY, could work well in the case of borrowings and named entities. 3 Table 3 shows the results. HAKIMI2020 outperforms VECMAP for all languages and evaluation measures. This finding suggests that sub-word information can be more effectively transferred in a cross-lingual setting when sub-words are incorporated into the training process -as is the case for HAKIMI2020 -than when they are not -as for VECMAP here. Comparing HAKIMI2020 to COPY, although there are several languages for which COPY outperforms HAKIMI2020, on average, HAKIMI2020 performs better. In the cases that COPY outperforms HAKIMI2020, it appears to be largely related to the presence of English abbreviations in the source language Wikipedia dump.
Because of the relatively strong performance of COPY on several languages, we propose an approach that combines COPY and HAKIMI2020, referred to as HAKIMI2020+COPY. Given a source language word, we first check whether it is in the target language embedding matrix. If so, we assume it is a word that does not require translation (e.g., a named entity) and copy it into the target language. 4 If the source language word is not in the target language embedding matrix, we apply HAKIMI2020 to find the target language translation under this model. This approach improves over both COPY and HAKIMI2020 for all languages, except Bengali, and gives substantial improvements on average. 5 Although COPY is a very simple approach, it is complementary to HAKIMI2020, and the two approaches can be effectively combined to improve BLI for OOVs. trained on modest amounts of monolingual data and can represent OOVs. In two BLI tasks for twelve lower-resource languages focused on IV words and OOVs, we found that this method improved over previous approaches, particularly for OOVs. Evaluation data and code for learning the cross-lingual embeddings is available. 6 In future work we plan to explore the impact of the target language on the quality of the crosslingual embeddings, and in particular consider source and target languages from the same family. We further intend to evaluate these cross-lingual embeddings in down-stream tasks for low-resource languages, such as language modelling (Adams et al., 2017) and part-of-speech tagging (Fang and Cohn, 2017), and to compare against approaches based on contextualized language models.