Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.


Introduction
Cross-lingual word embedding learning has the goal of learning representations for words of different languages in a common space (Mikolov et al., 2013b;Conneau et al., 2018;Levy et al., 2017).Cross-lingual representations are beneficial for finding correspondences between languages, and are utilised in many downstream tasks such as machine translation (Lample et al., 2018;Artetxe et al., 2018b) and cross-lingual named entity recognition (Xie et al., 2018).
The recent trend in cross-lingual embedding models is to leverage an enormous amount of monolingual data for each of the target languages, e.g. by training word embeddings monolingually and mapping them into a common space.Another approach is to jointly train cross-lingual word embeddings in the same space.Recently, crosslingual masked language models such as mBERT (Devlin et al., 2019) have succeeded in learning cross-lingual representations using large-scale monolingual data for multiple languages.
However, when dealing with endangered languages, there is generally no such large-scale corpus of monolingual data, and when trained under low-resource settings, modern pretraining methods do not perform well (Hu et al., 2020).In this paper, we propose a joint-training method that learns contextualised cross-lingual word embeddings using a small parallel corpus, of the scale and form constructed by field linguists for language documentation purposes.Compared to previous models based on parallel corpora, our model has two strengths: (1) while previous models extend bag-of-words models such as Skip-Gram (Mikolov et al., 2013a) and capture only rudimentary word order information, our model encodes sentences with LSTMs (Hochreiter and Schmidhuber, 1997) and generates contextualised word embeddings; and (2) our model trains subword-aware word embeddings and captures orthographic similarities among the languages.
We perform evaluation over bilingual lexicon induction and word alignment.Both tasks are extremely important to facilitate language documentation, revitalisation, and education of endangered languages.We run experiments targeting three endangered languages -Yongning Na, Shipibo-Konibo, and Griko (Section 3.1) -as well as four high-resource language pairs, and show that our model substantially outperforms strong baselines for most language pairs.

Model Architecture
Our proposed model is based an LSTM2 encoderdecoder model with attention (Luong et al., 2015b), trained with translation and reconstruction objectives (Figure 1).Suppose our model encodes a sentence ⟨x s 1 ..., x s N ⟩ in the source language s and decodes a sentence ⟨y t 1 ..., y t M ⟩ in the target language t.The encoder employs bi-directional LSTMs f , which are shared among all languages: (1) where x s i denotes a one-hot vector.In cross-lingual tasks, we employ r s i and u s i as the static and contextualised word embeddings of x s i .Given the encoder states u s , the decoders − → g t and ← − g t translate (when s ̸ = t) or reconstruct (when s = t) the input sentence left-to-right and right-to-left.We train separate decoders for each language and direction to allow for the differences of word order. 3imilar to ELMo (Peters et al., 2018), the decoding is performed independently in both directions: ).The output layer and attention mechanism are shared across the two directions: where and N is the number of words in the source sentence x s .In Eqn.(4), we use the word embedding parameters E t for the output layer (weight tying: Inan et al. (2017); Press and Wolf ( 2017)).This technique can reduce the number of the language-specific parameters substantially, encouraging the model to use the same space across languages.When calculating attention weight in Eqn. ( 7), the model uses dot products of the encoder and decoder hidden states to encourage them to be in the same embedding space.In Eqn. ( 6), our model attends to the word embeddings as well as the hidden states to consider more direct relations between source and target word embeddings, i.e. r t i ⊤ W rs i directly contributes to the probability p(y t i ).4 Furthermore, we employ very aggressive dropout (Srivastava et al., 2014), which is applied to all the input and output word embeddings E ℓ in Eqns.(1), (3), and (4), as well as to ūs i + rs i + h i in Eqn.(5) before the linear transformation, with the dropout rate all set to 0.5.We show that this strong regularisation leads to better cross-lingual representations.

Shared Subword Embeddings
To incorporate orthographic information into word embeddings, we propose a simple yet effective method to combine word and subword embeddings, inspired by FastText (Bojanowski et al., 2017).For each word w ℓ i , we calculate its subwordaware word embedding Ẽℓ w i as follows: where F(•) denotes the subword encoding function; Z k denotes the k-th subword embedding and Q(w i ) denotes the indices of the subwords included in w i .The subword embeddings Z are shared among all languages, capturing orthographic similarities across languages.For the encoding function F(•), we experiment with two methods:5 (1) average pooling ("SW ave "); and (2) applying a convolutional neural network (CNN) function which is shared among all languages, followed by average pooling ("SW cnn ").For instance, the embedding of the English word puts is represented by its language-specific word embedding E en puts and shared subword embeddings Z @put and Z s , where @ in @put denotes the beginning of a word.To segment words into subwords, we apply Sentence-Piece (Kudo and Richardson, 2018). 6istinct from a standard NMT model, we use the subword-aware embeddings Ẽℓ not only as the input layers in Eqns.( 1) and ( 3), but also as the output layer in Eqn.(4) (Figure 1).In this way, we encourage the model to learn subword correspondences between the source and target languages through attention, i.e.Eqns.( 4) and (5).In monolingual language modelling, Assylbekov and Takhanov (2018) have previously shown the effectiveness of sharing morpheme-aware embeddings between the input and output layers.

Training
Given a parallel corpus aligned between languages s and t, our model is trained to minimise the following training loss J = J s,t + J t,s , where ∑ Cs,t j=1 ∆(x j , p(x s j |x s j )) + ∆(y j , p(y t j |x s j )).Here, C s,t is the number of aligned sentences between languages s and t, and ∆ denotes the cross entropy loss.The first and second terms represent the reconstruction and translation loss, respectively.Our model can also take multiple parallel corpora as input and generate multilingual word embeddings.In this case, we sum the loss calculated on each parallel corpus.For instance, we train our multilingual model to minimise the loss J = J nru,en + J en,nru + J nru,f r + J f r,nru + J nru,zh + J zh,nru given three parallel corpora of nru-en, nru-fr, and nru-zh,7 where some sentences are aligned between more than 3 Experiments

Data
We conduct experiments on real-world data sets for three endangered languages: Yongning Na (nru), Shipibo-Konibo (shp), and Griko (grk) (Table 1).9 The URLs of the data are shown in Appendix B.

Yongning Na
Yongning Na is an unwritten Sino-Tibetan language with less than 50,000 speakers (Do et al., 2014).Due to the lack of a writing system, textual data must be professionally transcribed from speech by linguists, which precludes the use of the latest pretraining methods such as BERT (Devlin et al., 2019).As a cross-lingual resource for Na, there exists a phonemically transcribed corpus that has been translated into French, Chinese, and English, which is a part of the Pangloss Collection (Michailovsky et al., 2014).However, there are two challenges in learning cross-lingual representations from this data.First, the syntax and orthography of the languages are very different: Na is an SOV language with rich tonal morphology (Michaud, 2017), while the others are SVO languages.Second, there is a lot of noise in the parallel corpora.For instance, some words or phrases in the translations are written in brackets, indicating alternative translations, subsidiary information, or words that are implicit in the original Na sentences (Table 2).To clean the data, we use the pre-  processing code used in Adams et al. (2017) with minor modifications.10

Shipibo-Konibo
Shipibo-Konibo is an indigenous language spoken by around 35,000 native speakers in the Amazon region of Peru (Vasquez et al., 2018), and is "definitely endangered" according to the UNESCO's Atlas of the World's Languages in Danger (Moseley, 2010).There is no large monolingual corpus for the language,11 but for cross-lingual resources there are two parallel corpora aligned with Spanish, which are extracted from the Bible and educational books (Galarreta et al., 2017).Similar to Na, Shipibo-Konibo is an SOV language with very rich morphology (Valenzuela, 1997;Vasquez et al., 2018), whereas Spanish is an SVO language.

Griko
Griko is a Greek dialect spoken in southern Italy, and "severely endangered" according to UNESCO.
There is no large-scale monolingual corpus of Griko, but there are two Griko-Italian parallel corpora (Zanon Boito et al., 2018;Anastasopoulos et al., 2018), with the smaller one including gold word alignment annotations.However, Griko has never had a consistent orthography, and hence its tokenisation and word segmentation differ across these corpora: the smaller data set is based on orthographic conventions from Italian, while the larger one follows the concept of a phonological word (Anastasopoulos et al., 2018).Unlike the Na and Shipibo-Konibo data sets, Griko and Italian are very similar in many ways: they both use the Latin script and have similar syntax.Therefore, the main challenge comes from the data paucity and inconsistent orthography in Griko, both of which are common problems for endangered languages.

Baselines
We compare our model against various crosslingual models that are trained on a parallel corpus.First, we compare our model against a recently-proposed word-alignment model based on mBERT (Dou and Neubig, 2021).12It fine-tunes mBERT on parallel corpora using various crosslingual objectives, and achieves state-of-the-art performance on word alignment tasks across many language pairs.We also include Levy et al. (2017), Luong et al. (2015a), andSabet et al. (2020) as recent word embedding baselines, which we denote as SENTID, BIVEC and BIS2V, respectively.All of these baselines are very similar in terms of methodology: SENTID trains a Skip-Gram model that predicts a sentence ID (which is assigned to each set of parallel sentences) from the component words; BIVEC trains a Skip-Gram model that predicts the context cross-lingually based on the wordalignment information;13 and BIS2V trains a Continuous Bag-of-Words (CBOW) model that predicts a target word from the rest of the sentence and its parallel sentence.Sabet et al. (2020) and Marie and Fujita (2019) show that these joint learning models perform better than mapping-based methods, which align monolingual word embeddings cross-lingually. 14Regarding the vocabulary size and word embedding dimension, we always use the same values for all the baselines and our model, to ensure fairness. 15n addition to these neural baselines, we also compare our model against statistical word alignment methods, namely GIZA++ (Och and Ney, 2003) and Fast Align (Dyer et al., 2013).These are pre-neural methods based on the IBM models (Brown et al., 1993), and still serve as de facto standard models to generate word alignments (Cao et al., 2020;Aldarmaki and Diab, 2019).For all the baselines, we use the authors' implementations.16

Experimental Settings and Evaluation
In our experiments, we train cross-lingual embeddings for five low-resource language pairs: Griko-Italian, Shipibo-Konibo-Spanish and Na-{French, Chinese, English}.For the Griko-Italian pair, we evaluate models on a cross-lingual word alignment task and report alignment accuracy (1−AER).We use the gold alignments manually annotated over the 330 Griko-Italian sentences.To produce alignments using Giza++ and Fast Align, we train them on the 330 sentences with or without additional 10k sentences from a second corpus,17 and combine forward and backward alignments using the grow-diag-final-and heuristic.For the word embedding-based methods, we train them on the same data, and align each word in a sentence to the closest word in its translation using static or contextualised word embeddings. 18To calculate word similarity, we use cross-domain similarity local scaling (Conneau et al., 2018): where cos(x, y) denotes cosine similarity between x and y, and N T (x) and N S (y) denote the K nearest words to x or y in a target or source sentence; we set K to 3 in the word alignment task.For the mBERT baseline, we follow the authors in using the softmax function. 19or the Shipibo-Konibo-Spanish and Na-{French, Chinese, English} pairs, we perform bilingual lexicon induction (BLI).That is, for each source word in a bilingual dictionary, we extract the k nearest words from the whole target vocabulary and see whether they are listed as translations in the dictionary.We set k to 1 or 5, and report P@1 and P@5.For evaluation, we use a Shipibo-Konibo-Spanish dictionary 20 (Maguiño-Valencia et al., 2018) and Na-French-Chinese-English dictionaries (Michaud, 2018).Based on extracting words that are present in the parallel corpora, we identified 79, 262, 215 and 87 word pairs for Shipibo-Konibo-Spanish, Na-French, Na-Chinese, and Na-English. 21To perform BLI with GIZA++ and Fast Align, we use their source-to-target probability table.We also try using the result of bidirectional word alignments, aligning each word to the most frequently aligned words to it. 22For the neural baselines and our model, we use static word embeddings and employ CSLS to measure the word embedding similarities.To obtain static word embeddings using mBERT, we calculate the contextualised representations for each word (calculated as the average of its subword embeddings), and take the average over all word occurrences. 23In BLI, N T (x) and N S (y) denote the K closest words extracted from the whole vocabulary, with K = 10, following Conneau et al. (2018).

Model Selection
Since our evaluation data (i.e.bilingual dictionaries and gold word alignments) is extremely limited, we do not have access to validation data to perform model selection over.Therefore, for all methods except ours, we run the models with different configurations and report the best scores of the baselines on the test data to show their upper-bound performance, which clearly gives a significant advantage to the baselines. 24For Fast Align and GIZA++, we train the models for 5 (default), 10, 15, or 20 iterations independently, and 20 In this dictionary, sets of synonyms are aligned crosslingually and we regard each member of them as translations. 21Since the Shipibo-Konibo-Spanish parallel corpora contain pairs of words as well as sentences, we include them in the evaluation data and remove them from training data. 22We use the probability tables as a backup when there are less than k aligned words. 23We also tried taking the average of the static subword embeddings of mBERT, but observed much worse results. 24In addition, we tuned the hyper-parameters of BIS2V and BIVEC based on P@1 on the na-en test data, based on the observation that they were very sensitive to hyper-parameters in low-resource conditions (e.g.BIVEC ranged from 5.4 to 33.8 P@1 in the na-en BLI task).Refer to Appendix C for the hyper-parameters of the baselines we used in our experiments.report the best score. 25For the neural baselines, we evaluate each model-checkpoint and report the best score; we fine-tune the mBERT baseline for 40,000 steps26 with 20 checkpoints, and train SENTID and BIS2V for 1,000 epochs with 100 checkpoints to ensure convergence.For BIVEC, we increase the training corpus size by 20 times by duplicating the sentences and train the model for 50 epochs with 50 checkpoints. 27or our model, on the other hand, we use a simple early-stopping criterion that doesn't require external data.First, we build a pseudo bilingual dictionary from the training data.To retrieve pseudo bilingual word pairs, we compute the Dice Coefficient (Dice, 1945;Smadja et al., 1996) and extract pairs of words that appear ≥ 3 times in each language and whose Dice Coefficient is ≥ 0.8 across two languages.We perform model selection based on the BLI performance on this pseudo dictionary.

Results
Table 3 shows the results for BLI. 28We run the neural baselines and our model three times with different seeds and report the average score since neural models can be unstable with little data.It clearly shows that our model outperforms all the baseline models by a large margin for every language pair.It also shows that utilising shared subword embeddings (+SW ave and +SW cnn ) further improves Table 4: Our model performance (P@1) on BLI when the model is trained on two and four languages ("bi" vs. "multi").All scores are averaged over three runs.
our model.Compared to the neural baselines, our model performs better even without subword information, demonstrating its efficiency.The mBERT baseline performs very poorly, likely because of its sub-optimal tokenisation for endangered languages.Table 4 compares our bilingual and multilingual models.The multilingual model is trained jointly on the three parallel corpora, sharing parameters among the four languages.The table shows that the multilingual model achieves better performance overall, especially for Na-English and Na-Chinese, where the number of the aligned sentences is much smaller than for Na-French.This result demonstrates that our model is not only able to embed multiple languages into the same space, but also benefits from extra sentences aligned between additional languages.
Table 5 shows the results for the Griko-Italian word alignment task.It shows that our model performs the best of all the models when they are trained on the 330 sentences only.This is particularly surprising given that encoder-decoder models are usually not effective when trained on small-scale data of magnitude 100s of sentence pairs.The result also shows that our model produces much better static word embeddings than SENTID, BIVEC and BIS2V, demonstrating the importance of considering word order information.
When we use the additional 10k sentences, the performance of the baselines drops substantially except for mBERT, 29 likely because of the differences in domains and tokenisation schemes, with the smaller Griko corpus closely following Italian norms.On the other hand, our model achieves good results under both conditions, indicating the robustness of our model to noisy real-world data.

Results on High-Resource Languages
To investigate how our model performs on highresource conditions, we conduct additional wordalignment experiments on four high-resource language pairs: Japanese-English (ja-en), English-Inuktitut (en-iu), German-English (de-en), and English-French (en-fr).Regarding Inuktitut, there is no large-scale monolingual data, making it a salient test case for our model.We use benchmark word-alignment data sets for each language pair, 30 where the de-en and en-fr data sets contain about 2M and 1M parallel sentences, and the ja-en and en-iu ones about 0.3M.We apply SentencePiece to each corpus 31 and use them to train all the models except for mBERT, for which we use its pretrained tokeniser.To perform word alignment, first we align subwords and align words if any of their subwords is aligned. 32We use the same model selection criteria (Section 3.4) to report the upper bound of the baselines. 33 29 We conjecture this is because the additional data helped mBERT (esp.its positional embeddings) to learn that these two languages have very similar syntax.
31 For the en-iu corpus, we segmented the Inuktitut sentences only, as there is a significant gap between the English and Inuktitut vocabulary size, i.e. 22k vs. 400k.
32 GIZA++ and Fast Align also benefit from this method. 33We train the word embedding baselines for 100 epochs and the mBERT baseline for 40,000 steps with 20 checkpoints, and Fast Align and GIZA++ for 5, 10, 15 or 20 epochs, using 50 word classes (Moses default) for GIZA++.Table 6: Precision ("P"), Recall ("R") and 1−AER ("1−A") of word alignment on high-resource conditions.P and R are calculated based on possible and sure alignments, resp."+null" denotes the result with null alignments.
Table 6 shows the results of the ja-en and en-iu word alignment experiments.It demonstrates that our model (OURS) significantly outperforms the other static word-embedding baselines.It also outperforms mBERT in en-iu and even in ja-en, which is very surprising given that mBERT is pre-trained on large-scale monolingual data for Japanese and English. 34We also tried training our subword-aware model (OURS+SW ave ) by segmenting subwords into smaller word pieces and learning "subsubword" embeddings.The result shows that this approach improves our model for ja-en but not en-iu, probably because some Japanese characters (e.g.kanji) contain much semantic information.When compared to the word alignment tools, our model outperforms Fast Align and is comparable to GIZA++, achieving lower precision and higher recall.This is because, unlike GIZA++, our simple alignment algorithm based on CSLS cannot handle NULL alignments (untranslatable words) and generates more alignments than necessary.To handle those words, we apply the following heuristic: discard alignments between x and y if CSLS(x, y) ≤ 0 or cos(x, y) ≤ min(cos(x, BOS), cos(BOS, y)).This improves our model substantially ("+null" in Table 6) and it outperforms all the baselines. 35 Lastly, Table 7 shows the results of the de-en and en-fr experiments.We cite the scores of the baselines from Dou and Neubig (2021), and report AER instead of 1−AER following the original table.It shows that "OURS +null" performs comparably to mBERT for en-fr (4.5 vs. 4.1), and outperforms it for de-en (14.0 vs. 15.0),establishing a new state-of-the-art with much less data and fewer parameters.The table also shows that our method is much simpler than the other NMT-based models (Zenkel et al., 2020;Chen et al., 2020), which pretrain an NMT model and then train an alignment model on top of it.Another important difference is that our model can produce cross-lingual representations while the NMT-based baselines can generate word alignments only.

Alignment with Pre-trained Embeddings
To employ large-scale monolingual data, we try initialising word embeddings of a high-resource language with pre-trained word embeddings, and train the word embeddings of a low-resource language in the same embedding space.During training, we freeze the pre-trained embeddings E ℓ pre and apply the element-wise operation a ⊗ E ℓ pre + b, where a and b are trainable vectors and shared among all the words in E ℓ pre .For the other words, we train subword-aware embeddings (SW ave ) from scratch. 35For grk-it, the performance (1−AER) slightly dropped, e.g."+SWcnn+null" achieved 92.3/93.2w/w.o the 10k sentences, likely because there are very few NULL alignments.We conduct an experiment on nru-en, where we pre-train English word embeddings using FastText on 10M sentences sampled from web-crawled data, OSCAR (Ortiz Suárez et al., 2020).The model achieves 30.6/52.3 on P@1/5, underperforming OURS+SW ave without pre-training (32.0/56.3),possibly because of the domain difference between the parallel and monolingual data.However, a closer look at the matched words reveals that pretraining can improve the retrieval performance in several cases, as shown in Table 8 (more examples are in Appendix D).It shows that even though all the models successfully match the correct words, our models retrieve more relevant words to the target word, especially when trained with pre-trained embeddings.This suggests that pre-training may benefit the model on other semantic tasks.Pretraining also makes it possible to measure the similarities between Na words and English words that are out-of-vocabulary in the parallel corpus.Table 9: Ablation results for our model (P@1 (nru) or 1−AER (ja-en))."multi" indicates the average P@1 of our multilingual model over the three language pairs.

Ablation Studies
To investigate the effectiveness of our model, we perform ablation studies, targeting: shared subword embeddings, word-specific embeddings E ℓ w i in Eqn. ( 8), backward decoding, dropout, weight tying, and the reconstruction objective.Table 9 shows the results.Without the reconstruction loss (i.e. the standard NMT model), the model performs very poorly in the bilingual settings.In the multilingual setting, however, the reconstruction is not essential because the model is trained to translate multiple languages into Na, which forces the source languages to be encoded in the same space.The table also shows that weight tying is very effective.We find this result intriguing, as usually it does not affect translation quality very much (Press and Wolf, 2017).In our model, however, weight tying is crucial in two ways: first, it reduces the number of parameters and prevents the model from learning word embeddings in different spaces; and second, it introduces more direct connections between source and target (sub)word embeddings through attention.Aggressive dropout is also effective in all the conditions, preventing the model from learning language-specific embeddings using different spaces.Word-specific embeddings also improve the model performance except for nru-zh. 36Lastly, backward decoding also improves performance, incorporating right-to-left contexts into the word embeddings.

Related Work
There are two main approaches to learning crosslingual word embeddings.One is to learn a matrix that aligns pretrained monolingual embeddings.
Most such methods exploit bilingual dictionaries to learn the mapping matrix (Mikolov et al., 2013b;Xing et al., 2015;Joulin et al., 2018), but recently a number of methods have succeeded in learning the matrix without supervision (Zhang et al., 2017;Conneau et al., 2018;Artetxe et al., 2018a).However, Wada et al. (2019) show that this approach does not work well on low-resource conditions.The second approach is to jointly train crosslingual embeddings in a common space.Most existing methods extend bag-of-words models (e.g.Skip-Gram) to incorporate cross-lingual information provided by parallel or comparable corpora (Hermann and Blunsom, 2014;Vulic and Moens, 2016;Levy et al., 2017;Dufter et al., 2018a,b;Luong et al., 2015a;Sabet et al., 2020;Sarioglu Kayi et al., 2020), or bilingual dictionaries (Gouws and Søgaard, 2015;Duong et al., 2016).Recently, masked language models such as XLM (Conneau and Lample, 2019) and mBERT (Devlin et al., 2019) have been shown to generate cross-lingual representations without parallel data, but require an enormous amount of monolingual data, which is not available for endangered languages.Similar to our work, some papers use multilingual neural machine translation models to obtain cross-lingual representations (Eriguchi et al., 2018;Schwenk and Douze, 2017;Artetxe and Schwenk, 2019;Schwenk, 2018).However, they employ extremely large and/or multilingual data aligned among more than two languages (e.g.Europarl, United Nations).Another important difference is that their methods focus on learning cross-lingual sentence representations only (i.e.not at the word level), and are evaluated on cross-lingual sentence retrieval or sentence classification tasks.

Conclusion
We propose a new approach for learning contextualised cross-lingual word embeddings that can be trained with a tiny parallel corpus.We evaluate models on real-world data for three endangered languages, and also on benchmark data sets for four high-resource languages, and show that our model outperforms existing methods at bilingual lexicon induction and word alignment.denotes hyper-paramertes for low-resource languages, the second ("medium") for ja-en and iu-en, the last ("high") for de-en and en-fr.CNN is used in lowresource experiments only due to its high computational cost.

A Details of Our Model
We report the details of our model to ensure reproducibility.Our model was trained using PyTorch (Paszke et al., 2019) on a single GPU.

A.1 Hyper Parameters
Table 10 shows the hyper-parameters of our model.We tuned the hyper-parameters on low and medium conditions using a subset of the de-en or en-fr data sets.We used the same embedding size in both low and medium resource conditions for simplicity.For very high-resource languages (i.e.de-en and en-fr), we simply increased the number of the encoder and decoder layers by one and set the embedding size to the same as that of mBERT to have a fair comparison.We use Adam (Kingma and Ba, 2015) as the optimiser with the default learning rate.In low-resource experiments, we train our model for 200 epochs in the Na and Griko bilingual experiments, and for 100 epochs for other languages.To learn our subword-aware models, we applied SentencePiece separately to each language with the vocabulary size set to 1,000 (in lowresource experiments) or 1,000 + the number of character types (for ja-en and en-iu).ments, most of the model parameters are located in LSTMs (9.5 millions) because the vocabulary sizes are very small.While our model itself is clearly more complex than the neural baselines (except for mBERT), the dimension of the word embeddings, which our model is trained for, is set to the same for all the models to ensure fairness.

A.3 Run-time
Table 12 shows the run-time of our model in seconds.Although computationally more expensive than baseline models (except mBERT), it scales well and can be fully trained using a GPU in less than a minute for nru-zh.

B Language Resources
Here we provide the URLs (in footnotes) from which we obtained the language resources we used in our experiments: Griko-Italian37 , Na-{English, French, Chinese}38 , Shipibo-Konibo-Spanish39 ,

C Hyper Parameters of the Baselines
In this section, we describe the hyper parameters of the baselines we used in our experiments.For SEN-TID, we used the default settings since they are already optimised for low-resource data (25k sentences from Bible) and works well on both small and large data.For BIS2V, we set the number of negatives sampling to 5, max length of word n-gram to 2, and the number of n-gram dropout to 1.In high-resource experiments, we changed the number of n-gram dropout to 4. For BIVEC, we set the subsampling rate to 0.001, bi-weight to 2, and the number of negative sampling to 5. In high-resource experiments, we set the sampling value to 0.01 and the negative sampling value to 10.We observed BIVEC very sensitive to the subsampling rate and bi-weight: the performance ranged from 5.4 to 33.8 (P@1) in the nru-en BLI task.We tuned these hyperparameters of BIVEC and BIS2V based on the last-checkpoint model performance on the test data of nru-en and en-iu.For the mBERT baseline, we used their default setting.
For GIZA++ and Fast Align, we used their default hyper-parameters unless mentioned otherwise.In BLI, we set the counts increment cutoff and probability cutoff thresholds to 0 in GIZA++, and remove the probability cutoff in Fast Align to avoid pruning low-probability words.

D Examples of Retrieved Words on BLI
Table 13 shows some examples of retrieved words on the Na-English BLI task.The source words are chosen by sorting the Na words in the dictionary based on the frequency in the Na-English corpus, and selecting the seven most frequent ones.
The table shows that, although P@5 and P@1 are nearly the same for all the methods, our model matches more semantically and/or grammatically related words to the target word than BIVEC, the best performing baseline. 44For instance, given the source word tʰi˩˥, OURS (+SW ave ) retrieved its translations "then" and "so", and also other conjunctions "and" and "but", while BIVEC was able to retrieve "then" only and the other retrieved words are irrelevant to it, such as "calculates" and "counts".For the source word ʈʂʰɯ˧-qo˧, all the models successfully retrieved its translation "here".However, while our models also retrieved relevant words to it, such as "there" and "where" , BIVEC retrieved completely irrelevant words such as "beam".These results suggest that our models encode more semantic and syntactic information into the word embeddings by taking word order information into account. 44Note that since the Na-English parallel corpus is extremely small, the size of the English vocabulary from which the words are retrieved is also very small (i.e.942 word types).Besides, as shown in Table 2, the corpus is also very noisy and contains some ungrammatical sentences.Therefore, it is inevitable to some extent that some of the retrieved words are irrelevant to the source word.

Table 3 :
The performance on bilingual lexicon induction (BLI)."+Align" indicates bidirectional alignments are used with the probability table (Ptable) as backup.The scores of the neural models are averaged over three runs.

Table 5 :
The performance (1−AER) on the Griko-Italian word alignment task with or without additional 10k parallel sentences."Word (static)" is the result when our model uses static word embeddings instead of contextualised ones.The scores of the neural models are averaged over three runs.

Table 7 :
Dou and Neubig (2021)res among various word alignment models.All the scores except for ours are cited fromDou and Neubig (2021)."+null" denotes the result with null alignments.

Table 8 :
Examples of retrieved words on nru-en BLI.
"+ Pre" denotes the use of pre-trained embeddings.

Table 10 :
Hyper-parameters of our model in low-and high-resource experiments.The first column ("low")

Table 12 :
Run-time (in seconds) of our model per one epoch on a single GPU.