Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?

Cognate prediction is the task of generating, in a given language, the likely cognates of words in a related language, where cognates are words in related languages that have evolved from a common ancestor word. It is a task for which little data exists and which can aid linguists in the discovery of previously undis-covered relations. Previous work has applied machine translation (MT) techniques to this task, based on the tasks’ similarities, without, however, studying their numerous differences or optimising architectural choices and hyper-parameters. In this paper, we investigate whether cognate prediction can bene-ﬁt from insights from low-resource MT. We ﬁrst compare statistical MT (SMT) and neural MT (NMT) architectures in a bilingual setup. We then study the impact of employing data augmentation techniques commonly seen to give gains in low-resource MT: monolingual pretraining, backtranslation and multilingual-ity. Our experiments on several Romance languages show that cognate prediction behaves only to a certain extent like a standard low-resource MT task. In particular, MT architectures, both statistical and neural, can be successfully used for the task, but using supplementary monolingual data is not always as ben-eﬁcial as using additional language data, con-trarily to what is observed for MT.


Introduction
The Neogrammarians (Osthoff and Brugmann, 1878) formalised one of the main hypotheses of the then recent field of comparative linguistics, the regularity of sound changes: if a phone in a word, at a given moment in the history of a given language, evolves into another phone, then all occurrences of the same phone in the same phonetic context in the same language evolve in the same way.
Sound changes are usually identified by looking at the attested (or hypothesised) phonetic form of specific sets of words, called cognates, whose definition varies in the literature depending on the field. 1 We use an extension of the customary definition used in historical linguistics, as described for instance in (Hauer and Kondrak, 2011;List et al., 2017), which is the following: given two languages with a common ancestor, two words are said to be cognates if they are an evolution of the same word from said ancestor, having undergone the sound changes characteristic of their respective languages' evolution. We extend it by also allowing the ancestor word (from the parent language) to also be considered a cognate. For example, Latin bonus 'good' gave Italian buono 'id.', Spanish bueno 'id.' and Spanish bono 'id.' by inheritance, and they are all cognates, whereas Spanish abonar 'to fertilise', obtained by derivation, is related but not a cognate ( Figure 1). Cognate identification (finding cognate pairs in a multilingual word set) and prediction (producing likely cognates in related languages) are two of the fundamental tasks of historical linguistics. Over the last three decades, automatic cognate identification has benefited from advances in computational techniques, first using dictionary-based methods (Dinu and Ciobanu, 2014) and purely statistical methods (Mitkov et al., 2007;McCoy and Frank, 2018), then statistical methods combined with clustering algorithms (Hall andKlein, 2010, 2011;List et al., 2017;St Arnaud et al., 2017), statistical methods combined with neural classifiers (Inkpen et al., 2005;Frunza andInkpen, 2006, 2009;Hauer and Kondrak, 2011;Dinu and Ciobanu, 2014) and neural networks only (Ciobanu and Dinu, 2014;Rama, 2016;Kumar et al., 2017;Soisalon-Soininen and Granroth-Wilding, 2019).
Automatic cognate prediction is less studied despite its interesting applications, such as predicting plausible new cognates to help field linguists (Bodt et al., 2018) and inducing translation lexicons (Mann and Yarowsky, 2001). In the last few years, it has been approached as an MT task, as it can be seen as modelling sequence-to-sequence correspondences. Using neural networks has been promising (Beinborn et al., 2013;Wu and Yarowsky, 2018;Dekker, 2018;Hämäläinen and Rueter, 2019;Fourrier and Sagot, 2020a), although in most works the hyper-parameters of the neural models were not optimised. Moreover, the differences between MT and cognate prediction have not been studied.
In this paper, we choose to study the application of MT approaches to the cognate prediction task. Our aim is to investigate whether the task can benefit from techniques commonly seen to improve standard low-resource MT. We first highlight the specific characteristics of cognate prediction, and (to our knowledge) provide the first detailed analysis of the expected differences with standard MT. We then compare MT architectures (bilingual SMT vs. bilingual and multilingual NMT) when applied to cognate prediction. We study how to leverage extra data in our NMT models, either monolingual (via backtranslation or pretraining) or multilingual (introducing new languages). We experiment with Latin and its Romance descendants Spanish and Italian for all our experiments, as well as added French and Portuguese in a data augmentation setting. We find that cognate prediction is only similar to standard MT to a certain extent: the task can be modelled well using standard MT architectures (adjusted for a low-resource setting), and extending neural architectures to a multilingual setting significantly improves the results. In such multilingual settings, further improvements can be obtained by leveraging data from extra languages. However, using extra monolingual data via backtranslation or pretraining is not always as beneficial as it is in standard MT settings. 2 2 Related Work

Cognate Prediction
Cognate prediction is the task that aims to produce from words in a source language plausible cognates in a target language (according to the aforementioned definition of cognates). It is a lexical task that models regular, word-internal sound changes that transform words over time. It has been approached with phylogenetic trees combined with stochastic sound change models (Bouchard et al., 2007;Bouchard-Côté et al., 2009;Bouchard-Côté et al., 2013), purely statistical methods (Bodt et al., 2018), neural networks (Mulloni, 2007), language models (Hauer et al., 2019) andcharacter-level MT techniques (Beinborn et al., 2013;Wu and Yarowsky, 2018;Dekker, 2018;Hämäläinen and Rueter, 2019;Fourrier and Sagot, 2020a;Meloni et al., 2021), because of its similarity to a translation task (modelling sequence-to-sequence crosslingual correspondences between words).

Low-resource MT
Since data is scarce, we postulate that cognate prediction could benefit from low-resource MT settings techniques and architectural choices.

Architecture Comparison
Several papers comparing SMT with NMT (recurrent neural networks (RNNs) with attention) in low-resource settings conclude that SMT performs better, being more accurate and less prone to overfitting (Skadin , a and Pinnis, 2017;Dowling et al., 2018;Singh and Hujon, 2020). However, as Dowling et al. (2018) themselves note, they did not optimise hyper-parameters for NMT. Sennrich and Zhang (2019) analysed and reproduced previous comparisons, to conclude that SMT can actually be outperformed by NMT when architectures and hyper-parameters are carefully chosen, but only above a certain quantity of data.

Leveraging Extra Data
Several techniques are commonly used in lowresource MT to mitigate the lack of parallel data: monolingual pretraining, backtranslation and using data from additional languages.
Monolingual pretraining (unsupervised) has, as in other NLP tasks, been highly beneficial to MT (Song et al., 2019;Conneau and Lample, 2019;Devlin et al., 2019;Liu et al., 2020). Before training on a translation task, model parameters are first pretrained using a language modelling objective, which enables the exploitation of monolingual data, more freely available than bilingual data.
Backtranslation originated in SMT (Bertoldi and Federico, 2009;Bojar and Tamchyna, 2011), and has been standard in NMT for several years (Sennrich et al., 2016;Edunov et al., 2018). Its goal is to artificially create larger quantities of parallel data from monolingual datasets, which are often more readily available. Target-side monolingual data is provided to a bilingual model trained in the opposite direction (target-to-source), which produces synthetic source-side data. The data is then filtered to keep the highest quality sentences. The newly generated dataset, made of synthetic source-side data parallel to real target-side data is then combined with the original bilingual set to train a new model.
Training multilingual NMT models has been shown to help low-resource scenarios by providing data in other languages and constraining the hidden representations to a shared, languageindependent space. The amount of sharing between languages varies according to the approach, from multi-encoder, multi-decoder architectures (Luong et al., 2016), optionally sharing attention mechanisms (Firat et al., 2016a), to approaches with a single shared encoder and decoder (Ha et al., 2016;Johnson et al., 2017).

Differences between Cognate Prediction and MT
Cognate prediction and MT both focus on learning sequence-to-sequence correspondences. However, amongst the works using MT techniques for cognate prediction, little attention has been paid to their differences; the underlying linguistic assumptions and aims are quite distinct, which could impact the transferability of choices and techniques from MT.
Representation Units MT processes sentences split into individual (graphemic) units that can be of diverse granularity levels (characters, subwords or words). Cognate prediction, on the other hand, involves predicting sound correspondences from one cognate word to another, and so is best modelled using sequences of phones (like character-level MT).

Reordering and Alignment
In MT, the correspondence between source and target sentences can involve long-distance reorderings, whereas the reorderings sometimes found in the correspondence between cognates are almost always local (e.g. metatheses). We therefore expect SMT, which is somewhat limited with respect to the modelling of long-distance context, to be less penalised in the cognate prediction setting than it usually is a standard MT setting.

Sample Length
The input sequence to MT is the sentence, whereas for cognate prediction it is the word. Even with different segmentation granularities for MT, the average sequence length is generally much shorter for cognate prediction than for MT. Again, this could mean that SMT is less penalised than it is in the standard MT setup.
Modelled Relations MT involves symmetrical relations between sentences, whereas cognate prediction, as defined above, is inherently ambiguous in a counter-intuitive way (especially because it is structurally different from the usual MT ambiguity, where many valid translations exist for the same input). The cognate task models both symmetrical and asymmetrical relationships between cognates: parent-to-child (e.g. LA→ES), i.e. modelling sequences of regular sound changes, is nonambiguous, whereas child-to-parent (e.g. ES→LA) and as a result, child-to-child (e.g. IT↔ES) is intrinsically ambiguous, as two distinct sounds in the parent language can result in the same outcome in the child language. When two distinct sounds in the child language are the outcome of the same sound in the parent language, it is always because their (word-internal) phonetic contexts were different in the parent language. In other words, the parent-tochild direction is (virtually) non-ambiguous, but might require taking the phonetic context into account. However, the child-to-parent direction is intrinsically ambiguous, which results from the fact that a sound in the child language can be the regular outcome of more than one sound in the parent language: for instance Spanish /b/ comes from Latin /p/ in abría (from Latin aperīre) but from Latin /b/ in habría (from Latin habeō).
Ambiguity Management When using cognate prediction as a tool to aid linguists, as in (Bodt and List, 2019), our aim is not to predict the single correct answer, but to provide a list of plausible candidates. In MT however, while many translations can be produced by the model (some better than others-including poor ones), it is possible to simply use the best ranked translation. In cognate prediction, as a consequence of the inherent ambiguity of the task discussed above, at most one prediction is correct, other predictions could have been correct (i.e. they are compatible with the phonetic laws involved), while other predictions are incorrect. A linguist would be interested in all correct or plausible predictions, not just the best ranked one, and there is therefore a need for n-best prediction.
Relevance of Leveraging Extra Data Whereas MT models could theoretically be trained on any sentence pair that are translations of each other, cognate prediction is far more limited in terms of which data can be used; cognacy relations only link a limited number of words in specific language pairs, limiting not only available parallel data but also the potential for synthetic data (e.g. via backtranslation). Using generic translation lexicons may help, but, as they do not only contain cognate pairs, all non-cognate pairs they contain (parallel borrowings from a third language and etymologically unrelated translations) are effectively noise for our task (Fourrier, 2020).

Experimental setup
Bearing in mind these differences, we seek to determine whether MT architectures and techniques are well suited to tackling the task of cognate prediction, paying attention to avoid the pitfalls raised by Sennrich and Zhang (2019) by carefully selecting architecture sizes and other hyper-parameters. For our baselines, we train several characterlevel 3 MT models (SMT vs. RNNs and Transformers) in a bilingual setup, training a single model for each language pair.
We then assess the impact of techniques commonly used to improve MT in low-resource scenarios. We first investigate the impact of using monolingual data for all 3 architecture types, via pretraining and backtranslation, 4 then take advantage of the ability of NMT to accommodate multilingual architectures to experiment with a multi-encoder multi-decoder architecture (Firat et al., 2016b) involving all language directions.
Finally, we test whether there can be any benefit from combining multilinguality with either pretraining or backtranslation.

Data
Our datasets (detailed below) are bilingual cognate lexicons for all our experiments, extended with monolingual lexicons for backtranslation and pretraining (see Table 1). As we focus on sound correspondences, we phonetise our datasets. Each word is phonetised into IPA using espeak (Duddington, 2007(Duddington, -2015, then cleaned to remove diacritics and homogenise double consonant representations. For example, conocer 'to know' is phonetised as We run all experiments on three different train/dev/test splits in order to obtain confidence scores. For the bilingual (baseline) and multilingual setups, each split is obtained by sampling sentences 80%/10%/10% randomly. our SMT models in Section 4.4.
Monolingual Lexicons Monolingual datasets are used for the monolingual pretraining and backtranslation experiments. They were extracted from a multilingual translation graph, YaMTG (Hanoka and Sagot, 2014), by keeping all unique words for each language of interest. To remove noise, words containing non-alphabetic characters were discarded (punctuation marks, parentheses, etc.). The final datasets (cleaned and phonetised) contain between 18,639 and 99,949 unique words (the LA set is more than 4 times smaller than the others).

SMT
We train a separate SMT model for each language direction using the MOSES toolkit (Koehn et al., 2007). Our bilingual training data is aligned with GIZA++ (Och and Ney, 2003). The target data for the pair is used to train a 3-gram language model using KenLM (Heafield, 2011). We tune our models using MERT based on BLEU on the dev set.

NMT
We compare two encoder-decoder NMT models: the RNN (bi-GRU) with attention (Bahdanau et al., 2015;Luong et al., 2015) and the Transformer (Vaswani et al., 2017). We use the multilingual Transformer implementation of fairseq (Ott et al., 2019), and extend the library with an implementation of the multilingual RNN with attention (following the many-to-many setting from (Firat et al., 2016a) but with separate attention mechanisms for each decoder). 5 Each model is composed of one encoder per input language, and one decoder (and its own attention) per output language. 6 We train each model for 20 epochs (which is systematically after convergence), using the Adam optimiser (Kingma and Ba, 2015), the cross-entropy loss, and dev BLEU as selection criterion.

Hyper-parameter Selection
We ran optimisation experiments for all possible bilingual and multilingual architectures, using three different data splits for each parameter combination studied, and choosing the models performing best 5 These implementations are used in all setups, bilingual (using one language as source and one as target) as well as multilingual. 6 In a multilingual setup, encoders, decoders and attention mechanisms can either be shared between languages or be language-specific. In preliminary experiments, using independent items proved to be the most effective. We also observe that a coherent phonetic embedding space is learned during training (described in Appendix A.2). across seeds. Our initial parameters were selected from preliminary experiments (in bold in Table 2).   Table 2 contains the successive parameter exploration steps: at the end of a step, we automatically selected (according to average dev BLEU) the step-best value, used as input parameter for the next parameter exploration step. 7 The final best parameters are given in Appendix A.1. Smaller learning rates (0.005 and 0.001) are better, while there is no observable pattern to the best batch sizes or numbers of layers. Interestingly, however, for the RNNs, the best results are obtained with the highest hidden dimension irrespective of the embedding size (72 vs. 20 or 24), whereas, for the Transformers, best results are obtained with the largest embedding size irrespective of the hidden dimension (24 vs. 54 or 72). Increasing the number of layers or using more than 1 head almost always increases performance.

Evaluation
For our task, we use the most commonly used MT evaluation metric, BLEU (Papineni et al., 2002), using the sacreBLEU implementation (Post, 2018). It is based on the proportion of 1-to 4-grams in the prediction that match the reference.
In standard MT, BLEU can under-score the many valid translations that do not match the reference. For cognate prediction, however, we expect a single correct prediction in most cases (there are a few exceptions such as variants due to gender distinctions specific to the target language). This makes BLEU better suited to the cognate prediction task than it is to standard MT. 8

Leveraging Extra Data
Monolingual pretraining For NMT, one way to take advantage of additional monolingual data is to teach the model to "map" each language to itself by using an identity function objective on the monolingual data for the model's target language. 9 Using monolingual target data during pretraining allows each target decoder to have access to more target data (which avoids overfitting), while we expect it to be beneficial to encoders too, since our source and target languages tend to share common sound patterns in cognate prediction, being closely related. In practice, we pretrain the model for 5 epochs 10 using the identity function objective together with the initial cognate prediction objective (on the original bilingual data) and then fine-tuned on the cognate task as before for 20 epochs.
For SMT, model parameters cannot be pretrained as in NMT, so in the guise of pretraining, we take the nearest equivalent: we use target-side monolingual data to train an extra language model.
For each language pair, the monolingual dataset we use is composed of 90% of the target monolingual data. The bilingual data is the same as before.
Backtranslation For each architecture type, we use the previously chosen models to predict 10best results for each seed from the monolingual target-side data, and construct synthetic cognate pairs from monolingual lexicons and source-side predictions. For each word, we keep the first prediction of the 10 that also appears in the relevant monolingual source language lexicon as our new source, and the initial source as target (this is akin to filtering back-translated data (e.g. to in-domain data) in MT, a standard practice). We discard pairs with no prediction match.
This large back-translated bilingual dataset is extended with our original training set. For NMT, it is used to train a new model for 10 epochs, 10 which is then fine-tuned for 20 epochs with the original bilingual training set. For SMT, it is used (instead of the original bilingual data) to train a new phrase table.
Multilingual NMT We exploit the fact that NMT can readily be made multilingual by training a single model on all language directions at once. 9 For the multilingual model, this means that every encoder will see data from all languages, whereas each decoder will only see data from its specific language. 10 This number of epochs is systematically big enough to reach convergence.

Baseline: Bilingual setup
1-best Results At a first glance (Figure 2, "S" columns), SMT and RNN appear to have relatively similar results, varying between 58.1 and 76.9 BLEU depending on the language pair, outperforming the Transformer by 5 to 15 points on average. However, SMT performs better for IT↔ES (pair with the least data), and RNNs for the other pairs. This confirms results from the literature indicating that SMT outperforms NMT when data is too scarce, and seems to indicates that the data threshold at which NMT outperforms SMT (for our Romance cognates) is around 3,000 word pairs for RNNs, and has not been reached for Transformers.

n-best Results
The BLEU scores for NMT and SMT increase by about the same amount for each new n (n ≤ 10), reaching between 79.3 and 91.9 BLEU score at n = 10 for RNN and SMT. The Transformer, however, does not catch up.

Pretraining, backtranslation
Both pretraining the models and using backtranslation (Figure 2, "P" and "B" columns) increase the results of the Transformer models by 1 to 9 points, though they are still below the RNN baseline. It is likely the added monolingual data mitigates the effect of too scarce bilingual sets. The impact on RNN performance is negligible for most language pairs, apart from the lowest resourced one (ES-IT), for which backtranslation increases results. Lastly, these methods seem to mostly decrease SMT performance, due to noisy data diluting the original (correct) bilingual data (cf. Section 3); this is less of a problem for NMT models, because they are then fine-tuned on the cognate task specifically.

Multilinguality
Data augmentation through a multilingual setup (Figure 2, "M" columns) seems to be the most successful data augmentation method for RNNs (increasing performance almost all the times), and allows them to finally outperform bilingual SMT for the least-resourced pair as well (ES↔IT). The Transformers benefit less from this technique than from adding extra monolingual data, apart for ES↔IT, most likely for the same reason as earlier: this dataset being the smallest, adding words in ES and IT from other language pairs helps to learn the translation and stabilises learning. This technique is not applicable to SMT.

Impact of the Translation Direction
There are three relation types present in our experiments, each with their level of ambiguity (most for childto-parent or child-to-child, least for parent-to-child, see Section 3). We observe that the SMT models, though bilingual, outperform multilingual NMT when going from ES or IT to LA (child to parent), and that multilingual NMT outperforms SMT in all other translation directions (ES↔IT, LA→ES, LA→IT: child-to-child and parent-to-child).

Combining data augmentation methods
We choose to combine the best performing data augmentation technique overall, multilinguality, with pretraining and backtranslation (Figure 3) for our NMT models.
Multilinguality + pretraining Combining multilinguality with pretraining has virtually no significant impact on the RNNs' results with respect to multilinguality only. For the Transformers, however, it increases the results by 2 to 3 BLEU on average.  Multilinguality + backtranslation Combining multilinguality with backtranslation provides the best results overall for Transformers (both being the best performing methods for these models). For the RNNs, however, the performance increase is smaller for most languages, and we even observe a decrease in performance when translating from ES (which was not the case with bilingual models).

Discussion
We discuss the results of the best performing models for the best seed across all architectures (SMT, multilingual + pretraining RNN and multilingual + backtranslation Transformer) from ES→IT. More than a third of the predicted words are above 90 BLEU 11 (resp. 35.4/46.4/38.1% for SMT/RNN/Transformer), and for error analysis, we study the words below this threshold. The observations generalise to other language pairs.

Predictions
Close Analysis of wrong results Wrongly predicted cognates correspond to four cases, as defined in Table 3. 12 We carried out a manual error analysis, and observed that their distribution was similar across models (resp. SMT/RNN/Transformer): (c) 0.9/0.9/0.9% corresponded to data errors, such as suspirar 'to sigh', phonetised as [suspiRaR], which was predicted as [sospira:re] sospirare 'to sigh', its actual cognate, instead of its erroneous counterpart in our database ([skwil:an] squillan '(it) rings').

Usefulness of n-best results
The average position at which the best prediction (according to dev BLEU) occurs (in 10-best predictions) is between 1 and 3 ( Table 7 in Appendix A.3). The lowest indices occur for Spanish (between 1 and 1.7) and Italian (between 1.6 and 2.2). The highest indices encountered occur when going for IT→LA or ES→LA (between 2 and 3). This illustrates the importance of n-best prediction when predicting cognates from child to parent languages, due to ambiguity. Standard deviations are between 2 and 3: for these languages, when studying cognate prediction, it is interesting to at least check the 5-best results.

Language choice in a multilingual setup
To study the impact of the language pairs used in the multilingual setup, we train additional multilingual neural models on only 1000 pairs of ES-IT data (single set), complemented by either nothing (to act as baseline), an extra 600 pairs of ES-IT, or 600 pairs of ES-L and IT-L (L being either Latin, a parent language, French, a related language, or Portuguese, more closely related to Spanish than Italian). The rest of the data (Table 4)   As we saw in Section 5.1, the Transformers' scores are far more affected by low resource settings than the RNNs. We therefore study the impact of adding extra languages with RNNs only.  Results on our new low-resourced baseline are lower than the our previous baselines by around 10 points (Table 5), which is expected, since we use less data for training.
Adding 600 pairs of ES-IT words has more effect on ES-IT performance than adding any other pair of related languages, which indicates that, unsurprisingly, the best possible extra data to provide is in the language pair of interest. When adding a related extra language, the results are better than with the initial data only. From Spanish, the performance is best when adding Portuguese, its most closely related language, then French, then Latin. From Italian, we observe the opposite trend. Adding an extra language seems to help most to translate from, and not to, the language it is most closely related to. For very low-resource settings, where extra pairs of the languages of interest might not be available, it will probably be interesting to explore using extra languages related to the source language.

Conclusion
We examined the differences between cognate prediction and MT, in terms of data as well as underlying linguistic assumptions and aims. We then observed that, above a certain training data size, SMT and multilingual RNNs provide the best BLEU scores for the task, SMT still being unrivalled when it comes to smaller datasets (which coincides with previous work comparing SMT and NMT for lowresource settings).
When studying how to increase the amount of training data seen by our models, we found that exploiting the multilinguality of NMT architectures consistently provided better results than adding monolingual lexicons (through pretraining or backtranslation), which contain noise for our task; combining the methods provided a significant amelioration for Transformers only. Adding multilingual data by training with extra languages also proved interesting, and we found the best possible extra data to add in a multilingual setting is, first, data from the languages at hand, followed by pairs between them and a parent language, then finally data from additional languages as close as possible to the source language.
We conclude that cognate prediction can benefit from certain conclusions drawn in standard lowresource MT, but that its specificities (intrinsic ambiguity which requires n-best prediction, reliance on cognate data only) must be systematically taken into account. Computational cognate prediction using MT techniques is a field in its infancy, and the work in this paper can be extended along several axes: working on less studied language families, or using the method in collaboration with linguists to better understand the etymology and history of languages. Luong-dot and Luong-general refer respectively to the dot and general attentions in (Luong et al., 2015), while Bahdanau-dot refers to our own implementation of the attention from (Bahdanau et al., 2015), simplified using the dot product to compute attention weights introduced in (Luong et al., 2015). See the code implementation with this paper for more detail.  Our learned embeddings seem to contain relevant phonetic information: their respective principal component analyses (PCA), when coloured according to place or manner of articulation for the consonants, and backness or height for the vowels, are coherently divided. The following examples are provided for an ES-IT RNN model, but similar results have been observed for our other languages and architectures. Figure 4(a) shows the PCA of the learned source phonetic embeddings of one RNN model, for IT consonant phones, coloured according to place of articulation. It is radially organised, with a smooth transition between labio-dentals from the centre [b] to the bottom [p:], and from centre alveolar to left post alveolar. Figure 4(b) shows a similar PCA, this time for learned source embeddings of ES consonant phones, coloured according to manner of articulation. It seems coherently divided, with a transition from nasal sounds on the bottom right to lateral affricates and fricatives on the top left.

A.2 Learning Embeddings
A.3 Average position of the best result among the 10-best results.
We present here at which position the best prediction (according to sentenceBLEU, from sacreBLEU) occurs amongst the 10-best predictions. For example, when going from Spanish terroso 'muddy', phonetised [tEroso], to Italian terroso 'muddy ', phonetised [terRo:zo] For all multilingual models, we computed the sentence BLEU score for each of the 10-best predictions and saved the position of the highest scoring prediction. We averaged these positions for all words in the test set and calculated the standard deviation.  Table 7: Average position of the closest prediction to the reference amongst the 10-best predictions.