Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K’iche’, a Mayan language. We compare our multilingual model to a monolingual (from-scratch) baseline, as well as a model pre-trained on Quechua only. We show that the multilingual pre-trained approach yields consistent segmentation quality across target dataset sizes, exceeding the monolingual baseline in 6/10 experimental settings. Our model yields especially strong results at small target sizes, including a zero-shot performance of 20.6 F1. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).


Introduction
Unsupervised sequence segmentation (at the word, morpheme, and phone level) has long been an area of interest in languages without whitespacedelimited orthography (e.g. Chinese, Uchiumi et al., 2015;Sun and Deng, 2018), morphologically complex languages without rule-based morphological anlayzers (Creutz and Lagus, 2002), and automatically phone-transcribed speech data (Goldwater et al., 2009;Lane et al., 2021) respectively. It has been particularly important for lowerresource languages in which there is little or no * Equal contribution from starred authors, sorted by last name. Sincere thanks to: Gina-Anne Levow, Shane Steinert Threlkeld, and Sara Ng for helpful comments and discussion; Francis Tyers for access to the K'iche' data; Manuel Mager for access to the morphologically-segmented validation data gold-standard data on which to train supervised models.
In modern neural end-to-end systems, partially unsupervised segmentation is usually performed via information-theoretic alogrithms such as BPE (Sennrich et al., 2016) and SentencePiece (Kudo and Richardson, 2018). However, the segmentations they produce are mostly non-sensical to humans. The motivating tasks listed above instead require unsupervised approaches that correlate more closely with human judgements of the boundaries of linguistic units. For example, in a human-in-theloop framework such as the sparse transcription proposed by Bird (2020), candidate lexical items are automatically proposed to native speakers for confirmation, and it is important that these candidates be (close to) sensical pieces of language that the speaker would recognize.
In this paper, we investigate the utility of recent models that have been developed to conduct unsupervised segmentation jointly with or as a byproduct of a language modeling objective (e.g. Kawakami et al., 2019;Downey et al., 2021, see Section 2). The key idea is that recent breakthroughs in crosslingual language modeling and transfer learning (Conneau and Lample, 2019; Artetxe et al., 2020, inter alia) can be leveraged to facilitate transferring unsupervised segmentation performance to a new target language, when using these types of language models. Specifically, we investigate the effectiveness of multilingual pre-training in a Masked Segmental Language Model (Downey et al., 2021) when applied to a low-resource target. We pre-train our model on the ten Indigenous languages of the 2021 AmericasNLP shared task dataset (Mager et al., 2021), and apply it to another low-resource, Indigenous, and morphologically complex language of Central America: K'iche' (quc), which at least phylogenetically is unrelated to the pre-training languages (Campbell et al., 1986).
We hypothesize that multilingual pre-training on similar, possibly contact-related languages, will outperform a monolingual baseline trained from scratch on the same data. In specific, we expect that the multilingual model will perform increasingly better than the monolingual baseline the smaller the target corpus is.
Indeed, our experiments show that a pre-trained multilingual model provides stable performance across all dataset sizes and almost always outperforms the monolingual baseline. We additionally show that the multilingual model achieves a zero-shot segmentation performance of 20.6 F1 on the K'iche' data, whereas the monolingual baseline yields a score of zero. These results suggest that transferring from a multilingual model can greatly assist unsupervised segmentation in very low-resource languages, even those that are morphologically rich. It may also support the idea that transfer-learning via multilingual pre-training may be possible at a more moderate scale (in terms of data and parameters) than is typical for recent crosslingual models.
In the following section, we overview important work relating to unsupervised segmentation, crosslingual pre-training, and transfer-learning (Section 2). We then introduce the multilingual data used in our experiments, as well as the additional pre-processing we performed to prepare the data for multilingual pre-training (Section 3). Next we provide a brief overview of the type of Segmental Language Model used for our experiments here, as well as our multilingual pre-training process (Section 4). After this, we provide details of our experimental process applying the pre-trained and from-scratch models to varying sizes of target data (Section 5). Finally, we discuss the results of our experiments and their significance for low-resource pipelines, both in the framework of unsupervised segmentation and for other NLP tasks more generally (Sections 6 and 7).

Related Work
Work related to the present study has largely fallen either into the field of (unsupervised) word segmentation, or into the field(s) of crosslingual language modeling and transfer learning. To our knowledge, we are the first to propose a crosslingual model for unsupervised word/morpheme-segmentation.
Unsupervised Segmentation Current state-ofthe-art unsupervised segmentation performance has largely been achieved with Bayesian models such as Hierarchical Dirichlet Processes (Teh et al., 2006;Goldwater et al., 2009) and Nested Pitman-Yor (Mochihashi et al., 2009;Uchiumi et al., 2015). Adaptor Grammars (Johnson and Goldwater, 2009) have been successful as well. Models such as Morfessor (Creutz and Lagus, 2002), which are based on Minimal Description Length (Rissanen, 1989) are also widely used for unsupervised morphology.
As Kawakami et al. (2019) note, most of these models are weak in terms of their actual language modeling ability, being unable to take into account much other than the immediate local context of the sequence. Another line of techniques have been focused on models that are both strong language models and good for sequence segmentation. Many are in some way based on Connectionist Temporal Classification (Graves et al., 2006), and include Sleep-WAke Networks (Wang et al., 2017), Segmental RNNs (Kong et al., 2016), and Segmental Language Models (Sun and Deng, 2018;Kawakami et al., 2019;Wang et al., 2021;Downey et al., 2021). In this work, we conduct experiments using the Masked Segmental Language Model of Downey et al. (2021), due to its good performance and scalability, the latter usually regarded as an obligatory feature of crosslingual models (Conneau et al., 2020a;Xue et al., 2021, inter alia).
Crosslingual and Transfer Learning Crosslingual modeling and training has been an especially active area of research following the introduction of language-general encoder-decoders in Neural Machine Translation that offered the possibility of zero-shot translation (i.e. translation for language pairs not seen during training; Ha et al., 2016;Johnson et al., 2017).
The arrival of crosslingual language model pretraining (XLM, Conneau and Lample, 2019) further invigorated the subfield by demonstrating that large models pre-trained on multiple languages yielded state-of-the-art performance across an abundance of multilingual tasks including zero-shot text classification (e.g. XNLI, Conneau et al., 2018), and that pre-trained transformer encoders provide great initializations for MT systems and language models in very low-resource languages.
Since XLM, numerous studies have attempted to single out exactly which components of the crosslingual training process contribute to the ability to transfer performance from one language to another (e.g. Conneau et al., 2020b). Others have questioned the importance of multilingual training, and have instead proposed that even monolingual pre-training can provide effective transfer to new languages (Artetxe et al., 2020). And though some like Lin et al. (2019) have tried to systematically study which aspects of pre-training languages/corpora enable effective transfer, in practice the choice is often driven by availability of data and other ad-hoc factors.
Currently, large crosslingual successors to XLM such as XLM-R (Conneau et al., 2020a), MASS (Song et al., 2019), mBART (Liu et al., 2020), and mT5 (Xue et al., 2021) have achieved major success, and are the starting point for a large portion of multilingual NLP systems. These models all rely on an enormous amount of parameters and pretraining data, the bulk of which comes from very high-resource languages. In contrast, in this paper we wish to assess whether multilingual pre-training on a suite of very low-resource languages, which combine to yield a moderate amount of unlabeled data, can provide a good starting point for similar languages which are also very low-resource, within the framework of the unsupervised segmentation task.

Data and Pre-processing
We draw data from three main datasets. The Ameri-casNLP 2021 open task dataset (Mager et al., 2021) contains text from ten Indigenous languages of Central and South America, which we use to pre-train our multilingual model. The multilingual dataset from Kann et al. (2018) consists of morphologically segmented sentences in several Indigenous languages, two of which overlap with the Americ-asNLP set, and serves as segmentation validation data for our pre-training process in these languages. Finally, the K'iche' data collected for Tyers and Henderson (2021) and Richardson and Tyers (2021) contains both raw and morphologically-segmented sentences. We use the former as the training data for our experiments transferring to K'iche', and the latter as the validation and test data for these experiments.

AmericasNLP 2021
The AmericasNLP data consists of train and validation files for ten low-resource Indigenous languages: Asháninka (cni), Aymara (aym), Bribri (bzd), Guaraní (gug), Hñähñu (oto), Nahuatl (nah), Quechua (quy), Rarámuri (tar), Shipibo Konibo (shp), and Wixarika (hch). For each language, AmericasNLP also in-cludes parallel Spanish sets, which we do not use. The data was originally curated for the Americas-NLP 2021 shared task on Machine Translation for low-resource languages (Mager et al., 2021). 1 We augment the Asháninka and Shipibo-Konibo training sets with additional available monolingual data from Bustamante et al. (2020), 2 which is linked in the official AmericasNLP repository. We add both the training and validation data from this corpus to the training set of our splits.
To prepare the AmericasNLP data for a multilingual language modeling setting, we first remove lines that contain urls, copyright boilerplate, or that contain no alphabetic characters. We also split lines that are longer than 2000 characters into sentences/clauses where evident. Because we use the Nahuatl and Wixarika data from Kann et al. (2018) as validation data, we remove any overlapping lines from the AmericasNLP set. We create a combined train file as the concatenation of the training data from each of the ten languages, as well as a combined validation file likewise.
Because the original ratio of Quechua training data is so high compared to all other languages (Figure 1), we randomly downsample this data to 2 15 examples, the closest order of magnitude to the next-largest training set. A plot of the balanced (final) composition of our AmericasNLP train and validation sets can be seen in Figure 2. A table with the detailed composition of this data is available in Appendix A.  (pua/tsz), Wixarika (hch), Yorem Nokki (mfy), Mexicanero (azd/azn), and Nahuatl (nah). This data was originally curated for a supervised neural morphological segmentation task for polysynthetic minimal-resource languages. We clean this data in the same manner as the AmericasNLP sets. Because Nahuatl and Wixarika are two of the languages in our multilingual pre-training set, we use these examples as validation data for segmentation quality during the pre-training process.
K'iche' data All of the K'iche' data used in our study was curated for Tyers and Henderson (2021). The raw (non-gold-segmented) data used as training data in our transfer experiments comes from a section of this data web-scraped by the Crúbadán project (Scannell, 2007). This data is relatively noisy, so we clean it by removing lines with urls or lines where more than half of the characters are nonalphabetic. This cleaned data consists of 62,695 examples and is used as our full-size training set for K'iche'. Our experiments involve testing transfer at different resource levels, so we also create smaller training sets by downsampling the original to lower orders of magnitude.
For evaluating segmentation performance on K'iche', we use the segmented sentences from Richardson and Tyers (2021), 3 which were created for a shared task on morphological segmentation. These segmentations were created by a hand-crafted FST, and then manually dis-ambiguated. The sentences originally came in a train/validation/test split, but because goldsegmented sentences are so rare, we concatenate these sets and then split them in half into final validation and test sets. MSLMs An MSLM is a variant of a Segmental Language Model (SLM) (Sun and Deng, 2018;Kawakami et al., 2019;Wang et al., 2021), which takes as input a sequence of characters x and outputs a probability distribution for a sequence of segments y such that the concatenation of the segments of y is equivalent to x: π(y) = x. An MSLM is composed of a Segmental Transformer Encoder and an LSTM-based Segment Decoder (Downey et al., 2021). See Figure 3.

Model and Pre-training
The training objective for an MSLM is based on the prediction of masked-out spans. During a forward pass, the encoder generates an encoding for every position in x, for a segment up to k symbols long; the encoding for position i − 1 corresponds to every possible segment that starts at position i. Therefore, the encoding approximates To ensure that the encodings are generated based only on the portions of x that are outside of the predicted span, the encoder uses a Segmental Attention Mask (Downey et al., 2021) to mask out tokens inside the segment. Figure 3 shows an example of such a mask with k = 2.
Finally, the Segment Decoder of an SLM determines the probability of the j th character of the segment of y that begins at index i, y i j , using the encoded context: The output of the decoder is therefore based entirely on the context of the sequence, and not on the determination of other segment boundaries. The probability of y is modeled as the marginal probability over all possible segmentations of x. Because directly marginalizing would be computationally intractable, the marginal is computed using dynamic programming over a forward-pass lattice.
The maximum-probability segmentation is determined using Viterbi decoding. The MSLM training objective maximizes language-modelling performance, which is measured in Bits per Character (bpc) over each sentence. Multilingual Pre-training In our experiments, we test the transfer capabilities of a multilingual pre-trained MSLM. We train this model on the AmericasNLP 2021 data, which was pre-processed as described in Section 3. Since Segmental Language Models operate on plain text, we can train the model directly on the multilingual concatenation of this data, and evaluate it by its language modeling performance on the concatenated validation data, which is relatively language-balanced in comparison to the training set (see Figure 2).
We train an MSLM with four encoder layers for 16,768 steps, using the Adam optimizer (Kingma and Ba, 2015). We apply a linear warmup for 1024 steps, and a linear decay afterward. The transformer layers have hidden size 256, feedforward size 512, and 4 attention heads. The LSTM-based segment decoder has a hidden size of 256. Character embeddings are initialized using Word2Vec (Mikolov et al., 2013) over the training data. The maximum possible segment size is set to 10. We sweep eight learning rates on a grid of the interval [0.0005, 0.0009], and the best model is chosen as the one that minimizes the Bits Per Character (bpc) language-modeling loss on the validation set. For further details of the pre-training procedure, see Appendix B.
To evaluate the effect of pre-training on the segmentation quality for languages within the pretraining set, we also log MCC between the model output and gold-segmented secondary validation sets available in Nahuatl and Wixarika (Kann et al., 2018, see Section 3). As Figure 4 shows, the unsupervised segmentation quality for Nahuatl and Wixarika almost monotonically increases during pre-training.

Experiments
We seek to evaluate whether crosslingual pretraining facilitates effective low-resource transfer learning for segmentation. To do this, we pre-train a Segmental Language Model on the AmericasNLP 2021 dataset (Mager et al., 2021) and transfer it to a new target language: K'iche' (Tyers and Henderson, 2021). As a baseline, we train a monolingual K'iche' model from scratch. We evaluate model performance with respect to the size of the target training set, simulating varying degrees of lowresource setting. To manipulate this variable, we randomly downsample the K'iche' training set to 8 smaller sizes, for 9 total: {256, 512, ... 2 15 , ∼2 16 (full)}. For each size, we both train a monolingual model and fine-tune the pre-trained multilingual model we describe in Section 4. 4 Architecture and Modelling Both the pretrained crosslingual model and the baseline monolingual model are Masked Segmental Language Models (MSLMs) with the architecture described in Section 4. The only difference is that the baseline monolingual model is initialized with a character vocabulary only covering the particular K'iche' training set (size-specific). The character vocabulary of the K'iche' data is a subset of the Americ-asNLP vocabulary, so in the multilingual case we are able to transfer without changing our embedding and output layers. The character embeddings for the monolingual model are initialized using Word2Vec (Mikolov et al., 2013) on the training set (again, size-specific).
Evaluation Metrics Segmental Language Models can be trained in either a fully unsupervised or "lightly" supervised manner (Downey et al., 2021). In the former case, only the language modeling objective (Bits Per Character, bpc) is considered in picking parameters and checkpoints. In the latter, the segmentation quality over gold-segmented validation data can be considered. Though our validation set is gold-segmented, we pick the best parameters and checkpoints based only on the bpc performance, thus simulating the unsupervised case. However, in order to monitor the change in segmentation quality during training, we also use Matthews Correlation Coefficient (MCC). This measure frames segmentation as a character-wise binary classification task (i.e. boundary vs no boundary), and measures correlation with the gold segmentation.
To make our results comparable with the wider word-segmentation literature, we use the scoring script from the SIGHAN Segmentation Bakeoff (Emerson, 2005) to obtain our final segmentation F1 score. For each model and dataset size, we choose the best checkpoint (by bpc), apply the model to the combined validation and test set, and use the SIGHAN script to score the output segmentation quality.
Parameters and Trials For our training procedure (both training the baseline from scratch and fine-tuning the multilingual model) we tune hyperparameters on three of the nine dataset sizes (256, 2048, and full) and choose the optimal parameters as those that obtain the lowest bpc. For each of the other sizes, we directly apply the chosen parameters from the tuned dataset of the closest size (on a log scale). We tune over five learning rates and three encoder dropout values.
Models are trained using the Adam optimizer (Kingma and Ba, 2015) for 8192 steps on all but the two smallest sizes, which are trained for 4096 steps. A linear warmup is used for the first 1024 steps (512 for the smallest sets), followed by linear decay. We set the maximum segment length to 10. For a more details on our training procedure, see Appendix B.

Results
The results of our K'iche' transfer experiments at various target sizes can be found in Table 1. In general, the pre-trained multilingual model demonstrates good performance across dataset sizes, with the lowest segmentation quality (20.6 F1) being in the zero-shot case, and the highest (42.0) achieved when trained on 2 14 examples. The best segmentation quality of the monolingual model is very close to that of the multilingual one (41.9, at size 4096), but this performance is not consistent across dataset sizes. Further, there doesn't seem to be a noticeable trend across dataset size for the monolingual model, except that performance seems to increase from approximately 0 F1 in the zero-shot case up to 4096 examples.
Interpretation The above results show that the multilingual pre-trained MSLM provides consistent segmentation performance across dataset sizes as small as 512 examples. Even for size 256, there is only a 15% (relative) drop in segmentation quality from the next-largest size. Further, the pretrained model yields an impressive zero-shot performance of 20.6 F1, where the baseline is approximately 0 F1.
On the other hand, the monolingual model can achieve good segmentation quality on the target language, but the pattern of success across target corpus sizes is not clear (note the quality at size 2 15 is almost halved compared to the two neighboring sizes).
This variation in the monolingual baseline may be partially explainable by sensitivity to hyperparameters. Table 2 shows that across the best four hyperparameters, the segmentation quality of the monolingual model varies considerably. This is especially noticeable at smaller sizes: at size 2048, the F1 standard deviation is 27.4% of the mean, and at size 256 it is 34.1% of the mean.
A related explanation could be that the hyperparameters tuned at specific target sizes don't transfer well to other sizes. However, it should be noted that even at the sizes for which hyperparameters were  tuned (256, 2048, and full), the monolingual performance lags behind the multilingual. Further, the best segmentation quality achieved by the monolingual model is at size 4096, at which the hyperparameters tuned for size 2048 were applied directly. In sum, the pre-trained multilingual model yields far more stable performance across target dataset sizes, and almost always outperforms its monolingual from-scratch counterpart.

Analysis and Discussion
Standing of Hypotheses Within the framework of unsupervised segmentation via language modeling, the results of these experiments provide strong evidence that relevant linguistic patterns can be learned over a collection of low-resource languages, and then transferred to a new language without much (or any) training data. Further, it is shown that the target language need not be (phylogenetically) related to any of the pre-training languages, even though the details of morphological structure are ultimately language-specific.
The hypothesis that multilingual pre-training would yield increasing advantage over a fromscratch baseline at smaller target sizes is also strongly supported. This result is consistent with related work showing this to be a key advantage of the multilingual approach (Wu and Dredze, 2020). Perhaps more interestingly, the monolingual model does not come to outperform the multilingual one at the largest dataset sizes, which also tends to be the case in related studies (e.g. Wu and Dredze, 2020;Conneau et al., 2020a). However, it is useful to bear in mind that segmentation quality is an unsupervised objective, and as such it will not necessarily always follow trends in supervised objectives.
Significance The above results, especially the non-trivial zero-shot transferability of segmentation performance, suggest that the type of language model used here learns some abstract linguistic pattern(s) that are generalizable between languages (even ones on which the model has not been trained). It is possible that these generalizations could take the form of abstract stem/affix or word-order patterns, corresponding roughly to the lengths and order of morphosyntactic units. Because MSLMs operate on the character level (and in these languages orthographic characters mostly correspond to phones), it is also possible the model could recognize syllable structure in the data (the ordering of consonants and vowels in human languages is relatively constrained), and learn to segment on syllable boundaries.
It is also helpful to remember that we select the training suite and target language to have some characteristics in common that may help to facilitate transfer. The AmericasNLP training languages are almost all morphologically rich, with many being considered polysynthetic (Mager et al., 2021), a feature that K'iche' shares (Suárez, 1983). Further, all of the languages, including K'iche', are spoken in countries where either Spanish or Portuguese are the official language, and are very likely to have had close contact with these Iberian languages and have borrowed lexical items. Finally, the target language family (Mayan) has also been shown to be in close historical contact with the families of several of the AmericasNLP set (Nahuatl, Rarámuri, Wixarika, Hñähñu), forming a Linguistic Area or Sprachbund (Campbell et al., 1986).
It is possible that one or several of these shared characteristics facilitates the strong transfer shown in our experiments. However, our current study does not conclusively show this to be the case. Lin et al. (2019) show that factors like linguistic similarity and geographic contact are often not as important for transfer success as non-linguistic features such as the raw size of the source dataset. Furthermore, Artetxe et al. (2020) show that even monolingually-trained models can be rapidly adapted to a new language by simply training a new embedding layer and adding lightweight adapter layers.
Future Work There are some future studies that we believe would shed light on the nuances of segmentation transfer-learning. First, pre-training monolingually on a language that is typologically or geographically close to the target could help disentangle the benefit given by multilingual training from that achieved by pre-training on a similar language in general (though the source language would need to be sufficiently high-resource to enable this comparison). Second, pre-training either multilingually or monolingually on languages that are not linguistically similar to the target language could help isolate the advantage given by pre-training on any language data.
In this way, we hope future experiments will refine our understanding of the dynamics that facilitate effective transfer into low-resource languages, both in the framework of unsupervised segmentation and in other tasks in which language model pre-training has enabled transfer learning.

Conclusion
This study has shown that unsupervised sequence segmentation performance can be transferred via multilingual pre-training to a novel target language with little or no target data. The target language also does not need to be from the same family as a pre-training language for this transfer to be successful. While training a monolingual model from scratch on larger amounts of target data can result in good segmentation quality, our experiments show that success in this approach is much more sensitive to hyperparameters, and the multilingual model outperforms the monolingual one in 9/10 of our experimental settings.
One finding that may have broader implications is that pre-training can be conducted over a set of low-resource languages that may have some typological or geographic similarity to the target, rather than over a crosslingual suite centered around highresource languages like English and other European languages. As mentioned in Section 2, most modern crosslingual models have huge numbers of parameters (XLM has 570 million, mT5 has up to 13 billion, Xue et al., 2021), and are trained on enormous amounts of data, usually bolstered by hundreds of gigabytes of data in the highestresource languages (Conneau et al., 2020a).
In contrast, our results suggest that effective transfer learning may be possible at smaller scales, by combining the data of low-resource languages and training moderately-sized, more targeted pretrained multilingual models (our model has 3.15 million parameters). Of course, the present study can only support this possibility within the unsupervised segmentation task, and so future work will be needed to investigate whether crosslingual transfer to and from low-resource languages can be extended to other tasks. Citations A more detailed description of the sources and citations for the AmericasNLP set can be found in the original shared task paper (Mager et al., 2021). Here, we attempt to give a brief listing of the proper citations. All of the validation data originates from Americ-asNLI (Ebrahimi et al., 2021) which is a translation of the Spanish XNLI set (Conneau et al., 2018) into the 10 languages of the AmericasNLP 2021 open task.
The training data for each of the languages comes from a variety of different sources. The Asháninka training data is sourced from Ortega et al. (2020); Cushimariano Romano and Sebastián Q. (2008); Mihas (2011) and consists of stories, educational texts, and environmental laws. The Aymara training data consists mainly of news text from the GlobalVoices corpus (Prokopidis et al., 2016) as available through OPUS (Tiedemann, 2012). The Bribri training data is from six sources (Feldman and Coto-Solano, 2020;Margery, 2005;Jara Murillo, 2018a;Constenla et al., 2004;Jara Murillo and Segura, 2013;Jara Murillo, 2018b;Flores Solórzano, 2017) ranging from dictionaries and textbooks to story books. The Guaraní training data consists of blogs and web news sources collected by Chiruzzo et al. (2020). The Nahuatl training data comes from the Axolotl parallel corpus (Gutierrez-Vasques et al., 2016). The Quechua training data was created from the JW300 Corpus (Agić and Vulić, 2019), including Jehovah's Witnesses text and dictionary entries collected by Huarcaya Taquiri (2020). The Rarámuri training data consists of phrases from the Rarámuri dictionary (Brambila, 1976). The Shipibo-Konibo training data consists of translations of a subset of the Tatoeba dataset (Montoya et al., 2019), translations from bilingual education books (Galarreta et al., 2017), and dictionary entries (Loriot et al., 1993). The Wixarika training data consists of translated Hans Christian Andersen fairy tales from .
No formal citation was given for the source of the Hñähñu training data (see Mager et al., 2021).

B Hyperparameter Details
Pre-training The character embeddings for our multilingual model are initialized by training CBOW (Mikolov et al., 2013) on the Americas-NLP training set for 32 epochs, with a window size of 5. Special tokens like <bos> that do not appear in the training corpus are randomly initialized. These pre-trained embeddings are not frozen during training. During pre-training, a dropout rate of 12.5% is applied within the (transformer) encoder layers. A dropout rate of 6.25% is applied both to the embeddings before being passed to the encoder, and to the hidden-state and start-symbol encodings input to the decoder (see Downey et al., 2021). Checkpoints are taken every 128 steps. The optimal learning rate was 7.5e-4.
K'iche' Transfer Experiments Similar to the pre-trained model, character embeddings are initialized using CBOW on the given training set for 32 epochs with a window size of 5, and these embeddings are not frozen during training. As in pre-training, a dropout rate of 6.25% is applied to the input embeddings, plus h and the start-symbol for the decoder. Checkpoints are taken every 64 steps for sizes 256 and 512, and every 128 steps for every other size.