AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages

Reproducible benchmarks are crucial in driving progress of machine translation research. However, existing machine translation benchmarks have been mostly limited to high-resource or well-represented languages. Despite an increasing interest in low-resource machine translation, there are no standardized reproducible benchmarks for many African languages, many of which are used by millions of speakers but have less digitized textual data. To tackle these challenges, we propose AfroMT, a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We also develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. Furthermore, we explore the newly considered case of low-resource focused pretraining and develop two novel data augmentation-based strategies, leveraging word-level alignment information and pseudo-monolingual data for pretraining multilingual sequence-to-sequence models. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines. We also show gains of up to 12 BLEU points over cross-lingual transfer baselines in data-constrained scenarios. All code and pretrained models will be released as further steps towards larger reproducible benchmarks for African languages.


Introduction
Accuracy of machine translation systems in many languages has improved greatly over the past several years due to the introduction of neural machine translation (NMT) techniques (Bahdanau et al., 2015;Sutskever et al., 2014;Vaswani et al., 2017), as well as scaling to larger models (Ott et al., 2018). However, many of these advances have been demonstrated in settings where very large parallel datasets are available (Meng et al., 2019;Arivazhagan et al., 2019), and NMT systems often underperform in low-resource settings when given small amounts of parallel corpora (Koehn and Knowles, 2017;Guzmán et al., 2019). One solution to this has been leveraging multilingual pretraining on large sets of monolingual data (Conneau and Lample, 2019;Song et al., 2019;Liu et al., 2020), leading to improvements even with smaller parallel corpora. However, this thread of work has focused on scenarios with the following two properties: (1) pretraining on a plurality of European languages and (2) cases in which the monolingual pretraining data greatly exceeds the parallel data used for finetuning (often by over 100 times) (Guzmán et al., 2019;Liu et al., 2020).
However, in the case of many languages in the world, the above two properties are often not satisfied. In particular, taking the example of African languages (the focus of our work), existing (small) parallel corpora for English-to-African language pairs often comprise the majority of available monolingual data in the corresponding African languages. In addition, African languages are often morphologically rich and from completely different language families, being quite distant from European languages. Moreover, despite the importance of reproducible benchmarks to measuring progress on various tasks in an empirical setting, there exists no standardized machine translation benchmark for the majority of African languages.
In this work, we introduce (1) a new machine translation benchmark for African languages, and (2) pretraining techniques to deal with the previously unexplored case where the size of monolingual data resources for pretraining is similar or equal to the size of parallel data resources for finetuning, and (3) evaluation tools designed for measuring qualities regarding the unique grammar of these languages in machine translation systems for better system evaluation.
Our proposed benchmark, AFROMT, consists of translation tasks between English and 8 African languages -Afrikaans, Xhosa, Zulu, Rundi, Sesotho, Swahili, Bemba, and Lingala -four of which are not included in commercial translation systems such as Google Translate (as of Feb. 2021). In §2, we describe the detailed design of our benchmark, including the language selection criterion and the methodology to collect, clean and normalize the data for training and evaluation purposes. In §3, we provide a set of strong baselines for our benchmark, including denoising sequence-to-sequence pretraining (Lewis et al., 2020;Liu et al., 2020), transfer learning with similar languages (Zoph et al., 2016;Neubig and Hu, 2018), and our proposed data augmentation methods for pretraining on low-resource languages. Our first method leverages bilingual dictionaries to augment data in high-resource languages (HRL), and our second method iteratively creates pseudo-monolingual data in low-resource languages (LRL) for pretraining. Extensive experiments in §4 show that our proposed methods outperform our baselines by up to ∼2 BLEU points over all language pairs and up to ∼15 BLEU points in data-constrained scenarios.

AFROMT benchmark
In this section, we detail the construction of our new benchmark, AFROMT. We first introduce our criteria for selecting the languages ( §2.1), and then describe the steps to prepare the dataset ( §2.2, 2.3).

Language Selection Criteria
Given AFROMT's goal of providing a reproducible evaluation of African language translation, we select languages based on the following criteria: Coverage of Speakers & Language Representation We select languages largely based on the coverage of speakers as well as how represented they are in commercial translation systems. In total, the AFROMT benchmark covers 225 million L1 and L2 speakers combined, covering a large number of speakers within Sub-Saharan Africa.
Linguistic Characteristics With the exception of English and Afrikaans, which belong to the Indo-European language family, all of the considered languages belong to the Niger-Congo family which is Africa's largest language family in terms of geographical area and speaking popula-tion (see Appendix). Similar to English, the Niger-Congo family generally follows the SVO word order. One particular characteristic feature of these languages is their morphosyntax, especially their system of noun classification, with noun classes often exceeding 10, ranging from markers denoting male/female/animate/inanimate and more 2 . These noun classes can be likened in some sense to the male/female designation found in romance languages. However, in contrast with these languages, noun markers in Niger-Congo languages are often integrated within the word, usually as a prefix (Bendor-Samuel and Hartell, 1989). For example: in Zulu, isiZulu refers to the Zulu language, whereas amaZulu refers to the Zulu people. Additionally, these languages also use "verb extensions", verb-suffixes used to modify the meaning of the verb. These qualities contribute to the morphological richness of these languages -a stark contrast with European languages.

Data Sources
For our benchmark, we leverage existing parallel data for each of our language pairs. This data is derived from two main sources: (1) open-source repository of parallel corpora, OPUS 3 (Tiedemann, 2012) and (2) ParaCrawl (Esplà et al., 2019). From OPUS, we use the JW300 corpus (Agić and Vulić, 2019), OpenSubtitles (Lison and Tiedemann, 2016), XhosaNavy, Memat, andQED (Abdelali et al., 2014). Despite the existence of this parallel data, these text datasets were often collected from large, relatively unclean multilingual corpora, e.g. JW300 which was extracted from Jehovah's Witnesses text, or QED which was extracted from transcribed educational videos. This leads to many sentences with high lexical overlap, inconsistent tokenization, and other undesirable properties for a clean, reproducible benchmark.

Data Preparation
Training machine translation systems with small and noisy corpora for low-resource languages is challenging, and often leads to inaccurate translations. These noisy examples include sentences which contain only symbols and numbers, sentences which only consist of one token, sentences which are the same in both the source and target sides, etc. Furthermore, in these noisy extractions from large multilingual corpora such as JW300, there is a key issue of large text overlap over sentences. Given the risk of data leakage, this prevents one from naively splitting the corpus into random train/validation/test splits. To mitigate these issues, when preparing our data, we use a combination of automatic filtering techniques and manual human verification at each step to produce clean parallel data for the construction of our benchmark. For consistency across language pairs, we perform cleaning mainly based on the English side of the noisy parallel corpora. We list the automatic filtering techniques below: Removal of extremely short sentences Since we focus on sentence-level machine translation, 4 we remove sentences containing less than three whitespace-tokenized tokens excluding numerical symbols and punctuation. Additionally, we remove pairs that contain no source or target sentences.
Removal of non-sentences We remove sentences containing no letters, i.e., pairs that contain only numbers and symbols.
Tokenization normalization We perform detokenization on all corpora using the detokenization script provided in the Moses (Koehn et al., 2007) toolkit 5 . Given that we collect data from various sources, this step is important to allow for consistent tokenization across corpora.

Removal of sentences with high text overlap
To prevent data leakage, we remove sentences with 4 While document-level translation is undoubtedly important, accuracy on the languages in AFROMT is still at the level where sentence-level translation is sufficiently challenging. 5 https://github.com/moses-smt/ mosesdecoder/ high text overlap. To do this, we use Levenshteinbased fuzzy string matching 6 and remove sentences that have a similarity score of over 60. Given that measuring this score against all sentences in a corpus grows quadratically with respect to corpus length, we use the following two heuristics to remove sentences with high overlap in an efficient manner: (1) scoring similarity between the 50 alphabetically-sorted previous sentences, (2): extracting the top 100K four-grams and performing the similarity score within each group of sentences containing at least one instance of a certain fourgram.

Data Split
The resulting benchmark is constructed using the data that passes our automatic filtering checks, and we further split the data into train, validation, and test for each language pair. We select 3,000 sentences with the least four-gram overlap (with the corpus) for both validation and testing while leaving the rest of the corpus to be used for training. Validation and test sentences are all further verified for quality. The resulting dataset statistics for each language pair can be seen in Table 1.

Impact of Cleaning Process
Given the non-trivial cleaning process and standardization of key components, such as tokenization/splits/data leakage, this cleaning provides a better representative corpus for the languages considered. We demonstrate this with an experiment comparing a randomly initialized English-Zulu models trained on (a) the original noisy data (including some test data leakage), (b) a model trained on noisy data (without data leakage) similar to the cleaning process used by Nekoto et al. (2020), and (c) a model trained on the AfroMT data. Scores for each setting are measured in BLEU on the clean test set: (a) 38.6, (b) 27.6, (c) 34.8. Comparing the noisy model and the AfroMT model, we find that not filtering the data for leakage leads to misleading results, unreliably evaluating models on these LRLs. Additionally, as shown by (b) vs (c), not filtering for other artifacts hinders performance leading to unrealistically weak performance. Additional quantification of data leakage can be found in the Appendix.

AfroBART
Given that we aim to provide strong baselines for our benchmark, we resort to multilingual sequenceto-sequence training. However, existing pretraining techniques have often been focused on the situation where monolingual data can be found in a larger quantity than parallel data. In this section we describe our proposed multilingual sequenceto-sequence pretraining techniques developed for the novel scenario where even monolingual data is scarce.

Existing Methods
The most widely used methods for multilingual sequence-to-sequence pretraining (Song et al., 2019;Xue et al., 2020;Liu et al., 2020) make a core assumption that the amount of monolingual data in all languages exceeds the amount of parallel data. However, in the case of many African languages, digitized textual data is not widely available, leading this approach to be less effective in these scenarios as shown in Table 2. To mitigate this issue, we build on existing denoising pretraining techniques, particularly BART (Lewis et al., 2020;Liu et al., 2020) and propose two data augmentation methods using dictionaries to augment high-resource monolingual data ( §3.2), and leveraging pseudo monolingual data in low-resource languages ( §3.3). Finally, we iterate the data augmentation with the model training ( §3.4) as shown in Figure 2.

Dictionary Augmentation
Given that existing monolingual corpora in lowresource languages are small, we aim to increase the usage of words from the low-resource language in diverse contexts. To do so, we propose to take  Figure 1: Transforming monolingual high-resource data to augmented code-switched data using an English-Swahili bilingual dictionary sentences from a high-resource language, and replace the words by their corresponding translations that are available in a dictionary extracted from our parallel corpora.
Dictionary Extraction As our data augmentation technique requires a dictionary, we propose to extract the dictionary from parallel corpora using a statistical word aligner, eflomal 7 (Östling and Tiedemann, 2016). Once we produce word alignments between tokens in our parallel corpora, we simply take word alignments that appear over 20 times to produce our bilingual dictionary.
Monolingual Data Augmentation We assume to have access to three sources of data, i.e., high-resource corpus i to low-resource term D l i . Given this, for every highresource sentence H i we replace 30% of the tokens that match the high-resource terms contained in D to their respective low-resource terms. In the case that there exists more than one low-resource term in D l i , we randomly select one to replace the highresource term. Notably, with the assumption that high-resource monolingual data is more diverse in its content given its greater size, this augmentation technique is an effective method to increase the coverage of words from the low-resource lexicon in diverse settings.
Monolingual corpora BART pretraining Dictionary Augmentation MT Finetuning Translation of HRL monolingual corpora Figure 2: Iterative approach to pretraining using pseudo monolingual data and dictionaries

Leveraging Pseudo-Monolingual Data
Although leveraging dictionaries to produce codeswitched monolingual data is a useful technique to introduce low-resource words in a wider variety of contexts, the code-switched sentences still lack the fluency and consistency of pure monolingual data. To further mitigate these fluency and data scarcity issues in the LRL, we propose to create fluent pseudo-monolingual data by translating the HRL monolingual data to the low-resource language using a pretrained machine translation model.
Specifically, given a pretrained sequence-tosequence model M , we finetune M for the translation from HRL to LRL on a parallel corpus, i.e., , and obtain a machine translation model M f t . With the pretrained translation model M f t , we then proceed to translate sentences from high-resource corpus H to our low-resource language l to produce pseudo LRL monolingual corpusL: Following this, we concatenate the existing lowresource corpus L withL and continue training our pretrained sequence-to-sequence model on this new pseudo-monolingual corpora. 8

Iterative Multilingual Denoising Pretraining
Given the pseudo-monolingual data synthesis step detailed in the previous §3.3, we can simply transform this into an iterative pretraining procedure (Tran et al., 2020). That is, given the monolingual data synthesis procedure, we can leverage this procedure to produce a cycle in which a pretrained model is used to initialize an MT model to synthesize pseudo monolingual data and the produced data is used to further train the pretrained model (depicted in Figure 2).

Experimental Setup
In this section, we describe our experimental setup for both pretraining and finetuning strong baselines for our benchmark. Furthermore, we look to evaluate the efficacy of our proposed pretraining techniques and see whether they provide an impact on downstream performance on AFROMT.

Pretraining
Dataset We pretrain AfroBART on 11 languages: Afrikaans, English, French, Dutch 9 , Bemba, Xhosa, Zulu, Rundi, Sesotho, Swahili, and Lingala. To construct the original monolingual corpora, we use a combination of the training sets in AFROMT and data derived from CC100 10 . We only perform dictionary augmentation on our English monolingual data. We list monolingual and pseudo-monolingual corpora statistics in Table 1.
Balancing data across languages As we are training on different languages with widely varying amounts of text, we use the exponential sampling technique used in Conneau and Lample (2019); Liu et al. (2020), where the text is re-sampled according to smoothing parameter α as shown below: where q k refers to the re-sample probability for language k, given multinomial distribution {q k } k=1...N with original sampling probability p k 11 . As we work with many extremely lowresource languages, we choose smoothing parameter α = 0.25 (compared with the α = 0.7 used in mBART) to alleviate model bias towards an overwhelmingly higher proportion of data in the higherresource languages.
Hyperparameters We use the following setup to train our AfroBART models, utilizing the mBART implementation in the fairseq 12 library . We tokenize data using Sentence-Piece (Kudo and Richardson, 2018), using a 80K subword vocabulary. We use the Transformer-base architecture of a hidden dimension of 512, feedforward size of 2048, and 6 layers for both the encoder and decoder. We set the maximum sequence length to be 512, using a batch size of 1024 for 100K iterations with 32 NVIDIA V100 9 We select English and French due to their commonplace usage on the continent, as well as Dutch due to its similarity with Afrikaans. 10 http://data.statmt.org/cc-100/ 11 p k is proportional to the amount of data for the language; in the case that we use dictionary augmented data, we keep p k proportional to the original data for the language 12 https://github.com/pytorch/fairseq  GPUs for one day. When we continue training using pseudo-monolingual data, we use a learning rate of 7 × 10 −5 and warm up over 5K iterations and train for 35K iterations.

Finetuning
Baselines We use the following baselines for our benchmark: • AfroBART Baseline We pretrain a model using only the original monolingual corpora in a similar fashion to Liu et al. (2020).
• AfroBART-Dictionary We pretrain a model using the original data in addition to a dictionary augmented English monolingual corpora in Afrikaans, Bemba, Sesotho, Xhosa, Zulu, Lingala, and Swahili.
• AfroBART We continue training the dictionary augmented AfroBART model, using pseudo monolingual data produce by its finetuned counterparts. Due to computational constraints we only perform one iteration of our iterative approach. Statistics for the pseudomonolingual data can be seen in Table 1.
• Cross-Lingual Transfer (CLT) When experimenting on the effect of pretraining with various amounts of finetuning data, we use strong cross-lingual transfer models, involving training from scratch on a combination of both our low-resource data and a similar relatively high-resource language following Neubig and Hu (2018).
• Multilingual Neural Machine Translation (mNMT) We also experiment with a vanilla multilingual machine translation system (Dabre et al., 2020) trained on all En-XX directions.
• Random As additional baselines, we also provide a comparison with a randomly initialized Transformer-base (Vaswani et al., 2017) models for each translation pair.
Evaluation We evaluate our system outputs using two automatic evaluation metrics: detokenized BLEU (Papineni et al., 2002;Post, 2018) and chrF (Popović, 2015). Although BLEU is a standard metric for machine translation, being cognizant of the morphological richness of the languages in the AFROMT benchmark, we use chrF to measure performance at a character level. Both metrics are measured using the SacreBLEU library 13 (Post, 2018). Table 2 shows the results on En-XX translation on the AFROMT benchmark comparing random initialization with various pretrained AfroBART configurations. We find that initializing with pretrained AfroBART weights results in performance gains of ∼1 BLEU across all language pairs. Furthermore, we observe that augmenting our pretraining data with a dictionary results in performance gains across all pairs in terms of chrF and 6/8 pairs in terms of BLEU. The gain is especially clear on languages with fewer amounts of monolingual data Num. Parallel Sentences (En-Xh) such as Rundi and Bemba, demonstrating the effectiveness of our data augmentation techniques on low-resource translation. Moreover we see further improvements when augmenting with pseudo monolingual data, especially on pairs with fewer data which validates the usage of this technique.

Performance vs Amount of Parallel Data
We perform experiments to demonstrate the effect on pretraining with various amounts of parallel data (10k, 50k, and 100k pairs) on two related language pairs: English-Xhosa and English-Zulu. We compare AfroBART (with both dictionary augmentation and pseudo monolingual data) with randomly initialized models, and cross-lingual transfer models (Neubig and Hu, 2018) jointly trained with a larger amount of parallel data (full AFROMT data) in a related language. In Figure 3, a pretrained AfroBART model finetuned on 10K pairs can almost double the performance of other models (with a significant performance increase over random initialization of 15+ BLEU on English-Zulu), outperforming both crosslingual transfer and randomly initialized models trained on 5x the data. Furthermore, we notice that CLT performs than Random on English-Xhosa as the data size increases. Although we do not have an exact explanation for this, we believe this has to do with the other language data adding noise rather than additional supervision as the data size increases. We detail these results in Table 3 of the Appendix.
Comparison on convergence speed In contrast to the cross-lingual transfer baseline which involves the usage of more data, and the random initialization baseline which needs to learn from scratch, AfroBART is able to leverage the knowl-edge gained during training for fast adaptation even with small amounts of data. For example, AfroBART converged within 1,000 iterations when finetuning on 10K pairs on English-Zulu, whereas the random initialization and cross-lingual transfer baselines converged within 2.5K and 12K iterations respectively. This is promising as it indicates that we can leverage these models quickly for other tasks where there is much fewer parallel data.

Fine-grained Language Analysis
We further provide a suite of fine-grained analysis tools to compare the baseline systems. In particular, we are interested in evaluating the translation accuracy of noun classes in the considered African languages in the Niger-Congo family, as these languages are morphologically rich and often have more than 10 classes based on the prefix of the word. For example, kitabu and vitabu in Swahili refer to book and books in English, respectively. Based on this language characteristic, our fine-grained analysis tool calculates the translation accuracy of the nouns with the top 10 most frequent prefixes in the test data. To do so, one of the challenges is to identify nouns in a sentence written in the target African language. However, there is no available part-of-speech (POS) tagger for these languages. To tackle this challenge, we propose to use a label projection method based on word alignment. Specifically, we first leverage an existing English POS tagger in the spaCy 14 library to annotate the English source sentences. We then use the fast_align 15 tool (Dyer et al., 2013) to train a word alignment model on the training data for the En-XX language pair, and use the alignment model to obtain the word-level alignment for the test data. We assign the POS tags of the source words in English to their aligned target words in the African language. We then measure the translation accuracy of the nouns in the African language by checking whether the correct nouns are included in the translated sentences by systems in comparison. Notably, our analysis tool can also measure the translation accuracy of the words in the other POS tags, (e.g. verbs, adjectives) which are often adjusted with different noun classes. Figure 4 compares the AfroBART and Random baseline in terms of translation accuracy of nouns in Swahili. First, we find that both systems perform worse on translating nouns with the prefix "ku-" which usually represent the infinitive form of verbs, e.g., kula for eating. Secondly, we find that AfroBART significantly improves translation accuracy for nouns with prefixes "ki-" (describing man-made tools/languages, e.g., kitabu for book) and "mw-" (describing a person, e.g., mwalimu for teacher). Finally, AfroBART improves the translation accuracy on average over the ten noun classes by 1.08% over the Random baseline.
We also perform this analysis on our dataconstrained scenario for English-Xhosa, shown in Figure 7. It can be seen that leveraging crosslingual transfer (trained on both Xhosa and Zulu) models improved noun class accuracy on classes such as uku (infinitive noun class), izi (plural for objects), and ama (plural for body parts) which are shared between languages. This can be contrasted with iin (plural for animals) which is only used in Xhosa, where CLT decreases performance. These analyses which require knowledge of unique grammar found in these langauges can be used for diagnosing cross-lingual transfer for these langauges. Also, we note that AfroBART almost doubles the accuracy (improvement of 16.33%) of the crosslingual transfer baseline on these noun classes.

Shortcomings of AFROMT
Although we believe AFROMT to be an important step in the right direction, we acknowledge it is far from being the end-all-be-all. Specifically, we note the following: (1) the lack of domain diversity among many languages (being largely from religious oriented corpora) and (2) the corpora may still contain some more fine-grained forms of noise in terms of translation given its origin. Given this, in the future we look to include more diverse data sources and more languages and encourage the community to do so as well.

Related Work
Machine Translation Benchmarks Previous work in benchmarking includes the commonly used WMT (Bojar et al., 2017) and IWSLT (Federico et al., 2020) shared tasks. Recent work on MT benchmarks for low-resource languages, such as that of Guzmán et al. (2019), have been used for the purpose of studying current NMT techniques for low-resource languages.
Multilingual Pretraining Multilingual encoder pretraining (Devlin et al., 2019;Conneau and Lample, 2019; has been demonstrated to be an effective technique for cross-lingual transfer on a variety of classification tasks (Hu et al., 2020;Artetxe et al., 2020). More recently, sequence-to-sequence pretraining has emerged as a prevalent method for achieving better performance (Lewis et al., 2020;Song et al., 2019) on generation tasks. Liu et al. (2020) proposed a mul-tilingual approach to BART (Lewis et al., 2020) and demonstrated increased performance on MT.
Building on these works, we extend this to a LRLfocused setting, developing two new techniques for improved performance given monolingual datascarcity. In concurrent work, Liu et al. (2021); Reid and Artetxe (2021) also look at using codeswitched corpora for sequence-to-sequence pretraining.
NLP for African Languages Benchmarking machine translation for African languages was first done by Abbott and Martinus (2019) for southern African languages and Abate et al. (2018) for Ethiopian languages. Recent work in NLP for African languages has largely revolved around the grassroots translation initiative Masakhane (Orife et al., 2020;Nekoto et al., 2020). This bottom-up approach to dataset creation (Nekoto et al., 2020), while very valuable, has tended to result in datasets with somewhat disparate data splits and quality standards. In contrast, AFROMT provides a cleaner corpus for the 8 supported languages. We plan to open source the the entire benchmark (splits included) to promote reproducible results in the community.

Conclusion
In this work we proposed a standardized, clean, and reproducible benchmark for 8 African languages, AFROMT, as well as novel pretraining strategies in the previously unexplored low-resource focused setting. Our benchmark and evaluation suite are a step towards larger, reproducible benchmarks in these languages, helping to provide insights on how current MT techniques work for these underexplored languages. We will release this benchmark, our pretrained AfroBART models, dictionaries, and pseudo monolingual data to the community to facilitate further work in this area.
In future work we look to use similar methodology to advance in both of these directions. We look to increase the number of language pairs in AFROMT to be more representative of the African continent. Additionally, we look to scale up our pretraining approaches for increased performance.

A AFROMT
We provide extra information -Script, Language Family, L1 and L2 speakers, Location as well as Word Order -in Table 4. We upload AFROMT as well as the data generated using the pseudo monolingual data synthesis 16 .

B Pretraining
Data We use in addition to the monolingual data for the languages in AFROMT (shown in Table 1 of the main paper), we 14 GB of English data, and 7 GB of French and Dutch data each.
Additional Hyperparameters We optimize the model using Adam (Kingma and Ba, 2015) using hyperparameters β = (0.9, 0.98) and = 10 −6 . We warm up the learning rate to a peak of 3 × 10 −4 over 10K iterations and then decay said learning rate using the polynomial schedule for 90K iterations. For regularization, we use a dropout value of 0.1 and weight decay of 0.01.

C Finetuning Hyperparameters
Training from scratch When training using random initialization (or CLT), we use a batch size of 32K (or 64K in the case of CLT) tokens and warmup the learning rate to 5 × 10 −4 over 10K iterations and decay with the inverse square root schedule. We use a dropout value of 0.3, a weight decay value of 0.01, and a label smoothing value of = 0.1.
Finetuning from AfroBART We train using a batch size of 32K tokens, and use a smaller learning rate of 3 × 10 −4 . We use a polynomial learning rate schedule, maximizing the learning rate at 5000 iterations and finishing training after 50K iterations. We perform early stopping, stopping training if the best validation loss remains constant for over 10 epochs. We use a label smoothing value of = 0.2, a dropout value of 0.3 and weight decay of 0.01.

D Training Infrastructure
For finetuning models on AFROMT we use between 1 and 8 NVIDIA V100 16GB GPUs on a DGX-1 machine running Ubuntu 16.04 on a Dual 16 Note that we do not generate pseudo monolingual data for Afrikaans due to its high similarity with Dutch -a high resource language. 20-Core Intel Xeon E5-2698 v4 2.2 GHz. For pretraining we make use of a compute cluster using 8 nodes with 4 NVIDIA V100 16GB GPUs per node.

E Quantification of Potential Data Leakage
In low-resource machine translation, data-leakage is a key concern given its pertinence in the mitigation of misleading results. We quantify data leakage for our benchmark We measured the targetside train-test data leakage using the 4-gram overlap between the training/test sets. We take the most frequent 100k 4-grams from the training set and compare them with all 4-grams in the test set and obtain an average 4-gram overlap of 5.01±2.56% (measured against all test-set 4-grams). To put this value in context, we ran these on other widely used low-resource datasets from IWSLT (En-Vi, Ja-En, Ar-En) and obtained 9.50%, 5.49%, and 5.53% respectively. We believe this to be reasonable evidence of the lack of train-test data-leakage. Furthermore, we also show improvements on source-target leakage as follows: we compute BLEU between the source and target over all training sets, we obtain an average of 4.5±1.3 before cleaning (indicating heavy overlap in certain source-target pairs in the corpus), and after cleaning 0.7±0.2 indicating a significant decrease in such overlap.

F Parameter Count
We keep the parameter count of 85M consistent throughout our experiments as we use the same model architecture. We ran experiments on scaling up randomly initialized models with a hidden size of 768 and feed forward dimension of 3072 with 6 layers in both the encoder and decoder on three language pairs. The results of these experiments can be seen in Table 3    We perform our fine grained morphological analaysis (described in Section §5.3 of the main paper) on the data constrained scenario (described in Section §5.2 of the main paper). We perform the analysis on English-Xhosa and English-Zulu (10k parallel sentence pairs) side by side and visualize them in Figure 6 and Figure 7. It can be seen that cross lingual transfer improves accuracy in this data constrained scenario over a random baseline, which is inturn improved upon by AfroBART. Additionally, we report the BLEU and chrF scores of the data constrained experiments (shown in Figure 3 of the main paper) in Table 5