Machine Translation into Low-resource Language Varieties

State-of-the-art machine translation (MT) systems are typically trained to generate “standard” target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a general framework to rapidly adapt MT systems to generate language varieties that are close to, but different from, the standard target language, using no parallel (source–variety) data. This also includes adaptation of MT systems to low-resource typologically-related target languages. We experiment with adapting an English–Russian MT system to generate Ukrainian and Belarusian, an English–Norwegian Bokmål system to generate Nynorsk, and an English–Arabic system to generate four Arabic dialects, obtaining significant improvements over competitive baselines.


Introduction
Despite tremendous progress in machine translation (Bahdanau et al., 2015;Vaswani et al., 2017) and language generation in general, current stateof-the-art systems often work under the assumption that a language is homogeneously spoken and understood by its speakers: they generate a "standard" form of the target language, typically based on the availability of parallel data. But language use varies with regions, socio-economic backgrounds, ethnicity, and fluency, and many widely spoken languages consist of dozens of varieties or dialects, with differing lexical, morphological, and syntactic patterns for which no translation data are typically available. As a result, models trained to translate from a source language (SRC) to a standard language variety (STD) lead to a sub-par experience for speakers of other varieties.
Motivated by these issues, we focus on the task of adapting a trained SRC→STD translation model to generate text in a different target variety (TGT), having access only to limited monolingual corpora in TGT and no SRC-TGT parallel data. TGT may be a dialect of, a language variety of, or a typologically-related language to STD. We present an effective transfer-learning framework for translation into low resource language varieties. Our method reuses SRC→STD MT models and finetunes them on synthesized (pseudoparallel) SRC-TGT texts. This allows for rapid adaptation of MT models to new varieties without having to train everything from scratch. Using word-embedding adaptation techniques, we show that MT models which predict continuous word vectors (Kumar and Tsvetkov, 2019) rather than softmax probabilities lead to superior performance since they allow additional knowledge to be injected into the models through transfer between word embeddings of high-resource (STD) and lowresource (TGT) monolingual corpora.
We evaluate our framework on three translation tasks: English to Ukrainian and Belarusian, assuming parallel data are only available for English→Russian; English to Nynorsk, with only English to Norwegian Bokmål parallel data; and English to four Arabic dialects, with only English→Modern Standard Arabic (MSA) parallel data. Our approach outperforms competitive baselines based on unsupervised MT, and methods based on finetuning softmax-based models.

A Transfer-learning Architecture
We first formalize the task setup. We are given a parallel SRC→STD corpus, which allows us to  Figure 1: An overview of our approach. (a) Using the available STD monolingual corpora, we first train word vectors using fasttext; (b) we then train a SRC→STD translation model using the parallel corpora to predict the pretrained word vectors; (c) next, we train STD→SRC model and use it to translate TGT monolingual corpora to SRC; (d) now, we finetune STD subword embeddings to learn TGT word embeddings; and finally (e) we finetune a SRC→STD model to generate TGT pretrained embeddings using the back-translated SRC→TGT data.
train a translation model f (·; θ) that takes an input sentence x in SRC and generates its translation in the standard veriety STD,ŷ STD = f (x; θ). Here, θ are the learnable parameters of the model. We are also given monolingual corpora in both the standard STD and target variety TGT. Our goal now is to modify f to generate translationsŷ TGT in the target variety TGT. At training time, we assume no SRC-TGT or STD-TGT parallel data are available.
Our solution (Figure 1) is based on a transformerbased encoder-decoder architecture (Vaswani et al., 2017) which we modify to predict word vectors. Following Kumar and Tsvetkov (2019), instead of treating each token in the vocabulary as a discrete unit, we represent it using a unit-normalized d-dimensional pre-trained vector. These vectors are learned from a STD monolingual corpus using fasttext (Bojanowski et al., 2017). A word's representation is computed as the average of the vectors of its character n-grams, allowing surfacelevel linguistic information to be shared among words. At each step in the decoder, we feed this pretrained vector at the input and instead of predicting a probability distribution over the vocabulary using a softmax layer, we predict a d-dimensional continuous-valued vector. We train this model by minimizing the von Mises-Fisher (vMF) loss-a probabilistic variant of cosine distance-between the predicted vector and the pre-trained vector. The pre-trained vectors (at both input and output of the decoder) are not trained with the model. To decode from this model, at each step, the output word is generated by finding the closest neighbor (in terms of cosine similarity) of the predicted output vector in the pre-trained embedding table.
We train f in this fashion using SRC-STD parallel data. As shown below, training a softmax-based SRC→STD model to later finetune with TGT suffers from vocabulary mismatch between STD and TGT and thus is detrimental to downstream performance. By replacing the decoder input and output with pretrained vectors, we separate the vocabulary from the MT model, making adaptation easier. Now, to finetune this model to generate TGT, we need TGT embeddings. Since the TGT monolingual corpus is small, training fasttext vectors on this corpus from scratch will lead (as we show) to low-quality embeddings. Leveraging the relatedness of STD and TGT and their vocabulary overlap, we use STD embeddings to transfer knowledge to TGT embeddings: for each character ngram in the TGT corpus, we initialize its embedding with the corresponding STD embedding, if available. We then continue training fasttext on the TGT monolingual corpus (Chaudhary et al., 2018). Last, we use a supervised embedding alignment method (Lample et al., 2018a) to project the learned TGT embeddings in the same space as STD. STD and TGT are expected to have a large lexical overlap, so we use identical tokens in both varieties as supervision for this alignment. The obtained embeddings, due to transfer learning from STD, inject additional knowledge in the model.
Finally, to obtain a SRC→TGT model, we finetune f on psuedo-parallel SRC-TGT data. Using a STD→SRC MT model (a back-translation model trained using large STD-SRC parallel data with standard settings) we (back)-translate TGT data to SRC. Naturally, these synthetic parallel data will be noisy despite the similarity between STD and TGT, but we show that they improve the overall performance. We discuss the implications of this noise in §4.

Experimental Setup
Datasets We experiment with two setups. In the first (synthetic) setup, we use English (EN) as SRC, Russian (RU) as STD, and Ukrainian (UK) and Belarusian (BE) as TGTs. We sample 10M EN-RU sentences from the WMT'19 shared task (Ma et al., 2019), and 80M RU sentences from the CoNLL'17 shared task to train embeddings. To simulate lowresource scenarios, we sample 10K, 100K and 1M UK sentences from the CoNLL'17 shared task and BE sentences from the OSCAR corpus (Ortiz Suárez et al., 2020). We use TED dev/test sets for both languages pairs (Cettolo et al., 2012).
The second (real world) setup has two language sets: the first one defines English as SRC, with Modern Standard Arabic (MSA) as STD and four Arabic varieties spoken in Doha, Beirut, Rabat and Tunis as TGTs. We sample 10M EN-MSA sentences from the UNPC corpus (Ziemski et al., 2016), and 80M MSA sentences from the CoNLL'17 shared task. For Arabic varieties, we use the MADAR corpus (Bouamor et al., 2018) which consists of 12K 6way parallel sentences between English, MSA and the 4 considered varieties. We ignore the English sentences, sample dev/test sets of 1K sentences each, and consider 10K monolingual sentences for each TGT variety. The second set also has English as SRC with Norwegian Bokmål (NO) as STD and its written variety Nynorsk (NN) as TGT. We use 630K EN-NO sentences from WikiMatrix (Schwenk et al., 2021), and 26M NO sentences from ParaCrawl (Esplà et al., 2019) combined with the WikiMatrix NO sentences to train embeddings. We use 310K NN sentences from WikiMatrix, and TED dev/test sets for both varieties (Reimers and Gurevych, 2020).
Preprocessing We preprocess raw text using Byte Pair Encoding (BPE, Sennrich et al., 2016) with 24K merge operations on each SRC-STD corpus trained separately on SRC and STD. We use the same BPE model to tokenize the monolingual STD data and learn fasttext embeddings (we consider character n-grams of length 3 to 6). 2 Splitting the TGT words with the same STD BPE model will result in heavy segmentation, especially when TGT contains characters not present in STD. 3 To counter this, we train a joint BPE model with 24K operations on the concatenation of STD and TGT corpora to tokenize TGT corpus following Chronopoulou et al. (2020). This technique increases the number of shared tokens between STD and TGT, thus enabling better cross-variety transfer while learning embeddings and while finetuning. We follow Chaudhary et al. (2018) to train embeddings on the generated TGT vocabulary where we initialize the character n-gram representations for TGT words with STD's fasttext model wherever available and finetune them on the TGT corpus.

Implementation and Evaluation
We modify the standard OpenNMT-py seq2seq models of Py-Torch (Klein et al., 2017) to train our model with vMF loss (Kumar and Tsvetkov, 2019). Additional hyperparameter details are outlined in Appendix B. We evaluate our methods using BLEU score (Papineni et al., 2002) based on the SacreBLEU implementation (Post, 2018). 4 For the Arabic varieties, we also report a macro-average. In addition, to measure the expected impact on actual systems' users, we follow Faisal et al. (2021) in computing a population-weighted macro-average (avg pop ) based on language community populations provided by Ethnologue (Eberhard et al., 2019).

Experiments
Our proposed framework, LANGVARMT, consists of three main components: (1) A supervised SRC→STD model is trained to predict continuous STD word embeddings rather than discrete softmax probabilities. (2) Output STD embeddings are replaced with TGT embeddings. The TGT embeddings are trained by finetuning STD embeddings on monolingual TGT data and aligning the two embedding spaces. (3) The resulting model is finetuned with pseudo-parallel SRC→TGT data.
We compare LANGVARMT with the following competitive baselines. SUP(SRC→STD): train a standard (softmax-based) supervised SRC→STD model, and consider the output of this model as Synthetic Setup Considering STD and TGT as the same language is sub-optimal, as is evident from the poor performance of the non-adapted SUP(SRC→STD) model. Clearly, special attention ought to be paid to language varieties. Direct unsupervised translation from SRC to TGT performs poorly as well, confirming previously reported results of the ineffectiveness of such methods on unrelated languages (Guzmán et al., 2019). 5 Additional ablation results are listed in Appendix C.
Translating SRC to TGT by pivoting through STD achieves much better performance owing to strong UNSUP(STD→TGT) models that leverage the similarities between STD and TGT. However, when resources are scarse (e.g., with 10K monolingual sentences as opposed to 1M), this performance gain considerably diminishes. We attribute this drop to overfitting during the pre-training phase on the small TGT monolingual data. Ablation results (Appendix C) also show that in such low-resource settings the learned embeddings are of low quality.
Finally, LANGVARMT consistently outperforms all baselines. Using 1M UK sentences, it achieves similar performance (for EN→UK) to the softmax ablation of our method, SOFTMAX, and small gains over unsupervised methods. However, in lower resource settings our approach is clearly better than the strongest baselines by over 4 BLEU points for UK (10K) and 3.9 points for BE (100K).
To identify potential sources of error in our proposed method, we lemmatize the generated translations and test sets and evaluate BLEU (Qi et al., 2020). Across all data sizes, both UK and BE achieve a substantial increase in BLEU (up to +6 BLEU; see Appendix D for details) compared to that obtained on raw text, indicating morphological errors in the translations. In future work, we will investigate whether we can alleviate this issue by considering TGT embeddings based on morphological features of tokens (Chaudhary et al., 2018).

Real-world Setup
The effectiveness of LANG-VARMT is pronounced in this setup with a dramatic improvement of more than 18 BLEU points over unsupervised baselines when translating into Doha Arabic. We hypothesize that during the pretraining phase of unsupervised methods, the extreme difference between the size of the MSA monolingual corpus (10M) and the varieties' corpora (10K) leads to overfitting. Additionally, compared to the synthetic setup, the Arabic varieties we consider are quite close to MSA, allowing for easy and effective adaptation of both word embeddings and EN→MSA models. LANGVARMT also improves in all other Arabic varieties, although naturally some varieties remain challenging. For example, the Rabat and particularly the Tunis varieties are more likely to include French loanwords (Bouamor et al., 2018) which are not adequately handled as they are not part of our vocabulary. In future work, we will investigate whether we can alleviate this issue by potentially including French corpora (transliterated into Arabic) to our TGT language corpora. On average, our approach improves by 2.3 BLEU points over the softmax-based baseline (cf. 7.7 and 10.0 in Table 2 under avg L ) across the four Arabic dialects. For a population-weighted average (avg pop ), we associate the Doha variety with Gulf Arabic (ISO code: afb), the Beirut one with North Levantine Arabic (apc), Rabat with Moroccan (ary), and the Tunis variety with Tunisian Arabic (aeb). As before, LANGVARMT outperforms the baselines. The absolute BLEU scores in this highly challenging setup are admittedly low, but as we discuss in Appendix D, the translations generated by LANG-VARMT are often fluent and input preserving, especially compared to the baselines.
Finally, due to high similarity between NO and NN, the SUP(EN→NO) model also performs well on NN with 11.3 BLEU, but our method yields further gains of over 4 points over the baselines.

Discussion
Fairness The goal of this work is to develop more equitable technologies, usable by speakers of diverse language varieties. Here, we evaluate the systems along the principles of fairness. We evaluate the fairness of our Arabic multi-dialect system's utility proportionally to the populations speaking those dialects. In particular, we seek to measure how much average benefit will the people of different dialects receive if their respective translation performance is improved. A simple proxy for fairness is the standard deviation (or, even simpler, a max − min performance) of the BLEU scores across dialects (A higher value implies more unfairness across the dialects) Beyond that, we measure a system's unfairness with respect to the different dialect subgroups, using the adaptation of generalized entropy index (Speicher et al., 2018), which considers equities within and between subgroups in evaluating the overall unfairness of an algorithm on a population Faisal et al. (2021) (See Appendix F for details and additional discussion).  Negative Results Our proposed method relies on two components: (1) quality of TGT word embeddings which is dependent on STD and TGT shared (subword) vocabulary, and (2)

Conclusion
We presented a transfer-learning framework for rapid and effective adaptation of MT models to different varieties of the target language without access to any source-to-variety parallel data. We demonstrated significant gains in BLEU scores across several language pairs, especially in highly resource-scarce scenarios. The improvements are mainly due to the benefits of continuous-output models over softmax-based generation. Our analysis highlights the importance of addressing morphological differences between language varieties, which will be in the focus of our future work.

A Related Work
Early work addressing translation involving language varieties includes rule-based transformations (Altintas and Cicekli, 2002;Marujo et al., 2011;Tan et al., 2012) which rely on language specific information and expert knowledge which can be expensive and difficult to scale. Recent work to address this issue only focuses on cases where parallel data do exist. They include a combination of word-level and character-level MT (Vilar et al., 2007;Tiedemann, 2009;Nakov and Tiedemann, 2012) between related languages or training multilingual models to translate to/from English to different varieties of a language (e.g., Lakew et al. (2018) work on Brazilian-European Portuguese and European-Canadian French). Such parallel data, however, are typically unavailable for most language varieties. Unsupervised translation models, which require only monolingual data, can address this limitation (Artetxe et al., 2018;Lample et al., 2018a;Garcia et al., 2020Garcia et al., , 2021. However, when even monolingual corpora are limited, unsupervised models are challenging to train and are quite ineffective for translating between unrelated languages (Marchisio et al., 2020). Considering varieties of a language as writing styles, unsupervised style transfer (Yang et al., 2018;He et al., 2020) or deciphering methods (Pourdamghani and Knight, 2017) to translate between different varieties have also been been explored but have not been shown to perform well, often only reporting BLEU-1 scores since they obtain BLEU-4 scores which are closer to 0. Additionally, all of these approaches require simultaneous access to data in all varieties during training and must be trained from scratch when a new variety is added. In contrast, our presented method allows for easy adaptation of SRC→STD models to any new variety as it arrives.
Considering a new target variety as a new domain of STD, unsupervised domain adaptation methods can be employed, such as finetuning SRC→STD models using pseudo-parallel corpora generated from monolingual corpora in target varieties (Hu et al., 2019;Currey et al., 2017). Our proposed method is most related to this approach; but while these methods have the potential to adapt the decoder language model, for effective transfer, STD and TGT must have a shared vocabulary which is not true for most language varieties due to lexical, morphological, and at times orthographic differ-ences. In contrast, our proposed method makes use of cross-variety word embeddings. While our examples only involve same-script varieties, augmenting our approach to work across scripts through a transliteration component is straightforward.

B Implementation Details
We modify the standard OpenNMT-py seq2seq models of PyTorch (Klein et al., 2017) to train our model with vMF loss (Kumar and Tsvetkov, 2019). We use the transformer-BASE model (Vaswani et al., 2017), with 6 layers in both encoder and decoder and with 8 attention heads, as our underlying architecture. We modify this model to predict pretrained fasttext vectors. We also initialize the decoder input embedding table with the pretrained vectors and do not update them during model training. All models are optimized using Rectified Adam (Liu et al., 2020) with a batch size of 4K tokens and dropout of 0.1. We train SRC→STD models for 350K steps with an initial learning rate of 0.0007 with linear decay. For finetuning, we reduce the learning rate to 0.0001 and train for up to 100K steps. We use early stopping in all models based on validation loss computed every 2K steps. We decode all the softmax-based models with a beam size of 5 and all the vMF-based models greedily.
We evaluate our methods using BLEU score (Papineni et al., 2002) based on the SacreBLEU implementation (Post, 2018). While we recognize the limitations of BLEU (Mathur et al., 2020), more sophisticated embedding-based metrics for MT evaluation (Zhang et al., 2020;Sellam et al., 2020) are simply not available for language varieties.

C Additional English-Ukrainian Experiments
On our resource-richest setup of EN→UK translation using 1M UK sentences and RU as STD, we compare our method with the following additional baselines. Table 3   Here we first translate SRC to STD using SUP(SRC→STD), and then modify the STD output to get a TGT sentence as follows: We create a STD-TGT dictionary using the embedding map suggested by Lample et al. (2018b). This dictionary is created on words tokenized with Moses tokenizer (Hoang and Koehn, 2008) rather than BPE tokens. We replace each token in the generated STD sentence which is not in the TGT vocabulary using the dictionary (if available). We consider this baseline to measure lexical vs. syntactic/phrase level differences between Russian and Ukrainian.
In addition to baseline comparison, we report the following ablation experiments.
(1) To measure transfer from STD to TGT embeddings, we finetune the SUP(SRC→STD) model using TGT embeddings trained from scratch (as opposed to initialized with STD embeddings).
(2) To measure the impact of initialization during model finetuning, we compare with a randomly initialized model trained in a supervised fashion on the psuedo-parallel SRC-TGT data. PIVOT:DICTREPLACE(STD→TGT) gains some improvement over considering the output of SUP(SRC→STD) as TGT, probably due to syntactic similarities between Russian and Ukrainian. This result can potentially be further improved with a human-curated RU-UK dictionary, but such resources are typically not available for the lowresource settings we consider in this paper.
Ablations As shown in Table 3, training the SRC→TGT model on a randomly initialized model (LANGVAR-RANDOM) results in a performance drop, confirming that transfer learning from a SRC→STD model is beneficial. Similarly, using TGT embeddings trained from scratch (LANGVARMT w/ poor embeddings) results in a drastic performance drop, providing evidence for essential transfer from STD embeddings.

D Analysis
To better understand the performance of our models, we perform additional analyses.
Lemmatized BLEU For UK and BE, we lemmatize each word in the test sets and the translations and evaluate BLEU scores. The results, depicted in Table 4, very likely indicate that our framework often generates correct lemmas, but may fail on the correct inflectional form of the target words. This highlights the importance of considering morphological differences between language varieties. The high BLEU scores also demonstrate that the resulting translations are quite likely understandable, albeit not always grammatical.  Translation of Rare Words On the outputs of the EN→UK model, trained with 100K UK sentences, we compute the translation accuracy of words based on their frequency in the TGT monolingual corpus for LANGVARMT, our best baseline SUP(SRC→STD)+UNSUP(SRC→TGT) and the best performing ablation SOFTMAX. These results, shown in Table 5, reveal that LANGVARMT is more accurate at translating rare words (with frequency less than 10) compared to the baselines.
Examples We provide some examples of EN-UK and EN-Beirut Arabic translations generated by the three models in Tables 6 and 7. As evaluated by native speakers of the Beirut Arabic, we find that  despite a BLEU score of only 8, in a majority of cases our baseline model is able to generate fluent translations of the input, preserving most of the content, whereas the baseline model ignores many of the content words. We also observe that in some cases, despite predicting in the right semantic space of the pretrained embeddings, it fails to predict the right token, resulting in surface form errors (e.g., predicting adjectival forms of verbs). This phenomenon is known and studied in more detail in Kumar and Tsvetkov (2019).

E Negative Results
We present results for the following experiments:

F Measuring Unfairness
When evaluating multilingual and multi-dialect systems, it is crucial that the evaluation takes into account principles of fairness, as outlined in economics and social choice theory (Choudhury and Deshpande, 2021). We follow the least difference  . (We prefer you take us to the market at the morning.)  principle proposed by Rawls (1999), whose egalitarian approach proposes to narrow the gap between unequal accuracies. A simple proxy for unfairness is the standard deviation (or, even simpler, a max − min perfor-mance) of the scores across languages. Beyond that, we measure a system's unfairness with respect to the different subgroups using the adaptation of generalized entropy index described by Speicher et al. (2018), which considers equities within and between subgroups in evaluating the overall unfairness of an algorithm on a population. The generalized entropy index for a population of n individuals receiving benefits b 1 , b 2 , . . . , b n with mean benefit µ is Using α = 2 following Speicher et al. (2018), the generalized entropy index corresponds to half the squared coefficient of variation. 7 If the underlying population can be split into |G| disjoint subgroups across some attribute (e.g. gender, age, or language variety) we can decompose the total unfairness into individual and group-level unfairness. Each subgroup g ∈ G will correspond to n g individuals with corresponding benefit vector b g = (b g 1 , b g 2 , . . . , b g ng ) and mean benefit µ g . Then, total generalized entropy can be re-written as: The first term E α (b) corresponds to the weighted unfairness score that is observed within each subgroup, while the second term E α β (b) corresponds to the unfairness score across different subgroups.
In this measure of unfairness, we define the benefit as being directly proportional to the system's accuracy. For a Machine Translation system, each user receives an average benefit equal to the BLEU score the MT system achieves on the user's dialect. Conceptually, if the system produces a perfect translation (BLEU=1) then the user will receive the highest benefit of 1. If the system fails to produce a meaningful translation (BLEU→ 0) then the user receives no benefit (b = 0) from the interaction with the system.