SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages

In recent years, multilingual machine translation models have achieved promising performance on low-resource language pairs by sharing information between similar languages, thus enabling zero-shot translation. To overcome the “curse of multilinguality”, these models often opt for scaling up the number of parameters, which makes their use in resource-constrained environments challenging. We introduce SMaLL-100, a distilled version of the M2M-100(12B) model, a massively multilingual machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. We evaluate SMaLL-100 on different low-resource benchmarks: FLORES-101, Tatoeba, and TICO-19 and demonstrate that it outperforms previous massively multilingual models of comparable sizes (200-600M) while improving inference latency and memory usage. Additionally, our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.


Introduction
Neural Machine Translation (NMT) systems are usually trained on datasets consisting of millions of parallel sentences, thus still performing poorly on low-resource languages, i.e., languages without a large amount of training data.Over the past few years, previous work has proposed several approaches to improve the quality of translations in low-resource languages, e.g., Multilingual Neural Machine Translation (MNMT) models (Johnson et al., 2017;Fan et al., 2020;Tang et al., 2021;Goyal et al., 2021), back-translation (Sennrich et al., 2016;Edunov et al., 2018) and unsupervised machine translation (Garcia et al., 2021;Ko et al., 2021).Massively MNMT models are particularly interesting for low-resource languages as they benefit the most from knowledge transfer from related languages (Arivazhagan et al., 2019).However, it is also seen that curse of multilinguality hurts the performance of high-resource languages.So, previous work attempted to increase the model size to maintain the translation performance in both high and low-resource languages.This makes the use of these massively MNMT models challenging in real-world resource-constrained environments.
To overcome this problem, we propose SMaLL-100, a Shallow Multilingual Machine Translation Model for Low-Resource Languages covering 100 languages, which is a distilled alternative of M2M-100 (12B) (Fan et al., 2020), the most recent and biggest available multilingual NMT model.In this paper, we focus on very-low and low-resource language pairs as there is no reasonable-size universal model that achieves acceptable performance over a great number of low-resource languages.We do so by training SMaLL-100 on a perfectly balanced dataset. 2 While this leads to lower performance on the high-resource languages, we claim that this loss is easily recoverable through further fine-tuning.We evaluate SMaLL-100 on different low-resource benchmarks, e.g., FLORES-101 (Goyal et al., 2021), Tatoeba (Tiedemann, 2020), and TICO-19 (Anastasopoulos et al., 2020).To summarize, our contributions are as follows: • We propose SMaLL-100, a shallow multilingual NMT model, focusing on low-resource language pairs.• We evaluate SMaLL-100 on several lowresource NMT benchmarks.• We show that our model significantly outperforms previous multilingual models of comparable size while being faster at inference.Additionally, it achieves comparable results with M2M-100 (1.2B) model, with 4.3× faster inference and a 3.6× smaller size.• While SMaLL-100 reaches 87.2% performance of the 12B teacher model, we show that this gap can be closed with a few finetuning steps both for low and high-resource languages.

Model and Training
2.1 SMaLL-100 Architecture It has been shown by Kasai et al. (2021) that deep encoder / shallow decoder architectures can achieve good translation quality while being significantly faster at inference.Berard et al. (2021) have confirmed that this is also valid for multilingual NMT.Here, we use a 12-layer Transformer encoder (Vaswani et al., 2017) and 3-layer decoder.Table 8 in the Appendix B reports further details of the SMaLL-100 architecture.Different from M2M-100 model, we use language codes in the encoder side, as it is shown to perform better with shallow decoder architectures (Berard et al., 2021).

Training Strategy
SMaLL-100 is trained with a combination of two loss functions: a standard Cross Entropy loss (CE) and a Knowledge Distillation loss (KD).Given source sequence X and gold target translation Y = (y 0 ,...,y m ), the CE loss is calculated as: 1{y j = z}log p(y j = z|y <j ,X,θ S ) (1) where |K| is the target vocabulary size, 1 is the indicator function, and θ S is the model parameters.p() is the conditional probability function.
We additionally use a word-level distillation loss, which is the Kullback-Leibler divergence between the output distributions of the student and teacher models (Hu et al., 2018).Specifically, it is calculated as: q(y j = z|y <j ,X,θ T ) ×log p(y j = z|y <j ,X,θ S ) (2) where θ T is parameters of the teacher model.q() is the conditional probability of the teacher model.The total loss is computed as: where α is a trainable parameter.

Training Data
Our training data includes parallel sentences from CCMatrix (Schwenk et al., 2019) and CCAligned (El-Kishky et al., 2020) datasets, which are part of the training data used by Fan et al. (2020) to train the M2M-100 models.As our goal is to maintain the performance of low-resource languages, we balance the training data across all language pairs; specifically, 100K sentence pairs are sampled for each language pair. 3As a result, our training data contains nearly 456M parallel sentences, which is less than 6% of the original data on which M2M-100 (Fan et al., 2020) was trained.We use the same languages as M2M-100.
3 Experimental Setup

Evaluation Benchmarks
FLORES-101 is a multilingual NMT benchmark, containing 3,001 sentences from different domains, that are derived from English Wikipedia.Sentences are translated into 101 languages by human translators (Goyal et al., 2021).It mostly includes low and medium-resource languages.We use devtest subset for the evaluation.
Tatoeba is a crowd-sourced collection of user-provided translations in different languages (Tiedemann, 2020).We choose a subset of languages from test set of Tatoeba Challenge,4 which are covered by M2M-100.
TICO-19 was created during the COVID-19 pandemic (Anastasopoulos et al., 2020).It contains sentences from 36 languages in the medical domain, including 26 low-resource languages.Table 2: Average spBLEU performance on FLORES-101, Tatoeba, and TICO-19 benchmarks for different language pair categories, defined in Appendix A. FLORES-101 results are computed on language pairs where M2M-100 12B has spBLEU scores higher than 3 to avoid polluting the analysis with meaningless scores.The first and second columns give the model size and speed-up ratios compared to M2M-100 (12B).Last column is the average spBLEU performance over all mentioned language directions.The best scores are shown in bold, and the second best results are shown with underline.
We evaluate on languages which are covered by M2M-100 (Fan et al., 2020).Inspired by Goyal et al. (2021), we split the languages based on the amount of available training sentences aligned with English into 4 different categories: Very-Low (VL), Low (L), Medium (M), and High-resource (H).As the true amount of training data is both dependent on quality and quantity of parallel sentences, Goyal et al. (2021) suggested to estimate it by computing the number of bitext data aligned with English, that is calculated from statistics of OPUS corpora (Tiedemann, 2012).Table 1 illustrates the criteria for choosing the category of different languages.More details about the distribution of language pair categories in each benchmark are provided in Appendix A.

FLORES-124 is an extension of M2M-100, covering additional 24 languages.
Training data of the additional languages is derived from OPUS (Tiedemann, 2012).Goyal et al. (2021) provide two models with 175M and 615M parameters.We use both models as baselines.
FineTuned-100 uses the same architecture as defined in Section 2, but KD loss (L kd ) is not used for training.For a fair comparison, it is trained for the same number of steps as SMaLL-100 model.

Implementation Details
SMaLL-100 contains nearly 330M parameters with 12 encoder and 3 decoder Transformer layers. 5It is trained for 30 days on 16 TESLA V100-32GB GPUs,6 with a batch size of 1K tokens and accumulated gradients over 9 batches.We implement our model using fairseq repository. 7We use last-checkpoint 8 of M2M-100 (12B) for the teacher model.For decoding, the beam search of 5 is applied.All hyper-parameters regarding the architecture and optimization strategy are provided in Appendix B.
For a faster convergence, we first fine-tune SMaLL-100 for 150k steps without distillation (L kd ).Then, it is trained with both losses for 756K steps (nearly 1 epoch).For evaluation, we use SentencePiece BLEU (spBLEU), as it is shown to be a fair metric in multilingual settings (Goyal et al., 2021). 9 We use the same tokenizer and dictionary as M2M-100.

Low-Resource NMT Benchmarks
Table 2 shows the average spBLEU performance on FLORES-101, Tatoeba, and TICO-19 test sets for different categories of language directions. 10SMaLL-100 outperforms all the models with comparable sizes while being smaller and faster at inference.Specifically, it outperforms M2M-100 418M both in terms of performance (+3.1 spBLEU) and inference speed (2.5× faster).We believe that Finetuned-100 outperforms M2M-100 418M for low-resource languages thanks to finetuning on the balanced dataset.The higher performance of 8 https://github.com/facebookresearch/fairseq/tree/main/examples/m2m_100 9 It utilizes a SentencePiece tokenizer with 256K tokens: https://github.com/facebookresearch/flores 10Complete spBLEU calculations of different language pairs on tested NMT benchmarks are provided in Appendix C. Speed is calculated on 2 TESLA V100-32GB GPUs with a batch size of 1 sentence over a subset of FLORES-101 devtest, containing nearly 10K sentences from all language pairs.SMaLL-100 compared to Finetuned-100 across all benchmarks shows the benefit of KD loss which allows to distill knowledge from the teacher model.Additionally, SMaLL-100 achieves competitive results with M2M-100 (1.2B), while being 3.6× smaller and 4.3× faster at inference.Compared to the biggest M2M-100 model (12B), SMaLL-100 loses nearly 1.7 spBLEU but is 36× smaller and 7.8× faster.Regarding medium and high-resource language pairs (as shown in Appendix C.A), SMaLL-100 achieves better or similar performance compared to M2M-100 (418M) and FLORES-124 (615M), while it contains fewer parameters and is faster at the evaluation time.It under-performs for some medium and high-resource language pairs compared to the teacher model (M2M-100 12B), which could be easily recovered, as we describe in the remaining section.

Recovering Teacher Model Performance
To go further, we demonstrate that SMaLL-100 can easily recover the performance of the teacher model with just a few fine-tuning steps, both for low and high-resource language pairs.For comparison, we fine-tune M2M-100 (418M) model with the same number of steps.Table 3 reports spBLEU performance for several language pairs, alongside the number of fine-tuning steps, required by SMaLL-100 model to reach M2M-100 (12B) performance. 11We see that SMaLL-100 achieves better performance than M2M-100 (12B) after a few training steps on low-resource language pairs.For high-resource language pairs, SMaLL-100 is fine-tuned for 20K steps to reach the performance of M2M-100 (12B) model.Additionally, fine-tuned SMaLL-100 significantly outperforms fine-tuned M2M-100 (418M) model on low and medium-resource languages.This confirms that SMaLL-100 could be a powerful and lightweight initialization model for training on different language pairs.

Related Work
Compression and Distillation.Over the past few years, pre-trained models lead to significant improvement by increasing the parameter size (Raffel et al., 2019;Fan et al., 2020;Zhang et al., 2022), which makes it challenging to use them in the resource-constraint environment.Previous work use several compression techniques e.g.knowledge distillation (Kim and Rush, 2016;Li et al., 2021), pruning (Behnke andHeafield, 2020;Zhang et al., 2021;Mohammadshahi et al., 2022), andquantization (Tao et al., 2022;Yao et al., 2022) to provide a reasonable-size model, while keeping the performance.

Multilingual NMT.
It provides a single model to translate between any pair of languages, which significantly improves performance on low-resource languages thanks to knowledge transfer (Haddow et al., 2021).Several works (Dong et al., 2015;Firat et al., 2016;Platanios et al., 2018;Fan et al., 2020;Berard et al., 2021) propose to include both language-specific, and language-independent parameters in MNMT models.Recently, massively MNMT models (Neubig and Hu, 2018;Arivazhagan et al., 2019;Aharoni et al., 2019;Fan et al., 2020;Zhang et al., 2020) have been proposed to translate between more than 100 languages.However, these models usually contain a huge number of parameters to maintain performance in both high and low-resource languages.Different from the previous work, we introduce SMaLL-100, which outperforms previous models with comparable size in low-resource language directions, while achieving better speed and being smaller.

Conclusion
We presented SMaLL-100 model, a shallow multilingual NMT model, focusing on low-resource languages.We evaluated our model on different NMT benchmarks.SMaLL-100 significantly outperforms multilingual models of comparable size on all of the tested benchmarks (FLORES-101, Tatoeba, TICO-19) and is much faster at inference.It also achieves competitive results with M2M-100 1.2B (Fan et al., 2020), while being 4.3× faster at inference and 3.6× smaller.Compared to M2M-100 (12B), the biggest available MNMT model, SMaLL-100 loses nearly 1.7 spBLEU on average but it is significantly faster (7.8×) and smaller (36×), which makes it a good fit for resource-constrained settings.Additionally, we show that SMaLL-100 can achieve similar performance as M2M-100 (12B) with just a few steps of fine-tuning on specific language pairs.

Limitations
As mentioned in Section 4, SMaLL-100 model under-performs for some medium and highresource languages, which could be resolved by further fine-tuning.Due to our computation constraint, we train SMaLL-100 model on nearly 6% of the original M2M-100 model.So, we encourage future research to increase the size of training data (especially for low-resource languages) to achieve better performance.Also, future research could apply different distillation strategies (Wu et al., 2020;Wang et al., 2021), as we just used word-level knowledge distillation loss (Hu et al., 2018).

A.B FLORES-101
We use devtest subset of FLORES-101 for the evaluation.To better compare different models, we exclude evaluation of language pairs, in which the spBLEU performance of M2M-100 12B (Fan et al., 2020) model is below 3.This gives 5,934 language directions for the comparison.Table 5: Distribution of resource categories for different language directions on FLORES-101 (Goyal et al., 2021).

A.C Tatoeba Challenge
We use the test subset data, provided by Tiedemann (2020) 12 to evaluate all models.We choose a subset of dataset that includes languages which are covered by M2M-100 (Fan et al., 2020) model.This brings 1,844 language pairs for the evaluation.The distribution of different language pair categories is shown in Table 6.

Table 1 :
The criteria to split languages into different resource categories.|K| is the amount of training data to/from English.

Table 4 :
ISO-639 code and resource type of languages used in evaluated NMT benchmarks.

Table 6 :
Distribution of resource categories for different language directions on Tatoeba Challenge.

Table 7 :
Number of language pairs for different categories of language directions on TICO-19.

Table 8 :
List of hyper-parameters used for the architecture, and optimization.

Table 10 :
spBLEU performance of last checkpoint of SMaLL-100 model on language pairs of FLORES-101.

Table 11 :
spBLEU performance of last checkpoint of SMaLL-100 model on language pairs of Tatoeba.