Distilling Efficient Language-Specific Models for Cross-Lingual Transfer

Massively multilingual Transformers (MMTs), such as mBERT and XLM-R, are widely used for cross-lingual transfer learning. While these are pretrained to represent hundreds of languages, end users of NLP systems are often interested only in individual languages. For such purposes, the MMTs' language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost. We thus propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer. This is achieved by distilling the MMT bilingually, i.e., using data from only the source and target language of interest. Specifically, we use a two-phase distillation approach, termed BiStil: (i) the first phase distils a general bilingual model from the MMT, while (ii) the second, task-specific phase sparsely fine-tunes the bilingual"student"model using a task-tuned variant of the original MMT as its"teacher". We evaluate this distillation technique in zero-shot cross-lingual transfer across a number of standard cross-lingual benchmarks. The key results indicate that the distilled models exhibit minimal degradation in target language performance relative to the base MMT despite being significantly smaller and faster. Furthermore, we find that they outperform multilingually distilled models such as DistilmBERT and MiniLMv2 while having a very modest training budget in comparison, even on a per-language basis. We also show that bilingual models distilled from MMTs greatly outperform bilingual models trained from scratch. Our code and models are available at https://github.com/AlanAnsell/bistil.


Introduction
Massively multilingual Transformers (MMTs), pretrained on unlabelled data from hundreds of languages, are a highly effective tool for cross-lingual transfer (Devlin et al., 2019;Conneau et al., 2020;Chung et al., 2020;He et al., 2021).However, they suffer from several limitations as a result of their ample language coverage.Firstly, aiming to represent many languages within their parameter budget and dealing with the training signals from different languages might result in negative interference.This is known as the "curse of multilinguality" (Conneau et al., 2020), which impairs the MMT's transfer capabilities (Pfeiffer et al., 2022).Secondly, in practice people are often interested in using or researching NLP systems in just a single language.This makes the MMTs unnecessarily costly in terms of storage, memory, and compute and thus hard to deploy.This especially impacts communities which speak low-resource languages, which are more likely to have limited access to computational resources (Alabi et al., 2022).
In this work, we address the question: can we increase the time-efficiency and space-efficiency of MMTs while retaining their performance in crosslingual transfer?Knowledge distillation (Hinton et al., 2015) is a family of general methods to achieve the first goal by producing smaller, faster models (Sanh et al., 2019;Jiao et al., 2020, inter alia) and has also been applied to MMTs specifically.However, when the distilled MMT is required to cover the same number of languages as the original model, whose capacity is already thinly stretched over hundreds of languages, the "curse of multilinguality" asserts itself, resulting in a significant loss in performance (Sanh et al., 2019).
As a consequence, to achieve the best possible performance with reduced capacity, we depart from the practice of retaining all the languages from the original MMT in the distilled model.Instead, we argue, we should cover only two languages, namely the source language and the target language of interest.In fact, distilling just one language would fall short of the second goal stated above, namely facilitating cross-lingual transfer, as a monolingually distilled model would be unable to learn from a distinct source language during task-specific finetuning.Maintaining cross-lingual transfer capabilities, however, is crucial due to the paucity of labelled task data in many of the world's languages in most tasks (Ponti et al., 2019;Joshi et al., 2020).
In particular, we propose a method for bilingual distillation of MMTs, termed BISTILLATION, inspired by the two-phase recipe of Jiao et al. (2020).We start from a "student" model, initialized by discarding a subset of layers of the original "teacher" MMT, as well as the irrelevant part of its vocabulary.In the first, "general" phase of distillation, unlabelled data is used to align the the hidden representations and attention distributions of the student with those of the teacher.In the second, taskspecific phase, the student is fine-tuned for the task of interest through guidance from a task-adapted variant of the teacher.Rather than fully fine-tuning the student during this second phase, we instead use the parameter-efficient Lottery-Ticket Sparse Fine-Tuning (LT-SFT) method of Ansell et al. (2022).Parameter-efficient task fine-tuning enables a system to support multiple tasks with the same distilled compact model, without unnecessarily creating full model copies per each task.
We evaluate our efficient "bistilled" models on a range of downstream tasks from several bench-marks for multilingual NLP, including dependency parsing from Universal Dependencies (UD; Zeman et al., 2020), named entity recognition from MasakhaNER (Adelani et al., 2021), natural language inference from AmericasNLI (Ebrahimi et al., 2022), and QA from XQuAD (Artetxe et al., 2020).We evaluate the model performance as well as its space efficiency (measured in terms of parameter count) and time efficiency (measured in terms of FLOPs and inference time).We compare it against highly relevant baselines: bilingual models pretrained from scratch and two existing multilingual distilled models, DistilmBERT (Sanh et al., 2019) and MiniLMv2 (Wang et al., 2021a).
We find that while our bilingually distilled models are twice or thrice smaller and faster than the original MMT, their performance is only slightly degraded, as illustrated in Figure 1.Our method outperforms the baselines by sizable margins, showing the advantages of (i) bilingual as opposed to multilingual distillation, and (ii) distilling models from MMTs rather than training them from scratch.We hope that our endeavour will benefit end-users of multilingual models, and potential users under-served by currently available technologies, by making NLP systems more accessible.The code and models are publicly available at https://github.com/AlanAnsell/bistil.

Cross-Lingual Transfer with MMTs
Prominent examples of MMTs include mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020) and mDeBERTa (He et al., 2021), among others.Pires et al. (2019) and Wu and Dredze (2019) showed that mBERT is surprisingly effective at zero-shot cross-lingual transfer.Zero-shot crosslingual transfer is a useful paradigm when there is little or no training data available for the task of interest in the target language, but there is training data available in some other source language.In the simplest form of zero-shot cross-lingual transfer, the model is trained on source language data and is then used without modification for inference on target language data.While this generally works quite well for high-resource languages, transfer performance degrades for low-resource languages, especially those under-represented or fully unseen by the MMT during its pretraining (Lauscher et al., 2020;Pfeiffer et al., 2020;Ansell et al., 2021;Adelani et al., 2021;Ebrahimi et al., 2022).

Modular Adaptation of MMTs
Because MMTs divide their capacity among many languages, they may often perform sub-optimally with respect to a single source or target language.Furthermore, we are sometimes interested in a target language not covered by the MMT.A naive solution to these problems is to prepare the MMT with continued pretraining on the target language before proceeding to task fine-tuning.While this can improve performance, Pfeiffer et al. (2020) show that a more effective approach is to perform this continued pretraining in a parameter-efficient manner, specifically with the use of adapters (Rebuffi et al., 2017;Houlsby et al., 2019).The resulting language-specific adapter is known as a language adapter.When the task fine-tuning is also learned in the form of an adapter (task adapter), Pfeiffer et al. demonstrate that zero-shot transfer can be achieved by composing arbitrary language and task adapter pairs.Ansell et al. (2022) extend this idea to a new parameter-fine tuning method, sparse fine-tuning (SFT).An SFT of a model is where only a sparse subset of its pre-trained parameters are fine-tuned, i.e. an SFT of a pretrained model F with parameters θ can be written as F (• ; θ + ϕ), where the difference vector ϕ is sparse (Sung et al., 2021).Language and task SFTs with difference vectors ϕ L and ϕ T respectively are composed through addition, i.e. yielding F (• ; θ + ϕ L + ϕ T ).SFTs are learned through a procedure called "Lottery Ticket Sparse Fine-Tuning" (LT-SFT), based on the Lottery Ticket algorithm of Frankle and Carbin (2019).The k% of parameters which undergo the greatest absolute change during an initial full fine-tuning phase are selected as tunable parameters during the second "sparse" phase which yields the final SFT.
As SFT composition exhibited somewhat better zero-shot cross-lingual transfer performance across a range of tasks than adapter composition, and SFTs avoid the inference time slow-down incurred by adapters at inference time, we adopt this parameter-efficient approach throughout this work.However, we note that other modular and parameter-efficient architectures can also be tried in future work (Pfeiffer et al., 2023).
Multi-Source Training.Ansell et al. (2021) show that multi-source task adapter training, where a task adapter is trained using data from several source languages simultaneously, yields large gains in cross-lingual transfer performance as a result of the task adapter learning more language-agnostic representations.Ansell et al. (2022) find similarly large gains from multi-source training of task SFTs.An important aspect of cross-lingual transfer with SFTs is that the source language SFT is applied during task SFT training.This requires each batch during multi-source training to consist of examples from a single source language, for which the relevant language SFT is applied during the corresponding training step.

Distilling Pretrained Language Models
Knowledge distillation (Buciluȃ et al., 2006;Hinton et al., 2015) is a technique for compressing a pretrained large "teacher" model into a smaller "student" model by training the student to copy the behavior of the teacher.Whereas during standard pretraining, the model receives a single "hard" label per training example, during distillation the student benefits from the enriched signal provided by the full label distribution predicted by the teacher model.Sanh et al. (2019) use this technique to produce DistilBERT, a distilled version of BERT base (Devlin et al., 2019) with 6 instead of the original 12 layers, and DistilmBERT, a corresponding distilled version of multilingual BERT.There has been extensive subsequent work on distillation of pretrained language models, but with less focus on distilling MMTs in particular.

BISTILLATION: Methodology
Overview.We are interested in providing NLP capabilities with limited computational resources in a specific target language T which lacks training data in the tasks of interest.A common paradigm in previous work (Pfeiffer et al., 2020;Ansell et al., 2022) is to use cross-lingual transfer with an MMT in conjunction with parameter-efficient task and language adaptation to support multiple tasks without adding a large number of additional parameters per task, see §2.2.Our goal in this work is to replace the highly general MMT, plus optional language adaptation, with a target language-specific model which maintains the benefits of cross-lingual transfer.
An obvious first attempt would be to simply distil the MMT into a smaller model using only text in the target language.However, this monolingual distillation approach is insufficient, as during task finetuning, the monolingually distilled student model no longer "understands" the source language.Indeed, our preliminary experiments confirmed the intuition that this approach is inadequate.This problem can be overcome through bilingual distillation, where text from both the source and target language is used to train the student model. 1  Therefore, our aim is to devise a method for deriving from an MMT M a smaller model M ′ S,T,τ to perform a given task τ in the target language T given only training data in the source language S. Our approach is inspired by the two-stage distillation paradigm of Jiao et al. (2020).In the first, "general" phase, a bilingual student model M ′ S,T is distilled from M using the same unsupervised task (e.g., masked language modeling) that was used for M 's pretraining.In the second, "task-specific" phase, M ′ S,T,τ is produced by fine-tuning M ′ S,T using M τ as its teacher, where M τ is derived from M by fine-tuning it for task τ .The following sections explain the details of these phases.

Distillation Method
Let L T be the number of Transformer layers in the teacher model, indexed from 1 to L T .The number of student model layers L S is required to evenly divide L T .We define the downscaling stride as s = L T L S .Following Jiao et al. (2020), the loss functions of the two distillation phases make use of three components, (i) attention-based, (ii) hidden statebased, and (iii) prediction-based.Attention-based loss is defined as follows: Here, A S i and A T i ∈ R l×l refer to the attention distribution 2 of Transformer layer i of the student and teacher model, respectively; l refers to the input sequence length; MSE() denotes mean squared error loss.
Hidden state-base loss is defined as follows: 1 This is similar to the idea of bilingual language adapters proposed by Parović et al. (2022), which obtain superior transfer performance by adapting the MMT to both source and target language simultaneously, removing the need to use different and possibly incompatible language adapters during training and inference.
2 Here, for ease of implementation within the Huggingface Transformers library (Wolf et al., 2020), we differ from Jiao et al. (2020), who use raw attention scores.
where H S i and H T i ∈ R l×d refer to the hidden representations output by Transformer layer i of the student and teacher model, respectively, or the output of the embedding layer when i = 0. Note that we assume that the student and teacher share the same hidden dimensionality d.
Finally, the prediction-based loss is defined as where z S and z T are the label distributions predicted by the student and teacher model, respectively, and CE denotes cross-entropy loss.
The intuition behind using attention-based and hidden state-based loss for our purposes is as follows.We (i) require good monolingual performance in the source and target language, but we also (ii) must preserve the existing alignment between these languages in the MMT which would consequently facilitate transfer between them.The intuition is that encouraging the student's intermediate representations to match those of the teacher will help to preserve this alignment.
We next describe how these loss components are employed in each phase of BISTILLATION.

Stage 1: General Bilingual Distillation
Initialization.We initialize all parameters of the student model by copying those of the teacher model, but retaining only the Transformer layers whose indices are multiples of s.
Vocabulary Reduction.Our distilled models can dispose of the many irrelevant tokens in the base MMT's vocabulary, i.e. those which are not frequently used in either the source or target language of interest, an idea previously proposed by Abdaoui et al. (2020).During initialization, the vocabulary of the student model is selected by retaining only the tokens of the teacher's vocabulary whose unigram probability in either the source or target language corpus is ≥ 10 −6 .
Teacher Language Adaptation.As we wish to be able to produce distilled models for languages not covered in the base MMT, and to obtain the best possible performance for languages which are covered, we employ language adaptation of the teacher MMT with language-specific SFTs (Ansell et al., 2022) applied on top of the original MMT during distillation. 3Since it draws examples from two languages, each with its own language SFT, bilingual distillation becomes a special case of multi-source training as described in §2.2.At each training step, either the source or target language is selected at random with equal probability; the batch is composed of sequences drawn from the training corpus of the chosen language, and a pretrained SFT for that language is applied to the teacher MMT.
Objective.The overall loss function for this phase is given by the sum of the attention-based and hidden state-based loss.Omitting the prediction-based loss here has the advantage of avoiding the need to evaluate the distribution of tokens predicted by the MLM head, which is costly because of the considerable size of MMTs' embedding matrices.

Stage 2: Task-Specific Distillation
After a general bilingual model has been distilled from the teacher MMT in Stage 1, it can be finetuned for a specific task.We first obtain the teacher for task-specific distillation by applying task-specific LT-SFT to fine-tune the base MMT (i.e., the teacher in the general distillation phase) for the task in question.This teacher's outputs and representations are then used to fine-tune the bilingual student model, again using task LT-SFT at the student's end.The use of parameter-efficient task adaptation here avoids adding a large number of parameters to the system for each task.The objective during this task-specific fine-tuning consists of the sum of all three losses from §3.1: L attn , L hidden , and L pred .

Experimental Setup
We largely adopt the evaluation framework of Ansell et al. (2022) for direct comparability with their LT-SFT method, which they apply to undistilled MMTs, and which we apply for task-specific fine-tuning of bilingually distilled MMTs.Specifically, we evaluate zero-shot cross-lingual transfer performance on four representative tasks: dependency parsing, named entity recognition, natural language inference, and QA.While the prior work focused only on low-resource languages, our method is also highly relevant to high-resource languages: the XQuAD QA task (Artetxe et al., 2020) provides additional insight into high-resource target language performance.Table 1 summarizes the experimental setup, including the datasets and languages considered in our experiments.In total, we cover a set of 44 typologically and geographically diverse languages, which makes them representa-tive of cross-lingual variation (Ponti et al., 2020).

Baselines and Model Variants
We refer to our main method as BISTIL.We compare it with several relevant approaches.First, the LTSFT method (Ansell et al., 2022), a state-of-theart cross-lingual transfer approach, uses LT-SFT with language adaptation on the base MMT.LTSFT can be seen as an upper bound for BISTIL, allowing us to measure how much the performance suffers as a result of replacing the MMT with its bilingually distilled variant.
For each task except NLI,4 we also compare against a multilingually distilled MMT, i.e. with all pretraining languages used for distillation as well.For DP and NER, where mBERT is the base MMT, the distilled MMT is DISTILMBERT (Sanh et al., 2019), which is similarly based on mBERT.For QA, where BISTIL uses mDeBERTa as the base MMT, no directly comparable multilingually distilled MMT is available, so we opt for a loose comparison with MINILMV2 (Wang et al., 2021a), distilled from XLM-R large , which has achieved strong results on cross-lingual transfer in high-resource languages.We perform task-specific fine-tuning with LT-SFT on DistilmBERT and MiniLMv2 in the same way as for the the undistilled MMTs in the LTSFT setting.For DP and NER we also perform language adaptation of DistilmBERT. 5e also consider SCRATCH, a setting where we train bilingual models from scratch instead of distilling them from a pretrained MMT.We then apply the same LT-SFT task fine-tuning method as for the other baselines.This comparison allows us to evaluate the benefit of distilling efficient bilingual models from the MMT rather than pretraining the same-sized bilingual models from scratch.
We refer to our main method, with the taskspecific distillation stage as described in §3.3, as  BISTIL-TF (TF = teacher forcing).We also carry out an ablation focused on the second phase of BISTILLATION: here, we consider performing task-specific fine-tuning without the assistance of a teacher, i.e. in the same manner as LTSFT.We refer to this variant as BISTIL-ST (ST = self-taught).
Table 2 provides details of the model sizes, before and after distillation using the above methods, demonstrating the benefits of BISTILLATION with respect to model compactness.

Distillation/Adaptation Training Setup
We always perform language adaptation of the teacher model during both phases of BISTILLA-TION and during LTSFT except for mDeBERTa and MiniLMv26 .For language adaptation of MMTs we use the pretrained language SFTs of Ansell et al. ( 2022), and we train our own for Distilm-BERT.Similarly, for the LTSFT baseline, and for task adaptation of the teacher in the BISTIL-TF configuration, we use their pretrained single-source task SFTs or train our own when necessary.When training/distilling our own models or SFTs, we generally choose hyperparameters which match those used to train their SFTs in the original work.See Appendix A for full training details and hyperparameters of all models in our comparison, and Appendix B for details of the training corpora.
We experiment with two layer reduction factors (LRF) for BISTILLATION, 2 (a reduction from 12 to 6 layers) and 3 (12 to 4 layers).Whereas the BIS-TIL setting initializes the model from the teacher (see §3.2), the SCRATCH setting initializes it randomly.

Results and Discussion
The results in terms of task performance are summarized in Tables 3-6.As expected, LTSFT on the undistilled MMTs performs best across all tasks.However, BISTIL-TF with reduction factor 2 is not much worse, with a degradation in performance not exceeding 1.3 points relative to LTSFT on DP, NER and NLI.The larger gap of 3.4 EM points on QA is likely a result of the fact that the base MMT is much more thoroughly pretrained on the high-resource languages found in XQuAD than on the lower-resource languages found in the datasets for the other tasks.It is therefore harder for BIDIS- TIL to achieve the base MMT's depth of knowledge of the target language during its relatively short distillation training time.BISTIL-TF, LRF = 2 nevertheless outperforms MiniLMv2 on QA by 1.7 EM points, despite MiniLMv2 receiving 320 times more training than each BIDISTIL model, or roughly 6 times more per language 7 .Furthermore, BISTIL-TF, LRF = 2 significantly outperforms DISTILMBERT, with a 6.1 LAS gap on DP and 2.9 F1 gap on NER.BISTIL, LRF = 2 produces models roughly half the size of DISTILM-BERT and that, once again, are trained for vastly less time 8 .
Training bilingual models from SCRATCH performs poorly, lagging behind the other methods 7 MiniLMv2 is trained for 1M steps with a batch size of 256 and max sequence length of 512; BIDISTIL for 200K steps with a batch size of 8 and max sequence length of 256.
8 Sanh et al. (2019) note that their monolingual DistilBERT model was trained on 8 16GB V100 GPUs for approximately 90 hours.Our BISTIL models are trained on a single 10GB RTX 3080 GPU for approximately 9 hours.by more than 20 points on DP.9One crucial weakness of SCRATCH, besides its reduced monolingual performance, is a lack of alignment between its representations of the source and target languages, severely impairing cross-lingual transfer.This highlights the advantage of distilling a bilingual model from an MMT within which cross-lingual alignment is already present.
Interestingly, when we evaluate the SCRATCH models on their English DP performance, we obtain an average UAS/LAS score of 81.8/77.1, which is much more competitive in relative terms with the BISTIL-TF, LRF = 2 English DP score of 91.0/88.2than the corresponding comparison in average target language DP scores of 29.9/11.0 to 55.5/36.5.This suggests that an even larger factor in SCRATCH's weakness than its poor monolingual performance is a lack of alignment between its representations of the source and target languages, Table 7: Relative inference speed and FLOP cost.Values are given relative to LTSFT without distillation, i.e. a speed reading of "2.00x" means the distilled model can on average process twice as many inference examples per second as the undistilled MMT.Likewise a FLOPs reading of "0.50x" would mean that the distilled model on average requires half as many FLOPs to process an inference example as the undistilled MMT does.
severely impairing cross-lingual transfer.This highlights the advantage of distilling a bilingual model from an MMT within which cross-lingual alignment is already present.As expected, the performance of BISTIL is somewhat weaker with a larger layer reduction factor of 3, though this is heavily task-dependent.With an LRF of 3, BISTIL-TF still comfortably outperforms DISTILMBERT on DP and NER, and does not fall much behind LRF = 2 for NLI.However, we observe a considerable degradation in performance for LRF = 3 for QA; this may indicate that a 4-layer Transformer struggles to adapt to this particular task, or that for this architecture the modest training time is not sufficient to approach the base MMT's understanding of the source and target languages.
Table 7 presents an analysis of the inference time efficiency.We measure the inference speed both on CPU with batch size 1 and GPU with the same batch size as during task-specific training.We also calculate the number of floating-point operations (FLOPs) per example using fvcore, measured dur-ing an inference pass over the test set of the first language in each task.
For NER, NLI and QA, the efficiency results conform quite closely to the intuitive expectation that a model's inference time should scale linearly with its number of layers; that is, BIDISTIL with LRF = 2 is generally around twice as fast as the base MMT.For DP, we observe a seemingly sublinear scaling which is caused by the very large biaffine parsing head, consisting of ∼23M parameters.The significant cost of applying the model head contributes equally to all models regardless of their degree of distillation.Despite having a moderate LRF of 2, MINILMV2 exhibits impressive speed as a result of the fact that it additionally has a smaller hidden dimension than its teacher (see Table 2), a technique which we do not consider for BIDISTIL, but may be a promising avenue for future work.
We argue that BIDISTIL accomplishes its aim by achieving two-to three-fold reductions in inference time and model size without sacrificing much in the way of raw performance.Its superior performance relative to multilingually distilled models despite its comparatively very modest training budget supports the assertion that specializing multilingual models for a specific transfer pair during distillation helps to avoid performance degradation resulting from the curse of multilinguality.

Related Work
One strand of prior work focuses on parameterefficient adaptation of pretrained MMTs, i.e. adaptation by adding/modifying a small subset of parameters.Adapters (Rebuffi et al., 2017;Houlsby et al., 2019) have been used extensively for this purpose (Üstün et al., 2020), with the MAD-X framework of Pfeiffer et al. (2020) becoming a starting point for several further developments (Vidoni et al., 2020;Wang et al., 2021b;Parović et al., 2022), where a notable theme is adapting MMTs to unseen languages (Ansell et al., 2021;Pfeiffer et al., 2021).Ansell et al. (2022) propose composable sparse fine-tunings as an alternative to adapters.Pfeiffer et al. (2022) create a modular MMT from scratch, where some parameters are shared among all languages and others are languagespecific.This allows the model to dedicate considerable capacity to every language without each language-specific model becoming overly large; thus it is quite similar in its aims to this work.
A variety of approaches have been proposed for general distillation of pretrained language models.The simplest form uses only soft target probabilities predicted by the teacher model as the training signal for the student (Sanh et al., 2019).Other approaches try to align the hidden states and self-attention distributions of the student and teacher (Sun et al., 2020;Jiao et al., 2020) and/or finer-grained aspects of the self-attention mechanism (Wang et al., 2020(Wang et al., , 2021a)).Mukherjee et al. (2021) initialize the student's embedding matrix with a factorization of the teacher's for better performance when their hidden dimensions differ.Of these, Sanh et al. (2019); Wang et al. (2020Wang et al. ( , 2021a)); Mukherjee et al. (2021) apply their methods to produce distilled versions of MMTs.Parović et al. (2022) adapt pretrained MMTs to specific transfer pairs with adapters; this approach is similar to ours in spirit, but it is aimed towards improving performance rather than efficiency.Minixhofer et al. (2022) learn to transfer full monolingual models across languages.The only work prior we are aware of which creates purely bilingual models for cross-lingual transfer is that of Tran (2020).This approach starts with a monolingual pretrained source language model, initializes target language embeddings via an alignment procedure, and then continues training the model with the added target embeddings on both languages.

Conclusions
While MMTs are an effective tool for cross-lingual transfer, their broad language coverage makes them unnecessarily costly to deploy in the frequentlyencountered situation where capability is required in only a single, often low-resource, language.We have proposed BISTILLATION, a method of training more efficient models suited to this scenario which works by distilling an MMT using only the source-target language pair of interest.We show that this approach produces models that offer an excellent trade off between target language performance, efficiency, and model compactness.The 'bistilled' models exhibit only a slight decrease in performance relative to their base MMTs whilst achieving considerable reduction in both model size and inference time.Their results also compare favorably to those of multilingually distilled MMTs despite receiving substantially less training even on a per-language basis.

Limitations
While the results of our experiments seem sufficient to validate the concept and our general approach to bilingual distillation, we have not carried out a detailed systematic analysis of alternative implementations of the various aspects of our methods, such as different student model initializations, distillation objectives and hyperparameter settings.Furthermore, our BISTIL models are likely undertrained due to limited computational resources.Consequently, we do not claim our specific implementation of bilingual distillation to be optimal or even close to optimal.Areas that warrant further investigation toward realizing the full potential of this approach include the use of hidden dimension reduction, which yielded impressive speed gains for MiniLMv2 in our experiments, and other innovations in distillation such as progressive knowledge transfer (Mukherjee et al., 2021).
With the exception of improved efficiency, our BISTIL models inherit the limitations of the MMTs from which they are distilled; notably, there is a discrepancy between the performance on high-and low-resource languages resulting from the distribution of data used during MMT pretraining.
In this work, we have only considered English as the source language; some target languages may benefit from other transfer sources.Future work may also consider the use of multi-source transfer, which would entail distilling with more than two languages.Here the challenge would be optimizing the balance of model capacity allocated to source languages versus the target language.

A Training Details and Hyperparameters
As we evaluate over many languages and tasks, we carry out a single run per (task, language, configuration) triple.

A.1 Language Distillation/Adaptation
The following are constant across all language distillation/SFT training: we use a batch size of 8 and a maximum sequence length of 256; model checkpoints are evaluated every 1,000 steps (5,000 for high-resource languages) on a held-out set of 5% of the corpus (1% for high-resource languages), and the one with the smallest loss is selected at the end of training; we use the AdamW optimizer (Loshchilov and Hutter, 2019) with linear decay without any warm-up.
During LT-SFT training of DistilmBERT's language SFTs, the dense and sparse fine-tuning phases each last the lesser of 100,000 steps or 200 epochs, but at least 30,000 steps if 200 epochs is less.The initial learning rate is 5 • 10 −5 .The SFT density is set to 4%. 10When distilling bilingual models or learning them from scratch, training lasts 200,000 steps (to equal the total length of the two phases of LT-SFT training).The initial learning rate is 10 −4 .The model architecture and hyperparameters are identical to the teacher MMT's other than a reduction in the number of layers and the use of vocabulary reduction as described in §3.2.

A.2 Task Distillation/Adaptation
For DP and NER, we train task SFTs for 3 epochs in the dense phase of LT-SFT and 10 epochs in the sparse phase, evaluating the model checkpoint on the validation set at the end of each epoch, and taking the best checkpoint at the end of training.The selection metric is labeled attachment score for DP and F1-score for NER.The initial learning rate is 5 • 10 −5 with linear decay.For NER, we use the standard token-level single-layer multi-class model head.For DP, we use the shallow variant (Glavaš and Vulić, 2021) of the biaffine dependency parser of Dozat and Manning (2017).For NLI, we train for 5 epochs with batch size 32, with checkpoint evaluation on the validation set every 625 steps, and an initial learning rate of 2 • 10 −5 .We apply a two-layer multi-class classification head atop the model output corresponding to the [CLS] token.For QA, we train for 5 epochs with a batch size of 12, with checkpoint evaluation every 2000 steps and an initial learning rate of 3 • 10 −5 .The singlelayer model head independently predicts the start and end positions of the answer span, and at inference time the span whose endpoints have the largest sum of logits is selected.
We set the density of our task SFTs to 8%, which Ansell et al. (2022) found to offer the best task performance in all their experiments.

Figure 1 :
Figure 1: Tradeoff between parameter count, inference FLOPs and averaged performance for our BISTIL models for cross-lingual transfer and several baselines.

Table 1 :
, Chinese † , German † , Greek † , Hindi † , Romanian † , Russian † , Spanish † , Thai † , Turkish † , Vietnamese † Details of the tasks, datasets, MMTs and languages involved in our zero-shot cross-lingual transfer evaluation.* denotes low-resource languages and † high-resource languages seen during MMT pretraining; all other languages are low-resource and unseen.The source language is always English.We 'bistil' the MMT listed per each task and target language.Further details of all the language and data sources used are provided in Appendix B.

Table 2 :
Dimensions of models before and after distillation.LRF = Layer Reduction Factor; DRF = hidden Dimension Reduction Factor; #L = number of Transformer Layers; D = hidden Dimension; #V = Vocabulary size; #P = total number of model Parameters; D'MBERT = DISTILMBERT.* -because of its vocabulary reduction procedure, BISTIL can produce models of slightly different sizes for different languages; the vocabulary sizes and numbers of parameters shown are averages over all BISTIL models trained.