MergeDistill: Merging Pre-trained Language Models using Distillation

Pre-trained multilingual language models (LMs) have achieved state-of-the-art results in cross-lingual transfer, but they often lead to an inequitable representation of languages due to limited capacity, skewed pre-training data, and sub-optimal vocabularies. This has prompted the creation of an ever-growing pre-trained model universe, where each model is trained on large amounts of language or domain specific data with a carefully curated, linguistically informed vocabulary. However, doing so brings us back full circle and prevents one from leveraging the benefits of multilinguality. To address the gaps at both ends of the spectrum, we propose MergeDistill, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies, using task-agnostic knowledge distillation. We demonstrate the applicability of our framework in a practical setting by leveraging pre-existing teacher LMs and training student LMs that perform competitively with or even outperform teacher LMs trained on several orders of magnitude more data and with a fixed model capacity. We also highlight the importance of teacher selection and its impact on student model performance.


Introduction
While current state-of-the-art multilingual language models (LMs) (Devlin et al., 2019;Conneau et al., 2020) aim to represent 100+ languages in a single model, efforts towards building monolingual (Martin et al., 2019;Kuratov and Arkhipov, 2019) or language-family based (Khanuja et al., 2021) models are only increasing with time (Rust et al., 2020). A single model is often incapable of effectively representing a diverse set of languages, evidence of which has been provided by works highlighting the importance of vocabulary curation and size (Chung et al., 2020;Artetxe et al., 2020), Figure 1: Previous works (left) typically focus on combining fine-tuned models derived from a single pre-trained model using distillation. We propose MERGEDISTILL to combine pre-trained teacher LMs from multiple monolingual/multilingual LMs into a single multilingual task-agnostic student LM.
pre-training data volume (Liu et al., 2019a;Conneau et al., 2020), and the curse of multilinguality (Conneau et al., 2020). Language specific models alleviate these issues with a custom vocabulary which captures language subtleties 1 and large magnitudes of pre-training data scraped from several domains (Virtanen et al., 2019;Antoun et al., 2020). However, building language specific LMs brings us back to where we started, preventing us from leveraging the benefits of multilinguality like zeroshot task transfer (Hu et al., 2020), positive transfer between related languages (Pires et al., 2019;Lauscher et al., 2020) and an ability to handle codemixed text (Pires et al., 2019;Tsai et al., 2019). We need an approach that encompasses the best of both worlds, i.e., leverage the capabilities of the powerful language-specific LMs while still being multilingual and enabling positive language trans-1 For example, in Arabic, (Antoun et al., 2020) argue that while the definite article "Al", which is equivalent to "the" in English, is always prefixed to other words, it is not an intrinsic part of that word. While with a BERT-compatible tokenization tokens will appear twice, once with "Al-" and once without it, AraBERT first segments the words using Farasa (Abdelali et al., 2016) and then learns the vocabulary, thereby alleviating the problem. Figure 2: Overview of MERGEDISTILL: The input to MERGEDISTILL is a set of pre-trained teacher LMs and pretraining transfer corpora for all the languages we wish to train our student LM on. Here, we combine four teacher LMs comprising of three monolingual (trained on English, Spanish and Korean respectively) and one multilingual LM (trained on English and Hindi). The student LM is trained on English, Spanish, Hindi and Korean. Pre-training transfer corpora for each language is tokenized and masked using their respective teacher LMs vocabulary. We then obtain predictions for each masked word in each language, by evaluating all of their respective teacher LMs. For example, we evaluate English masked examples on both the monolingual and multilingual LM as shown. The student's vocabulary is a union of all teacher vocabularies. Hence, the input, prediction and label indices obtained from teacher evaluation are now mapped to the student vocabulary, and input to the student LM for training. Please refer to Section 3.1 for details. fer.
In this paper, we use knowledge distillation (KD) (Hinton et al., 2015) to achieve this. In the context of language modeling, KD methods can be broadly classified into two categories: task-specific and task-agnostic. In task-specific distillation, the teacher LM is first fine-tuned for a specific task and is then distilled into a student model which can solve that task. Task-agnostic methods perform distillation on the pre-training objective like masked language modeling (MLM) in order to obtain a task-agnostic student model. Prior work has either used task-agnostic distillation to compress singlelanguage teachers (Sanh et al., 2019; or used task-specific distillation to combine multiple fine-tuned teachers into a multi-task student (Liu et al., 2019b;Clark et al., 2019). The former prevents positive language transfer while the latter restricts the student's capabilities to the tasks and languages in the fine-tuned teacher LMs (as shown in Figure 1).
We focus on the problem of merging multiple pre-trained LMs into a single multilingual student LM in the task-agnostic setting. To the best of our knowledge, this is the first effort of its kind, and makes the following contributions: • We propose MERGEDISTILL, a task-agnostic distillation approach to merge multiple teacher LMs at the pre-training stage, to train a strong multilingual student LM that can then be finetuned for any task on all languages in the student LM. Our approach is more maintainable (fewer models), compute efficient and teacherarchitecture agnostic (since we obtain offline predictions).
• We use MERGEDISTILL to i) combine monolingual teacher LMs into a single multilingual student LM that is competitive with or outperforms individual teachers, ii) combine multilingual teacher LMs, such that the overlapping languages can learn from multiple teachers.
• Through extensive experiments and analysis, we study the importance of typological similarity in building multilingual models, and the impact of strong teacher LM vocabularies and predictions in our framework.
Language Model pre-training has evolved from learning pre-trained word embeddings (Mikolov et al., 2013) to contextualized word representations (McCann et al., 2017;Peters et al., 2018;Eriguchi et al., 2018) and to the most recent Transformer-based (Vaswani et al., 2017) LMs (Devlin et al., 2019;Liu et al., 2019a) with state-of-the-art results on various downstream NLP tasks. Most commonly, these LMs are pre-trained with the MLM objective (Taylor, 1953) on large unsupervised corpora and then fine-tuned on labeled data for the task at hand. Concurrently, multilingual LMs (Lample and Conneau, 2019;Siddhant et al., 2020;Conneau et al., 2020;Chung et al., 2021), trained on massive amounts of multilingual data, have surpassed cross-lingual word embedding spaces (Glavaš et al., 2019; to achieve state-of-the-art in cross-lingual transfer. While Pires et al. (2019); Wu and Dredze (2019) highlight their cross-lingual ability, several limitations have been studied. Conneau et al. (2020) highlight the curse of multilinguality. Hu et al. (2020) highlight that even the best multilingual models do not yield satisfactory transfer performance on the XTREME bechmark covering 9 tasks and 40 languages. Importantly, Dredze (2020) andLauscher et al. (2020) observe that these models significantly under-perform for low-resource languages as representation of these languages in the vocabulary and pre-training corpora are severely limited.
Language-specific LMs are becoming increasingly popular as issues with multilingual language models persist. As language identification systems are extended to 1000+ languages (Caswell et al., 2020), increasing capacity for a single model to uniformly represent all languages is prohibitive. Often, practitioners prefer to have a model performing well on a subset of languages that their application calls for. To address this, the community continues its efforts in building strong multi-domain language models using linguistic expertise.
A few examples of these are AraBERT (Antoun et al., 2020), CamemBERT (Martin et al., 2020), and FinBERT (Virtanen et al., 2019). 2 Knowledge Distillation in pre-trained LMs has 2 (Nozza et al., 2020) maintain an ever-growing list of BERT models here most commonly been used for task-specific model compression of a teacher into a single-task student (Tang et al., 2019;Kaliamoorthi et al., 2021). This has been extended to perform task-specific distillation of multiple single-task teachers into one multi-task student (Clark et al., 2019;Turc et al., 2019). In the task-agnostic scenario, prior work has focused on distilling a single large teacher model into a student model leveraging teacher predictions (Sanh et al., 2019) or internal teacher representations (Sun et al., , 2019 with the goal of model compression. To the best of our knowledge, this is the first attempt to perform task-agnostic distillation from multiple teachers into a single task-agnostic student. In the context of neural machine translation, Tan et al. (2019) come close to our work where they attempt to combine multiple single language-pair teacher models to train a multilingual student. However, our work differs from theirs in three key aspects: 1) our students are task-agnostic while theirs are task-specific, 2) we can leverage pre-existing teachers while they cannot, and 3) we support teachers with overlapping sets of languages while they only consider single language-pairs teachers.

MERGEDISTILL
Notations: Let K denote the set of languages we train our student LM on and T denote the set of teacher LMs input to MERGEDISTILL 3 . Consequently, T k denotes the set of teacher LMs trained on language k, where |T k | ≥ 1 ∀ k ∈ K.

Workflow
An overview of MERGEDISTILL is presented in Figure 2. Here we detail each step involved in training the student LM from multiple teacher LMs.
Step 1: Input The input to MERGEDISTILL is a set of pre-trained teacher LMs and pre-training transfer corpora for all the languages we wish to train our student LM on. With reference to Figure 2, the student LM is trained on K ={English (en), Spanish (es), Hindi (hi), Korean (ko)}. We combine four teacher LMs comprising of three monolingual and one multilingual LM. The monolingual LMs are trained on English (M en ), Spanish (M es ), and Korean (M ko ) while the multilingual LM is trained on English and Hindi (M en,hi ). Therefore, for each language, the corresponding set of teacher LMs (T k ) can be defined as: [T en = {M en , M en,hi }, T es = {M es }, T hi = {M en,hi }, T ko = {M ko }]. First, the pretraining transfer corpora is tokenized and masked for each language using their respective teacher LM's tokenizer. For the language with two teachers, English, we tokenize each example using both the teacher LMs.
Step 2: Offline Teacher LM Evaluation We now obtain predictions and logits for each masked, tokenized example in each language, by evaluating their respective teacher LMs. For English, we obtain predictions from both M en and M en,hi on their respective copies of each training example. In an ideal situation, we believe that multiple strong teachers can present a multi-view generalisation to the student as each teacher learns different features in training. Let x denote a sequence of tokens where x m = {x 1 , x 2 , x 3 ...x n } denote the masked tokens, and x −m denote the non-masked tokens. Let v be the vocabulary of student LM θ s . In the conventional case of learning from gold labels, we minimize the cross-entropy of student logit distribution for a masked word x m i , with the one-hot label v j , given by: With the teacher evaluations, we obtain predictions (and corresponding logits) of the teacher for the masked tokens. Let us denote the teacher output probability distribution (softmax over logits) for token x m i by Q(x m i |x −m ; θ t ). Therefore, in addition to the loss from gold labels, we minimize the entropy between the student logits and the teacher distribution, given by : It is extremely burdensome (both memory and time) to load multiple teacher LMs and obtain predictions during training. Hence, we first store the top-k logits for each masked word offline, loading and normalizing them during student LM training, similar to (Tan et al., 2019). Additionally, obtaining offline predictions gives one the freedom to use expensive teacher LMs without increasing the student model training costs and makes our framework teacher-architecture agnostic.
Step 3: Vocab Mapping A deterrent in attempting to distill from multiple pre-trained teacher LMs is that each LM has its own vocabulary. This makes it non-trivial to uniformly process an input example for consumption by both the teacher and student LMs. Our student model's vocabulary is the union of all teacher LM vocabularies. In the vocab mapping step, the input indices, prediction indices, and the gold label indices, obtained after evaluation from each teacher LM are processed using a teacher→student vocab map. This converts each teacher token index to its corresponding student token index, ready for consumption by the student model. For simplicity, each teacher and student LM uses WordPiece tokenization (Schuster and Nakajima, 2012;Wu et al., 2016) in all our experiments.
Step 4: Student LM Training The processed input indices, prediction indices, and gold label indices can now be used to train the multilingual student LM. In training, examples from different languages are shuffled together, even within a batch. We train the student LM with the MLM objective. Let L MLM denote the MLM loss from gold labels. Therefore, with reference to Equation 1 : In addition to learning from gold labels, we use teacher predictions as soft labels and minimize the cross entropy between student and teacher distributions. Let L KD denote the KD loss from a single teacher LM. With reference to Equation 2: The total loss across all languages is minimized, as shown below: In the case of multiple teacher LMs, we have n tokenized instances for a given example (where n denotes the number of teachers for a particular language). In this case, each example in English has two copies -one tokenized using M en and another using M en,hi . Thus, we explore two possibilities of training in this multi-teacher scenario : • Include all the copies in training. Here the model is exposed to n different teacher LM predictions, each presenting a multi-view generalisation to the student LM.
• Include the best copy in training. The best copy is the one having minimum teacher LM loss for a given example. Here the model is only exposed to the best teacher LM predictions for each example.

Experiments
In this section, we aim to answer the following questions :  Distillation Parameters: We have two hyperparameter choices here: 1) k in top-k logits -as it increases, we observe that while performances remain similar, storing k>8 number of predictions for each masked word offline significantly increases resource requirements 4 . Hence, we set k=8 in all our experiments. 2) the value of λ in the loss function, which decides the proportion of teacher loss, is annealed through training similar to Clark et al. (2019).
Evaluation Metrics: We report F1 scores for structured prediction tasks (NER, POS), accuracy (Acc.) scores for sentence classification tasks (XNLI, PAWS-X), and F1/Exact Match (F1/EM) scores for question answering tasks (XQuAD, MLQA, TyDiQA). We also report a task-specific relative deviation from teachers (RDT) (in %) averaged across all languages (n). For each task, RDT is calculated as: where P T i and P S are performances of the i th teacher and student LMs, respectively.

Monolingual Teacher LMs
Pre-training: In this experiment, we use preexisting monolingual teacher LMs, as shown in Table 1, to train a multilingual student LM on the union of all teacher languages. In this setup, |T k | = 1 ∀ k ∈ K, i.e., each language can learn from its respective monolingual teacher LM only. Our teacher selection and setup follows a two-step process. First, we aim to select languages having pre-trained monolingual LMs available, and evaluation sets across a number of downstream tasks. This makes us choose teacher LMs for :   Fine-tuning: We evaluate both the teacher and student LMs on three downstream tasks with in-language fine-tuning for each task 5 : i) Named Entity Recognition ( Results: We report results of our teacher and student LMs in Table 2. Overall, we find that Student similar outperforms individual teacher models on NER (+0.6%) and QA (+2.8/3.7%) while performing competitively on UDPOS (-0.1%). Student dissimilar is competitive with the teacher LMs with only small differences of up to 1.3/1.4% (QA), as shown in Table 2. For each language, we find Student similar is either competitive or outperforms its respective teacher LM. Our results provide evidence for positive transfer across languages in two ways. First, we observe that Student similar outperforms Student dissimilar for the common language -English. Given that the English teacher (BERT) and the pre-training transfer corpora 7 is common for both student LMs,  Table 4: Results for multilingual teacher and student LMs on the XTREME benchmark. We compare performances of three student LM variants as described in Section 4.3 to the two teachers mBERT and MuRIL. Relative deviations of 5% or less from teacher (i.e., RDT ≥ −5%) are marked in bold. Overall, we find that Student MuRIL performs the best among all student variants and report its RDT (in %) (Equation 3) from the two teachers. Please refer to Section 4.3 for a detailed analysis.
we can attribute this gain to the fact that English is trained with linguistically and typologically similar languages in Student similar . Second, Student similar outperforms its teacher LMs while Student dissimilar is competitive for all languages. These two results across all languages point towards Student similar benefiting from a positive transfer across similar languages. In Table 3, we observe that Student similar is trained on 9.9% of the total unique tokens seen by its respective teacher LMs and Student dissimilar lies close with 13.6%. Despite this huge disparity in pre-training corpora, student LMs are competitive with their teachers. This encouraging result proves that even with very limited data, MERGEDISTILL enables one to combine strong monolingual teacher LMs to train competitive student LMs that can leverage the benefits of multilinguality.

Multilingual Teacher LMs
Pre-training: In this experiment, we make use of pre-existing multilingual models: mBERT and MuRIL. mBERT is trained on 104 languages and MuRIL covers 12 of these (11 Indian languages + English): Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Nepali (ne), Punjabi (pa), Tamil (ta), Telugu (te), and Urdu (ur), with higher performance for these languages on the XTREME benchmark. We train the student model on all 104 languages. In low resourced (a sum total of 349M unique tokens) in comparison to Student similar (a sum total of 1992M unique tokens) as shown in Table 3 this case, the MuRIL Languages (MuL) have two teachers (mBERT and MuRIL) and the Non-MuRIL Languages (Non-MuL) can learn from mBERT only. Therefore, while we only use mBERT as the teacher LM for Non-MuL across all experiments, we consider three possibilities for MuL : i) Student MuRIL : We only use MuRIL as the teacher LM and each input training example is tokenized using MuRIL. ii) Student mBERT : We only use mBERT as the teacher LM and each input training example is tokenized using mBERT. iii) Student Both : As highlighted in Section 3, we consider two possibilities to incorporate both teacher LM predictions in training: • Student Both all : Tokenize each input example using mBERT and MuRIL separately and include both copies in training.
• Student Both best : Tokenize each input example using mBERT and MuRIL separately and include only the best copy in training. The best copy is the one having minimum teacher LM loss for the example.
Note, it is non-trivial to tokenize each example in a way that is compatible with all teacher LMs. One must resort to tokenization using an intersection of vocabularies which is sub-optimal.
All the student LMs use a BERT-base architecture and have a vocabulary size of 288,973. We reduce  Table 5: Importance of teacher vocabulary and predictions in MERGEDISTILL. We observe maximum performance gains, by changing the vocabulary from mBERT in SM1 to (mBERT∪MuRIL) vocabulary in SM2. Here, SM3 is the standard Student MuRIL . We also observe that SM3 100k, trained for 20% of the total training steps, is competitive to SM3 and significantly outperforms SM2 100k, highlighting the importance of teacher LM predictions in a limited data scenario. Please see Section 4.4 for details.
our embedding dimension to 256 as opposed to 768 to bring down the model size to be around 160M, comparable to mBERT (178M). We keep a batch size of 4096 and train for 500,000 steps with a maximum sequence length of 512.
Finetuning: We report zero-shot performance for all languages in the XTREME (Hu et al., 2020) benchmark 8 .

Results:
We report results of our teacher and student LMs in Table 4. Overall, we find that Student MuRIL performs the best among all student variants. For Non-MuL, Student MuRIL beats the teacher (mBERT) by an average relative score of 3.8%. For MuL, Student MuRIL beats one teacher (mBERT) by 8.8%, but underperforms the other teacher (MuRIL) by 3.8%. There can be two factors at play here. MuRIL is trained on monolingual and parallel data 9 while the student LMs only see ∼22% of unique tokens in comparison. MuRIL also has different language sampling strategies (α = 0.3 as opposed to 0.7 in our setting, where a lower α value upsamples more rigorously from the tail languages), which have a significant role to play in multilingual model performances (Conneau et al., 2020). We also observe a significant drop in Student mBERT 's performance for MuL when compared to the other student LM variants. This might be because the input is tokenized using the mBERT tokenizer which prevents learning from MuRIL tokens in the student vocabulary. For Student Both , we do not observe much of a difference between Student Both all and Student Both best . This observation may differ with one's choice of teacher LMs depending on how well it performs for a particular language. In our case, we don't observe much of a difference in incorporating mBERT predictions for 8 More details in Appendix A.3 9 More details in Appendix A.2 MuL.

Further Analysis
The importance of vocabulary and teacher LM preditions: In Table 4, we see that Student MuRIL significantly outperforms mBERT for MuL, despite both being trained on Wikipedia corpora, and having comparable model sizes. With regard to MuL, Student MuRIL differs from mBERT in two main aspects -i) Student MuRIL 's vocabulary is a union of mBERT and MuRIL vocabularies. ii) Student MuRIL is trained with additional MuRIL predictions as soft labels. To disentangle the role both these factors play in Student MuRIL 's improved performance, we train two models : i) SM1 is trained exactly like Student MuRIL , but with mBERT vocabulary and on gold labels. ii) SM2 is trained using Student MuRIL 's vocabulary (mBERT ∪ MuRIL) but on gold labels only, without teacher predictions.
The results are summarized in Table 5. Note, we refer to Student MuRIL as SM3. Overall, we observe a ∼4.2% gain in average performance for SM2 over SM1. This clearly highlights that given fixed data and model capacity, LM training significantly benefits by incorporating a strong teacher's vocabulary.
Furthermore, we also observe that SM2 and SM3 achieve competitive performances despite SM3 being additionally trained on teacher LM labels. To motivate the need for teacher predictions, Hinton et al. (2015) argue that when soft targets have high entropy, they provide much more information per training case than hard targets and can be trained on much less data than the original cumbersome model. In our case, we hypothesize that training on 500,000 steps exposes the model to sufficient data for it to generalize well enough and mask the benefits of teacher LM predictions. To validate this, we evaluate the performances of SM2 and SM3, 20% into training (i.e. 100,000 steps / 500,000 total steps) as shown in Table 5. We observe a ∼2.9% gain in average performance for SM3 over SM2, clearly highlighting the importance of teacher LM predictions in a limited data scenario. This is especially important when one has access to very limited monolingual data and a strong teacher LM for a particular language.
Pre-trained zero-shot transfer: Interestingly, Student MuRIL performs the best on almost all tasks for Non-MuL. This hints at positive transfer from strong teachers to languages that the teacher does not cover at all, due to the shared multilingual representations. 10 This would mean that learning from strong teachers can improve the student model's performance in a zero-shot manner on related languages not covered by the teacher. This would make MERGEDISTILL highly beneficial for low-resource languages that do not have a strong teacher or limited gold data. We leave this exploration to future work.

Conclusion
In this paper we address the problem of merging multiple pre-trained teacher LMs into a single multilingual student LM by proposing MERGEDIS-TILL, a task-agnostic distillation method. To the best of our knowledge, this is the first attempt of its kind. The student LM learned by MERGEDISTILL may be further fine-tuned for any task across all of the languages covered by the teacher LMs. Our approach results in better maintainability (fewer models) and is compute efficient (due to offline predictions). We use MERGEDISTILL to i) combine monolingual teacher LMs into one student multilingual LM which is competitive with the teachers, thereby demonstrating positive crosslingual transfer, and ii) combine multilingual LMs to train student LMs that learn from multiple teachers. Through experiments on multiple benchmark datasets, we show that student LMs learned by MERGEDISTILL perform competitively or even outperform teacher LMs trained on orders of magnitude more data. We disentangle the positive im-pact of incorporating strong teacher LM vocabularies and learning from teacher LM predictions, highlighting the importance of the latter in a limited data scenario. We also find that MERGEDIS-TILL enables positive transfer from strong teachers to languages not covered by them (i.e. zero-shot transfer). Our work bridges the gap between the universe of language-specific models and massively multilingual LMs, incorporating benefits of both into one framework.
In a distillation setup, the student is trained to not only match the one-hot labels for masked words, but also the probability output distribution of the teacher t. Let us denote the teacher output probability distribution for token x m i by Q(x m i |x −m ; θ t ).
The cross entropy between the teacher and student distributions then serves as the distillation loss : The total loss is then defined as : With the addition of the teacher, the target distribution is no longer a single one-hot label, but a smoother distribution with multiple words having non-zero probabilities which yields in a smaller variance in gradients (Hinton et al., 2015). Intuitively, a single masked word can have several valid predictions, which appropriately fit the context.

A.2.1 Monolingual Teacher LMs
We pre-train our student models using the BERT base architecture. Student similar has a vocabulary size of 99112 and a model size of 162M parameters. Student different has a vocabulary size of 180996 and a model size of 225M parameters. We keep a batch size of 4096 and train for 250k steps with a maximum sequence length of 512. We use TPUs, and it takes around 1.5 days to pre-train each student LM.

A.2.2 Multilingual Teacher LMs
We pre-train our student models using the BERT base architecture. All student LMs have a vocabulary size of 288973. Hence, we reduce our embedding dimension to 256 as opposed to 768 to bring down the model size to be around 160M, comparable to mBERT (178M). We keep a batch size of 4096 and train for 500k steps with a maximum sequence length of 512. We use TPUs, and it takes around 3 days to pre-train each student LM.
We present pre-training data statistics for MuRIL and the student LMs in Table 6. Here we only include the monolingual data statistics, but MuRIL is additionally trained on parallel translated and transliterated data.  ii) Part-of-Speech tagging (POS): We use the Universal Dependencies v2.6 (Zeman et al., 2020) dataset for all languages. Detailed statistics for each language can be found in Table 9. Specifically, we use the huggingface re-packaged implementation of the dataset 12 .
iii) Question Answering (QA): We use the TyDiQA dataset (Clark et al., 2020) for ar and fi, SQuADv1.1 (Rajpurkar et al., 2016) for en, SQuAD-translated for it (Croce et al., 2018) and es (Carrino et al., 2020), DRCD for zh (Shao et al., 2018) and TQuAD 13 for tr. Detailed statistics for each language can be found in Table 10. Note, we use the dev set as our test sets, since most datasets only have a train/dev split. We use 10% of randomly shuffled training examples as our dev sets.

A.3.2 Multilingual Teacher LMs
Data Statistics We evaluate all the teacher (mBERT and MuRIL) and student (Student MuRIL , Student mBERT and Student Both ) LMs on the XTREME (Hu et al., 2020) benchmark. We finetune the pre-trained models on English training data for the particular task, except TyDiQA, where we use additional SQuAD v1.1 English training data, similar to (Fang et al., 2020). All results are computed in a zero-shot setting.
Hyperparameter Details We use the same hyperparameters for fine-tuning all teacher and student LMs, as shown in Table 11. We report results on the best-performing checkpoint for the

A.4 Different top-k values
We present results for Student MuRIL trained with different top-k values from teacher predictions in Table 8. We observe that while performances remain similar for higher values of k, storage becomes increasingly expensive. Hence, we stick to a value of k=8 in all our experiments.