Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings, while keeping the transformer body frozen. Despite learning a small subset of parameters, this approach is not compute-efficient, as training the new embeddings requires a full forward and backward pass over the entire model. We propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model's parameters. New language-specific embeddings can then be efficiently trained over the mini-model and plugged into the aligned large model for rapid cross-lingual transfer. We explore two approaches to learn mini-models: MiniJoint, which jointly pretrains the primary model and the mini-model using a single transformer with a secondary MLM head at a middle layer; and MiniPost, where we start from a regular pretrained model, build a mini-model by extracting and freezing a few layers, and learn a small number of parameters on top. Experiments on XNLI, MLQA and PAWS-X show that mini-model adaptation matches the performance of the standard approach using 2.3x less compute on average.


Introduction
Recent work on multilingual NLP has focused on pretraining (masked) language models on unlabeled corpora in multiple languages (Pires et al., 2019;Conneau et al., 2020;Xue et al., 2021).The resulting models can then be finetuned using labeled downstream data in a single language (typically English), and zero-shot transferred to the rest of the languages.While effective, existing models rarely cover more than a few dozen languages, and pretraining new models from scratch to support additional languages can be prohibitively expensive.Artetxe et al. (2020).A speedup of x means that our approach needs x times less compute to achieve the same performance.See §4.2 for more details.
Motivated by this, a recent line of work has explored pretraining an initial model in a few languages, and expanding it to new languages posthoc in a continual learning fashion (M'hamdi et al., 2022).More concretely, Artetxe et al. (2020) showed that it is possible to expand an English masked language model (MLM) to new languages by freezing the transformer body and learning a new embedding layer using the original MLM objective.Recent work has reported improved results by using a better initialization scheme (Pfeiffer et al., 2021), or learning additional languagespecific parameters through adapters (Pfeiffer et al., 2022).All these approaches are parameter-efficient, as they only learn a small number of parameters for each language, while the rest remain frozen.However, learning such parameters is not computeefficient, as it requires a full forward and backward pass over the entire model, including the frozen transformer body.
We introduce mini-model adaptation, a new approach to extend MLMs to new languages that is both parameter-and compute-efficient.Minimodels are shallow models that are aligned with a Standard BL_Base/BL_Small (Artetxe et al., 2020) Step 1: Step 2:

Transfer to L trg
Step 1b: Trainable parameters are green, frozen parameters are gray.L src embeddings are small rectangles, L trg embeddings are triangles.All approaches use a four-step process for crosslingual transfer: (1) pretrain an MLM in L src , (2) learn a new embedding layer in L trg via MLM with transformer body frozen, (3) finetune the model in L src with embeddings frozen, (4) zero-shot transfer to L trg by swapping the embeddings.Standard adaptation (top) uses the same transformer body for all steps, while our approach learns two aligned models in Step 1-the primary model and a shallower mini-model-and uses the mini-model to learn L trg embeddings efficiently in Step 2. We explore two approaches to learn mini-models: MINIJOINT (center) jointly pretrains the primary model and mini-model using a secondary MLM head attached at a middle layer; MINIPOST (bottom) starts from an existing model and builds a mini-model in Step 1b by extracting/freezing a few layers and learning a small number of parameters on top.
larger parent model.Thanks to this, one can efficiently train a new embedding layer for a new language over the mini-model, and plug it directly into the parent for strong cross-lingual performance.
As shown in Figure 2, we explore two approaches to learn mini-models, depending on whether we start from an existing primary model and learn a mini-model posthoc (MINIPOST), or we jointly learn the primary model and the mini-model from scratch (MINIJOINT).In MINIPOST, we extract the bottom layers from the existing MLM, and learn a small number of parameters on top to make it a usable small MLM itself.In the MINIJOINT variant, we pretrain an MLM from scratch including a secondary head at a middle layer.Both heads are optimized jointly, creating a complete, wellaligned MLM contained within a larger MLM.
We evaluate our approach on natural language inference (XNLI), question answering (MLQA) and paraphrase identification (PAWS-X).As shown in Figure 1, mini-model adaptation can match the performance of the standard method from Artetxe et al. (2020) using 1.6x and 2.3x less compute for MINIPOST and MINIJOINT, respectively (averaged over tasks), and retains >98% of performance when trained to completion.
All in all, our work shows that it is possible to adapt language models to new tasks (in this case, new languages) using smaller aligned models for training.While we focus on the problem of crosslingual lifelong learning to validate this idea, we believe that this new paradigm opens exciting opportunities to make finetuning large language models more affordable.cross-lingual transfer from a monolingual model, visualized in Figure 2 (top).First, one trains a monolingual MLM in the source language (L src , usually English).Second, the transformer body is frozen, embeddings are re-initialized,1 and the model is trained with MLM in the target language (L trg ).The trainable embeddings are tied with the output projection layer in the MLM head.Third, the L src embeddings are swapped back into the model and frozen, and the transformer body is finetuned on the downstream data in L src .Finally, the L trg embeddings are swapped back into the finetuned model for zero-shot transfer into L trg .We build two baselines based on this framework: a standard 12-layer (BL_BASE), and a smaller 4layer version (BL_SMALL).

Mini-Model Adaptation
Our proposed approach follows a similar four-step training paradigm.However, we learn two aligned models in Step 1: the primary model and a shallow mini-model.In Step 2, the L trg embeddings are learned over the mini-model, saving compute with respect to standard adaptation.Steps 3 and 4 are run as usual over the primary model, resulting in a full-size L trg model.For Step 1, we explore the following two alternatives depending on whether we start from an existing L src model, or we are training one from scratch: MINIJOINT.In this variant, we pretrain a dualhead 12-layer L src transformer from scratch, attaching a secondary head to an intermediary N th layer (Figure 2, center).The model is trained to minimize the average MLM loss over the two heads.As such, the whole model receives gradient updates from the primary head, and the bottom layers also get updates from the secondary head.Having done that, we extract the bottom N layers and the secondary head to create the mini-model for Step 2. Unless otherwise indicated, we use N = 4.
MINIPOST.Here, we start with a regular 12layer MLM in L src (same as BL_BASE), and build an aligned mini-model in Step 1b (Figure 2, bottom).To that end, we first copy the bottom N layers into a new, shallower model, along with the embeddings and the MLM head.However, this does not work out of the box, as we must bridge the gap  Littell et al., 2017).
between the output of the N bottom layers and the input of the MLM head, which goes through 12−N additional layers in the original model.To that end, we add 2 randomly-initialized layers between the N bottom layers and the MLM head, and train them with the MLM objective in L src while keeping the rest of the parameters frozen.Because the new layers are unfrozen, they update to "complete" the MLM-bridging representations from the bottom layers' output to the MLM head's input, and resulting in a mini-model with N + 2 layers that is fully functional and aligned with the primary model.

Experimental Settings
Languages and Data.Following common practice, we use English as the source language (L src ), and experiment with 14 other languages as the target (L trg ).We use CC-100 (Conneau et al., 2020) as our training corpus, which is a filtered version of CommonCrawl.We report the full list of languages along with their corpus size and linguistic details in Table 1.Each language is preprocessed individually using SentencePiece (Kudo and Richardson, 2018) with a vocabulary size of 50,000.
which starts from a regular 12-layer model and constructs a 6-layer mini-model post-hoc).BL_BASE is a performance upper-bound, as it is the original 12-layer model that is used for adaptation.BL_SMALL is a lower-bound, demonstrating performance of the standard approach using an adaptation model of similar size as ours.Models are trained for 125,000 steps with a global batch size of 2048, sequence length of 512, and learning rate of 7e-4 with 10,000 warmup updates and linear decay, both for the original pretraining (Step 1), and cross-lingual extension into each language (Step 2).As such, models see 131.1 billion training tokens per language.Step 1b in MINI-POST uses the same training hyperparameters.
Evaluation.We evaluate on 3 tasks: natural language inference in XNLI (Conneau et al., 2018), question answering in MLQA (Lewis et al., 2020), and adversarial paraphrase identification in PAWS-X (Yang et al., 2019).We also report XQuAD (Artetxe et al., 2020) results in §A.2.In all cases, the model is finetuned using the corresponding training data in English (Step 3), and zero-shot transferred into the rest of languages (Step 4).We perform 5 independent finetuning runs with different random seeds, and report average results.During finetuning, we use a peak learning rate of 1e-5 for XNLI and PAWS-X, and 3e-5 for MLQA and XQuAD.Each uses a warmup ratio of 0.06 and linear decay, and is finetuned for 3 epochs.
Estimating FLOPs.We compare training efficiency of different approaches using floating point operations (FLOPs).To calculate FLOPs, we estimate analytically using an adaptation of the formula from Narayanan et al. (2021), detailed in §A.1.When doing so, we exclusively consider the cost of expanding the model to a new language (Step 2), which is the most significant in the crosslingual lifelong learning setup that our work addresses. 2We also report NVIDIA V100 GPU training days as a more interpretable number, which we estimate analytically using an estimated throughput of 30 TFLOP/s, or 1 V100 day = 2.592 EFLOPs.
In some of our experiments, we are interested in estimating the training FLOPs required to achieve 2 While Step 1 can also be expensive, it is amortized over time: the initial model is trained only once, but extended to new languages many times.The cost of Step 1 is similar for BL_BASE and MINIJOINT, as the overhead of the second head is small (∼30.4 vs. ∼32.2V100 days for a 12-layer model).MINIPOST incurs extra cost from Step 1b, but this is relatively small compared to the cost of pretraining (see §A.1).certain performance.However, this cannot be computed precisely, as we only have a limited number of intermediate checkpoints. 3For that reason, we identify the checkpoints immediately before and after which the model first scores the desired performance, and use linear interpolation to estimate the step at which the exact score would have been hit.For instance, if MINIPOST obtains an accuracy of 48% at the 5,000 update checkpoint (∼1.17 EFLOPs) and 52% at the 10,000 update checkpoint (∼2.34 EFLOPs), we estimate that the accuracy of 50% was achieved at 7,500 steps (∼1.76 EFLOPs).

Performance at Training Completion
Table 2 reports performance at training completion (i.e., after 125,000 updates in Step 2).As expected, BL_BASE obtains the best results, but its training cost is also the highest.In contrast, MINI-JOINT requires nearly one third of the compute, while obtaining similar results.More concretely, it is marginally better on PAWS-X, while moderately (1-2 points) worse on MLQA and XNLI.Averaged over tasks, MINIJOINT retains 98.7% of BL_BASE's performance 4 at 39% of its cost.This validates the core hypothesis of our work-learning target language embeddings over the mini-model is almost as effective as learning them over the original model, while being significantly cheaper.
MINIPOST follows a similar trend, retaining 99.3% of BL_BASE's performance at nearly half of its cost.This shows that mini-models do not need to be trained from scratch, but one can take any existing English model and build it's corresponding mini-model post-hoc.
BL_SMALL performs substantially worse than our proposed approach.BL_SMALL has the same training cost as MINIJOINT, but is 4.0 points worse on XNLI, 4.8 points worse on MLQA, and 9.0 points worse on PAWS-X.This shows that our idea of having two aligned models-a shallow one for efficient adaptation and a deep one for best performance at test time-is critical, as using a shallow model both for adaptation and inference performs considerably worse.

GPU days to Near-Maximal Performance
While we previously compared approaches at training completion, one can also apply early stopping, sacrificing some performance to gain on efficiency.This also allows to compare different approaches head-to-head according to the compute they require to achieve a given score-assuming we stop training as soon as the desired performance is hit.To that end, we fix our target score as 95% of the performance obtained by BL_BASE at the end of training, which we call near-maximal performance. 5esults are in Table 3, and average speedup of our approach over standard adaptation is in Figure 1. 6  Overall, MINIJOINT does best: when perlanguage speedup is averaged across languages, we see that it requires about half to one-third the compute of BL_BASE to achieve the same performance in all tasks.MINIPOST has more modest speedups, but is still substantially faster than standard adaptation to hit the desired performance.This shows that, if possible, it is preferable to pretrain minimodels jointly with the primary model, but our approach can also bring substantial speedups when starting with an existing pretrained model.
It is also remarkable that there is a considerable variance across tasks.In particular, all approaches require substantially less compute to achieve the target performance in PAWS-X when compared to XNLI and MLQA.The relative speedup of minimodel adaptation is also considerably higher on PAWS-X.We also observe a high variance across languages, which we analyze in more detail in §5.4.

Training Curves
We visualize the training curves of the different approaches in Figure 3. Consistent with our previous findings, we observe that MINIJOINT is usually the leftmost curve-signifying the most rapid adaptation-at the cost of a slightly lower final score.In contrast, BL_BASE is by far the slowest system approaching its peak performance, while BL_SMALL gets stuck at a poor performance compared to other approaches.Finally, we find that all methods adapt rapidly in PAWS-X, which suggests that this tasks might be easier than the others.
Figure 4 shows the XNLI training curve averaged over 3 languages.We see more rapid adaptation with shallower attachment of the second head, at a cost to final performance.§A.3 shows curves for PAWS-X, MLQA, and XQuAD.For PAWS-X, high performance was rapidly achieved by all models.End-of-training results are in Table A3.
Table 4 reports estimated V100 days to achieve near-maximal performance as defined in §4.2, and upper and lower estimates are in §A.3.We find that the optimal depth of the mini-model is largely language-dependent.Specifically, Arabic and Turkish never hit the target performance with 2 layers, whereas German does so quickly.For Arabic, 4 layers provides the most rapid adaptation, while Turkish requires at least 6.This suggests that it is critical to have some minimum number of layers to achieve good performance, which varies from language to language.But, as long as this minimum is met, shallower mini-models are generally more efficient. 7Attaching after layer 12 means that both heads are at the final layer, making it virtually equivalent to BL_BASE.A3.

English Performance
While all of our results so far correspond to the target languages, we next look into the source language performance.As described in §2.2, MINI-POST uses BL_BASE as the primary model, so their English performance is exactly the same.However, MINIJOINT jointly pretrains the primary model and its aligned mini-model.damage performance: the full MINIJOINT model performs on-par with the 12-layer baseline, and the 4-layer extracted mini-model performs on-par with the 4-layer baseline.

Variance Across Languages
While we obtain strong results across the board, there are 3 languages that prove challenging: Hindi, Turkish and Urdu.As shown in Table 3, MINI-JOINT takes more than 5 V100 days to achieve near-maximal performance on XNLI for these languages, whereas the rest of the languages require at most 1 day.As seen in §5.2 this can be mitigated by using a deeper mini-model in the case of Turkish.However, we observe that even BL_BASE struggles with Urdu and, to a lesser extent, Hindi.This suggests that there is something making these languages particularly challenging for cross-lingual adaptation, affecting not only our method but also the standard approach from Artetxe et al. (2020).
One hypothesis is that this is due to the high linguistic distance between these languages and English.In Table 1, these are the languages that are the most syntactically distant from English according to lang2vec,8 and the only ones with a pure SOV word order.This is also consistent with German, Spanish and French-the 3 languages that are the closest to English-generally obtaining the fastest adaptation times.In the future, we would like to explore starting with a multilingual model covering a few diverse languages akin to Pfeiffer et al. (2022), which could facilitate adapting to languages that are distant from English but might share features with some of the other languages.
Another potential factor is that Hindi, Turkish and Urdu, along with Swahili, have the smallest training corpora.However, despite having the smallest training corpus with only 1.7GB-∼1/3 the size of Urdu and ∼1/12 of Hindi and Turkish-Swahili exceeds the aforementioned three on both adaptation speed and raw performance on XNLI.Exploring the impact of corpus size was outside of the scope of this work, but we believe that this is an interesting question to address in future work.

Related Work
Multilinguality in NLP.One way to create a LM for a particular language is to collect enough data and train from scratch (e.g.Martin et al., 2020;de Vries et al., 2019;Chan et al., 2020).For the majority of languages, however, not enough data exists to train a high-quality model from scratch.Alternatively, one may pretrain a multilingual model on unlabeled data from many languages, which can then be finetuned on labeled data for zero-shot cross-lingual transfer (e.g.Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020).Multilingual LMs are not without challenges; they are large and expensive to train, suffer from the curse of multilinguality, low-resource language performance can lag due to underrepresentation in the training corpus, and they cannot benefit from language-specific tokenization (Conneau and Lample, 2019;Wu and Dredze, 2020;Rust et al., 2021;Doddapaneni et al., 2021, for a survey).Furthermore, not all languages are created alike in multilingual models; Muller et al. (2021) find that some "easy" languages perform well in mBERT out-ofthe-box and others are successfully after finetuning with monolingual data, some "hard" languages perform poorly in mBERT even after tuning.Alternatively, one may adapt a pretrained model by finetuning, adding language-or domain-specific adapters (e.g.Rebuffi et al., 2017;Houlsby et al., 2019;Pfeiffer et al., 2022), retraining the lexical embedding layer (Tran, 2020;Artetxe et al., 2020;de Vries and Nissim, 2021), or translating the train, finetuning, or test set (e.g.Wang et al., 2022).

Efficient Adaptation of Language Models.
Adapters are a parameter-efficient way to extend LMs by training a small number of parameters that can be swapped-in for on-the-fly adaptation at test time as opposed to needing to store full separate models per task or language.Pfeiffer et al. (2020) train small stackable language-and taskspecific adapters with respect to a frozen transformer body that is shared between all languages and tasks, allowing simple and quick cross-lingual transfer at test-time.Bapna and Firat (2019) inject adapter layers into a neural machine translation (NMT) model for domain adaptation to obviate the need for full-model finetuning, and use languagespecific adapters for high-resource languages to recover from catastrophic forgetting during multilingual NMT training.Alabi et al. (2022) argue that their finetuned mBERT for 17 African languages is parameter efficient because they maintain highperformance with a single model rather than requiring separate models per language.Like Abdaoui et al. ( 2020), they reduce model size by removing vocabulary tokens not needed for target languages.LoRa adds small trainable matrices corresponding to low-rank decompositions of a weight updates within transformer attention, allowing rapid updates during finetuning (Hu et al., 2022).Prefixtuning methods are also parameter-efficient (Li and Liang, 2021;Liu et al., 2021).
Compute-efficient methods aim reduce the computation (FLOPs or wall-time) required to train a model.Several authors developed vocabulary adaptation methods which reduce the need to extensively finetune a model or train from scratch (e.g.Chronopoulou et al., 2020;Sachidananda et al., 2021).Though Wang et al. (2020) continued-train mBERT with an extended vocabulary for a new language, convergence is faster than with a bilingual BERT model trained from scratch.Kocmi and Bojar (2020)'s vocabulary adaptation method improves time-to-convergence of a NMT system adapted to a new language.While de Vries and Nissim (2021) learn a new lexical embedding layer on top of GPT-2, which is computationally expensive, they employ engineering strategies to decrease training time, such as 16-bit mixed precision training, reduced window size, and maximum batch size with gradient accumulation.Though they must backpropogate through the entire model during embedding layer relearning, training stabilizes quickly.They adapt larger models by initializing the embedding layer using transformations of embeddings developed on smaller models, noting that the better initialization speeds training.
Variance across languages.Prior work observes similar variation between languages in LM adaptation.When adapting BERT, Tran (2020) see that Hindi showed the slowest growth and lowest final XNLI score of six assessed languages, acknowledging word-order differences.Several authors see performance lags on NLP benchmarks for SOV languages when probing large multilingual models (Doddapaneni et al., 2021, for a review).Pires et al. (2019) find that zero-shot part-of-speech tagging is best when the model has been finetuned on a language that shares word order with the target language.Limisiewicz et al. (2020) attribute the disparity to underrepresentation of SOV languages in the training corpus.

Conclusion and Future Work
Our work shows that it is possible to extend pretrained models to new languages using only a fraction of their parameters.We achieve this by learning a new embedding layer over a shallow minimodel aligned with the primary model.We explore two approaches to learn mini-models: MINIJOINT augments a transformer with a second MLM head during pretraining, adapting with an average 2.3x speedup over the standard method from Artetxe et al. (2020), and MINIPOST builds a mini-model by extracting a small number of layers from a pretrained model, providing an average 1.6x speedup.
Our analysis reveals that shallower mini-models converge faster but plateau at lower performance.As such, one might explore combining multiple mini-models of different depths, using the shallowest at the beginning of cross-lingual adaptation, and then deeper ones as training progresses.One could add multiple MLM heads to a MINIJOINT model and train all simultaneously to facilitate this.
We would also like to explore applications of mini-model adaptation beyond the multilingual scenario.In particular, by adapting rapidly on models significantly smaller than the base model used for inference, MINIJOINT/MINIPOST might be used to finetune large LMs on modest hardware.This could allow for a new paradigm whereby one shares a small model for adaptation while keeping a large aligned model private behind an API.Clients could then learn parameters for their task on the small model, which are later plugged into the large model for better performance.Shortly after us, Xiao et al. (2023) proposed Offsite-Tuning, an adaptation method similar to ours but motivated by privacy.

Limitations
Our study is limited to the adaptation of MLMs to new languages.While we believe that our proposed approach could also be applied more broadly (e.g., autoregressive models instead of MLMs, or adapting to new downstream tasks instead of new languages), further experiments are necessary to empirically verify this.In addition, we observe a considerable variance across languages ( §5.4), the reasons for which are not entirely clear.Ideally, we would have a broader set of languages to better study this, as our language set is limited and skewed towards the Indo-European family.Finally, we average results over 5 finetuning runs, but computational restrictions prevented us from also averaging over multiple pretraining runs.As discussed in §A.5, we observed a non-negligible variance over pretraining runs in a preliminary experiment, but a more systematic exploration is necessary to better understand its impact.

A Appendix
A.1 Floating Point Operations (FLOPs) We estimate total FLOPs for training using the formula from Narayanan et al. (2021), amended for RoBERTa without activation recomputation.Like the authors, we omit calculations over biases, activation functions, softmax, and other minor costs.Assume hidden size h, vocabulary size V , number of layers l, token mask probability p, sequence length s, batch size B, and total training updates U , the total FLOPs during training are: Derivation Recall that multiplying A ∈ R m×n by B ∈ R n×p requires 2mnp FLOPs.Each transformer layer consists of a multi-head self-attention block and a linear projection.The attention block has four weight matrices W q , W k , W v , W o ∈ R h×h . 9The input x ∈ R s×h is projected with W q , W k and W v , requiring 2sh 2 FLOPs each: Self-attention followed by output projection is: Multiplying QK T and multiplying the result by V both require 2hs 2 FLOPs.Multiplying with W O costs 2sh 2 FLOPs.In sum, there are 8sh 2 + 4hs 2 FLOPs to compute the forward pass of the attention block.The output of the attention block (x ∈ R s×h ) is then passed through two linear layers: F 0 ∈ R h×4h and F 1 ∈ R 4h×h .These multiplications cost 8sh 2 FLOPs each, so total FLOPs per layer is: The output x ∈ R s×h passes through the MLM head: a dense layer of size R h×h for 2sh 2 FLOPs, and an output projection of size R h×V that costs: Only masked tokens are passed through MLM head, so the total flops in the LM head is FLOP lm = p(2sh 2 + 2shV ) 9 We demonstrate the calculation over one head, as using more heads results in the same FLOPs calculation.
In sum, the total estimated FLOPs for a forward pass of RoBERTa with a batch size of 1 is: To account for the backward pass, one typically triples the forward pass FLOPs.This is because (1) to backpropogate the error, one calculates the partial derivatives of the loss with respect to the input (activations): ∂δ ∂a , and (2) to make a weight update, one first must calculate the partial derivatives with respect to the weights: ∂δ ∂w .Calculating each partial derivative requires the same number of FLOPs as the forward pass, meaning that the backward pass is doubly as expensive. 10Tripling Equation A2 to account for the backward pass, multiplying by batch size and total updates, and reducing gives Equation A1 for full pretraining.
Adaptation requires an amended equation for the backward pass because layers are frozen (Step 2: L trg embedding training).The trainable embeddings are tied to the output projection layer in the MLM head: thus, trainable input embeddings are passed through frozen layers, which passes through the MLM head consisting of a frozen dense layer and trainable output projection.To backpropogate the error to the embeddings, we must (1) calculate ∂δ ∂a for the entire model, requiring the same number of FLOPs as the forward pass.11Because the MLM head's output projection layer is also trainable, we also calculate ∂δ ∂w here on the backward pass.In total, this gives the below equation for Step 2, after multiplying for batch size and total updates: Thus, adaptation with 4 layers requires ∼21.1 EFLOPs versus ∼29.3 EFLOPs during pretraining.For 12 layers, adaptation requires ∼54.1 EFLOPs versus ∼78.8 in pretraining.

MINIPOST FLOPs in Step 1b
Step 1b of MINI-POST builds small mini-model with embeddings and first l f layers frozen.These frozen layers do not require the backward pass.Furthermore, the frozen LM head does not require calculating ∂δ ∂w , only ∂δ ∂a .Of the trainable layers, each require both ∂δ ∂a and ∂δ ∂w , except the first trainable layer which only needs ∂δ ∂w (because it does not pass back the error).Given trainable layers l t , the total cost for creating the mini-model in MINIPOST is: Concretely, the cost of training a 6-layer minimodel in this work is ∼21.6 EFLOPs.In comparison, pretraining the vanilla 12-layer RoBERTa base model requires ∼78.8 EFLOPs.

A.2 XQuAD
The Cross-lingual Question Answering Dataset (XQuAD; Artetxe et al., 2020) covers a more extensive set of languages than MLQA.We evaluate the same models tuned for QA in the main body of the paper on XQuAD.Final F1 and V100 days to achieve near-maximal performance are in Tables A1 and A2.We also show the growth curve for F1 through the first V100-week in Figure A1.XQuAD ar de el es hi ru th tr vi zh avg BL_BASE 2.3 1.1 1.7 0.8 3.3 1.9 2.1 2.2 1.2 1.5 1.8 MINIPOST 1.4 0.8 1.0 0.4 1.3 1.3 1.4 1.2 0.8 1.0 1.1 MINIJOINT 0.8 0.4 0.6 0.5 3.1 0.6 0.6 ∞ 0.5 0.6 0.9* Table A1: Estimated V100 training days to achieve near-maximal performance (see §4.2) on XQuAD.∞: never hit target performance.BL_SMALL never achieves near-maximal performance.* excludes Turkish, which never hit near-maximal performance.A.3 Mini-Model Depth: MLQA, PAWS-X, and XQuAD We extend the results of §5.2 to MLQA, PAWS-X, and XQuAD, shown in Figure A2. Figure A3 shows training curves for the particularly challenging language of Turkish on XNLI and XQuAD.Table A3 shows performance at training completion.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure1: Average speedup of mini-model adaptation overArtetxe et al. (2020).A speedup of x means that our approach needs x times less compute to achieve the same performance.See §4.2 for more details.

Figure 2 :
Figure2: Standard and mini-model adaptation.Trainable parameters are green, frozen parameters are gray.L src embeddings are small rectangles, L trg embeddings are triangles.All approaches use a four-step process for crosslingual transfer: (1) pretrain an MLM in L src , (2) learn a new embedding layer in L trg via MLM with transformer body frozen, (3) finetune the model in L src with embeddings frozen, (4) zero-shot transfer to L trg by swapping the embeddings.Standard adaptation (top) uses the same transformer body for all steps, while our approach learns two aligned models in Step 1-the primary model and a shallower mini-model-and uses the mini-model to learn L trg embeddings efficiently in Step 2. We explore two approaches to learn mini-models: MINIJOINT (center) jointly pretrains the primary model and mini-model using a secondary MLM head attached at a middle layer; MINIPOST (bottom) starts from an existing model and builds a mini-model in Step 1b by extracting/freezing a few layers and learning a small number of parameters on top.

Figure 3 :
Figure 3: Training curve through the first GPU-week.We report XNLI and PAWS-X accuracy and MLQA F1.

Figure 4 :
Figure 4: XNLI training curve for MINIJOINT with secondary head attached at varying layers.Results are averaged over Arabic, German and Turkish.Final performance is in TableA3.

Table 2 :
Performance at training completion.Both variants of our approach nearly match the performance of BL_BASE at a substantially lower cost, while BL_SMALL significantly lags behind.days: V100 GPU days.
To understand the effect of the joint pretraining on the monolingual quality of the model, we compare the full MINIJOINT model and its corresponding minimodel with BL_BASE and BL_SMALL.As shown in Table5, we find that dual-head training does not

Table A3 :
Performance at end of training for MINI-JOINT.Results correspond to Figures 4, A2, and A3.

Table A2 :
XQuAD performance at training completion.Both variants of our approach nearly match the performance of BL_BASE at a substantially lower cost, while BL_SMALL significantly lags behind.C2.Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Sections 4 & 5 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 3 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.