One For All & All For One: Bypassing Hyperparameter Tuning with Model Averaging For Cross-Lingual Transfer

Multilingual language models enable zero-shot cross-lingual transfer ( ZS - XLT ): fine-tuned on sizable source-language task data, they perform the task in target languages without labeled instances. The effectiveness of ZS - XLT hinges on the linguistic proximity between languages and the amount of pretraining data for a language. Because of this, model selection based on source-language validation is unreliable: it picks model snapshots with suboptimal target-language performance. As a remedy, some work optimizes ZS - XLT by extensively tuning hyperparameters: the follow-up work then routinely struggles to replicate the original results. Other work searches over narrower hyperpa-rameter grids, reporting substantially lower performance. In this work, we therefore propose an unsupervised evaluation protocol for ZS - XLT that decouples performance maximization from hyperparameter tuning. As a robust and more transparent alternative to extensive hyperparam-eter tuning, we propose to accumulatively average snapshots from different runs into a single model. We run broad ZS - XLT experiments on both higher-level semantic tasks (NLI, extractive QA) and a lower-level token classification task (NER) and find that conventional model selection based on source-language validation quickly plateaus to suboptimal ZS - XLT performance. On the other hand, our accumulative run-by-run averaging of models trained with different hyperparameters boosts ZS - XLT performance and closely correlates with “oracle” ZS - XLT , i.e., model selection based on


Introduction and Motivation
Massively multilingual transformers (MMTs) like XLM-{R,V} (Conneau et al., 2020;Liang et al., 2023) or mT5 (Xue et al., 2021) are pretrained via language modeling on vast corpora encompassing 100+ languages.MMT fine-tuned on labeled task data in a source language can transfer cross-lingually zero-shot, i.e. without further annotations, to target languages (Hu et al., 2020;Lauscher et al., 2020).However, pretraining corpora size and linguistic distance between the source and target language dictate the quality of XLT (Lauscher et al., 2020).This is why model selection based on source-language validation data unreliably correlates with ZS-XLT and selects checkpoints that yield suboptimal target-language performance (Keung et al., 2020).Worse yet, there is no "best practice" for replicating ZS-XLT results of prior work.Some works, as our results suggest (cf.§4), (1) exhaust extraordinarily large hyperparameter grids and (2) monitor target-language performance for the best transfer (i.e., violating "true" ZS-XLT) to outperform baselines (Conneau et al., 2020;Wei et al., 2021).Other works rerun baselines with little to no hyperparameter tuning (Hu et al., 2020;Wu and Dredze, 2020): the re-evaluation then often trails original results by non-negligible margins. 1 As a remedy, Keung et al. (2020) propose to evaluate ZS-XLT on the snapshot that generalizes best to validation data in the target language ("oracle" ZS-XLT): as such, oracle ZS-XLT stabilizes evaluation and denotes ideal transfer performance.Nonetheless, oracle ZS-XLT overstates the performance of true ZS-XLT, for which no target-language instances are available (Schmidt et al., 2023).If they are, targetlanguage annotations are always better levered for training than for validation (Schmidt et al., 2022).
This calls for an evaluation protocol that (1) maximizes "true" ZS-XLT results and (2) makes them easily reproducible, regardless of the extent of hyperparameter tuning.In this work, we find that model averaging fulfills both criteria.Weights averaging has proven effective in, e.g., MT (Vaswani et al., 2017) and recently NLU (Wang et al., 2022;Schmidt et al., 2023).Schmidt et al. (2023) enable model averaging for sizable gains in XLT.They first fine-tune an MMT on labeled source-language data and then re-train models (i.e., more runs) by copying and freezing the task head of the initially fine-tuned model: this aligns snapshots and enables weight averaging across runs.2Contributions.In this work, we propose an evaluation protocol that decouples maximizing ZS-XLT performance from hyperparameter tuning.The key idea is to accumulatively average snapshots of runs with different hyperparameters: this improves performance over model selection based on sourcelanguage validation performance.We run exhaustive experiments on higher-level (NLI, extractive QA) and lower-level (NER) NLU tasks on a broad grid of hyperparameters and show, examining the cross-section of all runs, that model selection based on source-language validation almost exclusively picks snapshots suboptimal for ZS-XLT.We also confirm that conventional hyperparameter tuning on source-language validation prematurely settles for models that maximize source-language performance at the expense of ZS-XLT.Crucially, we show that accumulative model averaging performs on par or better than the best snapshot picked by source-language validation already from the second (i.e.first averaged-in) run and then consistently improves ZS-XLT with more runs.We additionally show that this accumulative model averaging closely correlates with oracle ZS-XLT without requiring any source-or target-language validation data to maximize transfer performance.

Accumulative Run Averaging
Prior work conducts model selection for ZS-XLT by extensive hyperparameter tuning using either source-or target-language validation data.Whereas the latter violates true ZS-XLT (Schmidt et al., 2022), the former overfits to source-language performance (Keung et al., 2020).The recent success of snapshot averaging in XLT (Schmidt et al., 2023) motivates our research question: can (accumulative) averaging of models trained during hyperparameter search outperform -with fewer overall training runs -the ZS-XLT performance of the "optimal" model selected based on source-language validation performance?
We benchmark model selection based on source-language validation against accumulative model averaging as follows.We iteratively sample models (i.e., runs) {{θ 1 , .

Experimental Setup
Tasks and Languages.We select for our evaluation two higher-level semantic tasks (NLI and and extractive QA) and one lower-level structured prediction task (NER).For each task, we fine-tune the MMT on the provided English training splits.3 Natural Language Inference (NLI).We evaluate NLI on XNLI (Conneau et al., 2018) and In-dicXNLI (Aggarwal et al., 2022) which together cover 25 typologically diverse languages.
Extractive QA (TyDiQA-GoldP).TyDiQA-GoldP comprises questions that are answered by a span of text in the provided gold passage and covers 9 diverse languages (Clark et al., 2020).
Training Details.
We train XLM-R large (Conneau et al., 2020) for 10 epochs with AdamW (Loshchilov and Hutter, 2019), weight decay of 0.01, gradient norm clipping to 1.0, and a LR schedule of 10% linear warm-up and decay. 4e save 10 snapshots per model, one at every 10% of total training steps.The maximum sequence length is 128 tokens for NLI and NER and 384 with a stride of 128 for TyDiQA-GoldP.
Hyperparameter Grids.We simulate conventional hyperparameter grid search over a broad set of 21 configurations, pairing seven learning rates l ∈ {0.1, 0.5, 1, 1.5, 2, 2.5, 3}e −5 with three batch sizes b ∈ {16, 32, 64}.The grid is deliberately kept wide and same for all tasks to not reflect any prior knowledge on task-specific "good values". 5We retrain MMT for each pair (l, b) three times with different random seeds to account for variances over individual runs.
Model Variants.We evaluate four model variants: v ∈ {LAST, SRC-DEV, CA, TRG-DEV}.LAST is simply the final snapshot of a training run.SRC-DEV is the snapshot that maximizes sourcelanguage validation performance (Hu et al., 2020).CA averages all snapshots of a run to a single model and, according to Schmidt et al. (2023), outperforms LAST and SRC-DEV.TRG-DEV breaches "true" ZS-XLT and picks the snapshot that performs best on the target-language validation data (Keung et al., 2020): as such, it generally represents an upper-bound of single-run ZS-XLT performance.6

Results and Discussion
Single-Run Performance.The full ZS-XLT results by hyperparameters are presented in Appendix §A.2 (cf.Table 3).We observe that optimal ZS-XLT of single runs depends on all axes of analysis: task, hyperparameters, and model variant.While LAST and SRC-DEV generally perform well, their ZS-XLT performance fluctuates substantially across hyperparameter configurations, in line with (Keung et al., 2020;Schmidt et al., 2023).CA is a strong and robust baseline that often outperforms LAST and SRC-DEV by notable margins on TyDiQA and NER.In the context of a single run, CA performs especially well with suboptimal hyperparameters, even sometimes outperforming TRG-DEV.We also confirm that CA remedies variation in ZS-XLT both within and across hyperparameters (Schmidt et al., 2023).Accumulatively averaging within-run snapshots (CA) outperforms LAST and SRC-DEV slightly on NLI and materially on NER.For NER, ZS-XLT from WikiANN to MasakhaNER (2.0) also represents a domain transfer (from Wiki to news), in which CA yields tremendous gains.In-domain (i.e., test on WikiANN), CA generally performs on par with LAST and SRC-DEV.The same is not true for QA, where CA performs slightly worse: we ascribe this to averaging of "unconverged" snapshots, owing to the small TyDiQA training set (merely 3,696 instances), especially from runs with smaller learning rates and larger batches (cf.Table 3).
Further Analyses.Table 2 extends the runby-run analysis to TRG-DEV and "model soups" (SOUP) to illustrate why accumulative model averaging outperforms model selection based on sourcelanguage validation.Rather than selecting a single snapshot, SOUP averages the five snapshots (among all available runs) with best source-language validation performance (Wortsman et al., 2022).
Compared to (oracle) TRG-DEV, accumulatively averaging runs performs on par on NLI, slightly better on TyDiQA, and somewhat worse on NER.TRG-DEV selects language-specific snapshots, thereby tailoring ZS-XLT to each target language and remedying for the varying performance of Max.SRC-DEV in ZS-XLT to many target languages.Such Table 2: "Max.TRG-DEV" selects the run where T is the set of target languages.
SOUP averages the five checkpoints (from all available runs) that "Max.SRC-DEV".For other details, see Table 1. a variation has been shown to be particularly pronounced in ZS-XLT on token-level tasks like NER or POS (Schmidt et al., 2023).On TyDiQA, we believe that accumulative averaging (slightly) better stabilizes the transfer from a small training set (3.7K instances).SOUPs however perform notably worse than both TRG-DEV and accumulative averaging on NLI and NER.SOUPs lack the beneficial diversity of different runs, as the best snapshots often come from the same "good" run. 7Anecdotal evidence further exemplifies why source-language validation is inapt for ZS-XLT.One of 63 SRC-DEV models replicates XNLI results of Conneau et al. (2020), vastly exceeding all other runs (c.∆+1.0).This "miraculous" run though merely ranks 3rd according to source-language validation performance.
The above suggests that even the more sophisticated hyperparameter tuning strategies (e.g., Bayesian optimization) are unlikely to improve ZS-7 Extending SOUP to average the top-10 best snapshots does not improve performance.
XLT without target-language validation.On the other hand, accumulative averaging improves ZS-XLT threefold: (1) Unlike model selection, it does not plateau in ZS-XLT on suboptimal single runs that maximize source-language performance; (2) TRG-DEV showcases that accumulative averaging ingests further runs with snapshots that perform well on ZS-XLT; (3) Model averaging irons out idiosyncratic noise of individual runs, leading to better performance.This renders accumulative averaging a robust (i.e., replicable results) and fair (i.e.true zero-shot) evaluation protocol for ZS-XLT.

Conclusion
Inconsistent hyperparameter tuning and model selection protocols exacerbate replicating previous results on ZS-XLT.In this focused study, we devise a ZS-XLT evaluation protocol that addresses previous shortcomings and feeds two birds with one scone.We show that accumulatively averaging snapshots -rather than selecting models based on source-language validation performance -both improves and stabilizes ZS-XLT.Conventional model selection strategies prematurely settle for models that maximize source-language validation performance and discard runs that generalize better in ZS-XLT.Accumulative model averaging both incorporates snapshots that transfer well and irons out models that perform badly.We find that model averaging correlates closely with "oracle" ZS-XLT, which assumes models selection on target-language validation instances.We hope future work adopts model averaging to promote fair and reproducible ZS-XLT that puts models on equal footing.

Limitations
Additional factors must be taken into consideration, even though we aspire to evaluate ZS-XLT on all levels of transparency (i.e., variants and strategies) across a varied set of downstream tasks on broad hyperparameter grids.Neither model selection on source-language validation data nor accumulative averaging may benefit ZS-XLT on certain tasks, as Schmidt et al. (2023), e.g., do not find that any variant other than TRG-DEV yields gain over LAST on part-of-speech tagging.The underlying cause remains unclear.For instance, the gains on ZS-XLT stemming from model selection or accumulative averaging likely depend on the type of distributional shift from the source-language training data and the target-language instances to transfer to (cf. §4; e.g.dynamics of variants in ZS-XLT for NER).
accumulative averaging nevertheless remains a robust evaluation protocol, as ZS-XLT performance is not expected to deteriorate via-à-vis other "fair" strategies (e.g., max.SRC-DEV).In addition, there may exist a subset of pairs of learning rates and batch sizes that jointly maximize source-and targetlanguage performance.However, as our results suggest ( §4), runs on such hyperparameters likely are indistinguishable from those that exclusively perform just as well on the source-language validation set.

Table 3
XLT and reduces performance variance vis-a-vis Max.SRC-DEV counterparts.

Table 4 :
Validation performance by task, model variant, and hyperparameters (cf.§3).LAST, SRC-DEV, and CA validate on source-language validation splits; TRG-DEV denotes performance averaged over individual snapshots of a run that perform best by target-language validation set.For each column, best validation performance in bold.Metrics: accuracy for NLI, span-F 1 for TyDiQA and token-level F 1 for NER.Subscripts denote std.deviation.