Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking

,


Introduction
In the field of Natural Language Processing (NLP), models for learning unsupervised representations from unlabeled text based on Transformer architectures (Vaswani et al., 2017) are the state-of-the-art on a variety of tasks (Kalyan et al., 2021).† Work done while at eBay Inc.San Jose, CA.
Transformer-based language models (TLMs) like BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), and their linage of advanced models (Amatriain, 2023), rely on the combination of an unsupervised pre-training of the model, and a subsequent task-specific fine-tuning procedure.
TLMs are pre-trained over large unlabeled text data using self-supervision, to learn the relationships between different sentences or words of the input.Once the TLM is pre-trained over large volumes of data, it can be used in various downstream tasks, by fine-tuning task-specific model layers.
With pre-training, TLMs learn language representations that are useful across downstream tasks, minimizing the need and burden of retraining the entire model from scratch, again, for each task.Extensive pre-training can lead to downstream performance improvements, i.e., it is worth learning complex TLMs in huge natural language corpora before fine-tuning them for particular tasks.
Even if conceptually simple and empirically powerful, pre-training is challenging and expensive.Beyond the significant resources needed to pre-train the original BERT model by Devlin et al. (2018), the improvements of RoBERTa (Liu et al., 2019) relied on orders of magnitude higher computational resources (Kaplan et al., 2020).
The relationship between TLM architecture, training corpus, pre-training hyperparameters, and evaluation metrics is complex and obscure.Therefore, previously overlooked pre-training design choices, e.g., pre-training hyperparameter selection, result in significant performance differences.
With this work, we aim to improve the pretraining procedure of TLMs, by sequentially selecting hyperparameters that result in a more efficient and superior pre-training performance.
We hypothesize that an interactive selection of pre-training hyperparameters can accelerate and improve pre-training, i.e., we can achieve a better metric value in fewer epochs.It is critical not only to achieve superior performance, but to reduce the computational cost, steering clear from timeand resource-expensive procedures.Increased efficiency in TLM pre-training is paramount amidst concerns pertaining to the carbon footprint of large language models (Patterson et al., 2021); and specifically, the significant impact of hyperparameter selection on resource utilization and power consumption (Puvis de Chavannes et al., 2021).
Our TLM pre-training use-case is random dynamic masking of Masked Language Models (MLMs) -in contrast to rule or task-based MLM dynamic masking solutions proposed in the literature (Joshi et al., 2020;Sun et al., 2020).Even though Liu et al. (2019) showed the benefits of random dynamic masking, the search for optimal masking hyperparameters is often carried out based on heuristic techniques and grid-based search.
In machine learning (ML), hyperparameter selection is commonly addressed as a black-box optimization problem, which can be solved using evolutionary algorithms (Yu and Gen, 2010), entropy search methods (Hennig and Schuler, 2012;Hernández-Lobato et al., 2014), and Bayesian optimization (BO) (Frazier, 2018).In particular, BO can tackle the problem of optimizing an unknown objective function with possibly noisy evaluations (Snoek et al., 2012), and of speeding up resource allocation to promising hyperparameter configurations (Li et al., 2018).Aligned with the recent successes of Turner et al. (2021) in hyperparameter selection via BO, we propose a BO framework for sequential tuning of MLM pre-training hyperparameters.Our framework is different from BO techniques that speed up hyperparameter set evaluations, such as Hyperband (Li et al., 2018), which is a pure-exploration adaptive resource al-location algorithm for allocating resources among configurations in the non-stochastic setting.
We here cast the TLM pre-training procedure as a sequential decision process, in which at each interaction, a reinforcement learning agent selects an action (e.g., pre-training hyperparameters) to maximize cumulative rewards (e.g., the pre-training metric of interest).To accommodate the black-box nature of the pre-training objective function, we fit a probabilistic surrogate model to the empirical evaluations of the pre-training metric, and propose a bandit-based technique for its sequential optimization.In the MLM dynamic masking use case, the bandit actions are the dynamic masking probabilities; and the MLM performance, the unknown function the bandit is trying to maximize, based on estimates computed in the validation set.
Contrary to dynamic masking techniques that decide which subsets of tokens to mask via combinatorial optimization and dynamic programming (Vu et al., 2020); we target online, sequential selection of masking hyperparameters for accelerated and improved pre-training.In contrast to proposals that adapt the language model's masking policy to a particular task of interest (Kang et al., 2020), we devise a generic online optimization framework that, by sequential selection of MLM design choices, provides fast and superior TLM pre-training performance, when pre-training -from-scratch and continually-across diverse corpora.

The contributions of this work are:
• To present a bandit-based framework for efficient online optimization of TLM pre-training.We formulate a Gaussian Process based Thompson sampling (GP-TS) algorithm for sequential MLM loss minimization.The novelty lays on modeling TLM pre-training validation losses with a Gaussian process reward model, and on formulating a Thompson sampling policy that minimizes them.
• To showcase empirically how GP-TS pre-trains TLMs better and faster: both when pre-training from-scratch and continually, across a variety of corpora.Besides, to show that GP-TS pre-trained TLMs provide top fine-tuned performance across diverse in-domain tasks, in fewer interactions.
• To demonstrate that GP-TS's sequential selection of how many tokens of the input to mask -and how to mask them-results in improved and accelerated dynamic MLM pre-training, enabling significant resource utilization savings.
To the best of our knowledge, this work is the first to address online optimization of TLM pretraining with bandit-based BO, and to showcase its performance and resource efficiency benefits.
The manuscript is organized as follows: Section 2 provides the background on Bayesian optimization, multi-armed bandits and TLM pretraining; Section 3 describes the proposed GP-TS method for TLM pre-training optimization; with its empirical performance evaluated in Section 4. Concluding remarks are provided in Section 5.

Background
2.1 Bayesian optimization and bandits Bayesian optimization (BO) is a framework to address hyperparameter optimization in ML (Snoek et al., 2012;Klein et al., 2017;Turner et al., 2021), and many closely related applications (Negoescu et al., 2011;Calandra et al., 2016;Frazier and Wang, 2016;Hernández-Lobato et al., 2017;Candelieri et al., 2018).BO relies on a probabilistic surrogate model of the objective function, to tackle the problem of simultaneously fitting and optimizing a high-dimensional, non-convex function with unknown smoothness, and possibly noisy evaluations (Shahriari et al., 2015;Frazier, 2018).Due to the black-box nature of BO, the surrogate model must provide a measure of uncertainty, for which generative models, Bayesian neural networks and Gaussian processes are used (Maddox et al., 2021).Using this surrogate model, an acquisition function determines the next promising candidate to evaluate.To address the challenge of learning about the environment (i.e., exploration) while simultaneously maximizing the observed outcomes (i.e., exploitation), the multi-armed bandit provides a useful framework (Lai and Robbins, 1985).
The multi-armed bandit (MAB) is an abstraction for problems that require learning while simultaneously maximizing attained rewards, i.e., balancing the exploration-exploitation tradeoff (Lattimore and Szepesvári, 2020).A MAB is a sequential decision process that requires decision-making under uncertainty (Slivkins, 2019).At each interaction t = 1, • • • , T , a bandit agent chooses an action a t ∈ A from a (not necessarily finite) set of actions A, and it observes stochastic reward r t drawn from an unknown distribution of the selected arm, a t , often characterized parametrically, r t ∼ p(•|a t , θ).
The MAB agent's goal is to maximize (ex-pected) cumulative rewards, R T = T t=1 µ a,t , with each arm's expected reward denoted as µ a = E p {r|a, θ}.The challenge is on the lack of knowledge about the reward generating mechanism, which demands learning its properties (e.g., its parameters), as it interacts with the environment.
In the context of BO in general, and MABs in particular, reward uncertainty quantification is critical.Gaussian processes (Rasmussen and Williams, 2005) provide not only adequate Bayesian uncertainty estimates, but a flexible solution for surrogate models that encode smoothness assumptions of the payoff function (Krause and Ong, 2011;Bogunovic et al., 2016;Nguyen et al., 2020).We resort to a Gaussian process reward model in the proposed bandit-based BO framework for TLM pre-training.

Language model pre-training and the Masked Language Model
Pre-training enables learning representations that generalize across tasks, i.e., it allows for a language model to be better initialized for quick finetuning (while avoiding overfitting) to downstream tasks.TLMs learn language representations in pretraining based on one (or more) self-supervised task.Two popular pre-training objectives are Masked Language Model (MLM) and Next Sentence Prediction (NSP) (Devlin et al., 2018).We focus on MLM pre-training as in (Devlin et al., 2018;Liu et al., 2019); where for an input sequence of words or tokens, a random sample of the tokens is replaced with the [M ASK] token, and the goal is to predict them.For an input sequence d of N tokens, with special tokens delimiting them, MLMs select a random sample of the tokens q i , i = {1, • • • , N }, replace them with the mask, and learn to predict these masked tokens.For pre-training the original BERT model (Devlin et al., 2018), a random but static subset of the input sequence tokens was replaced with the mask.Liu et al. (2019) proposed a dynamic masking procedure, which generates a new masking pattern (given a fixed probability of masking) for every input sequence.Liu et al. (2019) demonstrate that this dynamic approach is beneficial when pretraining for more steps or with larger datasets.
Dynamic masking relies on several hyperparameters: (i) the probability ρ of replacing an input token with the mask, (ii) the probability γ that a masked token is left unmasked, and (iii) the probability λ of replacing a token with a random token, instead of with the mask.Online optimization of these hyperparameters ψ = (ρ, γ, λ) is the usecase for our experiments in Section 4. (2) where χ( q i ; w, ψ) denotes the TLM's representation of the masked token q i , and ξ(q i ) is its original embedding.The pre-training objective is to find the TLM that minimizes the MLM loss between the original dataset D and its masked version D. In practice, this minimization is executed via stochastic gradient-descent, run for e = 1, where we drop the dependency with respect to TLM parameters w and the masked validation dataset D val to avoid notation clutter.

Proposed bandit-based framework
We cast TLM pre-training as a sequential decision process to be solved by a multi-armed bandit agent that interactively optimizes the analytically unknown pre-training loss, based on its sequentially observed empirical evaluations.We define pre-training steps, i.e., a fixed number of stochastic gradient updates u in the training set, as bandit interactions t = 1, • • • , T .The goal is to minimize the TLM pre-training objective l(•|ψ) given tunable hyperparameters ψ, with (stochastic) evaluations of the loss function in the validation set.
Pre-training hyperparameters at interaction t, ψ t , are the bandit's arms, i.e., a t = ψ t .For MLM pre-training with dynamic masking, at each bandit interaction, the agent selects hyperparameters ψ (the proportion of tokens to mask and their masking probabilities), pre-trains the TLM for certain stochastic updates to minimize the MLM loss, and evaluates its performance in the validation subset, as per Equation ( 5).Due to the black-box nature of the pre-training objective, for which only stochastic evaluations are available, we formulate a surrogate reward function (leveraging empirical MLM validation loss estimates) for the bandit to maximize, as it sequentially selects which arm to play.

From MLM pre-training to Gaussian process-based regret minimization
We transform the empirical pre-training validation loss at each MAB interaction into a reward quantity for it's sequential minimization by the bandit agent.Specifically, we compute bandit rewards as the normalized difference in averaged empirical MLM losses between bandit interactions, i.e., By normalizing reward differences perinteraction, we mitigate the potential non-stationary effect sequentially selected hyperparameters might have on TLM pre-training.With rewards as (normalized) empirical MLM loss differences, we capture how much (relative) improvement each action provides.
Rewards in Equation ( 6) are based on stochastic draws from an analytically unknown objective function, i.e., only empirical estimates lt (•) of the MLM objective are available.To accommodate these noisy observations of the unknown loss function l(•|ψ) -that we aim at optimizing with respect to its hyperparameters ψ-we model the bandit reward function via a Gaussian process (GP) model f (•; θ) of the pre-training objective, with observed rewards independent and identically (i.i.d.) distributed as  mussen and Williams, 2005).For instance, via Type-II maximum likelihood estimation (MLE) of the GP parameters θ = (θ µ , θ k ), θ = argmax θ log p (r 1:T |f (ψ 1:T |θ)), where the data likelihood p(r|f (•; θ)) is a function of the observation noise probability distribution.

GP-Thompson sampling for TLM
pre-training.
Leveraging the GP reward model in Equation ( 7 with initial hyperparameters θ 0 5: Draw posterior sample from GP, Select arm based on drawn posterior sample, Run TLM pre-training for u steps, with hyperparameters ψ t = a t . 10: Compute pre-trained TLM validation loss, lt (D val ; ψ t ) as in Equation ( 5) . 11: Observe bandit reward, r t (ψ t ) as in Equation ( 6) . 12: Update bandit history 13: Fit GP model with H 1:t , θt+1 = argmax θ log p (r 1:t |f (ψ 1:t ; θ)) .14: end for GP-TS accommodates continuous arms a t = ψ t , with dimensionality determined by the pre-training hyperparameter space ψ ∈ Ψ.Any TLM can be used within the proposed framework, as long as the hyperparameter space ψ ∈ Ψ is identified, and rewards as in Equation ( 6) are computed for a pretraining objective l(•|ψ) of interest.
GP-TS draws predictive function samples for the next TLM pre-training interaction from its GP reward model posterior, updated at every bandit interaction as indicated in Step 7 of Algorithm 1.As in other TS methods, these samples are used to determine -in Step 8 of Algorithm 1-the arms (hyperparameters ψ t ) to be used in the next bandit interaction.After u pre-training steps2 , the model's MLM validation loss is computed to evaluate the observed bandit rewards r t (ψ t ) of Equation ( 6).After each interaction t, new evidence is collected in Step 12 to re-fit the GP model to the observed input (action)-output (rewards) history H 1:t .For instance, via Type-II MLE as in Step 13 of Algorithm 1, although other GP parameter optimization procedures might be used -see Appendix A for details on GP models and posterior inference.

Evaluation set-up
We probe the ability of the proposed GP-TS to, given a dataset, a TLM architecture, and a computational budget, efficiently pre-train well-performing language models3 .For our experiments, we incorporate RoBERTa (Liu et al., 2019) as implemented by Ott et al. (2019) in our Python implementation of GP-TS4 as in Algorithm 1 -Appendix B.1 provides implementation and configuration details.
We compare pre-training performance of RoBERTa models based on a grid-search over masking hyperparameters -as executed by Liu et al. (2019)-to RoBERTa models pre-trained by GP-TS5 .We focus our evaluation on MLM validation loss and downstream per-task accuracy metrics, and report the negligible computational overhead of pre-training with GP-TS in Appendix B.3.
We study two variants of GP-TS, depending on the masking hyperparameters it optimizes: (i) GP-TS ρ, where the bandit arm is the masking probability ρ of replacing an input token with the mask token (other hyperparameters are fixed to default γ = 0.1 and λ = 0.1 values as suggested by Liu et al. (2019)); and (ii) GP-TS ψ = (ρ, γ, λ), where GP-TS optimizes over all MLM dynamic masking hyperparameters: the bandit search space is a three-dimensional hypercube Ψ with no previous expert guidance on hyperparameter selection.
Pre-training datasets.We gather three distinct datasets, two based on publicly available corpora, and one based on private data from eBay: • wiki-c4: We pre-process and encode publicly available Wikitext-103 (Merity et al., 2016) and Google's c4 RealNews (Zellers et al., 2019) datasets for pre-training, from scratch, each of TLM.This corpora is similar to those originally used by Devlin et al. (2018) and Liu et al. (2019).
• mimic: We pre-process and encode free-text clinical notes available in the public MIMIC-III Clinical database (Pollard and Johnson, 2016), which contains deidentified nursing and physician notes, ECG and imaging reports, and discharge summaries for patients who stayed in intensive care units at Beth Israel Deaconess Medical Center.
• e-commerce: We pre-process and encode a random subset of eBay marketplace inventories, which contains different product titles and descriptions provided by marketplace users, as well as category tags associated with each item and product reviews.
Each dataset contains text of very different linguistic characteristics and sizes (see summary statistics in Appendix B.4), which we leverage to investigate TLM pre-training across a variety of settings.We evaluate candidate TLMs (i) when pretraining from-scratch, i.e., from a randomly initialized architecture; and (ii) with continual pretraining, i.e., when continuing pre-training a TLM architecture previously trained in other NLP corpora (Kalyan et al., 2021).Continual pre-training results we present are for the RoBERTa-base architecture as pre-trained by Facebook Research (2022) that we continue to pre-train in our domain-specific datasets, i.e., mimic and e-commerce.Fine-tuning in downstream tasks.Pre-trained language models are most useful when applied to downstream tasks, as there is no need to retrain the entire model again.We evaluate pre-trained TLM's in the following in-domain tasks6 :    • e-commerce title classification: A binary classification task to decide whether a pair of item titles belong to the same marketplace product.Item titles are instances of a product sold by a specific seller, which can have different attributes like condition or can exist as a special version (e.g., a signed book), yet refer to the same product.
• e-commerce title similarity: A task using the same title-pair data as above, but formulated as a similarity task.Namely, we learn a distance metric between item titles to help discriminate whether they belong or not to the same product.
• e-commerce title quality: A classification task that predicts if a title fulfills the marketplace requirements for it to be a product title.Titles must contain the product's main relevant information -the brand, the product name and/or type, and all distinguishable attributes, i.e., its key features-but should not contain conditions, marketing terms, or any other non-product related information.
• medical MLI: A natural language inference task annotated by doctors (Shivade, 2019), which is grounded in the medical history of patients collected in MIMIC-III (Pollard and Johnson, 2016).It contains sentence pairs -the premise and the hypothesis statements-with a corresponding label indicating their inferential relationship (e.g., entailment, contradiction, or neutral).
Summary statistics for each in-domain per-task dataset are provided in Appendix B.6.
To elucidate how the pre-trained TLMs' quality evolves over pre-training interactions, we finetune (for ten epochs) the pre-trained RoBERTa models at each pre-training interaction t.We report the best classification accuracy of each finetuned model across pre-training interactions and fine-tuning epochs.

GP-TS pre-training of RoBERTa models
We compare from-scratch pre-training performance of all RoBERTa models (pre-trained with fixed hyperparameters or by GP-TS) in Figure 1, where we illustrate MLM validation losses of each model over pre-training interactions: GP-TS attains the lowest MLM loss values in fewer interactions.
Recall that when pre-training TLMs, validation performance varies across training epochs; hence, practitioners are interested in identifying the best pre-trained model -as per the lowest validation metric-instead of selecting the pre-trained TLM available at the last training epoch.
Results for continual pre-training are provided in Figure 2 below, where we observe that GP-TS continually pre-trains the best performing RoBERTa models -the fastest-for both in-domain datasets.MLM validation losses for models pre-trained with GP-TS fluctuate across interactions, depending on the stochastic action (hyperparameter value) selected by the GP-TS agent.Practitioners are interested in using the model with the lowest validation MLM loss, which GP-TS consistently finds across all studied datasets and pre-training approaches, in fewer pre-training interactions.We evaluate the influence of different realizations of GP-TS (with different random seeds) in Table 1, where we observe that GP-TS always pre-trains models with the lowest MLM loss, and in less interactions (indicated within parentheses).GP-TS not only circumvents the need for costly grid searches, but enables improved performance: it attains reduced MLM loss at earlier interactions than grid-search baselines.Recall how GP-TS ψ outperforms all the alternatives in Table 1, as it pre-trains models with the lowest MLM, the fastest -even when no good initial guesses for the MLM hyperparameters ψ = (ρ, γ, λ) are available.
In summary, the benefits of interactive GP-TS pre-training do not pertain to the attained MLM values only, but to an accelerated, efficient procedure.We emphasize the computational efficiency of GP-TS: it adds little to no overhead -details on the computational cost of GP-TS are provided in Appendix B.3-while providing clear benefits for language model pre-training.It attains best MLM pre-training performance in less interactions, avoiding computationally expensive hyperparameter search.
To the best of our knowledge, these experiments provide novel evidence that, instead of MLM pretraining with fixed masking hyperparameters, sequentially deciding which masking values to use is beneficial.GP-TS finds sequences of dynamic masking hyperparameters (when optimizing over ρ or a three-dimensional hyperparameter space ψ ∈ Ψ) that minimize MLM loss across datasets, when pre-training from-scratch and continually.
4.3 GP-TS pre-trained RoBERTa models for downstream fine-tuned tasks We scrutinize how performant in-domain GP-TS pre-trained RoBERTa models are, when compared to grid-search based models, after in-domain pertask fine-tuning.The fine-tuned accuracy of continually pre-trained models7 of Figure 2   These results exhibit how GP-TS pre-trains performant language models -with top accuracyoften at earlier interactions than when pre-training with static hyperparameters: e.g., the continually pre-trained GP-TS ψ model (see last row of Table 2) provides best downstream accuracy for two e-commerce tasks and competitive accuracy in others, in just a few pre-training interactions.This efficiency is of practical importance, due to the significant resource savings it affords.A pre-training hyperparameter grid-search does not provide significant downstream performance improvements, yet it demands high computational resources -the computational complexity of a gridsearch over hyperparameters ψ = (ρ, γ, λ) with n candidates per hyperparameter is O(3 n ).
On the contrary, by letting GP-TS pre-train TLMs, best pre-training MLM performance is achieved, with well-performing fine-tuned model accuracy across downstreams tasks, in fewer pretraining interactions.

Conclusion
We present a multi-armed bandit-based Bayesian optimization framework for the sequential selection of pre-training hyperparameters towards optimized Transformer-based language model performance.
We develop and evaluate an interactive, Gaussian process-based Thompson sampling (GP-TS) framework for accelerated language model pre-training.We model noisy evaluations of the pre-training objective (e.g., the MLM loss) as drawn from a surrogate Gaussian process that the bandit agent aims to minimize.
We provide empirical evidence of how GP-TS, when applied to MLM dynamic masking, attains superior and accelerated (both from-scratch and continual) pre-training performance, along with excellent in-domain downstream metric values.
While Liu et al. (2019) randomly select -with fixed probability-which input tokens to mask, we show that sequentially adapting the masking hyperparameters with GP-TS results in enhanced and efficient pre-training.Notably, GP-TS interactively selects hyperparameters that result in top performing models faster, enabling significant resource efficiency, of critical importance in practice.
Building upon our formulation and the provided evidence, we envision follow-up work investigating the proposed method's ability to successfully pretrain large-scale models in general purpose corpora, as well as for optimizing domain-specific models.

Limitations
There are several limitations to account for in the presented work.First, the large GPU requirements for the execution and replication of the presented experiments.Second, the lack of empirical results beyond English-based text, and how morphologi-cally and syntactically more complex corpora may affect the presented evidence.Third, our evaluation section compares GP-TS performance to the common hyperparameter grid-search alternative, yet we acknowledge that other Bayesian optimization techniques used in the machine learning community may provide suitable and competitive alternatives to explore.In addition, we have not run any hyperparameter tuning beyond MLM dynamic masking, which might improve all studied algorithms' performance.Finally, our conclusions are limited to RoBERTa models pre-trained via MLM dynamic masking, and therefore, investigation of how GP-TS generalizes to other TLM pre-training approaches and architectures is lacking.

Ethics Statement
This work raises ethical and societal considerations associated with the use and biases of pre-collected natural language data, the energetic and environmental impact of extensive GPU resource usage, and the downstream applications of language models.We acknowledge the potential implicit biases within the publicly available datasets used.E.g., mimic reports are limited to the population attended at Beth Israel Deaconess Medical Center, and may contain implicit biases of health practitioners there.We have carefully sampled data for the e-commerce dataset to avoid biases over specific products, users and sellers.We are also aware of the rising concerns pertaining to the carbon footprint of large language models (Patterson et al., 2021), and the significant impact hyperparameter selection techniques have on resource utilization and power consumption (Puvis de Chavannes et al., 2021).Finally, we acknowledge the wide range of established and anticipated risks that language models pose to society (Weidinger et al., 2021).

B.5 RoBERTa fine-tuning
The specific RoBERTa hyperparameters used for the in-domain fine-tuning downstream tasks are described in Tables 7-10.
MLM pre-training aims at minimizing the MLM loss: a function of the original (D) and masked ( D) datasets, the TLM architecture with its parameters w ∈ W , and pre-training hyperparameters ψ ∈ Ψ.The MLM objective is the cross-entropy loss of predicting the masked tokens in the masked sequence d ∈ D, where we denote with m i = {0, 1} whether tokens q i , i = {1, • • • , N }, from the original input sequence d ∈ D have been masked in d: l(d, d; w, ψ) = − log p(d| d; w, ψ) where ϵ t denotes the stochastic nature of each of the observed rewards -based on empirical estimates computed in Equation (6).Hence, we overcome the black-box nature of the pre-training objective (e.g., the MLM loss) by modeling observed rewards as realizations of a noisy surrogate GP model(Rasmussen and Williams, 2005).The mean µ(•) and kernel functions k(•, •) of a GP f (•) ∼ GP (µ(•), k(•, •)) determine the reward function class: i.e., the regularity and smoothness of the pre-training loss.These are parameterized prior-functions µ(•|θ µ ) and k(•, •|θ k ), which can be fitted to the observed data r 1:T ), we devise a bandit-based interactive method that executes a Thompson sampling (TS) policy 1(Russo et al., 2018) for TLM pre-training optimization.The proposed Gaussian process-based Thompson sampling (GP-TS) -with pseudo-code provided in Algorithm 1-views the TLM pre-training objective as an unknown black-box function with inputs a t = ψ t and outputs r t (ψ t ) as in Equation (6).GP-TS makes decisions on what bandit arm a t = ψ t to play at each TLM pre-training interaction t = 1, • • • , T, informed by its GP reward model of Equation (7), to maximize its observed cumulative rewards R T = T t=1 r t (ψ t ).Algorithm 1 GP-TS for TLM pre-training 1: Input: TLM and pre-training corpus 2: Input: Pre-training hyperparameter space Ψ 3: Input: Number of pre-training interactions T , number of updates per-interaction u 4: Input: GP prior functions µ(•) and k(•, •),

Figure 1 :
Figure 1: MLM validation loss comparison (lower is better) of grid-search and GP-TS based from-scratch pre-trained RoBERTa models, over interactions.

Figure 2 :
Figure 2: MLM validation loss comparison (lower is better) of grid-search and GP-TS based continually pretrained RoBERTa models over interactions.
• • • , E, epochs with random mini-batches D e ∈ D per epoch e, w e = argmin w∈W l(D e , D e ; w, ψ) .The analytical form of the MLM loss, a function of selected hyperparameters ψ and the data where it is evaluated, is in general complex and unknown.However, estimates of the MLM loss are available at every pre-training epoch e. Namely, an empirical estimate of the MLM loss can be computed in the validation set.For fair comparisons under different training setups (e.g., mini-batch sizes and hyperparameters), per-epoch averaged empirical MLM losses are computed in the validation dataset D val ,

Table 1 :
Best MLM loss attained before interactions 20 and 30, when pre-training RoBERTa models continually in the medical domain corpora.
are presented in Table2: we showcase, per-task, best test-set accuracy for each fine-tuned model, and at which pre-training interaction was such value attained.Results are computed on each per-task test-set, i.e., a subset of each task's dataset (see details in Table 11) that has not been used for fine-tuning nor hyperparameter optimization.

Table 6 :
Summary statistics of the pre-training datasets.

Table 7 :
RoBERTa fine-tuning hyperparameters for the e-commerce title classification downstream task.

Table 8 :
RoBERTa fine-tuning hyperparameters for the e-commerce title similarity downstream task.We split each per-task fine-tuning dataset into training, development and test sets for our experiments, with summary statistics of each set provided in Table11.

Table 11 :
Summary statistics of the fine-tuning task datasets.