On the Influence of Masking Policies in Intermediate Pre-training

Current NLP models are predominantly trained through a two-stage “pre-train then fine-tune” pipeline. Prior work has shown that inserting an intermediate pre-training stage, using heuristic masking policies for masked language modeling (MLM), can significantly improve final performance. However, it is still unclear (1) in what cases such intermediate pre-training is helpful, (2) whether hand-crafted heuristic objectives are optimal for a given task, and (3) whether a masking policy designed for one task is generalizable beyond that task. In this paper, we perform a large-scale empirical study to investigate the effect of various masking policies in intermediate pre-training with nine selected tasks across three categories. Crucially, we introduce methods to automate the discovery of optimal masking policies via direct supervision or meta-learning. We conclude that the success of intermediate pre-training is dependent on appropriate pre-train corpus, selection of output format (i.e., masked spans or full sentence), and clear understanding of the role that MLM plays for the downstream task. In addition, we find our learned masking policies outperform the heuristic of masking named entities on TriviaQA, and policies learned from one task can positively transfer to other tasks in certain cases, inviting future research in this direction.


Introduction
Large, neural language models (LMs) pre-trained with masked language modeling  have achieved impressive results over a variety of NLP tasks. Studies show that an additional intermediate pre-training stage between general pre-training and task-specific finetuning further improves downstream performance (Fig. 1 Source: In Newtonian physics, free fall is any motion of a body where <mask> is the only force acting upon it.

Pretrained Model
Target: gravity Figure 1: Analysis Setup. We investigate the influence brought by different masking policies during intermediate pre-training, a stage between general pre-training and task-specific fine-tuning. We apply three types of policies (heuristic, supervised, meta-learned) on three categories of tasks (closed-book QA, knowledgeintensive language tasks, multiple-choice QA). by masking and recovering named entities or dates, known as salient span masking (SSM, Guu et al. 2020), significantly improves a model's performance of answering factoid questions in a closedbook setting . However, there is a lack of systematic study on how intermediate pre-training works, whether heuristic masking policies like SSM are near-optimal, or whether they generalize to different NLP tasks. Additionally, it is unclear that for tasks other than closed-book QA, whether intermediate pre-training is helpful, or what masking strategy should be adopted.
In this paper, we offer a large-scale, systematic study on the effects and transferability of masking strategies during intermediate pre-training, while we carefully control all other aspects ( §3). We first begin our analysis with a focus on three heuristic masking policies ( §4.1). We fine-tune the models resulting from intermediate pre-training on nine selected tasks covering three categories (closedbook QA, knowledge-intensive language tasks, and multi-choice QA). Our results suggest that successful intermediate pre-training is dependent on the selection of appropriate corpus. Moreover, heuristicbased approaches are effective only when we have a precise understanding of the role masked language modeling (MLM) plays in downstream task. For example, MLM serves as a sort of memorization step (Petroni et al., 2019), whereby learning to unmask spans in context is analogous to memorizing facts about the span. In the absence of such understanding, heuristic policies may be sub-optimal.
This motivates us to explore whether automating the discovery of optimal masking policies is possible. We design methods to learn a masking policy with supervised learning ( §4.2) or meta-learning ( §4.3), and compare downstream task performance using the same protocol in our previous analysis. Notably, we observe that masking policies learned with supervised learning and meta-learning outperforms the SSM policy for TriviaQA , and these policies learned from TriviaQA also help improve performance on Web Questions . We also discuss the pros and cons of learned masking policies, such as downstream task learning efficiency, risks of over-fitting and learning instability.
Finally, in hopes to better understand the heuristic and learned masking policies, we provide quantitative analysis on the masks produced by these policies. We visualize the distribution of part-ofspeech tags among masked tokens, and their relation to token frequency in the corpus ( §5.3). We find that the masking policies learned from Trivi-aQA tend to mask more proper nouns and tend to mask less frequent words when compared to SSM.
Overall, our empirical analysis provides useful suggestions for NLP researchers who aim to improve downstream task performance using intermediate pre-training and heuristic masking strategies. In addition, our experiments reveal that infusing task-specific knowledge into LMs with learned masking policies is a promising way to improve downstream task performance, and invite future research in this direction.

Preliminary: Masked Language Modeling
In this section, we revisit MLM objective with the notation that we will use throughout the paper. MLM is a predominant pre-training objective for large-scale transformers in NLP . MLM and its variants can be characterized with two key components: a masking policy g(.; φ), parameterized by φ, which decides the collection of tokens to be masked, and a language model f (.; θ), parameterized by θ. Formally, given a sequence of tokens x = [x 1 , x 2 , ..., x m ], g(x; φ) generates a sequence of bi- indicates the token x i will be masked. The source sequence for pre-training, x (src) , is formulated by replacing the selected tokens with a special <mask> token, i.e., We denote this operation as x (src) = x ⊕ d. The target sequence x (tar) can be either the full original sequence x (BART, Lewis et al. 2020), or the sequence of masked tokens (T5, .

Analysis Setup
In this section we introduce the analysis pipeline ( §3.1) and downstream datasets we use ( §3.2). We defer the details of learned masking policies to §4.

Experiment Procedure
Our goal is to analyze the influence in downstream task performance brought by different masking policies g(.; φ) during intermediate pre-training. Towards this goal, we ensure that the only variable is the masking policy, while all other aspects are controlled, so that the downstream performance reveal the influence we aim to study. We first initialize with a BART-base model (Lewis et al., 2020); then for each masking policy, we conduct experiments following a two-stage pipeline: Stage 1. Intermediate Pre-training. We perform intermediate pre-training with a given masking policy g(.; φ). All intermediate pre-training is done with input sequence length of 128, batch size of 2048, learning rate of 0.0001, up to a total number of 100, 000 updates, using Wikipedia snapshot from December 20, 2018 1 .
Stage 2. Task-specific Fine-tuning. We finetune each resulting checkpoint from Stage 1 on related downstream tasks, and evaluate their performance. We follow the same routine of hyperparameter search for each checkpoint. We then run the fine-tuning experiments with the best hyperparameter setting and three different random seeds. See Appendix B for details.

Downstream Tasks and Datasets
We focus our study on nine downstream tasks across three categories. We introduce their details and explain the rationale behind our selection in the following.
Closed-book QA. Closed-book QA is a task that requires a language model to directly answer questions without access to external knowledge . This paradigm assumes that the model memorizes large amounts of knowledge from its pre-training data, which gets "packed" into its parameters, and can subsequently be "retrieved" to answer questions. Notably,  reported 9%+ improvement in exact match on Triv-iaQA when intermediate pre-training with salient span masking (i.e., masking and recovering named entities or dates) is performed on a T5-11B model. This observation inspired our work. Our study considers three datasets for closed-book QA: Natural Questions (NQ, ), We-bQuestions (WQ) and TriviaQA (TQA).
Knowledge-Intensive Tasks from KILT. Extending from closed-book QA, we select three tasks from the KILT benchmark (Petroni et al., 2020) that also aims to test a model's implicit knowledge capacity, while having different task formats and goals. Aidayago2 (AY2, ) is an entity linking task that requires the model to assign a Wikipedia page to an entity mention in the text. The output is the unique name of the Wikipedia page in text format. Zero-shot relation extraction (ZSRE, ) is a slot filling task that aims to predict the object when given the subject and the relation. The relations in the train/dev/test splits are non-overlapping. Wizard of Wikipedia (WoW, ) is a dataset of dialogue histories relevant to knowledge in Wikipedia. The model is required to act like a chatbot and generate the response given previous dialogue history.
Knowledge-Intensive Multiple-choice QA. We select three multiple-choice QA datasets, in which the questions can be answered with commonsense/background knowledge without any context, but the dataset provides additional context paragraphs to explicitly state the background knowledge used. We use WIQA  which focuses on procedural text, QuaRTz  which focuses on qualitative relationship, and ROPES  which focuses on causes and effects. We reformat these tasks into sequence-to-sequence format, following UnifiedQA (Khashabi et al., 2020).
To summarize, all tasks above can be treated as sequence-to-sequence tasks, where each example is a source-target pair (s, t), accompanied with a context paragraph c provided by the dataset. Details for dataset splitting are in Appendix D.1.

Compared Masking Policies
We experiment with three categories of masking policies: heuristic policies, where g is a fixed heuristic function ( §4.1); supervised policies, where g is a model whose weights are learned from direct supervision on downstream tasks ( §4.2); and meta-learned policies, where g is a model whose weights are learned through meta-learning on downstream tasks ( §4.3).

Supervised Policy
When students prepare for closed-book exams, they are likely to review and memorize what they perceive as most important in the text book. Such perception is learned from their prior experience of taking closed-book exams. Following this intuition, Ye et al. (2020) proposed to learn a masking policy for closed-book QA tasks to help the model focus on likely answers during intermediate pre-training. The masking policy is trained with (answer, context) examples, and the policy is an extractive model that extracts the answer span from the context. For example, if the context x is [Charles, Schulz, was, the, creator, of, Snoopy] and the answer is "Charles Schulz", the label for the answer start index will be [1,0,0,0,0,0,0]; for end index it will be [0,1,0,0,0,0,0]. In the following, we briefly recap the method with our notations.
Model. Given context paragraph tokens x = [x 1 , x 2 , ..., x m ], we first use an embedding matrix E to embed each token: [e 1 , e 2 , ..., e n ]. Then, we use a 2-layer bi-directional LSTM model to compute the hidden representation at each position. 3 Finally, we use two learned vectors (w st , b st ) and (w ed , b ed ) to compute the logits for each position being the start or end position of the potential answer/target span. For example, the logit of position j being a start/end position is computed as follows.
Policy Inference. When deploying the policy to intermediate pre-training, we select the potential answer spans by ranking the sum of start and end logits of each potential spans, in accordance to the inference step in machine reading comprehension models. That is, we rank the spans (i, j) according to y i,st + y j,ed . We consider two variants when deploying the policy: (a) masking the top 1 span or (b) sampling 1 span from the top 5 spans.
Applicability and Limitation. Supervised policy is designed for closed-book QA, and one limitation of this method is that the target span t must appear as is in the context paragraph c. Within all other knowledge intensive tasks, only ZSRE satisfies this constraint. To sum up, we apply supervised policy method to TQA, NQ, ZSRE.

Meta-learned Policy
Conceptually, what the learned masking policy captures is closely related to the concept of "learning to learn" (Schmidhuber, 1987;Thrun and Pratt, 1998). At a high level, the masking policy should provide the model with the desired initialization for the downstream task, such that the model can better learn the downstream task in only a few fine-tuning updates. Therefore, we construct a meta-learning approach, which we describe below.
Overview. We formulate each (c, s, t) example as a small "task". For each task, the goal is to improve the performance of generating target sequence t given input s, immediately after learning 3 Though the masking policy can theoretically take any form, we opt for a lightweight architecture (2-layer Bi-LSTM) as we need to apply it to millions of pre-training instances.

Pre-train Target
Charles Schulz (November 26, 1922-February 12, 2000 was an American cartoonist  from the context c. This is similar to taking quizzes, where a student first learns from a passage c and then is immediately tested on it by trying to answer t given s. Studying from c strategically with an optimal masking policy will result in better performance (i.e., smaller loss in generating t).

Fine-tune
Following work in gradient-based meta-learning (Finn et al., 2017;Grefenstette et al., 2019), we set up an inner and outer loop. We briefly sketch the procedure in Fig. 2. In the inner loop, we focus on the current (c, s, t) examples by applying the current masking policy g(.; φ) and performing pre-train/fine-tune updates to f (.; θ). In the outer loop, we update the policy g(.; φ) with the signal at the end of inner loop training. We denote φ (p) as the masking policy parameters after p outer loop optimization steps, and θ (p,q) as the LM parameters after p outer loop optimization steps and q inner loop optimization steps.
Inner Loop. In one inner-loop curriculum, we first take the context as a pre-training sentence, i.e., x = c, and use the current masking policy g(.; φ (p) ) to determine the masks d and the implied perturbed input We pre-train θ (p,0) for one step to recover x from x (src) : where α 0 is the learning rate, and L(., .) is the cross entropy loss for recovering x using disturbed input x (src) and model parameters θ (p,0) . Next, we take θ (p,1) as initialization and finetune it for one step on the downstream objective of predicting t given s: Outer Loop. In outer loop, we update the masking policy g(.; φ (p) ). We aim to answer query s correctly after the inner-loop curriculum. We define the meta-loss L ′ as the decrease in losses after the one fine-tuning update, i.e., L ′ characterizes how fast the model has adapted itself to answer (s, t) within one step of optimization. Since all computations in Eq. (3-5) are continuous 4 , we optimize φ by directly taking gradients from L ′ , Controlling Masking Budget. Higher order optimization is known to be unstable (Antoniou et al., 2019). In early stages of the study, we found the policy to be flipping between masking none or all of the tokens. To stabilize, we add a softened L2 loss to control the portion of mask/not-mask decisions output by g(.; φ). Denoting l(x) as the input sequence length, l(d) as the number of mask decisions; we define a budget γ and a tolerance factor ǫ, and compute the regularization term L reg , For example, when γ = 15%, ǫ = 5% and the input sequence x contains 100 tokens, the policy will not be penalized if it's masking 15 ± 5 of all tokens in the sequence. We modify the optimization step in Eq. (6) as follows, where β is a co-efficient balancing the regularization intensity.
Post-processing. When we deploy a learned policy to pre-training, we are no longer constrained by differentiability. Based on useful techniques in previous work, we apply post-processing to predicted masking decisions d.
(1) Whole-word masking and text infilling (Liu et al., 2019;Lewis et al., 2020): whenever one subword x i within a whole word is masked (d i = 1), we expand the mask and always mask the whole word. When consecutive tokens are masked, we replace the sequence of <mask> in the input sequence with exactly one <mask> token.
(2) Additional budget control: Even with our budget regularization loss (Eq. 7), we find some input sequences get too many masks (> 50%). This creates extremely challenging pre-train examples that may prevent the model from learning useful information. For these sentences we randomly "unmask" tokens to keep the portion of masks below 30%.
For space concerns, we leave pseudo-code and other implementation details in Appendix A.2.

Results and Discussion
Following our analysis setup ( §3), we present the results for closed-book QA in Table 1, knowledgeintensive language tasks (KILT) in Table 2 and multiple-choice QA in Table 3. In the following, we aim to understand the influence brought by different masking policies through these results. We also introduce several ad-hoc experiments to verify our hypotheses raised in our analysis.

Comparison of Heuristic Policies
Continue pre-training with the original objective is helpful in general. Prior work has shown that intermediate pre-training on encoder models (i.e., RoBERTa, Liu et al. 2019) with in-domain corpora helps to improve downstream classification tasks performance (Gururangan et al., 2020). Our experiments help to examine whether similar conclusion holds for text-to-text models and tasks beyond classification. From our results, we found intermediate pre-training with Wikipedia and BART's original objective (+Orig) improves performance of two closed-book QA tasks (TQA and WQ), one entity linking task (AY2), and two multiple-choice QA tasks (WIQA and QuaRTz); maintains performance on NQ and ZSRE; leads to worse performance on ROPES. Overall, intermediate pre-training leads to improved performance; this may be due to the common observation that language models tend to improve with further pre-training even after validation perplexity have plateaued, or that Wikipedia as a general knowledge-intensive corpus, is more closely related to our downstream tasks, compared to the   Table 3: Performance of Multiple-choice QA Tasks. We report accuracy for each task. the focus to the +Orig objective. Now we further add +Rand and +SSM into the comparison. From the results in Table 1, we first confirm that salient span masking (SSM) is indeed very beneficial for closed-book QA . In addition, SSM helps improve performance for two entity-centric knowledge intensive tasks (AY2 and ZSRE, see Table 2) and two multiple-choice QA tasks (ROPES and QuaRTz, see Table 3). Note that ROPES focus on causal relationships between entities and QuaRTz focus on qualitative relations (involving numbers); both can be considered entitycentric. We conclude that using heuristic masking policies that resemble the downstream tasks, or masking information known to be important for the downstream task, tend to improve downstream performance. When it's difficult to design a heuristic that satisfy these needs, using random masking may be helpful. In this case, we recommend to decide whether to generate full sequence (+Orig) or only masked tokens (+Rand) based on the task output length. If the downstream tasks requires generating long sentences, generating full sequence is more helpful. This is supported by the observation that +Orig is better than +Rand for WoW. On the other hand, if the target sequences in the downstream dataset are shorter, generating masked tokens is more helpful, as shown by experiments on NQ, AY2, ZSRE and ROPES.

How Do Learned Policies Perform?
We have introduced two ways to automate the discovery of better masking policies, with supervised learning ( §4.2) and meta-learning ( §4.3). We now extend our analysis to these learned policies. Successful Cases. We observe that learned policies are most successful on TriviaQA, with both the supervised policy and the meta-learned policy outperforming SSM. We attribute its success to the following reasons: (1) (context, source, target) examples are abundant, so the masking policy has Training Data Used 0.1% 1% (a) BART-Base 3.69% 5.54% (b) +SSM 5.56% 7.31% (c) +Supervised-TQA(Top1) 6.49% 8.40% (b) +Meta-learned-TQA 4.50% 6.44% sufficient supervision. TriviaQA dataset is accompanied with large-scale context paragraphs created with distant supervision, so the scale of (c, s, t) examples is larger than other datasets.
(2) The heuristic masking policy does not "perfectly" resemble the downstream task, and it still has room for improvement. SSM masks one random named entity in the context. However, the answer to trivia questions are not necessarily named entities, and one named entity may be more important than another. Therefore the learned policies can better capture the characteristics of TriviaQA than SSM. Apart from TriviaQA, meta-learned policies outperforms +Orig on NQ, ZSRE and ROPES, demonstrating the effectiveness of the method. This also opens up a promising direction for downstream tasks whose heuristic masking policy is not intuitive (e.g., dialogue response generation, multiple-choice QA). Improved learning efficiency. We additionally consider a low-resource setting for TriviaQA, where we use 0.1% and 1% of its training set for fine-tuning. We present the results in Table 4. We observe that the supervised policy has better sample efficiency than SSM. We also observe that intermediate pre-training by generating full sequence (a/d) is worse than generating spans (b/c), supporting our previous conclusion that the choice of target sequences should be based on the downstream task output format (span or sentence).
Overfitting on ZSRE. ZSRE dataset has a unique setting: it is a slot filling task similar to closebook QA; however it adds additional challenge as the relations in train/dev/test splits are nonoverlapping. We hypothesize that this train/test discrepancy leads to unsatisfactory behavior of learned ZSRE policies, and we conduct a set of controlled experiment to validate this hypothesis. Concretely, we use 90% of its original train set as the new train set, use the 10% remaining training examples as a "matched" dev set, and the original dev set as a "mismatched" dev set. In our experiments, SSM achieves 20.02% ±.16% EM on match-dev, and 3.21% ±.15% on mismatch-dev. Supervised-ZSRE(Top5) achieves 20.37% ±.04% on match-dev (outperforms SSM, p<0.05), and 2.94% ±.11% on mismatch-dev. These experiments show that our supervised policy is learning useful information, but has overfitted to the training data and becomes less robust to distribution shift during inference. In comparison, SSM is agnostic to train-test discrepancy and thus achieves the strongest performance. Generalization of learned policies. We observe several cases where a policy learned from one dataset positively transfer to another downstream tasks. That includes Supervised-TQA(Top5) bringing improvement to WQ, +Supervised-NQ(Top5) bringing improvement to TQA, and +Supervised-ZSRE(Top5) bringing improvement to AY2, compared to random masking baselines. This is reasonable since all these tasks are entity-centric and are similar in nature. For tasks with significantly different formats and goals, e.g., ZSRE and WoW, policies learned on one does not benefit the other.
Here we only exhibit the evidence supporting that learned masking policies can positively transfer, and we leave the question of "when and why does it work" as future work.
Remarks. Supervised/meta-learned masking policies are our initial attempt towards the idea of "learning to mask". While being successful and exhibiting the evidence for positive transfer in certain cases, we recognize the potential risks of overfitting, or suffering from high instability in metalearning. We hope future work can investigate these issues and design novel methods to learn better masking policies.

Quantitative Analysis: What are Masked? How are policies different?
In this section we aim to understand how masking policies are different from each other in terms of their masking decisions. We analyze the relation between masking decisions, part-of-speech tags and token frequency. Specifically, we take 1% of the pre-train corpus and compare the masking decisions made by each policy to facilitate our analysis.
Relation to Part-of-speech Tags. In Fig. 3, we plot the stacked bar chart of part-of-speech tags to visualize their distribution. Each bar represent the portion of masks having the part-of-speech tag, amongst all masks produced by this policy.
Notably, most supervised policies learns to focus more on proper nouns, and less on common nouns.  This is consistent with the goal of the entity-centric downstream tasks. Comparing Supervised-TQA and SSM, Supervised-TQA focuses less on nouns, numbers and adjectives, and it focuses even more on proper nouns. This suggests that Supervised-TQA better characterizes the property of TQA, and thus outperforms SSM by learning to mask task-specific information. Due to the differences in learning procedures, the meta-learned policies has distributions different from supervised policies. Still, meta-learned policies for NQ and TQA masks more proper nouns compared to random masking, similar to their supervised counterparts.
Relation to Token Frequency. In Fig. 4, we plot the relation between mask frequency and token frequency for masking policies learned from TQA, along with random masking and SSM for reference. Mask frequency is computed as the number of occurrences that a token was masked divided by the number of all masked tokens. For random masking, the datapoints approximate a Zipfian distribution (Zipf, 1999), with some noise due to random sampling of words. Secondly, for SSM, most datapoints fall on a curve above the random masking line, while a small portion of tokens are less likely to be masked, formulating line segments in the bottom area. These observations indicate that SSM tend to mask less frequent tokens, but its behavior is not fully explained away with token frequency. The two learned policies, Supervised-TQA ( 4(a)) and Meta-TQA ( Fig. 4(b)) are in general similar to SSM, while the curve for Supervised-TQA is more scattered, indicating a weaker preference for Zipfian behavior.

Related Work
Implicit Knowledge in Pre-trained Language Models. Petroni et al. (2019) discovered that pre-trained language models can implicitly store relational knowledge in their parameters, and such knowledge can be accessed with cloze-style queries.  introduced the task of closed-book QA, which breaks the convention of retriever-reader strategy for open-domain QA, and requires the model to directly generate answers with its implicit knowledge. Closed-book QA performance is boosted significantly when salient span masking (Guu et al., 2020) is used. Guu et al. (2020) maintained that SSM helps the model to "focus on problems that require world knowledge".
Self-supervised Pre-training. Pre-trained language models has shown its capability on a wide variety of NLP tasks. Current self-supervised objectives are mostly heuristic, including masked language modeling , span boundary representation learning (Joshi et al., 2020), corrupted sentence reconstruction (Lewis et al., 2020), etc. Raffel et al. (2020 systematically studied the self-supervised objectives used in previous literature. Related to our goal of exploring pre-training objectives, ELECTRA  propose a replaced token prediction task which improves pre-training efficiency.  propose to reduce the variance of gradients in SGD and expedite model pre-training. Levine et al. (2020) propose to mask n-grams according to Pointwise Mutual Information (PMI). These works typically consider the efficiency of an objective when pretraining from scratch and without preconceived focus on a given problem; while we focus on encoding knowledge or adapting the model during intermediate pre-training with a given task in mind.
Domain/Task-specific Pre-training. Gururangan et al. (2020) experiment on four domains (biomedical, computer science, news, reviews) and eight different datasets, where they discover that pre-training with in-domain corpus leads to better downstream performance. Kang et al. (2020) propose to learn a mask generator via reinforcement learning. Closely related to us, Gu et al. (2020) propose task-guided pre-training by learning to predict importance score for each token in pre-train corpus. Vu et al. (2020); Pruksachatkun et al. (2020) studies knowledge transfer from intermediate-task fine-tuning, while we focus on a different problem setting of intermediate pre-training with generic corpus (e.g., Wikipedia). We believe both settings have practical utility in real-world applications.

Conclusion
In this paper, we study the influence brought by different masking policies used during intermediate pre-training, and offer two methods as our initial attempts towards automating the discovery of optimal masking policy. From extensive experiments with heuristic and learned masking policies across three categories of tasks, we have identified several successful cases of intermediate pre-training, offered in-depth analysis and insights for the masking policies we used, discussed the risks of learned masking policies, and summarized several suggestions for researchers who wish to adopt intermediate pre-training in their applications. We also acknowledge that, despite our additional efforts and experiments, several observations still cannot be explained away. We invite future research into this challenging and under-explored problem, to expand on our methods, and to search the space of pre-training objectives beyond masked language modeling. Furthermore, we hope our work encourages researchers to consider the type of downstream applications they wish to deploy their LMs in, before investing resources into largescale pre-training.

A Additional Training Details
A.1 Supervised Policy Training Details. The embedding matrix E is initialized with the weights in BART-base model. We optimize cross entropy loss between the logits outputted by the model and the gold annotations. For each source of supervision stated above, we train the policy for 30 epochs with learning rate of 1e-5 and batch size of 512, and select the best checkpoint according to validation loss.

A.2 Meta-learned Policy
Design choices. We use a 1D convolution layer with two additional linear layers as our policy network g(.; φ). The linear layers output two logits for each token in input sequence x. The two logits for each tokens go through Gumbel Softmax (Jang et al., 2017) to decide whether it should be masked (d i = 1) or not (d i = 0). We've also experimented with Bi-LSTM as the encoder, but find meta-learning with LSTMs to be extremely unstable.
Intuitive Example. The PTLM f is given a piece of context c "Charles Schulz (November 26, 1922-February 12, 2000 was an American cartoonist" and is expected to take an upcoming "closed-book exam" based on this piece of context. In the pre-train step, the current policy g predicts masks (e.g., Charles Schulz (<mask> -February 12, 2000) was an American cartoonist) and take one step of optimization, implicitly encoding this piece of knowledge into its parameters. After this, the PTLM "transit to closed-book exam mode" by finetuning on (s, t) for one step. Finally the language model "takes the closed-book exam" and the loss for generating t given s as input can be interpreted as the supervision for the masking decisions (i.e., whether masking "November 26, 1922" is helpful).
Pseudo-code. We provide pseudo-code for our method in Algorithm 1.

C Discussion on NQ
In Table 1 we observe that performances on NQ are close for all BART-base models; therefore it is hard to rank all compared methods. We argue that multiple factors leads to this phenomenon, including dataset characteristics and evaluation protocol. Specifically, NQ may not be an ideal testbed for our study due to three reasons. Firstly, intermediate pre-training in general might not be as beneficial for this particular task. For instance,  reports only 2% EM gain on NQ using T5-11B. In our experiments, we use significantly smaller pre-trained models (BART-base/large), so the effect brought by intermediate pre-training will be even smaller. In our case we believe the effect is hidden in the variance brought by random seeds.
Secondly, performance on NQ may not represent the real implicit knowledge capacity of a LM. For reference, we observe a 20% dev set EM when finetuning a randomly initialized BART-base model on NQ. The general pre-training stage brings merely 4-5% EM improvement, and therefore the improvement brought by intermediate pre-training can be marginal.
And finally, evaluation based on exact match may substantially underestimate the model capability, as suggested in .

D Reproducibility D.1 Dataset Details
We obtain closed-book QA datasets from https://github.com/facebookresearch/ DPR/blob/master/data/download data.py, knowledge-intensive language tasks from https://github.com/facebookresearch/KILT/blob/ master/scripts/donwload all kilt data.py. We obtain ROPES, WIQA and QuaRTz from huggingface datasets (https://huggingface.co/datasets). For more details, see Table 6. KILT hosts the test set evaluation on its leaderboard and the test set annotations are not publicly available; therefore we report performance on dev set in Table 2. The test set annotations for ROPES is not publicly available, so we take 50% of original dev set as the new dev set, and the other 50% as the new test set.

D.2 Training Details
Implementation. All our experiments are implemented with fairseq (Ott et al., 2019). For higherorder optimization in the meta-learning approach optimization, we use higher library (Grefenstette et al., 2019). Our code will be released upon acceptance.