G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

General pre-trained language models (PLMs), such as BERT, have achieved remarkable performance on various NLP tasks. Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this domain-adaptive pre-training (DAPT (CITATION)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of Memory-Augmented Pre-trained Language Model (MAP), which augments the domain-specific PLM by a memory built from the frozen general PLM without losing the general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmentation strategies are explored to build memory and fusion memory into domain-specific PLM. We demonstrate the effectiveness of MAP on different domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed MAP can achieve SOTA results on these tasks.


Introduction
Pre-trained Language models (PLMs), such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b), have achieved promising performance on NLP tasks.Typically, these general models are firstly pre-trained on large unlabeled corpus and then directly fine-tuned on downstream tasks.However, there is an inherent gap in text distribution between unlabeled pre-training corpus Figure 1: Masked LM (MLM) loss of RoBERTa on 50K randomly sampled documents from each domain before and after DAPT. Figure A and B denote the inference loss of general RoBERTa-base and domainspecific PLMs on the samples of biomedical (BM) and computer science (CS).Figure C means the loss of these models on the samples from the pre-training (PT) corpus of RoBERTa.We report the results of (Gururangan et al., 2020) and lower MLM loss is better.and labeled task corpus, which leads to the distribution shift problem (Gururangan et al., 2020) and makes PLMs perform poorly on some domain tasks (Beltagy et al., 2019;Lee et al., 2020b).To address this shift problem, the domain-adaptive pretraining (DAPT) is proposed (Huang et al., 2019;Beltagy et al., 2019;Gururangan et al., 2020;Lee et al., 2020b) to further pretrain general PLMs on large-scale domain corpora, achieving better performance than general PLMs.
Although DAPT can effectively learn the domain distribution of the target task, its continual pretraining process updates the parameters of general PLMs, which inevitably leads to partial general knowledge being forgotten.This catastrophic forgetting (Goodfellow et al., 2014;Li and Hoiem, 2016;Thompson et al., 2019) phenomenon is verified in Figure 1, where we observe that the domainspecific PLMs show better results than general PLMs on domain corpus, but perform worse on the general corpus.We argue that this forgotten knowledge is beneficial for domain-specific PLMs and should be used to improve their generalization ability on domain tasks.
To alleviate the catastrophic forgetting, we propose a simple yet effective memory-augmented framework named General Memory-Augmented Pre-trained model (G-MAP).In addition to the backbone domain-specific PLM, G-MAP introduces a new memory-augmented layer.It explicitly incorporates the representation built from a frozen general PLM as the memory to make the backbone model access the complete general knowledge.Then, a new proposed memory-attention within the memory-augmented layer enables the domain-specific PLM adaptively combine the memory representation and the domain-specific representation.Using the memory built from the frozen general PLM has two advantages: (1) frozen PLM never suffers from forgetfulness since the parameters remain unchanged (Levine et al., 2022); (2) it doesn't require additional training for the general PLM during fine-tuning.However, building and fusing memory into a backbone model is essentially a many-to-many scenario, where we need to choose which layer output of the general PLM as the memory representation, and which layer in the domain-specific PLM should be fused.Thus, we propose several memory-augmented strategies for better building and then combining the memory representation into domain-specific PLM.
We evaluate our G-MAP on text classification, Question Answering (QA) and Name Entity Recognition (NER) tasks covering four domains: biomedical science, computer science, news, and reviews.Experimental results demonstrate that G-MAP outperforms existing baselines on all tasks.We compare different memory-augmented strategies, and the results show that the proposed chunk-based gated memory transfer strategy achieves the best results.In addition, for the memory representation building, we empirically find that the freezing way is better than the unfreezing one, which also has better training efficiency.Furthermore, we apply the proposed framework to a small-scale domain pre-training setting and find that G-MAP is also practical in achieving lower MLM loss.Our contributions are summarized below: • We empirically find that forgotten general knowledge due to catastrophic forgetting can benefit the domain-specific downstream tasks since it can improve PLMs' generalization ability.augmented layer.• We conduct extensive experiments on various domain-specific tasks, including text classification, QA, and NER, the results demonstrating that our G-MAP outperforms existing baselines.

The Method of G-MAP
In this section, we first overview the G-MAP framework.Then we detail a new memory-augmented layer that fuses general knowledge into domainspecific PLMs.Finally, we propose different memory-augmented strategies, including singlelayer memory transfer, multiple-layer memory transfer, gated memory transfer, and chunk-based gated memory transfer.

Overview
Our G-MAP framework aims to tackle the catastrophic forgetting of domain-specific PLMs by using the memory cache built from the frozen general PLMs, which is illustrated in Figure 2. Given an sequence x = [x 1 , x 2 , . . ., x t , . . ., x n ] with x t denoting the t-th token, general PLMs output the contextual representations of the input tokens as the memory cache, which is fed into the domainspecific PLMs to build final representation for domain tasks: where θ g and θ d are the parameters of general and domain-specific PLMs, respectively.We only update the θ d and the θ g is frozen when finetuning.The general PLM could be a BERT or RoBERTa, which contains l layers of Transformer (Vaswani et al., 2017) encoder blocks and outputs a set of hidden states denoted as a memory cache M = M 1 , M 2 , . . ., M l .In the G-MAP framework, the domain-specific PLM utilizes a new memory-augmented layer to adaptively incorporate the memory representation built from the memory cache M and enhance its generalization ability.Specifically, if i-th Transformer layer is a memory-augmented one, it obtains the domainspecific representation H i−1 from the previous layer and the memory representation M f as the input and fuses them by the following way: where M f is the memory representation directly extracted from the memory cache M or effectively constructed by some adaptive aggregation strategies, which has the same shape as the intermediate hidden state H i of the domain-specific PLM, k means the number of heads and W o is a trainable parameter matrix.Then, M f is linearly transformed into new pairs of (keys, values), which were appended to the last of domain-specific ones: where W q i,j , W k i,j and W v i,j are trainable parameters to generate queries, keys, values respectively, and j refers to j-th attention head.Then the selfattention is performed on the queries and merged pairs of (keys, values) as follows: where d k is the head dimension acting as a scaling factor.Firstly, a unified attention matrix is computed by the standard scaled dot-product of each query against the keys of general memory and the domain-specific keys.Then, a softmax operation gets the normalized scores that weigh and sum these concatenated values.Without additional parameter updates for the general PLM, domain-specific PLM can dynamically capture useful general knowledge and ignore noisy information through the memory-augmented layer.

Memory-Augmented Strategies
The remaining problem is how to build the memory representation M f from the memory cache M and which layer of the domain-specific PLM should be the memory-augmented layer to fuse M f .Essentially, it is a many-to-many layer assignment problem between the general PLM and the domain-specific PLM.To study the effect of layer assignment, we propose and compare different strategies, as shown in Figure 3.

Single-Layer Memory Transfer
We first consider a single-layer memory transfer approach, where the last hidden state of the memory cache M is extracted as M f and then it is fused into one layer of domain-specific PLM with memory-attention.We choose the layer near the top of the domainspecific PLM model as the memory-augmented layer which performs best in the experiment.This strategy does not require additional parameters.

Multiple-Layer Memory Transfer
The singlelayer memory transfer may ignore the knowledge learned from shallow layers of the general PLM.To perform layer-wise interaction between the general PLM and the domain-specific PLM, we propose a multiple-layer transfer strategy.This strategy leverages all hidden states from the memory cache M as the memory representations and then fuses them into the corresponding layers of the domainspecific PLM, which also does not introduce any new parameters.

Gated Memory Transfer
Multiple-layer memory transfer uses the hidden states output by all layers of the frozen general PLM as the memory representations, which inevitably introduces homogeneous and noisy information.To avoid the problem, we further propose the strategy of gated memory transfer, which firstly exploits the token-level gate mechanism to adaptively weigh and sum representations of different layers into a memory representation, and then it will be fused into one layer of domain-specific PLM.We also choose the layer near the top of the domain-specific PLM as the memory-augmented one which achieves optimal performance in the experiment.The gate fusion mechanism is formulated as below: where m t l is the token representation, t denotes the token index, l is the layer index, n is the length of tokens and g refers to a linear layer.We utilize a softmax function to calculate the importance of tokens in different layers.Therefore, the output token representation m t f is obtained by weighing the t-th token representations from different layers with their corresponding importance α t l .Finally, the built memory representation M f is fused to the memory-augmented layer.

Chunk-based Gated Memory Transfer
The previous work (Liu et al., 2019a;Phang et al., 2021) have observed that the representations from upper layers and lower layers of pre-trained language models are significantly different.Motivated by this observation, based on the gated memory transfer, we further propose a chunk-based variant, which separates the layers of general PLM into a high-level chunk and a low-level chunk, and then apply the gate fusion strategy to get upper and lower-layer memory representations M h f and M l f , respectively.Finally, we fuse them into two memory-augmented layers in the domain-specific PLM, and the details of different layer selections for this strategy are presented in Section 4.3.

Experiments
In this section, we first introduce the evaluation tasks and metrics.Then we illustrate the baseline methods, and implementation settings.Finally, we conduct the experimental analysis of G-MAP.

Datasets and Metrics
Datasets We evaluate our model on three tasks: text classification, QA, and NER.For text classification, we conduct experiments on eight tasks that cover four domains, including CHEMPROT (Kringelum et al., 2016) and RCT (Dernoncourt and Lee, 2017) in the biomedical domain, ACL-ARC (Jurgens et al., 2018) and SCI-ERC (Luan et al., 2018) in the computer science domain, HYPERPARTISAN (Kiesel et al., 2019) and AGNEWS (Zhang et al., 2015) in the news domain, HELPFULNESS (McAuley et al., 2015) and IMDB (Maas et al., 2011) in the reviews domain.In addition, we use micro-F1 as the metric for ChemProt and RCT, and use macro-F1 for the other datasets following (Gururangan et al., 2020).For NER, we use two datasets, including NCBI-Disease (Dogan et al., 2014) in the biomedical domain, CoNNL-2003 (Sang andMeulder, 2003)  main.We use the F1 score as the evaluation metric.
For QA, we utilize two datasets including Medication (Pampari et al., 2018) in the biomedical domain, NewsQA (Trischler et al., 2017) in the news domain.We use the Exact-Match (EM) and the F1 score as the evaluation metrics.The detailed description and statistics of each task are shown in Appendix A.

Baselines
In our experiment, all the baselines are built on the RoBERTa-base.The details baselines of text classification tasks are described as follows: • Fine-Tuning: directly fine-tuning the general PLM for downstream tasks.• DAPT (Gururangan et al., 2020): pre-training the general PLM with large-scale domain unlabeled corpora to get the domain-specific PLM then fine-tuning it.• Logits Fusion: a straightforward method combines the frozen general PLM and the domain-specific PLM by adding their logits.This method is optimized end-to-end and does not include any memory-augmented strategies in the model.• Ensemble LMs: an ensemble method that adds the predicted probabilities of the fine-tuned general and the domain-specific PLMs for final prediction.• TAPT (Gururangan et al., 2020): taskadaptive pretraining continues to pre-train the PLM on the training dataset, and then we finetune it for the downstream tasks.
For the NER and QA tasks, in addition to these above baselines, we also introduce KALA (Kang et al., 2022).Since KALA is only verified in QA and NER, we use it as a baseline for these two tasks.
• KALA: constructing an entity memory and knowledge graph on a task-specific domain and then augmenting PLM by this additional knowledge.

Implementation
We implement the G-MAP framework based on RoBERTa-base.For the domain-specific PLM, we use the released pre-trained weights DAPT2 .More details on fine-tuning of the downstream tasks are shown in Appendix B.

Results and Analysis
Our experiment results on the domain-specific classification tasks are shown in  results of QA and NER tasks are shown on Table 3.

Performance on Domain Classification Tasks
From Table 1, we can observe that G-MAP with the proposed chuck-based gated memory transfer can achieve better results than all the baselines, which proves that incorporating memory from the general frozen PLM is beneficial for the domain-specific PLM.Specifically, the strategy of chunk-based gate memory transfer outperforms other strategies, we conjecture that it adaptively selects the token-level information across different layers and adequately utilizes the general knowledge from both the highlevel and low-level chunks.However, we also observe that multiple-layer memory transfer has little improvement compared with the baselines, because it incorporates excessive redundant and noisy information from the general PLM without the proposed gated fusion.Besides, single-layer memory transfer is a simple yet effective strategy achieving better than the baselines and the non-gated fusion strategy of multiple-layer memory transfer.Since the chunk-based gate memory transfer strategy achieves the best performance compared with the baselines, we use it as the default memoryaugmented strategy within G-MAP in the following experiments.
Comparison with Further TAPT Further taskadaptive pre-training (TAPT) has been proven to improve the domain-adaptive pre-training (DAPT) (Gururangan et al., 2020).To demonstrate the effectiveness of G-MAP on TAPT, we build a G-MAP framework that replaces the domain-specific pre-trained PLM with the task-adaptive pre-trained PLM.From Table 2, we find that our G-MAP also outperforms DAPT+TAPT on all datasets, indicating that the proposed framework is general for different backbone models, including the domainadaptive and the task-adaptive PLMs.
Effectiveness for QA and NER We also evaluate G-MAP on the tasks of QA and NER, and the experiment results are shown in Table 3. From the results, we see that our method achieves better results than the baselines on all datasets, especially KALA (Kang et al., 2022), which spends a considerable effort to construct entity memory and knowledge graph from the contexts.These results further demonstrate the effectiveness of G-MAP.

Further Discussion
In the following sections, we conduct some detailed analysis of G-MAP to demonstrate the effectiveness of the frozen general PLM and memoryattention.Moreover, we apply the proposed framework in the pre-training stage and study the effect of layer selection on performance.

Effectiveness of Frozen Memory
We compare the frozen and unfrozen ways when building the general memory representation, and the results are shown in Table 4. From the table, we observe that the frozen method is better than the unfrozen model on all datasets, and both are better than the baseline DAPT.We argue that using the frozen memory has two advantages: (1) more efficient in model training without updating the parameters of the general PLM; (2) keeps general knowledge from PLM unchanged when fine-tuning, so it does not lead to a forgetting problem.

Effectiveness of Memory-Attention
To study the effectiveness of our proposed memory-attention module, we compare it with other attention-based variants within a memoryaugmented layer.Specifically, cross-attention is an attention module widely-used in the multi-modal learning (Li et al., 2021;Zeng et al., 2021), we apply it to adaptively fuse the memory represen- tation M f and the output representation from the self-attention module.We also include the gateattention (Wu et al., 2022) as the fusion baseline, which utilizes a gate mechanism to weigh and sum the local and external memory for long-sequence modeling.As shown in Table 5, our memoryattention module outperforms other variants without additional trainable parameters.Besides the strategy of multiple-layer memory transfer, other strategies need to do the layer selection.For the strategies of single and gatedmemory transfer, we fuse the memory representation M f into different layers {3, 6, 9, 12} in the domain-specific PLM and find that the layer 9 as the memory-augmented layer can achieve the best performance in both strategies.We present more detailed results in Appendix C. For the chunkbased gate memory transfer strategy, we experiment with transferring the memory representation of the high-level chunk into layers 7 to 12, and the other one of the low-level chunk into layers 1 to 6, which keeps the same layer interval between the two memory-augmented layers in the domain-specific PLM.The experimental results are shown in Figure 4.The results show that there is an increasing tendency when placing memoryaugmented layers to the top of the domain-specific PLM.Finally, we choose layers 6 and 12 as the memory-augmented ones for the strategy of chunkbased gated memory transfer.

Apply G-MAP in the Pre-training Stage
In the previous experiment, we incorporated the domain-specific PLM with the G-MAP framework in the fine-tuning stage.In this section, we further study whether G-MAP is beneficial for the pre-training stage.To this end, we randomly sample 50k documents from general3 , biomedical and computer science domains (Lo et al., 2020), respectively.In addition, we randomly split 70% of the data from each domain as the pre-training samples and the rest data as the test samples.More details about pre-processing the domain samples are shown in Appendix B. Then, we pre-train the models on the pre-training samples and calculate the masked LM loss on the test samples.From the experiment results shown in Figure 5, compared with baseline DAPT, we observe that utilizing G-MAP can reduce masked LM loss on the biomedical, cs, and general domains.These results demonstrate that the proposed G-MAP also mitigates catastrophic forgetting during the adaptive pre-training.

Related work
Domain Adaptation for PLMs Recently, the domain shift problem of PLMs has attracted increasing research (Beltagy et al., 2019;Huang et al., 2019;Lee et al., 2020b;Gururangan et al., 2020) since the domain discrepancies between the pre-training corpora and the downstream tasks can lead to a significant performance drop.To bridge the domain gap, SciBERT (Beltagy et al., 2019) and BioBERT (Lee et al., 2020b) further pre-train BERT with 1.14M scientific papers from Semantic Scholar corpus and biomedical documents, respectively, which can improve the performance of domain-specific NLU tasks compared with general BERT.Also, Gururangan et al. (2020) proposed domain-adaptive pre-training (DAPT) and task-adaptive pre-training (TAPT).Concretely, DAPT continues to pre-train the PLM with domainspecific corpora, while TAPT directly pre-trains the PLM on the task dataset.Moreover, BT-TAPT (Lee et al., 2021) inherits the crucial step of TAPT and leverages the back-translated strategy to augment the task data to improve the performance of PLM.TAPTER (Nishida et al., 2021) equips TAPT with domain-specific word embedding regularization to improve fine-tuning performance.However, the above approaches suffer from catastrophic forgetting of general domain knowledge after adaptive pre-training, which leads to sub-optimal performance on downstream tasks.
Catastrophic Forgetting Catastrophic forgetting is a common phenomenon for continual learning, and it occurs when a training model forgets previously learned knowledge and over-fits to new tasks (Mccloskey and Cohen, 1989).Typically, regularization-based methods (Goodfellow et al., 2014;Kirkpatrick et al., 2016;Serrà et al.;Deng et al., 2021) exploit regularization to constrain the parameter update to alleviate the forgetting problem, and the memory-based methods (Guo et al., 2020;Saha et al., 2021) mitigate forgetting by storing important samples from past tasks in the external memory and rehearsing them via some gradient transformation strategies.In addition, plenty of works have been proposed to address catastrophic forgetting for NLP tasks.Dakwale and Monz. (2017) subtly minimized KL-divergence of prediction losses as a regularization term between fine-tuning and general domain models.Lee et al. (2020a) introduced a new regularization technique to mix the PLM parameters with vanilla parameters instead of stochastical dropout.Chen et al. (2020) adopted multi-task learning to jointly learn pretraining and downstream tasks with less forgetting during fine-tuning.Xie et al. (2021) preserved the model neurons of general and language-specific parts during fine-tuning.However, our method is orthogonal to the above approaches since we aim to effectively incorporate the domain-specific PLM with the memory representation built from the frozen general PLM to solve the forgetting issue without adding additional regularization terms in the model or using external memory to preserve samples from the past tasks.

Knowledge-Enhanced PLMs
Knowledgeenhanced methods have shown effectiveness for PLMs via introducing internal or external knowledge.To improve the performance of fine-tuning tasks, REINA (Wang et al., 2022) retrieves the labeled training instances most similar to the input data and concatenates them before feeding them into PLMs.Besides, RETRO (Borgeaud et al., 2021) enhances the auto-regressive language model via leveraging a pre-trained frozen BERT model to retrieve related texts and then use a chunked cross-attention module to incorporate them.Memorizing transformer (Wu et al., 2022) leverages a learned gate to combine the attention results of the local context and the external context retrieved from previously seen sub-sequences.KALA (Kang et al., 2022) is the approach most relevant to our work.It incorporates intermediate hidden representations with domain-specific entities and their relational facts during taskspecific fine-tuning for domain tasks.However, our method doesn't need to retrieve similar texts or construct additional knowledge graphs.We propose several memory-augmented strategies to build the memory representation and then transfer it into the domain-specific PLM to mitigate the forgetting of general knowledge.

Conclusion
In this work, we propose G-MAP, a novel framework that utilizes the memory-augmented layer to fuse the memory representation built from the frozen general PLM to mitigate catastrophic forgetting of general knowledge caused by domain-adaptive pre-training.We explore dif-ferent memory-augmented strategies to construct the memory representation and empirically find that chunk-based gate memory transfer achieves the most optimal performance.We validate G-MAP on various domains of classification, QA, and NER tasks.The results show that our method consistently outperforms existing baselines on all datasets, implying that explicitly leveraging forgotten general knowledge is beneficial for domainspecific downstream tasks.

Limitations
Our G-MAP framework has been validated on domain-specific tasks and a small-scale domain pre-training experiment in Section 4.4.Due to the lack of large GPU resource, we have not validated our G-MAP framework in large-scale pre-training, a more challenging setting that we leave as future work.We also consider automatic layer selection to be an under-studied problem and believe that AutoML techniques (Pham et al., 2018;Tan and Le, 2019), such as evolutionary search (Deb et al., 2002;Chen et al., 2021), will be promising methods.Finally, the proposed framework is built on the encoder-only model, RoBERTa-base.In the future, we will apply our framework on the other types of architectures, such as decoder-only GPT (Radford et al., 2018) and encoder-decoder BART (Lewis et al., 2020).

A Dataset Descriptions and Statistics
This section describes the details and statistics of three tasks: domain classification, domain extractive question answering (QA), and named entity recognition (NER).
For the text classification tasks, we leverage the following datasets covering four domains, including biomedical science, computer science, news, and reviews.In the biomedical domain, CHEMPROT (Kringelum et al., 2016) is the relation classification dataset based on chemical-protein interaction.RCT (Dernoncourt and Lee, 2017) is the role classification task constructed from the abstract of the biomedical articles.In the computer science domain, ACL-ARC (Jurgens et al., 2018) is the task of annotated citations for articles' functions.SCIERC (Luan et al., 2018) is constructed from scientific abstracts annotated with relation.In the news domain, HYPERPARTISAN (Kiesel et al., 2019) is the news text classification for determining partisan leanings.AGNEWS (Zhang et al., 2015) is the topic classification for news.In the reviews domain, AMAZON (McAuley et al., 2015) is a binary classification task consisting of feedback on products.IMDB (Maas et al., 2011) consists of movies reviews, which is a binary sentiment classification dataset.
For the NER tasks, we use two datasets involving the news and biomedical domain.Concretely, CoNLL-2003 (Sang andMeulder, 2003) consists of news stories from the Reuters Corpus.NCBI-Disease (Dogan et al., 2014) is annotated with disease mentions.
For the QA tasks, we utilize two domainspecific datasets.Specifically, NewsQA (Trischler et al., 2017) is a machine comprehension dataset consisting of news articles.Medication (Pampari et al., 2018) is constructed by electronic medical records from clinical text.
The detailed statistics of text classification tasks and NER tasks are shown in Table 7, QA in Table 8.

B Implementation Details
We use the huggingface (Wolf et al., 2020) library to implement our G-MAP framework, which contains various transformer-based pre-trained language models (PLMs) and their saved checkpoints.We implement the DAPT, TAPT, and DAPT+TAPT models of biomedical, cs, news, and reviews do-mains 4 from the library released by (Gururangan et al., 2020).All the experiments are implemented on Nvidia V100 GPUs with 32GB memory.We select the best checkpoint on the validation set during training to infer the test set.
Configurations for Classification Tasks In this section, we explain the setting of fine-tuning for domain-specific classification tasks.We fine-tune the domain-specific PLM with our G-MAP framework for 5 to 15 epochs, respectively, with the same learning rate of 4e-5 and the dropout rate of 0.5.The default classification layer of the model is 1 except for the IMDB dataset with 2, and the default maximum sequence length is 256 except for IMDB with 512.We leverage the Adam optimizer to schedule the learning rate, with the Adam epsilon of 1e-8, the Adam beta-1 of 0.9, and the Adam beta-2 of 0.999.We apply the grid-search method to find the optimal batch size and numbers of GPUs for all the classification datasets.The detailed settings are shown in Table 7.
Configurations for QA and NER For the extractive QA tasks, we fine-tune the domain-specific PLM with our G-MAP framework for 3 epochs, which can converge to optimal performance.Besides, we train the model with a maximum sequence length of 384 and the learning rate of 3e-5, the weight decay rate with 1e-2 and the warm-up rate of 6e-2.For the experiments on NER tasks, we fine-tune NCBI-Disease for 20 epochs and CoNNL-2003 for 15 epochs, with the same maximum sequence length of 128, the learning rate of 5e-5, and the weight decay rate and warm-up rate are set to 0. Different from domain classification tasks, we utilize AdamW as the learning rate optimizer instead of Adam.In addition, we also adopt the grid-search method to find the optimal batch size and number of GPUs for all tasks.The detailed settings are shown in Table 10.

Configurations for small-scale Pre-training
This part describes the experimental settings of adaptive pre-training with G-MAP framework.Our simple pre-training experiment needs some external domain-relative corpora of two domains: Biomedical and Computer Science.Following by (Gururangan et al., 2020), we adopt SCISPACY (Neumann et al.)  Figure 6: Performance of different layer-selection indexes of memory-attention for single-layer memory transfer and gated memory transfer strategies.(Lo et al., 2020).After pre-processing for the corpora, we randomly sample 50K data for each domain and split 70% of them as pre-training sets and 30% as test sets.For the general corpus similar to ROBERTA's pre-training corpus, we also randomly sample 50K data from BOOKCORPUS5 and split them by using the method mentioned above.The detailed hyper-parameter settings for this crossdomain adaptive pre-training are shown in Table 6.

C Layer Selection Experiment
For single-layer memory transfer and gated memory transfer strategies, we experiment with adding the memory-attention to layers 3,6,9 and 12 in a 12-layer RoBERTa-base model, with the result shown in Figure 6.We empirically find that adding memory-attention to the 9-th layer of the domainspecific model as the memory-augmented layer will obtain the best results for the two strategies.However, adding memory-attention to either too upper or too lower obtained fewer gains.Therefore, we adopt memory-attention on the 9-th layer as the default choice for the two strategies in the main experiment shown in Table 1.

Figure 2 :
Figure 2: A framework of G-MAP with the cs-domain task input.PLM-G denotes the frozen general PLM, PLM-D denotes the domain-specific PLM.

Figure 3 :
Figure3: Memory-augmented strategies of the G-MAP framework.We take a 6-layer model as an example.

Figure 4 :
Figure 4: Performance of different layer selections in chunk-based gate memory transfer strategy.

Figure 5 :
Figure 5: Maksed LM loss for the pre-training stage (a lower value is better).PT denotes samples similar to RoBERTa's pre-training corpus.DAPT(BM) denotes the domain-specific PLM for the biomedical domain.G-MAP(BM) denotes the G-MAP framework with the biomedical-domain backbone.For instance, figure A represents further pre-training of the models on the biomedical pre-training samples and then inferring their MLM loss on the test samples.

Table 2 :
The experimental results of G-MAP compared with TAPT and DAPT+TAPT on domain classification tasks.
* means G-MAP framework built with the backbone PLM pre-trained with the process of DAPT then TAPT.
Table 1 and 2, the

Table 3 :
The experimental results of extractive QA and NER tasks in biomedical and news domains.We use Exact Match and F1 score as the metrics for the QA tasks: Medication and NewsQA, and F1 score for NER tasks: NCBI-Disease andCoNNL-2003.

Table 4 :
Results of utilizing the frozen and unfrozen general PLMs.

Table 7 :
Statistics of Classification and NER tasks involving four domains, including Biomedical, Computer Science, News, and Reviews.† indicates high-resource settings.

Table 8 :
Statistics of QA tasks, including News and Biomedical domains.We report the number of contexts and questions of the two datasets.

Table 9 :
Hyperparameters for fine-tuning on eight classification tasks of four domains, we use these hyperparameters for reporting the performances of our proposed G-MAP framework in the main papers.

Table 10 :
Hyperparameters for fine-tuning on QA and NER tasks of biomedical and news domains, we use them for reporting the performances of our proposed G-MAP framework in the main papers.