Continual Machine Reading Comprehension via Uncertainty-aware Fixed Memory and Adversarial Domain Adaptation

Continual Machine Reading Comprehension aims to incrementally learn from a continuous data stream across time without access the previous seen data, which is crucial for the development of real-world MRC systems. However, it is a great challenge to learn a new domain incrementally without catastrophically forgetting previous knowledge. In this paper, MA-MRC, a continual MRC model with uncertainty-aware fixed Memory and Adversarial domain adaptation, is proposed. In MA-MRC, a fixed size memory stores a small number of samples in previous domain data along with an uncertainty-aware updating strategy when new domain data arrives. For incremental learning, MA-MRC not only keeps a stable understanding by learning both memory and new domain data, but also makes full use of the domain adaptation relationship between them by adversarial learning strategy. The experimental results show that MA-MRC is superior to strong baselines and has a substantial incremental learning ability without catastrophically forgetting under two different continual MRC settings.


Introduction
Recently, Machine Reading Comprehension (MRC) has attracted wide attention and achieved remarkable success when solving specific tasks in stationary environments, such as answering factual questions with wikipedia articles or answering narrative questions with web search logs (Seo et al., 2017;Seonwoo et al., 2020;Zhang et al., 2021;Wu and Xu, 2020). However, the answering scenario changes over time in real-world applications. For example, the dialog system should continuously adapt to new user requirements (Abujabal et al., 2018;Madotto et al., 2021). In this paper, we focus on one of the most typical scenario changes for MRC tasks: the domain data shift. Existing * Hua Xu is the corresponding author. stationary-trained MRC systems are usually trained with in-domain data but are applied to new domain data (Fisch et al., 2019). Therefore, it is necessary to build a non-stationary MRC that continually learns with incremental domain data. We formulate such a challenging problem as Continual MRC task, which is required to incrementally learn over sequential domain data and perform well on all seen domain data. Figure 1 illustrates the incremental learning and testing processing. In this scenario, the MRC system can only trained on the latest coming domain data without access the previous seen data. To tackle this issue, if we directly fine-tune the MRC model on each new incoming domain, the performance on earlier domains will significantly drop (Su et al., 2020). Another naive approach is to retrain the whole MRC model from scratch, but it is costly and time-consuming. Hence it is a great challenge for incrementally learning without largely forgetting previously acquired knowledge.
Existing studies for continual MRC can mainly be divided into three categories. The first class is model expansion techniques that design domainindividual classifier for each in coming domain (Su et al., 2020). However, it is expensive and unpractical in real-world. The second class borrows the regularization idea, which utilizes an ad-ditional loss term to aid knowledge consolidation when learning new domains. For example, Su et al. (2020) added a penalty that restricts the change of important parameters to prevent forgetting previous knowledge. The third class is episodic memory based methods. For example, de Masson d'Autume et al. (2019) introduced a key-value memory module that stored previously seen examples for sparse experience replay and gradient-based local adaptation. Abujabal et al. (2018) proposed templatebased Never-Ending KB-QA that learned new templates by capturing new syntactic structures with a semantic similarity function between questions and user feedback. However, the above methods still have limitations, mainly including two aspects. On the one hand, to prevent catastrophic forgetting, these methods only design consistent constraints of model output or gradient for previous and new domains, while ignoring the domain adaptation relationship between them. However, transfer learning can help the MRC model generalize to other domains. On the other hand, the memory update strategy for continual MRC is limited. Some previous work stores fixed examples for each incoming domain. It greatly grows the number of samples kept in memory and leads to expensive costs. Other methods that limit the maximum number of memory for all seen domains usually update the memory by random sampling, ignoring the forgotten degree of different samples. In fact, the continual model should pay more attention to samples that are more likely to be forgotten.
To handle the above limitations, this paper proposes MA-MRC, an incremental model that solves continual MRC task via Uncertainty-aware fixed Memory and Adversarial Domain Adaptation. Concretely, MA-MRC 1) introduces a fixed-size memory to store a small number of samples in previous domain, which are later periodically replayed when learning new domain; and 2) updates the memory with an uncertainty-aware strategy that takes the forgotten degree of previous data into account; 3) leverages an adversarial learning strategy to make full use of the domain adaptation relationship between different domains with a domain discriminator, so as to help generalize and avoid overfitting very small memorized examples. The intuition behind this is to mimic the human learning process that replays the memory and adapts to new domains.
The key contributions of this work are: (1) This paper proposes a continual MRC model, MA-MRC, which learns new domain data incrementally.
(2) Applying uncertainty-aware Memory and Adversarial learning and to MRC model contributes to strong incremental learning ability. (3) The experimental results on two different continual MRC settings indicate that MA-MRC obtaines good incremental learning ability without largely forgetting and significantly outperforms strong baselines.

Continual Learning
Continual Learning (CL) mainly aims to overcome the catastrophic forgetting problem when learning on sequential new tasks incrementally (French, 1999). Existing work follows three directions: architectural, regularization, and memory-based approaches. The architectural methods change the network's architecture and add task-specific parameters, e.g., Dynamically Expandable Network (Yoon et al., 2018) and Reinforced Continual Learning (Xu and Zhu, 2018). The regularizationbased techniques aid knowledge consolidation when learning new tasks. For instance, EWC and Online EWC (Kirkpatrick et al., 2017;Schwarz et al., 2018) slow down the learning of parameters important for previous tasks. The third class is to save previous samples and learn a new task with a forgetting loss defined on them, such as GEM (Lopez-Paz and Ranzato, 2017), A-GEM (Chaudhry et al., 2019), DER (Buzzega et al., 2020), and MER (Riemer et al., 2019).

CL in MRC.
Few previous studies apply continual learning to MRC. Su et al. (2020) (Abujabal et al., 2018) designed episodic memory based methods that store training samples from previously seen data, which are later rehearsed to learn new domains. In this paper, we solve the continual MRC problem of incrementally learning over sequential domains, and build our continual model based on the above memory-based and penalty regularization paradigm.

Domain Adaptation
Domain adaptation aims to learn discriminative feature features and generalize to other new domains and is usually achieved by learning domaininvariant features (Ben-David et al., 2010;Ganin et al., 2016). For MRC under domain shift, previous methods (Wang et al., 2019;Cao et al., 2020;Lee et al., 2019) reduce domain discrepancy by a discriminator network that is trained to distinguish features of the target from source domains. However, recent work usually transfers the model from source to target. In this paper, we explore the domain adaptation under continual setting.

Problem Statement
For the task of continual MRC, we assume that the learning framework has access to streams of MRC data from T different domains, denoted by Data = {D 1 , D 2 , ..., D T }. Each domain data, e.g. D t = {P n , Q n , A n } |Dt| n=1 , consists of a series of <passage, question, answer> triples, where |D t | is the sample number of D t . The MRC model is required to continually learn over each incoming domain data. More specifically, at each step t, the MRC model only observes D t domain data, and is required to perform well on both the t-th domain data and previous t − 1 domain data. Hence, after training on D t , the model will be evaluated on all seen t domains. To make the MRC model perform well on previous tasks, during the continual learning process, a memory M is to set to store a small number of samples in previous domain data in a fixed-size memory M |D old |. This paper focuses on the span extractive MRC. Inspired by Su et al. (2020), we perform two different continual domain settings: CDA-C and CDA-Q, which define different domains according to paragraph type and question type, respectively. Define the domain discriminator θ D

5:
Incrementally learning θ t ← update θ t−1 and θ D with L final on D t , M t−1 6: Calculate need importance based on uncertainty for each sample in M t−1 7: 3) utilizes knowledge distillation to encourage the model to mimic its original responses for past domain samples. After finishing training at t-th step, we update the memory with an uncertainty-aware sampling strategy that focuses on remembering what the model most needs. More details about the training process are shown in Algorithm 1.

Initial Training
Note that there is only one domain data at the beginning, so the initial training with D 1 can be seen as a special case of formal extractive MRC tasks. Therefore, we build a normal backbone MRC model θ 1 (a standard BERT-MRC model with Transformer Encoder and Answer Prediction Modules) on the first domain data D 1 . We initialize a fixed-size memory M 1 to keep previous training samples that are periodically replayed while learning the new domain.

Prepare Backbone model Transformer Encoder Module:
First, a pre-trained L Transformer encoder blocks is used to convert the input sequence S = [ CLS , Q, SEQ , P, SEQ ] into contextual representations. Then, the last block output H L = BERT(S) ∈ R (l)×h is taken as the contextual representation, where h is the hidden dimension of BERT, l is the sequence length.

Answer Prediction Module:
A linear layer is applied on the contextual representation H L to calculate the probability distribution of start and end positions of candidate answer: where W s , W e ∈ R h are learnable parameters.
Objective Function: The loss function of the backbone MRC is the cross-entropy: where y s and y e are the ground-truth start and end indices of the corresponding sample, respectively.

Memory Initialization
For the first domain, to preserve the data distribution of the current domain as much as possible, we randomly select M training samples from D 1 as the initial memory M 1 .

Learning for New Domain
When the t-th domain data is coming, we can access the current data D t and the memory M t−1 . If we just finetune the backbone model, the model is hard to have good transfer ability and may overfit on the few memorized samples. Hence the MA-MRC utilizes the adversarial learning strategy to fully make use of the domain adaptation relationship between the previous and the current domain. After the t-th training step, we dynamically update memory M with an uncertainty-aware strategy to store the training samples that the model most needs to replay.

Memory-based Adversarial Training
To fully use the domain adaption relationship, we adversarially learn the domain-invariant and transfer representations of the memory and current domain data. Inspired by domain adaptation theory, MA-MRC introduces a domain discriminator and build a two-player min-max game. The first player, a domain discriminator D, distinguishes memory data from the current new domain data. Here D is a basic binary discriminator that has a three linear layer followed by a sigmoid activation function. The second player, the Transformer Encoder T , aims to learn features that confuse D. We utilize empirical Maximum Mean Discrepancy as distance measure to reduce the difference between marginal representation distributions and make them similar: Finally, this learning procedure can be described by the following minimax game: where M and |D t | are the number of training samples from memory and the current domain.

Knowledge Distillation
It is obvious that a good continual model can learn a new domain data well while approximating the behavior observed in the old ones. Then, we leverage knowledge distillation constraints to encourage MA-MRC to mimic its original responses for past samples. Specifically, we seek to minimize the distance between the corresponding pre-softmax responses to preserve the knowledge about previous memory data: where logits are the probability distribution before softmax and KL is the KL divergence.

Uncertainty-aware Memory Updating
Unlike other approaches that store fixed examples for each domain, we use a fixed memory for all domains to avoid memory growing. Therefore, when a domain arises, it is necessary to remove some old samples in memory to store new data. We design a dynamic sampling strategy that focuses on samples what the model most needs to replay. Based on existing trained MRC model θ t , we first compute the uncertainty for each sample in memory by a unsupervised Entropy-based stragety: u entropy = log p start y s + log p end y e .
Then we calculate the gap between the above uncertainty and 1) the respective previous best uncertainty 2) the average uncertainty of all memory data, and normalize these metric differentials to create a probability distribution. For ease of exposition, we define these two normalization calculations as norm 1 and norm 2 . Note that we will  sample M t data for each previously seen domain in memory with the above distribution separately. Finally, we get M(t−1) t data from memory and ramdonly sample M t data from current domain. In this way, the more forgotten a memory sample is, the more it will be retained.

Objective Function
When incrementally learning for a new domain, the span loss in answer prediction module L span considers all current domain data D t and the memory data M , while the KL loss only takes memory data M into account. Finally, the overall loss function at the t-th step is formulated as:

Continual MRC Datasets
Inspired by Su et al. (2020), this paper deals with two continual MRC under domain adaptation: CDA-C and CDA-Q. For CDA-C setting, we regard MRC datasets with different passage corpora (e.g., Wikipedia, News, and Web snippets) as different domains and choose five datasets: SQuAD 1.1 (Rajpurkar et al., 2016), HotpotQA , Natural Questions (Trischler et al., 2017), NewsQA (Trischler et al., 2017), and TriviaQA (Joshi et al., 2017). Due to computational limits, we use the curated version provided by Fisch et al. (2019). For each dataset, we randomly sample 10,000 <question, context, answer> triples from the original training datasets for continual training, and the original dev sets for testing. For CDA-Q setting, we make use of the original SQuAD   • EWC (Kirkpatrick et al., 2017) restricts the change of model parameters for previous domains via elastic weight consolidation and a special L 2 regularization. Hence it can slow down the learning of parameters important for all previous domains.
• Online EWC (Schwarz et al., 2018), the extension of EWC, which only consider the restriction for the latest model parameters.
• DER++ (Buzzega et al., 2020), the extension of DER, uses an additional term on memory.
• AGEM (Chaudhry et al., 2019), memorybased approaches, uses a constraint that enables the projected gradient to decerease the average loss on previous seen domains.

Evaluation Metrics
Exact Match (EM) and word-level F1 score are used to evaluate the performance of MRC model in a single domain data. As for the continual domain adaptation setting, two common evaluation settings in continual learning theory are adopted: the average and the whole performance: F 1 all = F 1(D 1:T test ).
The former is the average F1 score on test sets of all seen domain, and the latter is the whole F1 score on the test sets.

Implementation Details
We initialize the transformer encoder layer with the pre-trained BERT BASE model officially released by Google 1 . The maximum sequence length is 384, and the batch size is 20. We set memory size M 400 default that means the memory stores up to 400 training samples for previous seen domain. When incrementally learn the new incoming domain data at t-th step, we first reinitialize the parameters of the domain discriminator θ D . Then use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3e-5 and training MA-MRC model for 3 epochs.

Main Results
For the task of continual MRC, the overall results under CDA-C and CDA-Q setting and are shown in Table 2 and  ple, OnlineEWC performs well under CDA-C but performs poorly CDA-Q, and all memory-based methods outperform consolidation-based methods under CDA-Q but not CDA-C. The reason is that the characteristics of the domains and the continual learning difficulty are different. Concretely, it is obvious that the domain data in CDA-C is more different and more difficult than that in CDA-Q. (3) There is a big gap between all the models and the upper bound. We further demonstrate the evaluation results of the proposed MA-MRC and Upper-Bound methods on each domain at each continual learning step t in Table 6. The results indicate that MA-MRC indeed forgets knowledge learned on previously seen domain data, and there remain lots of things to be explored for continual MRC.
Besides, we plot the average F1 performance of models during the whole continual learning process in Figure 3 to investigate how performance    changes. We observe that the performance of all models decreases in some degree with increasing numbers of domains under both CDA-C and CDA-Q settings. However, the proposed MA-MRC methods outperform other baselines and achieve better performance on the whole domain data.

Ablation Study
To better understand our proposed model, we conduct ablation studies to see the effectiveness of each model component. The results in Table 4 and  MA-MRC model. The results prove that the adversarial learning indeed helps for remembering previous knowledge.

Effect of Memory
Memory Size M. Table 8 shows the performance with three different memory size M: 200, 300, and 400. In low memory scenario, i.e., M = 200, the proposed method performs poorly. The reason is that the adversarial domain adaption has difficulty transferring well with a too small memory. Therefore, as the number of memory samples increases, it will be more conducive to transfer to a new domain, and the overall performance will be better. We believe that an appropriate memory size could lead to better performance. Uncertainty-aware Sampling. We replace the uncertainty-aware memory updating strategy with another two strategies. First, we use another uncertainty measurement that takes heuristic max softmax probability of spans as the uncertainty: u prob = max i,j (p start i + p end j ). The second strategy is random sampling. The experimental results in Table 7 indicate that the uncertainty-aware sampling (both Entropy-based and Probability-based uncertainty) is better than random sampling. Table 9 shows the results of different domain orders. Order 1 is a descending order based on the number of training samples in each domain and order 2 is an ascending order, and order 3 is a random order. The performance of AGEM and DER++ degrades severely in order 1. However, the proposed MA-MRC methods is superior to baselines and are stable and robust under different orders.

Efficiency Analysis
We compare the parameters and training speed of methods with the same size of the memory in Table  10. The MA-MRC have more parameters for domain discriminator. Nevertheless, considering the larger number of parameters of the Transformer, we conclude all methods have almost the same number of parameters. As for training time per epoch, MA-MRC is slower than DER++ (4%/5% under CDA-C/CDA-Q). It can be accepted because of the significant improvement of MA-MRC.

Conclusion
In this paper, an incremental learning MRC model with uncertainty-aware fixed memory and adversarial domain adaptation, MA-MRC, is proposed for continual MRC and alleviating catastrophically forgetting. Inspired by the human learning process, There are two main ideas of MA-MRC: a memory that stores a small number of samples in previous seen domain data and always focuses on what the model most needs to replay; and adversarial learning the domain adaptation in a two-player game to learn better transfer representations between previous and current domain data. Experimental results show that the proposed MA-MRC can achieve a good continuous learning performance without catastrophically forgetting under CDA-C and CDA-Q settings.
In the future, we would like to explore a more effective sampling strategy, domain adaptation strategy, and balance training strategy for multiple objectives to enhance the continual MRC model.