Total Recall: a Customized Continual Learning Method for Neural Semantic Parsers

This paper investigates continual learning for semantic parsing. In this setting, a neural semantic parser learns tasks sequentially without accessing full training data from previous tasks. Direct application of the SOTA continual learning algorithms to this problem fails to achieve comparable performance with re-training models with all seen tasks because they have not considered the special properties of structured outputs yielded by semantic parsers. Therefore, we propose TotalRecall, a continual learning method designed for neural semantic parsers from two aspects: i) a sampling method for memory replay that diversifies logical form templates and balances distributions of parse actions in a memory; ii) a two-stage training method that significantly improves generalization capability of the parsers across tasks. We conduct extensive experiments to study the research problems involved in continual semantic parsing and demonstrate that a neural semantic parser trained with TotalRecall achieves superior performance than the one trained directly with the SOTA continual learning algorithms and achieve a 3-6 times speedup compared to re-training from scratch.


Introduction
In the recent market research report published by MarketsandMarkets (INC., 2020), it is estimated that the smart speaker market is expected to grow from USD 7.1 billion in 2020 to USD 15.6 billion by 2025. Commercial smart speakers, such as Alexa and Google assistant, often need to translate users' commands and questions into actions. Therefore, semantic parsers are widely adopted in dialogue systems to map natural language (NL) utterances to executable programs or logical forms * corresponding author (LFs) (Damonte et al., 2019;Rongali et al., 2020). Due to the increasing popularity of such speakers, software developers have implemented a large volume of skills for them and the number of new skills grow quickly every year. For example, as of 2020, the number of Alexa skills exceeds 100,000 and 24 new skills are introduced per day in 2020 (KIN-SELLA, 2020). Although machine learning-based semantic parsers achieve the state-of-the-art performance, they face the following challenges due to the fast growing number of tasks.
Given new tasks, one common practice is to retrain the parser from scratch on the training data of all seen tasks. However, it is both economically and computationally expensive to re-train semantic parsers because of a fast-growing number of new tasks (Lialin et al., 2020). To achieve the comparable performance, training a deep model on all 8 tasks of NLMap (Lawrence and Riezler, 2018) takes approximately 6 times longer than training the same model on one of those tasks. In practice, the cost of repeated re-training for a commercial smart speaker is much higher, e.g. Alexa needs to cope with the number of tasks which is over 10,000 times more than the one in NLMap 1 . In contrast, continual learning provides an alternative cost-effective training paradigm, which learns tasks sequentially without accessing full training data from the previous tasks, such that the computational resources are utilized only for the new tasks.
Privacy leakage has gradually become a major concern in many Artificial Intelligence (AI) applications. As most computing environments are not 100% safe, it is not desirable to always keep a copy of the training data including identifiable personal information. Thus, it is almost not feasible to assume that complete training data of all known tasks is always available for re-training a semantic parser (Irfan et al., 2021). For the semantic parser of a privacy-sensitive AI system, e.g. personalized social robot, continual learning provides a solution to maintain the knowledge of all learned tasks when the complete training data of those data is not available anymore due to security reasons.
A major challenge of continual learning lies in catastrophic forgetting that the (deep) models easily forget the knowledge learned in the previous tasks when they learn new tasks (French, 1991;Mi et al., 2020). Another challenge is to learn what kind of knowledge the tasks share in common and support fast adaptation of models for new tasks. Methods are developed to mitigate catastrophic forgetting (Lopez-Paz and Ranzato, 2017;Han et al., 2020) and facilitate forward knowledge transfer (Li and Hoiem, 2017). Instead of directly measuring speedup of training, those methods assume that there is a small fixed-size memory available for storing training examples or parameters from the previous tasks. The memory limits the size of training data thus proportionally reduces training time. However, we empirically found that direct application of those methods to neural semantic parsers leads to a significant drop of test performance on benchmark datasets, in comparison to re-training them with all available tasks each time.
In this work, we investigated the applicability of existing continual learning methods to semantic parsing in-depth, and we have found that most methods have not considered the special properties of structured outputs, which distinct semantic parsing from the multi-class classification problem. Therefore, we propose TOTAL RECALL (TR), a continual learning method that is especially designed to address the semantic parsing specific problems from two perspectives. First, we customize the sampling algorithm for memory replay, which stores a small sample of examples from each previous task when continually learning new tasks. The corresponding sampling algorithm, called Diversified Logical Form Selection (DLFS), diversifies LF templates and maximizes the entropy of the parse action distribution in a memory. Second, motivated by findings from cognitive neuroscience (Goyal and Bengio, 2020), we facilitate knowledge transfer between tasks by proposing a two-stage training procedure, called Fast Slow Continual Learning (FSCL). It updates only unseen action embeddings in the fast-learning stage and updates all model parameters in the follow-up stage. As a result, it significantly improves generalization capability of parsing models.
Our key contributions are as follows: • We conduct the first in-depth empirical study of the problems encountered by neural semantic parsers to learn a sequence of tasks continually in various settings. The most related work (Lialin et al., 2020) only investigated incremental learning between two semantic parsing tasks.
• We propose DLFS, a sampling algorithm for memory replay that is customized for semantic parsing. As a result, it improves the best sampling methods of memory replay by 2-11% on Overnight (Wang et al., 2015a).
• We propose a two-stage training algorithm, coined FSCL, that improves the test performance of parsers across tasks by 5-13% in comparison with using only Adam (Kingma and Ba, 2014).
• In our extensive experiments, we investigate applicability of the SOTA continual learning methods to semantic parsing with three different task definitions, and show that TR outperforms the competitive baselines by 4-9% and achieves a speedup by 3-6 times compared to training from scratch.

Related Work
Semantic Parsing The recent surveys (Kamath and Das, 2018;Zhu et al., 2019;Li et al., 2020) cover an ample of work in semantic parsing. Most current work employ a sequence-to-sequence architecture (Sutskever et al., 2014) to map an utterance into a structured meaning representations, such as LFs, SQL, and abstract meaning representation (Banarescu et al., 2013). The output sequences are either linearized LFs (Dong and La pata, 2016Cao et al., 2019) or sequences of parse actions (Chen et al., 2018;Cheng et al., 2019;Lin et al., 2019;Zhang et al., 2019;Yin and Neubig, 2018;Chen et al., 2018;Guo et al., 2019;Wang et al., 2020a;Li et al., 2021). There are also work (Guo et al., 2019;Wang et al., 2020a;Li et al., 2021) exploring semantic parsing with unseen database schemas or actions. Feedback semantic parsing interactively collects data from user feedbacks as continuous data streams but does not address the problem of catastrophic forgetting or improve forward transfer (Iyer et al., 2017;Yao et al., 2019;Labutov et al., 2018).

Continual Learning
The continual learning methods can be coarsely categorized into i) regularization-based methods (Kirkpatrick et al., 2017;Zenke et al., 2017;Ritter et al., 2018;Li and Hoiem, 2017;Zhao et al., 2020;Schwarz et al., 2018) which either applies knowledge distillation (Hinton et al., 2015) to penalize the loss updates or regularizes parameters which are crucial to the old tasks; ii) dynamic architecture methods (Mallya and Lazebnik, 2018;Serra et al., 2018;Maltoni and Lomonaco, 2019;Houlsby et al., 2019;Wang et al., 2020b;Pfeiffer et al., 2021;Rusu et al., 2016) which dynamically alter the structures of models to reduce the catastrophic forgetting; iii) memory-based methods (Lopez-Paz and Ranzato, 2017;Wang et al., 2019;Han et al., 2020;Aljundi et al., 2019;Chrysakis and Moens, 2020;Kim et al., 2020) which stores historical instances and continually learn them along with instances in new tasks. There are also hybrid methods (Mi et al., 2020;Liu et al., 2020;Rebuffi et al., 2017) which integrate more than one type of such methods. In natural language processing (NLP), continual learning is applied to tasks such as relation extraction (Wang et al., 2019;Han et al., 2020), natural language generation (Mi et al., 2020), language modelling  and the pretrained language models adapting to multiple NLP tasks (Wang et al., 2020b;Pfeiffer et al., 2021). To the best of our knowledge, (Lialin et al., 2020) is the only work studying catastrophic forgetting for semantic parsing. However, they consider learning between only two tasks, have not proposed new methods, and also have not evaluated recently proposed continual learning methods. In contrast, we propose two novel continual learning methods customized for semantic parsing and compare them with the strong and recently proposed continual learning methods that are not applied to semantic parsing before.

Base Parser
A semantic parser learns a mapping π θ : X → Y to convert an natural language (NL) utterance x ∈ X into its corresponding logical form (LF) y ∈ Y. Most SOTA neural semantic parsers formulate this task as translating a word sequence into an output sequence, whereby an output sequence is either a sequence of LF tokens or a sequence of parse actions that construct an LF. For a fair comparison between different continual learning algorithms, we adopt the same base model for them, as commonly used in prior works (Lopez-Paz and Ranzato, 2017;Wang et al., 2019;Han et al., 2020).
Similar to (Shin et al., 2019;Iyer et al., 2019), the base parser converts the utterance x into a sequence of actions a = {a 1 , ..., a t }. As an LF can be equivalently parsed into an abstract syntax tree (AST), the actions a sequentially construct an AST deterministically in the depth-first order, wherein each action a t at time step t either i) expands an intermediate node according to the production rules from a grammar, or ii) generates a leaf node. As in (Shin et al., 2019), the idioms (frequently occurred AST fragments) are collapsed into single units. The AST is further mapped back to the target LF.
The parser employs the attention-based sequence-to-sequence (SEQ2SEQ) architecture (Luong et al., 2015) for estimating action probabilities.
Encoder. The encoder in SEQ2SEQ is a standard bidirectional Long Short-term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997), which encodes an utterance x into a sequence of contextual word representations.
Decoder. The decoder applies an LSTM to generate action sequences. At time t, the decoder produces an action representation s t , which is yielded by concatenating the hidden representation h t produced by the LSTM and the context vector o t produced by the soft attention (Luong et al., 2015). We maintain an embedding for each action in the embedding table. The probability of an action a t is estimated by: where A t is the set of applicable actions at time t, and c a is the embedding of the action a t , which is referred to as action embedding in the following.
testing, we know which task an example belongs to. As the definition of tasks is application specific and parallel data of semantic parsing is often created by domain experts, it is easy to identify the task of an example in practice. We further assume that there is a fixed-size memory M k associated with each task T (k) for e.g. storing a small amount of replay instances, as adopted in (Rebuffi et al., 2017;Wang et al., 2019). This setting is practical for personalized conversational agents because it is difficult for them to re-collect past information except reusing the ones in the memories.

Challenges
We demonstrate catastrophic forgetting in continual semantic parsing by training the base parser sequentially with each task from the OVERNIGHT corpus (Wang et al., 2015a) and report the test accuracy of exactly matched LFs of all seen tasks combined (More evaluation details are in Sec. 5).  ) (with and without fine-tuning BERT parameters) and GlOVE respectively by using the standard cross entropy loss. The accuracy on the combined test set drops dramatically after learning the second task. The training on the initial task appears to be crucial on mitigating catastrophic forgetting. The BERT-based parser with/without finetuning obtains no improvement over the one using GLOVE. The forgetting with BERT is even more serious compared with using GLOVE. The same phenomenon is also observed in (Arora et al., 2019) that the models with pre-trained language models obtain inferior performance than LSTM or CNN, when fine-tuning incrementally on each task. They conjecture that it is more difficult for models with large capacity to mitigate catastrophic forgetting. Figure 2: The average conditional probabilities P (a t |a <t , x) of the representative cross-task (solid) and task-specific (dash) actions till the seen tasks on OVERNIGHT after learning on each task sequentially. The boxes at ith task indicate the actions from the initial task also exist in the ith task.
We further investigate which parse actions are easy to forget. To measure the degree of forgetness of an action, after training the parser in the first task, we average the probabilities P (a t |a <t , x) produced by the parser on the training set of the first task. We recompute the same quantity after learning each task sequentially and plot the measures. Fig. 2 depicts the top two and the bottom two actions are easiest to forget on average. Both top two actions appear only in the first task, thus it is difficult for the parser to remember them after learning new tasks. In contrast, cross-task actions, such as GEN ( string < ), may even obtain improved performance after learning on the last task. Thus, it indicates the importance of differentiating between task-specific actions and cross-task actions when designing novel continual learning algorithms.

TOTAL RECALL
To save training time for each new task, we cannot use all training data from previous tasks, thus we introduce a designated sampling method in the sequel to fill memories with the examples most likely mitigating catastrophic forgetting. We also present the two-stage training algorithm FSCL to facilitate knowledge transfer between tasks.
Sampling Method. DLFS improves Episodic Memory Replay (EMR) (Wang et al., 2019;Chaudhry et al., 2019) by proposing a designated sampling method for continual semantic parsing. EMR utilizes a memory module train and M is the size of the memory. The training loss of EMR takes the following form: where L D (k) train and L M i denotes the loss on the training data of current task T (k) and the mem- i=1 , respectively. The training methods for memory replay often adopt a subroutine called replay training to train models on instances in the memory. Furthermore, prior works (Aljundi et al., 2019;Wang et al., 2019;Han et al., 2020;Mi et al., 2020;Chrysakis and Moens, 2020;Kim et al., 2020) discovered that storing a small amount of diversified and long-tailed examples is helpful in tackling catastrophic forgetting for memory-based methods.
Semantic parsing is a structured prediction problem. We observe that semantic parsing datasets are highly imbalanced w.r.t. LF structures. Some instances with similar LF structures are likely to occupy a large fraction of the training set. Therefore, we presume storing the diversified instances in terms of the corresponding LF structures would alleviate the problem of catastrophic forgetting in continual semantic parsing.
To sample instances with the diversified LF structures, our method DLFS partitions the LFs in D train into M clusters, followed by selecting representative instances from each cluster to maximize entropy of actions in a memory. To characterize differences in structures, we first compute similarities between LFs by sim(y i , y j ) = (Smatch(y i , y j ) + Smatch(y j , y i ))/2, where Smatch (Cai and Knight, 2013) is a asymmetrical similarity score between two LFs yielded by calculating the overlapping percentage of their triples. Then we run a flat clustering algorithm using the distance function 1 − sim(y i , y j ) and the number of clusters is the same as the size of a memory. We choose K-medoids (Park and Jun, 2009) in this work for easy interpretation of clustering results.
We formulate the problem of balancing action distribution and diversifying LF structures as the following constrained optimization problem. In particular, it i) aims to balance the actions of stored instances in the memory module M by increasing the entropy of the action distribution, and ii) requires that each instance m in M belongs to a different cluster c j . Let the function c(m) return the cluster id of an instance in a memory M and m i denote its ith entry, we have max a j ∈A n j , with n i being the frequency of action a i in M and A being the action set included in the training set D train . In some occasions, the action set A is extremely large (e.g. 1000+ actions per task), so it may be infeasible to include all actions in the limited memory M. We thus sample a subset of h actions, A ⊆ A, a j ∈A n j , with n i being the frequency of a i in D train . In that case, our method addresses the optimization problem over the actions in A . We solve the above problem by using an iterative updating algorithm, whose details can be found in Appendix B. The closest works (Chrysakis and Moens, 2020; Kim et al., 2020) maintain only the balanced label distribution in the memory while our work maintains the balanced memory w.r.t. both the LF and action distributions.
Fast-Slow Continual Learning. Continual learning methods are expected to learn what the tasks have in common and in what the tasks differ. If there are some shared structures between tasks, it is possible to transfer knowledge from one task to another. Inspired by findings from cognitive neuroscience, the learning should be divided into slow learning of stationary aspects between tasks and fast learning of task-specific aspects (Goyal and Bengio, 2020). This is an inductive bias that can be leveraged to obtain cross-task generalization in the space of all functions.
We implement this inductive bias by introducing a two-stages training algorithm. In the base model, action embeddings c a (Eq. (7)) are task-specific, while the remaining parts of the model, which builds representations of utterances and action histories, are shared to capture common knowledge between tasks. Thus, in the fast-learning stage, we update only the embeddings of unseen actions c (i) a with the cross-entropy loss, in the slow-learning stage, we update all model parameters.
In the fast-learning stage, the unseen actions A (k) u of the k-th task are obtained by excluding all historical actions from the action set of current task T (k) , namely A Algorithm 1: Fast-Slow Training for the k-th task # fine-tune all cross-task parameters and task-specific parameters of the current task denotes the action set of the k-th task. All actions are unseen in the first task, thus we update all action embeddings by having A (0) u = A (0) . In the slow-learning stage, we differ updating parameters w.r.t. current task from updating parameters w.r.t. memories of previous tasks. For the former, the parameters θ g shared across tasks are trained w.r.t all the data while the task-specific parameters θ (i) s are trained only w.r.t. the data from task T (i) . For the latter, the task-specific parameters learned from the previous tasks are frozen to ensure they do not forget what is learned from previous tasks. More details can be found in Algo. 1.
This training algorithm is closely related to Invariant Risk Minimization (Arjovsky et al., 2019), which learns invariant structures across different training environments. However, in their work, they assume the same label space across environments and have access to all training environments at the same time.
Loss During training, we augment the EMR loss with the Elastic Weight Consolidation (EWC) regularizer (Kirkpatrick et al., 2017) to obtain the train- N is the number of model parameters, θ k−1,j is the model parameters learned until T (k−1) and F j = ∇ 2 L(θ k−1,j ) w.r.t. the instances stored in M. EWC slows down the updates of parameters which are crucial to previous tasks according to the importance measure F j .

Experiments
Datasets and Task Definitions. In this work, we consider three different scenarios: i) different tasks are in different domains and there are task-specific predicates and entities in LFs; ii) there are taskspecific predicates in LF templates; iii) there are a significant number of task-specific entities in LFs. All tasks in the latter two are in the same domain. We select Overnight (Wang et al., 2015b) and NLMapV2 (Lawrence and Riezler, 2018) to simulate the proposed three continual learning scenarios, coined OVERNIGHT, NLMAP(QT) and NLMAP(CITY), respectively.
Overnight includes around 18,000 queries involving eight domains. The data in each domain includes 80% training instances and 20% test instances. Each domain is defined as a task.
NLMapV2 includes 28,609 queries involving 79 cities and categorizes each query into one of 4 different question types and their sub-types. In the NLMAP(QT) setting, we split NLMapV2 into 4 tasks with queries in different types. In the setting of NLMAP(CITY), NLMapV2 is split into 8 tasks with queries of 10 or 9 distinct cities in each task. Each city includes a unique set of point of interest regions. In both NLMAP(CITY) and NLMAP(QT), each task is divided into 70%/10%/20% of training/validation/test sets, respectively.
We attribute different distribution discrepancies between tasks to different definitions of tasks. Overall, distribution discrepancy between tasks on OVERNIGHT is the largest while the tasks in other two settings share relatively smaller distribution discrepancies because tasks of NLMAP(QT) and NLMAP(CITY) are all in the same domain. is an extension of EMR using memory instances to construct prototypes of relation labels to prohibit the model from overfitting on the memory instances. ARPER (Mi et al., 2020) adds an adaptive EWC regularization on the EMR loss, where the memory instances are sampled with a unique sampling method called PRIOR. ProtoParser (Li et al., 2021) utilizes prototypical networks (Snell et al., 2017) to improve the generalization ability of semantic parsers on the unseen actions in the new task. We customize it by training the PROTOPARSER on the instances on current task as well as the memory instances. The ORACLE (All Tasks) setting trains the model on the data of all tasks combined, considered as an upper bound of continual learning.
Evaluation. To evaluate the performance of continual semantic parsing, we report accuracy of exactly matched LFs as in (Dong and Lapata, 2018). We further adopt two common evaluation settings in continual learning. One setting measures the performance by averaging the accuracies of the parser on test sets of all seen tasks {D  (Wang et al., 2019;Han et al., 2020). For reproducibility, we include the detailed implementation details in Appendix A.  Table 1: LF Exact Match Accuracy (%) on two datasets with three settings after model learning on all tasks. "W" stands for the Whole performance ACC whole , and "A" stands for the Average performance ACC avg . All the results are statistically significant (p<0.005) compared with TR (+EWC) according to the Wilcoxon signed-rank test (Woolson, 2007). All experiments are run 10 times with different sequence orders and seeds.

Results and Discussion
As shown in Table 1, the base parser trained with our best setting, TR (+EWC), significantly outperforms all the other baselines (p<0.005) in terms of both ACC avg and ACC whole . The performance of TR (+EWC) is, on average, only 3% lower than the ORACLE setting. Without EWC, TR still performs significantly better than all baselines except it is marginally better than ARPER and PRO-TOPARSER in the setting of NLMAP(QT). From Fig. 4 we can see that our approaches are more stable than the other methods, and demonstrates less and slower forgetting than the baselines.
The dynamic architecture method, HAT, performs worst on OVERNIGHT while achieves much better performance on NLMAP(QT) and NLMAP(CITY). Though the performance of the regularization method, EWC, is steady across different settings, it ranks higher among other baselines on NLMAP(CITY) and NLMAP(QT) than on OVERNIGHT. In contrast, the memory-based methods, GEM, and EMR, rank better on OVERNIGHT than on NLMAP(QT) and NLMAP(CITY).
We conjecture that the overall performance of continual learning approaches varies significantly in different settings due to different distribution discrepancies as introduced in Datasets. The general memory-based methods are better at handling catastrophic forgetting than the regularization-based and dynamic architecture methods, when the distribution discrepancies are large. However, those memory-based methods are less effective when the distribution discrepancies across tasks are small. Another weakness of memory-based methods is demonstrated by EMAR, which achieves only 14.25% of ACC whole on NLMAP(QT), despite it is the SOTA method on continual relation extraction. A close inspection shows that the instances in the memory are usually insufficient to include all actions when the number of actions is extremely large (i.e., more than 1000 actions per task in NLMAP(QT)) while EMAR relies on instances in memory to construct prototypes for each label. Furthermore, large training epochs for memorybased methods usually lead to severe catastrophic forgetting on the previous tasks, while the regularization method could largely alleviate this effect.
ARPER and PROTOPARSER are the two best baselines. Similar to TR, ARPER is a hybrid method combining EMR and EWC, thus the joint benefits lead to consistent superior performance over the other baselines except PROTOPARSER in all three settings. The generalization capability to unseen actions in new tasks also seems critical in continual semantic parsing. Merely combining PROTOPARSER and EMR yields a new baseline, which performs surprisingly better than most existing continual learning baselines. From that perspective, the parser with FSCL performs well in continual learning also because of its strength in generalizing to unseen actions.  Influence of Sampling Strategies. Table 2  Overall, our sampling method consistently outperforms all other baselines on both OVERNIGHT and NLMAP(CITY). On OVERNIGHT with memory size 50, the gap between DLFS and GSS is even up to 11% and 2% between DLFS and FSS, the best baseline. However, on NLMAP(CITY), the performance differences across various sampling methods are smaller than those on OVERNIGHT. Similar observation applies to the influence of different sample sizes. We conclude that the smaller distribution discrepancy reduces the differences of sampling methods as well as the sample sizes in the memory-based methods.
RANDOM performs steadily across different settings though it is usually in mediocre performance. FSS, GSS, and PRIOR are model-dependent sampling methods. The gradients and model confidence scores are not stable features for the sample selection algorithms. We inspect that the instances selected with GSS are significantly different even when model parameters are slightly disturbed. For the PRIOR, the semantic parsing model is usually confident to instances with similar LF templates. Diversifying entities do not necessarily lead to diversities of LF templates since the LFs with different entities may share similar templates. Therefore, GSS and PRIOR can only perform well in one setting. In contrast, the utterance encoding features are much more reliable. FSS can achieve the second-best performance among all methods. Either balancing action distribution (BALANCE) or selecting centroid LFs from LF clusters (LFS) alone performs no better than DLFS, proving it is advantageous to select a instance in a cluster which balances the memory action distribution over directly using the centroid.  Ablation Study of FSCL Training. Table 3 shows the ablation study of FSCL training by removing (-) or replacing (-/+) the corresponding component/step.
The fast-learning with action embeddings is the most critical step in FSCL training. Removing it causes up to 13% performance drop. To study this step in depth, we also replace our fast-learning with fine-tuning all task-specific parameters except in the first task, as done in LwF (Li and Hoiem, 2017), or fine tuning all parameters, as done in EMAR (Han et al., 2020), in the fast-learning stage. The corresponding performance is no better than removing it in most cases. We also plot the training errors and test errors with or without this step in Fig. 3. This step clearly leads to dramatically improvement of both generalization and optimization.
Another benefit of this fast-learning step is in the first task. We observe that a good optimization on the first task is crucial to the model learning on the  following tasks. Our preliminary study shows that by applying the fast-learning only to the first task, the model can still keep the close-to-optimal performance. As shown in Fig. 4, our method with this fast-learning step is better optimized and generalized on the initial tasks than all the other baselines and largely alleviate the forgetting problem caused by learning on the second task.

Influence of Pre-trained Language Models.
We study the impact of pre-trained language models for semantic parsing in supervised learning and continual learning, respectively. In both settings, we evaluate the base parsers using BERT (Devlin et al., 2019) as its embedding layer in two configurations: fine-tuning the parameters of BERT (BERT-finetune) and freezing BERT's parameters (BERT-fix). As in Tab. 4, BERT slightly improves the overall performance of the base parsers in supervised training (the ORACLE setting) on OVERNIGHT. In contrast, in the continual learning setting, base parsers with the BERT embedding perform much worse than the ones with the GLOVE embedding. On NLMAP(QT), the accu-racy of FINE-TUNE with GLOVE embedding is 30% and 20% higher than that with BERT's embedding updated and fixed, respectively. We conjecture that the deeper neural models suffer more from the catastrophic forgetting issues. However, the average training speeds of parsers with BERTfix and BERT-finetune are 5-10 times and 20-40 times respectively slower than those with GLOVE on each task. Overall, our method still consistently outperforms other SOTA continual learning methods, such as EWC and EMR, with either BERTfinetune or BERT-fix. In contrast, the performances of baselines, EWC and PROTOPARSER, are highly unstable on NLMap when using BERT.  Table 4: LF Exact Match Accuracy (%) of parsers using BERT by fine-tuning (Up) and fixing (Bottom) BERT's parameters.

Conclusion
We conducted the first in-depth empirical study to investigate continual learning for semantic parsing. To cope with the catastrophic forgetting and facilitate knowledge transfer between tasks, we propose TOTAL RECALL, consisting of a sampling method specifically designed for semantic parsing and a two-stage training method implementing an inductive bias for continual learning. The resulted parser achieves superior performance over the existing baselines on three benchmark settings. The ablation studies also demonstrate why it is effective.
2019. Does an lstm forget more than a cnn? an empirical study of catastrophic forgetting in nlp.

A Reproducibility Checklist
The hyper-parameters are cross-validated on the training set of OVERNIGHT and validated on the validation set of NLMAP(QT) and NLMAP(CITY). We train the semantic parser on each task with learning rate 0.0025, batch size 64 and for 10 epochs. The fast-learning training epochs is 5. We use the 200-dimensional GLOVE embeddings (Pennington et al., 2014) to initialize the word embeddings for utterances. As different task orders influence the performance of the continual semantic parsing, all experiments are run on 10 different task orders with a different seed for each run. We report the average ACC avg and ACC whole of 10 runs. In addition, we use one GPU of Nvidia V100 to run all our experiments.

B DLFS Algorithm
We provide the detailed DLFS as in Algo. 2. Table 6 shows the performance of TR with different sampling strategies and different memory sizes on NLMAP(QT).

D Dynamic Action Representation
To differentiate the learning of cross-task and taskspecific aspects, we innovatively integrate a designated dynamic architecture into the base parser along with DLFS and FSCL for continual semantic parsing, coined Dynamic Action Representation (DAR). This method could also significantly mitigate the catastrophic forgetting and improve the forward transfer in the continual semantic parsing. Due to the limited space, we did not put it into the main paper. The details and analysis of this method are listed below.
Decoder of Base Parser. The decoder of the base parser applies an LSTM to generate action sequences. At time t, the LSTM produces a hidden state h t = LSTM(c a t−1 , h t−1 ), where c a t−1 is the embedding of the previous action a t−1 . We maintain an embedding for each action in the embedding table. As defined in Luong et al. (2015), we concatenate h t with a context vector o t to yield s t , where W c is a weight matrix and the context vector c t is generated by the soft attention (Luong et al., 2015), The probability of an action a t is estimated by: where A t is the set of applicable actions at time t.
In the following, the dense vectors c a are referred to as action embeddings.  s , which generate task-specific predicates or entities, from cross-task actions, which are the remaining actions A g associated with predicates appearing in more than one tasks. We model different actions using different action embeddings (Eq. (7)). But the key challenge lies in switching between task-specific and cross-task hidden representations.
To address the problem, given an output hidden state of LSTM, h t = LSTM(c a t−1 , h t−1 ), we apply a task-specific adapter modules to transform the hidden state h t ∈ R d .
is an adapter network and g i (·) : R d → R d is a gating function for task T (i) . Here, we adopt the following modules for the adaptor network and the gating function, where parameters W i φ ∈ R d×d and W i g ∈ R 2d×d are task-specific. The number of parameters introduced per task is merely O(3d 2 ), which is parameter-efficient. Therefore, the context vector of attention and the state to infer action probability in Eq. 5 and 6 become:  Ablation Study of DAR As shown in Tab. 7, removing the task-specific representations (-specific) generally degrades the model performance by 1.5-3.5% except on NLMAP(QT). Our further inspection shows that the proportion of taskspecific actions in NLMAP(QT) is only 1/20 while the ratios are 1/4 and 2/5 in OVERNIGHT and NLMAP(CITY), respectively. Using either taskspecific representations (-specific) or cross-task representations (-cross) alone cannot achieve the optimal performance.  Figs. 5 depicts the performance curve of semantic parsers till the seen tasks on NLMAP(CITY) and NLMAP(QT) after learning on each task sequentially.

E Accuracy Curve
The base parsers are the same for all training methods in comparison. However, the training methods are not exactly the same. For example, PROTOPARSER and EMAR use meta-learning ACC whole till the seen tasks on NLMAP(CITY) (Up) and NLMAP(QT) (Down) after learning on each task sequentially (best seen in colours). methods to train the parser. HAT manipulates parameter gradients during training and uses adapter layers to modify the weights of model parameters on different tasks. ARPER and EWC use regularization during continual training. Different training methods cause the baselines to obtain different results on the initial and subsequent tasks.
In the first task, Fast-Slow Continual Learning (FSCL) differs from the traditional supervised training by updating all action embeddings first, followed by updating all model parameters. From Fig. 4 and Fig. 5, we can tell FSCL leads to a significant performance gain over the baselines in the first task. In this way, our parser trained with FSCL lays a better foundation than the baselines for learning future tasks in terms of both the forward and backward transfer. For each new task, the first step of FSCL focuses on minimal changes of model parameters for task-specific patterns thus significantly reduces the risk of forgetting prior knowledge. In contrast, the baselines modify the Figure 6: The training and test error points of semantic parsing models with/without fast-learning on NLMAP(CITY) (Up) and NLMAP(QT) (Down). majority of model parameters for each new task, hence easily lead to catastrophic forgetting. As a result, our model with FSCL could achieve better performance than all baselines on both all tasks and only the initial task as in Fig. 4 and Fig. 5.

F Training Time Analysis
The average training times of different continual learning models on each task of OVERNIGHT, NLMAP(CITY), and NLMAP(QT) are depicted in Tab. 8. On average, the training time of Finetune is 13, 5, and 14 times faster than training the parser from scratch on the tasks of OVERNIGHT, NLMAP(CITY), and NLMAP(QT), respectively. In general, the training times of memory-based methods are longer than regularization and dynamic architecture methods due to the replay training. Since our method, TOTAL RECALL, is a memorybased method, its training time is comparable to the other memory-based methods such as GEM, EMR and EMAR. In addition, EWC slowers the convergence speed of the parser on NLMAP(CITY), and Figure 7: The conditional probabilities P (a t |a <t , x) of the representative cross-task actions (Up) and taskspecific actions (Down) from evaluation on the initial task after parser being trained on each task on OVERNIGHT sequentially. NLMAP(QT), thus increases the training time of parsers on each task to achieve their optimal performance. Therefore, the hybrid method, ARPER, that utilizes both EMR and EWC takes the longest training time among all continual learning methods. However, our FSCL could speed up the convergence of the base parser even with EWC; thus, the training time of TOTAL RECALL (+EWC) is much less than the one of ARPER. Fig. 6 provides the training and test error points of semantic parsers on NLMAP(CITY) and NLMAP(QT), respectively. As we can see, same as on OVERNIGHT, the base parser with this fastlearning step is better optimized than without this step on NLMAP(CITY) and NLMAP(QT).

H Forgetting Analysis on Actions
Following 4.1, Fig. 7 depicts the conditional probabilities, P (a t |a <t , x), of cross-task and taskspecific actions, respectively, predicted by the base parser fine-tuned sequentially on each task. Overall, task-specific actions are more likely to be forgotten than cross-task actions while learning parsers on the new tasks. Due to the rehearsal training of the cross-task actions in the future tasks, the prediction performance over cross-task actions fluctuates on different tasks.