Controllable Semantic Parsing via Retrieval Augmentation

In practical applications of semantic parsing, we often want to rapidly change the behavior of the parser, such as enabling it to handle queries in a new domain, or changing its predictions on certain targeted queries. While we can introduce new training examples exhibiting the target behavior, a mechanism for enacting such behavior changes without expensive model re-training would be preferable. To this end, we propose ControllAble Semantic Parser via Exemplar Retrieval (CASPER). Given an input query, the parser retrieves related exemplars from a retrieval index, augments them to the query, and then applies a generative seq2seq model to produce an output parse. The exemplars act as a control mechanism over the generic generative model: by manipulating the retrieval index or how the augmented query is constructed, we can manipulate the behavior of the parser. On the MTOP dataset, in addition to achieving state-of-the-art on the standard setup, we show that CASPER can parse queries in a new domain, adapt the prediction toward the specified patterns, or adapt to new semantic schemas without having to further re-train the model.


Introduction
Semantic parsing is the task of mapping input queries to their meaning representations. In practical applications of semantic parsing, such as conversational agents, we often want to rapidly control the behavior of the parser. We particularly focus on three scenarios: (1) Domain bootstrapping: making the parser handle queries in a new domain (Su and Yan, 2017;Hou et al., 2020a;Li et al., 2021b). This requires predicting new semantic labels (e.g., intents and slots) unseen during training, and assigning correct values to such labels. (2) Parse guiding: influencing the prediction toward a specific parse pattern. This can be used as an override for sensitive queries or queries that the model struggles on. (3) Schema refactoring: adapting the  (1) Given a query x, we retrieve exemplars (x i , y i ) from the retrieval index.
(2) We construct an augmented query x + based on x and the retrieved exemplars. (3) We apply a generative model on x + to produce an output parse y. The retrieval index and augmentation procedure can be modified to change the parser's behavior without re-training. parser to changes in the semantic schema such as semantic label renaming (Gaddy et al., 2020).
A common way to control the parser's behavior is to construct training examples exhibiting the new behavior (e.g., examples from the new domain) and tune the model on them. However, model training requires computational resources, which can become unwieldy if we need to make multiple rapid changes. Ideally, we want to control the behavior of the semantic parser without additional training. Such an ability would enable many novel use cases. For example, developers could update the semantic parser's behavior and observe the results immediately, thus speeding up the development cycle. This can be used to quickly update the parser in time-critical scenarios while waiting for a fully retrained model. Another use case is deploying a single model to service multiple clients. Each client can manipulate the parser to fit its use without interfering with the central model or other clients, thus saving resources and preserving privacy.
To this end, we propose ControllAble Semantic Parser via Exemplar Retrieval (CASPER). As illustrated in Figure 1, the parser first retrieves exemplars relevant to the input query (e.g., training examples resembling the input query) from a modifiable retrieval index. These retrieved exemplars are then augmented to the query. Finally, a seq2seq generator model takes the augmented query as input and generates a meaning representation. The model takes inspiration from recent works that use modifiable retrieval indices (Khandelwal et al., 2020(Khandelwal et al., , 2021 and exemplar-augmented inputs (Brown et al., 2020;Liu et al., 2021).
The retrieval and augmentation processes grant us several ways to control the behavior of the parser. For instance, in domain bootstrapping, we can add examples from the new domain to the retrieval index. When these added examples are retrieved, the generator can condition on them while generating the output. This allows the generator to, for instance, follow the semantic template of the exemplars and produce new semantic labels unseen during training.
We evaluate our approach on the English portion of the MTOP dataset (Li et al., 2021a). In our experiments, we show that CASPER preserves the generality and increases the performance of a seq2seq-based semantic parser, while also enabling new capabilities that are simply not possible with standard seq2seq parsers. Our main results are: 1 • Parse guiding: We can train CASPER to follow the semantic template of the manually provided exemplars when asked to do so, while maintaining accuracy on the standard setup.
• Schema refactoring: By editing the retrieval index, CASPER can, without re-training, adapt to a new semantic schema where some semantic labels are split into unseen labels.
1. Exemplar retrieval: Retrieve a list E of k exemplars (x i , y i ) (x i is a query; y i is the MR of x i ) that are related to the input query x.
2. Augmentation: From x and E, construct a retrieval-augmented query x + .
3. Generation: Use a generative seq2seq model to map x + into an output MR y.
We will elaborate on each step in the following subsections. We can view exemplar augmentation as a way to provide extra information to any seq2seq-based semantic parser, while still preserving its ability to generate complex outputs. The generator can learn to consider or ignore the provided exemplars, so CASPER can perform at least as well as the underlying generator in the standard setup (Section 3). Additionally, we will later show that by manipulating the retrieval index and how the augmented queries are constructed, we can control the behavior of CASPER for domain bootstrapping (Sections 4), parse guiding (Section 5), and schema refactoring (Section 6).
Retrieval The retrieval index consists of inputoutput pairs (x , y ), and is initially constructed from training examples. We utilize a retriever that uses query embedding cosine similarity as the retrieval score. Concretely, each index entry (x , y ) is indexed with the embedding e(x ) of the query, computed using a sentence embedder. Given an input x, we rank all index entries (x , y ) by the cosine similarity between e(x) and e(x ), and let the list E of exemplars be the top-k entries.
For our experiments, we use the large version of the pre-trained Universal Sentence Encoder (USElarge) (Cer et al., 2018) to embed the queries. We did not fine-tune the embedder. As the retrieval index is small enough (≈16k entries), we simply rank all index entries and choose k = 5 top entries as the exemplars.
Augmentation From the input query x and the list E = [(x 1 , y 1 ), . . . , (x k , y k )] of retrieved exemplars, we construct an augmented query x + . Similar to previous works that use exemplar-augmented inputs (Guu et al., 2018;Lewis et al., 2020;Brown et al., 2020;Liu et al., 2021), we simply concatenate each exemplar to the query: where @@ and ## are the separator strings. The MRs y i are simply treated as strings.
Generation We fine-tune a pre-trained seq2seq model to map the augmented query x + to the string representation of y. For our experiments, we finetune T5-base (Raffel et al., 2020). While T5 was pre-trained on text data, our experiments show that it can effectively generate MRs after fine-tuning.
Training We keep the retriever fixed and only train the generator model. When constructing (x + , y) pairs for training the generator, we want to diversify the list of exemplars E. This would encourage the generator to learn when to use or ignore the exemplars based on the their quality and relevance to the input x. To this end, instead of using top-k retrieval results as exemplars, at training time we construct a sampled-k exemplar list E as follows. From the input x, we first create a ranked pool of all index entries, excluding ones whose query is exactly x. At each step i ∈ {1, . . . , k}, we choose the jth entry from the pool with probability ∝ p(1 − p) j−1 (where p is a hyperparameter, set to 0.5 in the experiments). This geometric distribution makes higher-ranking entries get sampled more frequently. We then remove the sampled exemplar from the pool and add it to E.

Faithfulness toward exemplars
Although the generation of y is conditioned on the exemplars in E, the generator could implicitly ignore the exemplars entirely. This is desirable as it allows the model to generate reasonable outputs even when the exemplars are of low quality. However, if the parser always ignores the exemplars, we will not be able to control the parser via exemplar manipulation, and the parser might struggle on outof-distribution examples at test time (e.g., in the domain bootstrapping setup).
We want the parser to be more faithful toward exemplars, but still want the generator to make a judgment call to ignore the exemplars when appropriate. Additionally, we want an on-demand mechanism for adjusting the degree of faithfulness on specific queries. We thus propose the following techniques:  Anonymization Most seq2seq models, including T5, can learn to copy 2 tokens from the input string. However, with regular supervision, the model may still end up not learning to copy semantic labels from the exemplars if (1) such labels appear so frequently that the model memorizes their usage and generates them without copying, or (2) the retrieval is imperfect and copying from exemplars hurts the model during training.
To explicitly teach the generator to copy labels from the exemplars, we create additional anonymized training data where each unique semantic label in y i and y are turned into a random numerical label, as illustrated in Figure 2. Since the labels are anonymized differently in each example, the generator can no longer memorize their usage, and must learn to identify and copy the correct anonymized labels. We train the generator on an equal mix of original and anonymized data.
Guiding tag In some scenarios, we want to manually instruct the model to be more faithful toward the exemplars than usual. To do so, we utilize a special token ("PLATINUM" in our experiments) as a guiding tag. We simply insert T before each exemplar when constructing x + : x + = x @@ Tx 1 ##ỹ 1 @@ Tx 2 ##ỹ 2 @@ . . .
To establish the behavior of the guiding tag, we create additional training examples (x + , y) where x + contains the guiding tag, and the prediction y is considered highly faithful to the augmented exemplars in x + . One instantiation, oracle training data, can be constructed by constraining the retrieved exemplars (x i , y i ) so that y i and y share the semantic template (any notion of semantic similarity; e.g., the MR's labels and hierarchical structure). The generator is trained on the combination of this oracle training data and the normal data.

Standard setup experiments
We start with evaluating CASPER on the standard train-test setup of the English portion of the MTOP dataset (Li et al., 2021a). 3 We show that the retrieved exemplars can aid the seq2seq generator even on in-distribution queries.

Setup
Data The MTOP dataset uses the decoupled TOP representation (Gupta et al., 2018;Li et al., 2021a), as exemplified in Figures 1, 2, and 3. A TOP representation is a tree, with each node labeled with either an intent (e.g., IN:CREATE_CALL) or a slot (e.g., SL:CONTACT). Each node also corresponds to a single token span of the query. The topmost node is always an intent node. For our models, we simply treat the TOP representation as a string. We start with the string serialization given in the dataset, and then lowercase and word-split the labels to simplify tokenization (e.g., The English portion of the dataset contains 15667 training, 2235 development, and 4386 test queries. Each query also belongs to one of 11 domains, which will be important in the domain bootstrapping setup (Section 4).
We define the template of a TOP tree to be its tree structure and node labels, with the query tokens discarded (e.g., the template of  Methods The main CASPER model is trained on a mixture of original and anonymized training data. We also consider two variants: CASPER orig trained only on original data, and CASPER anon trained only on anonymized data. Since CASPER anon does not know about actual labels, the test data for CASPER anon is also anonymized. None of the models use oracle training data with guiding tag in this section. Baselines We compare against mBART+MT, the best published result from Li et al. (2021a). We also consider fine-tuning T5 on the original training data without exemplar augmentation. Table 1 shows the experimental results on the test set of MTOP averaged over 3 runs. The base T5 model already outperforms previous stateof-the-art by 0.7%. With retrieval-augmentation, CASPER further improves upon T5, leading to a total of 1.24% gain in exact match accuracy.

Results and analysis
The CASPER orig variant, which is trained only on non-anonymized data, achieves an even higher gain of 2.1%. With in-distribution test data, leaning toward memorization rather than following noisy exemplars is likely the best strategy. However, we will show in later sections that CASPER trained on mixed data is more robust to out-of-distribution queries and changes in the retrieval index.
Our error analysis shows that, while augmented exemplars improve performance over the baseline in general, they also cause some losses. Figure 3a shows a winning example where CASPER orig is better at predicting the slot that shows up in the augmented exemplars. Figure 3b shows a loss due to the exemplars not being in analogy with the input query, while Figure 3c shows a case where annotation inconsistency between the exemplars and the gold output causes a loss.
Note that while T5 was pre-trained on text data, CASPER effectively learns to generate syntactically valid MRs, with only 0.04% of test outputs being syntactically invalid. Post-hoc filtering or constrained decoding (Yin and Neubig, 2017;Krishnamurthy et al., 2017) could be used if one needs an absolute guarantee for syntactic correctness.

Ablation
Retriever We train CASPER orig with different choices of the retriever's embedder: BERT-base (embedding of the [CLS] token) (Devlin et al., 2019), BERT-large, and USE-large. We also consider an oracle retriever that only returns examples with the same template as the correct output. Table 2 reports intrinsic metrics of the retrievers, which include template recall@5 (whether one of the top 5 retrievals has the same template as the gold MR) and label coverage@5 (whether all labels in the gold MR appear among the top 5 retrievals). Meanwhile, Table 3 reports the end-to-end results on the development set of the trained baseline T5 and CASPER orig models.
We observe that USE-large, being pre-trained on sentence-level tasks, performs better than BERT on both intrinsic and end-to-end evaluation. On the other hand, the oracle performs much better than USE-large, showing that an improved retriever (e.g., USE-large fine-tuned on the training data) could potentially improve the CASPER model.

Exemplar selection
We compare different ways to select exemplars when constructing training data (x + , y). The choices include using a fixed top-k list (less diverse) instead of sampled-k, and using different number of exemplars k. Note that we always use the top-k list at test time, with the same   k as training time. Table 3 compares the results. We see that using sampled exemplars during training and a higher k give additive improvements to the model. However, note that larger k's can make the augmented query exceed the model's maximum query length.

Domain bootstrapping experiments
In addition to improving the parser on the standard setup, we will show that by manipulating the retrieval index, we can influence the parser's behavior. We first consider domain bootstrapping, where a new domain is being added to a previously trained parser, and we want to quickly update the model using a handful of examples in the new domain.

Setup
In Note that when evaluating on N dev , exemplars can come from both O train and N sup in the retrieval index. If the retriever does its job well, we would still mostly get exemplars from N sup .
Baselines While there are previous works on domain bootstrapping for semantic parsing without fine-tuning (Hou et al., 2020a;Zhu et al., 2020;Henderson and Vulić, 2021), most of them rely on token-level matching and sequence tagging, which are not directly applicable to the hierarchical MRs from MTOP. We thus compare CASPER with T5, which represents generic seq2seq parsers.
For the unseen-bootstrap setting, we additionally try fine-tuning T5 on either N sup or O train + N sup for a small number of steps. These fast update experiments demonstrate the trade-off of spending additional resources for fine-tuning at test time.

Results
In Table 4, we report results of the models trained in the standard setup (full training data) and the two domain bootstrapping setups (with N sup = 100 random examples from N train ). The results are averaged over 5 bootstrapped domains: alarm, calling, event, messaging, and music.
We observe that CASPER shows larger improvements upon T5 in the domain bootstrapping settings than the standard setting, ranging from +2% when N sup is seen during training (seen-bootstrap), to +38% when N sup is only available at test time (unseen-bootstrap). The results show that by modifying the retrieval index, we can change the be-  On the unseen-bootstrap setup, the model has to rely solely on the exemplars for unseen semantic labels and parse patterns. The anonymized training data proves to be crucial for making the model more faithful toward the exemplars, as evidenced by CASPER trained on mixed data improving upon CASPER orig , and CASPER anon winning over CASPER orig by a large margin.
Comparison with fast update The line plots in Figure 4 track the accuracy on N dev and O dev (bootstrapped domain = alarm) when T5 trained on O train is fine-tuned on the support set for a few iterations at test time. If only the support set is used for fast update (blue), the model eventually suffers from catastrophic forgetting and degrades on O dev . Mixing in O train during fast update (red) solves this issue. T5 eventually surpasses the unseen-bootstrap CASPER (green) on N dev after processing 512 examples, at which point much more computational resource was already consumed than CASPER.

Parse guiding experiments
In this section, we demonstrate CASPER's ability to guide the prediction toward the patterns specified in the exemplars. This parse guiding ability can be useful for correcting the parser's output on a set of problematic queries (e.g., sensitive queries, or queries that the model struggles on). In industrial semantic parsers, one common way to handle problematic queries is to add explicit "hotfix" filters and treat such queries as special cases. Parse guiding enables us to also handle queries that are sufficiently similar to known problematic queries. Concretely, we can use the similarity score from CASPER's retriever to identify whether the input is similar to any problematic examples, and apply parse guiding toward them if it is.

Setup
We focus on the usage of guiding tag (Section 2.1) for parse guiding. Trained correctly, the parser should become more faithful toward the exemplars when the guiding tag is present in the augmented query, and should parse normally otherwise.
To evaluate this parse guiding ability, we define an oracle evaluation set consisting of examples (x, E, y) with a predefined list of exemplars E. The MRs y i in E are restricted to having the same semantic template as y. On this evaluation, we expect the model's accuracy to rise when the guiding tag is present. The template accuracy, which is now equivalent to the rate where the prediction follows the template of y i , should also increase.
Methods We compare CASPER that was taught the behavior of the guiding tag against the models without such knowledge. We report the results on the standard and oracle evaluation sets, with and without the guiding tag added at test time. Table 5 shows the experimental results. On the standard evaluation set, CASPER model with the knowledge of guiding tag has a slightly smaller gain over T5. But when the guiding tag is present, the model becomes much more faithful to the given exemplars, as evidenced by the increased template and exact match accuracy on the oracle set.

Results
Note that this gain is due to the guiding tag and not just the increased amount of training data: if we add oracle training data without guiding tag when training CASPER, the accuracy on the oracle set (90.74) is not as high as when we use the   guiding tag (93.02). We also note that the guiding tag should only be used when the correct parse is expected to closely follow the exemplars. Using the tag on the standard set hurts the accuracy. When the guiding tag is present, the model needs to balance between being faithful to the exemplars and generating a sane parse. As an analysis, we try supplying exemplars with a drastically different template from the gold parse. The first example from Figure 5 shows how the model attempts to fit the two names from the query as two slot values. The second example shows how the model refuses to predict a SL:DATE_TIME slot, despite the guiding tag being present, since the query does not contain a suitable value for such a slot.

Schema refactoring experiments
In this section, we show how CASPER can adapt to changes in the semantic schema. Although the solution involves modifying the retrieval index like domain bootstrapping (Section 4), schema refactor-  Table 6: Schema refactoring: Both mixing in anonymized training data and using guiding tags help CASPER achieves the best post-refactor accuracy without hurting the pre-refactor accuracy.
ing presents a new challenge: the parser now needs to produce a different output for in-domain queries, and must resist the urge to produce semantic labels it has learned during training.

Setup
We consider a label splitting scenario where some semantic labels split into two labels each at test time. Following Gaddy et al. (2020), we simulate the scenario backward by using the original dataset as post-refactoring data, and merge 10 pairs of similar labels (listed in Appendix C) to form the pre-refactoring data. About 35% of development examples contain at least one label involved in label splitting, and about half of which have their MRs altered after refactoring.
Methods At test time, we replace the retrieval index with post-refactoring training data. For models with the knowledge of guiding tag, we add the guiding tag whenever a retrieved exemplar contains a label involved in label splitting.  Kumar et al., 2019;Andreas, 2020;Lee et al., 2021) can be used to amplify the data for the new task. When the few-shot examples are only available at test time, the task is more difficult, and common approaches in the literature include metric learning, fast update, and exemplar augmentation.

Results
Metric learning The main idea of metric learning (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017) is to learn a representation of objects (either inputs or labels) such that objects in the same class are closer together. Test inputs are then matched to the representation of the few-shot labels or their exemplars.
Metric learning was first applied on classification tasks (Fritzler et al., 2019;Sun et al., 2019;. Subsequent studies extended metric learning to sequence labeling and semantic parsing by either matching tokens (Hou et al., 2020a,b;Zhu et al., 2020; or matching spans (Henderson and Vulić, 2021;Yu et al., 2021). However, such rigid notions of substructure matching do not lend themselves to complex hierarchical outputs. In CASPER, the retriever performs querylevel matching to retrieve exemplars. While the exemplars may not be exactly in the same class as the query, the generator can implicitly reason with them when making predictions. This allows us to generate complex outputs while still gaining benefits from metric learning.
Fast update Given the few-shot examples, one could spend a small amount of resource to fine-tune on them for a few training steps. This creates a trade-off between the amount of resource spent and the performance on the new task. A common way to improve this trade-off is via meta learning (Finn et al., 2017;Ravi and Larochelle, 2017;Li et al., 2017). The main idea is to simulate fast update scenarios during training, and update the model's parameters so that the model performs fast updates more efficiently. Fast update with meta learning has been applied to NLP models for generalizing to unseen tasks or domains (Gu et al., 2018a;Dou et al., 2019;Bansal et al., 2020;Athiwaratkun et al., 2020;Wang et al., 2021).
Since fast update explicitly minimizes the loss on the few-shot examples, the updated model is more likely to be faithful toward them, whereas CASPER requires additional techniques to increase faithfulness toward exemplars (Section 2.1). Nevertheless, CASPER has several advantages over fast update. For instance, while fast update needs to save the information about new labels into the parameters and recall it when parsing test queries, CASPER can directly access the new labels in the exemplars when parsing test queries. Compared to meta learning, training CASPER is also much simpler, only requiring off-the-shelf seq2seq fine-tuning. Finally, while fast update requires the new data to be inputoutput pairs to fine-tune on, CASPER's exemplars can technically be any information (e.g., new semantic schema) that can be augmented to the query.
Exemplar augmentation Our work is not the first to use exemplar augmentation for few-shot tasks (Radford et al., 2019;Zhao et al., 2021). The most prominent previous work is GPT-3 (Brown et al., 2020), which can perform new tasks by augmenting exemplars or task description to the query, even without further fine-tuning the model to specifically handle such augmented queries.
The approach most similar to ours is Liu et al. (2021), which also retrieves exemplars from a retrieval index. While Liu et al. (2021) focuses on improving the generative models on the standard evaluations, our work proposes how to use retrieval augmentation for controlling the behavior of the generator, which leads to novel use cases (domain bootstrapping, parse guiding, schema refactoring) on top of achieving state-of-the-art on the standard evaluation.

Discussion
Issues in domain bootstrapping The most straightforward method to adapt a neural model to new domains is to fine-tune it on new training examples. However, this approach not only has a high computation cost, but also suffers from two critical issues. One is catastrophic forgetting: the inability to preserve previous knowledge (McCloskey and Cohen, 1989;Goodfellow et al., 2014). The other is model churn: instability of model predictions on individual examples after fine-tuning. Existing work commonly tackles catastrophic forgetting via incremental training, such as imposing constraints on the distance between new and old models (Sarwar et al., 2019;Rosenfeld and Tsotsos, 2018) or jointly learn a generator to reply past examples for training (Hu et al., 2018). Another existing approach is to identify conflicting data to improve robustness of model updates (Gaddy et al., 2020). In CASPER, having the retrieval index that stores training examples mitigates catastrophic forgetting by design. And since the model can be controlled without fine-tuning, model churn is reduced.
Retrieval-augmented generation Recent studies have shown the effectiveness of retrieval augmentation in many generative NLP tasks. How the model actually uses the retrieved information differs among the methods. Some methods, like CASPER, encode the retrievals alongside the query and let the model decides how to use them (Guu et al., 2018;Hashimoto et al., 2018;He et al., 2020;Weston et al., 2018;Pandey et al., 2018;Lewis et al., 2020). Some utilize alignments between the retrieved examples and the input (Sumita and Iida, 1991;Gu et al., 2018b;Lu et al., 2019). And some use the retrievals to explicitly manipulate the token scores at each decoding step (Zhang et al., 2018;Hayati et al., 2018;Peng et al., 2019;Khandelwal et al., 2020Khandelwal et al., , 2021. Controllable generation Several works on controllable generation make use of conditional VAEs, where the latent variable conditioned on the input is the indicator for controlling the output (Hu et al., 2017;Shen et al., 2018;Song et al., 2019;Shu et al., 2020). Other types of control indication include special input tokens (Keskar et al., 2019;Dathathri et al., 2020) or using another neural model as a style discriminator during decoding (Krause et al., 2020). Our work use exemplars as the indicator for controlling the prediction.

Conclusion
We have presented CASPER, a retrieval-augmented semantic parser that uses the retrieved exemplars to influence the predictions. By manipulating the retrieval index and how the exemplars are augmented, we can control the parser's behavior, which is helpful for domain bootstrapping, parse guiding, and schema refactoring.
Future works include fine-tuning the retriever, possibly jointly with the generator, which has potential to improve the model (see Section 3.3); introducing more fine-grained control on the faithfulness toward exemplars than the presence/absence of guiding tag; and pre-training the model on external resources to increase generalization.

Ethical considerations
This paper proposes a retrieval-augmented semantic parser, the predictive behavior of which can be changed by editing the retrieval index or how the retrieval-augmented query is constructed. These modifications can only be carried out by the developer of the parser, and not by the users who issue the queries. The intended use cases of our work include: (1) adding support for new query domains to the parser; (2) overriding predictions of a subset of queries, such as sensitive queries or queries that the parser struggles on; and (3) adapting the parser to an updated semantic schema.
Our method reduces the computational resources needed to retrain the model when enacting the new behavior. That said, the parser needs to be initially trained to recognize retrieval-augmented queries (which can be expensive), and retraining would be required for drastic changes in behavior (e.g., renaming multiple high-frequency semantic labels at once).
While the experiments were done on the English portion of the MTOP dataset, the method is generic to the language of the queries and meaning representations. Note that the model performance would depend on whether the underlying pre-trained retriever and generator models support the target languages well. . . ]). We also removed spaces before the "]" tokens.
Retriever To embed the queries, we used pretrained Universal Sentence Encoder (Cer et al., 2018). Specifically, we used the large version of the encoder. 5 The embedder was kept fixed. We computed the embeddings for all queries on CPU.
For each query, we cached 100 exemplars with the least cosine embedding distance from the query. The selection of top-k and sampled-k exemplars were only done on these cached exemplars.
In actual deployment, brute force retrieval might be too slow, and fast nearest neighbor methods (Johnson et al., 2021;Guo et al., 2020) could be used to speed up the retriever.
Training For each original example (x, y), we generated 20 lists E of sampled-5 exemplars, and saved them to dataset files for fine-tuning the T5 (Raffel et al., 2020) generator model. We used the base version of the model (220M parameters). We selected reasonable hyperparameter values and performed some minimal hyperparameter tuning. Specifically, we used a batch size of 4096 and the learning rate of 0.001. Training is done for 2000 steps, with early stopping based on the exact match accuracy on the development data. We fine-tuned T5 on 32 Cloud TPU v3 cores. Training takes approximately 2.5 hours.
We ran the experiments on 3 random seeds. One exception is the domain bootstrapping experiments, where we ran on 1 seed for each of the 5 domains and averaged the results.
Fast update For the fast update experiments in Section 4, we start from the T5 model fine-tuned on O train , and then continue to fine-tune it on either N sup or an equal mix of O train and N sup (i.e., 50% chance of picking an example from O train ; 50% chance of picking an example from N sup ). We use a batch size of 128 here instead of 4096. Since |N sup | = 100, each iteration goes over the support set approximately once when fine-tuning on N sup . Table 7, 8, and 9 show detailed results of the standard, parse guiding, and schema refactoring experiments. Table 10 shows per-domain results for 5 https://tfhub.dev/google/ universal-sentence-encoder-large/5 the domain bootstrapping experiments. We note that T5 got a non-trivial accuracy on the no-finetuning setting of the calling domain. This is because many training queries in the reminder domain have IN:CREATE_CALL, a main intent of calling, nested inside (e.g., "Delete reminder to call husband").