Detecting Compositionally Out-of-Distribution Examples in Semantic Parsing

While neural networks are ubiquitous in state-of-the-art semantic parsers, it has been shown that most standard models suffer from dramatic performance losses when faced with compositionally out-of-distribution (OOD) data. Recently several methods have been proposed to improve compositional generalization in semantic parsing. In this work we instead focus on the problem of detecting compositionally OOD examples with neural semantic parsers, which, to the best of our knowledge, has not been investigated before. We investigate several strong yet simple methods for OOD detection based on predictive uncertainty. The experimental results demon-strate that these techniques perform well on the standard SCAN and CFQ datasets. Moreover, we show that OOD detection can be further improved by using a heterogeneous ensemble.


Introduction
Neural network (NN) based models are ubiquitous in natural language processing (NLP). In particular, sequence-to-sequence models have found adoption in neural machine translation (NMT Luong et al., 2015)), neural semantic parsing (NSP (Dong and Lapata, 2016)), and beyond. While basic sequence-to-sequence models have shown impressive results on these tasks, recent work (Lake and Baroni, 2018;Keysers et al., 2019;Kim and Linzen, 2020) have presented the disconcerting finding that these models fail to generalize to novel combinations of elements observed in the training set (see Section 2). Therefore, several models and methods with improved compositional generalization have recently been proposed (Liu et al., 2020;Li et al., 2019;Russin et al., 2020;Guo et al., 2020a;Herzig and Berant, 2020;Furrer et al., 2020;Andreas, 2020;Guo et al., 2020b;Herzig et al., 2021) In this work, we consider the task of detecting compositionally out-of-distribution (OOD) exam-ples, which, to the best of our knowledge has not been investigated before. The ability to detect OOD inputs is important, as it helps us to decide whether the model's prediction on the input can be trusted, which is crucial for safe deployment of the model and could be useful to build more efficient systems.
To this end, we analyse the OOD detection performance of recurrent neural network (RNN) and transformer-based models using methods relying on predictive uncertainty. In addition, we propose to use a heterogeneous ensemble of transformer and RNN-based models that combines the strengths of both to improve the detection of compositionally OOD examples.

Background
Several recent works have investigated the generalization properties of commonly used sequence-tosequence models, in particular their ability to learn to process and produce novel combinations of elements observed during training (Lake and Baroni, 2018;Keysers et al., 2019;Kim and Linzen, 2020). Lake and Baroni (2018) propose the SCAN dataset, which consists of natural language utterances (input) and action sequences (output), and perform an analysis of the generalization performance of sequence-to-sequence models on different splits of the dataset. The different splits are aimed at testing the ability of networks to (1) generalize to novel combinations of tokens observed only in isolation during training (the JUMP and TURN_LEFT settings) and (2) generalize to longer sequence lengths (the LENGTH setting). For the JUMP setting, the training set consists of the basic example "jump" → [JUMP] as well as all other simple and composed examples (e.g. "run twice") while the test set contains all composed examples with "jump" (e.g. "jump twice"). They observe that standard sequence-to-sequence models fail on the JUMP and LENGTH splits (accuracy below 10%) while they perform well (near 100%) on a random test split. Keysers et al. (2019) performed their analysis on the CFQ dataset that provides tens of thousands of automatically generated question/SPARQL-query pairs and provides maximum compound divergence (MCD) splits. The MCD splits are generated such that the distributions over compounds (phrases combining atomic elements) are maximally different between the train and test sets while the distributions over the atomic elements (entities, relations, question patterns) are kept similar. Keysers et al. (2019) also provide MCD splits for the SCAN dataset. Experiments using standard neural sequence-to-sequence models (transformers and RNN+attention) reveal that while the random splits result in near-perfect accuracy, the MCD splits suffer dramatic losses in performance (< 20% accuracy for CFQ's MCD splits and < 10% for SCAN's MCD splits).

Detecting OOD examples
In this work, we focus on OOD detection methods that build on the predictive distributions of discriminative task-specific models (extending the work of Hendrycks and Gimpel (2017)). These methods have the advantage that they are easy to use in existing models and do not require additional models or additional training (unlike for example generative modeling (Nalisnick et al., 2018;Ren et al., 2019)). Previous work has shown that neural network models can produce incorrect predictions with high confidence on OOD inputs (Nguyen et al., 2015), which can be detrimental for detecting such inputs. We investigate whether this is the case for compositionally OOD examples in semantic parsing models as well.
We compare the following measures quantifying the uncertainty of the prediction based on the output distributions of a trained model: (1) the average negative log-likelihood (NLL) for the generated sequence, (2) the sum of the NLLs, and (3) the average entropy of the output distributions. More specifically, our approach for measuring uncertainty proceeds as outlined: First, the input x is encoded and an output sequenceŷ is generated by the decoder. The model's output probability distributions p(ŷ i |ŷ <i , x) for every decoding step i are then used to compute the sum of NLL as − log p(ŷ i |ŷ <i , x) . (1) The average entropy is given by where V is the set of all output tokens.

MC Dropout
To take model uncertainty into account, Bayesian approaches can be used (Louizos and Welling, 2017;Maddox et al., 2019;Malinin and Gales, 2018). A simple method for approximating the predictive uncertainty under a Bayesian posterior distribution over model parameters, is MC Dropout (Gal and Ghahramani, 2016).
In our work, we use MC dropout as follows: First, we encode the input x and run the decoder to generate an output sequenceŷ. Then, we obtain K output probability distributions p k (y i |ŷ <i , x), k = 1, . . . K, for each decoding step by feeding x and the generatedŷ through the model K times while randomly dropping neurons with the same probability as during training. Finally, the posterior predictive distribution is approximated by and is used with the metrics described previously.

Homogeneous ensemble
Another method often used for uncertainty quantification are deep ensembles (Lakshminarayanan et al., 2017), where K models with the same architecture and hyperparameters are trained in parallel starting with different initalizations. The final prediction is given as the average over the single predictions. For our sequence models, we average the predictive distributions of the ensembled models at every decoding step.

Heterogeneous ensemble
In our experiments, we found that different underlying architectures are better at detecting different types of OOD examples. To further improve detection performance, we propose to use a heterogeneous ensemble of different models for compositional OOD detection in semantic parsing. Concretely, given M different architectures (in our case M = 2), we first train an ensemble (in our case of size 3) of each architecture, and during prediction, analogously to the regular ensemble, we average the predictive distributions of all the models at every time step.  Table 1: OOD Detection performance on SCAN's splits for the transformer (TM), the GRU-based sequence-tosequence model with attention (GRU), and heteogenous ensembles (HE). The results for MCD correspond to the average over the three MCD splits. If "MC Drop" is "No", MC dropout is not used during prediction, otherwise the value of "MC Drop" specifies the number of samples (K in Section 3.1). If "Ensemble" is "No", homogeneous ensemble is not used, otherwise, its value specifies the number of models in the ensemble. 3+3 specifies that we use ensembles of 3 transformer models and 3 GRU models in the heterogenous ensemble. Best result is shown in bold, close to best are underlined.
We also combine heterogeneous ensembles and MC dropout, the approach for which is described in Appendix C.

Experiments 1
Datasets: We experiment with the SCAN (Lake and Baroni, 2018) and CFQ (Keysers et al., 2019) datasets mentioned in Section 2. Table 4 in Appendix A provides some statistics on the number of examples in each split.

Models:
We consider both a transformer based (Vaswani et al., 2017) and a GRU+attention  based sequenceto-sequence model in our experiments. Both are randomly initialized. For transformers, we use six layers with six heads, learned position embeddigns, dimension 384 and dropout rate 0.25. For the GRU-based model, we use two-layers, dimension 384 hidden layers and dropout probability 0.1 2 . The models are trained using Adam (Kingma and Ba, 2015), with an initial learning rate of 5 * 10 −4 . For more details, see Appendix B. We ran all 1 Code is available at https://github.com/ lukovnikov/parseq/tree/emnlpcr/ 2 0.25 for homogeneous ensemble. experiments with three 3 different seeds and report the average. Evaluation: To evaluate the ability of the techniques presented in Section 3 to detect OOD examples, the following metrics are computed: (1) AUROC↑ 4 , (2) AUPRC↑ 5 , and (3) FPR90↓ 6 . These metrics are commonly used to measure the performance in OOD detection as well as for binary classifiers in general.   Table 2 verify that the query accuracy is similar to previously reported numbers. They show that the standard sequence-to-sequence models fail on all compositional generalization scenarios except on the TURN_LEFT split from SCAN. In contrast, the ID test accuracy was near 100% for both datasets.

Results
The OOD detection performance for the different splits of the SCAN dataset are reported in Table 1, and for the CFQ dataset in Table 3. SCAN's random split obtains 50% AUROC, which is expected since it does not contain OOD data. 7 The effect of MC-dropout: The method described in Section 3.1 leads to improvements across different settings and architectures, with the exception of SCAN's length-based split.
The effect of architecture: Different architectures appear to produce markedly different results for different types of splits on SCAN. The transformer performs better than the GRU-based model on the primitive generalization splits (SCAN's JUMP and TURN_LEFT splits), slightly underperforms on the MCD splits of both SCAN and CFQ and is worse on the length-based SCAN split. How difficult are the different splits? Some of the splits are more challenging to detect than others. The JUMP split appears the easiest to detect (see Table 1). The TURN_LEFT split is more chal-lenging. The high query accuracy on this test set in Table 2 might indicate that it is closer to the training distribution than the others. Nevertheless, several methods are able to achieve high detection performance for TURN_LEFT. The transformer fails to produce any correct output on the length-based split of SCAN and is also bad at detecting when it encounters such examples. The effect of homogeneous ensemble: The regular (homogeneous) deep ensemble (Lakshminarayanan et al., 2017) leads to significant improvements of the OOD detection ability across all tested architectures and datasets. However, using an ensemble is not always sufficient to close the performance gap to the best performing architecture on a certain split (e.g. GRU on the length-based split). Note that a disadvantage of using ensembles is the increased computational requirements, which can be especially prohibitive for large transformerbased models.
The effect of heterogeneous ensemble: Using the heterogeneous ensemble of a transformer and a GRU-based sequence-to-sequence model to detect OOD examples yields the best overall results. The heterogeneous ensemble leads to an overall improvement both in combination with MC dropout and with regular ensemble. Most notable are the gains on SCAN's MCD splits, reaching an FPR90 of less than 5% with MC Dropout and below 1% with regular ensemble. It also appears to improve results on the TURN_LEFT split and beats the detection performance of the ensembled GRU-based model on the length-based SCAN split.

Analysis and Discussion
In the results obtained in Table 1, two things stand out: (1) the gap in OOD detection performance between the transformer and the GRU-based model on the length-based split and (2) the extremely high OOD detection performance of the transformer on the JUMP split. In this section, we perform a further analysis to try to better understand these findings.

Length-based split:
To analyse what may have caused the poor performance of the transformer on the length-based split, we investigate the lengths of the generated outputs (see Figure 1). We found that the transformer with absolute position encodings (PE) that produced the results in Table 1 and 3 is more biased towards generating shorter sequences than a transformer with relative PE or the GRU. However, the transformer with relative PE, which  Figure 1: Histograms of lengths for outputs generated by two different transformer models for SCAN's lengthbased split. Blue is on ID inputs, red is on OOD inputs. Note that here we also count the tokens added at the beginning and end of the sequences.  SumNLL sums over the entire sequence and simply producing longer sequences, even with similar per-timestep entropies to ID data, would lead to better distinguishable examples. However, for Av-gENT, which is averaged over time steps and therefore not influenced by the length, the GRU-based model still performs better than the transformer.

Method
Thus, we believe that while the length of the generated sequences can be an important signal for detection, and may give a slight benefit to the GRU-based model, it is not the only reason of the high performance of the GRU-based model. Transformer on JUMP split: To ensure that the high performance of the transformer on JUMP is not just due to the exploitation of trivial input features we experimented with additional JUMP examples that put the word "jump" in all other positions to avoid correlation with the position vectors. This indeed resulted in slightly worse OOD detection ability. However, with an FPR90 of 4.6, the transformer was still better than the GRU-based model.

Conclusion
In this work, we investigate how easy it is for neural semantic parsers to detect out-of-distribution examples in the context of compositional generalization. While some recent works (Fomicheva et al., 2020;Malinin and Gales, 2021) investigate similar methods for structured prediction (for NMT and automated speech recognition), to the best of our knowledge, our work is the first to investigate compositional OOD detection for NSP. Our analysis shows that relatively simple uncertainty based methods perform well for RNN as well as transformer-based models in most settings. Ensemble provide the best results, while MC dropout leads to improvements at no extra training cost. OOD detection can further be improved by using an ensemble of RNN and transformer-based models.

A Dataset statistics
See Table 4 for statistics of the used datasets and our splits. Note that the datasets come with three predefined MCD-based splits, which are referred to by MCD-{X} for X ∈ {1, 2, 3} in the table. Preprocessing: The CFQ dataset is preprocessed using a simple reversible transformation of SPARQL queries into LISP-style s-expressions. This includes converting the set of triple patterns of the form "?x :rel ?y. ?a :r ?b" to s-expressions of the form "(AND (COND ?x :rel ?y) (COND ?x : rel ?y))", where the order of the arguments of "AND" does not matter during evaluation.
Accuracy: The logical form accuracy considers an example correct if the logical form is equivalent to the target logical form, and which is invariant to the effects of linearization order. In the case of CFQ, which uses SPARQL, this means that the order of conditions does not affect the accuracy of the obtained results and is therefore ignored. In the case of SCAN, whose outputs are action sequences, this simply becomes the sequence-level accuracy.

C MC dropout with heterogeneous ensemble
When we apply MC dropout to the heterogeneous ensemble, we train only one model for each of the two different architectures. These models are used to independently predict a sequenceŷ m with m ∈ {1, 2} given x with dropout disabled. Next, we feed x andŷ m through both models K times with dropout enabled, leading to two averaged outputsȳ m over K * 2 distributions. Finally, we perform max-pooling over the NLL-based metrics computed for eachȳ m such that the most pessimistic score is retained.