Efficient Out-of-Domain Detection for Sequence to Sequence Models

,


Introduction
Sequence-to-sequence (seq2seq) models achieve state-of-the-art performance in various NLP tasks, such as neural machine translation (NMT;Vaswani et al. (2017); Song et al. (2019); Zhu et al. (2020); Liu et al. (2020)), abstractive text summarization (ATS; Zhang et al. (2020); Lewis et al. (2020)), question answering (QA;Raffel et al. (2020)), and others.Such models may encounter various user inputs when exposed to the general public.In many cases, it is preferable to detect and handle in a special way what is known as out-of-domain (OOD) inputs.OOD instances are significantly different from the data used during training, and as a result, model predictions on such inputs might be unreliable.OOD can be performed in supervised and unsupervised ways.In a supervised approach, one trains a discriminator between in-domain (ID) and OOD instances on a labeled dataset of such instances, which is manually annotated (Hendrycks et al., 2019) or synthetically generated (Liang et al., 2018).The drawback of such an approach is that the discriminator is also limited in what instances it can correctly process.Therefore, in many practical cases, it might be better to use an unsupervised approach, where OOD instances are detected using uncertainty estimation (UE) methods.
Related work.UE for text generation models is still an area of ongoing research with only a limited number of works.Malinin and Gales (2020) propose various ensemble-based UE methods for seq2seq models and evaluate them on two tasks: NMT and automatic speech recognition.Ensemblebased methods in conjunction with Monte Carlo (MC) dropout (Gal and Ghahramani, 2016) are also investigated in (Lukovnikov et al., 2021).The authors find that the ensemble-based UE methods lead to the best results for OOD detection in the neural semantic parsing task.Xiao et al. (2020) introduce a novel UE method BLEUVar, which is also based on MC dropout.The uncertainty score is calculated as a sum of the squared complements of BLEU scores for all pairs of generated texts obtained with different dropout masks.The method shows improvements over the baselines in NMT.Lyu et al. (2020) further explore this method for OOD detection in question answering.Gidiotis and Tsoumakas (2022) show that BLEUVar can also be applied for UE in summarization.The aforementioned methods entail performing multiple model inferences for each individual input, resulting in high computational overhead.Recently, Kuhn et al. (2022) propose a method that does not leverage MC dropout, but samples multiple predictions without additional inferences.It is called semantic entropy and is based on the idea that different samples can have the same meaning.It calculates the entropy of the probability distribution over meanings instead of their surface realizations.Semantic entropy outperforms the standard predictive entropybased methods proposed in (Malinin and Gales, 2020) on the free-form question answering task.
Contributions.In this work, we show that there is significant room for improvement for existing OOD detection methods in seq2seq tasks.We find out that in some configurations, they even work worse than the random choice.Moreover, most of them are computationally intensive, which hinders their successful application in real-world settings.
To address these issues, we adopt methods based on fitting the probability density of latent instance representations obtained from a trained neural network (Lee et al., 2018;Yoo et al., 2022).While these methods are shown to be effective for text classification tasks, their application in text generation tasks has received limited research attention.We fill this gap by conducting an empirical investigation of these methods for OOD detection in NMT, ATS, and QA tasks and show their superiority over the baselines from previous work.The main contributions of our paper are as follows.
• We perform a large-scale empirical study of UE methods on three different sequence generation tasks: NMT, ATS, and QA, with various types of out-of-domain inputs: permutations of tokens from original input, texts from a new domain, and texts from another language.
• We show that the density-based approaches are both more effective and computationally efficient than previously explored state-ofthe-art ensemble-based or MC dropout-based methods.The improvement is consistently observed in all considered tasks.
2 Out-of-domain Detection Methods OOD detection using uncertainty estimation is a binary classification task, where an uncertainty score U (x) of a given input x is a predictor of x coming from an unknown domain.In practice, a threshold δ is specified so that all x : U (x) > δ are considered to be OOD.The task of text generation involves complex autoregressive probabilistic models and usually requires making not one but multiple predictions (one per output token).These two factors make UE of predictions in text generation tasks much more complicated than in standard text classification tasks.Below, we provide a short overview of the approaches for uncertainty estimation of autoregressive model predictions investigated in our work.More comprehensive details can be found in Appendix A. All methods described below can be applied to the majority of modern Transformerbased pre-trained seq2seq models.

Information-based Uncertainty Estimation
Usually, seq2seq models for each input x can generate multiple candidate sequences y via beamsearch, where the resulting set of sequences B(x) = {y (b) } B b=1 is called a "beam".To get the uncertainty score associated with a prediction on x, we can aggregate individual uncertainties for input-output pairs (x, y (b) ) of the whole beam.
The simplest aggregation method is to take the probability of a sequence y * that has the maximum confidence and is usually selected as a final model output.We refer to this method as Maximum Sequence Probability (MSP).The alternative approach is to consider the hypotheses in the beam y (b) as samples from a distribution of possible sequences.In this case, we can compute the expected probabilities over the beam, yielding a method called Normalized Sequence Probability (NSP).Another option is to compute the average entropy of the predictive token distributions over the beam.

Ensembling
One can train several models for a single task and benefit from their variability to estimate the uncertainty.In this section, we mostly follow Malinin and Gales (2020) who give a comprehensive overview of the information-based UE techniques for ensembles and Bayesian methods in general.
First of all, note that hypotheses sequences that form the beam B(x) = {y (b) } B b=1 for the case of ensembling can be generated naturally by generating tokens sequentially according to the average of the probabilities of ensemble members.Such an ensembling approach is usually referred to as Product of Expectations (PE) ensemble.We consider two types of ensemble-based UE methods: sequence-level and token-level.
Sequence-level methods obtain uncertainty scores for the whole sequence at once.Total Uncertainty (TU) is measured via entropy and Reverse Mutual Information (RMI).We refer to these scores as PE-S-TU and PE-S-RMI in our experiments.
One can also consider an alternative way of ensembling models that is usually called the Expectation of Products (EP) ensemble.It averages the probabilities of whole sequences computed by different models.This approach gives us two more variants of TU and RMI: EP-S-TU and EP-S-RMI.
In token-level UE methods, we compute some uncertainty measure for each token first and then average these scores over all tokens in a sequence.

Density-based Methods
Recently, density-based methods exhibited outstanding performance in UE of deep neural network predictions (Lee et al., 2018;van Amersfoort et al., 2020;Kotelevskii et al., 2022;Yoo et al., 2022).Yet, none of them has been applied to seq2seq models.
The basic idea behind density-based UE methods is to leverage the latent space of the model and fit the probability density of the training input representations within it.The lower value of the density is then considered as an indicator of a higher uncertainty due to the scarce training data used to make the prediction.
We adopt two state-of-the-art methods of this type for seq2seq models: Mahalanobis Distance (MD; Lee et al. (2018)) and Robust Density Estimation (RDE; Yoo et al. (2022)).Let h(x) be a hidden representation of an instance x.The MD method fits a Gaussian centered at the training data centroid µ with an empirical covariance matrix Σ.The uncertainty score is the Mahalanobis distance between h(x) and µ: We suggest using the last hidden state of the encoder averaged over non-padding tokens or the last hidden state of the decoder averaged over all generated tokens as h(x).An ablation study of various embeddings extraction and reduction methods is provided in Appendix D.
The RDE method improves over MD by reducing the dimensionality of h(x) via PCA decomposition.It also computes the covariance matrix in a robust way using the Minimum Covariance Determinant estimate (Rousseeuw, 1984).The uncertainty score U RDE (x) is also the Mahalanobis distance but in the space of reduced dimensionality.

Experiments
Following (Malinin and Gales, 2020), we use two approaches to generating OOD data for a given "in-domain" (ID) dataset.In the first approach, we simply take texts from another dataset, which is distinct from the training set of the model in terms of domain and/or structure.In the second approach, we corrupt the dataset by randomly permuting the source tokens (PRM).The details of OOD data creation are provided in Appendix B.
Following the previous works on OOD detection (Hendrycks and Gimpel, 2017;Malinin and Gales, 2020), we report the AU-ROC scores of detecting OOD instances mixed into the test set.To ensure stability, we run each experiment with 5 different random seeds and report the standard deviation.For brevity, in the main part, we report the results of only the two best-performing methods from each method group.Hardware configuration for experiments is provided in Appendix B.   2020)).The OOD datasets were selected according to the benchmark of Malinin and Gales (2020).Since in reallife settings, OOD data come from various sources, we want to cover as many domains of data as possible with these datasets.For OOD data generation, we use texts from WMT'14 (Bojar et al., 2014) in French, the LibriSpeech test-clean (LTC) reference texts (Panayotov et al., 2015), and English comments from Reddit from the Shifts dataset (Malinin et al., 2022).The predictions are made by the multilingual mBART model (Liu et al., 2020).The details of the datasets and the model are provided in Appendix B.
Results.The performance of the selected methods is presented in Figure 1 and Figure 4 in Appendix H.For both ID datasets with LTC and PRM being OOD datasets, MD separates ID and OOD instances very clearly.It achieves an AU-ROC score very close to the optimal one, outperforming all the ensemble-based methods.
When WMT'14 is used as OOD, for the model trained on the WMT'17, most of the ensemblebased methods notably fall behind even the random choice, which means that the model is overconfident in OOD instances.In contrast, MD and RDE yield adequate results.MD based on encoderderived embeddings shows the best quality in this setting.In the hardest setting, where Reddit is used as an OOD dataset, MSP and ensembles poorly detect OOD instances, while the density-based methods outperform all other techniques by a large margin.The only case where density-based methods show slightly lower performance is when WMT'14 and Reddit are considered OOD for the model trained on WMT'20.
Overall, we can see that in most of the considered settings, MD substantially outperforms all other methods, and it is steadily better than the random choice baseline, while other methods are sometimes worse than the random choice.The compute time of the selected methods is presented in Table 13 in Appendix E. We see that the efficient density-based methods introduce only a small com-putational overhead compared to ensemble-based approaches.The complete results of all the considered methods are presented in Table 15 in Appendix H.
Finally, the qualitative analysis of model performance and examples of ID/OOD predictions are presented in Tables 4,5 in Appendix C.

Abstractive Text Summarization
Experimental setup.We experiment with four widely used datasets for ATS with each being ID and OOD: XSum (Narayan et al., 2018), AESLC (Zhang and Tetreault, 2019), Movie Reviews (MR; Wang and Ling ( 2016)), and Debate (Wang and Ling, 2016).Predictions are made by the standard BART model (Lewis et al., 2020).The details on the datasets and the model are provided in Appendix B.
Results.For brevity, in the main part of the paper, we only keep the results with XSum being an OOD dataset.The results for other settings are presented in Appendix G. Figure 2 and Figure 5, Tables 16  and 17 in Appendix G illustrate the results of OOD detection in different corruption scenarios.
First, we can clearly see that the density-based methods relying on both encoder and decoder features provide a large improvement over both information-based and ensemble-based methods.In each corruption scenario, at least one of the MD versions yields the highest AU-ROC scores.
Second, we can observe that some OOD configurations where density-based methods achieve the optimal quality (e.g.MR-XSum, MR-Debate) turn out to be challenging for both information-based and ensemble-based methods.These methods perform worse than the random choice baseline.
Third, when XSum is the ID dataset, RDE based on encoder features fails to perform well.MD, however, achieves the best results in these cases.
Finally, the ensemble-based methods struggle to work stable across different settings.We can see that both PE-S-TU and PE-T-MI are even inferior to information-based methods in some ID-OOD dataset configurations (e.g.AESLC-XSum, Debate-XSum).MD, on the contrary, shows robust results without performance gaps.

Question Answering
Experimental setup.For the QA task, we select several widely-used KGQA datasets: Simple Questions (Bordes et al., 2015), Mintaka (Sen et al., 2022), and RuBQ 2.0 (Rybin et al., 2021).For predictions, we use the T5 model pre-trained for the QA task (Roberts et al., 2020).The details on the datasets and the model are given in Appendix B. The T5 model is used in zero-shot and if no sampling technique is undertaken, there will be no diversity for single model-based and density-based methods.Thus, we apply the bootstrap technique to estimate the confidence of the results obtained by calculating the standard deviation from the mean results.
Results.Experiments on the QA task demonstrate similar behavior of UE methods.From Figure 3 and Table 18 in Appendix H, we can see that the density-based estimates obtained from encoderderived embeddings outperform all the other uncertainty methods by a large margin.
They achieve high-quality results even in cases when the ensemble-based methods completely miss the target (e.g.RuBQ2-RuBQ2ru).This confusion can be explained by the fact that in the case when the model receives input data that is significantly different from what it was trained on, for example, the pre-training was mostly in English, and the question in Russian, the network is forced into default mode distribution based on the frequency of tokens.Example of such generation mode is illustrated in Table 7 in Appendix H.
For experiments in settings RuBQ2-Mintaka and RuBQ2-PRM, we do not observe such a significant outlier as in the previous example.MD is the obvious leader, followed by RDE with a significant gap.Additional qualitative analysis in Table 7 in Appendix H shows that for a particular OOD example, often the uncertainty metric based on a single model and MC ensemble is not so different from the ID counterpart which explains their poor performance.

Conclusion
We adopted the density-based UE methods for seq2seq models and demonstrated that they provide the best results in OOD detection across three sequence generation tasks: NMT, ATS, and QA.They appear to be superior to the ensemble-based methods in terms of both performance and compute time, which makes them a good choice for applying in practice.
In future work, we are going to extend the application of density-based methods to seq2seq models in other UE tasks such as selective classification.

Limitations
In our experiment, we presented results for three diverse sequence-to-sequence tasks, namely, machine translation, text summarization and knowledge graph question answering.While for these three tasks, we managed to observe common trends (i.e.some methods consistently outperformed other methods) a more large-scale study of various sequence-to-sequence tasks is needed to further confirm this observation and robustness of the best-performing method as identified in this work.

Ethics Statement
Uncertainty estimation methods are useful for building safer and more robust machine learning models.However, the extent to which they may interfere with other model tailoring methods, such as debiasing or compression models is not currently studied.In principle, we do not see large ethical implications in our research or risks.

A.1 Base Probabilistic Uncertainty Measures
The task of sequence generation involves relatively complex autoregressive probabilistic models and there exist several variants of defining uncertainties for them.Let us consider the input sequence x and the output sequence y ∈ Y of the length L, where Y is a set of all possible output sequences.Then the standard autoregressive model parametrized by θ is given by: where the distribution of each y l is conditioned on all the previous tokens in a sequence y <l = {y 1 , . . ., y l−1 }.
The probability P (y | x, θ) immediately gives a so-called Unnormalized Sequence Probability (USP) uncertainty metric: USP(y | x, θ) = 1 − P (y | x, θ).However, this metric tends to increase with the increase of the sequence length L which is usually undesirable in practice.That is why some alternatives are proposed.
Normalized Sequence Probability (NSP; Ueffing and Ney ( 2007)) metric directly deals with the variable length via the appropriate normalization that corresponds to average token log-probability Average token-wise entropy (Malinin and Gales, 2020) allows to generalize the notion of standard entropy-based uncertainty metrics for the case of autoregressive models: where H(y l | y <l , x, θ) is an entropy of the token distribution P (y l | y <l , x, θ).

A.2 Aggregation of Uncertainties over Beam
In practice, seq2seq models for each input x usually generate several candidate sequences via beam-search procedure.The resulting set B(x) = {y (b) } B b=1 is usually called beam.Thus, for the solution of OOD detection problems, one needs to aggregate uncertainties of particular pairs (x, y (b) ) into one uncertainty measure associated with an input x.
The simplest method to measure the uncertainty for a beam of sequences is to take the sequence having maximum confidence as exactly this sequence is usually selected as a resulting output of the model.In this work, we consider the particular instantiation of this approach based on NSP measure (2) that we call Maximum Sequence Probability (MSP): The alternative approach is to consider the hypotheses sequences y (b) as samples from a distribution of sequences P (y | x, θ).Each sequence is seen only once and to correctly compute the expectation of some uncertainty measure U (y, x; θ) over this distribution one needs to perform some corrections.The natural choice is importance weighting that leads to the following uncertainty estimate: .

A.3.3 Token Level Ensembling
In the previous section, all the computation of uncertainties was performed on the level of the full sequences.However, multiple opportunities exist to perform it on the level of individual tokens and then aggregate the resulting token uncertainties over the whole sequence.Below we discuss this in detail.
We start from a total uncertainty estimate via entropy: where H(y l | y <l , x) is an entropy of the token distribution P (y l | y <l , x) given in (6).Additionally, for the ensemble one can compute the variety of other token level uncertainty measures including Mutual Information (MI): and Expected Pairwise KL Divergence (EPKL): where KL(P ∥ Q) refers to a KL-divergence between distributions P and Q.
Finally, Reverse Mutual Information (RMI) also can be computed on the token level via a simple equation The resulting token-level uncertainties computed via MI (10), EPKL (11) and RMI (11) can be pluggedin in equation ( 9) on the place of entropy leading to corresponding sequence level uncertainty estimates.

We refer to the resulting methods as PE-T-TU, PE-T-MI, PE-T-EPKL and PE-T-RMI.
Additionally, instead of considering the distribution P (y l | y <l , x, θ i ) one might consider the expectation of products averaging leading to the distribution:

B Experimental Details B.1 OOD Dataset Creation
In both corruption scenarios, we use test samples of the ID and OOD datasets.From the ID dataset, all the observations are used.If the number of texts in the test sample of the OOD dataset is less than that of the ID dataset, we add observations from the training and validation sets until the number of OOD instances equals the number of ID ones.Note that we do not clip the ID dataset if the OOD dataset still contains fewer observations.

B.2.1 Machine Translation
We select the WMT'14 dataset (Bojar et al., 2014), LTC (Panayotov et al., 2015), and Comments from Reddit (Malinin et al., 2022) for the following reasons.WMT'14 is different from the source datasets (WMT'17 En-De and WMT'20 En-Ru) in terms of the source language.The scenario when OOD data comes from different languages can be practical because one usually does not control the input data given by users, while the model output given the input in a different language might be unpredictable and cause reputational risks.In the next two settings, OOD texts only differ from ID in their formality level.Thus, LTC represents a new domain for the model with a completely different structure of texts as a spoken language.Comments from Reddit also refer to spoken language, embodying a structural shift in the data.

B.2.2 Abstractive Summarization
We select the following datasets since they all represent different domains.XSum (Narayan et al., 2018) consists of BBC news with their one-sentence introductory sentences as summaries.AESLC (Zhang and Tetreault, 2019) contains emails with their headlines as summaries.Movie Reviews dataset (Wang and Ling, 2016) (MR) is a collection of critics' opinions on the movie and their consensus.Finally, the Debate dataset (Wang and Ling, 2016) contains arguments and the debate topic pairs, with the former standing for documents and the latter embodying summaries.

B.2.3 Question Answering
Mintaka (Sen et al., 2022), as stated in the original article, is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models.The advantage of this dataset is that it is large enough and has a decent quality of data at the same time.The trade-off between size and quality is the problem of such datasets as mentioned in (Sen et al., 2022).Besides, it provides professional translation of the questions in 8 languages and Wikidata knowledge graph IDs to cope with disambiguation.
The second dataset that we use is RuBQ 2.0 (Rybin et al., 2021).It contains Russian questions, coupled with English machine translations, SPARQL queries and answers with Russian labels, and a subset of Wikidata knowledge graph identifiers.Different complexity of questions allows us to work with data that does not have a shift towards simple or complex questions.
We also conduct experiments on the most popular and oldest Simple Questions (Bordes et al., 2015) dataset for KGQA that contains various questions.We select only the answerable ones.
Thus, we work on the task of answering questions over datasets with links to the Wikidata Knowledge Graph.

B.2.4 Dataset Statistics
We give the summary statistics about the considered datasets in Table 1.

B.3.1 Machine Translation
We use the "large-CC25" version of mBART.We train an ensemble of 5 models with different random seeds for En-De and En-Ru tasks.As for the training settings, we follow the original setup and hyperparameters from (Liu et al., 2020) and train models with 100K update steps.

B.3.2 Abstractive Summarization
In this experiment, we use the "bart-base" version of BART.For each dataset, we construct 5 ensembles each consisting of 5, with a total of 25 trained models.We leverage the hyperparameters and training setup proposed in the original paper (Lewis et al., 2020).

B.3.3 Question Answering
We use the checkpoint "t5-small-ssm-nq" of the T5 model (Raffel et al., 2020) .It is considered to be a state-of-the-art model for the QA task even in closed book setting.NMT and KGQA experiments were performed using the following hardware: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, 36 cores CPU, NVIDIA Tesla v100 GPU, 16 Gb of VRAM.ATS experiments were performed using the following hardware: 2 Intel Xeon Platinum 8168, 2.7 GHz, 24 cores CPU; NVIDIA Tesla v100 GPU, 32 Gb of VRAM.
We provide the information about the resources employed for each experiment in Table 2. Tables 4 and 5 present the textual examples for the models trained on the WMT17 En-De and the WMT20 En-Ru task for ID and OOD datasets.We can see, that for the PRM and WMT14 Fr as OOD, a model trained on the WMT17 En-De performs copying of the input to the output with a high probability.Therefore, the MSP uncertainty is quite low for these examples.However, MD-Encoder is able to correctly spot these instances with high uncertainty.
We can see, that for instances from the LTC dataset, both models produce poor translations, and MD-Encoder precisely detects these instances with high uncertainty.The Reddit dataset consists of challenging texts, and a model trained on the WMT20 En-Ru generates translation with a low BLEU score.However, MD-Encoder produces higher uncertainty than MSP for these examples, and we are able to correctly detect these erroneous instances.

C.3 Question Answering
For the KGQA task, we also analyze the behaviour of the uncertainty metrics to further illustrate the effectiveness of density based approaches on particular examples.Table 7 depicts this analysis.It is evident that values of MD-Encoder estimates show clear difference between ID and OOD inputs.We can also clearly see that for most of the OOD inputs considered, output of the model is either factually incorrect or simply incomprehensible.
We also report model quality on the ID/OOD datasets, further justifying this choice of datasets.Results are present in Table 8.For this analysis we have chosen a larger versions of the same model -t5-largessm-nq.It's clear that the model performs significantly better on both ID datasets, which motivates the need to detect OOD inputs with lower expected quality of the output.We also present a Table 14 with the time cost results for the KBQA task.The presented table displays the mean values and their corresponding standard deviations for the evaluated uncertainty methods in a specific experiment.The dataset used in this experiment is RUBQ 2.0 English questions, and the out-of-domain (OOD) questions are questions with permuted tokens from the same dataset.Despite the high variability observed in this problem, as indicated by the large standard deviations, we can assert with confidence that the density-based methods exhibit significantly faster performance compared to both the ensemble-based and single model-based methods.14: The computational time for the KBQA task with RuBQ 2.0 questions in english as the ID dataset and RuBQ 2.0 permuted questions in english as the OOD dataset.Inference time corresponds to the time needed for model generation for each UE method.UE time corresponds to the time needed for computing uncertainty estimates after the inference stage.0.99 ± 0.0 0.98 ± 0.0 0.87 ± 0.0 1.0 ± 0.0 1.0 ± 0.0 1.0 ± 0.0 1.0 ± 0.0 1.0 ± 0.0 MD-Dec.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 3 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?3; Appendices 3-5 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable in our case since we used public model checkpoints / datasets, released in Hugging-Face repo.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Average ROC curves in various configurations on the NMT task for the selected UE methods.The first dataset in the title represents the ID dataset, the second is the OOD dataset.
We consider Total Uncertainty measured via entropy, Mutual Information (MI), Expected Pairwise KL Divergence (EPKL) and Reverse Mutual Information (RMI).The resulting token-level uncertainties can be averaged via the PE approach leading to PE-T-TU, PE-T-MI, PE-T-EPKL, and PE-T-RMI methods.The alternative is to use EP averaging that gives us another four metrics to consider: EP-T-TU, EP-T-MI, EP-T-EPKL and EP-T-RMI.

)
This gives us another four metrics to consider: EP-T-TU, EP-T-MI, EP-T-EPKL and EP-T-RMI.

Figure 4 Figure 4 :
Figure 4 presents the mean ROC curves over 5 seeds for the models trained on the WMT'20 En-Ru for the selected methods.The second dataset in the title of the figure represents OOD.

Figure 5 :
Figure 5: Average ROC curves in various configurations on the ATS task for the selected UE methods.The first dataset in the title represents the ID dataset, the second stands for the OOD dataset.

Table 1 :
Dataset statistics.We provide a number of instances for the training / validation / test sets, average lengths of texts and targets (answers / translations / summaries) in terms of tokens, and source / target languages.

Table 2 :
Models used for experiments with their parameter counts and approximate GPU hours used for inference and training Table3presents the BLEU score for the NMT task on ID and OOD datasets.We can see a significant decrease in model performance on the OOD dataset.These results demonstrate the necessity of the detection of OOD instances for maintaining the high quality of the model performance.

Table 3 :
Model performance for various ID/OOD settings on the NMT task.The first row demonstrates the BLEU↑ score on the ID test dataset for the considered models.The second row demonstrates the BLEU↑ score on the OOD test dataset, presented in the header of the table.

Table 4 :
Textual examples with the input and output of the model trained on the WMT17 En-De task.We demonstrate uncertainty estimates from MSP and MD-Encoder and BLEU scores for the NMT task.For LTC, we do not show the BLEU score since ground-truth translation is not presented in the dataset.Uncertainty for each method is presented in the range [0-1].The less saturated color indicates lower uncertainty.The Abe government's school closure plan was immediately criticized.План правительства Абэ о закрытии школы был подвергнут резкой критике.

Table 5 :
Textual examples with the input and output of the model trained on the WMT20 En-Ru task.We demonstrate uncertainty estimates from MSP and MD-Encoder and BLEU scores for the NMT task.For LTC, we do not show the BLEU score, since ground-truth translation is not presented in the dataset.Uncertainty for each method is presented in the range [0-1].The less saturated color indicates lower uncertainty.

Table 6 :
Table 6 illustrates the ROUGE-2 score for the ATS task on ID and OOD datasets.Similar to NMT, the model performs much very poorly on OOD data.Therefore, detection of OOD instances is crucial for maintaining the high quality of the model performance.Deb.M.R. XSum AESLC PRM M.R. XSum AESLC Deb.PRM XSum AESLC Deb.M.R. PRM Model performance for various ID/OOD settings on the ATS task.The first row demonstrates the ROUGE-2↑ score on the ID test dataset for the considered models.The second row demonstrates the ROUGE-2↑ score on the OOD test dataset, presented in the header of the table.

Table 7 :
Textual examples with the input and output of the model T5 (t5-small-ssm-nq) used in zero shot.We demonstrate uncertainty estimates for several illustrative examples for MD and RDE calculated on encoder embeddings, NSP, Entropy, PE-S-TU and PE-T-MI.The results presented in the table are standardised to the interval from 0 to 1 for the analysis of comparative values.The less saturated color indicates lower uncertainty.

Table 8 :
Model performance for various ID/OOD settings on the KGQA task.The first row demonstrates the Top-1 Accuracy on the ID test dataset for the considered models.The second row demonstrates the Top-1 Accuracy on OOD dataset, presented in the header of the table.
Table13presents the computational time for all considered methods for the NMT task with WMT17 as the ID dataset and PRM as the OOD dataset.These results demonstrate 1100% of the computational overhead time for the ensemble-based methods in comparison with the inference of a single model.Moreover, density-based methods show their computational efficiency and superior other methods by ROC-AUC with 18-20% additional overhead in comparison with the inference of a single model and only 1.5% in comparison with the ensemble-based methods.

Table 13 :
The computational time for the NMT task with WMT17 as the ID dataset and PRM as the OOD dataset.Inference time corresponds to the time needed for model generation for each UE method.UE time corresponds to the time needed for computing uncertainty estimates after the inference stage.

Table
Table presents the full results with all the considered methods.This table shows that density-based methods for most of the considered configurations outperform the best ensemble method by a large margin.

Table 16 :
Full results (AU-ROC↑) of OOD detection in ATS when XSum / Movie Reviews stand for the ID dataset.The dataset in the second line in the header represent the OOD dataset.We select with bold the best results w.r.t.standard deviation.