Rainproof: An Umbrella To Shield Text Generators From Out-Of-Distribution Data

Implementing effective control mechanisms to ensure the proper functioning and security of deployed NLP models, from translation to chatbots, is essential. A key ingredient to ensure safe system behaviour is Out-Of-Distribution (OOD) detection, which aims to detect whether an input sample is statistically far from the training distribution. Although OOD detection is a widely covered topic in classification tasks, most methods rely on hidden features output by the encoder. In this work, we focus on leveraging soft-probabilities in a black-box framework, i.e. we can access the soft-predictions but not the internal states of the model. Our contributions include: (i) RAINPROOF a Relative informAItioN Projection OOD detection framework; and (ii) a more operational evaluation setting for OOD detection. Surprisingly, we find that OOD detection is not necessarily aligned with task-specific measures. The OOD detector may filter out samples well processed by the model and keep samples that are not, leading to weaker performance. Our results show that RAINPROOF provides OOD detection methods more aligned with task-specific performance metrics than traditional OOD detectors.


Introduction
Significant progress has been made in Natural Language Generation (NLG) in recent years with the development of powerful generic (e.g., GPT (Radford et al., 2018;Brown et al., 2020;Bahrini et al., 2023), LLAMA (Touvron et al., 2023) and its variants) and task-specific (e.g., Grover (Zellers et al., 2019), Pegasus (Zhang et al., 2020) and Dialog-GPT (Zhang et al., 2019b)) text generators.They power machine translation (MT) systems or chatbots that are exposed to the public, and their reliability is a prerequisite for adoption.Text generators are trained in the context of a so-called closed * maxime.darrin@mila.quebec† ILLS -International Laboratory on Learning Systems ‡ Mathématiques et Informatique Centralesupelec world (Fei and Liu, 2016), where training and test data are assumed to be drawn i.i.d.from a single distribution, known as the in-distribution.However, when deployed, these models operate in an open world (Parmar et al., 2021;Zhou, 2022) where the i.i.d.assumption is often violated.Changes in data distribution are detrimental and induce a drop in performance.It is necessary to develop tools to protect models from harmful distribution shifts as it is a clearly unresolved practical problem (Arora et al., 2021).For example, a trained translation model is not expected to be reliable when presented with another language (e.g. a Spanish model exposed to Catalan, or a Dutch model exposed to Afrikaans) or unexpected technical language (e.g., a colloquial translation model exposed to rare technical terms from the medical field).They also tend to be released behind API (OpenAI, 2023) ruling out many usual features-based OOD detection methods.
Most work on Out-Of-Distribution (OOD) detection focus on classification, leaving OOD detection in (conditional) text generation settings mainly unexplored, even though it is among the most exposed applications.Existing solutions fall into two categories.The first one called training-aware methods (Zhu et al., 2022;Vernekar et al., 2019a,b), modifies the classifier training by exposing the neural network to OOD samples during training.The second one, called plug-in methods aims to distinguish regular samples in the in-distribution (IN) from OOD samples based on the model's behaviour on a new input.Plug-in methods include Maximum Softmax Probabilities (MSP) (Hendrycks and Gimpel, 2016) or Energy (Liu et al., 2020) or featurebased anomaly detectors that compute a per-class anomaly score (Ming et al., 2022;Ryu et al., 2017;Huang et al., 2020;Ren et al., 2021a).Although plug-in methods from classification settings seem attractive, their adaptation to text generation tasks is more involved.While text generation can be seen as a sequence of classification problems, i.e., chosing the next token at each step, the number of possible tokens is two orders of magnitude higher than usual classification setups.
In this work, we aim to develop new tools to build more reliable text generators which can be used in practical systems.To do so, we work under 4 constraints: (i) We do not assume we can access OOD samples; (ii) We suppose we are in a blackbox scenario: we do not assume we have access to the internal states of the model but only to the soft probability distributions it outputs; (iii) The detectors should be easy enough to use on top of any existing model to ensure adaptability; (iv) Not only should OOD detectors be able to filter OOD samples, but they also are expected to improve the average performance on the end-task the model has to perform.
Our contributions.Our main contributions can be summarized as follows: 1.A more operational benchmark for text generation OOD detection.We present LOFTER the Language Out oF disTribution pErformance benchmaRk.Existing works on OOD detection for language modelling (Arora et al., 2021) focus on (i) the English language only, (ii) the GLUE benchmark, and (iii) measure performance solely in terms of OOD detection.LOFTER introduces more realistic data shifts in the generative setting that goes beyond English: language shifts induced by closely related language pairs (e.g., Spanish and Catalan or Dutch and Afrikaans (Xiao et al., 2020)  1 ) and domain change (e.g., medical vs news data or vs dialogs).In addition, LOFTER comes with an updated evaluation setting: detectors' performance is jointly evaluated w.r.t the overall system's performance on the end task.

2.
A novel detector inspired by information projection.We present RAINPROOF: a Relative in-formAItioN Projection Out OF distribution detector.RAINPROOF is fully unsupervised.It is flexible and can be applied both when no reference samples (IN) are available (corresponding to scenario s 0 ) and when they are (corresponding to scenario s 1 ).RAINPROOF tackles s 0 by computing the models' predictions negentropy (Brillouin, 1953) and uses it as a measure of normality.For s 1 , it relies upon its natural extension: the Information Projection |Ω| i=1 p i = 1 the set of probability distributions defined over Ω.Let D train be the training set, composed of N ⩾ 1 i.i.d.samples {(x i , y i )} N i=1 ∈ (X × Y) N with probability law p XY .We denote p X and p Y the associated marginal laws of p XY .Each x i is a sequence of tokens, and x i j ∈ Ω the jth token of the ith sequence.
∈ Ω * denotes the prefix of length t.The same notations hold for y.Conditional textual generation.In conditional textual generation, the goal is to model a probability distribution p ⋆ (x, y) over variable-length text sequences (x, y) by finding p θ ≈ p ⋆ (x, y) for any (x, y).In this work, we assume to have access to a pretrained conditional language model f θ : X × Y → R |Ω| , where the output is the (unnormalized) logits scores.f θ parameterized p θ , i.e., for any (x, y), p θ (x, y) = softmax(f θ (x, y)/T ) where T ∈ R denotes the temperature.Given an input sequence x, the pretrained language f θ can recursively generate an output sequence ŷ by sampling y t+1 ∼ p T θ (•|x, ŷ⩽t ), for t ∈ [1, |y|].Note that ŷ0 is the start of sentence (< SOS > token).We denote by S(x), the set of normalized logits scores generated by the model when the initial input is x i.e., S(x) = {softmax(f θ (x, ŷ⩽t ))} |ŷ| t=1 .
Note that elements of S(x) are discrete probability distributions over Ω.

Problem statement
In OOD detection, the goal is to find an anomaly score a : X → R + that quantifies how far a sample is from the IN distribution.x is classified as IN or OUT according to the score a(x).We then fix a threshold γ and classifies the test sample IN if a(x) ⩽ γ or OOD if a(x) > γ.Formally, let us denote g(•, γ) the decision function, we take: In our setting, OOD examples are not available.Tuning γ is a complex task, and it is usually calibrated using OOD samples.In our work, we decided not to rely on OOD samples but on the available training set to fix γ in a realistic setting.Indeed, even well-tailored datasets might contain significant shares of outliers (Meister et al., 2023).Therefore, we fix γ so that at least 80% of the IN data pass the filtering procedure.See Sec.G.3 for more details.

Review of existing OOD detectors
OOD detection for classification.Most works on OOD detection have focused on detectors for classifiers and rely either on internal representations (features-based detectors) or on the final soft probabilities produced by the classifier (softmax based detectors).Features-based detectors.They leverage latent representations to derive anomaly scores.The most well-known is the Mahanalobis distance (Lee et al., 2018a;Ren et al., 2021b), but there are other methods employing Grams matrices (Sastry and Oore, 2020), Fisher Rao distance (Gomes et al., 2022) or other statistical tests (Haroush et al., 2021).These methods require access to the latent representations of the models, which does not fit the black-box scenario.In addition, it is well known in classification that performing per-class OOD detection is key to get good performance (Lee et al., 2018b).This per-class approach is a priori impossible in text generation since it would have to be done per token or by some other unknown type of classes.We argue that it is necessary to find non-class-dependent solutions, especially when it comes to the Mahalanobis distance, which relies upon the hypothesis that the data are unimodal; we study the validity of this hypothesis and show that it is not true in a generative setting in Ap. A.
Softmax-based detectors.These detectors rely on the soft probabilities produced by the model.The MSP (Hendrycks and Gimpel, 2017;Hein et al., 2019;Liang et al., 2018;Hsu et al., 2020) uses the probability of the mode while others take into account the entire logit distribution (e.g., Energybased scores (Liu et al., 2020)).Due to the large vocabulary size, it is unclear how these methods generalize to sequence generation tasks.OOD detection for text generation.Little work has been done on OOD detection for text generation.Therefore, we will follow (Arora et al., 2021;Podolskiy et al., 2021) and rely on their baselines.We also generalize common OOD scores such as MSP or Energy by computing the average score along the sequence at each step of the text generation.We refer the reader to Sec.B.7 for more details.Quality estimation as OOD detection metric.Quality Estimation Metrics are not designed to detect OOD samples but to assess the overall quality of generated samples.However, they are interesting baselines to consider, as OOD samples should lead to low-quality outputs.We will use COMET QE (Stewart et al., 2020) as a baseline to filter out low-quality results induced by OOD samples.Remark 2. Note that features-based detectors assume white-box access to internal representations, while softmax-based detectors rely solely on the final output.Our work operates in a black-box framework but also includes a comparison to the Mahalanobis distance for completeness.

RAINPROOF OOD detector
3.1 Background An information measure I : P(Ω) × P(Ω) → R quantifies the similarity between any pair of discrete distributions p, q ∈ P(Ω).Since Ω is a finite set, we will adopt the following notations While there exist information distances, it is, in general, difficult to build metrics that satisfy all the properties of a distance, thus we often rely on divergences that drop the symmetry property and the triangular inequality.
In what follows, we motivate the information measures we will use in this work.
First, we rely on the Rényi divergences (Csiszár, 1967).Rényi divergences belong to the fdivergences family and are parametrized by a parameter α ∈ R + − {1}.They are flexible and include well-known divergences such as the Kullback-Leiber divergence (KL) (Kullback, 1959) (when α → 1) or the Hellinger distance (Hellinger, 1909) (when α = 0.5).The Rényi divergence between p and q is defined as follows: The Rényi divergence is popular as α allows weighting the relative influence of the distributions' tail.
Second, we investigate the Fisher-Rao distance (FR).FR is a distance on the Riemannian space formed by the parametric distributions, using the Fisher information matrix as its metric.It computes the geodesic distance between two discrete distributions (Rao, 1992) and is defined as follows: It has recently found many applications (Picot et al., 2022;Colombo et al., 2022b).
3.2 RAINPROOF for the no-reference scenario (s 0 ) At inference time, the no-reference scenario (s 0 ) does not assume the existence of a reference set of IN samples to decide whether a new input sample is OOD.Which include, for example, Softmax-based detectors such as MSP, Energy or the sequence log-likelihood 5 Under these assumptions, our OOD detector RAINPROOF comprises three steps.For a given input x with generated sentence ŷ: 1. We first use f θ to extract the step-by-step sequence of soft distributions S(x).
2. We then compute an anomaly score (a I (x)) by averaging a step-by-step score provided by I.This step-by-step score is obtained by measuring the similarity between a reference distribution u ∈ P(Ω) and one element of S(x).Formally, where |S(x)| = |ŷ|.
5 The detector based on the log-likelihood of the sequence is defined as aL 3. The last step consists of thresholding the previous anomaly score a I (x).If a I (x) is over a given threshold γ, we classify x as an OOD example.
Interpretation of Eq. 3. a I (x) measures the average dissimilarity of the probability distribution of the next token to normality (as defined by u). a I (x) also corresponds to the token average uncertainty of the model f θ to generate ŷ when the input is x.The intuition behind Eq. 3 is that the distributions produced by f θ , when exposed to an OOD sample, should be far from normality and thus have a high score.
Choice of u and I.The uncertainty definition of Eq. 3 depends on the choice of both the reference distribution u and the information measure I.A natural choice for u is the uniform distribution, i.e., u |Ω| ] which we will use in this work.It is worth pointing out that I(•||u) yields the negentropy of a distribution (Brillouin, 1953).Other possible choices for u include one hot or tf-idf distribution (Colombo et al., 2022b).For I, we rely on the Rényi divergence to obtain a Dα and the Fisher-Rao distance to obtain a FR .

RAINPROOF for the reference scenario (s 1 )
In the reference scenario (s 1 ), we assume that one has access to a reference set of IN samples R = where |R| is the size of the reference set.For example, the Mahalanobis distance works under this assumption.One of the weaknesses of Eq. 3 is that it imposes an ad-hoc choice when using u (the uniform distribution).
In s 1 , we can leverage R, to obtain a data-driven notion normality.
Under s 1 , our OOD detector RAINPROOF follows these four steps: 1. (Offline) For each x i ∈ R, we generate ŷi and the associated sequence of probability distributions (S(x i )).Overall we thus generate x∈R |ŷ i | probability distributions which could explode for long sequences6 .To overcome this limitation, we rely on the bag of distributions of each sequence (Colombo et al., 2022b).We form the set of these bags of distributions: 2. (Online) For a given input x with generated sentence ŷ, we compute its bag of distributions representation: p. (5) 3. (Online) For x, we then compute an anomaly score a ⋆ I (x) by projecting p(x) on the set S * .Formally, a ⋆ I (x) is defined as: We denote p ⋆ (x) = arg min p∈ S * I(p∥p(x)).
4. The last step consists of thresholding the previous anomaly score a I (x).If a I (x) is over a given threshold γ, we classify x as an OOD example.
Interpretation of Eq. 6. a I (x) relies on a Generalized Information Projection (Kullback, 1954;Csiszár, 1975Csiszár, , 1984) ) 7 which measures the similarity between p(x) and the set S * .Note that the closest element of S * in the sens of I can give insights on the decision of the detector.It allows interpreting the decision of the detector as we will see in Tab. 6.
Choice of I. Similarly to Sec. 3.2, we will rely on the Rényi divergence to define a ⋆ Rα (x) and the Fisher-Rao distance a ⋆ FR (x).

Results on LOFTER
4.1 LOFTER: Language Out oF disTribution pErformance benchmaRk LOFTER for NMT.We consider a realistic setting involving both topic and language shifts.Language shifts correspond to exposing a model trained for a given language to another which is either linguistically close (e.g., Afrikaans for a system trained on Dutch) or missing in the training data (as it is the case for german in BLOOM (Scao et al., 2022)).
It is an interesting setting because the differences between languages might not be obvious but still cause a significant drop in performance.For linguistically close languages, we selected closely related language pairs such as Catalan-Spanish, Portuguese-Spanish and Afrikaans-Dutch) coming from the Tatoeba dataset (Tiedemann, 2012) (see Tab. 8).Domain shifts can involve technical or rare terms or specific sentence constructions, which can affect the model's performance.We simulated such shifts from Tatoeba MT using news, law (EuroParl dataset), and medical texts (EMEA).LOFTER for dialogs.For conversational agents, we focused on a scenario where a goal-oriented agent, designed to handle a specific type of conversation (e.g., customer conversations, daily dialogue), is exposed to an unexpected conversation.In this case, it is crucial to interrupt the agent so it does not damage the user's trust with misplaced responses.We rely on the Multi WOZ dataset (Zang et al., 2020), a human-to-human dataset collected in the Wizard-of-Oz set-up (Kelley, 1984), for IN distribution data and its associated fine-tuned model.We simulated shifts using dialogue datasets from various sources, which are part of the SILICONE benchmark (Chapuis et al., 2020).Specifically, we use a goal-oriented dataset (i.e., Switchboard Dialog Act Corpus (SwDA) (Stolcke et al., 2000)), a multi-party meetings dataset (i.e., MRDA (Shriberg et al., 2004) and Multimodal EmotionLines Dataset MELD (Poria et al., 2018)), daily communication dialogs (i.e., DailyDialog DyDA (Li et al., 2017)), and scripted scenarii (i.e., IEMOCAP (Tripathi et al., 2018)).We refer the curious reader to Sec.B.5 for more details on each dataset.Model Choices.We evaluated our methods on open-source and freely available language bilingual models (the Helsinki suite (Tiedemann and Thottingal, 2020)), on a BLOOM-based instructions model BLOOMZ (Muennighoff et al., 2022) (for which German is OOD).For dialogue tasks, we relied on the Dialog GPT (Zhang et al., 2019b) model finetuned on Multi WOZ, which acts as IN distribution.We consider the Helsinki models as they are used in production for lightweight applications.Additionally, they are specialized for a specific language pair and released with their associated training set, making them ideal candidates to study the impact of OOD in a controlled setting.8Metrics.To evaluate the performance on the OOD task, we report the Area Under the Receiver Oper- ating Characteristic AUROC and the False positive rate FPR ↓.These methods have been widely employed in previous research on out-of-distribution (OOD) detection.An exhaustive description of the metrics can be found in Sec.4.1.

Experiments in MT
Results on language shifts (Tab.1).We find that our no-reference methods (a Dα and a FR ) achieve better performance than common no-reference baselines but also outperform the reference-based baseline.In particular, a Dα , by achieving an AUROC of 0.95 and FPR ↓ of 0.25, outperforms all considered methods.Moreover, while no-reference baselines only capture up to 45% of the OOD samples on average, ours detect up to 55%.In addition, COMET QE, a quality estimation tool, performs poorly in pure OOD detection, suggesting that while OOD detection and quality estimation can be related, they are still different problems.
Results on domain shifts (Tab.1).We evaluate the OOD detection performance of RAINPROOF on domain shifts in Spanish (SPA) and German (DEU) with technical, medical data and parliamentary data.
For s 0 , we observe that a Dα and a FR outperform the strongest baselines (i.e., Energy, MSP and se- quence likelihood) by several AUROC points.Interestingly enough, even our no-reference detectors outperform the reference-based baseline (i.e., a M , a deeper study of this phenomenon is presented in Ap.A).While a Dα achieves similar AUROC performance to its information projection counterpart a D * α , the latter achieve better FPR ↓.Once again, the COMET QE metric does not yield competitive performance for OOD detection.

Experiments in dialogue generation
Results on Dialogue shifts (Tab.1).Dialogue shifts are understandably more difficult to detect, as shown in our experiments, as they are smaller than language shifts.Our no-reference detectors do not outperform the Mahalanobis baseline and achieve only 0.79 in AUROC.The best baseline is the Mahalanobis distance and achieves better performance on dialogue tasks than on NMT domain shifts, reaching an AUROC of 0.84.However, our reference-based detector based on the Rényi information projection secures better AUROC (0.86) and better FPR ↓ (0.52).Even if our detectors achieve decent results on this task, it is clear that dialogue shifts will require further work and investigation (see Ap. F), especially in the wake of LLMs.

Ablations Study
Fig. 1 shows that RAINPROOF offers a crucial flexibility by utilizing the Rényi divergence with adjustable parameter alpha.RAINPROOF's detectors show improvement when considering the tail of the distributions.Notably, lower values of α (close to 0) yield better results with the Rényi Information projection a D * α .This finding suggests that the tail of the distributions used in text generation contains contextual information and insights about the processed texts.These results are consistent  with recent research in automatic text generation evaluation (Colombo et al., 2022b).Interestingly, increasing the size of the reference set beyond 1.2k has minimal influence.We provide an additional study of the impact of the temperature and parameter α for the different OOD scores in Ap.E.

A More Practical Evaluation
Following previous work, we measure the performance of the detectors on the OOD detection task based on AUROC and FPR ↓.However, this evaluation framework neglects the impact of the detector on the overall system's performance and the downstream task it performs.We identify three main evaluation criteria that are important in practice: (i) execution time, (ii) overall system performance in terms of the quality of the generated answers, and (iii) interpretability of the decision.Our study is conducted on NMT because due to the existence of relevant and widely adopted metrics for assessing the quality of a generated sentence (i.e., BLEU (Papineni et al., 2002) and BERT-S (Zhang et al., 2019a) and COMET (Stewart et al., 2020)).

Execution time
Runtime and memory costs.We report in Tab. 4 the runtime of all methods.Detectors for s 0 are faster than the ones for s 1 .Unlike detectors using references, no-reference detectors do not require additional memory.They can be set up easily in a plug&play manner at virtually no costs.

Effects of Filtering on Translation Quality
In this experiment, we investigate the impact of OOD filtering from the perspectives of quality estimation and selective generation.Global performance.In Tab. 3 and Tab. 5, we report the global performance of the system (f θ ) with and without OOD detectors on IN samples, OOD samples, and all samples (ALL).In most cases, adding detectors increases the average quality of the returned answers on all three subsets but with varying efficacy.a M SP is a notable exception, and we provide a specific correlation analysis later.While the reference-based detectors tend to remove more OOD samples, the no-reference detectors demonstrate better performance regarding the remaining sentences' average BLEU.Thus, OOD detector evaluation should consider the final task performance.Overall, it is worth noting that directly adapting classical OOD detection methods (e.g., MSP or Energy) to the sequence generation problem leads to poor results in terms of performance gains (i.e., as measured by BLEU or BERT-S).a Dα removes up to 62% of OOD samples (whereas the likelihood only removes 45%) and maintains or improves the average performance of the system on the end task.In other words, a Dα provides the best combination of OOD detection performance and system performance improvements.Threshold free analysis.In Tab. 2, we report the correlations between OOD scores and quality metrics on each data subset (IN and OUT distribution, and ALL combined).For the OOD detector to improve or maintain performance on the end task, its score must correlate with performance metrics similarly for each subset.We notice that it is not the case for the likelihood or a MSP .The highest likelihood on IN data corresponds to higher quality answers.Still, the opposite is true for OOD samples, meaning using the likelihood to remove OOD samples tends to remove OOD samples that are well handled by the model.By contrast, RAINPROOF scores correlate well and in the same way on both IN and OUT, allowing them to remove OOD samples while improving performance.

Towards an interpretable decision
An important dimension of fostering adoption is the ability to verify the decision taken by the automatic system.RAINPROOF offers a step in this direction when used with references: for each input sample, RAINPROOF finds the closest sample Table 5: Detailed impacts on NMT performance results per tasks (Domain-or Language-shifts) of the different detectors.We present results on the different parts of the data: IN data, OOD data and the combination of both, ALL.For each, we report the absolute average BLEU (Abs.), the average gains in BLEU (G.s.) compared to a setting without OOD filtering (f θ only) and the share of the subset removed by the detector (R.Sh.).We provide more detailed results on each dataset in Ap.G.In addition, we performed this study using different thresholds see Sec.  (in the sense of the Information Projection) in the reference set to take its decision.Tab.6 present examples of OOD samples along with their translation scores, projection scores, and projection on the reference set.Qualitative analysis shows that, in general, sentences close to the reference set and whose projection has a close meaning are better handled by f θ .Therefore, one can visually interpret the prediction of RAINPROOF and validate it.

RAINPROOF on LLM for NMT
As an alternative to NMT models, we can study the performance of instruction finetuned LLM on translation tasks.However, it is important to note that while LLMs are trained on enormous amounts of data, they still miss many languages.Typically, they are trained on around 100 languages (Conneau et al., 2019), this falls far short of the existing 7000 languages.In our test-bed experiments, we decided to rely on BLOOM models, which have not been specifically trained on German (DEU) data.Therefore, We can use German samples to simulate OOD detection in an instruction-following translation setting, specifically relying on BLOOMZ (Muennighoff et al., 2022).We prompt the model to trans-

Conclusion
This work introduces a detection framework called RAINPROOF and a new benchmark called LOFTER for black-box OOD detection on text generator.We adopt an operational perspective by not only considering OOD performance but also task-specific metrics: despite the good results obtained in pure OOD detection, OOD filtering can harm the performance of the final system, as it is the case for a MSP or a M .We found that RAINPROOF succeed in removing OOD while inducing significant gains in translation performance both on OOD samples and in general.In conclusion, this work paves the way for developing text-generation OOD detectors and calls for a global evaluation when benchmarking future OOD detectors.

Limitations
While this work does not bear significant ethical or impact hazards, it is worth pointing out that it is not a perfect, absolutely safe solution against OOD distribution samples.Preventing the processing of OOD samples is an important part of ensuring ML algorithms' safety and robustness but it cannot guarantee total safety nor avoid all OOD samples.In this work, we approach the problem of OOD detection from a performance standpoint: we argue that OOD detectors should increase performance metrics since they should remove risky samples.However, no one can give such guarantees, and the outputs of ML models should always be taken with caution, whatever safety measures or filters are in place.Additionally, we showed that our methods worked in a specific setting of language shifts or topic shifts, mainly on translation tasks.While our methods performed well for small language shifts (shifts induced by linguistically close languages) and showed promising results on detecting topic shifts, the latter task remains particularly hard.Further work should explore different types of distribution shifts in other newer settings such as different types of instructions or problems given to instruction-finetuned models.The main drawback of the Mahalanobis distance is assuming a single-mode distribution.In text classification, this is mitigated by fitting one Mahalanobis scorer per class.However, in text generation, this assumption is flawed as there are multiple modes as illustrated in Fig. 2).PCA of Fig. 2 illustrate a failure case of the Mahalanobis distance in the case of OOD detection.

B Experimental setting
In this section, we dive into the details and definitions of our experimental setting.First, we present our OOD detection performance metrics (Sec.B.1), then we provide a couple samples for one of the small language shifts (Sec.B.4).We also discuss the choices of pre-trained model (Sec.B.6) and how we adapted common OOD detectors to the text generation case (Sec.B.7).In order to evaluate the performance of our methods we will focus and report mainly the AUROC and the FPR ↓, we provide more detailed metrics and experiments in Sec.B.1.

B.1 Additionnal details on metrics
Area Under the Receiver Operating Characteristic curve (AUROC).The Receiver Operating Characteristic curve is curve obtained by plotting the True positive rate against the False positive rate.The area under this curve is the probability that an in-distribution example X in has an anomaly score higher than an OOD sample x out : AUROC= Pr(a(x in ) > a(x out )).It is given by γ → (Pr a(x) > γ | Z = 0 , Pr a(x) > γ | Z = 1 ).
False Positive Rate at 95% True Positive Rate (FPR ↓).We accept to allow only a given false positive rate r corresponding to a defined level of safety and we want to know what share of positive samples we actually catch under this constraint.It leads to select a threshold γ r such that the corresponding TPR equals r.At this threshold, one then computes: r is chosen depending on the difficulty of the task at hand and the required level of safety.
For the sake of brevity, we present only AUROCand FPR ↓metrics in our aggregated results but we also used Detection error and Area Under the Precision-Recall curve metrics and those are presented in our full results section (Ap.F).

B.6 Choices of models
To perform our experiments we needed models that were already well installed and deployed and that would also support OOD settings.For translation tasks, we needed specialized models for a notion of OOD to be easily defined.It would be indeed more hazardous to define a notion of OOD language when working with a multilingual model.The same is true for conversational models.
Neural Machine Translation model.We benchmark our OOD method on translation models provided by Helsinky NLP (Tiedemann and Thottingal, 2020) on several pairs of languages with large and small shifts.We extended the experiment to detect domain shifts.These models are indeed specialized in each language pair and are widely recognised in the neural machine translation field.For our experiments we used the testing set provided along these models, so we can consider that they have been fine-tuned over the same distribution.

Conversational model.
We used a dialog-GPT (Zhang et al., 2019b) model fine-tuned on the Multi WOZ dataset as chat bot model.The finetuning on daily dialogue-type tasks ensures that the model is specialized, thus allowing us to get a good definition of samples not being in its range of expertise.Moreover, the choice of the architecture, DialogGPT, guarantees that our results are valid on a very common architecture.He that will lie, will steal.
I'm the one who has the key.Jo soc qui te la clau.5.69 En Tom surt a treballar cada matí a dos quarts de set.Tom leaves for work at 6:30 every morning.In Tom surt to pull each matí to two quarts of set.
He told me that his house was haunted.Ell m'ha dit that the seva house was haunted.27.78 Aquest és el lloc on va nèixer el meu pare.This is the place where my father was born.Aquest is the lloc on va nèixer el meu pare.8.30 Table 10: Example of behaviours of a language model trained to handle Spanish inputs on Catalan inputs.

B.7 Generalization of existing OOD detectors to Sequence Generation
In this section, we extend classical OOD detection score to the conditional text generation settting.Common OOD detectors were built for classification tasks and we need to adapt them to conditional text generation.Our task can be viewed as a sequence of classification problems with a very large number of classes (the size of the vocabulary).We chose the most naive approach which consists of averaging the OOD scores over the sequence.We experimented with other aggregation such as the min/max or the standard deviation without getting interesting results.
Likelihood Score The most naive approach to build a OOD score is to rely solely on the loglikelihood of the sequence.For a conditioning x we define the log-likelyhood score by a L (x) = − |ŷ|−1 t=0 log p θ (ŷ t+1 |x, ŷ⩽t ).The likelihood is the same as the perplexity.

Average Maximum Softmax Probability score
The maximum softmax probability (Hendrycks and Gimpel, 2017) takes the probability of the mode of the categorical distribution as score of OOD.We extend thise definition in the case of sequence of probability distribution by averaging this score along the sequence.For a given conditioning x, we define the average MSP score a MSP (x) = p T θ (i|x, ŷ⩽t )).While it is closely linked to uncertainty measures it discards most of the information contained in the probability distribution.It discards the whole probability distribution.We claim that much more information can be retrieve by studying the whole distribution.Average Energy score We extend the definition of the energy score described in (Liu et al., 2020) to a sequence of probability distributions by averaging the score along the sequence.For a given conditioning x and a temperature T we define the average energy of the sequence:a |Ω| i e f θ (x,ŷ ⩽t ) i /T .It corresponds to the normalization term of the softmax function applied on the logits.While it takes into account the whole distribution, it only takes into account the amount of unormalized mass before normalization without attention to how this mass is distributed along the features.Mahalanobis distance Following (Lee et al., 2018a;Colombo et al., 2022a) compute the Mahalanobis matrice based on the samples of a given reference set R. In our case we are using encoderdecoder models we use the output of the last hidden layer of the encoder as embedding.Let's denote ϕ(x) this embedding for a conditioning x.Let µ and Σ be respectively, the mean and the covariance of these embedding on the reference set.We define

B.8 Computational budget
We had a budget of 20000h on NVIDIA V100 GPU.While this is an important number it was used to compute the benchmarks over many pairs and languages.In practice our OOD detectors do not require much addition computation overhead since they only rely on the probability distributions already output by the models.

B.9 Towards an interpretable decision
An important dimension of fostering adoption is the ability to verify the decision taken by the automatic system.RAINPROOF offers a step in this direction when used with references: for each input sample, RAINPROOF finds the closest sample (in the sense of the Information Projection) in the reference set to take its decision.We present in Tab.11 some OOD samples along with their translation scores, projection scores, and their projection on the refer-

C Scaling to larger models
In order to validate our results we perform experiments on larger and general-purpose models such as BloomZ (Muennighoff et al., 2022), NLLB (Team et al., 2022)   Table 13: OOD detection on BloomZ using German as an OOD language for the instruction model.

C.1 Negative results on NLLB
By the very definition of the No-Language-Left-Behind model, it should be particularly hard to find OOD language to benchmark on.The model still requires special token to be set in the sequence to define the source and target languages.We tried to apply our OOD detection methods to situations where the presented language does not correspond to the source language set by the special token.We found that in this scenario the likelihood was by far the best discriminator of OOD samples.It can be explained by the fact that our inputs are not actually OOD, they are just not consistent with the source language token, but the model is still well calibrated overall on these inputs.

D Additional OOD features-based baselines
To further support the point that features-based detectors have important flaws when it comes to text generation we compare our best performing OOD score to SOTA OOD detectors in text such as the DataDepth (a D ) (Colombo et al., 2022a) and the Maximum Cosine Projection (a C ) (Zhou et al., 2021).

E Parameters tuning
Detectors depend on their anomaly score to make decisions, and these scores can be parametric.First of all, soft probability-based scores depend on the soft probability distribution and its scaling.Therefore, the temperature is a crucial parameter to tune to get the most performance.While a small temperature makes the distribution pickier, a higher value spreads the probability mass along the classes.Moreover, the Rényi divergence depends on a factor α. We provide here further results and analysis of those parameters on our results.
In Fig. 3, we analyse the impact of the temperature and α parameter for our Renyi-Negentropy score.Consistently with results for the information projection we find that the tail of the distribution is important to ensure good detection of OOD samples for all language shifts.A temperature higher than 2 and lower values of α yield the best results.We recommend using α = 0.5 with a temperature of 2.
We found that our a Dα score, the Rényi negentropy is more stable concerning the temperature and the considered datasets and shifts than the energybased OOD score and the MSP score.Indeed, in Fig. 4, we show that the baselines do not behave consistently across datasets when the temperature changes.This is a problem when deploying these scores in production.Indeed, we cannot fit a temperature for each possible type of shift or OOD samples.By contrast, there exist sets of parameters (temperature and α) for which our negentropybased scores perform consistently across different shifts.Our results show that, when it comes to domain shift (domain shifts in translation or dialog shifts), reference-based detectors are required to obtain good results.They also show that, the more these detectors take into account the tail of the distributions, the better they are, as displayed in Fig. 5.We find that low values of α (near 0) yields better results with the Rényi Information projection a D * α .It suggests that the tail of the distributions used during text generation carries context information and insights on the processed texts.Such results are consistent with findings of recent works in the context of automatic evaluation of text generation (Colombo et al., 2022b).

F.2 Summary of our results
In Fig. 6 we present the different performance levels of all the detectors we studied.We can see that in every task our detectors outperform the baselines but also that in dialog shift, while the Mahalanobis distance outperform clearly our detectors for s 0 , they still outperform baselines for their scenario by far.

F.3 Detailed results of OOD detection performances
In this section, we present the performances of our OOD detectors on each detailed tasks, i.e. for each pair of IN and OOD data with all the considered metrics.Our metrics outperform other OOD detectors baselines in almost all scenarios.Table 16: Detailed results of the performances of our OOD detectors on different language shifts.The first language of the pair is the reference language of the model and the second one is the studied shift.In Fig. 7 and Fig. 8 we present the ROC-AUC curves of our different detectors for language shifts in translation.

F.4.2 Domain shifts
In Fig. 9 and Fig. 10 we present the ROC-AUC curves of our different detectors for topic shifts in translation.

F.4.3 Dialog shifts
In Fig. 11 and Fig. 12 we present the ROC-AUC curves of our different detectors for topic shifts in a dialog setting.

G NTM performance
Surprisingly we show that common OOD detectors tend to exclude samples that the model well handles and keep some that are not leading to decreasing overall performance in terms of translation metrics.Moreover, it seems this phenomenon is more dominant in reference-based detectors.We show that our uncertainty-based detectors mostly avoid that downfall and provide good OOD detection and improved translation performances.

G.1 Absolute performances
It is clear (somewhat expected) that NMT models do not perform as well on OOD data as we can see in Tab.19b.However, we find that our OOD detectors are able to remove most of the worst-case samples and keep enough well-translated samples so that with correct filtering our method actually allows the model to achieve somewhat acceptable BLEU scores.

G.2 Gains
In Tab.20 we give the detailed gain in translation performance based on the BLEU score.

G.3 Choice of threshold
We believe that the choice of the threshold for OOD detection should not require OOD samples because we do not want to assume we have access to all kind 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure 1 :
Figure 1: Ablation study on RAINPROOF for α and reference set size (|R|) for dialogue shift detection.Smaller α emphasizes the tail of the distribution, while α = 0 counts common non-zero elements.

A
Examining the Limitations of Mahalanobis-Based OOD Detector for Text Generation (a) deu date on fra model.(b) spa date on fra model.

Figure 2 :
Figure 2: PCA reduction of encoder's hidden features for IN and OUT distribution samples, with Mahalanobis distance mean (green cross).The plot reveals the multimodal nature of the distributions.

OOD
Detection is usually an unbalanced binary classification problem where the class of interest is OUT.Let us denote Z the random variable corresponding to actually being out of distribution.We can assess the performance of our OOD detectors focusing on the False alarm rate and on the True detection rate.The False alarm rate or False positive rate (FPR) is the proportion of samples misclassified as OUT.For a score threshold γ, we have FPR = Pr a(x) > γ | Z = 0 .The True detection rate or True positive rate (TPR) is the proportion of OOD samples that are detected by the method.It is given by TPR = Pr a(x) > γ | Z = 1 .
Detection error.It is simply the probability of miss-classification for a given True positive rate.Area Under the Precision-Recall curve (AUPR-IN/AUPR-OUT).The Precision-Recall curve plots the recall (true detection rate) against the precision (actual proportion of OOD amongst the predicted OOD).The area under this curve γ → (Pr Z = 1 | s(X) ⩽ γ , Pr s(X) ⩽ γ | Z = 1 ) captures the trade-off between precision and recall made by the model.A high value represents a high precision and a high recall i.e. the detector captures most of the positive samples while having few False positives.Tatoeba DEU Tatoeba NLD SPA-ENG Tatoeba SPA News FR SPA-ENG Tatoeba SPA Tatoeba CAT SPA-ENG Tatoeba SPA Tatoeba POR NLD-ENG Tatoeba SPA AFR Domain shift DEU-ENG Tatoeba DEU EMEA DEU DEU-ENG Tatoeba DEU Eurparl DEU DEU-ENG Tatoeba DEU EMEA DEU DEU-ENG Tatoeba DEU Eurparl DEU

FFigure 5 :
Figure 5: Impact of α on the performance of the Rényi information projection for dialog shifts detection.A smaller α increases the weight of the tail of the distribution.An α of 0 would consist in counting the number of the common non zero elements.

Figure 3 :
Figure 3: Effect of the temperature and α parameter for a Dα on the performance on OOD detection in terms of AUROC.

Figure 4 :
Figure 4: Impact of the temperature used to compute the energy (a E ) and MSP (a MSP ) OOD scores in terms of AUROC.

Table 1 :
Summary of the OOD detection performance of our detectors (Ours) compared to commonly used strong baselines (Bas.).We report the best detector for each scenario in bold and underline the best overall.The ↓ indicates that for this score, the lower, the better; otherwise, the higher, the better. 9

Table 2 :
Correlation between OOD scores and translation metrics BLEU, BERT-S and COMET.

Table 3 :
Impact of OOD detectors on BLEU for IN data only, OOD data and the combination of both ALL.We report average BLEU (Abs.),BLEU gains (G.s) compared to f θ only, and removed subset share (R.Sh.).γ set to remove 20% of IN dataset.

Table 7 :
OOD detection on BLOOMZ using German as an OOD language for the LLM.
late Tatoeba dataset samples into English, focusing on languages known to be within the distribution for BLOOMZ while attempting to separate the German samples from them.From Tab. 7, we observed that our no-reference methods perform comparably to the a MSP baseline, but are outperformed by the Mahalanobis distance in this scenario.However, the information projection methods demonstrate substantial improvements over all the baselines.

Table 8 :
Summary of models and studied shifts.

Table 9 :
Number of samples in each (test) datasetsB.4SamplesInTab. 10 we provide examples of small shifts in translation between Spanish and Catalan and its impact on a spanish to english translation model.

Table 11 :
OOD inputs, their translations and projections onto the reference set.The first 2 are far from the reference set and not well translated whereas the next 2 are very close to the reference set and well translated.We can, for that matter, notice that the projection is quite close to the input sentence grammatically speaking.
ence set.We notice that, in general, sentences that are close to the reference set, and whose projection has a close meaning, are better handled by f θ .Therefore, one can visually interpret the prediction of RAINPROOF, and validate it.This observation further validates our method.

Table 14 :
Performance in OOD detection for NLLB.

Table 15 :
Comparison of our best detector a Dα against SOTA features based-ood detectors on close language shifts.

Table 17 :
Detailed results of the performances of our OOD detectors on different domain shifts.For Spanish (spa) and German (de), we present two domains shifts: Technical medical (EMEA) data and legal parlementary texts (parl) against common language emboddied by the Tatoeba dataset (tat).

Table 18 :
Detailed performance results of our OOD detectors on dialog shift against the Multi WOZ dataset as reference set. Figure 7: ROCAUC curves for our uncertainty-based metrics compared to common baselines for language shift detection.Baselines are represented in dashed lines.ROC-AUC curves for our uncertainty-based metrics compared to common baselines for domain shift detection.baselinesarerepresented in dashed lines.ROC-AUC curves for our uncertainty-based metrics compared to common baselines for dialog shift detection.baselinesarerepresented in dashed lines.Figure 12: ROC-AUC curves for our reference-based metrics compared to common baselines for dialog shift detection.baselinesare represented in dashed lines.spa-catspa-por nld-afr spa:tat-parl deu:news-parl spa:tat-EMEA deu:news-EMEA M Figure 10: ROC-AUC curves for our reference-based metrics compared to common baselines for domain shift detection.baselines are represented in dashed lines.

Table 19 :
Absolue translation performances in terms of BLEU on the different subset (IN, OOD, ALL) of each dataset of our translation OOD performance benchmark.spa-cat spa-por nld-afr spa:tat-parl deu:news-parl spa:tat-EMEA deu:news-EMEA Scenario Score

Table 20 :
Detailed impact of the OOD filtering on the different subset for each task.

Table 23 :
Detailed impacts on NMT performance results per tasks (Domain-or Language-shifts) of the different OOD detectors with a threshold defined to keep 99% of the IN data.We present results on the different part of the data: IN data, OOD data and the combination of both, ALL.For each we report the absolute average BLEU score (Abs.), the average gains in BLEU (G.s.) compared to a setting without OOD filtering (f θ only) and the share of the subset removed by the detector (R.Sh.).