HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation

Hallucinations in machine translation are translations that contain information completely unrelated to the input. Omissions are translations that do not include some of the input information. While both cases tend to be catastrophic errors undermining user trust, annotated data with these types of pathologies is extremely scarce and is limited to a few high-resource languages. In this work, we release an annotated dataset for the hallucination and omission phenomena covering 18 translation directions with varying resource levels and scripts. Our annotation covers different levels of partial and full hallucinations as well as omissions both at the sentence and at the word level. Additionally, we revisit previous methods for hallucination and omission detection, show that conclusions made based on a single language pair largely do not hold for a large-scale evaluation, and establish new solid baselines.


Introduction
With neural machine translation systems reaching an overall satisfactory quality, alleviating those rare but severe translation pathologies that undermine user trust becomes very important.These pathologies include hallucinations (translations containing information completely unrelated to the input) and omissions (translations that do not include some of the information of the input).While understanding hallucinations is receiving increasing attention (Raunak et al., 2021;Müller and Sennrich, 2021;Zhou et al., 2021;Guerreiro et al., 2023;Dale et al., 2023;Guerreiro et al., 2022), progress in this direction is hampered by the lack of annotated data.To the best of our knowledge, previous datasets are limited to German-English data with sentence-level annotations of hallucinations and omissions (Guerreiro et al., 2023) and Chinese-English data with token-level hallucination labels (Zhou et al., 2021).Previously available general-purpose quality assessments, such as direct assessment (DA) Graham et al. (2013), MQM (Lommel et al., 2014), or XSTS (Licht et al., 2022) do not seem suitable since they do not distinguish between hallucinations, omissions and other translation errors.In this work, we aim to address this limitation.
Ideally, an evaluation dataset for hallucination/omission detection should satisfy several conditions: (i) data has to cover a broad range of languages with varying resource levels and scripts, (ii) translations should be generated naturally, (iii) the models that produced the translations have to be available, and (iv) considered modeling paradigms have to cover several approaches (i.e., encoder-decoder vs decoder-only, single language pair vs multilingual, etc.).The first point is important because the best-performing detectors are different for high-and low-resource settings, and general conclusions cannot be made based on a single language pair (Section 4).Secondly, translations have to be generated naturally as opposed to using specifically developed perturbations of models and/or data because conclusions made when detecting pathologies under perturbation might not transfer for detection of natural hallucinations (Sec-tion 6).Thirdly, the corresponding model should be released along with the translations to allow evaluating "internal" detection methods that rely on characteristics internal to a model.Finally, in the most ideal setting, various models are needed to test whether the detection methods transfer between modeling approaches.
While satisfying all the desiderata is very challenging, we can satisfy all but last by focusing on the state-of-the-art multilingual NLLB-200 model (NLLB et al., 2022).In addition to covering a broad range of languages and being publicly available along with its training data, NLLB is widely recognized 1 and is likely to stay the state-of-the-art multilingual translation model for the foreseeable future.For this model, we choose 18 language pairs that include high-and low-resource languages, as well as a zero-shot pair (Figure 1).We develop rigorous annotation guidelines for identifying full and partial hallucinations and omissions and use these guidelines for manual annotation of translations in all 18 directions.The resulting dataset contains fine-grained sentence-level and token-level annotations.
We highlight the importance of our dataset by making several valuable observations that would not be possible otherwise.For example, we find that for low-resource directions, internal methods perform much better than external methods that substantially fail.When analyzing performance of a range of recently introduced pathology detection methods, we see that some of the previous results do not transfer across languages.As another example, we show that relying on attention to make conclusions about translation quality is very fragile.Finally, we introduce some detection tasks (e.g., token-level omission detection) that became possible only with our data.We believe our work opens the door for reliable and accessible research on detecting and analyzing translation pathologies as well as understanding their causes.
Overall, we: • release a dataset with fine-grained profes-1 Only 4-months after launching NLLB-200, Wikimedia reported that NLLB-200 was the third most used machine translation engine accounting for 3.8% of all published translations.Beyond Wikimedia, scientific impact is also prominent: NLLB-200 model has been used as standard to compare with other MT paradigms such as prompting with large language models (Zhu et al., 2023).This impact is possible due to open-sourcing the model and making it available on various platforms, e.g.Github and Huggingface.This broad usability backs up our motivation to annotate outputs of this model.sional annotations of hallucinations and omissions for 18 language pairs2 ; • analyze previous sentence-level detectors and find that e.g.(i) for low-resource settings, model internal characteristics work best, (ii) attention is very fragile when used to judge translation quality, among other observations; • introduce word-level pathology detection tasks along with the baselines.

Dataset Creation
The steps to create the dataset include (i) choosing the language pairs, (ii) gathering data for annotation, (iii) developing annotation guidelines and qualification sets, (iv) manual annotation, and (v) postprocessing.In this section, we explain these steps in more detail.

Selection of Languages
We optimized the language selection in order to cover (i) different resource levels and (ii) a variety of language families and scripts.Among the languages available in NLLB-200, we include 5 highresource language pairs (Arabic, Mandarin Chinese, German, Russian, and Spanish paired with English), 3 low-resource language pairs (Kashmiri, Bengali, and Yoruba paired with English) and a zero-shot pair3 (Spanish-Yoruba).We consider all language pairs in both directions which gives us 18 translation directions summarized in Figure 1.

Gathering Data for Annotation
Since strong NLLB models rarely generate hallucinations and omissions, getting translations that are likely to contain these types of errors is not straightforward.To gather these translations, we developed a multi-step procedure where we first choose data to generate translations and then choose a subset of the generated translations for annotation.
Choosing data for translation.In addition to clean in-domain data, we use noisier out-of-domain sources.The reason for the latter choice stems from our expectation that the NLLB model will not hallucinate much when handling high-resource languages.Overall, the data we use to generate translations is as follows: • in-domain: FLORES-200 development set (NLLB et al., 2022); • out-of-domain: Jigsaw toxicity detection competition corpora (Jigsaw, 2020)4 -for English, Russian and Spanish; comments from Wikipedia discussion pages5 -for Chinese, Arabic and German.The Jigsaw corpora were extracted from Wikipedia talk pages, so the distributions of these texts are rather similar.
We translated these texts with the 600M distilled NLLB model following the standard setting (beam size 5, forbidden generation of the <UNK> token, forbidden repetition of 4-grams, limiting the translation length to 3•len(source)+5 tokens.
Choosing translations for annotation.To find potentially pathological translations, we scored the resulting sentence pairs by multiple metrics that were used as hallucination detectors in previous works.Specifically, we used some methods from Guerreiro et al. (2023): ChrF++ (Popović, 2017), reference-based COMET and referenceless COMET-QE (Rei et al., 2020), and sequence logprobability (their best detector).We also used some methods introduced in Dale et al. (2023): cosine similarity coming from LASER3 (Heffernan et al., 2022) and LaBSE (Feng et al., 2022), a bidirectional XNLI score, and ALTI+ source contributions (Ferrando et al., 2022).
We applied the reference-based metrics only to the FLORES data.For each translation direction and data source, we selected sentence pairs with 3 strategies: • Sample uniformly -to preserve data diversity and non-hallucinated examples; • Sample favoring potentially pathological translations (with the probabilities proportional to the quantiles of the detectors); • Pick the worst according to the detectors -to increase the chance of hallucinations.

Guidelines and Qualification Tests
To ensure annotation quality, guidelines and qualification tests were prepared by professional linguists.
Annotation guidelines.These guidelines define: • the notion of hallucinations and omissions; • the hallucination vs mistranslation distinction; • hallucination/omission severity levels.
Figure 2 summarizes the resulting guidelines.Note that distinguishing hallucinations from other translation errors is one of the known difficulties when dealing with hallucinations (Raunak et al., 2021;Guerreiro et al., 2023).In our guidelines, a token is referred to as hallucinated if there is no corresponding token in the source (Figure 2).For all pathologies, linguists provide positive and negative examples in diverse languages.Additionally, we ask the annotators to mark if a translation is incomprehensible, i.e. whether the text is garbled or in another language.These translations are then discarded. 6ualification tests.The annotators recruited for this project were translators and reviewers who participated in FLORES translation (NLLB et al., 2022) or have other professional translation experience.They are allowed to annotate our data only after passing a specifically developed qualification test.An annotator can fail the test no more than once, in which case they are given an opportunity to receive a detailed feedback and re-do the test.If they do not achieve a passing score of 96% at the second attempt, the vendor is required to find a replacement.
Our qualification tests were developed for each of the language directions, and contained 15 items to annotate: 3 full hallucinations, 4 partial hallucinations, 2 word-level hallucinations, 5 mistranslations, and 1 incomprehensible sentence.The tests were found effective in identifying annotation quality issues before annotators annotate real data.
Post-processing.From the data, we discard the translations marked as incomprehensible along with the data with some issues (e.g.unbalanced brackets in the word-level annotations of hallucinations or omissions; word-level annotations that significantly differ from the initial input/output texts).After this filtering, we were left with 144 to 197 annotated sentence pairs per direction.

Dataset Description
Annotation format.The resulting data contains the source text and its translation, along with the word-level and sentence-level annotations of omissions and hallucinations.Figure 3 shows examples of annotated translations from our dataset.
Overall statistics.Figure 1 shows the proportions of hallucinations and omissions in the data (translations with both hallucinations and omissions are referred to as hallucinations).Overall, all directions have at least 3% translations with hallucinations (1% full) and 17% with omissions (5% full).Most of the full hallucinations are also labelled as full omissions, and vice versa.
Differences between resource levels.From Figure 1 we see that, not surprisingly, high-resource language pairs hallucinate less than low-resource.A less anticipated difference between high-and low-resource settings is seen when looking within each language pair.In high-resource settings, translating to English leads to more hallucinations than translating from English.Differently, for lowresource pairs, translation from English has higher hallucinatory rates than translation to English for the same language pair.This might inspire future work to analyze the role of English data in the multilingual NLLB model.Finally, while for the zeroshot pair one might expect more pathologies, this is not what we see: results for the zero-shot pair are comparable to those for low-resource languages.

Sentence-Level Detection
Detecting pathologies at the sentence level is the task of flagging a whole translation as pathological or not.This is the standard definition of e.g. the hallucination detection task (Lee et al., 2019;Müller et al., 2020;Raunak et al., 2021;Guerreiro et al., 2023;Dale et al., 2023;Guerreiro et al., 2022;Xu et al., 2023).Such sentence-level pathology detection (instead of flagging individual erroneous tokens) is an integral part of hybrid pipelines when a machine-generated translation is first passed to a quality estimation system and then, if needed, is corrected by human translators.
Detection tasks.For our dataset, we define three sentence-level detection tasks: • hallucination detection: same as in previous work mentioned above; • omission detection: detecting translations with omissions on a hallucination-free subset.The latter is to disentangle omissions from a more severe hallucination pathology; • pathology detection: detecting translations that are either hallucinations or omissions.
Evaluation methodology.We evaluate the ability of a detector to rank more severe pathologies higher (e.g., full hallucinations higher than partial, any hallucinations higher than non-hallucinations, etc).For this, we use an adaptation of the binary ROC AUC score for several classes.Formally, we subtract from the perfect score, i.e. 1, the percentage of incorrectly ranked pairs.For two classes, this metric is equivalent to the ROC AUC score.We compute the metrics for each translation direction separately.

Detection Methods
Detection metrics can be either internal, i.e. relying only on the information from the model that generated the inspected translation, or external, i.e. using external models.We use the best detectors from several recent works, along with some of their modifications we propose in this work.The metrics are summarized in Figure 4.
Internal methods.For internal models, we use the best method from Guerreiro et al. ( 2023) (sequence log-probability) and the best internal method from (Dale et al., 2023), ALTI.ALTI (Ferrando et al., 2022) is an attribution method that evaluates token contributions to generated translations.For hallucination detection, Dale et al. (2023) evaluate how, on average, the prediction of each target token is based on the source.Here, mostly to detect omissions, we propose a different variant ALTI T that computes how much, on average, each source token was used to generate the translation.Intuitively, if many source tokens are not used during generation, the translation is likely to not contain some information.The difference between the two versions of ALTI is illustrated in Figure 4.
External methods.For external methods, we use the state-of-the-art quality estimation system COMET-QE (Rei et al., 2020) and sentence similarity measures proposed in Dale et al. (2023).The latter are cosine similarities coming from LASER3 (Heffernan et al., 2022), LaBSE (Feng et al., 2022), and a bidirectional XNLI score.
Gray-area method.Finally, we also use a recent optimal transport-based measure evaluating the abnormality of the attention distribution compared to those of good translations (Guerreiro et al., 2022).While this method uses internal characteristics, it requires external data, i.e. a large collection of attention maps for "good" translations, which can be hard to obtain for low-resource settings.7

Experimental Results
The detection scores are shown in Figure 5.
High-resource: much easier to handle.We can see that it is much easier to detect pathologies in high-resource settings: the gap between the best scores for high-and low-resource directions is rather big (e.g., for halucinations, 0.89 vs 0.72).Note also that for high-resource language pairs, both internal and external methods perform quite well (e.g., Seq-Logprob and LaBSE for hallucinations; XNLI, LaBSE and ALTI T for omissions).
Low-resource: internal methods take the lead.In low-resource settings, external methods drop substantially with some of them losing sensibility.For example, high-performing XNLI drops close to chance for both hallucination and omission detection.Overall, external models are unlikely to be competent for low-resource directions as they do not observe enough relevant data during training.While previous work already expressed this concern and advocated focusing on internal methods (Dale et al., 2023), without our dataset, verifying this was not possible.Now we show that, indeed, for low-resource languages, internal methods become noticeably better.
Hallucinations: Seq-Logprob remains the best.After it turned out that the standard sequence logprobability is more informative for hallucination detection than the heuristics introduced earlier (Guerreiro et al., 2023), a few recent works reported improvements over Seq-Logprob: ALTI and LaBSE in Dale et al. (2023) and Attn-OT in Guerreiro et al. (2022).We see, however, that on average, Seq-Logprob is still the best.This discrepancy comes from the fact that those works made conclusions based on a single language pair.This highlights the importance of our dataset enabling large-scale evaluation over several language pairs.For example, while we see that for hallucination detection, results for English-German agree with Dale et al. (2023), for other languages this is not the case.
Attention-based method: close to chance.For hallucinations, Attn-OT detecting attention anomaly is an outlier and performs close to chance. 89While previous work already showed that relying on attention patterns to make conclusions about translation quality can be fragile (Guerreiro et al., 2023), results with our dataset highlight this even further.This points to a larger debate on the distinction between attention and attribution and the consequences of mistaking one for the other (Jain and Wallace, 2019;Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Bastings and Filippova, 2020).While Attn-OT was introduced as a way to evaluate detachment from the source sequence (Guerreiro et al., 2022), we see that implementing this intuition with attention instead of attribution (as in e.g.ALTI) leads to varying results: from high performance in Guerreiro et al. ( 2022) to near-random performance in our experiments.
Omissions: internal ALTI T performs best.Finally, for detecting omissions among non-hallucinations, the detection quality is generally worse than for hallucinations.This agrees with the view that an omission is a less severe pathology than a hallucination and hence it is harder to detect.The best overall method is ALTI T which confirms our intuition that if, according to token contributions, some source words are not used for translation, a translation is likely to omit relevant information.LaBSE and XNLI also perform well for high-resource languages but, similar to hallucination detection, are worse than internal methods for low-resource.Finally, while Attn-OT does not seem to identify hallucinations, it is more sensible for detecting omissions.

Word-Level Detection
Differently from sentence-level detection, the task of detecting pathologies at the word level received much less attention.In terms of both available data and detection methods, previous attempts were rather limited (Zhou et al., 2021;Vamvas and Sennrich, 2022).In our work, we want to facilitate future research in this direction.
Detection tasks.We define two detection tasks: • hallucination detection: for each translation word, predict whether it is hallucinated; • omission detection: for each source word, predict whether it is omitted from the translation.
Segmentation.We segment texts using Sacre-BLEU tokenizer (Post, 2018).For Chinese, it interprets each Chinese character as an individual word.For other languages, it applies regex-based tokenization based on spaces and punctuation.
Evaluation methodology.For these binary classification tasks, we evaluate detection quality with the ROC AUC score.Since models operate at the token level, we make predictions for tokens and not words.If a word is segmented into several tokens, we assign to this word the worst score among its tokens (i.e., if one of a word's tokens is hallucinated, the entire word is hallucinated).

Detection Methods
To the best of our knowledge, there are no publicly available models for token-level detection of hallucinations or omissions that could be easily adapted to handle the language pairs in our dataset.Applying previous approach by Zhou et al. (2021), i.e. training a specialized model (or several language pair-specific models) on synthetic data, would be very demanding in terms of engineering and research effort to work well on diverse resource levels.Therefore, we suggest starting with internal methods and their combinations.
Internal methods.For internal methods, we rely on the same methods that were previously used: model log-probability and ALTI (Figure 6).We use two types of log-probability: the standard and its difference with the unconditioned log-probability for the same token (i.e., when conditioning on an empty source sentence).Intuitively, the latter is one more way of measuring whether the model uses the source or relies more on its language model.For ALTI, we use both the total source contribution and the maximum contribution among the source tokens.The latter is high when the model is "focused" on a specific source token -this might be indicative of the prediction quality.For omission detection, we use the same methods with swapped source and target sentences (i.e., ALTI turns into ALTI T ).
Combination of methods.Apart from the individual methods, we also consider their linear combinations.We use a logistic regression trained using 3-fold group-wise cross-validation (with sentence ids as groups).We train the same feature combination for all languages by maximizing the detection score on the pooled data.

Experimental Results
The results are shown in Figure 7. Overall, the methods we proposed are reasonable and perform much better than the random baseline.
Token contributions perform best.We see that for both hallucinations and omissions, token contributions coming from ALTI (or ALTI T for omissions) perform better than the log-probability coming from the model.Note that for hallucinations, this is different from the sentence-level results where Seq-Logprob outperformed ALTI.Contrastive vs standard log-probability.Another interesting observation is that for hallucination detection in the high-resource setting, contrastive log-probability gives a better signal than the standard log-probability.This indicates that comparing explicitly the changes when dropping some information can be useful.This is not surprising: in a broad sense, our contrastive log-probability is a variant of erasure-based interpretation approaches (Zeiler and Fergus, 2014;Li et al., 2017;Kádár et al., 2017;Poerner et al., 2018;Li et al., 2019).In our case, the erased part is rather large, i.e. the whole source sentence.For such large segments, a similar idea previously appeared in context-aware machine translation when dropping the entire context sentence (Fernandes et al., 2021).
The detectors are complementary.Finally, we see that log-probability and token contributions are complementary: for both hallucinations and omissions, combining the features leads to a noticeable improvement in detection quality.

Natural vs Artificially Induced Pathologies
One of the difficulties when dealing with hallucinations (and, to a lesser extent, omissions) is that this is a rare phenomenon.Therefore, previous work often resorted to artificially amplifying the problem by applying various perturbations (Lee et al. (2019);Raunak et al. (2021), among others).However, it is not clear whether conclusions made based on this synthetic data would transfer to pathologies generated by a model in a natural setting.This is especially important when evaluating detection methods: for example, using internal workings of a model to detect its pathological behavior does not have to be helpful for pathologies the model did not generate "voluntarily".In this section, we com- pare performance of detection methods between two datasets: (i) our dataset with natural translations and (ii) translations generated under perturbation (annotated using the same protocol).
Model perturbation.To encourage the model to hallucinate or omit source information while still generating fluent text, we decrease the output activations of all the encoder-decoder attention layers by a constant multiplier α.Intuitively, this should imitate detachment from the source and increase hallucinatory rate.Indeed, translations generated this way are overall more pathological (see Figure 11 in appendix B).We use α = 0.3 to match the average Seq-logprob of reference translations.
"Artificial" data might not be informative.
Figure 8 shows sentence-level detection scores for different methods of hallucination and omission detection (average over all translation directions). 10e can see that perturbing the translation model introduces biases into the evaluation of detection methods.For example, for hallucination detection, Seq-Logprob outperforms ALTI on the natural dataset and loses on artificial.For omission detection, ALTI T is the best for the natural dataset, while XNLI is better on the artificial.Overall, we see that data with translations generated under perturbation has to be used with caution, especially when evaluating pathology detection methods: the conclusions are likely to not transfer to the natural setting.

Discussion
We showed that some of the existing internal and external methods can detect hallucinations and omissions with the quality that is much better than nothing, much worse than perfect.But what are the cases in which methods perform well and what can be improved?

Sentence-Level Detection
Figure 9 shows some manually selected examples of false and true positive and negative detection results for the English-Spanish translation direction.
Flagging correct translations.Examples 1-3 are correct translations, but some of them are flagged as pathological.Example 2 is flagged as an hallucination and omission, probably because the input ":::Jeez." is slang and has a wide range of potential meanings.Example 3 is flagged by Seq-Logprob: for the model, this translation may be "unlikely" because it is short and consists of a rare word.
Difficult to detect pathologies.Examples 4-6 show partial hallucinations and omissions that are difficult to detect, either because (in some sense) they resemble a correct translation, or because the translation indeed remains within the range of possible translations despite having these pathologies.This raises a question: what does it really mean to have a hallucinated translation?Is writing "grains" instead of "grands" a hallucination or just a typo?Are the phrases "U.S. Speaker of the House" and "U.S. Speaker of the House of Representatives" equivalent?While the sentence-level labels in our dataset are fine-grained, the severity of a pathology is defined based on the number of omitted/hallucinated words rather than on the degree of semantic inadequacy of the pathology (similarly to e.g.Guerreiro et al. (2023)).This agrees with previous work noting that defining severity of translation errors is not straightforward (Graham et al., 2013;Licht et al., 2022;Guerreiro et al., 2023).
Correctly detected pathologies.Examples 7-11 present more severe hallucinations and omissionsthese are detected correctly by at least a few of the considered methods.Many of these pathologies are produced by out-of-distribution inputs: conjunction of several sentences, typos, non-sentence texts (such as dates), and very short or incomplete sentences.For the latter, note that short sentences are generally non-typical for NLLB training data.As we see, for such translations the model is often not confident even for correct translations (see e.g.examples 2 and 3 -correct but short translations are flagged as pathological).This suggests that such critical pathologies might be alleviated by augmenting training data with similar (very short, multi-sentence, etc.) samples.

Word-Level Detection
Figure 10 shows examples of word-level detection.
Logprob focuses on the beginnings.We notice that log-probability focuses more on the beginning of a word or a sentence.This makes sense: model uncertainty in prediction is generally higher when beginning generation.
ALTI focuses on the endings.Differently, token contributions focus on word endings.This is again expected: when generating a token that completes a word, source contribution is likely to be lower than for the other tokens.However, the predictions are still very reasonable -for the last three examples, ALTI detects omissions and hallucinations more confidently than log-probability.
Finally, we see that feature-based combination of the methods leads to more refined results.

Additional Related Work
Except for the work mentioned above, previous work on hallucinations in machine translation largely avoided human annotation.To judge whether a translation is hallucinated, they relied on various heuristics or string-based automatic evaluation metrics (Lee et al. (2019); Berard et al. (2019); Müller and Sennrich (2021);Raunak et al. (2021), among others).These methods, however, were shown not to be indicative of hallucinations (Guerreiro et al., 2023), which highlights the importance of our human-annotated data.For omissions, previous work mostly focused on cases of an empty translation (Stahlberg and Byrne, 2019;Vijayakumar et al., 2016) with some work using artificially created undertranslations (Vamvas and Sennrich, 2022).As we saw in Section 6, the latter is unlikely to be helpful when evaluating detection methods.

Conclusions
In this work, we present the first dataset with human-annotated hallucinations and omissions that satisfies several conditions: (i) it covers a broad range of languages with varying resource levels and scripts, (ii) the translations are generated naturally (i.e., without artificial perturbations), (iii) the model producing the translations is publicly available.Through our extensive experiments, we illustrate why each of these conditions is important.
Additionally, we make several observations of individual importance.For example, for low-resource directions internal pathology detection methods perform better than external, attention is very fragile when used to judge translation quality, among others.Overall, we believe our work opens the door for a reliable and accessible research on detecting and analyzing translation pathologies.

Limitations
Our experiments are reproducible, and our dataset together with the NLLB model can be widely used to evaluate progress on hallucinations/omissions detection and mitigation.However, all the annotated translations were generated with a single model, and the generalization to other models is yet to be verified.
The dataset is rather small and may not cover all possible hallucinations/omissions cases.

Ethical considerations
The annotations were provided by professionals and they were all paid a fair rate.

A Attention-based anomaly detection
Reproducibility.The Attn-OT sentence-level detection method that we use in Section 4 is our reproduction of the Wass-Combo method from Guerreiro et al. (2022).Their paper did not provide code and training data, so our implementation is not exact.For Wass-to-Unif, we obtained the same ROC AUC scores as Guerreiro et al. (2022) on their test set, but for Wass-to-Data and Wass-Combo, the AUC scores are 2% lower than in the original paper, probably due to the differences in selecting the reference data.
Reference data for 18 directions.To apply the Attn-OT method to our data, we created reference translations for each of the 18 translation directions by following steps: • Sample 1M sentences for each language from the NLLB mined training data (NLLB et al., 2022); • Translate them with the same settings as in Section 2.2; • For each translation direction, drop the resulting sentence pairs that got into the worst 20% by any of the criteria: shortest-to-longest ratio for source and translation texts, Seq-LogProb, and LASER3 cosine similarity score between source and translation.After that, about 600K sentence pairs per direction are left as reference translations.
Computing scores.To compute attention distribution, we average the encoder-decoder attention maps for the last decoder layer over heads and over target tokens, just like Guerreiro et al. (2022).Our Attn-OT score is then computed with the same formula as for Wass-Combo in Guerreiro et al. ( 2022), with the only slight difference: swtu is scaled by matching 1% and 99% quantiles of s wtd , instead of min-max scaling, to improve computational stability.Along with this score, we also evaluate Wass-to-Unif and Wass-to-Data scores from Guerreiro et al. (2022), and their weighted average with weights inversely proportional to standard deviations: Wass-Mean.
Dropping the EOS We observed that in the setting above, Wass-to-Unif has nearly zero rank correlation with hallucination severity.After inspecting the attention maps, we found that about 75% of attention weight on most heads is distributed to the end-of-sentence token.This probably compensates for the fact that the order of magnitude of its encoder hidden state is an order of magnitude smaller than for other tokens, which aligns with the observations of Kobayashi et al. (2020).This makes a standard attention map highly non-uniform, and may obscure the more informative differences between the attention maps for different translations.To mitigate this fact, we computed the second version of all scores, with dropping the attention to the EOS token and renormalizing it so that the sum of attention to the other tokens equals 1 again.including the EOS token.Removong the EOS token improves the Wass-to-Unif and Wass-Mean scores, but slightly negatively affects Wass-to-Data and Wass-Combo scores.Whatever method we use, its performance for hallucination detection is not much better than chance.

B Natural vs Artificially Induced Pathologies
Figure 11 shows the annotation results for the two types of data: the translations generated naturally and under perturbation.We see that under perturbation, the model generates more pathological translations.
Figure 12 shows similar results, but with fractions of data downsampled in a way that the natural and perturbed data subsets have equal number of observations for each combination of pathology type, source dataset and translation direction.This is done to ensure that differences in detection performance to come from different distribution of pathology types.We can see that even for these curated subsets, the conclusions do not transfer from perturbed to natural pathologies.

Figure 2 :
Figure 2: Overview of some parts of the annotation guidelines.

Figure 4 :
Figure 4: Summary of the sentence-level detection methods.

Figure 5 :
Figure 5: Results for sentence-level detection of hallucinations (left) and omissions (right).

Figure 6 :
Figure 6: Token-level detection methods for hallucinations.For omissions, we swap the source and the target.

Figure 8 :
Figure 8: Sentence-level detection scores for natural and artificial (generated under perturbation) pathologies.

Figure 9 :
Figure 9: Examples of successful and failed sentence-level detection for translations from English to Spanish.

Table 1 :
Evaluation Table1reports ROC AUC scores for all OT-based detection methods, with and without ROC AUC scores for detection of hallucinations and omissions with OT-based methods, averaged across tranlsation directions.The asterisk* denotes the scores based on the attention maps with the EOS token excluded.