Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better

While the problem of hallucinations in neural machine translation has long been recognized, so far the progress on its alleviation is very little. Indeed, recently it turned out that without artificially encouraging models to hallucinate, previously existing methods fall short and even the standard sequence log-probability is more informative. It means that internal characteristics of the model can give much more information than we expect, and before using external models and measures, we first need to ask: how far can we go if we use nothing but the translation model itself ? We propose to use a method that evaluates the percentage of the source contribution to a generated translation. Intuitively, hallucinations are translations “detached” from the source, hence they can be identified by low source contribution. This method improves detection accuracy for the most severe hallucinations by a factor of 2 and is able to alleviate hallucinations at test time on par with the previous best approach that relies on external models. Next, if we move away from internal model characteristics and allow external tools, we show that using sentence similarity from cross-lingual embeddings further improves these results. We release the code of our experiments.


Introduction
Hallucinations in machine translation (MT) are cases when the model generates output that is partially or fully unrelated to the source sentence. While generally this phenomenon is not that frequent and has a relatively low impact on corpuslevel automatic metrics, the impact of hallucinations on user experience can be rather dramatic. For example, if a machine translation system generates The staff were very friendly and helpful in response to an input sentence about e.g. a marvelous view from the window, in future a user is unlikely to trust this system.
While the problem of hallucinations is known and important, addressing it is very challenging.
Firstly, hallucinations are very rare. This is why previous work mostly resorted to settings where models are encouraged to hallucinate, e.g. artificially perturbing source sentence (Lee et al., 2019;Raunak et al., 2021), adding specific types of noise to the training data (Raunak et al., 2021), working under domain shift (Wang and Sennrich, 2020;Müller et al., 2020), among others (Zhou et al., 2021). Secondly, hallucinations are hard to identify with automatic metrics. For the most part, hallucinations were defined as translations with low quality according to some metric such as e.g. adjusted BLEU or chrF (Lee et al., 2019;Raunak et al., 2021;Müller and Sennrich, 2021) or translations satisfying some heuristic condition (Berard et al., 2019;Raunak et al., 2021). Overall, it was not clear whether proposed methods indeed detect hallucinations and, if so, whether they transfer to more natural settings.
Recently, when revisiting previous work in a relatively clean setting, Guerreiro et al. (2022) found that existing methods fall short and the standard sequence log-probability is the most informative. To show this, the authors gathered a large dataset with professional annotations of translations that, according to 10 previously proposed methods, are likely to be hallucinations. This data (hallucinations along with the model that generated them) made it possible to first, evaluate the performance of various detection methods and second, to work on alleviating hallucinations at test time. For the latter, the idea is "detect-then-rewrite": after flagging a translation as likely to be pathological, generate several alternative hypotheses and pick the best one relying on some measure. So far, the best realization of this general framework uses sequence logprobability -Seq-Logprob -for detection, Monte Carlo dropout (Gal and Ghahramani, 2016) to generate several alternative translation hypotheses, and COMET-QE to pick the final candidate (see Guerreiro et al. (2022) for more details). In this work, we use the same test bed and substantially improve the previous results.
Regarding hallucination detection, we view the observation that Seq-Logprob outperforms previous (specifically targeted to hallucinations) methods as follows: internal model characteristics may contain much more information than we expect. Therefore, before developing or using external models and measures, we ask: how far can we go if we use nothing but the translation model itself ? We propose to use a method that evaluates the percentage of the source contribution to a generated translation. Intuitively, since hallucinations are translations that are "detached" from the source (by definition), low source contribution should be able to identify hallucinations. Despite the fact that understanding hallucinations was one of the motivations behind the first method evaluating relative source and target contributions (Voita et al., 2021), both existing methods only looked at highly artificial hallucinations (Voita et al., 2021;Ferrando et al., 2022). We propose to use ALTI+ by Ferrando et al. (2022), the method that aggregates layer-wise tokens attributions, for both hallucination detection and reranking in the "detect-then-rewrite" framework. For detection of the most severe hallucinations, it is twice more accurate than sequence logprobability. For reranking, it performs on par with the previous best COMET-QE. All in all, we show that we can improve the overall pipeline results by solely relying on internal model characteristics.
When allowing external tools, previous work mostly focused on different ways to automatically evaluate quality of a translation example, either with string-based methods or neural quality estimation systems. This idea (the better we estimate translation quality, the better we are at detecting hallucinations) is natural: hallucinations are lowquality translations in the first place. However, implementing this idea in practice is challenging: even state-of-the-art quality estimation system substantially fails (Guerreiro et al., 2022). We hypothesize that instead of targeting quality evaluation, it might be beneficial to use models trained with a rather different objective. Indeed, as we show, similarity between the source and a translation estimated via cross-lingual sentence embeddings outperforms the best internal method. Apart from cross-lingual sentence similarity (which is expected to be sensitive to highly incorrect translations), we find that cross-lingual natural language inference models (less anticipated in the context of machine translation) also perform quite well. To the best of our knowledge, we are the first to apply these models for hallucination detection.
Overall, we show that: • by using only the model's inner workings, we • detect the most severe type of hallucinations with twice better precision; • overwrite hallucinations at test time with results on par with the best previous method that relies on an external model; • models focused on semantic similarity of sentences can detect all types of hallucinations with 80% better precision than the previous methods.

Background and Setting
In this section, we describe the framework and data we use for evaluation of hallucination detection and mitigation methods. This framework was proposed by Guerreiro et al. (2022) and consists of a large dataset of annotated translations along with the model that produced them. To the best of our knowledge, this is the only released data that can be used to analyze hallucinations in a "clean" setting.

Model
The model is Transformer base (Vaswani et al., 2017) from fairseq (Ott et al., 2019) with the standard hyperparameters setting. It was trained on the WMT'18 German-English news translation data excluding Paracrawl (Bojar et al., 2018) -totalling 5.8M sentence pairs. Since Guerreiro et al.
(2022) used randomly chosen 1/3 of the dataset as a held-out set for analysis, the model was trained on the remaining 2/3 of the dataset. We use the model released by Guerreiro et al. (2022): this is the model that generated hallucinations we analyze.

Hallucination Dataset
Hallucination dataset released by Guerreiro et al. (2022) contains fine-grained manual annotations of 3415 German-to-English translations generated by the model above. These translations are chosen from a set of 1.8M translations of held-out data as the ones that are likely to be pathological. The criteria used to flag the translations include 10 methods ranging from previously proposed heuristics (Lee et al., 2019;Berard et al., 2019;Raunak et al., 2021) to quality estimation models (Rei et al., 2020b)  and uncertainty detectors (Fomicheva et al., 2020;Zerva et al., 2021;Guerreiro et al., 2022). The taxonomy of translation pathologies in the dataset is shown in Figure 1. Here, hallucinations are defined as severe translation errors that are detached from the source. These can be either oscillatory (i.e. contain erroneous repetitions of words and phrases) or largely fluent. The latter is also split by severity of an error into fully detached (the whole content is not supported by the source) and strongly, but not fully, detached (significant proportion of output is not supported by the source). 1 Other than hallucinations, the annotated data also contains translation errors that are deemed to be not detached from the source (see Figure 1). Overall, 323 examples are judged to be hallucinations, 1044 as less severe translation errors and the rest as correct translations.
Note that so far, there is no "canonical" hallucination taxonomy and previous work used various, mostly overlapping, definitions (Lee et al., 2019;Raunak et al., 2021;Zhou et al., 2021;Ji et al., 2022;Raunak et al., 2022;Guerreiro et al., 2022). We follow the taxonomy by Guerreiro et al. (2022) for two reasons. Firstly, for consistency with the dataset and the evaluation framework we use. Secondly, this taxonomy is rather general, considers some hallucination types overlooked in previous work (e.g. strongly detached hallucinations), and was shown to be reasonable: under these definitions, properties of hallucinations differ from those of translation errors (Guerreiro et al., 2022).

Hallucination Detection Methods
Generally, methods for handling hallucinations can be either internal, i.e. using only information coming from the translation model itself, or external, i.e. using auxiliary models. In addition to these, we also consider "oracles" relying on reference translation. Note that these cannot be used in preventive settings when references are not available; here we use them only for analysis.

Reference-Based Oracles
Following previous work (Müller and Sennrich, 2021;Guerreiro et al., 2022), we use: • chrF: character n-gram F score of the translation with respect to reference. We use the CHRF++ version that also takes into account word unigrams and bigrams (Popović, 2017); • COMET: a neural quality estimation metric by Rei et al. (2020a) which was shown to be the state-of-the-art reference-based method (Kocmi et al., 2021).

Internal Measures
Baseline: Seq-Logprob. This is the standard length-normalized sequence log-probability. Compared to previously introduced methods specifically targeting hallucinations, this simple metric performs the best (Guerreiro et al., 2022).
We use ALTI: percentage of source contribution. As we already mentioned above, we hypothesize that the percentage of source impact on a generated translation might be a strong signal for identifying hallucinations. To evaluate this relative source contribution, we use recently introduced ALTI+ (Ferrando et al., 2022). At a high level, it decomposes each transformer block into a sum of functions of individual tokens and views an output representation as a summation of transformed input vectors. Then it evaluates contribution of these vectors to the resulting sum. Among other interesting observations, ALTI+ (as well as an earlier LRP-based method by Voita et al. (2021)) was used to show that for artificially created hallucinations, source influence is much lower than for "healthy" correct translations. Our work is the first to test this intuition in a real setting where hallucinations are generated naturally. 2 Formally, for a model and its generated translation, we compute the total source contribution as the sum of contributions of all source tokens. We do it for each target token individually and then average across target tokens. The scores are computed by the same model that produced the translations (Section 2.1).

External models
Baseline: COMET-QE. For a reference-free model, we use the state-of-the-art COMET-QE (Rei et al., 2020b) for its superior performance compared to other quality estimators (Mathur et al., 2020;Freitag et al., 2021;Kocmi et al., 2021).
We use: sentence similarity. Overall, we consider three measures based on pretrained models that evaluate semantic similarity of two sentences: • LASER: cosine similarity of source and translation sentence embeddings from LASER2 (Heffernan et al., 2022). LASER2 improves the encoder-decoder LASER (Artetxe and Schwenk, 2019) by replacing LSTM encoder with a Transformer and using teacher-student training; • LaBSE: cosine similarity of source and translation sentence embeddings from LaBSE (Feng et al., 2022). LaBSE is a dual-encoder approach based on pretrained transformers and fine-tuned for translation ranking with an additive margin softmax loss; • XNLI: product of the entailment probabilities of source to translation and translation to source. We compute entailment scores with RoBERTa (Conneau et al., 2020) finetuned on a combination of NLI data in 15 languages (Conneau et al., 2018). 3

Main results
Overall results are shown in Table 1. We report ROC AUC and precision at 90% recall. 4 We show et al. (2022) over LRP-based method by Voita et al. (2021) because the latter is more computationally expensive. 3 https://huggingface.co/joeddav/xlm-r oberta-large-xnli 4 This is different from Guerreiro et al. (2022) who compare recall at thresholds cutting off a specific percentage of the dataset. Instead, we rely on two metrics: ROC AUC that does not rely on specific thresholds and PR@R90 that covers a specific percentage of the hallucinations (in this case, 90%) and then reports the resulting precision. metrics for all hallucinations and fully detached hallucinations separately because the latter are the most disastrous translation mistakes that are potentially the easiest to detect. First, let us look at internal methods. We see that while ALTI performs comparably to Seq-Logprob for all hallucinations, for fully detached hallucinations it has twice better precision than Seq-Logprob. A possible root of this discrepancy is that ALTI averages the source contribution over all generated tokens. Therefore, it is more efficient for detecting translations where all or most of the generated tokens are hallucinated. Note also that for fully detached hallucinations, internal ALTI performs almost on par with the best external methods.
Among external methods, LaBSE and XNLI substantially outperform previous best detector: for both all and fully detached hallucinations, their precision at 90% recall is roughly twice better than that of Seq-Logprob. While such a good performance might be expected for LaBSE that evaluates cross-lingual sentence similarity (in a way, this might be seen as a measure of translation quality), results for XNLI are rather surprising: to the best of our knowledge, models optimized for XNLI have not been used in the context of machine translation. This suggests that looking at broader class of models and training objectives might be beneficial.
Note also the large difference between LaBSE and LASER: while the former shows big improvements compared to Seq-Lobprob, the latter noticeably lags behind. This is not surprising when looking at training objectives of the underlying models. In LASER2, the cross-lingual part of the objective uses cosine similarity between sentence encodings. Differently, LaBSE is trained as a translation ranking task and thus encourages ordering of translations by severity of an error more explicitly.
To further understand differences between detectors, we look at the distributions of the detection scores in Section 4.2 and the detected pathology types in Section 4.3.

Analysing Distributions of the Scores
For each of the methods, Figure 2 shows distributions of the scores for hallucinations, less severe translation errors and correct translations.
Internal methods: errors are bimodal. ALTI and Seq-Logprob show similar behavior: the scores for errors have bimodal distribution. At a high level, for the model, some errors "look" more like hallucinations, and some -more like correct translations. This observation can motivate future work: it would be interesting to understand which types of translation errors behave one way or another.
COMETs: blind to error severity. We see that COMET and COMET-QE scores do not provide separation between hallucinations and less severe errors. This agrees with previous work noting that since quality estimation models are mostly trained on data that lacks negative examples, COMETs may be inadequate at evaluating poor translations in general  and hallucinations in particular (Guerreiro et al., 2022). What is also expected, is that compared to reference-free COMET-QE, the overlap between the scores for correct and incorrect translations is much lower for reference-based COMET. ChrF exhibits a behavior similar to COMET.
LaBSE: ranks error severity best. LaBSE is the only detector with a clear order between hallucinations, errors and correct translations. As we mentioned before, this is expected because only LaBSE is trained to rank translations. Interestingly, for LASER, modes for the three distributions are also ordered; unfortunately, the distributions themselves overlap significantly which makes LASER not a good hallucination detector.
XNLI: no middle ground. Finally, we see that XNLI distributions are very peaky and concentrated around 0 and 1. This is expected: XNLI's decision is always binary. While this provides good separation between hallucinations and correct translations, it is hard to estimate severity of an error.

Detected Pathology Types
Now we come to more fine-grained categories and look at detected pathology types. For each method, we say that a translation is "detected" if it is contained in a fraction (e.g. 10%) of the hallucination dataset corresponding to the lowest scores. 5 Then we look at In the original dataset, these types were annotated in a multilabel manner (e.g. the same translation could be annotated both as oscillatory hallucination and as a named entity error). To assign a single label to each translation, we choose the most severe pathology type (with severity increasing clockwise from "Correct translations" to "Fully detached").
• recall for different translation types with respect to the whole dataset ( Figure 4).
The three best methods are similar. Figure 3 shows that ALTI, LaBSE and XNLI select similar pathology types. For them, flagged examples consist mostly from fully detached and strongly detached hallucinations, along with other errors.
LASER is an outlier. LASER behaves differently and instead of focusing on pathological translations, it flags correct translations more. This explains its poor performance on detection mentioned before.
XNLI flags undergenerations. Figure 4 shows that XNLI (and, to a lesser extent, LaBSE) flags a large proportion of undertranslations. This makes sense: these criteria are symmetric, and if we swap the source and the undergenerated translation, the longer source can be seen as a hallucination.
Fully detached are the easiest to detect. As expected, fully detached hallucinations are the easiest to detect: all methods detect them entirely when taking 20% of the hallucination dataset (Figure 4), and they are the most frequent pathology type among the examples flagged by the best performing methods (Figure 3). Overall, our findings confirm the conclusions of Guerreiro et al. (2022) that oscillatory and strongly detached hallucinations are more difficult to detect, and improvements with our methods mostly come from these types of hallucinations. Here, the types are presented in a multilabel manner, i.e. one translation may contribute to multiple axes.

Mitigating Hallucinations at Test Time
Finally, let us come to the second part of the "detectthen-rewrite" pipeline: after flagging a translation as likely to be hallucinated, we want to generate several alternative translations and select one of them based on some criterion i.e., by reranking the hypotheses (Guerreiro et al., 2022). This general framework has two degrees of freedom: (i) generation of hypotheses, and (ii) reranking approach. We show that • for generating hypotheses, simply applying MC dropout (as done in Guerreiro et al. (2022)) outperforms more involved methods such as diverse beam search (Section 5.2); • for reranking, we can match COMET-QE with internal ALTI and improve the hallucination rate by using LaBSE (Section 5.3).

Evaluation methodology
In this section, we explain the setup for the experiments with automatic evaluation in Sections 5.2 and 5.3. The setup for manual annotation is explained later in Section 5.3.2.
Metrics. In our experiments, we use several metrics. First, we use quality evaluation metrics commonly used by the community, i.e. COMET (Rei et al., 2020b) and BLEU. Additionally, we use the two best metrics for hallucination detection: LaBSE and XNLI. We show some of the metrics in the main text and the rest in the appendix.
Data. First, we analyze the impact of our method on translations of different quality levels. For this, we randomly sample 150 sentences from each of the following groups of the hallucination dataset (Section 2.2): fully detached hallucinations, strongly detached hallucinations, all other translation pathologies, and correct translations. We apply all versions of the hallucination mitigation algorithm to these 600 sentences.
Note that in a practical application, we would apply the mitigation techniques only to the translations labeled by a detection algorithm as potential hallucination. We simulate this later in Section 5.3.2 when performing manual annotation.

Generation Strategies
To generate alternative hypotheses, Guerreiro et al. (2022) use Monte Carlo dropout (Gal and Ghahramani, 2016). This means they leave standard beam search inference intact and achieve variability in translations via activating model dropout at inference. A natural question is whether using other generation strategies can give better results. For example, if we use e.g. beam search specifically designed to produce diverse translations, can we get better hypotheses?
To test this, we use the following methods: • DEFAULT: standard decoding without reranking, i.e. beam search with size 5, where we pick only the top 1 candidate; • BEAM SEARCH: beam search with size n; • sampling from the predicted distribution: • SAMPLING: from the whole distribution; • SAMPLING P=80: from the top p = 80% of the distribution, i.e. nucleus sampling (Holtzman et al., 2020); • diverse beam search:

The Impact of Generation Strategy
The results are shown in Figure 5. To disentangle the effect of generation strategy from the subsequent reranker performance, we show the results for all combinations. As rerankers, we considered COMET-QE used in Guerreiro et al. (2022) and the methods proposed in Section 3.
We see that the MC BEAM method clearly outperforms all the other. This is interesting for two reasons. First, MC dropout is easy to use: one has to apply standard inference with dropout on without other changes to the implementation. Next, differently from modifying decoding strategies, here variability in hypotheses comes from model predictive uncertainty (Gal and Ghahramani, 2016;Zerva et al., 2021;Guerreiro et al., 2022). This is one more evidence that understanding model inner characteristics can be beneficial in various settings.
Based on these results, in what follows we generate hypotheses with beam search with MC dropout.

The Impact of Number of Hypotheses
We also check whether generating more than 10 hypotheses can improve the overall results. Figure 6 shows the final COMET scores depending on the number of hypotheses. We see that the scores increase with more hypotheses and do not saturate at 10. This implies that in cases when the quality of a translation is much more important than its computational cost, one can potentially improve the quality by generating more candidate hypotheses.

Reranking Approaches
Apart from detecting hallucinations, the methods we propose can be applied as rerankers in the "detect-than-rewrite" pipeline. Figure 5 shows that, regardless of the generation method, LaBSE is the best reranker and it performs notably better than the strong COMET-QE baseline. Apart from the average results, Table 2 also shows COMET scores for each pathology type. We can see that reranking with any method is better then no reranking for all groups of original translations. Compared to the COMET-QE baseline, LABSE improves the scores for hallucinations and correct translations, but drops quality for other pathologies.

Automatic Evaluation
The only internal method ALTI performs better than COMET-QE for fully detached hallucinations, but is inferior when looking at other translation types. It means that ALTI is very sensitive to the most severe pathology, but not capable to rank relatively good translations.

Manual annotation
Data. To confirm the results of automatic evaluation, we perform manual annotation. For each of the methods, we their translations of the same 200 source sentences. These sentences were randomly sampled from the hallucination dataset with the distribution of pathologies roughly mimicking outputs of the best detectors ( Figure 3)  20% as correct translations. 6 We compare the original translations and three reranking methods: the baseline COMET-QE used in Guerreiro et al. (2022), the best overall reranker LaBSE, and the only internal method ALTI.
Annotation. For each source sentence, the four translations were deduplicated and shuffled to avoid annotator bias. The resulting sentence pairs were given to 3 annotators who labeled them with three categories: Correct, Error, and Hallucination. The labels were aggregated by majority vote; in case of ties (20 out of the 602 sentence pairs that were left after deduplication) we pessimistically assumed a hallucination. Further details on the annotation guidelines and inter-annotation agreement are reported in Appendix B. We evaluate the statistical significance of the pairwise differences in the proportions of correct and hallucinated translations using two-sided Student test for two related samples with 5% confidence level. Figure 7. All reranking methods reduce hallucinatory rate by a factor of 2.5-3. Interestingly, when looking at hallucinations, internal ALTI performs on par with COMET-QE: the differences between these two methods are not statistically significant. COMET-QE, however, has less errors. This is expected: it was trained to distinguish correct translations from errors. Coming to LaBSE, we find that For hallucinations, all the differences are significant, except the one between ALTI vs COMET-QE. For correct translations, the difference between LaBSE and ALTI is statistically significant.

Results. Annotation results are shown in
it produces slightly less hallucinations that other reranking methods, and more correct translations than ALTI; these differences are significant at 5% confidence level. Overall, by using sentence similarity from LaBSE, we improve both halucinations detection and alleviating hallucinations at test time.
Note that since COMET-QE is the state-of-theart quality estimator, it is a very strong baseline for the reranking stage where the goal is to find a better translation. The fact that we can match COMET-QE's hallucinatory rate reduction by analyzing model internal workings has value from different perspectives. For research, it can motivate future work on model understanding; for practitioners, it means that hallucination mitigation methods are not limited to language pairs where external models such as COMET-QE exist: model understanding might be enough.

Conclusions
We started this work by asking how far we can go at detecting and mitigating hallucinations if we use nothing but the translation model itself. Turns out, we can improve the results of the overall "detectthen-rewrite" pipeline by evaluating the percentage of source contribution to generated translation: translations with low source contributions are likely to be "detached" from the source, i.e. hallucinations. For detecting the most severe type of hallucinations, this method improves previous results twice; for mitigating hallucinations at test time, it matches the hallucination reduction rate of the previous method based on COMET-QE. We believe this can motivate future work on model understanding and help practitioners: using nothing but model inner working can allow mitigating hallucinations for language pairs where models such as COMET-QE are not available.
When allowing external models, we propose to expand the methods for handling hallucinations from models specialized for quality estimation to a broader set of objectives, e.g. sentence similarity from cross-lingual embeddings. Apart from showing that e.g. LaBSE improves previous results significantly, we also find that models so far overlooked in the context of machine translation (such as natural language inference) can be beneficial. We hope future work will build on this idea.

B Manual Evaluation
In this appendix we describe the manual evaluation. First, we detail the simple guidelines that were presented to manual annotators. Second, we report the number of annotators and inter-annotation agreement. Third, we report the results of statisitical sigificance tests for comparing all the methods.
Guidelines Annotators were provided with the guidelines shown in Table 4. For the reporting purposes, "Partial hallucination" was grouped together with "Full hallucination", and "Undertranslation" with "Other".
Inter-annotation agreement We evaluated inter-annotation agreement by Fleiss' Kappa. For the three annotators and the three aggregated labels, it equals 0.57 on the 602 sentence pairs that were labeled (with the 5 original labels, it is 0.55). This may be interpreted as moderate agreement.

The differences
The Tables 5 and 6 compare proportions of correct and hallucinated translations for each of the manually evaluated methods. The Pvalues are computed with paired two-sided Student test (scipy.stats.ttest_rel).
Each row of the data consists of the German source sentence, its reference English translation (it is not always accurate!), and 1 to 4 machine translation outputs. The machine translation outputs are presented in a random order, to exclude the possibility of bias toward any specific method. For each of the machine translations, you need to assign one of the following labels: • OK: An acceptable translation; it conveys the main meaning correctly and does not introduce extra meaning. Some details still may differ, and minor errors are acceptable.
• Partial hallucination: a part of the translation is unrelated to the source, or is related very indirectly, such as via a common topic.
• Full hallucination: most or all of the translation is unrelated to the source, or is related very indirectly.
• Undertranslation: there is no hallucinations, but a significant part of the source is not translated at all.
• Other: there are no hallucinations or undertranlsations, but there are other translation errors that make the translation unacceptable.   Table 6: Comparison between manually annotated rates of hallucinated translation.