Detecting Hallucinated Content in Conditional Neural Sequence Generation

Neural sequence models can generate highly fluent sentences, but recent studies have also shown that they are also prone to hallucinate additional content not supported by the input. These variety of fluent but wrong outputs are particularly problematic, as it will not be possible for users to tell they are being presented incorrect content. To detect these errors, we propose a task to predict whether each token in the output sequence is hallucinated (not contained in the input) and collect new manually annotated evaluation sets for this task. We also introduce a method for learning to detect hallucinations using pretrained language models fine tuned on synthetic data that includes automatically inserted hallucinations Experiments on machine translation (MT) and abstractive summarization demonstrate that our proposed approach consistently outperforms strong baselines on all benchmark datasets. We further demonstrate how to use the token-level hallucination labels to define a fine-grained loss over the target sequence in low-resource MT and achieve significant improvements over strong baseline methods. We also apply our method to word-level quality estimation for MT and show its effectiveness in both supervised and unsupervised settings. Codes and data available at https://github.com/violet-zct/fairseq-detect-hallucination.


Introduction
Neural sequence models for tasks such as data-totext generation (Puduppully et al., 2019), machine translation (MT;Vaswani et al. (2017); Wu et al. (2016)) and text summarization (Rothe et al., 2020) can often generate fluent text that is sometimes preferred to human-written content (Läubli et al., 2018;Brown et al., 2020). However, they also often generate texts that lack global logical consis- * Most work was done during an internship at FAIR. 1 Codes and data available at https://github.com/ violet-zct/fairseq-detect-hallucination. tency (Marcus and Davis, 2020), are dull and repetitive (Welleck et al., 2019), or contain hallucinated content that is not entailed by the input (Maynez et al., 2020;Martindale et al., 2019). In this paper, we focus on tackling the latter problem, aiming to automatically identify and quantify content in the output that is not faithful to the input text.
The risk of generating unfaithful content impedes the safe deployment of neural sequence generation models. The first step to building models that do not suffer from these failures is the assessment and identification of such hallucinated outputs. Prior work has shown that standard metrics used for text evaluation, such as BLEU scores (Papineni et al., 2002;Post, 2018), ROUGE (Lin and Hovy, 2004) and BERTScore (Zhang et al., 2019), do not correlate well with the faithfulness of model outputs (Maynez et al., 2020;Wang and Sennrich, 2020;Tian et al., 2019). They also require reference output text, limiting their applicability in a deployed system at run-time. Very recent efforts have started to develop automatic metrics to measure the faithfulness of output sequences using external semantic models, e.g. the question-generation and question-answering systems (Wang et al., 2020a;Durmus et al., 2020) or textual entailment inference models (Maynez et al., 2020), to score faithfulness tailored for abstractive text summarization. However, these scores do not directly identify hal-lucinated tokens and only correlate weakly with human judgements. We propose a new task for faithfulness assessment -hallucination detection at the token level, which aims to predict if each token in the machine output is hallucinated or faithful to the source input. This task does not use the reference output to assess faithfulness, which offers us the ability to also apply it at run-time. Similar to the spirit of our proposed task, word-level quality estimation (Specia et al., 2018;Fonseca et al., 2019) in the MT community predicts if tokens are correctly translated based on human post-editing. However, these methods generally do not distinguish errors in terms of fluency and adequacy (Specia et al., 2011), with the exception of a subset of the WMT 2020 shared task on quality estimation (Specia et al., 2020), where different types and levels of severity of word-level errors are defined. Our proposed task specifically focuses on hallucination errors, and we define these errors in a simpler way with only binary labels, which we argue makes them simpler to use and more conducive to labeling at large scale. The proposed hallucination detection method (described below) is also applicable to the word-level quality estimation task as demonstrated in §5.4.
We measure hallucination for two conditional sequence generation tasks -abstractive summarization and MT. For the former, we produce a benchmark dataset from recently released annotations (Maynez et al., 2020). For MT, we carefully design human assessment guidelines and create highquality annotations, which will be released to aid future research. To learn token-level hallucination prediction for general conditional sequence generations tasks, we propose a novel method that creates synthetic "hallucinated" data and finetunes a pretrained language model (Liu et al., 2019;Conneau et al., 2020) on it. Without any human annotated supervised training data, we achieve an average F1 of around 0.6 across all the benchmark datasets, setting initial performance levels for this new task.
Predicting hallucination labels at the token level provides a tool for diagnosing and interpreting model outputs, which allows us to flag potential risks when the model is applied to previously unseen inputs. Additionally, we show how to use these token-level hallucination labels in two case studies to improve self-training (Scudder, 1965) and learning from noisy mined bitext  in low-resource MT. In both cases, there can be noise in the target text, either produced by the self-training teacher or errors in the mining process. However, most outputs are only partially erroneous (see examples in Appendix E.3) and the rest of the output is still useful for training, as we show by introducing different token-level loss truncation schemes that use our proposed hallucination detection methods. Our best methods outperform strong baselines by a large margin, and reduce the number of hallucinations.

Token-level Hallucination Prediction
For source sequence S and generated output sequence G, following Maynez et al. (2020) we define any span g i , · · · , g i+j (j >= 0) in G as being "hallucinated" if it is not supported by the source input S. 2 More specifically, we consider two types of hallucination, which are not mutually exclusive: Extrinsic hallucinations: a span g i , · · · , g i+j in G consists of additional content without clear grounding in the input. In Fig. 1, the word "happily" in the machine translation belongs to this case, as there is no word in the input sentence that clearly corresponds to "happy".
Intrinsic hallucinations: a span of word(s) in G contains incorrect information due to synthesizing content using information present in S. In Fig. 1, "Jerry" in the MT is a hallucinated word and should be replaced by "Mike". Note that multiword phrases can also be marked intrinsic hallucinations, such as "this is a book" being hallucinated from "this is not a book", where "this is" is a minimal span corresponding to the hallucination.
The above definitions are for illustrative purposes; we do not explicitly label whether a hallucination is intrinsic or extrinsic, only whether one exists at all. Given these spans, we aim to identify all the span(s) satisfying the above conditions in machine generation G. 3 Human Assessment of Hallucinations To facilitate the assessment of hallucinations in MT, we conduct human annotations on outputs of MT models in the patent and COVID-19 domain. Three bilingual annotators were presented the source sentence, the reference sentence and the MT output, and they were asked to label each sentence with one of the three types of labels: incomprehensible, faithful, and contains hallucinations. If the translation contains hallucinations, we asked the annotators to tag all the tokens that were not faithful to the source. The final benchmark datasets were created by taking majority labels among three annotators. We present more details regarding annotation guidelines and pipelines in Appendix A.
We compute the Fleiss's Kappa (Fleiss, 1971) (FK) scores of our annotations for MT and the processed annotations from (Maynez et al., 2020) on abstractive summarization (Tab. 5 in Appendix A). We achieved moderate agreement (FK≈0.56) on the token-level hallucination annotations and substantial agreement (FK≈0.67) on the sentence-level annotations, while Maynez et al. (2020) achieved substantial or almost perfect agreement (FK≈0.8) on the XSUM dataset. For MT, we conjecture that it is relatively hard to achieve consistent agreement among annotators for several reasons. First, although we have made detailed annotation guidelines following the definition of hallucination in § 2, it could still be difficult for annotators to distinguish between ungrammatical translations and hallucinations. Second, it was sometimes difficult for annotators to understand the specialized text in the patent domain.

Token-level Hallucination Detection
We propose a general-purpose method for tokenlevel hallucination detection for conditional sequence generation tasks. Given the source input S, we first formulate the task of token-level hallucination detection as a sequence labeling problem where a binary label is predicted at each position G t of the machine generation G. One straightforward way of learning this task is to train a model with supervised data in the form of ((S, G), L G ) where L G are the labels at every position of G that indicate if each word is a hallucinated one or not. However, because such labeled training data is not readily available, we propose an approach to automatically create synthetic training data.

Synthetic Data Creation
We use bi-text from the training data to create synthetic examples by automatically inserting new, hallucinated target-side tokens. More specifically, we take target sequence T and create a hallucinated version of it denoted T with associated hallucination labels for each token in T . Then we can train  Figure 2: Generation of synthetic data with hallucination labels. A hallucinated version of T is generated by feeding the noised sentence to the encoder-decoder model BART. Hallucination labels are assigned to each token by computing the edit distance between T and T . Labels of 1 refer to hallucinated words.
a supervised model on this synthetic labeled data set of ((S, T ), L T ).
The key challenge is that T should be a fluent sentence that does not differ too much from T .
Generation of hallucinated sentences To control this synthetic hallucination process, we build on a pre-trained denoising autoencoder, which maps a corrupted sentence back to the original text it was derived from, learning to reconstruct missing words that have been arbitrarily masked out. Specifically, we use the BART model , without providing it any access to the source sentence, thereby encouraging it to insert new content as needed to ensure fluency. As shown in Fig. 2, we first apply a noising function that removes words from the original target sentence T 4 and then use a pretrained BART to generate T conditioned on the noised T with beam search.
Jerry goes to the bookstore his friend.
Mike goes to the bookstore on Thursday. Label assignments After obtaining the hallucinated sentence T with BART, we need to assign appropriate labels to each token in T to mark which words are hallucinated. We compute the edit distance between T and T , and back-trace the deletion and substitution operations with dynamic programming. All the positions in T involving these two operations are labeled as hallucinations and everything else is considered faithful to T . Fig. 3 shows an example of label assignment with edit distance, where words in red are replaced and words in blue are deleted to convert T to T . Assigning labels with edit-distance can not always guarantee correct labels, but we find that this simple approach provides sufficiently high quality training data for effective hallucination detection in practice.

Finetuning on Synthetic Data
Hallucination prediction loss We follow the common practice in natural language understanding (NLU) tasks and finetune a pretrained language model (LM) on our synthetic data. We finetune a cross-lingual LM (Conneau et al., 2020) for MT and a monolingual LM (Liu et al., 2019) for summarization. In both cases, we concatenate the input, true target and hallucinated target denoted (S, T , T ) as a single input sequence to the model. Then we minimize the standard classification loss L pred over the pseudo hallucination labels L T on top of the final hidden vectors of each token in T as shown in Fig. 4.
Although using only the source text and hallucinated target (S, T ) as the input should be sufficient to learn to predict hallucinations, we can also easily measure the extent to which including the true target T in the input could help the model. At test time, when evaluating the faithfulness of the machine outputs G, we do not use the true target T and perhaps surprisingly find our model can generalize well without references, even when they were present during training.
To prevent the model from overly relying on the true target T and learning spurious correlations (e.g. the edit distance), we explored two techniques: (1) dropout -randomly drop out tokens in T to force the dependence on the source input; (2) paraphrase -recall that at synthetic data generation time, we generate T from BART conditioned on the noised T . Instead, we can apply noise functions to the paraphrased sentence of T . We create paraphrased targets via knowledge distillation (Kim and Rush, 2016) where we use the output from pretrained Seq2Seq model conditioned on the source sentence in the bi-text corpus as the paraphrased target. Let D denote the paraphrased sentence of T and D denote the generation from BART conditioned on the noised D. Then we create pseudo labels of D denoted L D by computing the edit-distance between the D and D and use ((S, T, D ), L D ) as the training data for finetuning. Since the pseudo labels are created based on D, it can prevent the model from learning the edit-distance between T and D easily. We provide ablation studies in Appendix D.
Masked LM loss We also add the masked language model loss (MLM) L mlm following (Devlin et al., 2019). To learn this loss, we create a different batch from the above by concatenating only the source S and target T as the input, since the hallucinated target T could provide erroneous information for predicting masked words in T . We find that such multi-task learning objective helps learn better representations of the input and further improves performance on predicting hallucination labels. The final loss is L = L pred + α · L mlm where α is a hyperparameter.

Evaluation Tasks and Data
We examine hallucination in abstractive text summarization and machine translation (MT) tasks, using the models and datasets described below. (2020) asked human annotators to label the spans in the machine generated summaries if they were unfaithful to the article. We post-processed their human annotations by majority voting and created test datasets for each of the summarization systems.

MT
Previous work (Wang and Sennrich, 2020;Müller et al., 2019;Koehn and Knowles, 2017) has shown that translation models are particularly prone to hallucination when tested out of domain. We similarly focus on this regime and additionally consider the low resource case where a modest amount of out of domain data is available at training time.
Data We use a multi-domain Chinese-English (Zh-En) translation dataset (Wang et al., 2020b) which consists of four balanced domains: law, news, patent and subtitles. We create a new training data D train with law (1.46M sentences), news (1.54M), subtitles (1.77M) train data and randomly sample 870 parallel sentences from the patent training data. We train two NMT models ( Figure 4: Finetuning XLM-Roberta (for cross-lingual generation task, e.g. MT) or Roberta (for monolingual generation task, e.g. text summarization) on the synthetic training data. Models Our data is generated from two models on which we will measure hallucination (see Appendix B for more details): (1) TranS2S (Vaswani et al., 2017) is the standard Transformer Seq2Seq model with 6 encoder layers and 6 decoder layers.

MT
(2) MBART (Liu et al., 2020) is a Seq2Seq denoising auto-encoder pretrained on large-scale monolingual corpora in many languages. We finetune the 12 layer model on D train .

Experimental setup
Synthetic Data Generation We use a pretrained 12 layer BART  model in the fairseq toolkit  for synthetic labeled data generation. We uniformly sample the percentage of tokens p m to mask from [0, h m ] for each sentence. We also uniformly sample the probability of replacing a token with a random token from [0, h r ] denoted p r . p m and p r are two important factors that affect the noise level when generating the synthetic data. For MT, we set h m and h r to 0.6 and 0.3 respectively. For abstractive summarization, we use 0.4 and 0.2. We use beam search for decoding from BART with beam size of 4 and length penalty of 3. For MT, we first create paraphrased target sentences D through knowledge distillation (Kim and Rush, 2016) by using the outputs from the same trained TranS2S model on the source inputs.
Hallucination Predictor For MT, we finetune XLM-R (Conneau et al., 2020) on the synthetic dataset with batch size of 128, and we annotated 50 examples (different from those in D eval ) from the patent test data as the validation dataset. For summarization, we finetune RoBERTa (Liu et al., 2019) with batch size of 96 and early stop training with 10K update steps. In addition, we dropout tokens from the reference T in the input with a rate of 0.5 and 0.3 respectively for summarization and MT to learn L pred . We set α to be 0.6 for MT and 0.5 for summarization based on the scales of L pred and L mlm . For both tasks, we set the mask probability used for L mlm to be 0.5, and the initial learning rate to be 2e − 5 with polynomial decay. We describe other hyperparameters, including training of MT models, in the Appendix B and C.

Evaluation of hallucination prediction
In Tab. 1, we present the F1 of token-level hallucination labels across six benchmark datasets for MT and abstractive summarization (full results of precision, recall and F1 are presented in Tabs. 7 and 9 in the appendix). We compare with three baseline methods that we proposed for this new task: (1) The alignment-based method uses a word alignment model for hallucination assessment. We em-  ploy SimAlign (Sabet et al., 2020), an unsupervised aligner, that extracts word alignments from similarity matrices induced from pretrained word embeddings. SimAlign is essentially used for crosslingual tasks, and we adapt it to summarization by using embeddings from the pretrained BERT-large (Devlin et al., 2019). We predict a target token as being hallucinated if it is not aligned to the source tokens.
(2) The overlap-based method is a heuristic one that predicts a target token as being hallucinated if does not appear in the source. Since it's not feasible to perform string matching between two languages for MT, we use a bilingual lexicon induction method (Zhou et al., 2019) to first translate each English word into a Chinese word and then check its existence in the source text.
(3) We go further by exploiting synonyms to assess hallucination in the summarization task where we use WordNet (Miller, 1998) to find synonyms of nouns, verbs, adjectives and adverbs of the target summary and the source article; we predict a target as being hallucinated if its synonym can not be found in the set of the source synonyms.
From Tab. 1, we note: (1) The proposed method achieves decent performance on this task and ranks the best among all baseline methods. However the task is still far from being solved is worthy of study in the future.
(2) We can see that even though our model learns hallucination prediction with reference T during training (Sec. 3.2), by applying token dropout to T , our model generalizes well without feeding the reference at test time. As a contrast, we report the results of predicting with reference at test time and observe that the model can achieve a significantly higher recall but worse precision (Tab. 9 in appendix).
(3) The two non-neural baselines we proposed work surprisingly well on the summarization datasets, especially the synonymbased system. We guess this is because the information of the summaries should come from the source article and a majority of hallucinated words are nouns ( §5.3) which can be easily detected by string matching or synonym matching. Our neural system performs better than these baseline methods but not significantly, and we hypothesize that this is because the RoBERTa model we finetune on only allows a maximum input length of 512, which results in an average cutoff of 158 subwords from the source article and hence loss of source information. By taking the union of the predictions from the synonym-based and our models, we can further obtain improvements on the summarization datasets. We believe the advances in long sequence modeling (Beltagy et al., 2020;Kitaev et al., 2020) could help here, and are important to study in future work. (4) At the same time, the baseline methods can not obtain reasonable performance for MT since crosslingual semantic matching is more challenging and our model shows significant improvements.
In Tab. 2, we show the percentage of annotated and model predicted hallucinated tokens across the six benchmark sets. We can see that model predictions correlate well with human assessment and have a Pearson correlation coefficient of 0.986.

Analysis
Analysis on Pretrained Models for Conditional Sequence Generation Recent work (Maynez et al., 2020) has shown that pretrained models are better at generating faithful summaries as evaluated by humans. In Tab. 2, summaries generated from BERTS2S contain significantly fewer hallucinations than other model outputs. We also confirmed this trend in MT that translations from MBART contain less hallucinated content than that from TranS2S. In Fig. 5, we present the percentage of hallucinated tokens categorized by their part-of-speech tags predicted by a POS tagger (Toutanova et al., 2003). First, we see that for both MT and summarization datasets, nouns are the most hallucinated words. In abstractive summarization, verbs also account for a certain number of hallucinations. Second, our model predicted hallucinated words match well with gold annotations on the distributions of POS tags. We also compare the percentage of hallucinations within each POS tag in Appendix E.2. In addition, we provide more ablation studies in Appendix D.

Evaluation on Word-level Quality Estimation
As noted in §1, our model is also readily applicable to word-level quality estimation (QE) for MT (Fonseca et al., 2019;Specia et al., 2020), which aims to detect word-level errors in MT output. In the WMT shared task of word-level QE, each token of the target sentence is labeled as OK/BAD based on the post-edited target sentences. We evaluate our model on the WMT18 en-de word-level QE shared task (Specia et al., 2018) in both the unsupervised and supervised setting. There are 13,442 labeled parallel sentences where the tagged target sentences are from an NMT model. In our supervised setting, we finetune the XLM-R model on these parallel sentences with the objective: L pred + 0.5 * L mlm .
In the unsupervised setting, we first create the synthetic data ( §3.1) using the post-edited target sentences from the labeled parallel set (13,442) and an additional 50K target sentences from the provided unlabeled parallel set. Then we finetune XLM-R on the created synthetic labeled data. For both settings, we set the weights of the cross-entropy loss for the bad-token labels to be 2.0 because the labels are imbalanced with fewer bad-token labels.

Results
We present results in Tab. 3, where F1-Mult is the multiplication of F1-scores for the OK and BAD labels. Note that all the baseline models are in the supervised setup and the best baseline OpenKiwi (Kepler et al., 2019) is a strong ensembled system using predictions from multiple models. In contrast, our supervised model only leverages the parallel labeled data without using other resources. Among all the supervised settings, our model outperforms the best system by 2 points in F1-Mult. To make it clear how our unsupervised model performs, we also show the best performed systems in the shared task of WMT18. We observe that our unsupervised setting achieves descent performance and even outperforms the 3 rd -ranked system. These results demonstrate that both the full version and the finetuning part of our method provide strong results for word-level QE.
6 Case Study I: Improving Self Training in Machine Translation Predicting hallucination labels at token-level not only allows us to flag potential risks in generation models, but also opens up the possibility of providing fine-grained signals which can be used to define new learning objectives. In this section and the following one, we demonstrate how to leverage the hallucination labels to reduce adverse effects of noisy training instances. Specifically, we show that the fine-grained hallucination signals allow for improved semi-supervised learning ( §6) and training with noisy parallel data ( §7).
6.1 Rectified Self-Training for Neural MT Self training (Scudder, 1965) is an important semisupervised approach that utilizes unlabeled source data to improve system performance. In a conditional sequence generation task, a teacher model is first trained with bitext D l = {s i , t i } N i=1 and used to make predictions on each sequence in a unlabeled dataset D u = {s j } N +M j=N +1 to create pseudo parallel data D p = {s j , t j } N +M j=N +1 . The model is then trained on D l ∪ D p .  finds that with self-training the student model can benefit from such pseudo-parallel data. However, such results require a relatively high-quality teacher, and performance suffers in low-resource setting where no such teacher is available.
We propose to use our token-level hallucination predictions to define a fine-grained loss during training in MT, by penalizing errors less on tokens that more likely to be hallucinated. This is in contrast to previous data filtering methods for MT, which remove entire sentence pairs (Junczys-Dowmunt, 2018; Kang and Hashimoto, 2020).
First, we predict the token-level hallucination labels on the target side of the pseudo parallel data D p . Then we propose two simple methods of using these labels in self-training: (1) We discard the losses of tokens that are predicted as hallucinations and compute the loss on the remaining tokens for each target sequence (token loss truncation).
(2) Instead of adjusting losses, we mask the decoder hidden states of those hallucinated positions after the target-to-source cross attention in each decoder layer (decoder HS masking  Table 4: BLEU(↑), BLEURT(↑) and hallucinated tokens (Hal, ↓) on the CWMT2017 test set. We compare with noised self-training and sequence-level loss truncation in the second and third blocks respectively.

Experimental Setup and Results
Experimental Setup To train a teacher model (baseline in Tab. 4), we use the same training data described in §4.2 using patent (870) as the lowresource domain. We evaluate on the full patent test set (1,500) from CWMT2017 (Wang et al., 2020b). For the unlabeled data, we use the withheld Chinese patent training data (2.9M).
Baselines We compare with the state-of-the-art self-training (ST) method of , which injects two types of noise into the input sentences: (1) paraphrase noise created by round-trip translations, and (2) random noise from dropping, mask- 5 We also tried removing hallucinated target words before training. This underperformed, likely because it produces too many ungrammatical target sentences. ing and shuffling input tokens. We also compare with the recently proposed loss truncation method (Kang and Hashimoto, 2020) that adaptively removes entire examples with high log loss, which was shown to reduce hallucinations.

Results and Analysis
We present the tokenized BLEU score (Papineni et al., 2002), BLEURT score (Sellam et al., 2020) and the percentage of hallucinated tokens predicted by our system in Tab. 4. We can see that ST improves over the baseline by around 3 BLEU and our best result further improves ST by 1.7 BLEU. Compared with strong baseline methods, our method not only achieves the best translation quality measured by BLEU and BLEURT but also the largest hallucination reduction. We also observe that: (1) Our method with ST alone can outperform other baseline methods, when combined with perturbed ST (noise), and using fine-grained control over the target tokens can further improve the results.
(2) ST with paraphrase noise (by round-trip translation) does not perform as well as the random noise, which further confirms that the noisy outputs from a teacher model may hurt the student model. (3) The sequence-level loss truncation approach can improve over the vanilla ST and reduce the level of hallucinations as measured by our system. However, the performance drops when combined with the noised ST.

Case Study II: Improving Corpus
Filtering for Low-Resource MT High-quality parallel data is critical for training effective neural MT systems, but acquiring it can be expensive and time-consuming. Many systems instead use mined and filtered parallel data to train NMT models (Junczys-Dowmunt, 2018; Zhang et al., 2020; . Nonetheless, the selected parallel data can still be noisy, containing misaligned segments. In this section, we demonstrate that token-level hallucination labels can allow us to make better use of noisy data to and improve the overall translation quality. We apply the token loss truncation method proposed in §6 to the filtered parallel data and evaluate it on the WMT2019 low-resource parallel corpus filtering shared task. Si-En FB system w/ loss trunc Figure 6: The BLEU scores of the best submission (FB system) in the WMT19 shared task on parallel noisy corpus filtering and our method (w/ loss trunc) on the Ne-En and Si-En flores test sets.
from the web. Participants were asked to score each sentence pair in the noisy parallel set. Scores were used to subsample sentence pairs amounting to 1 million and 5 million English words, which were used to train an MT system that was evaluated on the test set using SacreBLEU (Post, 2018). In addition, the shared task also provides additional clean parallel data for Nepali-English (564K), Sinhala-English (646K) and Hindi-English (1.6M), but they can not be used for training the final NMT system. First, we train a token-level hallucination prediction system with the combined parallel data from all the three language pairs (as Hindi is related to Nepali). Second, we use the scores (Chaudhary et al., 2019) that achieve the best overall performance for both language pairs among all the submissions to select the top-scored 1M, 2M, 3M, 4M, 5M, and 10M data (in English tokens) and predict the token-level hallucination labels on the target side. We follow the same setup and use the script provided by the shared task to train the NMT model with the selected subsets. During training, we discard losses of tokens that are predicted as hallucinations and only compute the losses for the remaining tokens. We use the validation and test data from the flores dataset  during training and evaluation. Fig. 6, we present the BLEU of the best submission (FB system) and our method on the Ne-En and Si-En test sets of the flores dataset. First, with token-level loss truncation, our model achieves the new best results on the flores test set in this shared task for both Ne-En (7.4) and Si-En (8.11). Second, for both language pairs our method further improves the state-of-theart system when varying the training data sizes. Notably, in the extreme case of 10M training data, which is very noisy, the baseline can not obtain decent BLEU scores for Si-En while our method still achieves reasonable performance (0.14 vs. 5.18). However, for Ne-En data sizes after 2M causes performance of both the baseline and our method to drop significantly, possibly because the dataset contains many pairs of misaligned sentences (the source is not Nepali and the target is not English).

Conclusions
This work proposed a new task of token-level hallucination detection, created human-annotated benchmark datasets, proposed a method for unsupervised learning of hallucination detectors, and showed that the models can be used to define fine grained losses that improve MT training. We demonstrate the remark performance of the proposed hallucination detection method in several downstream tasks, including word-level quality estimation and noisy neural machine translation. In the future, we hope to create a large-scale pretrained hallucination detector for any dataset or model, and also would extend our method to data-to-text generation scenarios. We are also interested in investigating how to leverage our detection methods to mitigate hallucination problems in conditional sequence generation. Annotation Guidelines and Process We conducted the pilot study and practice sessions with annotators before annotating the final blind test set D eval . The pilot study was performed on a different evaluation set and we performed analysis on them. Then we conducted an education session with evaluators to make sure that they can fully understand and follow the guidelines. We find that it is important to define a clear workflow for annotators to execute. In the final evaluation, we ask each annotator to read the tokens in the sentence carefully and check if they can be supported by the source sentence in the following order: (1) If there are tokens (or the entire sentence) that cannot be supported by the source, label all the span(s) with color and mark the sentence as a hallucinated one; (2) If the annotator can not understand the entire translation, mark the sentence as incomprehensible; (3) If all the tokens in the translation can be entailed from the source, mark the sentence as a faithful one.
We shuffled the order of sentences so that annotators did not know which translation model was used (TranS2S or MBART). Besides, we made out the following guidelines to help annotators identify hallucinated spans and distinguish bad translations from hallucinated ones: (1) If a machine generation contains hallucinations, we ask annotators to minimally mask spans of words as hallucinations such that deleting these spans or replacing these spans with other words can dehallucinate the generation (make the generation a faithful one to the source input). For example, if T ="John likes Mary, but Mary does not like John." and G ="John likes Mary, and Mary likes John.", "and" and "likes" in the latter part of G should be marked as hallucinations.
(2) We ask annotators not to consider the domain of sentences when marking hallucinations. For examples, if S="今天我的胸部非常痛。" (Chinese), T ="My chest hurts badly today." and G="My breast hurt badly today.", in this case, both the reference T and the MT G are valid translations of the source sentence because the word "胸部" in the source is a polysemy. Without considering the domain that sentences come from, the generation is a faithful one. (3) We ask annotators not to be "harsh", e.g. if a capitalized word in the reference is lowercased in the translation, we ask them not to mark it as hallucination under the rule that hallucinations should only be considered by the meaning of words and whether they are faithful to the source, instead of the surface form.
Note that annotations are performed on the raw sentences, i.e. punctuation marks can also be labeled as hallucinations along with the span and we did not apply special treatments to them. At test time, the model outputs are compared against the raw form of sentences, and model predictions on subwords are converted to labels on the raw sentences. Besides, based on our guidelines, the annotated span of hallucination words may also contain prepositions and other stop words.
Post-processing: We dropped all the translations that were labeled as incomprehensible (15 for TranS2S and 3 for MBART). To aggregate annotations from the three annotators, we assign the label to each token by majority voting, i.e. the label that two or more annotators agree on. We also aggregate the evaluation data from Maynez et al. (2020) in the same manner to produce our own test set for abstract text summarization.

B Training of NMT models
Tokenization For TranS2S, we first segment the Chinese corpus with a Chinese word segmentation tool (Luo et al., 2019), then we learn separate BPE vocabularies with 32k merge operations (Sennrich et al., 2016) over the source (Zh) and the tokenized target (En) corpus respectively. For MBART, we directly apply the contained sentence-piece dic-   Maynez et al. (2020). tionary in the finetuned model to the raw data of Chinese and English corpus.
Model We use the implementation of Transformer from fairseq . Following the notations used in fairseq, we use a base transformer model for TranS2S and a large tranasformer model for MBART.
Training and Decoding For TranS2S, we apply the standard hyperparameters reported in the example of fairseq. We use the Adam optimizer (Kingma and Ba, 2014) using β 1 = 0.9, β 2 = 0.98, = 1e − 8. The learning rate is scheduled using inverse sqrt with a maximum learning rate 0.0005 and 4000 warmup steps. We set the label smoothing as 0.1. We apply dropout of 0.1 and select the best model with validation BLEU scores. We run the model on 8 GPUs for 300, 000 updates with an effective batch size of around 64, 000 tokens. When finetuning MBART, we use learning rate of 3e-5, and use polynomial decay for learning rate scheduling with warmup updates of 3,000. The effective batch size is 16,384. Dropout is set to be 0.3 and the attention dropout rate is 0.1. The label smoothing is set to be 0.2. We finetune MBart for 60,000 updates. We decode outputs with beam-search and beam size of 5.

C Experimental Details for Token-level Hallucination Prediction
Subword Tokenization Depending on the pretrained model (Roberta / XLM-Roberta) we finetune on, we apply corresponding subword segmentation to the synthetic data set (S, T, T ) and calculate the edit-distance between the T and T at the subword level. At evaluation time, the model predicts the hallucination labels for each subword  Table 6: Performance on the TranS2S benchmark from MT and summarization by using different data as the input to the noised function N (·). "raw" refers to the original targets in the training data.
in the sentence, thus we predict a word to be a hallucination word if any subword of it is predicted as a hallucinated one.
Synthetic data generation There are a couple of hyperparameters of noised functions in the BART implementation . The main noised functions include (1) random masking, (2) random replacement, (3) random insertion of masks. We found that random masking and random replacement are the two key factors affecting the generated sentences and we have provided their settings in the main paper. We apply a random insertion masks rate of 0.2 for all settings. In addition, the noise functions are applied to words instead of spans in our setting.
Finetuning For MT, we finetune a large XLM-Roberta (Conneau et al., 2020) released in fairseq . For summarization, we finetune a large Roberta  on the synthetic data where we truncate articles that exceed 512 tokens (allowed by the Roberta) to be 512. For both models, we use the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.98, = 1e − 6 and weight decay of 0.1. We set the masking probability to be 0.35 for the L mlm loss. The dropout and attention dropout rates are set to be 0.1. We adopt polynomial decay for learning rate scheduling with learning rate of 2e-5.

D Ablation Studies
Effects of including reference at training time Recall that we concatenate the source, reference and machine generation together as the input when learning hallucination predictions (Sec. 3.2). In Fig.7, we vary the dropout rate of tokens in the reference at training time and evaluate the models on the outputs from the TranS2S model for both tasks, where dropout rate of 1.0 indicates that we do not include the reference at all. First, different dropout rates do not signficinatly affect performance for MT, this is likely because we use the paraphrased target when creating the synthetic data instead of the reference sentences. Thus, the "hallucinated" sentences D from BART do not resemble the reference T as closely as T , and the model will not learn spurious correlations between the T and D .
Second, for summarization we see that applying word dropout is crucial since we have used the reference more directly for generating synthetic data. On the other hand, if reference is removed at learning time (dropout = 1.0), the resulted model performs poorly, which shows that including reference at training time also has positive effects.
Effects of paraphrased data We investigate the effects of using paraphrased data in Tab. 6, where we apply the noise functions to different forms of targets when generating synthetic data. For MT, we create paraphrased targets via knowledge distillation (Kim and Rush, 2016) where we use the output from TranS2S conditioned on the source sentence in the bi-text corpus as the paraphrased target. We can see that with distillation data for synthetic data generation, the model achieves better results compared to using the references. However, note that we need to choose a proper word dropout rate when using the reference-based synthetic data as discussed above. For abstractive summarization, we create paraphrased data out of an abstractive and an extractive summarization systems respectively. We finetune BART on the bi-text of XSUM and create distillation data from this finetuned abstractive model. For the extractive system, we use the recent proposed MatchSum (Zhong et al., 2020) as the distillation model. We see a significant drop in the performance for both of the variants. This likely due to the fact that: (1) it has been shown that abstractive summarization systems are prone to hallucinate contents themselves (Maynez et al., 2020), thus we are not able to create reliable pseudo labels based on the generated summaries, and (2) the extractive system generates summaries out of the input article which diverge from the actual abstractive summaries we evaluate on, and the model cannot generalize well under such data shift.  Table 9: Triplets represent (Precision, Recall, F1 (x100)) of hallucination labels on the abstract summarization task (XSUM dataset). The first block are baseline methods and the second block are our results. We highlight the best results without using reference.
Reference the arrangement pattern of the projections 2 will now be explained with reference to figs.

E.1 Full Results of Token-level Hallucination Predictions
We found the synonym and string-matching based methods are strong and effective baselines on the monolingual (summarization) token-level hallucination prediction task as an alternative to neural methods. However, previous work (Maynez et al., 2020;Wang et al., 2020a;Durmus et al., 2020) on hallucination assess did not study synonym-based non-neural baselines when measuring the faithfulness of the summarization model outputs.

E.2 Analysis on Part-of-speech tags and with-in Group Hallucination Percentage
We have shown that the macro Part-of-Speech tag distribution of hallucinated tokens in §5.3. In this section, we analyze the micro-percentage of hallucination labels within each POS tags. We show the gold annotations as well as our model predictions of hallucination words within each POS tags.
For summarization, we also show the results from the string-matching baseline. From Fig. 8, we can see that for MT nouns are most likely hallucinated words while for summarization cardinal numbers (e.g. one, two) are most likely hallucinated words.
And we can see that our model predictions align well with the gold annotations on the percentage of hallucinated words within each POS tags.

E.3 Examples of Partially Hallucinated Outputs from Teacher MT Model
In Tab. 8, we randomly select some examples for which we present the source sentences from the patent monolingual Chinese dataset, the corresponding reference English sentences and the generations from a teacher model trained on the training data described in §4.2 where patent is a low-resource domain. We can see that in these examples, only parts of the model outputs are hallucinated and the rest of the outputs are good translations that are faithful to the source. Through our approach in §6, we can still make use of these good parts of translation during training.

E.4 Examples of Hallucination Predictions on the MT test set
As shown Tab. 10, our model performs well in general but can be inaccurate in case of spelling errors of the translations. Besides, we also find some annotation errors while our model predicts correctly.