Detecting and Mitigating Hallucinations in Multilingual Summarisation

Hallucinations pose a significant challenge to the reliability of neural models for abstractive summarisation. While automatically generated summaries may be fluent, they often lack faithfulness to the original document. This issue becomes even more pronounced in low-resource settings, such as cross-lingual transfer. With the existing faithful metrics focusing on English, even measuring the extent of this phenomenon in cross-lingual settings is hard. To address this, we first develop a novel metric, mFACT, evaluating the faithfulness of non-English summaries, leveraging translation-based transfer from multiple English faithfulness metrics. We then propose a simple but effective method to reduce hallucinations with a cross-lingual transfer, which weighs the loss of each training example by its faithfulness score. Through extensive experiments in multiple languages, we demonstrate that mFACT is the metric that is most suited to detect hallucinations. Moreover, we find that our proposed loss weighting method drastically increases both performance and faithfulness according to both automatic and human evaluation when compared to strong baselines for cross-lingual transfer such as MAD-X. Our code and dataset are available at https://github.com/yfqiu-nlp/mfact-summ.

In addition, current summarisation models, opensource or proprietary, struggle in low-resource settings (Parida and Motlicek, 2019;Hasan et al., 2021;Bai et al., 2021;Urlana et al., 2023), when the target language is under-represented (e.g., Vietnamese and Urdu).Fortunately, cross-lingual transfer methods (Pfeiffer et al., 2020b;Xue et al., 2021;Hu et al., 2020) leverage task-specific knowledge learned from a resource-rich source language to summarise documents in many resource-poor target languages, in a zero-shot fashion or only with few annotated examples.Nevertheless, it remains unclear to what extent cross-lingual summarisation suffers from the problem of hallucination, compared to monolingual systems where English is the only language.
The main challenge in addressing this question is that most faithfulness evaluation metrics are available only for English and do not support lowresource languages.Hence, our first contribution (Section 2) is a model-based metric (mFACT) that measures the factual consistency of multilingual conditional generation, obtained from four diverse English faithfulness metrics (Goyal and Durrett, 2021;Fabbri et al., 2022;Cao et al., 2022a) via 'translate train' knowledge transfer (Artetxe et al., 2020).As illustrated in Figure 1, we use existing faithfulness metrics to label the English documentsummary pairs as positive (i.e., faithful) or negative (i.e., hallucinated) and translate them into each target language.We then train a classifier in each We average the score of four English metrics to rank the training samples in XSum.We then translate the most faithful and hallucinated samples into each target language and train a classifier to distinguish them.
target language to predict the faithfulness scores for the translated document-summary pairs.We verify the reliability of mFACT on the translated test set and, most importantly, with human evaluation.These confirm the effectiveness of mFACT in capturing hallucinations in target languages.
Equipped with this new metric, we conduct extensive cross-lingual transfer experiments on XL-Sum (Hasan et al., 2021) for abstractive summarisation in six typologically diverse languages: Chinese, Spanish, French, Hindi, Turkish and Vietnamese.We find that state-of-the-art cross-lingual transfer methods increase summarisation performance in the target languages, but also introduce more hallucinations compared to English monolingual models in comparable experimental settings, thus further exacerbating this tendency (Section 6).
We also employ the mFACT metric to assess the faithfulness of some recently released multilingual large language models (LLMs), including Phoenix, BLOOMZ, and Vicuna (Chen et al., 2023;Muennighoff et al., 2022;Chiang et al., 2023;Le Scao et al., 2022).We show that LLMs that use multilingual data for pre-training or conversational finetuning fail to ensure faithfulness in summarisation in various languages, producing more hallucinations in low-resource ones.
To overcome this limitation and promote faithful summarisation in multiple languages, we adapt a series of existing methods for reducing hallucinations originally devised for monolingual summarisation (Section 3.2).In addition, we introduce a novel, simple but effective method (Section 3.3): we weigh the loss for each training example according to their faithful scores.We evaluate our loss-weighting method with automated metrics and human judgements.We observe significant gains in both summarisation performance and faithfulness over a series of strong baselines (Section 8).In a nutshell, our main contributions are the following: • We propose mFACT, a multilingual faithful metric developed from four English faithfulness metrics.This enables detecting hallucinated summaries in languages other than English.• To the best of our knowledge, we are the first to study hallucination in a cross-lingual transfer setting.We show that state-of-the-art methods like MAD-X (Pfeiffer et al., 2020b) can improve the performance for low-resource summarisation, but also amplify hallucinations.• We apply mFACT to study the faithfulness in summarisation of the recent multilingual Large Language Models.We observe that despite their scale, these models are still struggling to reduce hallucinations for languages other than English.• We propose a novel method to enhance faithfulness and performance in cross-lingual transfer for summarisation, which consists of weighting training samples' loss based on their faithfulness score.Both automatic and human evaluations validate the superiority of our method over existing baselines.

mFACT: A Multilingual Metric for Faithfulness
metrics into any target language, given the availability of a machine translation model.

Translation-based Transfer for Faithfulness Metrics
One straightforward way to implement a faithful metric in any target language is by implementing it from scratch following the design of monolingual English metrics.However, these often rely on data annotated with auxiliary language-specific tools.For instance, Dependency Arc Entailment (DAE; Goyal and Durrett 2021) requires an external dependency parser to label fine-grained hallucinated segments.This is impractical due to the lack of annotated data and auxiliary tools in most languages.Another strategy relies on "translate test" knowledge transfer (Artetxe et al., 2020), where test documents and their corresponding generated summaries are translated from the target language to English.Then, English metrics can measure faithfulness; however, this introduces noise from translation and is costly at inference time, which makes this unsuitable for model development.For instance, model selection is commonly based on early stopping according to validation faithful scores (Choubey et al., 2021;Aharoni et al., 2022), which necessitates translating all generated summaries at each validation step.
Our solution instead is to formulate faithfulness evaluation as a binary classification problem, i.e., to predict whether a given document-summary pair is faithful or hallucinated.In other terms, our proposed approach aims to distil knowledge from multiple teacher models, i.e., existing English model-based metrics, into a target-language classifier as a student model.Specifically, we use multiple English faithful metrics to assign the pseudo labels of "faithful" or "hallucinated" for English document-summary pairs, then translate them to create a faithfulness binary classification dataset in target languages.We then train the target-language classifier on the resulting silver dataset.Formally, we aim to obtain a faithfulness scoring model g(•) in target language tgt that predicts the faithfulness for a given document-summary pair (x, y).Hence g (tgt) (x (tgt) , y (tgt) ) ≜ p(z = 1 | x (tgt) , y (tgt) ) where z = 1 and z = 0 represent whether the pair is faithful or hallucinated, respectively.
The pipeline for creating mFACT is presented in Figure 1.We start with four diverse English faith-fulness metrics1 , and use them to score the training samples from the English XSum summarisation dataset (Narayan et al., 2018).Following Maynez et al. (2020), we select the metrics based on two categories of hallucinations generated by the model: 1) intrinsic hallucinations where the summary distorts the information present in the document; 2) extrinsic hallucinations where the model adds information that cannot be directly supported by the document.We select two model-based metrics capturing intrinsic hallucinations: • DAE (Goyal and Durrett, 2021) which consists in an entailment classifier trained with annotation at a fine-grained dependency level; • QAFactEval (Fabbri et al., 2022) which focuses on generating questions whose answer is a span of the summary, and attempts to answer them based on the document alone; and two metrics for extrinsic hallucinations: • ENFS% is a simple rule-based measurement presented by Cao et al. (2022a), which counts the proportion of entities which appear in a summary but not in its corresponding document.• EntFA (Cao et al., 2022a) which estimates the posterior and prior probabilities of generated entities with language models conditioned (or not conditioned, respectively) on the source document.Using these probabilities as features, a KNN classifier detects token-level hallucination.We chose XSum as the source English dataset because 1) all our selected faithfulness metrics are trained on XSum, which allows us to maximise the reliability of these metrics.2) XSum has been shown to include abundant and diverse hallucinations (Maynez et al., 2020;Pagnoni et al., 2021), which allows our metrics to capture as many types of hallucinations as possible.
We then normalise the scores from the abovementioned four metrics between [0, 1] and average them for each training sample.We rank the samples from the most faithful to the most hallucinated according to the resulting faithfulness scores.The k top-ranked and k bottom-ranked documentsummary pairs are then considered positive and negative examples, respectively.We translate these into a series of target languages with the Google Translation API2 and create our silver faithfulness dataset splitting its examples with a proportion of 95/2.5/2.5 as the training/validation/testing sets.
Finally, a multilingual BERT-based classifier is fine-tuned on our dataset.We follow the sentencepair classification setting from (Devlin et al., 2019) to concatenate the document-summary pairs as the input.A classifier receives the last-layer representation for the [CLS] special token and returns a score between 0 (hallucinated) and 1 (faithful).

Reducing Hallucination in Cross-lingual Transfer
We first provide some background on cross-lingual transfer.Then, we show how to adapt several methods promoting faithfulness in monolingual summarisation to cross-lingual transfer settings.Finally, we describe a new approach based on loss weighting.

Cross-lingual Transfer with MAD-X
We adopt the Multiple ADapters framework (MAD-X; Pfeiffer et al. 2020b), which constitutes a stateof-the-art method for cross-lingual transfer.MAD-X learns independent language and task adapters (i.e., parameter-efficient model fine-tunings), and then combines them.Specifically, to transfer the ability to summarise documents from a source language to a target language, we follow these steps: 1) We train two separate language adapters on the Wikipedia corpora for both the source and target languages.2) We stack the (frozen) source language adapter with a randomly initialised task adapter and train the latter with annotated data in the source language.3) We stack the trained task adapter with the target language adapter and then perform zero-shot inference in the target language.

Expert and Anti-Expert Approaches
The majority of strategies to reduce hallucinations in monolingual settings rely on creating experts or anti-experts that steer the model towards positive behaviour or away from negative behaviour.As a by-product of the pipeline to create our metric, mFACT (Section 2), we obtained two separate subsets of faithful and hallucinated samples in both source and target languages.These subsets can serve as training data for experts/anti-experts in multiple languages, thus making them suitable for cross-lingual transfer.We explore three methods in this family.In all in stances, we first train a base adapter with the source summarisation dataset.Then, we further tune it with the faithful (hallucinated) subset to obtain an expert (anti-expert) adapter.
Task Vector Negation (TVN; Ilharco et al. 2022).Task vector negation mitigates hallucinated generation by subtracting the task vector of the antiexpert model from the fine-tuned model.Formally, given a fine-tuned model with parameter θ 0 and an anti-expert model θ − , the interpolated model parameters θ ⋆ are obtained as where λ is the importance hyperparameter that controls the degree of fusion between the fine-tuned model and the anti-expert.
Contrastive Parameter Ensembling (CAPE; Choubey et al. 2021).To compensate for the potential loss of summarisation ability by only subtracting the anti-expert task vector from the base model, CAPE proposes to also add the expert parameters.Formally, the interpolated model parameters θ ⋆ are obtained as: where λ again is the importance hyperparameter.
DExpert Decoding (Liu et al., 2021).Contrary to Task Vector Negation and CAPE, which directly manipulate the model parameters, DExpert uses expert and anti-expert models to modify the predicted logits at each decoding step.Given the base model f θ and a pair of expert f θ + and anti-expert f θ − models, the scores for the next token at each decoding step t are: where z t , z + t , z − t are the outputs from f θ , f θ + , f θ − at time step t, respectively.Again, an importance hyper-parameter λ controls the degree of fusion during decoding.

Weighted Loss Approach
We also introduce a simple but effective approach to reduce hallucination during cross-lingual transfer.Previous works have shown that controlling the quality of the training samples can improve the model's faithfulness (Kang and Hashimoto, 2020;Aharoni et al., 2022).However, simply filtering out hallucinated training data may sacrifice the summarisation performance (Dziri et al., 2022).
We thus propose a "soft" data filtering approach where we weigh the training loss according to each sample's faithfulness score.More formally, we rely Model Acc.
on a faithfulness metric for the source language, which outputs a score z (i) for the i th documentsummary pair's faithfulness.Then the update rule of training parameters for each batch becomes ) where θ is the vector of trainable model parameters, α is the learning rate, m is the batch size, J(•; θ) is the loss function for a single training example (x (i) , y (i) ), and ∇ θ J(•) is the gradient of the loss function wrt. the model parameters.

Experimental Setup
Evaluation Metrics.We use ROUGE-1/2/L scores (Lin and Och, 2004) to evaluate the task of abstractive summarisation.We use the four metrics mentioned in Section 2 to evaluate the faithfulness of English summaries and our mFACT metric for summaries in other languages.Dataset.We conduct our experiments on XL-Sum, which is a large-scale multilingual summarisation dataset (Hasan et al., 2021).XL-Sum provides a large collection of annotated document-summary pairs in 45 languages in addition to English.We test our approach on six target languages: Chinese, Spanish, French, Hindi, Turkish and Vietnamese.Table 7 shows the dataset statistics.

Classification Results
Firstly, we verify the reliability of mFACT by using our translated test sets in multiple languages to benchmark mFACT and several baselines for faithfulness classification.Baselines.Previous works (Maynez et al., 2020;Kryscinski et al., 2020) showed that models train for natural language inference (NLI), a related task for which more annotated data is readily available, can be used also for assessing faithfulness for English summarisation.We thus include a baseline, namely XNLI, which consists in fine-tuning multilingual BERT with the corresponding language split in the XNLI dataset (Conneau et al., 2018).As an alternative, we further fine-tune the XNLI baseline with our translated data (XNLI-mFACT), thus verifying whether combining the supervision signal from both sources boosts the performance.Finally, we incorporate an ablation study for using zeroshot multilingual transfer instead of "translate train" (Artetxe et al., 2020).In mFACT-Transfer, we train a multilingual encoder on our English faithfulness classification dataset without translating it, then deploy it directly on examples in other languages.
Results and Discussion.We report the classification performance in Table 1.We find that NLI classifiers do not achieve a level of performance on par with classifiers trained on our faithfulness classification dataset.This demonstrates that evaluating faithfulness is indeed distinct from NLI, which is consistent with previous findings in assessing faithfulness in English (Kryscinski et al., 2020;Maynez et al., 2020).Comparing mFACT and mFACT-Transfer, we also observe the positive effects of translation-based transfer, which achieves a much higher recall rate than zero-shot cross-lingual transfer.Hence, mFACT is more likely to identify faithful document-summary pairs as such.

External Evaluation by Inverse Transfer
Finally, we conduct an evaluation based on inverse cross-lingual transfer (i.e., from other languages to English) as a downstream task with our newly intro-duced approach (Section 3.3).This setting allows us to compare the impact of using different multilingual faithfulness metrics, among those listed in Section 5.1, to weigh the training samples in target languages.The logic behind this experiment is that if the scorer captures the model's faithfulness in target languages, the English summaries generated by the corresponding model should be more faithful according to the four English metrics from Section 2.1.
The results are shown in Table 2. Unsurprisingly, we observe that in general weighting the training samples in target languages with faithfulness metrics can achieve considerable improvements over the MAD-X baseline on English faithfulness scores.This suggests that these metrics are well aligned with the actual faithfulness of generated summaries.Specifically, comparing mFACT-Ours and mFACT-Transfer methods with XNLI and XNLI+mFACT, we find that our constructed dataset is much more effective in improving faithfulness than NLI signal, which again verifies our previous assumption that faithfulness classification and NLI are only vaguely related.Finally mFACT-Transfer performs worse than mFACT in ROUGE, which can be caused by the much lower recall rate of mFACT-Transfer in faithfulness classification (see Table 1).

Cross-lingual Transfer Introduces Additional Hallucinations
The second analysis of this paper aims to corroborate our observation that cross-lingual transfer can introduce additional hallucinations over monolingual fine-tuning, though it improves the task performance for summarisation in the target language.
Transfer Setup.We compare two data scenarios and two styles of fine-tuning.To begin, we investigate the impact of initial training on source data, followed by applying few-shot learning techniques on target data (cross-lingual transfer) instead of direct application.We attribute the difference in faithfulness scores to the additional hallucinations introduced by the training phase in the source language.Taking Chinese as an example of few-shot cross-lingual transfer, we train the summarisation model first on XL-Sum (Chinese) and then with 1K randomly sampled XSum (English) examples.Secondly, we compare fine-tuning the full model, where all parameters are updated, with parameterefficient fine-tuning, where only the adapters are updated.This allows us to study the effect of dif- ferent transfer methods on faithfulness.

Results and Discussion
In Figure 3, we observe that cross-lingual transfer improves ROUGE scores for both full-model fine-tuning and MAD-X, outperforming monolingual fine-tuning.This underscores its effectiveness in transferring task-specific knowledge from source to target languages in lowresource scenarios.However, it's important to note that leveraging source language data can also increase hallucination in both cases.

Hallucinations in Multilingual Large Language Models
We also assess the summarisation performance of recent multilingual large language models (LLMs) on XL-Sum in Table 4: Automatic evaluation for zero-shot crosslingual transfer performance from English to other languages when selecting the checkpoint with the best validation mFACT.Numbers represent the average of three runs with different random seeds.mF stands for mFACT and mF-T stands for mFACT-Transfer.bi% and tr% stand for the percentages of novel bigrams and trigrams.
BLOOMZ with an additional 267K and 189K instances of multilingual instructions and conversation rounds.
We select three languages, aside from English, which are present in the pre-training data for BLOOMZ and the conversational tuning data for Vicuna and Phoenix.We also report the percentage of examples in each of these languages that these models have been exposed to during their Lang.

Phoenix
multilingual training.Table 5 demonstrates that current LLMs display notable faithfulness limitations in cross-lingual transfer contexts for languages beyond English, including well-resourced languages like French and Spanish.Furthermore, a noticeable trend emerges: LLM faithfulness across languages tends to correlate highly to the number of samples from target languages observed during their training.These observations align with recent findings (Lai et al., 2023;Laskar et al., 2023) which highlight the challenges in maintaining faithfulness while generating content in low-resource languages.

Reducing Hallucinations
In this section, we test different methods for crosslingual transfer of summarisation to multiple languages and for promoting faithfulness.We compare our new method of loss weighting based on mFACT with MAD-X, as well as with a series of approaches for reducing hallucinations (Section 4).We evaluate these methods with automated metrics for performance, faithfulness, and abstractiveness (i.e., the ability to rephrase the document instead of copy-pasting spans of text).We also conduct human evaluations to corroborate these results.Automatic Evaluation.We report ROUGE scores for performance, faithfulness (mFACT), and abstractiveness (novel bigrams and trigrams in the summary) for the test set of each target language in Table 4.We first observe that the expert/anti-expert methods adapted from monolingual summarisation are partly effective for improving ROUGEs and mFACT score in cross-lingual transfer over MAD-X; however, no clear winner emerges among them, as their gains are marginal or inconsistent.For example, TVN produces the most faithful summaries for Hindi and Vietnamese, CAPE for Turkish, and DExpert for French.All three models, however, display a similar trend of sacrificing ROUGE scores to improve faithfulness.Instead, as Table 4 demonstrates, our proposed weighted-loss approach (WL) improves the performance across the board while achieving a comparable mFACT score with the most faithful expert models.In particular, WL achieves the best faithfulness in Chinese and Spanish and the best ROUGE scores for all languages except Hindi.These results suggest that our weightedloss method strikes the best balance between summarisation abilities and faithfulness.
Abstractiveness.We also measure the levels of ab-stractiveness of different methods, which is known to be inversely correlated with faithfulness (Ladhak et al., 2022;Daheim et al., 2023).In fact, reducing hallucinations has the side effect of encouraging the model to copy-paste spans of the document (i.e., acquiring an extractive behaviour).Following Cao et al. (2022a) and See et al. (2017), we use the percentage of novel n-grams in summaries compared with the document as a measure of abstractiveness.Figure 3 illustrates the distributions of abstractiveness and faithfulness for all models in six XL-Sum datasets.Both positive and negative predictions of mFACT scatter with different levels of abstractiveness.We also observe that summaries generated by the weighted loss method generally have a higher level of abstractiveness when they are similarly faithful compared with other baselines.Table 4 shows most expert/anti-expert models sacrifice abstractiveness to improve faithfulness score.In contrast, the weighted loss approach produces more novel n-grams.These findings show that our method does not improve faithfulness by simply favouring extractive summaries.Human Evaluation.Finally, we recruited human annotators from the Prolific platform3 for a blind comparison between MAD-X and our weighted-loss model.We randomly sampled nine documents for each language and paired them with the summaries generated by the two models.We asked the human participants to evaluate the summaries via A/B testing in two aspects, Informativeness: An informative summary should cover as much information from the document as possible, while it should convey the main idea of the document.Faithfulness: A faithful summary should only contain information already present in the document 4 and should not contain information contradicting the document.Participants will first read the document, then select the better summary (or both, if they are similar) in terms of informativeness and faithfulness (see Appendix A.5).We require participants to be native speakers of the language they evaluate and have obtained at least a bachelor's degree.Each document and its paired summaries are evaluated by 3 participants.These settings allow us to achieve a fair inter-rater agreement of 0.28 in terms of Fleiss' κ (Landis and Koch, 1977).
The results in Figure 2 indicate that human evaluators prefer the summaries generated by our weighted loss method rather than MAD-X, demonstrating that our weighted loss approach improves faithfulness and informativeness for all six languages.
Finally, we study the correlation between the human preferences from Figure 2 and various faithfulness metrics presented in Section 5.1.From Table 6, it emerges that mFACT achieves the strongest correlation with human judgements (0.45 Pearson ρ and 0.34 Spearman ρ), which is statistically significant.In comparison with XNLI and XNLI-mF, we reconfirm that metrics designed for faithfulness classification, rather than natural language inference, more effectively align with human preferences.

Conclusion
We investigate how to measure and mitigate hallucinations of summarisation models in cross-lingual transfer scenarios.We first propose a multilingual metric, mFACT, to facilitate the evaluation of faithfulness in low-resource languages.By virtue of this new metric, we find empirical evidence that while common cross-lingual transfer methods benefit summarisation performance, they amplify hallucinations compared to monolingual counterparts.We also point out that faithfulness in summarisation for languages other than English is still challenging for multilingual large language models.Finally, with the aim of reducing these hallucinations, we adapt several monolingual methods to crosslingual transfer and propose a new method based on weighting the loss according to the mFACT score of each training example.Based on both automated metrics and human evaluation, we demonstrate that mFACT is the most reliable metric in detecting hallucinations in multiple languages.Moreover, compared to a series of state-of-the-art baselines, we find that summaries produced by loss weighting achieve higher performance and abstractiveness, competitive faithfulness, and a higher alignment with human preferences.We hope that this work will attract more attention from the community to the phenomenon of hallucination in languages different from English and facilitate future research by establishing evaluation metrics and baselines.

Limitations
We use machine translation to construct the faithfulness classification dataset for training the faithfulness metrics in target languages.The required resources may constrain the feasibility of extending mFACT to other languages.The quality of the learned metrics may also be limited by the propagation of errors during translation, especially for languages with poor translation performance.Additionally, although the weighted-loss approach is effective in a diverse sample of languages, we note that its gains in faithfulness are not consistent for all languages, as we discussed in Section 8. Finding a method that is equally effective in reducing hallucinations across all languages is still an open research question for future work.

Ethical Consideration
All human workers participating in our evaluation are informed of the intended use of the provided assessments of summary quality and comply with the terms and conditions of the experiment, as specified by Prolific.In regards to payment, workers from different regions are paid on the same high scale with a wage of £13.5 hourly.This work (and specifically, the human evaluation) has also passed an ethical review by the ethical panel in our institute.

A.1 Dataset Statistics
We show the dataset statistics for all six used subsets of XL-Sum in table 7.
mFACT Classifiers We implement mFACT with the transformers package (Wolf et al., 2020).We train the multilingual BERT model for two epochs, with a batch size of 32 and a learning rate of 5e-5.We set the max input length to 512 and apply truncation to the input article if necessary.The same hyper-parameter settings are applied to all the languages we test.Weighted Loss Summarisation Models We implement our weighted loss model for cross-lingual transfer with adapter-transformers package (Pfeiffer et al., 2020a).We use the officially released mBART-50 checkpoint as the base model for equipping language and task adapters.
To train the language adapters, we follow the same adapter architecture and training settings in (Pfeiffer et al., 2020b).We use the batch size of 64, and a learning rate of 1e-4.We train each adapter with 48K update steps.Task Adapters To train the task adapters for summarisation, we set the batch size to 32, the learning rate to 1e-4, label smoothing factor to 0.1.We use the polynomial scheduler for adjusting the learning rate during training, with weighted decay at 0.01 and maximum gradient norm at 0.1.The model is trained for ten epochs, and we set the first 500 update steps as the warm-up stage.We select the best checkpoint following either the best validation ROUGE or the best mFACT score, respectively.During the decoding step for zero-shot cross-lingual transfer, we follow most settings of (Hasan et al., 2021).We apply the beam search with a size of 6, and the minimum/maximum decoding steps are set to 30/84, respectively.The length penalty is applied at 0.6, and we block all repeated tri-grams.

A.3 Sanity Check for English Faithfulness Metrics
We perform a sanity check experiment and report the results in Table 9 to verify the reliability of these model-based hallucination metrics.We randomly shuffle the alignments of document-summary pairs predicted by the mBART model and the reference.We then feed these misaligned document-summary pairs into the evaluation models and test their performance.We observe that all hallucination metrics drop considerably, showing that these metrics are indeed sensitive to random summaries and reliable to some extent.

A.4 Translation Quality Check
Our first experiment is to confirm the effectiveness of mFACT in capturing hallucinations in target languages.To support our method, we conduct a quality check for translation outputs, a comparison of different metrics on our translated faithfulness classification dataset, and an external evaluation of downstream tasks.
Machine translation (MT)-based transfer can arguably suffer from error propagation, where MT tools introduce hallucinations into their outputs.This issue is even more serious in our setting where translating faithful samples is necessary to create the mFACT metric as training with false positives might significantly degrade its quality.To ensure the feasibility of our pipeline to develop mFACT, we first check the translation quality manually.We randomly pick 100 samples from the Chinese positive set and label their faithfulness.Through this sanity check, we found 13 hallucinated samples; however, only 4 of them are caused by poor translation, while the other 9 are due to an incorrect ranking based on the four English metrics.This shows that MT-based transfer is mostly reliable: only a small amount of noise is introduced by MT.

A.5 Extended Results for Faithfulness Classification
To gain a deeper comprehension of the averaged faithfulness classification results presented in add a reference to Table 1, we analyse the individual language-specific outcomes (Table 10).Across the six language experiments, we consistently observe a significant performance gap between the models trained on the NLI task and those trained on the faithfulness classification task.
The following is the guide for annotators to indicate whether a summary is informative and faithful.

A.6 Full-model transfer vs. MAD-X transfer
We conduct a comparative study on the performance of summarisation and faithfulness in two cross-lingual transfer approaches: MAD-X style and full-model transfer.
For both MAD-X style and full-model crosslingual transfer, we observe that cross-lingual trans-

A.8 Prompts Used for Multilingual LLM's Summarisation
We show the prompt templates used for all languages in our LLM's summarisation experiments in Figure 6.
A.9 Assembling Metrics for mFACT does better than Single Metric We conducted an additional experiment to support our assembling design of mFACT.Rather than averaging four metrics, we individually apply single English metric -DAE, QAFE, ENFS, and EntFA -to rank the XSum dataset and train a multilingual classifier similar to mFACT-Transfer without translation, denoted as DAE-T, QAFE-T, ENFS-T, and EntFA-T.
To examine mFACT with other metrics originating from each single metric, we extend the human evaluation results in Table 6.We compare these four metrics with mFACT-Transfer, and again we measure the Pearson and Spearman correlations to human annotations.
In Table 11, we find mFACT consistently  emerges with the highest human correlation when compared to other four metrics.This observation underscores mFACT's better correlation with human evaluations.The reason could be relying on a single metric can introduce biased preference in models and a lack of diversity for captured hallucinations.In general, multiple teacher models lead to a robust, unbiased process (Wu et al., 2021;Ilichev et al., 2021)

A.10 Strategy for Selecting Best Model Checkpoint
Table 12 compares the summarisation model performance when we select the model checkpoint with the best ROUGE-1 or the best mFACT score.We find that under both strategies, the weighted loss model can achieve better ROUGE and faithfulness scores in most languages.However, similar to other works (Choubey et al., 2021;Aharoni et al., 2022), selecting the model checkpoint with the best validation faithfulness score has a higher positive contribution to model's faithfulness.

A.11 Distributions of Faithfulness and Abstractiveness for All Languages
We show the distributions for the percentage of novel 2-grams and mFACT scores for all six languages in Figure 7.

Figure 1 :
Figure1: Pipeline of mFACT for transferring English faithfulness metrics to target languages via machine translation.We average the score of four English metrics to rank the training samples in XSum.We then translate the most faithful and hallucinated samples into each target language and train a classifier to distinguish them.

Figure 3 :
Figure 3: Distributions for Novel 2-gram% and mFACT scores for all five hallucination reduction methods in cross-lingual transfer for the datasets of 6 languages in XL-Sum.

Figure 5 :
Figure 5: Validation mFACT scores curve for each model's training dynamics.Weighted loss consistently outperforms MAD-X in terms of faithfulness during the whole training period.

Table 3 :
Performance and faithfulness scores for fewshot cross-lingual transfer (CLTF) and monolingual finetuning (MFT) on abstractive summarisation.CLTF generally improves the model's performance but decreases its faithfulness.↑ and ↓ indicate higher or lower values are better, respectively.

Table 6 :
Correlation between several faithfulness metrics and human preferences.mF and mF-T stand for mFACT and mFACT-Transfer, respectively.We calculate both Pearson and Spearman statistics on documentsummary pairs from all six languages to ensure that the sample size is significant.

Table 10 :
Classification performance on our translated faithfulness dataset for all target languages.
Figure 4: Comparison of Full-model and MAD-X crosslingual transfer in ROUGE and faithfulness.The left column is the zero-shot performance, and the right column is the few-shot performance.We provide the average scores over all six languages.

Table 11 :
. Using diverse metrics in mFACT's training helps the classifier detect various hallucination types -our inverse transfer experiments (Table2) also show mFACT's promising correlations with both intrinsic and extrinsic hallucination metrics.Correlation with human preferences for mFACT and four transferred metrics developing from single metric.We again calculate both Pearson and Spearman statistics on document-summary pairs from all six languages to ensure that the sample size is significant.