Multilingual Summarization with Factual Consistency Evaluation

Abstractive summarization has enjoyed re-newed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating fac-tually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual in-consistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong base-lines in both automatic and human evaluation. We release models and human judgements of summaries to foster progress towards more fac-tually consistent multilingual summarization. 1


Introduction
The past few years have witnessed a huge leap forward in abstractive summarization thanks to largescale pretraining (Devlin et al., 2019;Lewis et al., 2020) and the availability of benchmark datasets.A well-known issue limiting the wider adoption of abstractive summarization models is their tendency to generate factually inconsistent summaries, a.k.a "hallucinations" (Maynez et al., 2020;Zhao et al., 2020, inter alia).A recently popular line of work explores how to best detect hallucinations in machine generated text, thereby enabling the automatic identification of factually inconsistent summaries (Eyal et al., 2019;Falke et al., 2019;  Kryscinski et al., 2020;Wang et al., 2020;Goyal and Durrett, 2021;Scialom et al., 2021;Honovich et al., 2022;Tang et al., 2022, inter alia).
While such approaches may prove useful for automatic evaluation, it remains unclear how to best leverage them for improving summarization models in multiple languages.While focusing exclusively on English, previous work suggests different techniques to this effect such as discarding "noisy" training examples (Gehrmann et al., 2021), contrastive learning paradigms (Nan et al., 2021b), controlled generation and planning (Narayan et al., 2021;Rashkin et al., 2021b), or reinforcementlearning approaches that use the evaluation model score as a reward function (Gunasekara et al., 2021).Despite promising results, no method has emerged as a clear winner in English, let alone across languages with varying amounts of data and resources.
In this work, we leverage factual consistency evaluation models to improve summarization systems in multiple languages.Specifically, we employ Textual Entailment models (a.k.a.Natural Language Inference; Dagan et al. 2005;Bowman et al. 2015) in order to determine whether a summary is factually consistent (Maynez et al., 2020;Laban et al., 2021).We opportunistically opt for NLI given the availability of multilingual benchmarks for model training (Conneau et al., 2018;Nie et al., 2020).Approaches based on question generation and answering have been also shown to work well for factuality evaluation (i.e.Scialom et al., 2021;Honovich et al., 2021;Deutsch et al., 2021), however, they are not easily portable due to the scarcity of respective resources in languages other than English.
We first analyze the quality of the training data for summarization models using a strong multilingual NLI model (as evaluated on the XNLI dataset, Conneau et al.;2018).In particular, we train our multilingual NLI model, following the guidelines from the TRUE survey (Honovich et al., 2022) for the assessment of factual consistency.Focusing on the XLSum2 multilingual summarization dataset (Hasan et al., 2021), we find that for some languages up to 70% of training examples are not factually consistent according to the NLI model, while such examples are commonly used for training.We use the NLI signal to improve the quality of the generated summaries in two ways: (1) data filtering, where we only train on examples whose summaries are predicted to be entailed by the input, and (2) controlled generation, where we also leverage "negative" training examples by conditioning the summarization model on the NLI signal.We evaluate the proposed approaches using both automatic and human evaluation in 45 languages, and observe significant gains in the faithfulness of the generated summaries over strong baselines.Finally, we show that the human judgments we collected in all languages are useful for training automatic metrics to assess the quality, factual consistency and informativeness of generated summaries.
To summarize, the contributions of this work are three-fold: (1) we analyze the quality of the XLSum dataset (Hasan et al., 2021) using strong multilingual NLI models and reveal severe issues with faithfulness in the training data across languages; (2) we explore methods for improving downstream summarization models trained on this data using a multilingual NLI signal, and show large gains in both automatic and human evalua-tion; and (3) using the data from our large-scale human evaluation study, we learn metrics for automatically evaluating summaries in multiple languages along the dimensions of Quality, Factual Consistency, and Informativeness. 3To the best of our knowledge, our work is the first to examine the faithfulness of summarization systems in multilingual settings, and we hope it will encourage the development of better metrics and models in multilingual text generation.
A plethora of approaches have been proposed for the automatic detection of factual inconsistencies in machine generated text (see Honovich et al. 2022 andTang et al. 2022 for overviews) with varying degrees of success.There is growing consensus that techniques based on textual entailment (Maynez et al., 2020;Goyal et al., 2021;Goyal and Durrett, 2021) and question generation and answering models (Durmus et al., 2020;Wang et al., 2020;Deutsch et al., 2021;Fabbri et al., 2021;Scialom et al., 2021;Honovich et al., 2021) achieve strong performance across tasks and datasets (Laban et al., 2021;Honovich et al., 2022).Another line of work uses synthetically generated data to train models for evaluating factual consistency (Kryscinski et al., 2020;Zhao et al., 2020;Goyal and Durrett, 2020).Aside from assessing system output, several studies have proposed novel model architectures which enforce factuality during training or inference.These include extracting facts from the source and incorporating them as additional input to the model (Cao et al., 2018;Aralikatte et al., 2021;Zhu et al., 2021), planning using entity chains and avoiding entities that are not in the input (Narayan et al., 2021(Narayan et al., , 2022)), using reinforcement learning to optimize model training with factual correctness as a reward (Zhang et al., 2020;Arumae and Liu, 2019;Pasunuru and Bansal, 2018;Nan et al., 2021b), reranking candidate summaries within a beam using entailment predictions (Falke et al., 2019) or quantity verification scores (Zhao et al., 2020), using contrastive learning (Cao and Wang, 2021;Wan and Bansal, 2022), modifying the training objective to only maximize the likelihood of factual words (Goyal and Durrett, 2021), incorporating factuality into the pretraining objective of models tailored to text summarization tasks (Wan and Bansal, 2022), and adaptively removing examples with high log loss (Kang and Hashimoto, 2020).Other work simply removes noisy training samples (Nan et al., 2021a;Goyal and Durrett, 2021) in the hope that factuality will improve by training on better examples.
Despite promising results, it is unclear whether previous techniques transfer to languages beyond English.Our own work aims to improve the factuality of abstractive summarization across languages.Leveraging recent progress on multilingual pretrained models (Xue et al., 2021), we show that entailment-based metrics can be trained to detect factually-inconsistent summaries in multiple languages, and that this signal can be leveraged to improve summarization systems in those languages.

Multilingual Factual Consistency Evaluation
We cast factual consistency evaluation as a Natural Language Inference (NLI) task.The input forms the premise, the summary forms the hypothesis (Maynez et al., 2020;Laban et al., 2021;Honovich et al., 2022), and the NLI model is used to predict whether the summary is entailed by the input.More formally, given input document d and summary s, we define an NLI model M as a binary classifier, where M(d, s) ≈ p(s is entailed by d).
Recent studies (Honovich et al., 2022) on evaluating factual consistency in summarization and other related tasks in English have obtained promising results when finetuning large pretrained models on NLI datasets.Specifically, they finetune the T5 pretrained encoder-decoder models (Raffel et al., 2020) for binary classification where the entailment relation translates to a positive label and contradiction/neutral relations are merged to a negative label.Their model encodes the concatenation of the premise (document) and hypothesis (summary) and decodes a single token that represents the class label (entailment or no entailment). 4ince we are interested in evaluating factual consistency in multiple languages, we extend the modeling approach of Honovich et al. ( 2022) to a multilingual setting.As our pretrained model, we use mT5-XXL (Xue et al., 2021) which was trained on mC4, a dataset drawn from the public Common Crawl covering 101 languages.We finetuned mT5-XXL on the ANLI (Nie et al., 2020) and XNLI (Conneau et al., 2018) datasets.ANLI contains 162K English-only examples, while XNLI has 37K examples5 in 15 languages.As mentioned above for the English case, the multilingual model is trained to generate a binary label when given the concatenation of a premise and hypothesis, where the positive label corresponds to an entailment relation, and the negative label stands for a neutral/contradiction relation.During inference, we score a premise and hypothesis input by measuring the output probability when force-decoding the positive label, resulting in a score between 0 (no entailment) and 1 (entailment).
We measured the quality of our multilingual NLI model by evaluating on the XNLI (Conneau et al., 2018) test set and the TRUE benchmark (Honovich et al., 2022).The latter is a standardized collection of datasets representing various tasks (summarization, dialog generation, paraphrasing and fact-checking) with manual annotations for factual consistency.On XNLI, our model yields an average accuracy of 90.0 over 15 languages in comparison to 87.8 reported in Xue et al. (2021). 6We present results for individual languages in Appendix A. On the (English-only) TRUE benchmark, our model's average ROC AUC is 82.4 in comparison to 83.4 reported in Honovich et al. (2022) for their best performing English-only, T5-11B model (Raffel et al., 2020) trained on ANLI.While our model is trained on both ANLI (English) and XNLI (15 languages, detailed in Table 1), we assume it can generalize to additional languages (for which NLI data is not available) due to the nature of the pretrained model (mT5, trained on 101 languages).

Summarization Models
We next describe two summarization approaches which exploit the factual consistency evaluation however, we refrained from doing so to avoid loss of context.signal provided by the multilingual NLI model.

Data Filtering
An intuitive approach to improving the factuality of machine generated summaries is to enhance the quality of the training data, simply by filtering noisy training samples (Nan et al., 2021a;Goyal and Durrett, 2021).More formally, given a training corpus D of input documentsummary pairs, we find D + ⊂ D such that for each document-summary pair (d, s) ∈ D + , p(s is entailed by d) > 0.5.We used our multilingual NLI model (see Section 3) to annotate the training data in XLSum (Hasan et al., 2021) for all 45 languages.Table 1 shows the total number training examples and the proportion where the summary was predicted to be entailed by the input (using a threshold of 0.5 on the NLI model score).The proportion of entailed summaries ranges from 68.96% (for Japanese) to 28.29% (for Punjabi).For all but three languages (English, Japanese, Nepali), the NLI model predicted less than half of the training summaries as being entailed by the input.We find these numbers strikingly low; this may be due to the nature of the dataset, since the relationship between news headlines and their corresponding article can be somewhat loose (e.g., headlines may include "clickbate" and additional details that are not mentioned in the article).Another reason might be errors of the NLI model; while it was shown to work well on TRUE/XNLI, XLSum may represent a different distribution.
Overall, the results in Table 1 indicate that filtering the training data based on the NLI signal can have a large impact on the resulting summarization model.Training on the entailed portion of the data may result in more factual summaries, however, at the expense of summary quality as the model unavoidably sees fewer examples (e.g., there are only 557 instances for Scottish Gaelic after filtering, while the original training set has 1,313).

Controlled Generation
Another way to leverage the NLI signal for improving the summarization model is via controlled generation (Keskar et al., 2019;Rashkin et al., 2021b).In this approach, special tokens are prepended to the model's input to indicate/control whether the output should be entailed or not.
Let D denote a training corpus of document summary pairs (d, s).We annotate each (d, s) ∈ D as (d , s), where d is d prepeneded with an "<entailed>" symbol if p(s is entailed by d) > 0.5, and otherwise d is d prepended with "<not-entailed>".The model trained on D enhanced with these annotations is expected to learn the correlation between entailment and the special token value, and as a result to be "controlled" to produce more faithful summaries by prepending the token that corresponds to faithful (aka entailed) summaries in inference time.This method implicitly teaches the model to learn from the entailment signal while taking advantage of all available training data.It may, however, be more sensitive to wrong predictions by the entailment model as noisy examples are not discarded.

Model Details
We finetuned three models based on mT5 XXL (Xue et al., 2021, 14B parameters).The first is a "Vanilla" model which is trained on the XLSum data as-is.As previous work has shown that multilingual training improves performance for low-resource languages (Aharoni et al., 2019;Hasan et al., 2021), we also follow this setting and finetune a single massively multilingual model for all 45 languages in XLSum.The second model ("Filtered") is finetuned only on the portion of the data that passed the multilingual NLI filter.The third model ("Controlled") is trained on all data, using the controlled generation approach mentioned above.Specifically, for control tokens "<entailed>" and "<not-entailed>", we used two extra spare tokens from the mT5 vocabulary and prepended them to the input (Keskar et al., 2019;Rashkin et al., 2021b).During inference, we always prepend the input with "<entailed>" and report on the whole development and test sets.
Ideally, we would like to evaluate a single model checkpoint for all languages; in the literature, the best checkpoint is often selected using ROUGE.However, we also employ NLI scores to quantify improvements in faithfulness.For each model, we select two checkpoints that are best according to ROUGE and NLI (on the development set), when averaged across all languages.Table 2 summarizes the number of finetuning steps that led to the best checkpoints for each model according to ROUGE and NLI.For all models, we observe that best NLI checkpoints are earlier than ROUGE-based ones.

System Comparisons
We compare the above approaches to three additional baselines.Firstly, we record the number of examples that pass the NLI filter, per language, and select the same number at "Random".We then finetune a model similarly to the "Filtered" model above using this randomly selected data.Secondly, we introduce a "Self-ROUGE" baseline which selects examples where the ROUGE of the summary with respect to the input document is Quality: Is the summary comprehensible?Incomprehensible: The summary is difficult to understand.It has serious grammatical errors, low fluency, and/or repeated information.Somewhat Comprehensible: The summary makes sense but suffers from grammatical errors, low fluency, and/or repeated information.
Comprehensible: The summary is understandable.It does not exhibit any grammatical errors, disfluencies, and/or repeated information.
Attribution: Is all the information in the summary fully attributable to the article?Yes, it is attributable: Select this option if it is accurate to say, "The provided news article says. . ." or "According to the news article. . ." with the summary following this phrase.No, not fully attributable: Select this option if only some of the information is supported in the news article, but other parts of the information are missing from the news article or not an accurate representation.
Informativeness: Is the summary a good summary of the article?Bad summary: The summary does not capture the important information in the article, or the captured information is not accurate with the article.It can also exhibit grammatical issues, low fluency, and/or repeated information.Good Summary: The summary captures the important information in the article and presents it accurately and concisely.It does not exhibit any grammatical errors, disfluencies, and/or repeated information.

Automatic Evaluation
We report ROUGE (Lin, 2004) which is commonly used to measure the informativeness and fluency of model summaries against gold-standard references. 7We also quantify faithfulness with the reference-free NLI score (Maynez et al., 2020;Honovich et al., 2022, inter alia).Since there are no tokenizers available for many of the languages in XLSum, we report ROUGE-L computed using the sentencepiece tokenization of mT5.Regarding NLI, we compute for each summary whether it is entailed by the input, and report the average over all examples in a partition (test or development set).

Human Evaluation
In addition to automatic evaluation, we conducted a large-scale human elicitation study assessing different dimensions of the output in all 45 languages.Firstly, we asked participants to read a system summary and assess its Quality (Is the summary comprehensible?), without looking at the source article, using a 1-3 rating scale where 3 means fully understandable and 1 indicates that the summary has serious fluency errors.After the first assessment, participants were shown the source article and were asked to rate the summary according to Attribution (Is all the information in the summary fully attributable to the article?) using the attribution definition8 provided in Rashkin et al. (2021a); and Informativeness (Is the summary a good summary of the article?).Both assessments used binary judgements (we report on the percentage of times each system was rated positively).Figure 2 presents our instructions.9 In order to evaluate summarization output in such a diverse multilingual setting, we have taken several measures to scale our study to 45 languages while maintaining high inter-annotator agreement.We used the same instructions in English for all languages and invited bilingual participants (native speakers of the target language who are also proficient in English) to take part in our study. 10Each participant had to pass a Screener test consisting of 25 questions with an accuracy of 85% before they could take part in the study.Finally, we conducted two pilot studies before the final evaluation to give participants feedback and improve agreement.Our final elicitation study was conducted using 100 instances per language, each randomly sampled from the test set.We collected ratings from three different annotators for each data point.

Results
Filtered Model is Best on NLI-based Evaluation Table 3 presents our results on the test set averaged across all 45 languages, for our three model vari-ants (Vanilla, Filtered, and Controlled) and three baselines.For the Filtered and Controlled models we report results for both the best-ROUGE and best-NLI checkpoints, while for the others we only use ROUGE for checkpoint selection as no NLI model is involved in their training.Per language results on the validation and test sets are in Appendix C, Tables 9 and 10. 11We see that the Filtered model outperforms all other models across languages achieving an average score of 76.49 for the Best-NLI checkpoint (it obtains best NLI scores in 43 out of 45 languages).This suggests that data filtering is a viable approach for improving the factual consistency of summarization systems.The next best models in terms of NLI are the Filtered and Controlled variants (with Best-ROUGE checkpoints), achieving an average score of 72.17.The Controlled and Vanilla models perform mostly worse than the Filtered variant in terms of NLI with either Best-NLI or Best-ROUGE checkpoints.Note the significant NLI score gap between the Vanilla model and the Best-NLI Filtered model (12.19 points on average).This primarily points to the quality of the unfiltered data, since both models are based on T5-XXL.The Best-NLI checkpoint outperforms the Best-ROUGE checkpoint for the Filtered model (average NLI scores of 76.49 vs 72.17).However, we observe a degradation of 0.79 NLI points when comparing the Best-NLI and Best-ROUGE checkpoints for the Controlled model.), and Low (less than 6K).For language families, the Indo-European cluster represents Bengali, Gujarati, Hindi, russian, Serbian (Cyrillic and Latin), and Sinhala; the Romance cluster comprises of French, Protuguese, and Spanish; the Turkic cluster contains Azerbaijani, Kyrgyz, Turkish, and Uzpek; Semitic languages are Amharic, Arabic, and Tigrinya; the Afro-Asiatic cluster groups together Hausa, Oromo, and Somali; finally, the Indo-Iranian cluster represents Pashto, Persian, and Punjabi; we omit clusters with two members and singletons.

Effect of Entailment
We also create two subsets depending on whether they appear in the XNLI dataset used to train our multilingual NLI model (Available, Section 3) or not (Unavailable).Highest ROUGE-L and NLI numbers are in bold.
similar ROUGE scores, ranging between 32.98 and 34.00, while the range of NLI scores is much larger (from 64.30 to 76.49).
Comparison against Baseline Approaches Table 3 also compares to previous work (mT5-Base, Hasan et al. 2021), and the Self-ROUGE and Random selection baselines.We did not employ any NLI preprocessing in building the baseline models, neither in filtering or checkpoint selection.We observe that all model variants (Vanilla, Filtered, and Controlled) are superior to mT5-Base in terms of ROUGE which is not surprising given the different model capacities (XXL vs Base).We also see that any filtering improves NLI scores (compare Vanilla against Self-ROUGE and Random), incurring a slight decrease in terms of ROUGE, while targeted filtering using NLI yields best results.1).We report results on English on its own, as it is the language with the largest number of examples (370k).

ROUGE and NLI across Different Language Groups
Again, we observe that the Filtered model is in most cases superior, including English.Vanilla scores are better on ROUGE for Low resource and Afro-Asiatic languages, although the difference against other models is less than 1 ROUGE point.The Controlled model is not better than Filtered or Vanilla in any configuration, irrespective of how languages are grouped into clusters.In conclusion, we find that the Filtered model dramatically improves faithfulness, while maintaining ROUGE performance similar to other models.We present examples of model output in Appendix G.
Human Assessment for Quality, Attribution, and Informativeness Table 5 presents our human evaluation results for Quality, Attribution and Informativeness (it also includes automatic evaluation results for a side-by-side comparison).We provide per language analysis in Appendix E (See Tables 15-17) and aggregate statistics using the same groups as in Table 4 (see Tables 18-20).
Unsurprisingly, human reference summaries were more understandable than Vanilla, Filtered, or Controlled summaries, with least fluency issues.Differences between the gold standard summaries and those generated by the Filtered Best-NLI and Controlled Best-NLI are, however, not statistically significant (using a one-way ANOVA with posthoc Tukey HSD tests; p < 0.01).Summaries generated using our Filtered Best-NLI model were most attributed (or faithful) and informative, with respect to their input documents.Differences be-tween the Filtered Best-NLI model and all other comparisons are statistically significant (using a one-way ANOVA with post-hoc Tukey HSD tests; p < 0.01).In conclusion, human evaluation confirms the Filtered model is best at generating faithful and informative summaries.
Effect on Summary Length One may argue that we are improving faithfulness by favoring shorter summaries.To study this, we also report in Table 5 the ratio of predicted to target summary length averaged across all test examples, for different models.
As we can see, best-NLI checkpoints do yield a reduction in predicted length across different models compared to their Best-ROUGE checkpoints; the length ratios drop from 0.93 to 0.88 for Vanilla, from 0.89 to 0.87 for Filtered, and from 1.00 to 0.81 for Controlled.However, shorter summaries are not necessarily more faithful; the worst length ratio (0.81) is for the Controlled Best-NLI model which performs worse on NLI, Attribution, and Informativeness, compared to the Filtered Best-NLI model with a higher length ratio (0.87).The Filtered Best-NLI model only yields a marginal reduction in summary length compared to the Vanilla Best-Rouge summaries (Length ratio: 0.87 vs 0.93), but improves on NLI scores (76.50 vs 64.31), Quality (0.86 vs 0.85), Attribution (0.52 vs 0.44), and Informativeness (0.45 vs 0.37) assessments.

Metric Learning for Multilingual Summary Evaluation
Our large-scale judgment elicitation study (across multiple languages and system outputs) delivered valuable annotations of summary document-quality (31,499 pairs x 3 quality dimensions x 3 raters).
We next explore whether it is possible to learn metrics for evaluating Quality, Attribution, and Informativeness automatically.Existing metrics (e.g., BLEURT; Sellam et al. 2020) have not targeted summarization specifically, or considered attribution, and multiple languages.Let s = (s 1 , . . ., s r ) denote a summary of length r where each s i is a token and let d = (d 1 , . . ., d p ) be its corresponding input document of length p.Let {(d i , x i , y i )} N n=1 be a training dataset of size N where y i ∈ R is the human rating that indicates how good x i is as a summary of d i along a specific dimension.Our goal is to learn a function f : (d, x) → y that predicts the human rating.
We finetuned three models based on mT5-XXL (Xue et al., 2021), one per dimension (details in Appendix F).The input was the concatenation of a document and its summary, and the output the human rating.10% of the elicited ratings (across languages) were reserved for testing, while the remainder was used for training and validation.Table 6 reports correlation coefficients (Pearson's r) between model predictions and (mean) human ratings.MT5-Q, MT5-A and MT5-I denote the learned metrics corresponding to Q(uality), A(ttribution), and I(informativeness), respectively.In addition, we report correlation coefficients for ROUGE and NLI.
Overall, we observe that learned metrics correlate best with human ratings (across dimensions).ROUGE correlates weakly with human judgments but cannot distinguish any dimension in particular, whereas NLI scores reliably correlate with attribution.Our results underscore the need for better and more fine-grained evaluation of summary quality, and also corroborate well-known issues (Gehrmann et al., 2022) with widely adopted lexical overlapbased metrics such as ROUGE.

Conclusion
In this paper we leveraged factual consistency evaluation for improving summarization models in multiple languages.Extensive experiments on the XL-Sum dataset showed large gains when training summarization models on a subset of the data selected using the NLI signal.Through a large-scale human evaluation study, we obtained ratings which not only helped us distinguish best performing systems, but were further used to learn metrics for assessing multilingual summaries along the dimensions of Quality, Attribution, and Informativeness.These metrics could be further used to inspect the quality of summarization datasets.Our annotators found that summaries are (on average) only 52% of the time fully faithful to their documents and this number is much worse for some languages (e.g., Hausa, Yoruba; see Table 16 in Appendix E).
An interesting avenue for future is to directly optimize the summarization models towards the different quality objectives, e.g.via Reinforcement Learning (Narayan et al., 2018b) or Calibrating Sequence Likelihood (Zhao et al., 2022).

Limitations
While our work covers a large number of languages, it is focused on a specific source and style of summaries.Our experiments focus exclusively on the XLSum dataset (Hasan et al., 2021)  on BBC articles where the opening sentence serves as a summary.It would be interesting to explore our methods on additional datasets and text generation tasks, e.g., where the summaries are longer, or there are multiple input documents.

Ethics Statement
An ethical consideration that concerns our work is the problem of misinformation.While we make a step towards improving the factual consistency of text generation systems which in turn should alleviate issues of misinformation, it is important to note that current systems are still far from being perfect in this respect, and thus should be used with caution.

A Intrinsic NLI Model Evaluation
In this section we present more detailed evaluation results for our our multilingual NLI model.We additionally evaluate our model on the TRUE factual consistency benchmark (Honovich et al., 2022).TRUE consists of 11 diverse datasets (including the output of grounded text generation systems), annotated with binary factual consistency labels.Although TRUE only includes English examples, we use it for our evaluation due to its relevance to factual consistency in summarization.Table 8 shows the area under the ROC curve (ROC AUC) results for all dataset in TRUE where we compare our multilingual model to T5-11B trained on ANLI (Nie et al., 2020) reported by Honovich et al. (2022).Results show that our multilingual model performs on par with theirs, while finetuning our model on non-English data causes a slight decrease in performance.

B Technical Modeling Details
We used the t5x (Roberts et al., 2022) framework for all training and inference tasks.We ran all experiments on TPU accelerators.Table 12 shows details of how we grouped languages into different clusters (i.e., family and the availability of NLI training data).Table 13 shows our results on the XLSum development set clustered by (1) the number of training examples per language; we group languages into three clusters -High (10k-70k examples), Medium (9k-16k examples); and Low (less than 6k examples); (2) language family (we group languages into Indo-European, Romance, Turkic, Semitic, Afro-Asiatic, and Indo-Iranian families); and (3) whether XNLI training data is available; we cluster languages into two subsets, those that appear in the XNLI dataset used to train our multilingual NLI model, and those that do not (see Table 1).We report results on English on its own, as it is the language with the largest number of examples (370k).

D Human Evaluation Setup and Annotator Qualifications
Figure 3 presents a snapshot of the interface seen by our participants together with the instructions used in our human evaluation studies.
To recruit our participants, we screened their language skills to determine whether they're native speakers, their education level and country of residence as well as origin.For some languages we could not recruit native speakers in the country of birth for various restrictions and sourcing difficulties, we hired native speakers in other countries.In addition, we created a screener test to determine the raters' suitability for the task.In total, we recruited 388 raters across all 45 locales.2.58% of them hold a Doctorate, 31.96% holds a master degree, 57.73% of them hold a bachelor degree, 7.73% hold High school degree or equivalent.Table 14 presents the demographics of our participants.All our annotators are paid adequately by our suppliers adhering to the supplier code of conduct.

E Detailed Human Evaluation Results
Table 18 presents human evaluation results for Summary Quality for individual languages on the XL-Sum test set.Table 16 shows mean judgments for Attribution, again per language, and finally Table 16 summarizes our results for Informativeness.We also group human judgments according to number of training examples, language family, and whether XNLI training data is available.Tables 18-20 show these different types of clustering for the judgments pertaining to Summary Quality, Attribution, and Informativeness.

F Metric Training Details
The metrics were trained by finetuning mT5-XXL to predict a binarized version of the human judgments (a summary receives a score of 1 if the mean human rating > 0.5).Each metric is trained for 20,000 steps with batch size = 32 and a learning

G Example Output
We showcase summaries generated by our models in Tables 21, 22, and 23.The article in Table 21 discusses a cholera outbreak in Algeria, with two deaths, 46 confirmed cases and 88 suspected cases.The reference summary in addition mentions that there have been 139 hospitalizations since August 2018, however, the number of hospitalizations is not given in the input document.Tha Vanilla summary manages to hallucinate two facts: the deaths have not been several, they are only two, and the number of suspected cases is 88, not 100.The Controlled summary is factual although perhaps sparse with the details, it only mentions the cholera deaths but not the cases.The Filtered summary on the other hand correctly mentions the number of confirmed and suspected cases but does not mention the deaths.The article in Table 22 talks about trials of an Ebola vaccine in Oxford.The trial involves 72 volunteers, and preliminary tests on monkeys have shown that the vaccine confers immunity against Ebola.Similar small-scale trials are underway in the United states and three African countries spared from the epidemic.The Reference summary is factually correct, the Vanilla summary gives the false impression that the Ebola trial is large-scale; by omitting the adjective "large-scale", the Controlled summary is factual, and likewise the Filtered summary does not include any hallucinations.
The article in Table 23 talks about Greenpeace activists arrested in Russia on piracy charges for protesting against and oil rig in the arctic sea.Among the 30 arrested, two are Argentinian, one Brazilian.Five of them were accused of climbing the oil rig and their detention was extended by two months.The Reference summary is factually correct, the Vanilla summary has slight fluency issues, it hallucinates the location of the oil rig to be in the Black Sea, and misrepresents the protest to be about the oil rig closure; the Controlled summary is factual but focuses on the oil rig climbers, and likewise the Filtered summary does not contain any hallucinations but focuses on the Argentine activists only.), and Low (less than 6K).For language families, the Indo-European cluster represents Bengali, Gujarati, Hindi, russian, Serbian (Cyrillic and Latin), and Sinhala; the Romance cluster comprises of French, Protuguese, and Spanish; the Turkic cluster contains Azerbaijani, Kyrgyz, Turkish, and Uzpek; Semitic languages are Amharic, Arabic, and Tigrinya; the Afro-Asiatic cluster groups together Hausa, Oromo, and Somali; finally, the Indo-Iranian cluster represents Pashto, Persian, and Punjabi; we omit clusters with two members and singletons.We also create two subsets depending on whether they appear in the XNLI dataset used to train our multilingual NLI model (Available) or not (Unavailable).Highest scores are in bold.
Q1: Is the summary comprehensible?Incomprehensible: The summary is difficult to understand.It can have serious grammatical errors, low fluency, and/or repeated information.Somewhat Comprehensible: The summary generally makes sense but suffers from grammatical errors, low fluency, and/or repeated information.

Comprehensible:
The summary is understandable.It does not exhibit any grammatical errors, disfluencies, and/or repeated information.
Q2: Is all the information in the summary fully attributable to the article?Yes, it is attributable: Select this option if it is accurate to say, "The provided news article says. . ." or "According to the news article. . ." with the summary following this phrase.No, not fully attributable: Select this option if only some of the information is supported in the news article, but other parts of the information are missing from the news article or not an accurate representation.
Q3: Is the summary a good summary of the article?Bad summary: The summary does not capture the important information in the article, or the captured information is not accurate with the article.It can also exhibit grammatical issues, low fluency, and/or repeated information.Good Summary: The summary captures the important information in the article and presents it accurately and concisely.It does not exhibit any grammatical errors, disfluencies, and/or repeated information.
Figure 3: A snapshot of the interface and instructions were used in our human evaluation studies.Document Los ecologistas están acusados de piratería.En la imagen, El portavoz de Greenpreace Internacional, Roman Dolgov.
Todos los detenidos esperan ser llamados por el tribunal de la ciudad de Murmansk, en el norte de Rusia, para que cada uno de sus casos sea escuchado.

Reference Summary
Tres latinoamericanos se encuentran en la lista de los 30 activistas de Greenpeace arrestados en Rusia por protestar la semana pasada contra la exploración de petróleo en aguas árticas.Three Latin Americans are on the list of 30 Greenpeace activists arrested in Russia for protesting last week against oil exploration in Arctic waters.

Vanilla
Activistas argentinos, brasileños y argentinos, que fueron arrestados en Rusia por protestar en el mar Negro en protesta por el cierre de una plataforma petrolera en el norte de Rusia, están acusados de piratería.Argentine, Brazilian and Argentinian activists, who were arrested in Russia for protesting in the Black Sea to protest the closure of an oil rig in northern Russia, are accused of piracy.

Controlled
Un tribunal ruso ordenó la detención de cinco miembros del grupo ecologista Greenpeace, acusados de intentar escalar una plataforma petrolera en el Ártico.A Russian court ordered the arrest of three members of the environmental group Greenpeace, accused of trying to scale an oil platform in the Arctic.

Filtered
Los dos argentinos que fueron arrestados en Rusia por protestar contra el desarrollo de una plataforma petrolera en el Ártico están en riesgo de ser encarcelados por piratería.The two Argentines who were arrested in Russia for protesting against the development of an oil rig in the Arctic are at risk of being jailed for piracy.

Figure 1 :
Figure 1: NLI and ROUGE scores for different models on the Arabic development set of XLSUM during finetuning.Using a multilingual entailment model during training (via data filtering or controlled generation) improves summary quality over a baseline model trained without using the entailment signal.

Figure 2 :
Figure 2: Instructions used in our human evaluation.highest.Again, we choose the same number of examples as those which passed the NLI filter, and finetune a model on this data.Finally, we compare against model output from Hasan et al. (2021) who finetuned an mT5-Base pretrained model.

Table 1 :
Statistics on XLSum training data: total number of examples per language, proporion of examples where the summary was entailed by the input (% Ent).Languages in XLSum and XNLI are underlined, for other lanuages NLI classification is zero-shot.Chinese (S/T) refers to simplified/traditional; Serbian (C/L) is a shorthand for Cyrilic and Latin respectively; and Scottish (G) abbreviates Gaelic.

Table 2 :
Number of finetuning steps for best checkpoint for each model according to NLI and ROUGE on the XLSum development set.

Table 4 :
Signal on ROUGE As shown above, NLI scores improve in all languages when training uses the signal from the NLI model, either by filtering data or by using controlled generation.But what is the effect on ROUGE?Looking at the average ROUGE scores across languages in Table 3, we again see that the best ROUGE is obtained by the Filtered model, with the Best-ROUGE checkpoint.Interestingly, this model is trained on much fewer examples, but obtains better results than the Vanilla and Controlled variants that use all training examples in XLSum.This model obtains higher or comparable NLI scores (72.17 and 76.49, for Best-ROUGE and Best-NLI, respectively) than the other models, suggesting that it is more accurate with respect to the reference summaries and more faithful with respect to the input.In general, the Vanilla, Filtered and Controlled models obtain very ROUGE-L and NLI scores on XLSum test set for best checkpoints averaged across language groups.For training resources we consider three groups with varying numbers of training examples: High ([70K-10K]), Medium ([10K-6K]

Table 4
we cluster languages into two subsets, those that appear in the XNLI dataset used to train our NLI model, and those that do not (see Table

Table 5 :
which is based Mean human judgments on XLSum test set averaged across languages.We also include ROUGE-L and NLI scores for a side-by-side comparison.Length Ratio is the ratio of predicted length to target length averaged across all test examples.Best results in each row are in bold.

Table 6 :
Correlation of metrics with human summary ratings for the dimensions of Quality (Q), Attribution (A), and Informativeness (I) on the test set.All correlations are statistically significant at p < 0.01.

Table 9 (
development set) and Table 10 (test set) report ROUGE-L and NLI scores on XLSum broken for individual languages.Table 11 compares our Filtered model against previous work (mT5-Base, Hasan et al. 2021) and the Self-ROUGE and Random selection baselines.Results are presented for individual languages and on average on the XLSum test set.

Table 7 :
Accuracy results on XNLI test set.

Table 9 :
ROUGE-L and NLI scores per language on the XLSum development set for the Best-ROUGE and Best-NLI checkpoints (chosen by averaging across all languages).Highest scores in each row are in bold.

Table 11 :
ROUGE-L and NLI scores per language on the XLSum test set for our Filtered model vs. comparison systems.For simplicity, all models are compared using their Best-ROUGE checkpoints.XLSum mT5-base predictions are taken from the original XLSum paper(Hasan et al., 2021).However, we report on the recomputed ROUGE-L using the SentencePiece tokenization of mT5 to make it comparable with others.See Section 5.2 for more details on Self-Rouge and Random baselines.

Table 12 :
Classification of XLSum languages into families and their membership in XNLI.

Table 14 :
Geographic characteristics of our participants.

Table 15 :
Mean human judgments for Summary Quality per language on the XLSum test set for Best-ROUGE and Best-NLI checkpoints.We also include judgments for Reference summaries.

Table 16 :
Mean human judgments for Attribution per language on the XLSum test set for Best-ROUGE and Best-NLI checkpoints.We also include judgments for Reference summaries.

Table 17 :
Mean human judgments for Informativeness per language on the XLSum test set for Best-ROUGE and Best-NLI checkpoints.We also include judgments for Reference summaries.

Table 18 :
Mean human judgments on Summary Quality for the best checkpoints averaged across language groups with 1) varying number of training resources, 2) language families and 3) depending on whether XNLI data is available.See Table4for more details about different groups.Best results in each row are in bold.

Table 19 :
Human evaluation results for Attribution for the best checkpoints averaged across language groups with 1) varying number of training resources, 2) language families and 3) depending on whether XNLI is available.See Table4for more details about different groups.Best results in each row are in bold.

Table 20 :
Human evaluation results for Informativeness for the best checkpoints averaged across language groups with 1) varying number of training resources, 2) language families and 3) depending on whether XNLI data is available.See Table4for more details about different groups.Best results in each row are in bold.

Table 23 :
Input XLSum document in Spanish, accompanied by reference summary, and summaries generated by the Vanilla, Controlled, and Filtered models, respectively.English translations of the summaries are shown in italics.