SimCSum: Joint Learning of Simplification and Cross-lingual Summarization for Cross-lingual Science Journalism

Cross-lingual science journalism is a recently introduced task that generates popular science summaries of scientific articles different from the source language for non-expert readers. A popular science summary must contain salient content of the input document while focusing on coherence and comprehensibility. Meanwhile, generating a cross-lingual summary from the scientific texts in a local language for the targeted audience is challenging. Existing research on cross-lingual science journalism investigates the task with a pipeline model to combine text simplification and cross-lingual summarization. We extend the research in cross-lingual science journalism by introducing a novel, multi-task learning architecture that combines the aforementioned NLP tasks. Our approach is to jointly train the two high-level NLP tasks in SimCSum for generating cross-lingual popular science summaries. We investigate the performance of SimCSum against the pipeline model and several other strong baselines with several evaluation metrics and human evaluation. Overall, SimCSum demonstrates statistically significant improvements over the state-of-the-art on two non-synthetic cross-lingual scientific datasets. Furthermore, we conduct an in-depth investigation into the linguistic properties of generated summaries and an error analysis.

1 Introduction A real-world example of cross-lingual science journalism is Spektrum der Wissenschaft1 .It is the German version of Scientific American and an acclaimed bridge between local readers and the latest scientific research in Germany.Spektrum's journalists read English scientific articles and summarize them into popular science stories in German that are comprehensible by local non-expert readers.Spektrum der Wissenschaft approached us to automate the process of their journalist's work.We define cross-lingual science journalism as the fu-sion of two high-level NLP tasks: text simplification and cross-lingual scientific summarization.
Cross-lingual science journalism aims to generate science summaries in a target language from scientific documents in a source language while emphasizing simplification.The readers of science magazines, usually non-experts in scientific fields, can grasp the complex scientific concepts expressed in easy-to-understand language.Moreover, this task is extendable for different readability levels according to age and education: adults, teens and kids, and in various local languages.However, we limit our study to investigate the task for English to German, targeting adult non-expert readers, to automate the process for Spektrum der Wissenschaft.
As no prior work exists to the best of our knowledge, we consider the two closest tasks for finding the recent advancements: monolingual science journalism and cross-lingual summarization2 .In monolingual science journalism, we discover the trend of taking it a downstream task of abstractive summarization (Dangovski et al., 2021;Zaman et al., 2020) with customized datasets (Zaman et al., 2020;Goldsack et al., 2022).However, these datasets are not suitable for cross-lingual science journalism.Cross-lingual summarization studies can be divided as pipeline (Ouyang et al., 2019;Zhu et al., 2019Zhu et al., , 2020) ) and Multi-Task Learning (MTL) (Cao et al., 2020;Bai et al., 2021Bai et al., , 2022) models with synthetic datasets, and direct cross-lingual summarization with non-synthetic datasets (Ladhak et al., 2020;Fatima and Strube, 2021).We find that Fatima and Strube (2021) have collected their datasets for cross-lingual scientific summarization, so we use them to explore the task.
To investigate cross-lingual science journalism, we propose an MTL-based model -SIMCSUM that jointly trains for simplification and cross-lingual summarization to improve the quality of cross-lingual popular science summaries.SIMCSUM consists of one shared encoder and two independent decoders for each task based on a transformer architecture, where we consider cross-lingual summarization as our main task and simplification as our auxiliary task.

Contributions
We summarize the contributions as follows: 1. We introduce SIMCSUM that jointly learns simplification and cross-lingual summarization to improve the quality of cross-lingual science summaries for non-expert readers.We also introduce a strong baseline -Simplify-Then-Summarize to compare the performance of our proposed model.2. We empirically evaluate the performance of SIMCSUM against several existing crosslingual summarization models on two crosslingual scientific datasets.We also conduct a human evaluation to find the linguistic qualities of generated summaries.3. We further analyze the outputs for various lexical, readability and syntactic-based linguistic features.We also perform error analysis to assess the quality of outputs.
2 Related Work

Scientific Summarization
This section focuses on the datasets for scientific summarization.Most science summarization datasets are collected from English scientific papers paired with abstracts: ARXIV (Kim et al., 2016;Cohan et al., 2018), PUBMED (Cohan et al., 2018;Nikolov et al., 2018), MEDLINE (Nikolov et al., 2018) and science blogs (Vadapalli et al., 2018b,a).Some work has been conducted for extreme summarization with monolingual dataset (Cachola et al., 2020), extended for cross-lingual extreme summarization (Takeshita et al., 2022).The extreme summarization task generates a one/two-line summary from a scientific abstract/paper, which makes it different from science journalism.Cross-lingual scientific summarization is an understudied area due to its challenging nature.We find two studies: a synthetic dataset from English to Somali, Swahili, and Tagalog with round trip translation (Ouyang et al., 2019), two real cross-lingual datasets from Wikipedia Science Portal and Spektrum der Wissenschaft for English-German (Fatima and Strube, 2021).

Cross-lingual Summarization
This section focuses on MTL-based cross-lingual summarization.Zhu et al. (2019) develop an MTL model for English-Chinese cross-lingual summarization.They develop two variations of the transformer model (Vaswani et al., 2017), where the encoder is shared, and two decoders are independent.Cao et al. (2020) present a MTL model for crosslingual summarization by joint learning of alignment and summarization.Their model consists of two encoders and two decoders, each dedicated to one task while sharing contextual representations.
The authors evaluate their model on synthetic crosslingual datasets for the English-Chinese language pairs.Takase and Okazaki (2020) introduce an MTL framework for cross-lingual abstractive summarization by augmenting (monolingual) training data with translations for three pairs: Chinese-English, Arabic-English, and English-Japanese.The model consists of a transformer encoder-decoder model with prompt-based learning in which each training instance is affixed with a special prompt to signal example type.Bai et al. (2021) develop a variation of multi-lingual BERT for English-Chinese crosslingual abstractive summarization.The model is trained with a few shots of monolingual and crosslingual examples.Bai et al. (2022) extend their work by introducing a MTL model to improve crosslingual summaries by combining cross-lingual summarization and translation rates.They add a compression scoring method at the encoder and decoder of their model.They augment their datasets for different compression levels of summaries.One variation consists of cross-lingual and monolingual summarization decoders, while the other consists of cross-lingual and translation decoders.
Most of these studies focus on English-Chinese synthetic datasets emphasizing summarization and translation.By architecture, SIMCSUM is similar to Zhu et al. (2019) model as it also consists of one shared encoder and two task-specific decoders.

Monolingual Science Journalism
This section focuses on science journalism models.Zaman et al. (2020) develop an extension of PGN (See et al., 2017) by modifying the loss function, so the model is trained for joint simplification and summarization.It is not a MTL model but a summarization model with an added loss for simplification.Moreover, the model is trained on a customized dataset that contains simplified summaries from the Eureka Alert science news website.Dangovski et al. (2021) introduce science journalism as a downstream task of abstractive summarization and story generation.They apply BERTbased models with a prompting method for data augmentation on a monolingual dataset collected from Science daily press releases and scientific papers.They use three existing models for their work: SCI-BERT, CNN-based sequence-to-sequence model and story generation model.
We find no similarity between these studies and our work except for the overlap of abstractive summarization, however, we focus on cross-lingual summarization.

Proposed Model
Our model jointly trains for Simplification and Cross-lingual Summarization (SIMCSUM).We first define MTL and our tasks, and then discuss the architecture of our proposed model.

Multi-Task Learning
MTL is an approach in deep learning which improves generalization by learning different noise patterns from data related to different tasks.We define our MTL-based model trained on two tasks: simplification and cross-lingual summarization.We adopt hard parameter sharing as it improves the positive transfer and reduces the risk of overfitting (Ruder, 2017).

Summarization
We define single-document abstractive summarization as follows.Given a text The decoder learns the conditional probability distribution over the given input and all previously generated words, where t denotes the time step.
Cross-lingual summarization adds another dimension of language for simultaneous translation and summarization.Given a text a cross-lingual summarizer generates a summary in a language k that contains salient information in X, where m n and Y consisting of a vocabulary The conditional probability is the same as in Eq.1, the only difference being that the language on the decoder side is different from the encoder side.

Simplification
We define the document-level (lexical and syntactic) simplification task as follows.Given a text retains the primary meaning of X, yet more comprehensible as compared to X, where m ≈ n and Y consisting of a vocabulary The conditional probability is also the same as in Eq.1.

SimCSum
We illustrate the framework of SIMCSUM attention layer, and the loss is combined to update the parameters (θ).We opt for two decoders because each task's output language and length differ.The training method is described in Algorithm 1.
Here we discuss the further details of SIMCSUM.For all mathematical definitions, T ∈ {sim, sum} denotes a task.

Architecture
Considering the excellent text generation performance of multi-lingual Bart (mBART) (Liu et al., 2020), we implement the SIMCSUM model based on it and modify it for two decoding sides for each task.Each encoder and decoder stack consists of 12 layers.
Self-Attention.Each layer of encoder/decoder has its self-attention, consisting of keys, values, and queries generated from the same sequence.
where Q is a query, K T is transposed K (key) and V is the value.All parallel attentions are concatenated to generate multi-head attention scaled with a weight matrix W .
Cross-attention.The cross-attention connects the encoder and decoder and provides the decoder with a weight distribution at each step, indicating the importance of each input token in the current context.We concatenate the cross-attention of both decoders.
where E is the encoder representation, D T is the task-specific decoder contextual representation, and d k is the model size.

Training Objective
We train our model end-to-end to maximize the conditional probability of the target sequence given a source sequence.We define the task-specific loss as follows.
L T (θ) = where x represents the input, y is the target, N is the mini-batch size, t is the time step and θ denotes to learnable parameters.We define the total loss of our model by task-specific losses where λ T is an assigned weight to each task.

Datasets
We use two non-synthetic cross-lingual scientific summarization datasets.(Fatima and Strube, 2021).Spektrum is a famous science magazine (Scientific American) in Germany.It covers various topics in diversified science fields: astronomy, biology, chemistry, archaeology, mathematics, physics, etc.The SPEK-TRUM dataset contains 1510 English articles (avg.2337 words) and German summaries (avg.361 words).

Simplification
We construct a synthetic WIKIPEDIA dataset for the simplification task by applying Keep-It-Simple (KIS) (Laban et al., 2021).To create the simplified WIKIPEDIA, we fine-tune KIS on WIKIPEDIA English articles as KIS is an unsupervised model and does not require parallel data.The simplified WIKIPEDIA consists of the original English articles paired with simplified English articles.
We perform English text simplification because most of the simplification work has been done in the English language (Al-Thanyyan and Azmi, 2021), and very few studies cover the German language (Aumiller and Gertz, 2022;Weiss and Meurers, 2018;Hancke et al., 2012) for children and dyslexic persons (not suitable for scientific simplification).Moreover, most of the work focuses on lexical or sentence level (Sun et al., 2021).To the best of our knowledge, KIS is the only SOTA paragraph-level unsupervised simplification model.

Split and Usage
We use WIKIPEDIA for training, validation and testing (80/10/10), while we use SPEKTRUM for zeroshot adaptability as a case study.
All PLM baselines are trained on WIKIPEDIA where each instance I in the training set consists of < x, y > where x is the input English text and y is the target German summary.
SIMCSUM is trained on WIKIPEDIA where each instance I in the training set contains < x, y sim , y sum > where x denotes the input English article and y sim refers to the simplified English article and y sum is the target German summary.

Models
Baselines.Almost all cross-lingual MTL models in §2 are based on translation and summarization, and none of them applies simplification.So we select several state-of-the-art (SOTA) PLMs that accept long input texts as baselines.
In addition, we define a baseline, Simplify-Then-Summarize, based on KIS and mBART models as a pipeline.We report it as KIS-mBART in our experiments.
SimCSum.We set λ Sum = 0.75 for SIMCSUM based on the best results on the WIKIPEDIA validation set.

Training and Inference
The libraries, hardware and training time details are presented in Appendix A. Here, we discuss hyper-parameters.
Baselines.We fine-tune all models for a maximum of 25 epochs and average the results of 5 runs for each model.We use a batch size of 4-16, depending on the model size.We use a learning rate (LR) of 5e −5 and 100 warm-up steps to avoid over-fitting of the fine-tuned models.We use the Adam optimizer with a LR linearly decayed LR scheduler.The encoder language is set to English, and the decoder language is German.
SimCSum.We adopt similar settings as used for baselines, except for the batch size fixed to 4. We only generate tokens from the Summarization decoder side in the inference period.We use beam search of size 5 and a tri-gram block during the decoding stage to avoid repetition.

Evaluation
Automatic.We evaluate all models with three metrics.ROUGE (Lin, 2004) is a standard metric for summarization.BERT-score (Zhang et al., 2020b) (BS) is a recent metric for summarization and simplification as an alternative metric to n-gram-based metrics and applies contextual embeddings.For English text simplification, SARI and Flesch Kincaid Grade Level (FKGL) are the mostly used metrics (Kariuk and Karamshuk, 2020;Omelianchuk et al., 2021;Laban et al., 2021).As our output language is German, we decide to use a variation of Flesch Kincaid score for the German language, i.e., Flesch Kincaid Reading Ease (FRE) (Kincaid et al., 1975) Human.We conduct a human evaluation to compare the outputs of SIMCSUM with mBART (baseline) for the same linguistic properties.Our annotators are two university students from the Computational Linguistics department with fluent German and English skills.It is worth mentioning that human evaluation of long cross-lingual scientific text is challenging and costly because it requires bi-lingual annotators with a scientific background.

WIKIPEDIA
We report F-score4 of ROUGE and BERT-score and FRE of all models in Table 1.The first block includes the fine-tuned PLMs models, the second block presents the pipeline baseline, and the last block includes SIMCSUM.From Table 1, we find that SIMCSUM outperforms all baselines for every metric.We compute the statistical significance of the results with the Mann-Whitney two-tailed test for a p-value (p < .001).Interestingly, WIKIPEDIA summaries are not simplified compared to SPEK-TRUM summaries; still, SIMCSUM performs better on WIKIPEDIA than the baselines.We interpret that the simplification auxiliary task helps the SIMC-SUM to learn better contextual representation and produce more relevant German words.We infer from the results that joint learning of simplification and cross-lingual summarization improves the quality of summaries.Among the baselines, almost all models demonstrate comparable performance except LONG-ED.For R, KIS-mBART perform better than other models, however, mBART and XLSUM performance are also similar.PEGASUS takes the lead for R, and mBART shows higher performance for RL.KIS-mBART and mBART take the lead for BS among the baselines.For FRE, a score between 30 − 50 presents the readability level best understood by college graduates.The WIKIPEDIA summaries fall in this range.For mBART, KIS-mBART performs slightly better than mBART for R and FRE, equal for BS and slightly lower for R and RL.We infer that it is due to the impact of the simplification module in KIS-mBART.

Case Study: SPEKTRUM
Table 2 presents the results of all models on SPEK-TRUM.We find a similar pattern that SIMCSUM outperforms all baselines.We also compute the statistical significance of these results with the same procedure.The SPEKTRUM results are on the lower side compared to the WIKIPEDIA results due to zero-shot adaptability, especially for R.We infer that it is due to the impact of the computation method of ROUGE score as it is an n-gram-based metric (Ng and Abrecht, 2015).The SPEKTRUM summaries have higher FRE scores compared to WIKIPEDIA.Interestingly, we find that all baselines perform lower than the GOLD summaries.However, the SIMCSUM score is similar to the GOLD summaries.Comparing the performance of mBART and KIS-mBART, KIS-mBART performs slightly lower than mBART for all scores except Rbecause only WIKIPEDIA is used for fine-tuning of both models in KIS-mBART.Human Evaluation.We compare the SIMCSUM and mBART outputs for analyzing linguistic qualities because SIMCSUM's architecture is based on mBART.We provide 30 × 2 (for each model) random summaries with their original texts.We ask two annotators to evaluate each document for three linguistic properties on a Likert scale from 1 − 5.The first five samples are used to calibrate the annotations of annotators, and then each annotator provides independent judgments on the rest of the samples.
Table 3 shows the human evaluation results.The  samples used for calibration are not used for computing the scores (guidelines in Appendix C).We compute the inter-rater reliability by using Krippendorff's α5 .We find that SIMCSUM improves the fluency, relevance and readability of outputs.We present a few comparative examples of SIMCSUM and mBART in Appendix E.
6 Analysis: SPEKTRUM We explore three further dimensions along with extended readability for in-depth analysis: lexical diversity, syntactic and error types to determine the quality of generated summaries.These types of analysis are well-known in NLP for textual analysis (Aluisio et al., 2010;Hancke et al., 2012;Vajjala and Lučić, 2018;Mosquera, 2022;Weiss and Meurers, 2022).The lexical diversity and readability scores are computed over all SPEKTRUM's reference summaries (Gold) and outputs of mBART and SIMCSUM.The gold summaries' score is a guideline for how similar the models' outputs are to gold summaries.

Lexical Diversity
Lexical diversity estimates how language is distributed overall and how much cohesion is present in the text as synonyms.It is a good indicator of the readability of a text.We calculate Shannon Entropy Estimation (SEE) (Shannon, 1948) and Measure of Textual Lexical Diversity (MTLD) (McCarthy, 2005) to find lexical diversity (see Appendix B §B.1 for formula).SEE presents a text's "informational value" and language diversity.It is a language-dependent feature, and its value varies for different languages.Higher SEE scores suggest higher lexical diversity.We aim to get similar SEE as Gold summaries.Table 4 shows SEE scores of mBART and SIMCSUM that are similar to Gold summaries suggesting the similar informational value of all summaries.
MTLD is considered a robust version of the typetoken ratio (TTR) and calculates lexical diversity  with no impact of text length.Higher MTLD represents the greater vocabulary richness.Table 4 presents MTLD scores of mBART and SIMCSUM.
The gold summaries have the highest scores, while SIMCSUM is the second highest and mBART has the lowest score.These scores suggest that the lexical richness of all three groups is not similar, in contrast to SEE results.However, the SIMCSUM outputs are more lexically diverse than the mBART outputs.We infer from the improved SIMCSUM scores that joint learning of simplification and cross-lingual summarization impacts word generation.These results also suggest that MTLD provides a better estimation of lexical diversity for our summaries.

Readability Scores
Readability scores measure comprehension levels of the text.One of the syllables-based readability scores is already presented in §5.Coleman and Liau (1975) suggests that word length in letters is a better predictor of readability than syllables.We calculate Coleman Liau Index (CLI) (Coleman and Liau, 1975) and Automated Readability Index (ARI) (Senter and Smith, 1967) as these do not rely on syllables (see Appendix B §B.1 for formula).CLI computes scores on word lengths.ARI computes scores on characters, words and sentences.For both CLI and ARI, the lower score is better as it shows the ease of reading and understanding.We interpret from Table 4 that Gold summaries have the lowest score, SIMCSUM has the second-lowest score, and mBART has the highest score.We infer from the improved SIMCSUM scores that joint learning of simplification and cross-lingual summarization impacts both word and sentence level because CLI only focuses on words, while ARI includes sentences also.

Syntactic Analysis
Syntactic analysis elaborates on how words and phrases are related in a sentence structure.We  We infer from the average sentence length (ASL) that SIMCSUM produces shorter sentences than mBART and gold summaries, which exhibits syntactic simplicity.A small average dependency distance (ADD) shows that words with a dependency relation are close together, making the text easier to understand.Table 5 shows that SIMCSUM summaries have a smaller average dependency than mBART, much closer to gold summaries.Fewer dependents per word (ADW) make a text less ambiguous and thus easier to follow.Table 5 shows the SIMCSUM outputs have fewer dependents than the mBART outputs and are similar to gold summaries.The average tree height (ATH) explains the syntactic structural complexity of a sentence.Table 5 shows that SIMCSUM outputs are less structurally complex than mBART outputs, however, gold summaries have the least average tree height.We infer from the syntactic analysis that joint learning of simplification and cross-lingual summarization positively impacts the syntactic properties of summaries.

Error Analysis
To further explore the challenges of improving cross-lingual science summaries, we randomly select 25 × 2 (for each model) summaries from the SIMCSUM and mBART outputs.We find three main categories of errors in the manual inspection.Table 6 presents the occurrences of these errors in each model.Appendix D presents some examples from error analysis and its guidelines.
Non-German Words.It is the error type where the models either produce non-existent German words or partially English-German or another language words.We find that mBART is more prone to produce such errors.We infer that it is due to the imbalance between the pre-trained and finetuned dataset sizes.These models are pre-trained on many languages and usually fine-tuned on comparatively smaller data.SIMCSUM tends to produce fewer errors due to data augmentation (simplification data) during the training.Wrong Name Entities.It is the error type where the models produce wrong name entities, such as cities or country names and persons' first and last names.We find that both models tend to produce such errors, however, the percentage of such errors is quite low.We infer that the models overestimate or underestimate the probability of word sequences present in data.
Unfaithful Information.It is the error type where we find some (new) information in generated summaries that is not faithful to the source documents.We infer that this error is caused by long inputs where the model tends to hallucinate and generates some content that cannot be verified from the source.We find that SIMCSUM makes similar errors as mBART for this error type.

Conclusions
In this paper, we explore the task of cross-lingual science journalism.We propose a novel multi-task model, SIMCSUM, that combines two high-level NLP tasks, simplification and cross-lingual summarization.SIMCSUM jointly trains for reducing linguistic complexity and cross-lingual abstractive summarization.We also introduce a pipeline-based strong baseline for cross-lingual science journalism.Our empirical investigation shows the significantly superior performance of SIMCSUM over the SOTA baselines on two non-synthetic cross-lingual scientific datasets, also indicated by human evaluation.Furthermore, our in-depth linguistics analysis shows how multi-task learning in SIMCSUM has lexical and syntactic impacts on the generated summaries.In the last, we perform error analysis to find what kind of errors has been produced by the model.In the future, we plan to add modules for linguistically informed simplification.
We proposed SIMCSUM for the Cross-lingual Science Journalism task and verified its performance for WIKIPEDIA and SPEKTRUM datasets for the English-German language pair.We believe that SIMCSUM is adaptable for other domains and languages.However, we have not verified it experimentally and limited our experiments to English-German scientific texts.
Our model jointly trains on two high-level NLP tasks, which takes slightly more time than its base model -mBART, as it has more parameters to learn during the training.Moreover, our model is trained on synthetic simplification data, which may create a dependency on the simplification model -KIS.Therefore, we plan to add linguistically informed simplification modules in our model in our future work.We also find during error analysis that both mBART and SIMCSUM have problems (repetition or unfaithful information) with long inputs, which need further investigation.
where L is the total number of characters (including numbers and punctuation), W is the total number of words, and S is the total number of sentences in a given text.
ARI is computed as follows: where L is the total number of characters (including numbers and punctuation), W is the total number of words, and S is the total number of sentences in a given text.

B.3 Syntactic Analysis
We use Stanza 10 to extract dependency relations and Stanford Parser 11 to extract constituency trees for each summary.Before tree generation, we replace all German umlauts (ä, ö, ü and ß) in the summaries with their replacements (ae, oe, ue and ss) due to encoding issues of the Stanford Parser.Average Sentence Length.It is the number of tokens in the sentences averaged over the number of sentences in a summary.
Average Dependency Distance.It is the averaged dependency distance over the sentences, which means the distance between the dependency heads and their dependents.
Average Dependents per Word.It computes the average number of dependents for each word.
Average Tree Height.For computing the average tree height of a summary, we calculate the height of every tree and average it over the sentences.

C.1 Task
We provided annotators with 30 examples of documents paired with a reference summary and two system-generated summaries.The models' identities had not been revealed.The annotators had to rate each model summary for the following linguistic properties after reading the English document and the German summaries.We asked annotators to use the first 5 examples to resolve the annotator's conflict and to find a common consensus for rating the linguistic aspects.However, the rest of the examples were annotated independently.

C.2 Linguistic Properties
We asked annotators to annotate each summary for the following linguistic properties.
Relevance.A summary delivers adequate information about the original text.Relevance determines the content relevancy of the summary.
Fluency.The words and phrases fit together within a sentence, and so do the sentences.Fluency determines the structural and grammatical properties of a summary.
Simplicity.Lexical (word) and syntactic (syntax) simplicity of sentences.A simple summary should have minimal use of complex words/phrases and sentence structure.

C.3 Scale
We use a Likert scale from 1 to 5 to score each property (1:worst | 2:bad | 3:ok | 4:good | 5:best).These scores should be assigned by comparing the outputs of both models.

D.1 Guidelines
We define our informal guidelines for the error analysis as follows.To find the errors in the mBART and SIMCSUM outputs, we compare them to each other, to the SPEKTRUM German gold summary and the original English text.
Non-German Words.To find them, it is sufficient to read through our model outputs and look up any unknown words.If one of the unknown words turns out to be a non-German word, we mark them in red.
Wrong name entities.We find wrong-name entities by comparing the names in both system outputs to the reference summary.If the names differ, we verify with the original text that they refer to the same person and thus represent a mistake by the model, and we mark them in blue.
We find new/unfaithful information by looking up every piece of information in the model outputs in the reference summary.We search for this information in the original text, and if it is not present there, it is clear that the model produced new information that is not faithful to the source text.We mark this information in orange.

D.2 Examples
Target: im freigehege aufzuwachsen und sich dort im schlamm zu suhlen, stärkt offenbar das immunsystem von schweinen : verglichen mit artgenossen, die in einem stall gehalten wurden, hatten freilandschweine eine höhere anzahl von gesundheitsfördernden darmbakterien.zu diesem ergebnis kommen jetzt wissenschaftler um denise kelly von der university of aberdeen.auch gene, die auf entzündungsreaktionen hindeuten, waren bei tieren, die im außenbereich lebten, wesentlich weniger aktiv.[...] Translation: Growing up in an outdoor enclosure and wallowing in the mud there apparently strengthens the immune system of pigs : compared to conspecifics that were kept in a pen, free-range pigs had a higher number of health-promoting intestinal bacteria.scientists led by denise kelly from the university of aberdeen are now coming to this conclusion.genes that indicate inflammatory responses were also significantly less active in animals that lived outdoors.[...] mBART: als fruehmikrobielle kolonisation wird dieabdeckung der brustdruese von schweinen bei der fruehen entwicklung des erwachsenen menschen bezeichnet.sie wird in der medizin, der mikrobiologie und der pathologie erforscht.early microbial colonization of the gut reduced the incidence of infectious, entzuendlichen und autoimmunischen diseasen reduced the incidence of infectious, inflammatory and autoimmune diseases.Translation: the covering of the mammary gland of pigs during the early development of the adult human is referred to as early microbial colonization.it is researched in medicine, microbiology and pathology.early microbial colonization of the gut reduced the incidence of infectious, inflammatory and autoimmune diseases reduced the incidence of infectious, inflammatory and autoimmune diseases.SIMCSUM : die fruehzeitshygiene ist ein begriff aus der entwicklungsbiologie und bezeichnet das phaenomen, dass die fruehzeitliche besiedlung des darmes durch krankheitserreger verhindert wird.die fruehzeitshygiene unterscheidet sich von anderen entwicklungsbiologischen forschungsgebieten wie der entwicklungsphysiologie, der haematologie und der palaeontologie dadurch, dass in ihrer gesamtheit zur fruehen entwicklungsphase die mikrobielle vielfalt des darmes zaehlt.Translation: early hygiene is a term from developmental biology and describes the phenomenon that prevents early colonization of the intestines by pathogens.early hygiene differs from other developmental biological research areas such as developmental physiology, haematology and palaeontology in that the microbial diversity of the intestine counts in its entirety for the early development phase.
Table D.1: An example of SPEKTRUM output, where mBART produces non-German words (marked as red) and SIMCSUM generates the summary with wrong attention on hygiene.The summaries are translated via Google translate.
Target: nicht nur stress und genetische faktoren tragen zu essstörungen bei, sondern auch vorübergehende hungerphasen.nun stießen forscher vom europäischen zentrum für hirnforschung in rom auf einen möglichen grund dafür -und zwar in den gehirnen von labormäuse, die sich auch durch schmerzen nicht vom naschen abbringen ließen.die psychologin rossella ventura und ihre kollegen trainierten mäuse darauf, in einer von zwei kammern nach einer belohnung in form von schokolade zu suchen.dann versetzten sie den tieren ausgerechnet in dem raum, wo die leckerei lockte, schmerzhafte elektroschocks.die mäuse mieden fortan die schokolade und hielten sich vorwiegend in der anderen kammer auf -außer jene, die zuvor einige zeit lang kaum futter bekommen hatten.[...] Translation: Not only stress and genetic factors contribute to eating disorders, but also periods of starvation.now researchers from the european center for brain research in rome have discovered a possible reason for this -namely in the brains of laboratory mice, which were not deterred from snacking even by pain.psychologist rossella ventura and her colleagues trained mice to look for a chocolate reward in one of two chambers.then they gave the animals painful electric shocks in the room where the treat lured them.From then on, the mice avoided the chocolate and mainly stayed in the other chamber -except for those that had hardly received any food for a while.[...] mBART: eating disorders (engl."ess-stoerungen") nennt man eine vielzahl von ursachen und mechanismen, die zur auspraegung von essstoerungen fuehren koennen.Translation: eating disorders are a number of causes and mechanisms that can lead to the development of eating disorders.
SIMCSUM : als essverhalten oder magersucht werden die verhaltensweisen von menschen und tieren bezeichnet, die excessive (vergoren) oder chronisch (z.b. nach unfaellen oder erkrankungen ) zu es verlangen oder dieses verhalten unterdruecken.essverhalten kann zu verhaltensstoerungen fuehren, die ursache fuer die entstehung von essstoerungen sind innere oder aeussere ursachen des essverhaltens.Translation: eating habits or anorexia are the behaviors of people and animals that demand excessive (fermented) or chronic (e.g. after accidents or illnesses) or suppress this behavior.eating behavior can lead to behavioral disorders, the cause of eating disorders are internal or external causes of eating behavior.
Table D.2: An example of SPEKTRUM output, where mBART produces non-German words (marked as red) and quite short summary.While SIMCSUM generates also generates a non-German word.The summaries are translated via Google translate.
Target: der tod kam auf dem höhepunkt der macht : cangrandei.della scala, der berühmteste spross der scaliger aus verona, hatte in zahlreichen kriegerischen auseinandersetzungen die herrschaft über ganz venetien an sich gerissen.doch kurz nach seinem triumphalen einzug in die norditalienische stadt treviso am 18. juli 1329 begannen mit durchfall und erbrechen die ersten symptome einer tödlichen erkrankung.[...] Translation: death came at the height of power : cangrandei.della scala, the most famous offspring of the scaliger family from verona, seized control of all of veneto in numerous military conflicts.but shortly after his triumphant entry into the northern italian city of treviso on july 18, 1329, the first symptoms of a fatal illness began with diarrhea and vomiting.[...] mBART: alberto cangrande ( * 1291 oder 1329 in verona ; † 29.oktober 1329 ebenda ) war ein italienischer dichter, schriftsteller und politiker der flämischen republik.in der zeit der italienischen republik bestieg er von seinen brudern die dynastie der scaligeri.bei der eroberung der elfenbeinküste 1311 wurde er zum maßgebenden gründer der faction in rom.ab 1314 war er herrscher im königreich venedig.cangrande gilt als vater der italienischen Translation: alberto cangrande ( b. 1291 or 1329 in verona ; † october 29, 1329 ibid ) was an italian poet, writer and politician of the flemish republic.in the time of the italian republic he ascended from his brothers the dynasty of the scaligeri.when he conquered the ivory coast in 1311, he became the main founder of the faction in rome.from 1314 he was ruler of the kingdom of venice.cangrande is considered the father of the italian SIMCSUM : cangrande della scala oder cangrande cangrande ( * 9. märz 1291 in verona ; † 22. juli 1329 ebenda ) war ein begründer der von den scaligeri beeinflussten dynastie, der unter den herrschern von verona von 1308 bis 1387 regierte.der sohn des veronaer stadtbegründers und stadtmeisters alberto della scala ( genannt " albert " oder " adelsmann " ), der nur als unterstellt war, wurde nach seinem tod 1301 zum bruder von alberto verbannte.Translation: cangrande della scala or cangrande cangrande ( born march 9, 1291 in verona ; † july 22, 1329 ibid ) was a founder of the dynasty influenced by the scaligeri, who reigned under the rulers of verona from 1308 to 1387. the son of verona's founder and mayor alberto della scala ( called " albert " or " nobleman " ), who was only subordinate, was banished to alberto's brother after his death in 1301.Target: inoffizieller weltrekordhalter im nicht -frieren trotz eisiger kälte ist der niederländer wimhof : er lief bereits ziemlich nackt halbmarathons nördlich des polarkreises und nahm das längste eisbad seit beginn der aufzeichnungen -ohne sich ernsthaft zu beklagen.wissenschaftliche untersuchungen des eismannes hatten bisher keine körperlichen auffälligkeiten gefunden, die erklären könnten, warum hof mit kälte offensichtlich besser umgeht als der durchschnittsmensch. [...] Translation: the unofficial world record holder for not freezing despite the icy cold is the dutchman wimhof : he ran half marathons north of the arctic circle quite naked and took the longest ice bath since records began -without seriously complaining.scientific examinations of the ice cream man had not found any physical abnormalities that could explain why hof obviously deals with the cold better than the average person.[...] mBART: wim hof ( * 20.april 1959 ) ist ein niederländischer sportler, der sich unter anderem mit eisschwimmern beschäftigt.Translation: wim hof ( born april 20, 1959 ) is a dutch athlete who is involved with ice swimmers, among other things.

E SIMCSUM Examples and Results
Here we present some examples showing the difference between SIMCSUM and mBART.
Target: für menschen ist der einbeinige stand immer eine wackelpartie , vor allem bei geschlossenen augen .um nicht umzukippen , müssen die muskeln permanent die leichten schwankungen ausgleichen .nicht so bei flamingos : sie kostet es weniger kraft , auf einem bein zu stehen als auf zweien .deswegen können sie auch beruhigt auf einem bein ein schläfchen machen , ohne dabei umzukippen .[...] Translation: standing on one leg is always a shaky game for humans, especially with closed eyes. in order not to tip over , the muscles have to constantly compensate for the slight fluctuations .Not so with flamingos: it takes less strength for them to stand on one leg than on two.that's why you can take a nap on one leg without tipping over.[...] mBART: die biogerontologie ( von griech.bíos " leben " und lat.protes " befestigt " ) beschäftigt sich mit dem verhältnis von körpergewicht und körperhaltung.so untersucht die biogerontologie das verhältnis von körpergewicht und körperhaltung : welche gewichtszustände und welche muskeln notwendig sind, um einen pilz, der einer krankheit ausgesetzt ist, zu tragen?wie sehr wichtig es ist, den jeweiligen körpergewichtszustand zu messen.dies wird in der Translation: biogerontology (from greek bíos "life" and lat.protes "fixed") deals with the relationship between body weight and posture.this is how biogerontology examines the relationship between body weight and posture: which weight conditions and which muscles are necessary to carry a fungus that is exposed to a disease?how very important it is to measure the respective body weight condition.this will be in the SIMCSUM : flamingos ( phoenicopteridae ) oft sitzen auf einem bein, um eine muskelkontraktion zu erzeugen.sie haben die fähigkeit, das körpergewicht aufzunehmen und zu tragen, ohne dabei eine muskelaktive wirkung auszuüben.Translation: Flamingos (phoenicopteridae) often perch on one leg to produce muscle contraction.they have the ability to absorb and carry body weight without exerting a muscle-active effect.
Table E.1:An example of SPEKTRUM output, where SIMCSUM generates a better summary than mBART .In this case, the mBART summary misses the article's point by focusing on biogerontology which has no relevancy in this case, and SIMCSUM generates a summary that better captures the idea and main focus of the original text.Marked in orange is unfaithful information to the original text.The summaries are translated via Google translate.
Target: eine besondere strategie, seinen fressfeinden zu entkommen, hat der bombardierkäfer pheropsophus jessoensis entwickelt.er wehrt sich sogar noch, nachdem er eigentlich schon verloren hat, und das sogar recht erfolgreich, wie jetzt shinji sugiura und takuya sato von der universität kobe in den biology letters berichten.gerade verschluckt, schießt der käfer im magen der kröte einen strahl heißes, ätzendes sekret aus der afterdrüse -eine strategie, die den bombardierkäfern ihren namen eingebracht hat.normalerweise allerdings dient die waffe zur abschreckung.in diesem fall jedoch ätzen sich die käfer mit ihrer hilfe quasi zurück in die freiheit : das sekret lässt die kröten erbrechen ; und der eigentlich schon verspeiste käfer entkommt.[...] Translation: the bombardier beetle pheropsophus jessoensis has developed a special strategy for escaping its predators.he even defends himself after he has actually already lost, and even quite successfully, as shinji sugiura and takuya sato from the university of kobe now report in the biology letters.just swallowed, the beetle in the toad's stomach shoots a jet of hot, caustic secretion out of the anal gland -a strategy that has given the bombardier beetle its name.normally, however, the weapon serves as a deterrent.in this case, however, the beetles etch their way back to freedom with your help: the toads vomit the secretion; and the actually already eaten beetle escapes.[...] mBART: unter zoologischer kompatibilitaet versteht man in der biologie die faehigkeit von beutetieremischlingen, nach einem laubabschweif von artgenossen in den verdauungstrakt von artgenossen zu entweichen.die oekologische kompatibilitaet ist ein wesentlicher bestandteil der erfolgreichen entstehung von beutegreifern.beimlaubabschweifexperiment mit hilfe des insektenpheropsophus jessoensis aus zwei kaeferarten, dem bufo japonicus und dem b. torrenticola, konnte nachgewiesen werden, dass der kaefer pheropsophus jessoensis unter laboratorischen bedingungen eine in den magensaftgastropfen der jagdsekten einschleudern kann.Translation: in biology, zoological compatibility means the ability of hybrid prey animals to escape into the digestive tract of conspecifics after the leaves have drifted away from conspecifics.ecological compatibility is an essential part of the successful emergence of predators.inthe leaf-dropping experiment with the help of the insect pheropsophus jessoensis from two beetle species, the bufo japonicus and the b.torrenticola, it could be proven that the beetle pheropsophus jessoensis can inject one into the gastric juice droplets of the hunting sects under laboratory conditions.SIMCSUM : als wehrverhalten oderverteidigungsverhalten bezeichnet man in der verhaltensbiologie verschiedene verhaltensweisen von tieren, die wehren gegen die fressfeinde oder andere fresstiere bilden.nicht zu verwechseln ist das wehrverhalten mit der phylogenetik, die sich aus der evolutionsgeschichte entwickelt hat.Translation: in behavioral biology, defensive behavior is the term used to describe various behaviors of animals that defend themselves against predators or other predators.the defensive behavior should not be confused with the phylogenetics that have developed from the history of evolution.
Table E.2:An example of SPEKTRUM output, both SIMCSUM and mBART produce reasonable summaries.The SIMCSUM summary is preferred because it explains the basic idea of the beetle defending itself against predators.The mBART summary is filled with complex words that do not make sense or are unnecessary for this context.These words are marked in orange.The summaries are translated via Google translate.
Target: manche farben bleiben besser im gedächtnis haften als andere, doch auf die erinnerung an den entsprechend gefärbten gegenstand hat das keinen einfluss.das zeigt nun eine studie von psychologen um christof kuhbandner von der universität regensburg.die ergebnisse machen deutlich, dass der gegenstand selbst und dessen eigenschaften vom gehirn nicht immer als einheit betrachtet werden.die wissenschaftler testeten für ihre studie das gedächtnis ihrer versuchspersonen.dabei variierten sie sowohl, was zu merken war, als auch, wie es dargeboten wurde.insbesondere präsentierten sie wörter und gegenstände in unterschiedlichen farben.[...] Translation: some colors stick in the memory better than others, but this has no influence on the memory of the correspondingly colored object.this is now shown by a study by psychologists led by christof kuhbandner from the university of regensburg.the results make it clear that the object itself and its properties are not always viewed as a unit by the brain.for their study, the scientists tested the memory of their subjects.they varied both what was to be remembered and how it was presented.in particular, they presented words and objects in different colors.[...] mBART: die farbpsychologie ist ein teilgebiet der psychologie und beschäftigt sich mit der wirkung von farben auf die wahrnehmung und das gedächtnis.Translation: color psychology is a branch of psychology and deals with the effect of colors on perception and memory.
SIMCSUM : die farbe ist ein farbmuster, das uns wahrnehmungen und die sinneswahrnehmung durch das sehen vermittelt.die farbwahrnehmung gehört zu den kognitionswissenschaften.wir sehen sie nicht nur als eindimensionales spektrum möglicher farben, sondern auch als eindimensionales, oftmals auch als " visuelles system ", das die sinneswahrnehmung mitanpasst.[...] Translation: color is a color pattern that conveys perceptions and sensory perception to us through seeing.Color perception is one of the cognitive sciences.we see them not only as a one-dimensional spectrum of possible colors, but also as a one-dimensional, often also as a " visual system " that also adapts the sensory perception.

[...]
Table E.3:An example of SPEKTRUM output, where mBART performs better than SIMCSUM .mBART generates a summary that is too short but which better recapitulates the main idea.The orange marked words in the SIMCSUM summary are incoherent and are not faithful to the original text.The summaries are translated via Google translate.

Figure 1 :
Figure 1: SIMCSUM consists of one shared encoder with two decoding sides for Simplification and cross-lingual Summarization.

Table 1 :
The WIKIPEDIA results for all baselines and SIM-CSUM.GOLD denotes the reference summaries.Underline refers to the best baseline results, and bold † denotes the best overall results with significant improvements (p < .001).

Table 2 :
FRE, KIS-mBART performs better than the other baselines.Interestingly, almost all baselines except BIGBIRD and XLSUM demonstrate good performance.Comparing KIS-mBART and The SPEKTRUM results for all baselines and SIM-CSUM.GOLD denotes the reference summaries.Underline refers to the best baseline results, and bold with † denotes the best overall results with significant improvements (p < .001).

Table 3 :
The SPEKTRUM human evaluation for mBART and SIMCSUM.The average scores (Krippendorff's α) for each linguistic feature are presented here.

Table 4 :
Lexical diversity and readability features' average scores (standard deviation).

Table 5 :
Syntactic features' average scores (standard deviation).perform it with constituency trees on 25 × 2 (for each model) random summaries from mBART, SIM-CSUMand the gold summaries.The total number of sentences for mBART is 70, for SIMCSUM is 80 and for gold is 131.Table 5 presents four syntactic features (see Appendix B §B.2 for definitions).

Table 6 :
Error occurrences for mBART and SIMCSUM summaries which may contain multiple errors.

Table D .
3: An example of SPEKTRUM output, where mBART generates a wrong named entity.SPEKTRUM, on the other hand, gets it right but generates a wrong alias for this person.The summaries are translated via Google translate.

Table D .
4: An example of SPEKTRUM output, where both mBART and SIMCSUM produce unfaithful information.Marked in orange is unfaithful information to the original text.The summaries are translated via Google translate.