Towards Unifying Multi-Lingual and Cross-Lingual Summarization

To adapt text summarization to the multilingual world, previous work proposes multi-lingual summarization (MLS) and cross-lingual summarization (CLS). However, these two tasks have been studied separately due to the different definitions, which limits the compatible and systematic research on both of them. In this paper, we aim to unify MLS and CLS into a more general setting, i.e., many-to-many summarization (M2MS), where a single model could process documents in any language and generate their summaries also in any language. As the first step towards M2MS, we conduct preliminary studies to show that M2MS can better transfer task knowledge across different languages than MLS and CLS. Furthermore, we propose Pisces, a pre-trained M2MS model that learns language modeling, cross-lingual ability and summarization ability via three-stage pre-training. Experimental results indicate that our Pisces significantly outperforms the state-of-the-art baselines, especially in the zero-shot directions, where there is no training data from the source-language documents to the target-language summaries.


Introduction
The world we live in is multi-lingual.With globalization, text resources in various languages flood the Internet, where global users can easily access their desired information.Under this background, the text summarization community presents multilingual summarization (MLS) and cross-lingual summarization (CLS), respectively.As shown in Figure 1, MLS aims at building a unified model to process documents in multiple languages and generate summaries in the corresponding language (Giannakopoulos et al., 2015;Cao et al., 2020b;Hasan  X i and Y i denote the input document and output summary in language i, respectively.En: English; De: German; Zh: Chinese. et al., 2021b; Wang et al., 2021;Varab and Schluter, 2021), while CLS generates a summary in the target language from the given document in a different source language (Leuski et al., 2003a;Wan et al., 2010;Wan, 2011;Yao et al., 2015;Zhu et al., 2019;Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021;Wang et al., 2022bWang et al., ,d,c, 2023)).Despite the close relationship between MLS and CLS (e.g., both tasks involve more than one language and require models to distill the key information from documents), previous work studies each task separately, hindering the systematic exploration for both of them.
In this paper, we aim to unify MLS and CLS into a more general setting named many-to-many summarization (M2MS).As its name implies, the goal of M2MS is to build a single summarization model to process a document in any source language and generate the corresponding summary in any given target language.In this manner, one M2MS model could perform more directions than MLS and CLS2 , thus reducing the used parameters.For example, one M2MS model involving n languages could replace one MLS model and n×(n − 1) CLS models.To provide a deeper understanding of M2MS, we also conduct preliminary studies to systematically compare M2MS with MLS and CLS, respectively.In detail, following recent CLS work (Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021), we use mBART-50 (Tang et al., 2021) as the summarization model, and train the model in the settings of MLS, CLS and M2MS, respectively.After comparing the model performances, we find that the model trained in M2MS setting can better transfer task knowledge across different languages and combine the advantages of those trained in MLS and CLS settings.Therefore, we argue that it is promising to unify MLS and CLS into M2MS.
Furthermore, we propose PISCES3 , a pre-trained M2MS model that learns language modeling, crosslingual ability and summarization ability via three pre-training stages: (1) meta pre-training learns the general language modeling knowledge from multi-lingual unlabeled corpora; (2) cross-lingual pre-training makes the model aware of the transformation between different languages based on parallel corpora; (3) task-specific pre-training utilizes M2MS objective to simultaneously improve the cross-lingual ability and the summarization abilities of the model.Considering the high-quality M2MS samples are non-trivial to collect, we leverage a simple strategy to construct pseudo M2MS samples from multi-lingual unlabeled corpora.During the three-stage pre-training, PISCES gradually shifts from learning language modeling to the abilities required by M2MS.Among them, the learned cross-lingual ability plays a key role in enhancing the knowledge transferability of the downstream task (i.e., summarization) from high-resource languages to low/zero-resource languages.Lastly, the pre-trained PISCES could be simply fine-tuned on M2MS with input source-language documents and output target-language summaries.
We evaluate PISCES on the WikiLingua (Ladhak et al., 2020) and CrossSum (Hasan et al., 2021a) datasets.Experimental results show that PISCES achieves promising results compared with the stateof-the-art baselines (i.e., mBART-50 and mT5), especially in the zero-shot directions.Moreover, we find that PISCES is even able to generate summaries for documents whose language never occurs in the fine-tuning stage.
Our contributions are concluded as follows: • To our knowledge, we are the first to unify MLS and CLS into a more general setting (M2MS).
We also conduct preliminary studies to provide deeper analyses among MLS, CLS and M2MS.
• We propose PISCES, a pre-trained M2MS model that learns language modeling, cross-lingual ability and summarization ability through a carefully designed three-stage pre-training.• We conduct extensive experiments and show that our PISCES achieves new state-of-the-art performance on the large-scale benchmark datasets.Besides, the effectiveness of PISCES in low/zeroresource languages is also demonstrated.
Considering the close relation between MLS and CLS, Cao et al. (2020b); Feng et al. ( 2022) also evaluate the MLS models on CLS to show their zero-shot CLS ability.
Cross-Lingual Summarization.Given documents in one language, cross-lingual summarization (CLS) generates summaries in another language.
Among these pre-trained summarization models, PEGASUS and PRIMERA only focus on monolingual summarization.Though mDIALBART aims at CLS, the model is merely built for a single crosslingual direction (i.e., English ⇒ German/Chinese) and a specific scenario (i.e., dialogue).Our PISCES is the first multi-lingual pre-trained model for general summarization.

Does Unifying All Directions in a
Single Model Help Each Other?
As discussed previously, M2MS unifies all summarization directions in a single model.Therefore, we wonder can such a setting help the model better transfer task knowledge across different languages compared with the settings of MLS and CLS?To an-swer the question, we conduct preliminary studies to investigate the influence of different settings.

Setup
Data.The preliminary studies are conducted on WikiLingua (Ladhak et al., 2020), one of the largest CLS datasets.We focus on six languages, i.e., English (En), French (Fr), Hindi (Hi), Chinese (Zh), Thai (Th) and Turkish (Tr).Among them, Tr serves as a zero-resource language, whose documents and summaries only appear in the validation and test sets.More details are given in Section 5.1.
Summarization Model.Following recent CLS literature (Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021), we use mBART-50 (Tang et al., 2021) as the summarization model, and train the model in the following four settings: • mBART (ONE): We separately train several models, each of which is built and evaluated in one single direction.When the direction is crosslingual (or monolingual), the corresonding model is a CLS (or monolingual summarization) model.Table 2: Correct language rate (%) of the summaries generated by mBART (MLS) and mBART (M2MS).
To give a deeper understanding of why mBART (MLS) performs poorly in cross-lingual directions, we analyze its generated summaries and find that most of them are not in the language we expected.Table 2 shows the rate of the generated summaries in the correct language. 4The languages of the generated summaries are detected by fastlangid5 .Compared with mBART (M2MS), mBART (MLS) struggles to generate summaries in the target language.We conjecture this is because that mBART (MLS) is only trained with monolingual data from multiple languages without any cross-lingual signals, resulting in limited cross-lingual ability.
Based on the above analyses, we argue that the summarization signals from cross-lingual directions could help mBART (M2MS) perform CLS and transfer the task knowledge to zero-shot directions, while mBART (MLS) does not own such abilities.mBART (M2MS) vs. mBART (U-CLS).The only difference between mBART (M2MS) and mBART (U-CLS) is that the training data of mBART (M2MS) contains all monolingual samples, while mBART (U-CLS) does not.We find that the performance gap between mBART (M2MS) and mBART (U-CLS) is extremely smaller than that between mBART (M2MS) and mBART (CLS) / mBART (MLS).In detail, mBART (M2MS) outperforms mBART (U-CLS) in most directions when the source and the target languages have been seen during the fine-tuning stage, i.e., the source and the target languages are from {En, Fr, Hi, Zh, Th}.However, when the source or target language is unseen (i.e., Tr), the performance of mBART (M2MS) is slightly worse than mBART (CLS).This is because the monolingual training data used in mBART (M2MS) makes the word embeddings of the unseen language6 drift away from those of other languages (see details in Appendix A).Additionally, the cross-lingual signal between the unseen language and other languages never occurs in the fine-tuning stage, making it difficult to summarize from or to the unseen language.

Preliminary Conclusion
The preliminary studies comparing mBART trained in different settings indicate that (1) the multilingual model trained in M2MS setting can better transfer task knowledge across different languages than those trained in the settings of MLS, CLS and unified CLS.(2) Compared with unified CLS, M2MS helps the model achieve better transferability across visible languages, but sacrifices the transferability to unseen languages.
Grounding the above analyses, we argue that it is valuable to unify previous MLS and CLS to M2MS.Meanwhile, how to improve the transferability to unseen languages becomes a keypoint in M2MS.

PISCES
In this section, we propose PISCES, a pre-trained multi-lingual model for M2MS with the backbone of transformer (Vaswani et al., 2017).
Figure 2 shows the overview of PISCES, which contains three pre-training stages.Specifically, the meta pre-training ( § 4.1) lets the pre-trained model learn general language modeling via monolingual denoising objective in multiple languages.Then, to improve the transferability across different languages, the cross-lingual pre-training ( § 4.2) adds noises to the source-language sentences, and encourages the model to translate them into parallel sentences in the target language.Note that the parallel sentences used in this stage might involve the languages which are not seen in downstream tasks, and it is the key to improving the transferability to these languages.Finally, to narrow the gap between the pre-training and fine-tuning stages, the task-specific pre-training ( § 4.3) trains the model with pseudo M2MS samples, which are constructed from the multi-lingual unlabeled corpora via gap sentences selection and machine translation.During the three-stage pre-training process, the model gradually learns the ability of language modeling, then the cross-lingual ability, and finally the adaptation to the specific task.

Meta Pre-Training
The goal of meta pre-training is to provide good initialization for the subsequent pre-training stages.
mBART-50 is a multi-lingual BART (Lewis et al., 2020) with the transformer encoder-decoder architecture.The model is pre-trained on largescale multi-lingual unlabeled corpora to learn the multi-lingual language modeling.Specifically, following BART, the denoising task is used as the pre-training objective, and there are two types of noise: (1) text infilling randomly masks text spans in text sequences, and (2) sentence permutation randomly shuffles sentences in documents.The model is required to comprehend the noisy text sequences and recover them.To indicate the input and output languages, the language tags (e.g., <En> and <Zh>) are appended at the inputs of encoder and decoder sides, respectively.

Cross-Lingual Pre-Training
Despite the effectiveness of mBART-50, the input and output sequences in its pre-training stage are always in the same language, resulting in the underexplored cross-lingual ability.However, such ability is indispensable for M2MS.Therefore, crosslingual pre-training is designed to improve the cross-lingual transferability.
In detail, we propose a simple yet effective pretraining task, i.e., cross-lingual denoising, which lets the model generate sentences in the target language based on their noisy parallel sentences in a different source language.The noise used in this stage is text infilling.In this way, the pre-trained model is required to not only understand the text in the source language but also learn the transformation between different languages.

Task-Specific Pre-Training
Task-specific pre-training aims to narrow the gap between the pre-training and fune-tuning stages.We directly adopt M2MS as its pre-training task.Grounding the truth that high-quality M2MS samples are difficult to collect, we construct the pseudo samples from multi-lingual unlabeled corpora.
In detail, for a source-language document D = , where s src i denotes the i-th sentence in D. Following previous monolingual pre-trained summarization methods (Zhang et al., 2020a;Xiao et al., 2022) 2022).For each gap sentence s src g i in D, the translated counterpart s trg g i is translated back to the source language, which we denote as s src g i .If the ROUGE-1 score between s src g i and s src g i is less than the pre-defined threshold λ, the corresponding pseudo sample will be discarded.
Input Format.To help the model trade off between (1) generating new sentences instead of translating part of input sentences, and (2) learning the translation pattern8 (Zhu et al., 2020), half of source-language gap sentences in D are randomly masked with a special token <mask-sent>.9 5 Experiments

Benchmark Datasets
In order to evaluate M2MS models, two requirements should be met in datasets, i.e., (1) involving multiple languages and summarization directions, and (2) having abundant samples in each direction.Thus, we choose WikiLingua (Ladhak et al., 2020) and CrossSum (Hasan et al., 2021a).
The original WikiLingua dataset, which involves 18 languages, is designed for CLS task.The 18 languages constitute 306 (18×17) cross-lingual directions, each of which contains about 18k CLS samples in average.For each document, WikiLingua also contains its summary in the original language.Therefore, the dataset could be used to evaluate M2MS models.However, the original splitting is for CLS.Thus, we re-split WikiLingua with the special consideration for M2MS: for each document in the test (or validation) set of one direction, the document and its parallel documents10 are not allowed to appear in the training and validation (or test) sets of other directions.This rule reduces the likelihood that learning shortcuts.We also intentionally create several zero-shot directions.
We focus on six languages in this work: English (En), Chinese (Zh), French (Fr), Hindi (Hi), Turkish (Tr) and Thai (Th).After re-splitting, the statistics are shown in Table 3.There are 9 highresource directions each of which contains more than 10k training samples.The other 8 directions with less than 10k training samples are considered as low-resource directions.The remaining 19 zero-shot directions have no training sample.According to whether both the source and target languages appear in the whole training set, we further divide them into 11 non-trivial and 8 conventional zero-shot directions.Note that Tr never appears in the training set of any direction, thus, in other words, the non-trivial zero-shot directions involve Tr while the conventional counterparts do not.We call Tr an unseen language.Though there is no training data in a conventional zero-shot direction, both its source and target languages might have training data with a pivot language, making it less challenging than the non-trivial ones.Taking the conventional zero-shot direction Hi⇒Zh as an example, the training data in Hi⇒En and En⇒Zh could bridge the gap between Hi and Zh.For statistics of the CrossSum dataset used in our experiments, please refer to Appendix C.1.

Experimental Setup
Baselines.We use mBART-50 (Tang et al., 2021) and mT5 (Xue et al., 2021) as baselines, which have achieved state-of-the-art performances on many CLS/MLS datasets (Perez-Beltrachini and Lapata, 2021;Hasan et al., 2021a;Feng et al., 2022).Metrics.We adopt ROUGE-1/2/L (Lin, 2004) and BERTSCORE (Zhang et al., 2020b) in our experiments.The ROUGE scores measure the lexical overlap between the generated summaries and corresponding references, while the BERTSCORE measures the semantic similarity.These metrics are calculated by multi-lingual rouge11 and bert-score 12 toolkits, respectively.The BERTSCORE is based on bert-base-multilingual-cased model.The statistical significance test (Koehn, 2004) 38 3.14 3.15 3.53 3.26 3.27 3.96 3.71 PISCES 3.17 3.56 3.41 3.24 3.76 3.54 3.52 4.28 4.16 Table 6: Human evaluation results."IF", "CC" and "GM" denote informativeness, conciseness and grammaticality, respectively.get language.We discover that the results in Any⇒Tr directions 13 are significantly worse than the Tr⇒Others counterparts.This finding suggests that generating summaries in unseen languages is more difficult than understanding documents in unseen languages.This is because the encoder could partly understand the unseen languages through the shared vocabulary and the similar syntax constituent with other languages.But for the decoder, we only change its language tag to expect it can generate summaries in unseen languages.This requires the decoder to simultaneously (1) capture the relationships between the unseen language tag and the unseen language tokens and (2) summarize documents.However, the pre-trained model only meets the requirement (1) in the pre-training stage 14 , while requirement (2) in the fine-tuning stage, making it hard to simultaneously meet both requirements, and consequently, cannot generate summaries in unseen languages.We reserve this challenge for future work.
Ablations.We conduct ablation studies to investigate the effect of the cross-lingual and task-specific pre-training stages.We run the following ablations:  14 Though PISCES has been pre-trained with pseudo M2MS samples, there is still a large gap between the pseudo samples and downstream samples, e.g., text style and domain.

mBART
Examine the iphone's keyboard.Click the "screen" button to view the photos.Click the "screen" button to view the list of available photos.

PISCES
Connect your iphone to computer.Unlock your iphone.Click the "photos" app.Select the photos you wish to download.Click the "choose photos" option.Select the photos you wish to download.Click the "download" button.

Ground Truth
Connect your iphone to your mac.Unlock your iphone.Open the photos app.Select your iphone.Select the photos you'd like to download.Click import selected.Click imports.lingual and task-specific pre-training stages, which is the same as mBART-50.
As shown in Table 5, we conduct ablation studies in several conventional zero-shot directions (results in more directions are provided in Appendix E).In each case, the RS and BS are lower than vanilla PISCES.In addition, both PISCES w/o TS and PISCES w/o CL outperform PISCES w/o TS & CL.Thus, the effectiveness of both stages is proved.

Qualitative Results
Human Evaluation.Following Zhu et al. (2020); Liang et al. (2022b), we conduct the human evaluation on 50 random samples extracted from WikiLingua (En⇒Zh, Zh⇒En and En⇒En, respectively).Three graduate students are invited to assess the generated summaries from three aspects: informativeness (IF), conciseness (CC) and grammaticality (GM).The scoring adopts a 5-point scale from 1 (worst) to 5 (best).Table 6 shows the average results.The IF, CC and GM scores of PISCES are significantly better than those of mT5 or mBART-50, demonstrating the effectiveness of our model.Case Study.Table 7 shows an example Turkish document, the generated summary and the ground truth summary.Though the summary generated by PISCES contains a repeated sentence, it has good overlaps with the ground truth.But for mBART-50, the generated summary is not relevant to the core idea of the document.This observation indicates that, through the cross-lingual and task-specific pre-training, PISCES could better transfer the task knowledge from high-resource directions to zeroshot ones, and even has the ability to generate summaries for the documents whose language does not occur in the fine-tuning stage.
Error Analysis.To further study how future research could advance M2MS, we take a closer look at the generation errors of PISCES and analyze them in Appendix F.

Conclusion
In this paper, we unify MLS and CLS to M2MS.Through carefully-designed preliminary studies, we discuss that unifying MLS and CLS to M2MS is valuable.In addition, we propose PISCES, the first pre-trained M2MS model, which contains three pretraining stages to enable the model learn the multilingual language modeling, cross-lingual ability and summarization ability.Extensive experiments show its superiority compared with the state-ofthe-art baselines (mBART-50 and mT5).The case study further demonstrates that our model could even generate summaries for the documents whose language does not occur in the fine-tuning stage.

Ethical Considerations
In this section, we consider potential ethical issues of our model.In this paper, we propose PISCES which utilizes mBART-50 (Tang et al., 2021) as the meta pre-trained model and further suffers from the cross-lingual pre-training and task-specific pretraining stages.The pre-training samples are constructed from OPUS (Tiedemann andThottingal, 2020) andmC4 (Xue et al., 2021) corpora.To construct the pseudo M2MS samples in the taskspecific pre-training stage, Google Translation is also adopted to translate gap sentences.Therefore, PISCES might involve the same biases and toxic behaviors exhibited by language models, pre-training corpora and Google Translation.

Limitations
While we show that PISCES outperforms mBART-50 on WikiLingua (Ladhak et al., 2020), there are some limitations worth considering in future work: (1) PISCES still struggles to generate summaries in unseen languages (Section 5.3); (2) In this work, we focus on six languages in total, and future work could extend our method to more languages.

A Word Embeddings of the Unseen Language and Other Languages
To verify the word embeddings of the unseen language drift away from those of other languages af- ter adding the monolingual training data, based on MUSE dictionary, we choose top frequent 1000 English words and the words with the same meaning in other five languages (i.e., Fr, Hi, Zh, Th and Tr).
Then, we calculate the embeddings of these words based on mBART (M2MS) and mBART (U-CLS), respectively.For the word that consists of multiple tokens, the word embedding is the average of embeddings of those tokens.As shown in Figure 3, we utilize Principal Component Analysis (PCA) to visualize the word embeddings from mBART (M2MS) and mBART (U-CLS).In the PCA space, we further calculate the central point of each language by averaging the word embeddings in the language.Then, we find the average distance between the central point of Tr and other languages is 0.426 / 0.407 for mBART (M2MS) / mBART (U-CLS).This distance in vanilla mBART-50 (Tang et al., 2021) is 0.398.Therefore, the monolingual training data used in mBART (M2MS) makes the word embeddings of the unseen language drift away from those of other languages.

B Implementation Details
Pre-Training Details.We use mBART-50 (Tang et al., 2021) as the meta pre-trained model, and futher pre-train it via cross-lingual and taskspecific pre-training stages.The implementation of mBART-50 is based on the Transformers (Wolf et al., 2020) library with default settings (12 encoder layers, 12 decoder layers and 1024 hidden states).In cross-lingual pre-training, we dynamically mask 0-15% tokens in the source-language sentences, and construct 20.6M samples from OPUS parallel corpora (Tiedemann and Thottingal, 2020).In task-specific pre-training, we construct 3.1M training samples from mC4 corpus (Xue et al., 2021).We set the total length of gap sentences to k% of the document length, and k is dynamically  190916 190916 190916 190916 88636 188351 190916 190916 190916 158518 190578 Hi⇔Zh Th⇔Tr Th⇔Zh Tr⇔Zh En⇒En Fr⇒Fr Hi⇒Hi Th⇒Th Tr⇒Tr  selected from [5,10,15].The pre-defined λ in the round-trip translation is 0.7.All experimental results listed in this paper are the average of 3 runs.Table 8 and Table 9 show the statistics of the constructed samples in the cross-lingual pre-training and task-specific pre-training stages, respectively.The cross-lingual pre-training and task-specific pretraining stages are conducted on 8 NVIDIA Tesla V100 GPUs with 32GB memory.In the crosslingual pre-training stage, we pre-train the model for 150K steps, with early stopping, 32 batch size, 3e-5 learning rate following Xiao et al. (2022) and 10K warmup steps.In the task-specific pre-training stage, we pre-train the model for 100K steps, with early stopping, 4 batch size, 3e-5 learning rate and 10K warmup steps.Fine-Tuning and Testing Details.In the finetuning stage, we fine-tune the PISCES model on 8 NVIDIA Tesla V100 GPUs (32G) with 4 batch size, 10 epochs, 2K warmup steps, 3e-5 learning rate, and set the maximum number of tokens for input sequences to 1024.To balance the high-resource and low-resource language data, following Xue et al.We set the hyperparameter α to 0.5.To fine-tune mT5 baseline on M2MS, the language tags (e.g., <En> and <Zh>) are appended at the inputs of both encoder and decoder sides.In the test process, we set the beam size and the maximum decoded length to 5 and 128, respectively.(Hasan et al., 2021a) used in our experiments.The data splitting mainly inherits from the original CrossSum except for zero-shot directions and monolingual directions: (1) If the number of samples in a direction (e.g., Fr⇒Hi) is less than 1k, we will regard the direction as a zero-shot direction and evenly split its samples into validation and test sets.(2) Considering the number of samples in cross-lingual directions is hundred-level or thousand-level, we truncate the number of samples in each monolingual direction (e.g., En⇒En) to 10k to make a balance.The corresponding splitting follows 8:1:1.If the number of samples in a monolingual direction (e.g., Th⇒Th) is less than 10k, its splitting follows the original CrossSum.

C.2 Experimental Results.
Table 11 shows the experimental results on Cross-Sum.Our PISCES outperforms mBART-50 by 2.3 ROUGE-1, 2.0 ROUGE-2, 2.0 ROUGE-L and 1.3 BERTSCORE in the average of all directions, which verifies the effectiveness of PISCES.For the average results in all zero-shot directions, mBART-50 achieves 33.8, 15.7, 28.1 and 67.1 in terms of ROUGE-1/2/L and BERTSCORE.The counterparts of PISCES are 37.9, 19.6, 31.8 and 69.3, showing its superiority in the zero-shot directions.

D Full Results on WikiLingua
Table 12 shows the experimental results in terms of ROUGE-1, ROUGE-2 and ROUGE-L, respectively.

E Ablations in Conventional Zero-Shot Directions
Table 13 shows the ablation results in all conventional zero-shot directions.

F Error Analysis
We first randomly select 100 summaries generated by PISCES on WikiLingua (En⇒Zh).After manually examining the generated summaries, we find the following major error types: • Missing Information: part of the information in the ground truth summary is not mentioned in the generated summary.This is the most frequent error type, and accounts for 39% of the generated summaries.• Faithfulness: the generated summary involves information that is inconsistent with (or not presented in) the source document.We find 32% of the summaries have this error.• Redundancy: the generated summary contains additional information beyond the ground truth summary.17% of the generated summaries contain this error.• Foreign Words: the generated summary involves words in another language.9% of the generated Chinese summaries involve some (typically one or two) words in another language.
The issue of foreign words could also refer to the code-switching phenomenon (Pfaff, 1979).Note that the generated foreign words are not limited in the source language.In several cases, the generated Chinese summaries of the given English documents even involve Thai words.We also find the semantics of these foreign words are typically coherent with their context.This error type might be caused by the cross-lingual pre-training (which bridges the representation gap of parallel words in different languages) in PISCES.

Figure 1 :
Figure1: Illustration of (a) multi-lingual summarization, (b) cross-lingual summarization and (c) many-tomany summarization.X i and Y i denote the input document and output summary in language i, respectively.En: English; De: German; Zh: Chinese.
Figure 2: Overview of the three-stage pre-training in PISCES.Specifically, (a) meta pre-training requires the model to generate original sentences based on the noisy counterparts; (b) cross-lingual pre-training generates the sentences in the target language based on the noisy parallel sentences in the source language; (c) task-specific pre-training utilizes pseudo M2MS samples to pre-train the model.
(2021), we sample the training examples according to the probability p(D) ∝ |D| α , where p(D) is the probability of sampling training examples from a give direction during fine-tuning and |D| is the number of original examples in the direction.
in multiple directions.Different from previous MLS and CLS, we unify them into a more general setting (M2MS) starting

Table 3 :
Statistics of re-splitted WikiLingua.# Samples denotes the number of samples in training / validation / test set.# Avg.Tokens represents the average tokens in the documents and summaries, respectively.Green , light green and gray indicate the high-resource , low-resource and zero-shot directions, respectively.i=1 via Google Translation 7 .In this manner, the source-language document D paired with source/target-language gap sentences S src * /S trg * could constitute a pseudo pre-training sample.Quality Controlling.Since machine translation results might contain flaws, we further employ roundtrip translation strategy as suggested by Zhu et al. (2019) and Feng et al. (

Table 4 :
Experimental results on WikiLingua.Avg.indicates the average score for each cluster of directions.PISCES is significantly better than mBART with t-test p < 0.01 in all directions.

Table 5 :
Results of ablation studies.

Table 7 :
An example of Tr⇒En summarization.

Table 8 :
Statistics of the constructed cross-lingual pre-training samples.Each entry shows the number of samples for each language pair in the corresponding corpus.

Table 9 :
Statistics of the constructed task-specific pre-training samples.
Table 10 lists the data statistics of the CrossSum dataset

Table 13 :
Results of ablation studies.