ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization

We present ClidSum, a benchmark dataset towards building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART via further pre-training, where the multiple objectives help the pre-trained model capture the structural characteristics as well as key content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research. We have released the dataset and code at https://github.com/krystalan/ClidSum.

Different from single-participant documents, dialogue is a discourse produced by more than one person (Haviland, 1990).The multi-participant dialogue documents, which record the communications between human and human/machine, have attracted wide research attention due to their key role in daily interpersonal interaction (Zhang et al., 2020c).Meanwhile, the globalization progress has prompted conversations among interlocutors of different native languages and brought many scenarios, e.g., international academic conferences and business meetings.Thus, it is valuable to provide speakers with the summary in their familiar language to help them efficiently grasp the gist of a foreign language dialogue.Nevertheless, dialogueoriented XLS is still under-explored due to the lack of corresponding datasets.
To this end, we introduce Cross-Lingual Dialogue Summarization (XLDS) task that aims to summarize a dialogue in the source language into a different language.To promote the XLDS research, we construct CLIDSUM (Cross-LIngual Dialogue SUMmarization), the first large-scale XLDS benchmark dataset with three features: (1) The proposed CLIDSUM is based on two existing monolingual dialogue summarization datasets, i.e., SAMSum (Gliwa et al., 2019) and MediaSum (Zhu et al., 2021).We choose these two datasets under the consideration of their quality and diversity.(2) To make these datasets suitable for XLDS, we employ professional translators to translate original English summaries of SAMSum and MediaSum to German and Chinese.Eventually, the translated corpora constitute CLIDSUM, which totally contains ~56.4kEn⇒De and ~56.4kEn⇒Zh XLDS samples 1 .(3) Besides the supervised benchmark setting that has been discussed by most previous XLS work, we argue that it is necessary to utilize large-scale monolingual dialogue summarization pairs for XLDS due to the dearth of cross-lingual samples.Thus, we design a semi-supervised setting, where a large number of monolingual pairs together with a relatively small scale of cross-lingual pairs are used to build XLDS systems.
Based on CLIDSUM, we build and evaluate various baseline systems, including summarize-thentranslate, translate-then-summarize and end-toend paradigms: (1) For summarize-then-translate baselines, we use a monolingual dialogue summarizer to generate summaries and further translate them to the target language through a machine translation (MT) model/service.(2) For translate-then-summarize baselines, we utilize a MT model/service to translate the dialogue documents from the source language to the target language, and then obtain summaries via a monolingual dialogue summarizer.(3) As for end-to-end paradigm, we adopt mBART-50 (Tang et al., 2021), a multi-lingual sequence-to-sequence (Seq2Seq) model which has been pre-trained on a large-scale corpora across 50 languages using denoising objectives.Though such a general model achieves surprising results in many downstream tasks, its performance will naturally degrade when there is a relatively large gap between the pre-training and fune-tuning stages (Lai et al., 2021;Zhong et al., 2022).Therefore, to narrow the gap and fully use large-scale monolingual dialogue summarization pairs provided in semi-supervised setting, we propose mDIALBART that extends mBART-50 through the second stage of pre-training with four objectives: action infilling, utterance permutation, monolingual dialogue summarization and machine translation.Specifically, action infilling and utterance permutation encourage the model to 1 En: English; De: German; Zh: Chinese.effectively capture the structural characteristics in dialogues.Monolingual dialogue summarization gives the model the ability to summarize dialogues.Machine translation enables the model to learn the transformation from the source language to the target language.The pre-training corpora are provided in the semi-supervised setting, which means we do not use additional data other than CLIDSUM.The experimental results show that mDIALBART outperforms strong pipeline baselines on CLID-SUM.We also conduct human studies to compare generated summaries from different methods and discuss specific challenges that current approaches faced with XLDS.We hope that our work could promote the development of XLDS.
Our main contributions are concluded as follows:

Related Work
Cross-Lingual Summarization.We divide existing XLS datasets into synthetic datasets and multilingual website datasets according to the construction methods.(1) Synthetic datasets are constructed through translating the summaries of existing text summarization datasets to a target language, such as En2ZhSum, Zh2EnSum (Zhu et al., 2019) and En2DeSum (Bai et al., 2021).These datasets are further equipped with the round-trip translation strategy (Zhu et al., 2019;Zhang et al., 2021;Lai et al., 2022) to filter out low-quality samples.(2) Multi-lingual website datasets are collected from websites which provide multi-lingual versions for their articles.For instance, WikiLingual (Ladhak et al., 2020) collects articles from the WikiHow website, where many English articles are translated to non-English versions by human writers.Each article also links to parallel articles in other languages, if available.Thus, it is handy to collect different language versions for one article.Next, the summary of each language-specific article is extracted through a heuristic strategy.In this way, the article in one language and its summary in a different language could constitute an XLS sample.In the similar way, Global Voices (Nguyen and Daumé III, 2019) and XWikis (Perez-Beltrachini and Lapata, 2021) collect multi-lingual articles from Global Voices and Wikipedia websites, respectively.Early XLS methods (Wan et al., 2010;Zhang et al., 2016;Ayana et al., 2018;Ouyang et al., 2019) are based on pipeline paradigms due to the scarcity of parallel corpus.Recently, Zhu et al. (2019) propose the first large-scale XLS dataset and further explore the multi-task learning on endto-end (e2e) XLS systems which achieve great improvement over pipeline methods.Subsequently, many efforts have contributed to the e2e XLS systems.Among them, Zhu et al. (2020) exploit the translation patterns in XLS.Cao et al. (2020) propose a framework that jointly learns to align and summarize for XLS.Xu et al. (2020) explore pretraining strategy on XLS.Liang et al. (2022) adopt conditional variational auto-encoder (CVAE) (Sohn et al., 2015) to deal with XLS.Different from existing XLS work, we shift research attention from single-participant documents to multi-participant dialogues.
Dialogue Summarization.Dialogue summarization aims to summarize the dialogue document into a shorter passage.AMI (Carletta et al., 2005) and ICSI (Janin et al., 2003) are two early meeting corpora that both contain hundreds of samples.Recently, the SAMSum corpus (Gliwa et al., 2019) is introduced, which contains over 16k chat conversations with human-labeled summaries.Medi-aSum (Zhu et al., 2021) collects about 464k interview transcripts and uses the overview descriptions as summaries.GupShup (Mehnaz et al., 2021) develops the first code-switched dialogue summarization dataset whose dialogues are in Hindi-English.
Based on these datasets, a lot of work (Chen and Yang, 2020;Wu et al., 2021;Xiachong et al., 2021;Chen and Yang, 2021;Feng et al., 2021) models the conversation characteristics and achieves great performance.All these efforts are given to monolingual or code-switched scenarios while we focus on cross-lingual scenario in this paper.

CLIDSUM
In this section, we first discuss how we choose existing monolingual dialogue summarization datasets for the construction of CLIDSUM ( § 3.1).Then, we introduce the details of the annotation process that translates original summaries to target languages ( § 3.2).We next give statistic analysis of CLIDSUM ( § 3.3).Finally, we formulate XLDS task ( § 3.4) and give the details of benchmark settings ( § 3.5).

Data Selection
As discussed in Section 2, there are two types of XLS datasets: synthetic datasets and multi-lingual website datasets.To our knowledge, there is no public website that provides multi-lingual dialoguesummary pairs.Therefore, we decide to construct XLDS dataset through translating original summaries of existing dialogue summarization datasets.
After carefully comparing existing datasets, we finally choose SAMSum (Gliwa et al., 2019) and MediaSum (Zhu et al., 2021) datasets due to the following reasons: (1) these datasets are of high quality, both of which consist of real-world or humanlabeled monolingual dialogue-summary pairs; (2) these datasets collect dialogues in a wide range of domains in daily life.

Data Annotation
Since the size of MediaSum (~464k) is much larger than any other dialogue summarization datasets (typically less than 30k), it is costly to translate all summaries from MediaSum to target languages via crowd-sourcing.Hence, we randomly select 20k samples from its training set, which together with all samples of validation set and testing set form MediaSum40k (totally 40k samples) subset.The remaining monolingual data is denoted as Me-diaSum424k.We then translate original English summaries from SAMSum and MediaSum40k to German and Chinese through data annotation.
There are 55 professional translators, 3 data reviewers and 1 data expert participating in the annotation process.All En⇒Zh translators have passed the TEM-8 while all En⇒De translators have passed both the TEM-8 and the PGH2 .The data reviewers and the expert are proficient in English, German and Chinese, and have rich experiences in checking translation quality.During the annotation process, 20% of the summaries translated by each translator are checked by a reviewer.
If the accuracy3 is lower than 95%, the translator needs to modify all his/her translated summaries under the guidance of the reviewer.To guarantee the overall quality, after the above process for one translation direction of SAMSum or MediaSum40k, 2% of the translated summaries are randomly sampled and checked by the data expert.If the accuracy is lower than 95%, the corresponding translators and reviewers need to revise their translated summaries again.This quality control loop is executed 74 , 1, 1 and 2 times for SAMSum(En⇒Zh), SAMSum(En⇒De), MediaSum40k(En⇒Zh) and MediaSum40k(En⇒De), respectively.
In the end, we collect 56,369 Chinese and 56,369 German summaries.We denote the summarytranslated SAMSum and MediaSum40k as XSAM-Sum and XMediaSum40k, respectively, which also inherit the data splitting of original datasets.The XSAMSum, XMediaSum40k and MediaSum424k together constitute CLIDSUM benchmark dataset.

Data Analysis
We compare CLIDSUM with previous synthetic datasets in Table 1.CLIDSUM is the first largescale XLS dataset towards dialogues, and it is also the first manually translated XLS dataset.While one may argue that the scale of previous synthetic datasets is much larger than XSAMSum and XMe-diaSum40k, it is worth noting that 1) the scale of all dialogue summarization datasets is less than 30k except MediaSum, and 2) the quality of XSAMSum and XMediaSum40k is much higher than those automatically constructed ones.In addition, as shown in Table 2, the dialogues from (X)MediaSum40k contain more turns (~30 vs. ~11) and more speakers (~9.3 vs. ~2.4)than (X)SAMSum, leading to more challenges for XLDS research.

Task Overview
Given a dialogue D = {u 1 , u 2 , ..., u |D| } in the source language, where u i denotes the i-th utterance in D, the XLDS task aims to generate corresponding summary Y t = (y 1 , y 2 , ..., y |Yt| ) in a different target language, where y j denotes the j-th token in Y t .

Benchmark Settings
We design two benchmark settings for supervised and semi-supervised scenarios, respectively.In the Supervised Setting, an XLDS system is established on the training set of XSAMSum or XMedi-aSum40k and evaluated on the corresponding test set.In the Semi-Supervised Setting, the training set of XMediaSum40k and the whole MediaSum424k are used to build XLDS systems which are evaluated on the test set of XMediaSum40k.

Pipeline Method
The main idea of the pipeline method is decomposing XLDS task into dialogue summarization and machine translation sub-tasks.It can be further divided into summarize-then-translate and translatethen-summarize paradigms.Summarize-then-translate.In this paradigm, a monolingual dialogue summarizer is used to generate summary in the same language with the given dialogue, and then obtain target language summary through machine translation.The following models are adopted as the dialogue summarizers: • PEGASUS (Zhang et al., 2020a) is a pre-trained abstractive summarization model.• T5 (Raffel et al., 2020) is a jointly pre-trained Seq2Seq model for many downstream NLP tasks.• BART (Lewis et al., 2020) is another pre-trained Seq2Seq model using denoising objectives during pre-training stage.• mBART-50 (Tang et al., 2021) is a multi-lingual version of BART, which could be also employed on monolingual tasks, though it does not take advantage of its multi-lingual ability.• MV-BART (Chen and Yang, 2020) is a BARTbased multi-view dialogue summarization model which utilizes conversation information from different views, e.g., topic view and stage view.• BART(D ALL ) (Feng et al., 2021) encodes additional dialogue characteristics to BART.The characteristics are extracted by a DialoGPT (Zhang et al., 2020c) annotator.
To translate the generated summaries from the source to the target language, previous XLS work (Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021) only adopts sophisticated translation services, e.g., Google Translation and AWS Translation.Though great performances, the black-box APIs provided by translation services are constantly updating, which leads to poor reproducibility.Thus, we decide to adopt both translation services and models.Specifically, we consider three translation methods: (1) Google Translation5 ; (2) OPUS-MT (Tiedemann and Thottingal, 2020) releases lots of transformer-based MT models which are trained on OPUS corpus6 with different translation directions; (3) Trans-WMT20: we train two transformerbased Seq2Seq models on WMT20 parallel corpus7 from scratch, including En⇒De and En⇒Zh.Translate-then-summarize.This paradigm first translates English dialogues from CLIDSUM to German and Chinese through the same translation methods as the summarize-then-translate, and then generate summaries via mBART-50.

End-to-End Method
The end-to-end method needs simultaneously learn both dialogue summarization and machine translation.We finetune mBART-50 with input dialogues from the source language, and summaries from the target language.Note that mBART-50 is used in both pipeline and end-to-end paradigms, where the languages of input and output sequences are the same when used in the pipeline paradigm while different in the end-to-end paradigm.To indicate the input and output languages, mBART-50 appends language identifiers (e.g., En, De and Zh) at both the encoder and the decoder sides.

mDIALBART
As suggested by previous work (Zhu et al., 2020;Ladhak et al., 2020), building an end-to-end model is preferable to a pipeline one due to: 1) the pipeline models suffer from error propagation; 2) the translation systems in pipeline paradigm require either a large parallel corpus to train MT models or the monetary cost of paid MT services; 3) the pipeline models have a recurring latency during inference (Ladhak et al., 2020).Thus, it is valuable and urgent to explore end-to-end models on XLDS.To this end, we propose mDIALBART which extends mBART-50 via a second stage of pre-training, where the objectives help the pre-trained model better adapt to the XLDS task (cf., Figure 2).

Pre-Training Tasks and Corpora
Action Infilling (AcI).Action triples (i.e., "whodoing-what") help the pre-trained model explicitly be aware of the actions within utterances for generating more factual summaries (Chen and Yang, 2021).Following Chen and Yang (2021), we extract action triples through OpenIE systems (Angeli et al., 2015).Then we randomly sample text spans from action triples and replace each span with a [MASK] token.About 15% of tokens in an utterance will be masked.The action infilling task requires the pre-trained model to reconstruct original dialogues based on the masked dialogues.Utterance Permutation (UP).We shuffle all utterances from dialogues and encourage the model to restore them.

Monolingual Dialogue Summarization (MDS).
We adopt MDS as a pre-training task to enable the model to learn the summarization ability.

Machine Translation (MT).
Since the XLDS task needs translation capability, we employ machine translation as one of the pre-training tasks.
We pre-train mDIALBART on XMediaSum40k and MediaSum424k corpora which are provided by the semi-supervised setting of CLIDSUM.The dialogues as well as monolingual dialogue-summary pairs used in AcI, UP and MDS come from Medi-aSum424k.For MT, we use the original English paired with the translated summaries from the training set of XMediaSum40k as pre-training samples.

Multi-Task Pre-Training
For each iteration of the second pre-training stage, training samples from the four tasks are randomly selected and used to calculate the summation loss and update the parameters.Thus, it is necessary to make the pre-trained model distinguish different tasks, to avoid the confusion brought by the joint training.Specifically, AcI, UP and MT could be regarded as reconstruction tasks.To distinguish MDS from others, a special token [SUM] is appended at the beginning of input sequences of MDS.

Fine-tuning on XLDS
Figure 3 shows the fine-tuning process on XLDS task.The input sequences of XLDS are prepended with token [SUM] to leverage dialogue summarization ability obtained from the second pre-training stage described in the previous section.Besides, the language identifiers at the encoder and the decoder sides are different to utilize the cross-lingual ability learned from the MT pre-training task.

Experiments
We evaluate mDIALBART and various baselines on CLIDSUM.Four automatic metrics are adopted in our experiments: ROUGE-1 (R1) and ROUGE-2 (R2) (Lin, 2004) evaluate unigram and bigram overlap between the generated summaries and correspondings references, respectively.ROUGE-L (R-L) (Lin, 2004) is applied to find the length of the longest common subsequence.BERTScore (B-S) (Zhang et al., 2020b) evaluates the semantic similarity of generated sentences against the references.For evaluation toolkits and the implementation details of all models, please refer to Appendix A.

Main Results
Table 3 shows the experimental results.We first analyze the performance of pipeline baselines, and then compare them with end-to-end baseline.Second, we compare mDIALBART with all baselines to demonstrate its superiority.Lastly, we introduce a simple data augmentation strategy to further improve the performance of end-to-end methods.For generated cases, please refer to Appendix B. Pipeline Baselines.Comparing the performance between translate-then-summarize (Trans-Sum) and summarize-then-translate (Sum-Trans), we find that Trans-Sum outperforms Sum-Trans accompanied with the same summarizer (row 19-21 vs. row 10-12) due to the limited amount of monolingual dialogue-summary pairs in the source language.Besides, the English-centric dialogue summarization work (Chen and Yang, 2020;Feng et al., 2021) reduces the summarization error and helps Sum-Trans overtake Trans-Sum (row 15/18 vs. row 21).Moreover, as shown in Table 4, we test the performance of the adopted MT methods on WMT20-newstest2020.Google Trans performs best in both En⇒De and En⇒Zh directions, and OPUS-MT outperforms Trans-WMT in En⇒De translation, but is worse in En⇒Zh translation.With the same dialogue summarizer, the XLDS performance of pipeline methods is consistent with the performance of the corresponding MT method.
End-to-End vs. Pipeline Baselines.Comparing the performance between end-to-end and pipeline baselines, mBART-50 achieves competitive results with the strong Trans-Sum baselines (row 22 vs. row 21), but performs worse than the strong Sum-Trans baselines (row 22 vs. row 15/18).This is because: (1) the strong pipeline baselines adopt the sophisticated translation service which leverages a large amount of parallel data; (2) the endto-end models need both the abilities to translate and summarize, which requires a large amount of cross-lingual training data.Nevertheless, existing datasets can not fully meet this requirement.
mDIALBART vs.All Baselines.To further explore the end-to-end paradigm on XLDS, we propose mDIALBART that extends mBART-50 through the second pre-training stage.Experimental results show that mDIALBART outperforms all baselines in all automatic metrics (row 24).Note that the data used in the second pre-training stage is provided in the semi-supervised setting of CLIDSUM.That is, mDIALBART is a semisupervised XLDS system and thus we only evaluate mDIALBART on the test set of XMediaSum40k.
Data Augmentation.Moreover, we construct a large number of pseudo-XLDS samples by translating original English summaries from Media-Sum424k to German and Chinese via Google Translation.The pseudo-XLDS samples together with the real XLDS samples from XMediaSum40k jointly train end-to-end models via a simple curriculum learning (Bengio et al., 2009) strategy, i.e., the training data is converted from pseudo data to real data in each epoch.With the help of the pseudo-XLDS samples, mBART-50 and mDIALBART significantly improve their performance on XMedia-Sum40k (row 23/25 vs. row 22/24).We also find that our mDIALBART outperforms mBART+DA (row 24 vs. row 23), indicating the effectiveness of the second stage pre-training.Through carefully designed pre-training objectives, the pre-trained model is more effective than simply using the data augmentation strategy.

Ablation Study
As shown in Table 5, we conduct ablation studies on XMediaSum40k to evaluate the contribution of each pre-training task.All four tasks contribute to the second pre-training stage.The most important one is MDS, which is most relevant to XLDS in all pre-training tasks.Both MDS and XLDS need the pre-trained model to understand the main content of dialogues and further summarize them.MT and UP bring less benefits than others due to the fewer pre-training samples and the lower required capability, respectively.Specifically, the number of MT pre-training samples is 20k, which is significantly fewer than other tasks.UP only requires the pre-trained model to restore the order of utterances in noisy dialogues rather than predicting new words.Such NLU-style task would make it easier for models to learn shortcuts rather than the semantic understanding and reasoning (Du et al., 2021).For ablation studies of combining two or three tasks, please refer to Appendix C.

Human Study
We conduct human studies to further evaluate the performances of the strong baselines under each paradigm and our pre-trained model, i.e., BART(D ALL ) + Google Trans, Google Trans + mBART, mBART and mDIALBART.We randomly select 50 samples from the test set of XMedia-Sum40k.Seven crowd workers with high levels of fluency in English and Chinese are asked to assess the generated Chinese summaries from three aspects: grammaticality (Gram.),informativeness (Info.)and conciseness (Conci.).Following the Best-Worst Scaling method (Kiritchenko and Mohammad, 2017), crowd workers are asked to select the best and the worst generated summaries on each criteria.The result scores are calculated based on the percentage of times each model is selected as best minus the times it is selected as worst.Thus, the final scores should range from -1 (worst) to 1 (best).Table 6 shows the results of human studies.

XLDS Difficulties
To further study the specific challenges of end-toend XLDS and give multiple promising directions for future research, we take a closer look at CLID-SUM and model generation errors.We conclude the following difficulties worthy of research attention: Multiple Topics.The dialogues in MediaSum40k are interview transcripts from NPR and CNN (Zhu et al., 2021), and each transcript usually records multiple topics from different events.Previous dialogue summarization work (Chen and Yang, 2020;Feng et al., 2021) has proved the effectiveness of topic information and proposed topic-aware summarizers.Thus, it is also necessary to explore topicaware end-to-end XLDS models.Low Resource.End-to-end XLDS models learn both dialogue summarization and machine translation simultaneously, which require a large amount of training data.Intuitively, as a more difficult task, XLDS needs more training samples than MT when building an end-to-end model.Following previous MT work (Lin et al., 2020), a parallel corpus is considered as an extremely low resource when the number of its samples is less than 100k.Therefore, it is hard to learn XLDS well when only utilizing supervised data provided by CLIDSUM.We believe the low-resource XLDS is also worth studying.Domain Adaption.There are various dialogue domains in daily life (e.g., chat, interview and debates), and it is impractical to construct XLDS datasets for each domain.Therefore, it is valuable to utilize XLDS data of one domain to improve the model performance of others.Moreover, we find that mDIALBART performs slightly worse than mBART baseline when fine-tuning on XSAMSum, indicating its limited domain adaption capability.

Conclusion
In this paper, we introduce XLDS task and present CLIDSUM, the first large-scale XLDS benchmark dataset.We also propose mDIALBART, a XLDSoriented pre-trained model that extends mBART-50 via the second pre-training stage.The carefully designed pre-training tasks help mDIALBART better adapt to the XLDS task.Experiments on CLIDSUM demonstrate that our mDIALBART outperforms strong baseline models.

Ethical Considerations
We discuss the main ethical considerations of CLID-SUM benchmark dataset as follows: (1) Licenses.CLIDSUM is derived from the SAMSum (Gliwa et al., 2019) and the MediaSum (Zhu et al., 2021), both of which are well-constructed and published datasets.We will follow the CC BY-NC-ND 4.08 license of SAMSum to make XSAMSum public.Following the requirements of MediaSum, we restrict our usage to research purpose only, and will release the translated MediaSum under CC-BY-NC-SA 4.0 license 8 .(2) Compensation.During the translation annotation, the salary for annotating each summary is determined by the average time of annotation and local labor compensation standard.
(3) Data characteristics.We refer readers to the content and (Gliwa et al., 2019;Zhu et al., 2021) for more detailed characteristics.(4) Potential problems.While principled measures are taken to ensure the quality of the dataset, there might still be potential problems with the dataset quality, which may lead to incorrect translations in applications.
We also consider potential ethical issues of our mDIALBART pre-trained model: mDIALBART inherits mBART-50 (Tang et al., 2021) and is further trained on the XMediaSum40k and MediaSum424k corpora.Therefore, mDIALBART could reflect the same biases and toxic behaviors exhibited by language models, such as biases about race and gender (Sheng et al., 2020).

Limitations
While we show that mDIALBART performs best on XMediaSum40k, there are limitations that provide avenues for future work.
(1) As we discussed in Section 6.4, the domain adaption capability of mDIALBART is limited.We think future work can focus on designing general or unified XLDS pretrained models which could be applied in multiple dialogue domains.
(2) Not all the parameters of mDIALBART are completely useful, such as the token embeddings of irrelevant languages, which reduces the inference speed.A Implementation Details

A.1 Automatic Evaluation
To calculate ROUGE scores, we employ the multilingual ROUGE toolkit9 that considers segmentation and stemming algorithms for various languages.To calculate BERTScore, we use the bertscore toolkit10 .

A.3 Pre-Trained Language Models
The implementation of pre-trained models used in our baselines is provided by the Huggingface Transformers (Wolf et al., 2020), i.e., PEGASUS-large14 , T5-large15 , BART-large16 and mBART-5017 .In the fine-tuning process of pipeline baselines, we set the batch size of PEGASUS, T5, BART and mBART to 16 for XSAMSum and 24 for XMedia-Sum40k.The corresponding learning rates are set to 4e-5 and 2e-5, respectively.All these models are fine-tuned with 20 epochs.The hyperparameters of MV-BART (Chen and Yang, 2020) and BART(D ALL ) (Feng et al., 2021) are the same as original paper.In the fine-tuning process of end-toend baseline, mBART-50 is fine-tuned with 4 batch size, 5e-6 learning rate and 20 epochs.

A.4 mDIALBART
Our mDIALBART first inherits from the mBART-50 17 (1024 hidden size, 16 multi-head attention, 12 encoder layers and 12 decoder layers) and then suffers from the second stage of pre-training, which are conducted utilizing 8 NVIDIA Tesla V100 GPUs with 32GB memory.mDIALBART is trained using pytorch-lightning18 framework with 5e-6 learning rate and 8 batch size.We set the warmup steps and total steps to 5,000 and 1,000,000 steps, respectively.The pre-trained mDIALBART is next fine-tuned on the XLDS task using a single GPU with 4 batch size and 5e-6 learning rate.The maximum number of tokens for input sequences is 1024.
In the test process, the beam size is 5, and the maximum decoded length is 150.

B Case Study
We give the generated summaries of strong baselines and mDIALBART in Figure 4.The dialogue in the left example only discusses one event, i.e., Hillary Clinton declares candidacy for the U.S. Senate seat from New York.All these models can generate a suitable summary that matches this event.

Figure 1 :
Figure 1: An example of cross-lingual dialogue summarization.Blue and orange sentences are the German and Chinese summaries of the given English dialogue, respectively.

Table 1 :
Statistics of CLIDSUM and previous synthetic XLS datasets.Trans.indicates the translation method (automatic or manual) to construct dataset.Src Lang.and Tgt Lang.denote the source and target languages (En: English, Zh: Chinese or De: German) for each dataset.Documents represents the size of each dataset.Doc.Length, Src Summ Length and Tgt Summ Length show the average length of documents, source summaries and target summaries (word-level for English and German while character-level for Chinese) for each dataset, repectively.

Table 2 :
Statistics of the dialogue characteristics in (X)SAMSum and (X)MediaSum40k.Dial.indicates the number of dialogue documents for each subset of these datasets.Turns and Speakers represent the average number of utterances and speakers in dialogues.
The second stage of pre-training in mDIALBART.<En>and <Zh> are two language identifiers indicating the input and output languages.[SUM] is a special token to indicate the summarization task.

Table 3 :
Experimental results on CLIDSUM.The bold and underline denote the best and the second scores, respectively.† represents that the evaluated models utilize the monolingual dialogue-summary pairs provided by the semi-supervised setting.Sum-Trans: Summarize-then-translate, Trans-Sum: Translate-then-summarize.

Table 4 :
Translation results on WMT20 (En⇒De and En⇒Zh newstest2020).↑ indicates higher is better.↓ indicates lower is better.