Models and Datasets for Cross-Lingual Summarisation

We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German, and the methodology for its creation can be applied to several other languages. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles’ bodies from language aligned Wikipedia titles. We analyse the proposed cross-lingual summarisation task with automatic metrics and validate it with a human study. To illustrate the utility of our dataset we report experiments with multi-lingual pre-trained models in supervised, zero- and few-shot, and out-of-domain scenarios.


Introduction
Given a document in a source language (e.g., French), cross-lingual summarisation aims to produce a summary in a different target language (e.g., English). The practical benefits of this task are twofold: it not only provides rapid access to salient content, but also enables the dissemination of relevant content across speakers of other languages. For instance, providing summaries of articles from French or German newspapers to non-French or non-German speakers; or enabling access to summary descriptions of goods, services, or knowledge available online in foreign languages. Figure 1 shows an example of an input document in French (left) and its summary in English and other languages (right).
Recent years have witnessed increased interest in abstractive summarisation (Rush et al., 2015;Zhang et al., 2020) thanks to the popularity of neural network models and the availability of datasets (Sandhaus, 2008;Hermann et al., 2015;Grusky et al., 2018) containing hundreds of thousands of document-summary pairs. Although initial efforts have overwhelmingly focused on English, more recently, with the advent of cross-lingual representations (Ruder et al., 2019) and large pre-trained models (Devlin et al., 2019;Liu et al., 2020), research on multi-lingual summarisation (i.e., building monolingual summarisation systems for different languages) has been gaining momentum (Chi et al., 2020b;Scialom et al., 2020).
While creating large-scale multi-lingual summarisation datasets has proven feasible (Straka et al., 2018;Scialom et al., 2020), at least for the news domain, cross-lingual datasets are more difficult to obtain. In contrast to monolingual summarisation, naturally occurring documents in a source language paired with summaries in different target languages are rare. For this reason, existing approaches either create large-scale synthetic data using back-translation (Zhu et al., 2019;Cao et al., 2020), translate the input documents (Ouyang et al., 2019), or build document-summary pairs from social media annotations and crowd-sourcing (Nguyen and Daumé III, 2019). Recent efforts (Ladhak et al., 2020) have been directed at the creation of a large-scale cross-lingual dataset in the domain of how-to guides. Despite being a valuable resource, how-to guides are by nature relatively short documents (391 tokens on average) and their summaries limited to brief instructional sentences (mostly commands).
To further push research on cross-lingual summarisation, we propose a large dataset with document-summary pairs in four languages: Czech, English, French, and German. 1 Inspired by past research on monolingual descriptive summarisation (Sauper and Barzilay, 2009;Zopf, 2018;Liu et al., 2018;Liu and Lapata, 2019a;Perez-Beltrachini et al., 2019;Hayashi et al., 2021), we derive cross- lingual datasets from Wikipedia 2 , which we collectively refer to as XWikis. We exploit Wikipedia's Interlanguage links and assume that given any two related Wikipedia titles, e.g., Huile d'Olive (French) and Olive Oil (English), we can pair the the lead paragraph from one title with the body of the other. We assume that the lead paragraph can stand as the summary of the article (see Figure 1). Our dataset covers different language pairs and enables different summarisation scenarios with respect to: degree of supervision (supervised, zeroand few-shot), combination of languages (crosslingual and multi-lingual), and language resources (high-and low-resource).
To illustrate the utility of our dataset we report experiments on supervised, zero-shot, few-shot, and out-of-domain cross-lingual summarisation. For the out-of-domain setting, we introduce Voxeurop, a cross-lingual news dataset. 3 In experiments, following recent work (Ladhak et al., 2020), we focus on All-to-English summarisation. In addition to assessing supervised and zero-shot performance of multilingual pre-trained models (Liu et al., 2020;Tang et al., 2020), we also provide a training mechanism for few-shot cross-lingual summarisation. 4

The XWikis Corpus
Wikipedia articles are organised into two main parts, a lead section and a body. For a given Wikipedia title, the lead section provides an overview conveying salient information, while the body provides detailed information. Indeed, the body is a long multi-paragraph text generally structured into sections discussing different aspects of the Wikipedia title. We can thus consider the body and lead paragraph as a document-summary pair. Furthermore, a Wikipedia title can be associated with Wikipedia articles in various languages also composed by a lead section and body. Based on this insight, we propose the cross-lingual abstractive document summarisation task of generating an overview summary in a target language Y from a long structured input document in a source language X. Figure 1 illustrates this with an example. For the Wikipedia title Huile d'Olive (Olive Oil), it shows the French document on the left and overview summaries in German, French, Czech, and English on the right.
Below, we describe how our dataset was created, analyse its main features (Section 2.1), and present a human validation study (Section 2.2).
Cross-Lingual Summarisation Pairs From a set of Wikipedia titles with articles (i.e., lead paragraph and body) in N languages, we can create N ! (N −2)! cross-lingual summarisation sets D X →Y , considering all possible language pairs and directions. Data points (Doc X , Sum Y ) in D X →Y are created, as discussed in the previous section, by combining the body of articles for titles t X in language X with the lead paragraph of articles for corresponding titles t Y in language Y . In this work, we focus on four languages, namely English (en), German (de), French (fr), and Czech (cs). To create such summarisation sets D X →Y , we first use Wikipedia Interlanguage Links to align titles across languages, i.e., align title t X ∈ X with t Y ∈ Y . 5 Then, from the aligned titles t X − t Y , we retain those whose articles permit creating a data point (Doc X , Sum Y ). In other words, t X 's article body and t Y 's lead section should obey the following length restrictions: a) the body should be between 250 and 5,000 tokens long and b) and the lead between 20 and 400 tokens. Table 1 shows the number of instances in each set D X →Y that we created following this procedure.
Wikipedia titles exist in different language subsets, thus, language sets D X →Y will include different sets of titles. For better comparison in the evaluation of our models, we would like to have exactly the same set of titles. To achieve this, we take 7,000 titles in the intersection across all language sets. We call this subset XWikis-parallel and the sets with remaining instances XWikis-comparable.
For further details about the data collection process, see the Appendix A.
Monolingual Summarisation Data A byproduct of our data extraction process is the creation of multi-lingual summarisation data. Each D X →Y set has its corresponding monolingual D X →X version. Data points (Doc X , Sum X ) in D X →X are created by combining the body of articles for titles t X in language X with the lead paragraph of articles in the same language X.

Features of XWikis Dataset
Comparison with Existing Datasets Our dataset departs from existing datasets in terms of size, summarisation task, and potential for extension to additional languages.  statistics for our XWikis corpus and existing datasets. Our dataset is larger in terms of number of document-summary pairs. WikiLingua (Ladhak et al., 2020) is also larger than previous datasets, in terms of number of instances, however, the summarisation task is different. In XWikis, the input documents are more than twice as long (average number of tokens). As for the number of languages, although in this work we focus on four European ones, the proposed data creation approach allows to extend the dataset to a large number of languages including more distant pairs (e.g., English-Chinese), as well as low-resource and understudied languages (e.g., Gujarati and Quechua).

Summarisation Task
We carry out a detailed analysis of our XWikis corpus to characterise the summarisation task it represents and assess the validity of the created summarisation data points (Doc X , Sum Y ). In the first instance, we do this through automatic metrics. Since metrics that are based on token overlap (Grusky et al., 2018;Narayan et al., 2018) cannot be directly applied to our cross-lingual data, we carry out some automatic analysis on the monolingual version of the corpus instead, i.e., (Doc X , Sum X ) instances. We first validate the assumption that the lead paragraph can serve as a summary for the article body. Table 3 provides statistics per language pair, for XWikiscomparable 6 , and averaged over all language pairs for XWikis-parallel.
Size. The top part of Table 3 provides an overview of the summarisation task in terms of size. The documents are long, with an overall average of 952 tokens, 40 sentences (note that sentence length is thus~23 tokens) and 6 sections.   , 2021). We analyse the average number of sections per document as a proxy for the complexity of the content selection sub-task. A summariser will need to learn which aspects are summary-worthy and extract content from different sections in the input document. Summaries are also long with 60 tokens and 3 sentences on average.

Content Diversity.
To assess the diversity of content in the corpus, we report the number of distinct top level section titles as an approximation (without doing any normalisation) of aspects discussed (Hayashi et al., 2021). These high numbers, together with the average number of sections per document, confirm that our dataset represents multi-topic content.
Level of Abstraction. To characterise the summarisation task in terms of level of abstraction, we analyse content overlap of document-summary pairs using automatic metrics (Grusky et al., 2018;Narayan et al., 2018) and then evaluate the performance of two extractive summarisation ap-proaches. 7 When the summarisation task is extractive in nature (i.e., the summaries copy text spans from the input document), extractive methods ought to perform well.
The set of automatic metrics proposed in Grusky et al. (2018), indicates the extent to which a summary is composed by textual fragments from the input document, i.e., extractive fragments. Coverage, measures the average number of tokens in the summary that are part of an extractive fragment; Density, indicates the average length of the set of extractive fragments. As shown in Table 3, Coverage is high, specially for de and fr sets, while Density is quite low. This indicates that the summaries overlap in content with the input documents but not with the same phrases. Although summaries are not short, the compression ratio is high given the size of the input documents. This highlights the rather extreme content selection and aggregation imposed by the summarisation task. The second set of metrics proposed in Narayan et al. (2018), measures the percentage of new n-grams appearing in the summary (i.e., not seen in the input document), and shows a similar trend. The percentage of novel unigrams is low but increases sharply for higher ngrams.
The last two rows in Table 3 report ROUGE-L for two extractive methods. LEAD creates a summary by copying the first K tokens of the input document, where K is the size of the reference and performs well when the summarisation task is biased to content appearing in the first sentences of the document. EXT-ORACLE selects a subset of sentences that maximize ROUGE-2 (Lin, 2004) with respect to the reference summaries (Nallapati et al., 2017;Narayan et al., 2018) and performs well when the summarisation task is mostly extractive. As we can see, LEAD is well below EXT-ORACLE (~4 ROUGE-L points on average), indicating no lead bias (i.e., summary-worthy content is not in the beginning of the document). EXT-ORACLE performs better, however, considering the high levels of Coverage, it does not seem to cover all salient content. This indicates that important content is scattered across the document in different sentences (not all of which are selected by EXT-ORACLE) and that phrasing is different (see jump from % of novel unigrams to bigrams). The French subset, has the highest Coverage (conversely the lower % of novel unigrams), and thus is more amenable to  the extractive methods.

Validation through Human Evaluation
To further complement automatic evaluation, we carried out a human evaluation study to assess the quality of cross-lingual data instances (Doc X , Sum Y ). In other words, we validate the assumption that given a pair of aligned titles t X − t Y , the lead paragraph in language Y is a valid overview summary of the document body in language X. As this evaluation requires bilingual judges, we selected three language pairs, namely D de→en , D f r→en and D cs→en and recruited three judges per pair, i.e., bilingual in German-English, French-English, and Czech-English. We selected 20 data instances from each set and asked participants to give an overall judgement of summary adequacy. Specifically, they were asked to provide a yes/no answer to the question Does the summary provide a general overview of the Wikipedia title?. In addition, we elicited more fine-grained judgments by asking participants to ascertain for each sentence in the summary whether it was supported by the document. We elicited yes/no answers to the question Does the sentence contain facts that are supported by the document?. We expect judges to answer no when the content of a sentence is not discussed in the document and yes otherwise. Table 4 shows the proportion of yes answers given by our judges for the three language pairs. Overall, judges view the summary as an acceptable overview of the Wikipedia title and its document. The same picture emerges when considering the more fine-grained sentence-based judgments. 66.2% of the summary sentences are supported by the document in the German-English pair, 77.4% for French-English, and 60.5% for Czech-English. We also used Fleiss's Kappa to establish inter-annotator agreement between our judges. This was 0.48 for German-English speakers, 0.55 for French-English, and 0.59 for Czech-English.

Task
Following previous work (Ladhak et al., 2020), the specific cross-lingual task that we address is generating English summaries from input documents in different (source) languages. In the context of cross-lingual summarisation, we assume that a) we have enough data to train a monolingual summarizer in a source language; b) we want to port this summarizer to a different target language without additional data (zero-shot) or a handful of training examples (few-shot); and c) the representations learnt by the monolingual summarizer to carry out the task, i.e., select relevant content and organise it in a short coherent text, should transfer or adapt to the cross-lingual summarisation task. The main challenges in this setting are understanding the input documents in a new language which may have new relevance clues and translating them into the target language.
Specifically, we assume we have access to monolingual English data (Doc en , Sum en ) to learn an English summariser, and we study the zero-and few-shot cross-lingual scenarios when the input to this model is in a language other than English (i.e., German, French, and Czech). We further exploit the fact that our XWikis corpus allows us to learn cross-lingual summarisation models in a fully supervised setting, and establish comparisons against models with weaker supervision signals. Our fully supervised models follow state-of-the-art approaches based on Transformers and pre-training (Liu and Lapata, 2019b; Lewis et al., 2020). We simulate zero-and few-shot scenarios by considering subsets of the available data instances.

Approach
We formalise cross-lingual abstractive summarisation as follows. Given input document Doc X in language X represented as a sequence of tokens x = (x 1 · · · x |x| ), our task is to generate Sum Y in language Y . The target summary is also represented as sequence of tokens y = (y 1 · · · y |y| ) and generated token-by-token conditioning on x by a summarisation model p θ as |y| t=1 p θ (y t |y 1..t−1 , x). Our summarisation model is based on mBART50 (Tang et al., 2020), a pre-trained multi-lingual sequence-to-sequence model. mBART50 (Tang et al., 2020) is the result of fine-tuning mBART (Liu et al., 2020) with a multi-lingual machine translation objective (i.e., fine-tuning with several lan-guage pairs at the same time). The fine-tuning process extends the number of languages from 25 to 50. BART (Liu et al., 2020) follows a Transformer encoder-decoder architecture (Vaswani et al., 2017). It was trained on a collection of monolingual documents in 25 different languages to reconstruct noised input sequences which were obtained by replacing spans of text with a mask token or permuting the order of sentences in the input.
Although pre-trained models like mBART50 provide multi-lingual representations for language understanding and generation, they require adjustments in order to be useful for abstractive summarisation. Given a training dataset D with document-summary instances {x n , y n } |D| n=1 starting from a model with parameters θ given by mMBART50, we fine-tune to minimise the negative log likelihood on the training dataset, we directly fine-tune on the target cross-lingual task. However, in our zero and few-shot settings we only have monolingual summarisation data available. We therefore assume D to be an English monolingual set (i.e., D en→en ).
In the zero-shot scenario, a monolingual summariser English summariser is used for crosslingual summarisation and we assume that the parameters of the English model will be shared to a certain extent across languages (Chi et al., 2020a). In the few-shot scenario, we assume that in addition to monolingual summarisation data, we also have access to a small dataset S X→en with cross-lingual summarisation examples. Although it might be possible to curate cross-lingual summaries for a small number of examples, using these in practice for additional model adaptation can be challenging. In this work propose an approach reminiscent of the few-shot Model Agnostic Meta-Learning (MAML) algorithm (Finn et al., 2017).
MAML is an optimisation-based learning-tolearn algorithm which involves meta-training and meta-testing phases. Meta-training encourages learning representations which are useful across a set of different tasks and can be easily adapted, i.e., with a few data instances and a few parameter updates, to an unseen task during meta-testing. More concretely, meta-training consists of nested optimisation iterations: inner iterations take the (meta) model parameters θ meta and compute for each task T i a new set of parameters θ i . In the outer iteration, the (meta) model parameters are updated according to the sum of each task T i loss on task-specific parameters θ i . 8 At test time, the (meta) model parameters can be adapted to a new task with one learning step using the small dataset associated with the new task.
We assume that the multi-lingual and MT pretraining of mBART50 (and mBART) are a form of meta-training involving several language tasks which learn shared representations across different languages. We then adapt the English monolingual summariser to the cross-lingual task T X→en with a small set of instances S X→en . We perform a single outer loop iteration and instead of taking a copy of the (meta) parameters and updating them after the inner loop, we combine the support set with a monolingual sample of similar size. We call this method light-weight First Order MAML (LF-MAML).
We also observe that in a real-world scenario, in addition to the small set with cross-lingual examples S X→en , there may exist documents in the source language Doc X without corresponding summaries in English. To further train the model with additional unlabelled data, we apply a Cross-View Training technique (CVT; Clark et al. 2018). We exploit the fact that our fine-tuning does not start from scratch but rather from a pre-trained model which already generates output sequences of at least minimal quality. We augment the set of document summary pairs x, y in S X→en with instanceŝ x,ŷ whereŷ is generated by the current model and x is a different view of x. We cheaply create different views from input x by taking different layers from the encoder.

Experimental Setup
Datasets and Splits We work with the D de→en , D f r→en , and D cs→en directions of our XWikis corpus (i.e., first column in Table 1) and evaluate model performance on the XWikis-parallel set. We split XWikis-comparable into training (95%) and validation (5%) sets.
To train an English monolingual summariser, we created a monolingual dataset D en→en following the procedure described in Section 2 (lead paragraph and body of Wikipedia articles). We selected a set of Wikipedia titles disjoint from those  Table 5: ROUGE-L recall for source document against reference monolingual summary computed against all input tokens (All), the first 800 tokens and the 600 tokens extracted with paragraph-based LEXRANK.
in our XWikis corpus. This dataset has 300,000 instances with 90/5/5 percent of instances in training/validation/test subsets. It follows similar characteristics to the data in our XWikis corpus with an average document and summary length of 884 and 70 tokens, respectively.
Paragraph Extraction To deal with very long documents, we carry out an initial extractive step (Liu et al., 2018;Liu and Lapata, 2019a). Specifically, we rank document paragraphs (represented as vectors of their tf-idf values) using LEXRANK (Erkan and Radev, 2004) and then select the top ranked paragraphs up to a budget of 600 tokens. Table 5 reports ROUGE-L recall of the input against the reference summary (note that to measure this we take the monolingual summary associated with the document rather than the cross-lingual one). As can be seen, the extractive step reduces the document to a manageable size without sacrificing too much content. Note that after ranking, selected paragraphs are kept in their original position to avoid creating a bias towards important information coming at the beginning of the input sequence.

Out of Domain Data
To evaluate the robustness of cross-lingual models on non-Wikipedia text, we created an out of domain dataset from the European news site Voxeurop. This site contains news articles composed of a summary section (with multisentence summaries) and a body written and translated into several languages by professional journalists and translators. We extracted from this site 2,666 summary-article pairs in German, French, Czech, and English. The average document length in tokens is 842 and the summary length 42. We used 2,000 instances for evaluation and reserved the rest for model adaptation.

Models
We evaluated a range of extractive and abstractive summarisation models detailed below. In cases where translation is required we used the Google API. 9 Extractive We applied extractive approaches on the source documents. Extracted sentences were the translated into English to create a summary in the target language.
1. EXT-ORACLE This extractive approach builds summaries by greedily selecting sentences from the input that together maximize ROUGE-2 against the reference summary. We implemented this upper bound following Nallapati et al. (2017)'s procedure. For datasets D X→en , we take the monolingual summary associated to the input document as a proxy for ROUGE-based selection.
2. LEAD The first K tokens from the input document are selected where K is the length of the reference summary.
3. LEXRANK This approach uses tf-idf graphbased sentence ranking (Erkan and Radev, 2004) to select sentences from the input and then takes first K tokens (where K is the length of the reference summary).
Supervised We fine-tuned three separate models based on mBART (Liu et al., 2020) and mBART50 (Tang et al., 2020) in a supervised fashion on the three cross-lingual datasets (D de→en , D f r→en , and D cs→en ). This provides an upper-bound on achievable performance. Additionally, we trained an English summariser on the separate English dataset D en→en (described in the previous section) for our zero and few-shot scenarios.
Translated This is a translate and summarise pipeline approach. We first translate the input documents Doc de , Doc f r , and Doc cs into English and then apply a monolingual English summariser.
Zero-Shot A monolingual English summariser is directly applied to summarise Doc de , Doc f r , and Doc cs documents into English. We fine-tune the entire network except the embedding layer. We report experiments with mBART50 (and mBART).  Few-Shot These models are based on fine-tuned monolingual English summarisers subsequently adapted to the cross-lingual task with a small set of examples S X→en . We present experiments with mBART and mBART50 pre-trained models. We evaluate three few-shot variants (see Section 3.2). LF-MAML is the light-weight First Order MAML version, FT is a fine-tuned version where only cross-attention and layer normalisation layers are fine-tuned, and CVT incorporates additional unlabelled instances into the adaptation step. We also consider two settings with |S X→en | being 300 and 1,000 few instances. Note that in each case we take 1/3 for validation, and the rest for training. For CVT, we generate two views,x m andx u , for each input document x in S X→en by taking a middle encoder representation (x m the hidden states at layer 6) and another by taking an upper encoder representation (x u the hidden states at layer 11). Intuitively, these provide different levels of abstraction from the input document.

Results and Analysis
In this section we discuss our cross-lingual summarisation results (Table 6 and Table 7 Table 6). This gap is highest when summarizing from Czech to English.
Can we Beat Machine Translation? In agreement with previous work (Ladhak et al., 2020), we find that Supervised models are better than Translated ones. Zero versions with mBART50 perform slightly below Translated, except for German-to-English (this is more surprising for mBART which has not seen any cross-language links during pretraining). Interestingly, Few with mBART50 and 300 training instances achieves comparable performance, which indicates that the summariser can improve on the new cross-lingual task by seeing only a few examples. We observe a similar trend for mBART even though it never sees any crosslingual examples during pre-training.
Which Few-Shot Model is Better? FL-MAML performs well across languages both in the 300 and 1K settings. Indeed, in this last configuration it beats Translated and gets closer to Supervised using a relatively small training set (~600 instances -the rest is used for validation). The performance of FT and CVT variants varies depending on the language. FT (which only fine-tunes crossattention) helps when summarizing from French whereas CVT helps when summarizing from Czech. The latter model benefits from potentially noisier unlabelled instances.
Is Out-of-Domain Summarisation Feasible? Table 7 shows the performance of a monolingual English summariser (trained on XWikis) and tested on the Voxeurop dataset. There is indeed a penalty for domain shift by approximately 10 ROUGE points (compare row Zero in Table 7 with rows Supervised/Zero in Table 6). Overall, Few-shot manages to improve upon Zero-shot, even though the few training instances come from a more distant distribution than the one used to pre-train the monolingual summariser (i.e., different genres).
Which Pre-trained Model? Our experiments identify mBART as the weakest pre-trained model, reporting lower ROUGE scores across languages, domains, and training settings (e.g., supervised, zero-and few-shot). mBART50 benefits from finetuning on machine translation and this knowledge is useful to our summarisation task.
Are there Differences between Languages? In the XWikis corpus (and mostly with mBART) Czech-to-English has the lowest performance. However, this gap disappears when applying Few-shot variants to the summarisation task. In Voxeurop, there are no discernible differences amongst language pairs; this is probably due to the fact that document-summary pairs are translations across languages.

How Hard is Cross-lingual Summarisation?
The task is very challenging! XWikis documents are long, and summarisation models must be able to represent multi-paragraph text adequately and isolate important content which is interspersed through the document. This difficulty is further compounded by the translation of content in different languages and the need for models to abstract, rephrase, and aggregate information. Our results in Tables 6 and 7 show that there is plenty of room for improvement.

Conclusion
We presented a new summarisation dataset in four languages (German, French, Czech, and English) which we hope will be a valuable resource for crosslingual and monolingual summarisation. We evaluated a wide range of models on the cross-lingual summarisation task, including zero-and few-shot variants some of which show promising results.
Future work directions are many and varied. We would like to further investigate MAML variants for few-shot summarisation, and expand on document views for CVT (e.g., by looking at semantic roles and discourse relations).

A The XWikis Corpus
Dataset Creation Our corpus was created with English, German, French and Czech Wikipedia dumps from June 2020. 10 We adapted Wikiextractor (Attardi, 2015) to obtain the lead section and body of Wikipedia articles. We preserved the structure of the input document, and section mark-ups were kept (e.g., <h2>). We used a dump of the same date for the table containing the Wikipedia Interlanguage Links. 11 We performed text normalisation (a variant of NFKC normalization) with sentence-piece (Kudo and Richardson, 2018).

B Experiments
All our models were built on top of the fairseq library (Ott et al., 2019) code base.
Text Processing For sentence splitting and tokenisation in German, French and English, we used the Stanza Python NLP Package (Qi et al., 2020). For Czech, we used the MorphoDiTa package (Straka and Straková, 2016).
Training Details For mBART50 (Tang et al., 2020), we used the checkpoint provided as mMBART 50 finetuned many-to-many and for mBART the mBART.cc25 checkpoint, both available in the fairseq library (Ott et al., 2019). We reused mBART's 250K sentencepiece (Kudo and Richardson, 2018) model which was trained using monolingual data for 100 languages. However, to reduce the size of the model to fit our GPU availability we carried out the following modifications. We trimmed the vocabulary to 135K. We first applied the sentencepiece encoder to the language sets in our XWikis corpus (Table 1) and the English data (used to train the monolingual summariser D en→en ) to generate a reduced dictionary. Then, we trimmed the dictionary and the models' embeddings (taking care to map indices from the original dictionary to the reduced one). We further slimmed-down the position embeddings layer from 1,024 to 600. Supervised fine-tuning of mBART and mBART50 was carried out for 20K updates with a batch size of 80 instances, following previous work (Lewis et al., 2020;Liu et al., 2020). We used the Adam optimizer ( =1e-6 and β 2 =0.98) with linear learning rate decay scheduling. We set dropout rate to 0.3 and attention-dropout to 0.1. We used half precision (fp16) and additionally set the weight decay to 0.01 and clipped gradients norm to 0.1. We fine-tuned with label smoothing and α=0.2. When fine-tuning on English mono-lingual summarisation, we freeze the embedding layer for mBART50 as it showed better zero-shot results (but not for mBART as zero shot results were not improved). We used 4 GPUs with 12GB of memory, fine-tuning took 2 days of training.
For the few-shot adaptation, we kept similar hyperparameters, except that we used a much smaller batch size, i.e., 8 instances, and ran 1K updates (300 few-shot) and 5k (1k few-shot). We monitored validation perplexity and obtained checkpoints with best perplexity. All few-shot variants used 1 GPU with 12GB of memory (and needed less than 10 hours of training). For the Few approaches, we sampled a subset of English S en→en instances of size similar to the support set S X→en of the adaptation task T X→en and doubled its size when in addition applying CVT. The sample of unlabelled CVT instances had also size similar to the task support set. Adding more unlabelled data for CVT hurts performance. We combined data from the three tasks, English monolingual, Few cross-lingual instances (task support set) and unlabelled cross-lingual instances. We computed a weighted loss with weights 0.5, 1 and 0.1 respectively (note that variants with no CVT have 0 in the third weight).
We followed the same instance formatting as used in Liu et al. (2020). We use special language ID tokens <LID>, postpend sentences with the </S>, and prepend <S> at the beginning of each sequence. Tables 8 and 9 are the extended versions of Tables 6 and 7 in the paper. Here, we report ROUGE-1/2/L F1 metrics.

Full Set of Metrics and Results
Example Outputs Tables 10, 11, and 12 show example outputs by mBART50 model variants for the three language pairs German-English, French-English, and Czech-English, respectively. Table 13 shows example outputs for the different mBART50 model variants on the Voxeurop dataset.
Erst der Atomausstieg Deutschlands, dann die Ablehnung einer Rückkehr zur Atomenergie in Italien: Dieser Sinneswandel zweier EU-Gründungsmitglieder könnte die übrigen Mitgliedsstaaten dazu bewegen, sich endgültig von der Kernkraft zu verabschieden und künftig auf erneuerbare Energien zu setzen. De Germany is phasing out nuclear power and Italy has rejected its reintroduction. This about-face by two founding members of the European Union could encourage other member states to turn the nuclear page and to develop renewable energies.
Německo a Itálie rozhodly vzdát se jaderné energie -radikální obrat v pozicích dvou zakládajících členů EU by mohl přimět další členské státy k odklonu od jádra a zaměřit se na obnovitelné zdroje.    Reference "Never Smile at a Crocodile" is a comic song with music by Frank Churchill and lyrics by Jack Lawrence. The music, without the lyrics, was first heard in the Walt Disney animated film "Peter Pan". Following the film's release in 1953, the sung version with Lawrence's lyrics went on to become a children's song classic.

Supervised
"Never Smile at a Crocodile" is a song by American singer-songwriter Charles Churchill. It was first released on the Disney soundtrack to the 1937 Disney film "Blanche-Neige et les Sept Nains" in 1939, when "Peter Pan" was already in the planning stages. However, Disney decided to discontinue the recording until 1949.

Translated
"Never Smile at a Crocodile" is a song from the 1937 Disney film "Snow White and the Seven Dwarfs". It was composed by Charles Churchill, who had composed most of the soundtrack for the 1937 film, "Peter Pan". However, the original lyrics were later recorded by several singers, including Jerry Lewis and Rolf Harris. The song became one of the top ten best-selling children's songs that year.
Zero "Never Smile at a Crocodile" is a song by American singer-songwriter John Churchill. It was first released in 1953 as the soundtrack to the 1937 Disney film, "Blanche-Neige and the Seven Nains". The song was later re-released as a CD in 1997.

FT
"Never Smile at a Crocodile" is a song by American singer-songwriter John Churchill. It was first released in 1939 as the soundtrack to the 1937 Disney film "Blanche-Neige and the Seven Nains". The song was later re-released as the lead single from the 1953 film "Peter Pan". Table 11: Example with mBART50 based models outputs from the validation set for French-to-English.

Gold
One in every five young Europeans is out of a job, and even one in two in some countries. Numbers like these were enough to have the young generation rebel against governments in the Arab world, remarks a Polish columnist. What will happen if our social model deprives young people of all hope? en ORACLE For many international education experts, a university education -bachelor or master's degree, doctorate -is the measure of all things. And it is true that the time-frame may not be ideal, as the German system is strongly dependent on the economy. en LEAD More than 5.5m young Europeans are without jobs. In the crisis countries in southern Europe, a generation is coming of age with few prospects: one in two Spaniards and Greeks under 25 are unemployed, and it's one in three in Italy and Portugal . To them, Germany must en LEXRANK As do young southern Europeans who are leaving home to come to Germany to find a job or receive vocational training. They not only lack companies willing to create apprenticeship positions, and patient "masters" happy to pass on their know-how to "their" apprentices, but also the institutions, and en Zero Youth unemployment in Europe has risen to 52% in Spain and Greece. In countries such as the United Kingdom, the jobs that are on offer are invariably short-term contracts. Precarious work is now the only option for a generation threatened by employment and poverty. However, in Europe, we may not have dictators to depose, but Monti's remarks are an indirect admission of the capitulation of democracy in response to the crisis.

de-en Zero
This article is a list of the events that have taken place in Greece, Italy, Spain, and the United Kingdom in the last decades of the twentieth century. The events that took place in Italy, Greece, Spain and Italy in the first decade of the twenty-first century have been described as "the most important events in the history of the European Union".

Few
In Europe, youth unemployment is on the rise. In Spain and Greece, it is rising to 52 per cent. But what will happen if the governments of Greece, Spain and Italy stop cutting their pensions?
fr-en Zero This is a list of events that have taken place in the last decades of the twentieth century in Europe. The most recent events in the history of the European Union have been the events in Greece, Spain, Spain and the United Kingdom.
Few A message of hope for young people in Europe has been delivered by Italian Prime Minister Mario Monti, who has deplored the fact that the unemployment rate of 20 years old is now a lost generation. But what will happen when they are no longer in the world or when the governments of Greece, Spain and Italy reduce the level of pensions?
cs-en Zero This is a list of events that have taken place in the European Union in the past two decades. This list includes the events that occurred in the last decade of the twentieth century, including the events of the Arab revolutions, the collapse of the European social model, and the fall in the living standards of young people Few Whatever leaders do this week, they are not going to bridge the gap between unemployment in Europe and poverty in the Middle East. Instead, young people should take to the streets in Brussels to express their support for Europe, argues Mario Monti.