Echoes from Alexandria: A Large Resource for Multilingual Book Summarization

,


Introduction
Recent research in Automatic Text Summarization -the task of shortening a text while preserving its meaning -has mainly focused on news stories.News texts are usually short documents; for example, 99.3% and 98.6% of the articles in XSum (Narayan et al., 2018) and CNN/DailyMail (Nallapati et al., 2016), respectively, are shorter than 2048 tokens.Additionally, news stories are characterized by strong layout features, such as the "lead bias", in which the first sentences usually contain the most relevant information for a summary.Accordingly, the Lead-3 baseline, which uses the first three sentences of a news item as its summary, performs competitively on news summarization benchmarks (Gehrmann et al., 2018;Zhu et al., 2019).Although recent approaches have achieved high performance, it is still unclear how they behave on longer documents and whether they can generalize across domains and genres.For this reason, the research community has been shifting toward more challenging settings, which include interviews (Zhu et al., 2021) and scientific articles (Gupta et al., 2021;Cohan et al., 2018).
One setting that has been attracting growing attention is full-book summarization (Kryscinski et al., 2021), i.e., the task of producing the plot of a book from its full text.Summarizing a book is hard not only because of its average text lengthcurrently not processable in a single forward pass even by architectures for long-form text processing (Beltagy et al., 2020;Guo et al., 2022) -but also due to other critical aspects, such as the presence of dialogues, rich discourse structures, parallel and non-linear lines of plot, and long-distance dependencies between entities, among others.Therefore, we deem book summarization a complex testbed to challenge current approaches and investigate their capabilities and limitations.
Although the first small-scale datasets for the task were introduced several years ago (Mihalcea and Ceylan, 2007), the area has recently regained traction thanks to larger-scale resources, such as BookSum (Kryscinski et al., 2021) and Nar-rativeQA (Kočiský et al., 2017).However, despite this recent progress, current resources for book summarization are still, i) limited in size, making them difficult to use for proper training and evaluation, and ii) monolingual (usually English-only).
To overcome these issues, we introduce "Echoes from Alexandria" (Echoes), the largest resource to date for book summarization and the first one providing books and summaries in multiple languages.We use Echoes to investigate how current summarization approaches perform on a large-scale multilingual summarization dataset, concluding that current purely-abstractive approaches still struggle in our setting.We additionally devise a new baseline, showing that the extractive-then-abstractive paradigm represents a promising direction for future research.
The main contributions of our work are the following: • We introduce Echoes, the first multilingual resource for book summarization, with thousands of texts and plots in 5 languages, for a total of 25 language pairs.Echoes is also the largest resource among current English datasets for full-book summarization.
• We release the three datasets of Echoes: i) Echo-Wiki, for multilingual abstractive summarization, ii) Echo-XSum, for extremelycompressive multilingual book summarization, and iii) Echo-FairySum, an English dataset for evaluating extractive book summarization.
• We leverage BookSum and Echoes to evaluate state-of-the-art systems, both in zero-shot and fine-tuning settings, bringing to light their inadequate generalization capabilities in book summarization.
• Our experiments demonstrate that an extractive-then-abstractive baseline outperforms the purely-abstractive counterpart on our datasets while achieving state-of-the-art results on BookSum.
• We provide a comprehensive manual evaluation of the automatically generated summaries and release the dataset with our human judgments.
We hope our work will foster research in multilingual long document understanding and summarization.We release Echoes and our software for research purposes at https://github.com/Babelscape/echoes-from-alexandria.

Related Work
Resources for summarization.Research efforts to create summarization resources have steadily increased in numbers over recent years.For the news domain, XSum (Narayan et al., 2018) and CNN/DailyMail (Nallapati et al., 2016) are the defacto standard datasets for training and evaluating summarization systems.XSum comprises 226k news articles accompanied by a one-sentence abstractive summary.In CNN/DailyMail, the authors retrieved 93k articles from CNN1 and 220k articles from DailyMail2 newspapers.Both publishers supplement their articles with a list of bullet points containing the main information of the news text.More recently, summarization resources have been shifting towards more challenging scenarios, i.e., where the documents of interest are longer and belong to different domains.Notably, Cohan et al. (2018) released two large-scale datasets of long and structured scientific papers obtained from arXiv3 and PubMed4 .In these datasets, paper abstracts are used as ground truth summaries.Another relevant example is MediaSum (Zhu et al., 2021), a collection of interview transcriptions from National Public Radio (NPR)5 and CNN, where overview and topic descriptions are employed as summaries.
In long-form text summarization research, a task that is attracting growing attention is book summarization.Although this task was originally introduced several years ago by Mihalcea and Ceylan (2007), who released the first small-scale evaluation resource, book summarization regained traction thanks to a few notable endeavors.The most important example is BookSum (Kryscinski et al., 2021), which provides a collection of resources for book summarization at three levels of granularity: paragraph, chapter, and full book.Book texts are collected from Project Gutenberg, while summaries are obtained from the Web Archive.6 BookSum features 222 unique book titles with a total of 6,987 book chapters and 142,753 paragraphs.Relatedly, NarrativeQA (Kočiský et al., 2017) is a collection of 1572 stories retrieved from Project Gutenberg (783 books and 789 movie scripts) associated with summaries from Wikipedia.The annotators were required to generate questions and answers based on the summaries.Even if NarrativeQA is primarily intended for Question Answering, it can also be used for book summarization.Due to their limited size, however, BookSum (in the full-book setting) and NarrativeQA can be more useful for evaluating models on the task rather than for training purposes.It is also worth noting that these resources are monolingual, i.e., English-only, limiting their usefulness for researchers seeking to evaluate multilingual summarization models.Despite the great work carried out so far, we argue that there is still ample room to improve book summarization resources.
Approaches to book summarization.Kryscinski et al. (2021) conducted experiments on fullbook summarization using a generate&rank strategy.This approach involves training a system to generate paragraph-level summaries, which are then sorted by perplexity and concatenated to form a full-book summary.More recently, Wu et al. (2021) proposed an approach where passages are recursively summarized and concatenated to form a full summary.However, generated summaries are affected by the errors accumulated from previous stages (Wu et al., 2021).Recursively generating a summary is a paradigm that has also been used by other works for long-document summarization (Zhang et al., 2021;Gidiotis and Tsoumakas, 2020).Another family of approaches is that of extractive-then-abstractive approaches.This family of approaches first extracts key sentences from the input document and then uses such sentences as input to an abstractive model, which is tasked with generating a summary that captures the main ideas and themes of the source.While it was successfully employed in previous works for short (Li et al., 2021) and long-form summarization (Chen and Bansal, 2018), this paradigm has never been explored for summarizing books.In this paper, we aim to fill this gap by presenting a new, simple extractive-then-abstractive model and showing its effectiveness for book summarization.

Echoes
Echoes is the first collection of resources for book summarization in 5 languages: English, French, German, Italian, and Spanish.With Echoes, we introduce the following three novel datasets: • Echo-Wiki, in which we pair book texts with plots retrieved from a hand-curated list of Wikipedia page sections.
• Echo-XSum, in which we pair book texts with extremely-compressive summaries, manually created starting from the lead section of Wikipedia pages.
• Echo-FairySum, an evaluation dataset for extractive summarization of short stories and fairy tales, composed of 197 English manually-annotated extractive summaries.
We provide an overview of the main differences between Echoes and existing resources in Table 1.

Text collection
We collect the book texts that comprise Echoes from two main sources: Project Gutenberg and Wikisource.Project Gutenberg is a digital library that provides free access to public-domain books and features over 60k texts.We collect all the available books from Project Gutenberg by following their robot-access policies.7While often considered one of the most reliable sources of copyrightfree books, Project Gutenberg provides only very limited coverage of non-English books and non-English translations of English books.This is one of the reasons why we also rely on Wikisource.Part of the Wikimedia Foundation, Wikisource contains a huge number of texts from a wide range of domains, e.g., books, and legal and historical documents, in various languages.Therefore, for Echoes, we rely on Wikisource in English, French, German, Spanish, and Italian to retrieve other book texts and expand the coverage of books already available from Project Gutenberg. 8We call this set of full-text books B. We note that Wikisource can also be used to expand Echoes to other languages.Given the limited amount of work in multilingual summarization, we focus on the five above highresource languages.We defer the expansion of Echoes to future work.While Project Gutenberg has already been used as a source of books in previous resources, such as BookSum and NarrativeQA, the use of Wikisource is what enables Echoes to become the largest resource for book summarization in English and the first resource for multilingual book summarization.micro-average ratio between the lengths of the source and the summary.

Pairing books with Wikipedia summaries
Book summaries from Wikipedia follow a standard set of guidelines 9 and are often of remarkable quality, as they are continuously refined over time by the Wikipedia community.Therefore, once we have collected our set of full-book texts (see Section 3.1), we iterate over the Wikipedia dumps 10 in English, French, German, Italian, and Spanish.Given our set B of full-book texts, and W , the set of Wikipedia pages, our objective is to uniquely associate a book b ∈ B to a page w ∈ W , such that w is the Wikipedia page of book b.We obtain a set of potential matches by finding Wikipedia pages whose contents contain a hyperlink to a book in B. To improve the accuracy of our mapping, we first apply a string distance metric 11 to compare the titles of the books and their associated Wikipedia pages.We then check if the lead section of the Wikipedia page in question mentions the surname of the author of the associated book.This additional step helps us further refine and ensure the validity of our associations.
After our matching process, we manually inspect the cases in which books are associated with multiple Wikipedia pages.We discover that the pages in excess refer to adaptations of the book in other mediums, such as movies and theatrical plays.To resolve this ambiguity, we utilize the mapping between Wikipedia pages and Wikidata 9 https://en.wikipedia.org/wiki/Wikipedia:How_to_write_a_plot_summary 10 Wikipedia dumps are freely available to download at https://dumps.wikimedia.org/<l>wiki/where <l> ∈ { EN, FR, DE, ES, IT}.Last accessed: July 1, 2022.
11 We used the Edit distance to retain only those pairs whose titles were highly similar, by setting a stringent threshold (0.2).
nodes to obtain metadata about the medium, e.g., book, movie, play, and retain only the Wikipedia page that corresponds to the book.
At this point, given the Wikipedia page content, our goal is to extract only the book summary and discard other information, such as the biography of the author, historical background, prizes and accolades, and critical reception, among others.To achieve this, we employ native speakers to manually identify a list of section names that, in the different languages, only contain plot information, aiming for high precision rather than coverage.We use the content of these identified sections as summaries and provide our list of section names in Appendix A for reference.We name the resulting set of (Wikipedia summary, full-text book) pairs Echo-Wiki.
We note that the average number of unique editors (220.6), revisions (421.4), and year of creation (2008) of the Wikipedia pages we select for the Echo-Wikidataset are large: this indicates that their book summaries have been curated over time and suggests that they are of high quality.Table 1 shows how Echo-Wiki compares against BookSum, the previous largest existing dataset for book summarization, to the best of our knowledge.Besides being multilingual, it is worth noticing that Echo-Wiki is about 12 times larger than BookSum (5,001 vs. 405 books) while still featuring similar compression ratios (103.7 vs. 126.2).

Enabling extreme summarization of books
Inspired by the work of Narayan et al. (2018) on the news domain with XSum, which showcases the capabilities of highly-abstractive summarization, we introduce Echo-XSum, a new dataset for training and evaluating systems for extreme summarization of books.In Echo-XSum, we pair full-text books with very short summaries.These summaries contain the minimum number of sentences required to provide an overview of the main contents of a book, typically one to three sentences.The main challenge posed by Echo-XSum is dealing with the great disparity between the size of the input and the size of the output.Indeed, as we can observe in Table 1, the compression ratio of Echo-XSum (1624.0) is unprecedented in the field of summarization, being an order of magnitude greater than those of Echo-Wiki (103.7) and BookSum (126.2).
The extreme summaries in Echo-XSum are the result of a manual annotation process, which involved an expert linguist who is a fluent speaker in all 5 languages of Echoes.The annotator was explicitly contracted for this task.Given a book and its previously-identified Wikipedia page (see Section 3.1), the annotator was tasked with extracting portions of text from the introduction that described the essential plot of a book.An excerpt of a book text with the corresponding multilingual summaries from Echo-XSum can be found in Appendix B. Notice that the portions of text extracted by the annotator are not necessarily contiguous, as long as the extracted text can be read independently of its original text.As a rule of thumb for the annotation process, the linguist followed the definitions of Consistency, Relevance, Fluency, and Coherence of a summary (Fabbri et al., 2021).The annotator spent an average of 5 minutes per sample.We provide an example of the annotations produced in Appendix C. At the end of the manual creation of our extreme summaries, the resulting Echo-XSum is still about 8 times larger than BookSum (3,383 vs. 405 books).12

Classifying books into genres
Differently from existing resources, such as Book-Sum, which is limited by its relatively small size, the thousands of books in Echoes give us the opportunity to investigate book summarization more in-depth.Indeed, books in Echoes cover a wide range of genres, including novels, theatrical plays, and poems, among others.We argue that developing a strategy to automatically identify book genres provides valuable insights into the dataset and en- ables a fine-grained evaluation of current and future summarization approaches.An analysis by genre can help us determine which genres are the most challenging to summarize.
Similarly to what was described in Section 3.2, we rely on a graph-based heuristic on the knowledge graph of Wikidata to identify genres.More specifically, given a Wikipedia article of a book, we retrieve its corresponding Wikidata node, and analyze its relations (e.g., genre and form_of_creative_work) with its neighboring nodes.This process is able to distinguish between 7 main genres: novels, plays, poems, epic poems, short stories, fairy tales, and essays.Note that our heuristic may assign more than one genre to a single book.Figure 1 illustrates the distribution of the genres in the English partition of Echo-Wiki, showing that novels are the most represented genre, followed by short stories and plays.

Digging up extractive summarization
Over the past few years, the attention of the research community has gradually shifted from extractive to abstractive summarization, especially thanks to the advent of flexible sequence-tosequence models, which have proven effective for summarizing short documents.Thanks to genre classification (see Section 3.4), we are able to perform a small-scale investigation of extractive book summarization on two genres in Echoes.More specifically, we construct Echo-FairySum, the first evaluation dataset for extractive summarization of fairy tales and short stories.
To create extractive summaries for Echo-FairySum, we set up the following manual annotation process: given the text of a book, and its abstractive summary from Wikipedia (Section 3.2), annotators are required to extract relevant sentences from the book text.A sentence is relevant if it provides a piece of information that is also contained in the abstractive summary.The annotators were asked to adhere as closely as possible to the concepts of Consistency, Relevance, and Coherence defined by Fabbri et al. (2021).The annotators were drawn from a pool of fifty-eight Master-level students from the 'Narrative Understanding and Storytelling' minicourse held at the Sapienza University of Rome by the last co-author, as part of the AI and Robotics degree.The selected students carried out the task as part of their course assignments.On average, each student annotated 3 texts, resulting in multiple annotations for each text.The annotation agreement was measured using Cohen's Kappa coefficient, which indicated substantial agreement (0.71).A subset of annotations was further validated by our contracted annotator to ensure that the students were adhering to the guidelines.Overall, Echo-FairySum provides extractive summaries for 197 documents, about 4 times the size of the test set of BookSum.

Aggregating books across versions and languages
A book can be published in various editions after its original publication.Perhaps most importantly, the same version of a book can also be translated into multiple languages.Given the potentially large variety of versions and translations of a book, we argue that it is important to aggregate those ver-sions.Indeed, aggregating books across versions and translations can allow Echoes to also be employed for machine translation, cross-lingual sentence alignment, and cross-lingual summarization.
To achieve this objective, we leverage two characteristics of Wikipedia.First, we aggregate all those book texts aligned to the same Wikipedia page (see Section 3.2).We increase the accuracy of this step by taking into account the information found on some Wikisource pages, which list the editions available for some books.Second, we navigate the Wikipedia interlanguage links, which connect pages that refer to the same concept/entity in different languages, to aggregate different translations and summaries (in different languages) of the same book.Figure 2 presents the number of book-summary and the version-summary pairs for all the language pairs in Echo-Wiki obtained after our aggregation process.

Experiments and Results
In recent years, two promising paradigms have emerged from previous work on longdocument summarization: recursive-abstractive and extractive-then-abstractive.In this section, we evaluate and analyze their effectiveness on Echoes.

Recursive-abstractive approaches
Recursive-abstractive approaches consist in dividing the source document into smaller segments, referred to as chunks, and then using an abstractive summarization model to summarize each segment.If the concatenated output summaries are still larger than a single chunk, the recursive-abstractive approach repeats the process by treating the concatenation as a new source document and summarizing it in the same way.The recursive process continues until the concatenated output summaries are short enough to be considered as the final summary, i.e., until their size is shorter than the maximum size of a single chunk.
Experimental setting.In its simplest form, a recursive-abstractive approach requires a model trained on a standard summarization dataset; this model is then employed recursively, as described above.For our experiments, we consider three sequence-to-sequence Transformer-based models -BART-large (Lewis et al., 2020), LED-base (Beltagy et al., 2020), and LongT5-base (Guo et al., 2022) -and train them on XSum (short documents, news) and MediaSum (long documents, interviews).Then, we evaluate our trained models on the test set of Echo-XSum,13 whose summaries feature an average length similar to that of the summaries in XSum and MediaSum but belong to a different genre (books).For the evaluation, we adopt standard summarization metrics, such as ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore (Zhang et al., 2019).
Results.Table 2 (top) provides an overview of the results obtained by our recursive-abstractive baseline using different language models and trained on different summarization datasets.Overall, we can observe that, independently of the language model and training dataset employed, the baseline does not achieve good results on Echo-XSum.Indeed, the best configuration (LED XSum ) obtains only 14.83 points in ROUGE-L on Echo-XSum.By comparison, the same configuration achieves 30.24 points on XSum.Therefore, i) Echo-XSum is empirically more challenging than XSum, ii) a simple recursive-abstractive approach is not sufficient to obtain acceptable results on Echo-XSum, and, iii) different pretrained language models and different summarization datasets (from different genres/domains) do not significantly affect the results of a recursive-abstractive approach on our book summarization dataset.

Extractive-then-abstractive approaches
Since recursive-abstractive approaches yield unsatisfying results on Echo-XSum (see Table 2), we propose a simple, novel baseline based on the extractive-then-abstractive paradigm.Our model is composed of two submodules: the extractor extracts key sentences from the input text, while the abstractor uses the concatenation of these key sentences to generate an abstractive plot of the book.
Given an input text T = (s 1 , s 2 , . . ., s |T | ) where each s i is a sentence, the extractor produces a score in [0.0, 1.0] for each s i , quantifying its degree of importance for the target summary.More formally: where e s i is the sentence representation of s i from a SENTENCEENCODER.14 Then, the abstractor takes the subset T * composed of the k sentences with higher scores according to the extractor, and uses T * to generate the final summary.To make the abstractor aware of the relative importance of each sentence, we multiply the embedding of each token by the score of its sentence, as follows: ) where e t i,j is the encoding of the j-th token of the i-th sentence, for each sentence in T * .
The model is trained in an end-to-end fashion, i.e., the extractor and abstractor are trained jointly, by minimizing the cross-entropy loss between the reference summary and the generated summary.
Experimental setting.We follow the experimental setting we used for our recursive-abstractive approach.We train and evaluate 3 models -BARTlarge, LED-base, and LongT5-base -on Echo-XSum.Since pretraining on XSum results in  slightly improved performance for the recursiveabstractive approach, we also evaluate how pretraining on XSum affects the performance of our extractive-then-abstractive approach.Finally, we also train and evaluate our approach on Echo-Wiki and on BookSum (the latter to directly compare performance with the current state of the art).
Results.Table 2 (bottom) provides an overview of the results obtained by our extractive-thenabstractive approach on Echo-XSum.We can immediately notice that each configuration significantly outperforms the recursive-abstractive baselines by a large margin.For example, the best extractive-then-abstractive model (BART XSum ) improves over the best recursive-abstractive model (LED XSum ) by 11.90 points in ROUGE-L (26.73 vs. 14.83), and this is true for all the metrics we consider (ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore).It is interesting to note that, while there is little difference in the results on Echo-XSum of different model configurations, there is a significant difference between BART, LED, and LongT5 when evaluated on Echo-Wiki, as shown in Table 3.We hypothesize that such a variance in performance is due to several factors, but the inadequacy of current non-semantic metrics plays a large role, as supported by our human evaluation (see Section 5).Finally, we further assess the effectiveness of our extractive-then-abstractive approach on the standard test set of BookSum (Table 6).In particular, our approach outperforms the system of Kryscinski et al. (2021) using 33% of its parameters, and is competitive with the system of Wu et al. ( 2021) using only 0.1% of its parameters.

Analysis and Discussion
Human evaluation.Following common practice in the field of summarization, we set up a human evaluation process to assess the quality of the system-generated summaries.The annotation task, performed by an expert English speaker, consists of reading the source text and rating the summaries using a Likert scale for Consistency, Relevance, Fluency, and Coherence, as outlined in Fabbri et al. (2021).To make this experiment feasible in terms of time and resources, we focus our evaluation on fairy tales and short stories, which can be read by a human in a short time.Interestingly, but not surprisingly (Fabbri et al., 2021), the results of our human evaluation experiment tell a story that is different from ROUGE, as shown in Tables 4 and 5.However, the evaluation still highlights the effectiveness of our extractive-then-abstractive model compared to the recursive-abstractive baseline.It is clear, however, that future work should focus in particular on improving the Consistency and Relevance of the summaries generated.
Challenges.Echoes opens the door to several other analyses and experiments that were not possible with previous datasets.For example, we can leverage Echo-FairySum to perform an analysis of the behavior of the extractor submodule of our extractive-then-abstractive approach, as we show in Appendix D. In Section 3.4, we examined the different book genres in Echoes; LongT5 model performances are detailed for each genre in Figure 3.We notice that epic poems are the hardest to summarize in this setting, while our model performs reasonably well on fairy tales.Cross-lingual book summarization.Additionally, Echoes can be employed as a multilingual and cross-lingual summarization benchmark, thanks to its coverage of 5 languages and 25 language pairs.In particular, we argue that cross-lingual book summarization is a very interesting challenge, as it requires a model to compress vast amounts of information while transferring knowledge across languages.Moreover, enabling cross-lingual book summarization is fundamental for all those cases in which we do not have the source text available in the language of interest, i.e., its translation may still be under copyright or may not exist at all.To move the first step in this direction, we propose a summarize-then-translate approach, a simple baseline for cross-lingual book summarization on Echo-XSum.As the name implies, our approach works by employing a monolingual model to produce a summary in the same language as the source text, and then it translates the summary from the source language to the desired target language.We report the results of this baseline in Table 7.While this is a strong baseline, it is still affected by two main issues: i) it requires two systems, a summarizer and a translator; ii) machine translation usually fails to translate language-specific items, e.g., character names may not be exact translations.

Conclusion
In this paper, we introduced Echoes, the first multilingual resource for book summarization and the largest among the English datasets.Echoes features three novel datasets, namely, Echo-Wiki, Echo-XSum, and Echo-FairySum, which address several limitations of existing book summarization resources, such as BookSum.Indeed, previous datasets for full-text book summarization are, i) limited in size, and, ii) monolingual, i.e., usually covering English only.
In addition, we leveraged Echoes to bring to light the unsatisfying capabilities of current approaches to generalize to book summarization.Finally, to mitigate this issue, we proposed a new extractivethen-abstractive baseline for book summarization, which outperforms its purely-abstractive counterpart on Echo-Wiki and Echo-XSum, achieving results on the standard BookSum test set that are comparable with the current state of the art while using a number of parameters that is only 0.1% compared to the best-performing method.
We believe that Echoes will foster future work on long-document summarization, especially in the multilingual and cross-lingual setting.

Limitations
Despite the multilinguality of our resource, there is still a strong bias towards the English language, as the majority of books are in English and many translations are from English.This may result in the values of English literature being reflected, and these may differ from those of other cultures; summarizing literature from different cultures and regions may not be fully accurate, as every region has had its own historical development.
Language models used in the experiments can inherit biases from the training data and the tools, such as the ones used for preprocessing, and have limitations that have not been fully evaluated and could impact the results of this study.
This study includes the use of Web data, whichwhile marked as public domain -may be subject to copyright laws.The data used in this study was collected for research purposes and was not intended for any other use.Additionally, it is worth noting that the majority of books used in our resource are copyright-free, and therefore, old.While this allowed us to include a large number of texts in our dataset, it also means that our resource may not fully capture contemporary literature and may not be representative of current linguistic trends and cultural values.

A Wikipedia summary sections
In Table 8 we provide the list of Wikipedia section titles whose contents are used as summaries in Echo-Wiki.

B Echo-XSum example
In Figure 5 we report an excerpt of the book text of the English version of "The Metamorphosis" by Franz Kafka, along with the multilingual extreme summaries from Echo-XSum.

C Echo-XSum annotation task
In Figure 6 we provide an example of a manuallyannotated summary in Echo-XSum.The annotator was tasked to highlight portions of text containing information related to the plot from the Wikipedia introduction.

D Extractor analysis
We analyze the positions of the sentences selected by the extractor.This analysis is required to investigate the presence of any positional bias, e.g., the lead bias, which is known to affect systems trained on news stories.Figure 4 depicts the distribution of the relative positions of the extracted sentences on texts from Echo-FairySum, i.e., fairy tales and short stories.We deduce that the extractions are not affected by any bias.Thanks to Echo-FairySum extractive annotations, we are also able to evaluate the performance of the extractor component of the extractive-then-abstractive approaches.We aggregate multiple extractive annotations in Echo-FairySum by retaining the intersecting sentences; we refer to these sentences as the gold sentences.
We measure the Extractor performance by computing the overlap between the sentences extracted by the model and the gold ones.We compute the Precision@K by comparing the topK-ranked sentences with the references.We report the Extractor We provide a short description of the guidelines and pointers to existing guidelines.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?
We report information about the students in Section 3. The expert annotators prefer not to disclose their information.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Our research group does not have an ethics review board.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?
The annotators prefer not to disclose their information.

Figure 1 :
Figure 1: Distribution of the genres -novels, short stories, play, poems, essays, fairy tales, and epic poems -in the English partition of Echo-Wiki.

Figure 2 :
Figure 2: Number of book-summary (left) and version-summary pairs (right) for all language pairs in Echo-Wiki.Best seen in color.

Figure 3 :
Figure 3: Genre-specific evaluation of LongT5 base model fine-tuned on Echo-XSum.Best seen in color.

Figure 4 :
Figure 4: Number of extracted sentences for each relative position interval.

Table 1 :
Comparison of Echoes (Echo-Wiki, Echo-XSum, and Echo-FairySum) with existing resources for summarization.Coverage and density: measures of the "extractiveness" of a summary.Compression Ratio:

Table 3 :
Automatic evaluation of extractive-thenabstractive approaches on Echo-Wiki.

Table 5 :
Human evaluation of extractive-thenabstractive approaches on Echo-Wiki.

Table 6 :
Results of our approach compared to the state of the art on the BookSum test set.

Table 7 :
Summarize-then-translate experiment.We translate the summaries generated by LongT5 base model, fine-tuned on Echo-XSum, and compare them against gold standard references.

Table 8 :
Table of Wikipedia section titles utilized in the Echo-Wiki parsing process in multiple languages performance in Table9.We observe relatively low scores, meaning that the extractor is only partially able to discriminate relevant sentences from irrelevant ones.This aspect confirms that there is still large room for improving the Extractor and, consequently, the relevance of the summaries.

Table 9 :
Extractor evaluation: Precision@K C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?We do not perform hyperparameter tuning C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Experiments are computational expensive, so we were able to afford just one run per configuration.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?ROUGE:4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?