SummerTime: Text Summarization Toolkit for Non-experts

Recent advances in summarization provide models that can generate summaries of higher quality. Such models now exist for a number of summarization tasks, including query-based summarization, dialogue summarization, and multi-document summarization. While such models and tasks are rapidly growing in the research field, it has also become challenging for non-experts to keep track of them. To make summarization methods more accessible to a wider audience, we develop SummerTime by rethinking the summarization task from the perspective of an NLP non-expert. SummerTime is a complete toolkit for text summarization, including various models, datasets, and evaluation metrics, for a full spectrum of summarization-related tasks. SummerTime integrates with libraries designed for NLP researchers, and enables users with easy-to-use APIs. With SummerTime, users can locate pipeline solutions and search for the best model with their own data, and visualize the differences, all with a few lines of code. We also provide explanations for models and evaluation metrics to help users understand the model behaviors and select models that best suit their needs. Our library, along with a notebook demo, is available at https://github.com/Yale-LILY/SummerTime.


Introduction
The goal of text summarization is to generate short and fluent summaries from longer textual sources, while preserving the most salient information in them. Benefiting from recent advances of deep neural networks, in particular sequence to sequence models, with or without attention (Sutskever et al., 2014;Bahdanau et al., 2014;Vaswani et al., 2017), current state-of-the-art summarization models produce high quality summaries that can be useful in practice cases (Zhang et al., 2020a;Lewis et al., 2020). Moreover, neural summarization has broadened its scope with the introduction of more summarization tasks, such as query-based summarization (Dang, 2005;Zhong et al., 2021), long-document summarization (Cohan et al., 2018), multi-document summarization (Ganesan et al., 2010;Fabbri et al., 2019), dialogue summarization (Gliwa et al., 2019;Zhong et al., 2021). Such summarization tasks can also be from different domains (Hermann et al., 2015;Cohan et al., 2018). However, as the field rapidly grows, it is often hard for NLP non-experts to follow all relevant new models, datasets, and evaluation metrics. Moreover, those models and datasets are often from different sources, making it a non-trivial effort for the users to directly compare the performance of such models side-by-side. This makes it hard for them to decide which models to use. The development of libraries such as Transformers (Wolf et al., 2020) alleviate such problems to some extent, but they only cover a narrow range of summarization models and tasks and assume certain proficiency in NLP from the users, thus the target audience is still largely the research community.
To address those challenges for non-expert users and make state-of-the-art summarizers more accessible as a tool, we introduce SummerTime, a text summarization toolkit intended for users with no NLP background. We build this library from this perspective, and provide an integration of different summarization models, datasets and evaluation metrics, all in one place. We allow the users to view a side-by-side comparison of all classic and state-of-the-art summarization models we support, on their own data and combined into pipelines that fit their own task. SummerTime also provides the functionality for automatic model selection, by constructing pipelines for specific tasks first and iteratively evaluation to find the best working solutions. Assuming no background in NLP, we list "pros and cons" for each model, and provide simple explanations for all the evaluation metrics we support. Moreover, we go beyond pure numbers and provide visualization of the performance and output of different models, to facilitate users in making decisions about which models or pipelines to finally adopt.
The purpose of SummerTime is not to replace any previous work, on the contrary, we integrate existing libraries and place them in the same framework. We provide wrappers around such libraries intended for expert users, maintaining the userfriendly and easy-to-use APIs.

Existing Systems for Summarization
Transformers (Wolf et al., 2020) includes a large number of transformer-based models in its Modelhub 1 , including BART (Lewis et al., 2020) and Pegasus (Zhang et al., 2020a), two strong neural summarizers we also use in SummerTime. It also hosts datasets for various NLP tasks in its Datasets 2 library. Despite the wide coverage in transformerbased models, Transformers do not natively support models or pipelines that can handle aforementioned subcategories of summarization tasks. Moreover, it assumes certain NLP proficiency in tis users, thus is harder for non-expert users to use. We integrate with Transformers and Datasets and import the state-of-the-art models, as well as summarization datasets into SummerTime, under the same easy-to-use framework.
Another library that we integrate with is Sum-mEval (Fabbri et al., 2020), which is a collection of evaluation metrics for text summarization. Sum-merTime adopts a subset of such metrics in Sum-mEval that are more popular and easier to understand. SummerTime also works well with Sum-mVis (Vig et al., 2021), which provides an interactive way of analysing summarization results on the token-level. We also allow SummerTime to store output in a format that can be directly used by SummVis and its UI.
Other systems also exist for text summarization. MEAD 3 is a platform for multi-lingual summarization Sumy can produce extractive summaries from HTML pages or plain texts, using several traditional summarization methods including Mihalcea and Tarau (2004) and Erkan and Radev (2004).
OpenNMT is mostly for machine translation, but it also hosts several summarization models such as Gehrmann et al. (2018).

SummerTime
The main purpose of SummerTime is to help nonexpert users navigate through various summarization models, datasets and evaluation metrics, and provide simple yet comprehensive information for them to select the models that best suit their needs. Fig. 1 shows how SummerTime is split into different modules to help users achieve such goal.
We will describe in detail each component of SummerTime in the following sections. With § 3.1, we introduce the models we support in all subcategories of summarization; in § 3.2 we list all the existing datasets we support and how users can create their own evaluation set. Finally in § 3.3, we explain the evaluation metrics included with Sum-merTime and how they can help users find the most suitable model for their task.

Summarization Models
Here we introduce the summarization tasks Sum-merTime covers and the models we include to support these tasks. We first introduce the singledocument summarization models (i.e., "base models") included in SummerTime, and then we show how those models can be used in a pipeline with other methods to complete more complex tasks such as query-based summarization and multidocument summarization.

Single-document Summarization
The following base summarization models are used in SummerTime. They all take a single document and generate a short summary. TextRank (Mihalcea and Tarau, 2004) is a graphbased ranking model that can be used to perform extractive summarization; LexRank (Erkan and Radev, 2004) is also a graphbased extractive summarization model, which is originally developed for multi-document summarization, but can also be applied to a single document. It uses centrality in a graph representation of sentences to measure their relative importance; BART (Lewis et al., 2020) is an autoencoder model trained with denoising objectives during training. This seq2seq model is constructed with a bidirectional transformer encoder and a left-to-right transformer decoder, which can be fine-tuned to perform abstractive summarization; Pegasus (Zhang et al., 2020a) proposes a new self-supervised pretraining objective for abstractive summarization, by reconstructing the target sentence with the remaining sentences in the document, it also shows strong results in low-resource settings; Longformer (Beltagy et al., 2020) addresses the problem of memory need for self-attention models by using a combination of sliding window attention and global attention to approximate standard self-attention. It is able to support input length of 16K tokens, a large improvement over previous transformer-based models.

Multi-document Summarization
For multi-document summarization, we adopt two popular single-document summarizers to complete the task, as this is shown to be effective in previous work (Fabbri et al., 2019). Combine-then-summarize is a pipeline method to handle multiple source documents, where the documents are concatenated and then a single document summarizer is used to produce the summary. Note that the length of the combined documents may exceed the input length limit for typical transformer-based models; Summarize-then-combine first summarizes each source document independently, then merges the resulting summaries. Compared to the combine-thensummarize method, it is not affected by overlong inputs. However, since each document is summarized separately, the final summary may contain redundant information (Carbonell and Goldstein, 1998).

Query-based Summarization
For summarization tasks based on queries, we adopt a pipeline method and first use retrieval methods to identify salient sentences or utterances in the original document or dialogue, then generate summaries with a single-document summarization model. TF-IDF retrieval is used in a pipeline to first retrieve the sentences that are most similar to the query based on the TF-IDF metric; BM25 retrieval is used in the same pipeline, but BM25 is used as the similarity metric for retrieving the top-k relevant sentences.

Dialogue Summarization
Dialogue summarization is used to extract salient information from a dialogue. SummerTime includes two methods for dialogue summarization.
Flatten-then-summarize first flattens the dialogue data while preserving the speaker information, then a summarizer is used to generate the summary. Zhong et al. (2021) found that this presents a strong baseline for dialogue summarization.
HMNet (Zhu et al., 2020) explores the semantic structure of dialogues and develops a hierarchical architecture to model the long dialogue script and exploits role vectors to perform better speaker modeling.
Since we assume no NLP background of our target users, we provide a short description for every model to illustrate the strengths and weaknesses for each model. Such manually written descriptions are displayed when calling a static get_description() method on the model class. A sample description is shown in Fig. 2.

Datasets
With SummerTime, users can easily create or convert their own summarization datasets and evaluate all the supporting models within the framework. However, in the case that no such datasets are available, SummerTime also provides access to a list of existing summarization datasets. This way, users can select models that perform the best on one or more datasets that are similar to their task. Multi-News (Fabbri et al., 2019) is a large-scale multi-document summarization dataset which contains news articles from the site newser.com with corresponding human-written summaries. Over 1,500 sites, i.e. news sources, appear as source documents, which is higher than the other common news datasets. SAMSum (Gliwa et al., 2019) is a dataset with chat dialogues corpus, and human-annotated abstractive summarizations. In the SAMSum corpus, each dialogue is written by one person. After collecting all the dialogues, experts write a single summary for each dialogue. XSum (Narayan et al., 2018) is a news summarization dataset for generating a one-sentence summary aiming to answer the question "What is the article about?". It consists of real-world articles and corresponding one-sentence summarization from British Broadcasting Corporation (BBC). ScisummNet (Yasunaga et al., 2019) is a humanannotated dataset made for citation-aware scientific paper summarization (Scisumm). It contains over 1,000 papers in the ACL anthology network as well as their citation networks and their manually labeled summaries. QMSum (Zhong et al., 2021) is designed for query-based multi-domain meeting summarization. It collects the meetings from AMI and ICSI dataset, as well as the committee meetings of the Welsh Parliament and Parliament of Canada. Experts manually wrote summaries for each meeting. ArXiv (Cohan et al., 2018) is a dataset extracted from research papers for abstractive summarization of single, longer-form documents. For each research paper from arxiv.org, its abstract is used as ground-truth summaries. A summary of all datasets included in Summer-Time is shown as Tab. 1, it is worth noticing that the fields in this table (i.e., domain, query-based, multi-doc, etc) are also incorporated in each of the dataset classes (e.g., SAMSumDataset as class variables, so that such labels can later be used to identify applicable models. Similar with the models classes, we include a short description for each of the datasets. Note that the datasets, either existing ones or user created are mainly for evaluation purposes. We leave the important task of fine-tuning the models on these datasets for future work.

Evaluation Metrics
To evaluate the performance of each supported model on certain dataset, SummerTime integrates with SummEval (Fabbri et al., 2020) and provides the following evaluation metrics for the users to understand model performance: ROUGE (Lin, 2004) is a recall-oriented method based on overlapping n-grams, word sequences, and word pairs between the generated output and the gold summary; BLEU (Papineni et al., 2002) measures n-gram precision and employs a penalty for brevity, BLEU is often used as an evaluation metric for machine translation; ROUGE-WE (Ng and Abrecht, 2015) aims to go beyond surface lexical similarity and uses pretrained word embeddings to measure the similarity between different words and presents a better correlation with human judgements; METEOR (Lavie and Agarwal, 2007) is based on word-to-word matches between generated and reference summaries, it consider two words as "aligned" based on a Porter stemmer (Porter, 2001) or synonyms in WordNet (Miller, 1995); BERTScore (Zhang et al., 2020b) computes tokenlevel similarity between sentences with the contextualized embeddings of each tokens. Since assuming no NLP background from our target users, we made sure that SummerTime provides a short explanation for each evaluation metric as well as a clarification whether high or low scores are better for a given evaluation metric, to help the non-expert users understand the meaning of the metrics and use them to make decisions.

Model Selection
In this section, we describe in detail about the workflow of SummerTime and how it can help our nonexpert users find the best models for their use cases, which is one of the main functionalities that makes SummerTime stands out from similar libraries.
Create/select datasets The user would first either load a dataset with the APIs we provide, or choose to use one of the datasets that are already included in SummerTime. During the creation of the datasets, the users also need to specify the Boolean attributes as in Tab. 1 to facilitate next steps.
Construct pipelines After identifying the potential pipeline modules (e.g., query-based module, dialogue-based module) that are applicable to the task, a combination of specific methods of such modules are put in a pool for further evaluation. An example of this process in shown in Fig. 3, SummerTime automatically constructs solutions to a specific dataset by combining the pipelines and summarization models specified in § 3.1.
Search for the best models As shown in Fig. 3, there can be a large pool of solutions to be eval- To save time and resources in searching for best models, SummerTime adopts the idea of successive halving (Li et al., 2017;Jamieson and Talwalkar, 2016). More specifically, SummerTime first uses a small number of examples from the dataset to evaluate all the candidates and eliminate models that are surpassed by at least one other model on every evaluation metric, then it does so iteratively and gradually increases the evaluation set size to reduce the variance. As shown in Algorithm 1, the final output is a set of competing models M that are better 4 than one another on at least one metric.
Visualization In addition to showing the numerical results as tables, SummerTime also allows the users to visualize the differences between different models with different charts and SummVis (Vig et al., 2021). Fig. 4 shows some examples of such visualization methods SummerTime provides. A scatter plot can help the users understand the distribution of the model's performance over each example, while the radar chart is an intuitive way of comparing different models over various metrics. SummerTime can also output the generated summaries to file formats that are directly compatible with SummVis, so that the users can easily use it to visualize the per-instance output differences on the token level.

Future Work
An important piece of future work for Summer-Time is to include more summarization models to 4 Note that in line 9 of the algorithm, the symbol ">" is conceptual and should be interpreted as "better than"  enlarge the number of choices for the users, and more datasets to increase the chance of users finding similar tasks or domain for evaluation when they do not have a dataset of their own. Moreover, we would like to enable fine-tuning for a subset of smaller models we support, to enable better performance on some domains or tasks for which no pretrained models are available. We also plan to add more visualization methods for the users to better understand the differences between the outputs of various models and the behavior of each individual model itself.

Conclusion
We introduce SummerTime, a text summarization toolkit designed for non-expert users. Sum-merTime includes various summarization datasets, models and evaluation metrics and covers a wide range of summarization tasks. It can also automatically identify the best models or pipelines for a specific dataset and task, and visualize the differences between the model outputs and performances. SummerTime is open source and available online.