WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation

Recent works made significant advances on summarization tasks, facilitated by summarization datasets. Several existing datasets have the form of coherent-paragraph summaries. However, these datasets were curated from academic documents that were written for experts, thus making the essential step of assessing the summarization output through human-evaluation very demanding. To overcome these limitations, we present a dataset based on article summaries appearing on the WikiHow website, composed of how-to articles and coherent-paragraph summaries written in plain language. We compare our dataset attributes to existing ones, including readability and world-knowledge, showing our dataset makes human evaluation significantly easier and thus, more effective. A human evaluation conducted on PubMed and the proposed dataset reinforces our findings.


Introduction
Summarization is the task of preserving the key information in a text while reducing its length. Recently, many summarization datasets were published and helped push the boundaries of new summarization systems. These datasets differ on several properties, including the domain (e.g., academic or news) and the summary form. PubMed, arXiv, and BigPatent (Cohan et al., 2018;Sharma et al., 2019) provide a summary in the form of coherent paragraphs (i.e., each sentence flows smoothly into the next). In contrast, other summarization datasets (Hermann et al., 2015;Grusky et al., 2018;Koupaee and Wang, 2018;Ladhak et al., 2020) offer a summary in the form of a key points list (i.e., highlights). In this paper, we focus on coherent paragraph summarization datasets. Automatic evaluation of summarization systems, e.g., by using the ROUGE metric, is challenging (Lloret et al., 2018) and is often inconsistent with human evaluation (Liu and Liu, 2008;Cohan and Goharian, 2016;Tay et al., 2019;Huang et al., 2020). To understand -and later improve -the quality of summarization systems, it is necessary to conduct a human evaluation. A human evaluation's quality depends on the ease of reading and understanding of the measured text: a simple text does not require annotators with unique expertise, can be evaluated faster, and is easier to annotate correctly. However, existing coherent-paragraph summarization datasets consist of academic papers and cannot be considered easy to read. Evaluating such summarization samples requires unique expertise, takes time, and comes at a high cost.
In this work, we present WikiSum, a new summarization dataset from the WikiHow knowledge base 2 . The WikiSum documents are written in simple English, and the summaries provide "nonobvious tips that mimic the advice a knowledgeable, empathetic friend might give." 3 Unlike previous WikiHow summarization (Koupaee and Wang, 2018;Ladhak et al., 2020) Figure 1). Moreover, in contrast to other coherent-paragraph summarization datasets from the academic domain, WikiSum is written using simple English. This critical property can help with the challenging task of evaluating summarization systems and provide insights that can go unnoticed using automatic evaluation methods.
The key attributes of WikiSum are: (1) Summaries written as a single, coherent passage. (2) Articles and summaries that are easy to read. (3) Articles and summaries require less world knowledge to understand. We evaluate the dataset readability and estimate the required world-knowledge in Section 3. Moreover, we reinforce our results by conducting a human-evaluation of a summarization dataset in Section 4. Finally, to establish a baseline on the proposed dataset, we benchmark WikiSum using recent summarization systems and report their performance on Section 5.

Related Work
The summarization landscape can be roughly divided into three primary summary-forms: (1) Single sentence (Napoles et al., 2012;Grusky et al., 2018;Narayan et al., 2018;Kim et al., 2019) -summarize the document in a single sentence; (2) Highlights (Hermann et al., 2015;Koupaee and Wang, 2018;Ladhak et al., 2020) -a summary in the form of bullets listing the key points in the text; (3) Coherent summary (Sharma et al., 2019;Cohan et al., 2018) -short coherent paragraphs describing the salient information. The summarization datasets from the news domain, which are commonly used for human evaluation, include summaries in the form of highlights or single-sentence summaries. However, summarization datasets written in a co-herent format come from the academic domain, making them extremely difficult to annotate manually. Our proposed WikiSum is the only dataset written in a coherent format, yet easy for human evaluation. We do not claim that coherent paragraph summaries are better, but rather different; each format has its use cases, and human evaluation should be done on each of the different formats separately.
The existing WikiHow datasets (Koupaee and Wang, 2018;Ladhak et al., 2020) can be considered the closest to WikiSum, as they originate from the same knowledge base. However, while the existing WikiHow datasets split the article to generate the document and summary, WikiSum uses the entire article as the document and a summary specifically written by the article's author (called the Article Quick Summary). The former uses the concatenation of the first line of each step, called the step header, as the list of highlights and the remainder of step text's concatenation called "wraptext," as the document 4 . In addition to the different summary-form of the highlight-based WikiHow and WikiSum, the content of the summaries is significantly different, which can be illustrated by the low BLEU-4 (0.06 5 ) between the two. BigPatent (Sharma et al., 2019), Arxiv and PubMed (Cohan et al., 2018) are recent summarization datasets with coherent paragraph summaries. These datasets focus on the academic domain and are written for experts. Like these datasets, Wik-iSum is composed of long documents and coherent paragraph summaries. Nonetheless, it uses common everyday language and ranges over many domains (see Figure 2). Finally, Table 1 compares WikiSum to common existing datasets. Additional details on WikiSum are available in the appendix.

Measuring Text Difficulty
This section focuses on two crucial attributes: ease of readability and external knowledge required, shown (in Section 4) to be important for easy and effective human evaluation. For brevity, we focus on summarization datasets with coherent-paragraph summaries.

Readability
Readability metrics attempt to indicate how difficult a passage in English is to read. We used classical readability measures, including FKGL (Farr et al., 1951), GFI (Robert, 1968), SMOG (Mc Laughlin, 1969), ARI (Senter and Smith, 1967), CLI (Coleman and Liau, 1975). All these metrics are based on lexical features of the text, e.g., number of words in a sentence or mean number of syllables per word. They produce a score that is interpreted as the number of years of formal education required (for a native English speaker) to understand a piece of text 6 . For each document, we measured readability scores 7 for the document and the ground truth summary. The document is longer than the summary, so its readability is of higher importance. We report the average readability score for all the samples in the dataset.
Readability scores for the documents are presented at the top of Table 2. The table shows that WikiSum is significantly easier to read than other documents from coherent-summary datasets (arXiv, PubMed, BigPatent). Similar results can be found for the readability scores for the summaries (bottom of Table 2). To conclude, WikiSum is measured as drastically simpler to read than other coherentsummary datasets.

External Knowledge
Existing datasets are composed of academic documents that are written for experts. Often, to fully understand academic texts requires domain knowledge, which makes the annotator pool smaller, and 6 Other readability metrics such as FRE (Flesch, 1948), LIX and RIX (Björnsson, 1968), have a similar trend to the shown metrics, but require a translation to years of education, omitted from this paper for brevity. 7 https://github.com/mmautner/readability   thus, in most cases, more expensive. Word frequency is a strong indicator of how familiar a word is (Paetzold and Specia, 2016), where rare words tend to be less familiar. We used OpenSubtitles (Lison and Tiedemann, 2016), text corpora compiled from an extensive database of movie and TV subtitles to obtain word frequencies. We hypothesize that movie and TV subtitles can roughly represent common knowledge among many people. In Figure 3, we show the percentage of non-frequent words in a document (i.e., words that cannot be found in the top-k words in OpenSubtitles) as a function K, averaged over a random sample of 10, 000 documents from each dataset. This figure clearly shows that WikiSum is composed of significantly fewer words unpopular in TV shows and movies, requiring less specialized external knowledge.

Human Evaluation
We conducted a standard human evaluation on a summarization task, in addition to the automatic readability and the external knowledge metrics. We gathered a pool of 6 annotators, without any prior knowledge of the project, all with a graduate degree (M.sc. or Ph.D.) and proficient English readinglevel. We asked them to evaluate summaries generated by Pegasus (Zhang et al., 2020). The an-  Table 3: Evaluation time per sample, evaluation difficulty/exhaustion rating, perceived qualification, and the ratio of unknown words in the document. ± denotes 95% confidence interval according to student's t distribution (df=20). Difficulty, qualification, and tiring were marked on a 1-5 scale.
notation task followed Huang et al. (2020) and consisted of relevance, consistency, fluency, and coherency. Due to resource limitations (and the difficulty of annotating articles from the academic domain), we had to pick one coherent-paragraph dataset for comparison with WikiSum. To avoid annotators' domain bias, we selected articles from PubMed, which contains articles not in the area of expertise of any annotator, in addition to WikiSum. We sampled random articles with 950 -1050 words to avoid length bias, ensuring that article length is similar in both datasets. All annotators allocated 1 hour, which amounted to 42 annotations, 21 for each dataset.
During the annotation task, we measured the evaluation time and asked the annotators to mark unfamiliar words. In addition, we asked the annotators to rate the following aspects on a 1-5 scale: (a) How difficult was the task? (b) How tiring was it? (c) How qualified are you for this task? After each pair of PubMed and WikiHow samples were completed, the annotators selected which dataset they prefer to evaluate.
In Table 3 we show the annotators' assessment of the tasks. Compared to PubMed, a WikiSum annotation takes significantly less time, is less difficult, and less tiring. Moreover, the annotators revealed that they were much more qualified to assess the WikiSum task summary. Finally, in 90% of the cases (19 out of 21), the annotators revealed that they preferred a WikiSum annotation task. This reinforces our findings that WikiSum is significantly easier to annotate than PubMed.
In the annotation task, we also asked the annotators to mark unfamiliar words in the article. We found a strong correlation between the count of unfamiliar words and the task difficulty, evaluation time, and perceived required qualification (Pearson correlation of 0.57, 0.36, −0.48 8 , respectively, 8 Many unfamiliar words implied annotators perceived  p < 0.05). Strong correlation was also found between the ARI readability metric (Section 3.1) and the above-mentioned annotation metrics (Pearson correlation of 0.69, 0.49, −0.76, p < 0.05). This demonstrates the effect of readability on the difficulty of an annotation task. Finally, we found that unfamiliar words correspond to low-frequency OpenSubtitles words (Section 3.2). The unfamiliar words on WikiSum and PubMed appear in the top 91, 550 and 230, 596 words on average, respectively, while familiar words appear in the top 16, 935 and 59, 244 words on average, respectively. It also further validates Paetzold and Specia (2016) hypothesis about the strong correlation between word frequency and complexity.

Model Results and Discussion
To provide both abstractive and extractive baselines for WikiSum, we evaluate on PEGASUS LARGE (Zhang et al., 2020), Tex-tRank (Mihalcea and Tarau, 2004), and the common LEAD-3 that selects the first three sentences of the document as the summary. We compare the results on WikiSum to the Arxiv, PubMed, and BigPatent Datasets results. Table 4 reports the F1 scores of ROUGE-1, 2 and L for all the models. The results show that the models' performance on WikiSum is not drastically different from the other datasets, making it an interesting dataset for benchmarking summarization systems. The detailed evaluation setup can be found in the supplementary materials.
To conclude, this paper presents the WikiSum dataset, which is drastically simpler for human evaluation than existing summarization datasets where the summary appears as a coherent paragraph. We showed WikiSum's simplicity via various readability metrics and demonstrated that the text requires less external knowledge to be understood. Finally, we validated our finding via a human evaluation task on WikiSum and PubMed. themselves as less unqualified.

A Data Description
A.1 Gathering the data We use Scrapy scraper 9 to download articles and summaries from the wikihow.com website. We removed HTML tags using BeautifulSoup 10 . Finally, we removed any sample in which the summary is a list of bullet points; around 7k samples were excluded in this manner.

A.2 Authors Instructions for Writing Quick Summaries
The wikihow.com website provides the following guidelines for authors writing a quick summary. 11 The goal of the "Quick Summary" section on wikiHow is to provide a short summary of non-obvious tips that mimic the advice a knowledgeable, empathetic friend might give you if you asked them for help on the given topic. Among other uses, Quick Summaries help smart devices like Google Homes and Amazon Echos deliver wikiHow advice to listeners in need of how-to guidance.
We remark that the quick summaries are indeed used by commercial voice assistants to answer howto questions. As voice assistants gain popularity, so does the importance of such coherent-paragraph summaries.

A.3 Data Layout
Raw data is available in the supplementary material, in a json format. Each line consists of a single sample, with the following fields 1. Link to the original article 2. Article title 3. Article text 4. Quick summary 5. Split fold (train, dev, or test) Finally, it also includes step headers: the first line in each step. This is part of the article but might be considered more important, and therefore, it might find further uses by system designers.

A.4 Dataset Statistics
Most dataset statistics appear in Table 1 in the article's main body and are repeated here for completeness. The total number of samples in the WikiSum dataset is 39, 775. On average, each summary consists of 101.2 words, while each article consists of 1, 334.2 words. The average compression ratio is 13.9.
GPU, using max input and output sequence lengths of 1024 and 256, respectively.

B Example Summaries
In this appendix, we provide an example summary from WikiSum and arXiv, PubMed, and bigPatent. Note that the article can be quite long (for arXiv and PubMed, it is a full academic paper), so it is not presented in this appendix. Instead, we provide a link to the online version of the full article.

B.1 WikiSum
The WikiSum example summary is provided below: "To ace a test, even if you're not prepared, start by glancing over the test before you get started to get an idea of how long it is so you can manage your time better. Then, read through each question twice and try to answer it. If you can't answer a question, skip it and come back to it later if you can, which will save you from wasting all of your time on one question. If your test is multiple choice and you don't know the answer, eliminate two answers, so you're left with just two options. Then, guess if necessary since you'll have a 50-percent chance of being right." The article is available at https://www. wikihow.com/Ace-a-Test.

B.2 WikiHow
For the sake of comparison between WikiHow and WikiSum datasets, we provide the WikiHow summary originating from the same raw material (i.e., the same wikihow.com how-to article) as the Wik-iSum example at Appendix B.1. We remark that the article to be summarized is not exactly the same, as the WikiHow example does not contain the step headers from the article's text. The WikiHow summary is provided below. "Study well before the test. Get a study friend. Take breaks. Relax. Pay attention in class. Do all available practice questions. Get some sleep the night before. Have proper meals before the test day. Have your test-taking materials assembled and ready. Listen to music you like. Go into the test in a positive manner. Take deep breaths to try to keep calm. Read the questions carefully. Do the easy questions first. Go with your first answer. Use logic if you're stuck on a multiple choice question. Review your answers thoroughly when you are done." It can easily be seen that the WikiSum summary is a coherent, fluent paragraph, while the WikiHow summary is a set of bullet points. The content of the two summaries are also quite different between the two datasets.

B.3 arXiv
"the effect of a random phase diffuser on fluctuations of laser light ( scintillations ) is studied. not only spatial but also temporal phase variations introduced by the phase diffuser are analyzed. the explicit dependence of the scintillation index on finite -time phase variations is obtained for long propagation paths. it is shown that for large amplitudes of phase fluctuations , a finite -time effect decreases the ability of phase diffuser to suppress the scintillations." The article is available at https://arxiv.org/ pdf/0903.5449.pdf.

B.4 PubMed
"tardive dystonia ( td ) is a serious side effect of antipsychotic medications, more with typical antipsychotics, that is potentially irreversible in affected patients. studies show that newer atypical antipsychotics have a lower risk of td. as a result, many clinicians may have developed a false sense of security when prescribing these medications. we report a case of 20-year -old male with hyperthymic temperament and borderline intellectual functioning, who developed severe td after low dose short duration exposure to atypical antipsychotic risperidone and then olanzapine. the goal of this paper is to alert the reader to be judicious and cautious before using casual low dose second generation antipsychotics in patient with no core psychotic features, hyperthymic temperament, or borderline intellectual functioning suggestive of organic brain damage, who are more prone to develop adverse effects such as td and monitor the onset of td in patients taking atypical antipsychotics." The article is available at https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC5330001/.

B.5 BigPatent
"this invention relates to novel calcium phosphatecoated implantable medical devices and processes of making same. the calcium -phosphate coatings are designed to minimize the immune response to the implant and can be used to store and release a medicinally active agent in a controlled manner.
such coatings can be applied to any implantable medical devices and are useful for a number of medical procedures including balloon angioplasty in cardiovascular stenting, ureteral stenting and catheterisation. the calcium phosphate coatings can be applied to a substrate as one or more coatings by a sol -gel deposition process, an aerosol -gel deposition process, a biomimetic deposition process, a calcium phosphate cement deposition process, an electro -phoretic deposition process or an electrochemical deposition process. the coating can contain and elude a drug in an engineered manner." The article is available at https: //patentscope.wipo.int/search/en/detail. jsf?docId=WO2007147234.