TWAG: A Topic-Guided Wikipedia Abstract Generator

Wikipedia abstract generation aims to distill a Wikipedia abstract from web sources and has met significant success by adopting multi-document summarization techniques. However, previous works generally view the abstract as plain text, ignoring the fact that it is a description of a certain entity and can be decomposed into different topics. In this paper, we propose a two-stage model TWAG that guides the abstract generation with topical information. First, we detect the topic of each input paragraph with a classifier trained on existing Wikipedia articles to divide input documents into different topics. Then, we predict the topic distribution of each abstract sentence, and decode the sentence from topic-aware representations with a Pointer-Generator network. We evaluate our model on the WikiCatSum dataset, and the results show that TWAG outperforms various existing baselines and is capable of generating comprehensive abstracts.


Introduction
Wikipedia, one of the most popular crowd-sourced online knowledge bases, has been widely used as the valuable resources in natural language processing tasks such as knowledge acquisition (Lehmann et al., 2015) and question answering (Hewlett et al., 2016;Rajpurkar et al., 2016) due to its high quality and wide coverage. Within a Wikipedia article, its abstract is the overview of the whole content, and thus becomes the most frequently used part in various tasks. However, the abstract is often contributed by experts, which is labor-intensive and prone to be incomplete.
In this paper, we aim to automatically generate Wikipedia abstracts based on the related documents * Corresponding Author collected from referred websites or search engines, which is essentially a multi-document summarization problem. This problem is studied in both extractive and abstractive manners.
The extractive models attempt to select relevant textual units from input documents and combine them into a summary. Graph-based representations are widely exploited to capture the most salient textual units and enhance the quality of the final summary (Erkan and Radev, 2004;Mihalcea and Tarau, 2004;Wan, 2008). Recently, there also emerge neural extractive models (Yasunaga et al., 2017;Yin et al., 2019) utilizing the graph convolutional network (Kipf and Welling, 2017) to better capture inter-document relations. However, these models are not suitable for Wikipedia abstract generation. The reason is that the input documents collected from various sources are often noisy and lack intrinsic relations (Sauper and Barzilay, 2009), which makes the relation graph hard to build.
The abstractive models aim to distill an informative and coherent summary via sentence-fusion and paraphrasing (Filippova and Strube, 2008;Banerjee et al., 2015;Bing et al., 2015), but achieve little success due to the limited scale of datasets. Liu et al. (2018) proposes an extractive-then-abstractive model and contributes WikiSum, a large-scale dataset for Wikipedia abstract generation, inspiring a branch of further studies (Perez-Beltrachini et al., 2019;Li et al., 2020).
The above models generally view the abstract as plain text, ignoring the fact that Wikipedia abstracts describe certain entities, and the structure of Wikipedia articles could help generate comprehensive abstracts. We observe that humans tend to describe entities in a certain domain from several topics when writing Wikipedia abstracts. As illustrated in Figure 1, the abstract of the Arctic Fox contains its adaption, biology taxonomy and geographical distribution, which is consistent with Abstract Content Table   The Arctic fox (Vulpes lagopus), also known as the white fox, polar fox, or snow fox, is a small fox native to the Arctic regions of the Northern Hemisphere and common throughout the Arctic tundra biome. It is well adapted to living in cold environments, and is best known for its thick, warm fur that is also used as camouflage. It has a large and very fluffy tail. In the wild, most individuals do not live past their first year but some exceptional ones survive up to 11 years. Its body length ranges from 46 to 68 cm (18 to 27 in), with a generally rounded body shape to minimize the escape of body heat. the content table. Therefore, given an entity in a specific domain, generating abstracts from corresponding topics would reduce redundancy and produce a more complete summary.
In this paper, we try to utilize the topical information of entities within its domain (Wikipedia categories) to improve the quality of the generated abstract. We propose a novel two-stage Topic-guided Wikipedia Abstract Generation model (TWAG). TWAG first divides input documents by paragraph and assigns a topic for each paragraph with a classifier-based topic detector. Then, it generates the abstract in a sentence-wise manner, i.e., predicts the topic distribution of each abstract sentence to determine its topic-aware representation, and decodes the sentence with a Pointer-Generator network (See et al., 2017).
We evaluate TWAG on the WikiCatSum (Perez-Beltrachini et al., 2019) dataset, a subset of the WikiSum containing three distinct domains. Experimental results show that it significantly improves the quality of abstract compared with several strong baselines.
In conclusion, the contributions of our work are as follows: • We propose TWAG, a two-stage neural abstractive Wikipedia abstract generation model utilizing the topic information in Wikipedia, which is capable of generating comprehensive abstracts.
• We simulate the way humans recognize entities, using a classifier to divide input documents into topics, and then perform topic-aware abstract generation upon the predicted topic distribution of each abstract sentence.
• Our experiment results against 4 distinct baselines prove the effectiveness of TWAG.
2 Related Work

Multi-document Summarization
Multi-document summarization is a classic and challenging problem in natural language processing, which aims to distill an informative and coherent summary from a set of input documents. Compared with single-document summarization, the input documents may contain redundant or even contradictory information (Radev, 2000).
Early high-quality multi-document summarization datasets are annotated by humans, e.g., datasets for Document Understanding Conference (DUC) and Text Analysis Conference (TAC). These datasets are too small to build neural models, and most of the early works take an extractive method, attempting to build graphs with interparagraph relations and choose the most salient textual units. The graph could be built with various information, e.g., TF-IDF similarity (Erkan and Radev, 2004), discourse relation (Mihalcea and Tarau, 2004), document-sentence two-layer relations (Wan, 2008), multi-modal (Wan and Xiao, 2009) and query information (Cai and Li, 2012). Recently, there emerge attempts to incorporate neural models, e.g., Yasunaga et al. (2017) builds a discourse graph and represents textual units upon the graph convolutional network (GCN) (Kipf and Welling, 2017), and Yin et al. (2019) adopts the entity linking technique to capture global dependencies between sentences and ranks the sentences with a neural graph-based model.
In contrast, early abstractive models using sentence-fusion and paraphrasing (Filippova and Strube, 2008;Banerjee et al., 2015;Bing et al., 2015) achieve less success. Inspired by the recent success of single-document abstractive models (See et al., 2017;Paulus et al., 2018;Gehrmann et al., 2018;Huang et al., 2020), some works (Liu et al., 2018;Zhang et al., 2018) try to transfer single-document models to multi-document settings to alleviate the limitations of small-scale datasets. Specifically, Liu et al. (2018) defines Wikipedia generation problem and contributes the large-scale WikiSum dataset. Fabbri et al. (2019) constructs a middle-scale dataset named Multi-News and proposes an extractive-then-abstractive model by appending a sequence-to-sequence model after the extractive step. Li et al. (2020) models inter-document relations with explicit graph representations, and incorporates pre-trained language models to better handle long input documents.

Wikipedia-related Text Generation
Sauper and Barzilay (2009) is the first work focusing on Wikipedia generation, which uses Integer Linear Programming (ILP) to select the useful sentences for Wikipedia abstracts. Banerjee and Mitra (2016) further evaluates the coherence of selected sentences to improve the linguistic quality. Liu et al. (2018) proposes a two-stage extractivethen-abstractive model, which first picks paragraphs according to TF-IDF weights from web sources, then generates the summary with a transformer model by viewing the input as a long flat sequence. Inspired by this work, Perez-Beltrachini et al. (2019) uses a convolutional encoder and a hierarchical decoder, and utilizes the Latent Dirichlet Allocation model (LDA) to render the decoder topic-aware. HierSumm  adopts a learning-based model for the extractive stage, and computes the attention between paragraphs to model the dependencies across multiple paragraphs. However, these works view Wikipedia abstracts as plain text and do not explore the underlying topical information in Wikipedia articles.
There are also works that focus on generating other aspects of Wikipedia text. Biadsy et al. (2008) utilizes the key-value pairs in Wikipedia infoboxes to generate high-quality biographies. Hayashi et al. (2021) investigates the structure of Wikipedia and builds an aspect-based summarization dataset by manually labeling aspects and identifying the aspect of input paragraphs with a finetuned RoBERTa model . Our model also utilizes the structure of Wikipedia, but we generate the compact abstract rather than individual aspects, which requires the fusion of aspects and poses a greater challenge to understand the connection and difference among topics.

Problem Definition
Definition 1 Wikipedia abstract generation accepts a set of paragraphs 1 D = {d 1 , d 2 , . . . , d n } of size n as input, and outputs a Wikipedia abstract S = (s 1 , s 2 , . . . , s m ) with m sentences. The goal is to find an optimal abstract S * that best concludes the input, i.e., Previous works generally view S as plain text, ignoring the semantics in Wikipedia articles. Before introducing our idea, let's review how Wikipedia organizes articles.
Wikipedia employs a hierarchical open category system to organize millions of articles, and we name the top-level category as domain. As for a Wikipedia article, we concern three parts, i.e., the abstract, the content table, and textual contents. Note that the content table is composed of several section labels {l}, pairing with corresponding textual contents {p}. As illustrated in Figure 1, the content table indicates different aspects (we call them topics) of the article, and the abstract semantically corresponds to these topics, telling us that topics could benefit the abstract generation.
However, general domains like Person or Animal consist millions of articles with diverse content tables, making it not feasible to simply treat section labels as topics. Considering that articles in specific domains often share several salient topics, we manually merge similar section labels to convert the sections titles to a set of topics. Formally, the topic set is denoted as . . , l m i }. Now, our task can be expressed with a topical objective, i.e., Definition 2 Given the input paragraphs D, we introduce the latent topics Z = {z 1 , z 2 , . . . , z n }, where z i ∈ T is the topic of i-th input paragraph d i , and our objective of Wikipedia abstract generation is re-written as (2) Therefore, the abstract generation could be completed with two sub-tasks, i.e., topic detection to optimize arg max Z P (Z|D) and topic-aware abstract generation to optimize arg max S P (S|D, Z).

The Proposed Method
As shown in Figure 2, our proposed TWAG adopts a two-stage structure. First, we train a topic detector based on existing Wikipedia articles to predict the topic of input paragraphs. Second, we group the  Figure 2: The TWAG framework. We use an example domain with 3 topics for illustration. The left half is the topic detector which attempts to find a topic for each input paragraph, and the right half is the topic-aware abstract generator to generate the abstract by sentence based on input paragraphs and their predicted topics.
input paragraphs by detected topics to encode them separately, and generate the abstract in a sentencewise manner. In each step, we predict the topic distribution of the current sentence, fuse it with the global hidden state to get the topic-aware representation, and generate the sentence with a copy-based decoder. Next, we will detail each module.

Topic Detection
The topic detector aims to annotate input paragraphs with their optimal corresponding topics. To formalize, given the input paragraphs D, Det returns its corresponding topics Z = {z 1 , z 2 , . . . , z n }, i.e., We view topic detection as a classification problem. For each paragraph d ∈ D, we encode it with ALBERT (Lan et al., 2019) and then predict its topic z with a fully-connected layer, i.e., where d is the vector representation of d, and we fine-tuned the ALBERT model on a pretrained version.

Topic-aware Abstract Generation
Topic-aware abstract generator utilizes the input paragraphs D and the detected topics Z to generate the abstract. Specifically, it contains three modules: a topic encoder to encode the input paragraphs into topical representations, a topic predictor to predict the topic distribution of abstract sentences and generate the topic-aware sentence representation, and a sentence decoder to generate abstract sentences based on the topic-aware representations.

Topic Encoder
Given the input paragraphs D and the detected topics Z, we concatenate all paragraphs belonging to the same topic T k to form a topic-specific text group (TTG) G k , which contains salient information about a certain topic of an entity: To further capture hidden semantics, we use a bidirectional GRU to encode the TTGs: g k is the final hidden state of the G k , and U k = (u 1 , u 2 , . . . , u n G k ) represents the hidden state of each token in G k , where n G k denotes the number of tokens in G k .

Topic Predictor
After encoding the topics into hidden states, TWAG tackles the decoding process in a sentence-wise manner: To generate the abstract S, we first predict the topic distribution of every sentence s i with a GRU decoder. At each time step t, the topic predictor produces a global hidden state h t , and then estimates the probability distribution q t over topics.
where e t−1 denotes the topical information in the last step. e 0 is initialized as an all-zero vector, and e t could be derived from q t in two ways. The first way named hard topic, is to directly select the topic with the highest probability, and take its corresponding representation, i.e., The second way named soft topic, is to view every sentence as a mixture of different topics, and take the weighted sum over topic representations, i.e., where G = (g 1 , g 2 , . . . , g nt ) is the matrix of topic representations. With the observation that Wikipedia abstract sentences normally contain mixed topics, we choose the soft topic mechanism for our model (see Section 5.3 for details). Finally, we compute the topic-aware hidden state r t by adding up h t and e t , which serves as the initial hidden state of sentence decoder: Additionally, a stop confirmation is executed at each time step: where σ represents the sigmoid function. If p stop > 0.5, TWAG will terminate the decoding process and no more abstract sentences will be generated.

Sentence Decoder
Our sentence decoder adopts the Pointer-Generator network (See et al., 2017), which picks tokens both from input paragraphs and vocabulary.
To copy a token from the input paragraphs, the decoder requires the token-wise hidden states U = (u 1 , u 2 , . . . , u nu ) of all n u input tokens, which is obtained by concatenating the token-wise hidden states of all TTGs, i.e., For the k-th token, the decoder computes an attention distribution a k over tokens in the input paragraphs, where each element a i k could be viewed as the probability of the i-th token being selected, where s k denotes the decoder hidden state with s 0 = r t to incorporate the topic-aware representation, and W u , W s , b a are trainable parameters.
To generate a token from the vocabulary, we first use the attention mechanism to calculate the weighted sum of encoder hidden states, known as the context vector, which is further fed into a two-layer network to obtain the probability distribution over vocabulary, . (18) To switch between these two mechanisms, p gen is computed from context vector c * k , decoder hidden state s k and decoder input x k : where σ represents the sigmoid function and W T c , W T s , W T x and b p are trainable parameters. The final probability distribution of words is 2

Training
The modules for topic detection and abstract generation are trained separately.

Topic Detector Training
Since there are no public benchmarks for assigning input paragraphs with Wikipedia topics, we construct the dataset with existing Wikipedia articles. In each domain, we collect all the label-content pairs {(l, p)} (defined in Section 3), and split the content into paragraphs p = (d 1 , d 2 , . . . , d np ) to form a set of label-paragraph pairs {(l, d)}. Afterwards, we choose all pairs (l, d) whose section label l belongs to a particular topic T ∈ T to complete the dataset construction, i.e., the topicparagraph set {(T, d)}. Besides, a NOISE topic is set up in each domain, which refers to meaningless text like scripts and advertisements, and the corresponding paragraphs are obtained by utilizing regular expressions to match obvious noisy texts. The details are reported in Appendix A.
Note that the dataset for abstract generation is collected from non-Wikipedia websites (refer to Section 5 for details). These two datasets are independent of each other, which prevents potential data leakage.
In the training step, we use the negative loglikelihood loss to optimize the topic detector.

Abstract Generator Training
The loss of topic-aware abstract generation step consists of two parts: the first part is the average loss of sentence decoder for each abstract sentence L sent , and the second part is the cross-entropy loss of stop confirmation L stop .
Following (See et al., 2017), we compute the loss of an abstract sentence by averaging the negative log likelihood of every target word in that sentence, and achieve L sent via averaging over all m sentences, where n st is the length of the t-th sentence of the abstract. As for L stop , we adopt the cross-entropy loss, i.e., where y s = 1 when t > m and y s = 0 otherwise.

Experimental Settings
Dataset. To evaluate the overall performance of our model, we use the WikiCatSum dataset proposed by (Perez-Beltrachini et al., 2019), which contains three distinct domains (Company, Film and Animal) in Wikipedia. Each domain is split into train (90%), validation (5%) and test (5%) set. We build the dataset for training and evaluating the topic detector from the 2019-07-01 English Wikipedia full dump. For each record in the Wiki-CatSum dataset, we find the article with the same title in Wikipedia dump, and pick all section labelcontent pairs {(l, p)} in that article. We remove all hyperlinks and graphics in contents, split the contents into paragraphs with the spaCy library, and follow the steps in Section 4.3.1 to complete dataset construction. Finally, we conduct an 8:1:1 split for train, validation and test. Table 1 presents the detailed parameters of used datasets.
Evaluation Metrics. We evaluate the performance of our model with ROUGE scores (Lin, 2004), which is a common metric in comparing generated and standard summaries. Considering that we do not constrain the length of generated abstracts, we choose ROUGE F1 score that combines precision and recall to eliminate the tendency of favoring long or short results.

Implementation Details.
We use the opensource PyTorch and transformers library to implement our model. All models are trained on NVIDIA GeForce RTX 2080.
In topic detection, we choose the top 20 frequent section labels in each domain and manually group them into different topics (refer to the Appendix A for details). For training, we use the pretrained albert-base-v2 model in the transformers library, keep its default parameters and train the module for 4 epochs with a learning rate of 3e-5.
For abstract generation, we use a single-layer BiGRU network to encode the TTGs into hidden states of 512 dimensions. The first 400 tokens of input paragraphs are retained and transformed into GloVe (Pennington et al., 2014) embedding of 300 dimensions. The vocabulary size is 50000 and out-of-vocabulary tokens are represented with the average embedding of its adjacent 10 tokens. This module is trained for 10 epochs, the learning rate is 1e-4 for the first epoch and 1e-5 for the rest.
Before evaluation, we remove sentences that have an overlap of over 50% with other sentences to reduce redundancy.
Baselines. We compare our proposed TWAG with the following strong baselines: • TF-S2S (Liu et al., 2018) uses a Transformer decoder and compresses key-value pairs in self-attention with a convolutional layer.
• CV-S2D+T (Perez-Beltrachini et al., 2019) uses a convolutional encoder and a two-layer hierarchical decoder, and introduces LDA to model topical information.
• HierSumm (Liu and Lapata, 2019) utilizes the attention mechanism to model inter-  paragraph relations and then enhances the document representation with graphs.
• BART (Lewis et al., 2020) is a pretrained sequence-to-sequence model that achieved success on various sequence prediction tasks.
We fine-tune the pretrained BART-base model on our dataset and set beam size to 5 for all models using beam search at test time. The parameters we use for training and evaluation are identical to these in corresponding papers. Table 2 shows the ROUGE F1 scores of different models. In all three domains, TWAG outperforms other baselines. Our model surpasses other models on ROUGE-1 score by a margin of about 10%, while still retaining advantage on ROUGE-2 and ROUGE-L scores. In domain Company, our model boosts the ROUGE-L F1 score by about 30%, considering that ROUGE-L score is computed upon the longest common sequence, the highest ROUGE-L score indicates that abstracts generated by TWAG have the highest holistic quality.

Results and Analysis
While CVS2D+T and BART retain reasonable scores, TF-S2S and HierSumm do not reach the scores they claim in their papers. Notice that the WikiCatSum dataset is a subset of WikiSum, which is used as the training dataset of these two models, we infer that TF-S2S and HierSumm require more training data to converge, and suffer from under-fitting due to the dataset scale. This phenomenon also proves that TWAG is data-efficient.

Ablation Study
Learning Rate of Topic Detector. We tried two learning rates when training the topic detector module. A learning rate of 1e-7 would result in a precision of 0.922 in evaluation, while a learning rate of 3e-5 would result in a precision of 0.778. However, choosing the former learning rate causes a drop of about 10% in all ROUGE scores, which is the reason why we use the latter one in our full model.
We infer that human authors occasionally make mistakes, assigning paragraphs into section labels that belong to other topics. A topic detector with low learning rate overfits these mistakes, harming the overall performance of our model.
Soft or Hard Topic. To further investigate the effectiveness of TWAG's soft topic mechanism, we compare the results of soft and hard topic and report them in Table 4, from which we can see that hard topic does quite poorly in this task.  A possible reason is that some sentences in the Gold Abstract: manjina there ( kannada : unk unk ) is a 1980 indian kannada film, directed by bangalore nagesh and produced by k. r. narayana murthy. the film stars srinath, manjula, thoogudeepa srinivas and dinesh in lead roles. the film had musical score by upendra kumar. TF-S2S: chalo ishq larain is a pakistani film directed by sajjad gul and was released across pakistan in april 2002 starring meera , ali haider and zara sheikh. CV-S2D+T: unk unk ( kannada : unk unk ) is a 1967 indian kannada film, directed by b. mallesh and produced by b. s. ranga. the film stars rajkumar, udaykumar, narasimharaju and k. s. ashwath in lead roles. the film had musical score by chellapilla satyam . the film was a remake of hindi film unk . HierSumm: ( kannada : unk unk ) is a 1980 indian kannada film , directed by bangalore nagesh . the film stars srinath , manjulla , thoogudeepa shreenivaas . the film stars srinath , manjula , manjula , thoogudeepa shreenivaas in lead roles . the film had musical score by upendra kumar . the film is a remake of telugu movie aakashagiri rao . the movie was remade in telugu as manjina in 1986 . the movie was remade in telugu as manjina there . . . BART: manjina there is a 1980 kannada family drama film directed by bangalore nagesh starring srinath and manjula in the lead roles. it was released on 14 january 1980. TWAG: manjina there is a 1980 kannada drama film directed by bangalore nagesh. the film stars srinath, vajramuni, manjula and thoogudeepa srinivas in lead roles. the film had musical score by upendra kumar and the film opened to positive reviews in 1980. the film was a remake of tamil film unk . standard abstract express more than one topic. Assigning one topic to each sentence will result in semantic loss and thus harm the quality of generated abstract, while the soft topic could better simulate the human writing style.
Number of Section Labels. The number of section labels n t plays a key role in our model: a small n t would not be informative enough to build topics, while a large one would induce noise. We can see from Figure 3 that the frequency of section labels is long-tailed, thus retaining only a small portion is able to capture the major part of information. Ta- Figure 3: The frequency of section labels in three domains. When ignoring section labels with extra high or low frequency, remaining section labels' frequency and rank generally form a straight line in log scale, which matches the Zipf's law for long-tail distributions.
ble 5 records the experiment results we conducted on domain Company. n t = 20 reaches a peak on ROUGE 1, 2 and L scores, indicating that 20 is a reasonable number of section labels.  can see that the gold abstract contains information about three topics: basic information (region, director, and producer), actors, and music. Among the models, TF-S2S produces an abstract with a proper pattern but contains wrong information and BART misses the musical information topic. CV-S2D+T, HierSumm, and our TWAG model both cover all three topics in the gold abstract, however, CV-S2D+T makes several factual errors like the release date and actors and Hier-Summ suffers from redundancy. TWAG covers all three topics in the gold abstract and discovers extra facts, proving itself to be competent in generating comprehensive abstracts.

Human Evaluation
We follow the experimental setup of (Perez-Beltrachini et al., 2019) and conduct a human evaluation consisting of two parts. A total of 45 examples (15 from each domain) are randomly selected from the test set for evaluation.
The first part is a question-answering (QA) scheme proposed in (Clarke and Lapata, 2010) in order to examine factoid information in summaries. We create 2-5 questions 3 based on the golden sum-   mary which covers the appeared topics, and invite 3 participants to answer the questions by taking automatically-generated summaries as background information. The more questions a summary can answer, the better it is. To quantify the results, we assign a score of 1/0.5/0.1/0 to a correct answer, a partially correct answer, a wrong answer and those cannot be answered, and report the average score over all questions. Notice that we give a score of 0.1 even if the participants answer the question incorrectly, because a wrong answer indicates the summary covers a certain topic and is superior to missing information. Results in Table 6 shows that 1) taking summaries generated by TWAG is capable of answering more questions and giving the correct answer, 2) TF-S2S and HierSumm perform poorly in domain Film and Animal, which is possibly a consequence of under-fitting in small datasets.
The second part is an evaluation over linguistic quality. We ask the participants to read different generated summaries from 3 perspectives and give a score of 1-5 (larger scores indicates higher quality): Completeness (does the summary contain sufficient information?), Fluency (is the summary fluent and grammatical?) and Succinctness (does the summary avoid redundant sentences?) Specifically, 3 participants are assigned to evaluate each model, and the average scores are taken as the fi-nal results. Table 7 presents the comparison results, from which we can see that, the linguistic quality of TWAG model outperforms other baseline models, validating its effectiveness.

Conclusion
In this paper, we propose a novel topic-guided abstractive summarization model TWAG for generating Wikipedia abstracts. It investigates the section labels of Wikipedia, dividing the input document into different topics to improve the quality of generated abstract. This approach simulates the way how human recognize entities, and experimental results show that our model obviously outperforms existing state-of-the-art models which view Wikipedia abstracts as plain text. Our model also demonstrates its high data efficiency. In the future, we will try to incorporate pretrained language models into the topic-aware abstract generator module, and apply the topic-aware model to other texts rich in topical information like sports match reports.

Ethical Considerations
TWAG could be applied to applications like automatically writing new Wikipedia abstracts or other texts rich in topical information. It can also help human writers to examine whether they have missed information about certain important topics.
The benefits of using our model include saving human writers' labor and making abstracts more comprehensive. There are also important considerations when using our model. Input texts may violate copyrights when inadequately collected, and misleading texts may lead to factual mistakes in generated abstracts. To mitigate the risks, researches on how to avoid copyright issues when collecting documents from the Internet would help.
BART-base. We infer that BART-large may overfit on training data, and BART-base is more competent to be the baseline. Table 9 shows an example of gold summary, its corresponding question set and system outputs. The full dataset we used for human evaluation can be found in our code repository.