AREDSUM: Adaptive Redundancy-Aware Iterative Sentence Ranking for Extractive Document Summarization

Redundancy-aware extractive summarization systems score the redundancy of the sentences to be included in a summary either jointly with their salience information or separately as an additional sentence scoring step. Previous work shows the efficacy of jointly scoring and selecting sentences with neural sequence generation models. It is, however, not well-understood if the gain is due to better encoding techniques or better redundancy reduction approaches. Similarly, the contribution of salience versus diversity components on the created summary is not studied well. Building on the state-of-the-art encoding methods for summarization, we present two adaptive learning models: AREDSUM-SEQ that jointly considers salience and novelty during sentence selection; and a two-step AREDSUM-CTX that scores salience first, then learns to balance salience and redundancy, enabling the measurement of the impact of each aspect. Empirical results on CNN/DailyMail and NYT50 datasets show that by modeling diversity explicitly in a separate step, AREDSUM-CTX achieves significantly better performance than AREDSUM-SEQ as well as state-of-the-art extractive summarization baselines.


Introduction
Extractive summarization is the task of creating a summary by identifying and concatenating the most important sentences in a document (Liu and Lapata, 2019;Zhang et al., 2019;Zhou et al., 2018). Given a partial summary, the decision to include another sentence in the summary depends on two aspects: salience, which represents how much information the sentence carries; and redundancy, which represents how much information in the sentence is already included in the previously selected sentences.
Although there have been a few studies on redundancy a long time ago, most recent research on extractive summarization focuses on salience alone. They usually model sentence salience as a sequence labeling task (Kedzie et al., 2018;Cheng and Lapata, 2016) or classification task (Zhang et al., 2019) and do not conduct redundancy removal. Previous methods that consider redundancy usually use a separate step after salience scoring to handle redundancy, denoted as sentence selection (Carbonell and Goldstein, 1998;McDonald, 2007;Lin and Bilmes, 2011). Sentence selection often follows a greedy iterative ranking process that outputs one sentence at a time by taking into account the redundancy of candidate sentences with previously selected sentences.
Several approaches for modeling redundancy in sentence selection have been explored: heuristicsbased methods such as Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998), Trigram Blocking (TRIBLK) (Liu and Lapata, 2019), or model based approaches (Ren et al., 2016), etc. Heuristic-based methods are not adaptive since they usually apply the same rule to all the documents. Model-based approaches depend heavily on feature engineering and learn to score sentences via regression with point-wise loss, which has been shown to be inferior to pairwise loss or list-wise loss in ranking problems (Liu et al., 2009).
Redundancy has also been handled jointly with salience during the scoring process using neural sequence models (Zhou et al., 2018). NEUSUM (Zhou et al., 2018) scores sentences considering their salience as well as previous sentences in the output sequence and learns to predict the sentence with maximum relative gain given the partial output summary. Despite the improved efficacy, it is not well-understood if the gain is due to better encoding or better redundancy-aware iterative ranking approaches (i.e., the sequence generation).
In this work, we propose to study different types of redundancy-aware iterative ranking techniques for extractive summarization that handle redundancy separately or jointly with salience. Extending BERTSUMEXT (Liu and Lapata, 2019), a stateof-the-art extractive summarization model, which uses heuristic-based Trigram Blocking (TRIBLK) for redundancy elimination, we propose two supervised redundancy-aware iterative sentence ranking methods for summary prediction. Our first model, AREDSUM-SEQ, introduces a transformerbased conditional sentence order generator network to score and select sentences by jointly considering their salience and diversity within the selected summary sentences. Our second model, AREDSUM-CTX, uses an additional sentence selection model to learn to balance the salience and redundancy of constructed summaries. It incorporates surface features (such as n-gram overlap ratio and semantic match scores) to instrument the diversity aspect. We compare the performance of our redundancy-aware sentence ranking methods with trigram-blocking (Liu and Lapata, 2019), as well as summarization baselines with or without considering redundancy on two commonly used datasets, CNN/DailyMail and New York Times (NYT50). Experimental results show that our proposed AREDSUM-CTX can achieve better performance by reducing redundancy and outperform all the baselines on these two datasets. The model's advantage can be attributed to its adaptiveness to scenarios in which redundancy removal has different potential gains.
In summary, our contributions are: 1) we propose two redundancy-aware iterative ranking methods for extractive summarization extending BERT-SUMEXT; 2) we conduct comparative studies between our redundancy-aware models as well as the heuristic-based method that BERTSUMEXT uses; 3) our proposed AREDSUM-CTX significantly outperforms BERTSUMEXT and other competitive baselines on CNN/DailyMail and NYT50.

Related Work
Extractive summarization methods are usually decomposed into two subtasks, i.e., sentence scoring and sentence selection, which deal with salience and redundancy, respectively.
Salience Scoring. Graph-based models are widely used methods to score sentence salience in summarization (Erkan and Radev, 2004;Mihalcea and Tarau, 2004;Wan and Yang, 2006). There are also extensions to such methods, e.g., with clus-tering (Wan and Yang, 2008) or leveraging graph neural networks (Wang et al., 2020). Classical supervised extractive summarization uses classification or sequence labeling methods such as Naive Bayes (Kupiec et al., 1999), maximum entropy (Osborne, 2002), conditional random fields (Galley, 2006) or hidden markov model (Conroy et al., 2004). Human engineered features are heavily used in these methods such as word frequency and sentence length (Nenkova et al., 2006).
In recent years, neural models have replaced older models to score the salience of sentences. Hierarchical LSTMs and CNNs have replaced manually engineered features. LSTM decoders are employed to do sequence labeling (Cheng and Lapata, 2016;Nallapati et al., 2017;Kedzie et al., 2018). These architectures are widely used and also extended with reinforcement learning (Narayan et al., 2018;Dong et al., 2018). More recently, summarization methods based on BERT (Devlin et al., 2018) have been shown to achieve state-of-the-art performance (Liu and Lapata, 2019;Zhang et al., 2019;Zhong et al., 2019;Zhou et al., 2020) on salience for extractive summarization.
Sentence Selection. There are relatively fewer methods that study sentence selection to avoid redundancy. Integer Linear Programming based methods (McDonald, 2007) formulate sentence selection as an optimizing problem under the summary length constraint. Lin and Bilmes (2011) propose to find the optimal subset of sentences with submodular functions. Greedy strategies such as Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998) select the sentence that has maximal salience score and is minimally redundant iteratively. Trigram blocking (Liu and Lapata, 2019) follows the intuition of MMR and filters out sentences that have trigram overlap with previously extracted sentences. Ren et al. (2016) leverage two groups of handcrafted features to capture informativeness and redundancy respectively during sentence selection. In contrast to learning a separate model for sentence selection, Zhou et al. (2018) propose to jointly learn to score and select sentences with a sequence generation model. However, it is not compared with other redundancy-aware techniques, and it is not clear whether its improvement upon other methods is from the sequence generation method or the encoding technique.
In this paper, we compare the efficacy of different sentence selection techniques grounded on the same BERT-based encoder. We propose two models that either conduct redundancy removal with a separate model or jointly with salience scoring and compare them with a heuristic-based method. As far as we know, our work is the first to conduct comparative studies on different types of redundancyaware extractive summarization methods.

Iterative Sentence Ranking
We formulate single document extractive summarization as a task of iterative sentence ranking. Given a document D = {s 1 , s 2 , · · · , s L } of L sentences, the goal is to extract t sentences, i.e., S t = {ŝ k |1 ≤ k ≤ t,ŝ k ∈ D}, from D that can best summarize it. With a limit of selected sentence count l, the process of extracting sentences can be considered as a l-step iterative ranking problem. At each k-th step (1 ≤ k ≤ l), given the current summaryŜ k−1 , a new sentence s k is selected from the remaining sentences D \Ŝ k−1 and added to the summary. Function M (Ŝ k ; S * ) 1 measures the similarity between the extracted summaryŜ k and the ground truth summary S * . The objective is to learn a scoring function f (·) so that the best sentenceŝ k selected according to f (·) can maximize the gain of the output summary: s k needs to be both salient in the document and novel in the current contextŜ k−1 . Note that at the beginningŜ 0 = ∅. Since ground truth summaries S * of existing summarization corpora are usually abstractive summaries written by experts, previous studies on extractive summarization usually extract a group of pseudo ground truth sentencesŜ * from D based on their similarities to the ground truth summaries S * for training purposes. Then labels 1 and 0 are assigned to sentences inŜ * and the other sentences in D. In this case, M (Ŝ t ;Ŝ * ) is used to guide training instead of M (Ŝ t ; S * ). et al., 2017; Liu and Lapata, 2019). Among them, BERTSUMEXT (Liu and Lapata, 2019) is a state-ofthe-art model with trigram-blocking (TRIBLK) that reduces redundancy by filtering out sentences that have trigram overlap with previously selected ones at each time step. As we empirically show later in § 6, heuristics can be effective on some datasets yet harmful on others since it applies the same rule to all the documents.
In contrast, we present an adaptive learning process for redundancy-aware extractive summarization, AREDSUM, and introduce two methods, AREDSUM-SEQ and AREDSUM-CTX, extending BERTSUMEXT by either consider redundancy jointly with salience during sentence scoring or separately with an additional selection model.

Document Encoder
First, we introduce the sentence and document encoder shared by both our variations of AREDSUM, shown in Figure 1. In sentence-level encoding, a [SEP] token is appended to each sentence to indicate the sentence boundaries and a [CLS] token is inserted before each sentence in the document to aggregate the information of the sentence. In addition to token and positional embeddings, as in BERTSUMEXT (Liu and Lapata, 2019), we also use interval segment embeddings E A and E B to distinguish sentences at odd and even positions in the document respectively. Following multiple transformer encoder layers, we represent each sentence s i by the output representation of the [CLS] symbol preceding s i . These symbols capture the features of the following tokens in the sentence while attending over all other tokens in the document through the transformer layers. We further conduct document-level encoding on the sentence-level representations from the [CLS] tokens, denoted as E s i , as well as their positional embeddings, E i , with another stack of transformer layers. We add a document embedding E D before the sequence of sentence embeddings to represent the whole document. The final representation of D and each sentence s i can be obtained from the output of the multiple transformer layers, denoted as h s i and h D .

AREDSUM-SEQ: Sequence Generation
Our first model, AREDSUM-SEQ, strictly considers the order of the target selected sentences while jointly modeling the redundancy and salience of the next sentence. It uses a transformer decoder module (Vaswani et al., 2017) to learn to select and order a sequence of sentences from the document as a summary. Our model is different from standard auto-regressive decoders. Each decoder block takes in a sequence of tokens (word-units) as input to generate the next possible token from a pre-defined vocabulary. Instead, our decoder is a conditional model that takes a sequence of sentence representations as input and selects a sentence with the maximum gain to be included in the summary from the rest of the document's sentences.
Following a standard transformer encoderdecoder architecture (Vaswani et al., 2017), at each decoding step k, a current hidden state is obtained with a stack of transformer decoder layers: where [h s 1 , h s 2 , · · · , h s L ] are the output sentence representations after the document-level encoding in Figure 1 and [Eŝ 1 , · · · , Eŝ k−1 ] are the embeddings of the so far selected sentences.Ŝ k−1 = ∅ and Eŝ 0 = 0 when k = 1. Note that sentence embeddings that are fed to the document-level transformer encoders, i.e., E s 1 , · · · , E s L , are used to represent the sentences in the target decoding space. Then, a two-layer MLP is used to score a candidate sentence s i given the hidden state h ŝ k−1 : where W 2s and W 1s are the weights of the MLP (we omit the bias parameters for simplicity), and [; ] denotes vector concatenation. In case the salience of s i is not sufficiently captured in o l (s i ), we calculate a matching score o g (s i ) between s i and the global context D, regardless of which sentences are selected previously 2 : where W ds is matrix for bilinear matching; h D and h s i are the embeddings of the document D and sentence s i output by the document-level encoder. The final score is the linear combination of o l and o g using the weight W o : The probability of any sentence s i being selected at the k-th step is the softmax of o(s i ) over the remaining candidate sentences s j in D: Following NEUSUM (Zhou et al., 2018), we train AREDSUM-SEQ to optimize for the relative ROUGE-F1 gain of each sentence with respect to so-far selected sentencesŜ k−1 .
where M ({s i }∪Ŝ k−1 ; S * ) and M (Ŝ k−1 ; S * ) measure the ROUGE-F1 between the golden summary S * and the so-far selected sentencesŜ k−1 with and without the candidate s i respectively. We rescale the gain g(s i ) to [0,1] in case of negative values using a min-max normalization and getg. Then we use a softmax function with a temperature τ on the rescaled gain to produce a target distribution: The final training objective is to minimize the KL divergence between the probability distribution of sentence scores (Eq. 6) and their relative rouge gain (Eq. 8), i.e., KL(P (·)||Q(·)). This objective can be considered as a listwise ranking loss (Ai et al., 2018) that maximizes the probability of the target sentence while pushing down the probabilities of the other sentences. In this way, AREDSUM-SEQ combined the sentence scoring and selection in the same decoder framework, and the redundancy is implicitly captured by optimizing the ROUGE gain.

AREDSUM-CTX: Context-aware Sentence Ranker
We introduce a second model, AREDSUM-CTX, a context-aware ranker that scores salience first and then selects a sentence according to both its salience and redundancy adhering to the previously extracted sentences as context, as shown in Figure 1. In AREDSUM-CTX, we use a two-step process for scoring and selecting sentences for learning to construct a summary: In the salience ranking step, we focus on learning the salience of the sentences, while in the ranking for sentence selection step, we represent the redundancy explicitly via surface features and use a ranker to decide to promote or demote sentences based on their scores given the joint degree of their salience and redundancy.
Salience Ranking. By assuming that the sentence salience is independent of the previously selected sentences, we design the salience ranking of AREDSUM-CTX as a single step process rather than an iterative one. We measure the probability of a sentence to be included inŜ * using a scoring function F sal based on the bilinear matching between h D and h s i , the transformer output after the document-level encoding, same as in Eq. 4.
The learning objective is to maximize the log likelihood of the summary sentences in the training data: Redundancy Features. In the selection step, we represent redundancy explicitly to let the model focus on learning how to balance salience and redundancy. We extract ngram-matching and semanticmatching features at each k-th step to indicate the redundancy of a candidate sentence s i given the so-far selected sentences, i.e.,Ŝ k−1 . The ngrammatching feature f n-gram is computed as: where n-gram(x) is the set of n contiguous words in x. We collect f n-gram for n = {1, 2, 3}. We also compute the semantic-matching feature f sem : Since most cosine values between output embeddings from the transformer layers fall in a small range near to 1, we apply a min-max normalization on f sem to enlarge the value differences and obtain a updated featuref sem .
The impact of redundancy features on final scores is not linear. Sentences with high redundancy values should be punished more. To capture the effect of the redundancy features at different value sections, we equally divide the range of [0, 1] to m bins and discretize each feature to the corresponding bin according to its value, as shown in Figure 1. In this way, we convert each feature into a one-hot vector of length m and then we concatenate them to obtain a overall redundancy feature vector F red (s i ) = [f 1-gram ; f 2-gram ; f 3-gram ;f sem ] where f represents the one-hot vector after binning f .
Ranker for Sentence Selection. In the sentence selection step, AREDSUM-CTX only needs to learn how to score a sentence based on its redundancy features F red (s i ) and its salience score F sal (s i ) from Eq. 9. Note that the first selected sentence is the one ranked with the largest salience score. We use a three-dimensional matrix W F to do a bilinear matching between the redundancy features and salience score and obtain a output matching vector with dimension d. Then we apply a single-layer MLP on top to output a final score: During training, we randomly select 1, 2, · · · , l-1 sentences from the extracted ground-truth setŜ * as the context and let the model learn to find the next sentence that is both salient and novel, where l is the maximum number of sentences to be included in the predicted summary. The training objective is the same as in § 4.2 except that o(s i ) in Eq. 6 is replaced with f (s i ) in Eq. 13. In contrast to AREDSUM-SEQ where the target output is an ordered sequence, the loss of AREDSUM-CTX is not order-sensitive since the goal is always to predict the next best sentence given a set of unordered selected sentences as context.
CNN/DailyMail contains news articles associated with a few bullet points as the article's highlight. We use the standard splits of Hermann et al.  (2019). We truncate articles up to 512 tokens. To collect sentence labels for extractive summarization, we use a greedy strategy similar to (Nallapati et al., 2017;Zhang et al., 2019). We label the subset of sentences that can maximize ROUGE scores against the human-generated summary as 1 (sentence to be included in the summary). The remaining ones are labeled as 0.
NYT50 is an annotated corpus of the New York Times. Following Paulus et al. (2017) and Durrett et al. (2016), we discard marks and words such as "(s)" and "photo" at the end of the abstract and filter out the articles with summaries shorter than 50. We sort the articles chronologically and split the data into training/validation/test sets according to the ratio of 0.8/0.1/0.1, yielding 133,602/16,700/16,700 documents, respectively. We following the same remaining steps for preprocessing and extractive label collection as the CNN/DailyMail.

Implementation Details
Our implementation 3 is based on PyTorch and BERTSUM(Liu and Lapata, 2019) 4 . We use "bertbase-uncased" version of BERT 5 to do sentencelevel encoding. We fine-tune our models using the objective functions in § 4. We set the number of document-level transformer layers to 2. The dropout rate in all layers is 0.1. We search the best value of τ in Eq. 8 in {10, 20, 40, 60}. We train our models using the Adam optimizer with β 1 = 0.9, and β 2 = 0.999 for 2 epochs. We schedule the learning rate according to Vaswani et al. (2017) with initial value 2e-3 and 10,000 warm-up steps.
We use teacher-forcing to train AREDSUM-SEQ. To learn the k-th sentence in the target sequence, we replace the first k − 1 input sentences with other random sentences in the document with the probability of 0.2 6 . We hypothesize that if the previously selected sentence is not always the golden (right) sentence, it can improve the model's robustness during training. We use two transformer layers in the decoder.
For AREDSUM-CTX, we train the salience ranker using the same settings as in Liu and Lapata (2019), and all the parameters in the salience ranker are fixed when we train the ranker for selection. This ensures that the salience score of each sentence stays the same during sentence selection. We select the optimum size of the bins for discretized redundancy features by sweeping the values in {10, 20, 30} and the size of the output dimension d of W F in Eq. 13 using {5, 10, 20, 30}.

Baselines
We compare our models to the state-of-the-art extractive summarization model BERTSUMEXT (Liu and Lapata, 2019), which uses Trigram Blocking (TRIBLK) (Paulus et al., 2017) to filter out sentences with trigram overlap with previously extracted sentences. We report the performance of BERTSUMEXT with and without TRIBLK separately to show the impact of this heuristic.
We also compare against other baselines including: LEAD3, NN-SE (Cheng and Lapata, 2016), SUMMARUNNER (Nallapati et al., 2017), Seq2Seq (Kedzie et al., 2018), NEUSUM (Zhou et al., 2018), and HIBERT (Zhang et al., 2019) along with ORA-CLE for upper bound on performance. LEAD3 is a commonly used effective baseline that extracts the first 3 sentences in the document. NN-SE (Cheng and Lapata, 2016) and SUMMARUNNER (Nallapati et al., 2017) both formulate extractive summarization as a sequence labelling task. NN-SE uses unidirectional GRU for both the encoding and decoding processes. SUMMARUNNER encodes sentences with BiGRU and considers salience, redundancy, absolute and relative positions of sentences during scoring. Seq2Seq (Kedzie et al., 2018) conducts binary classification by encoding the sentences with a bidirectional GRU (BiGRU) and using a separate decoder BiGRU to transform each sentence as a query vector that attends to the encoder output. NEUSUM (Zhou et al., 2018) learns to jointly score and select sentences using a sequence-to-sequence model to optimize the marginal ROUGE gain and reduce redundancy implicitly. HIBERT (Zhang et al., 2019) pre-trains a hierarchical BERT for extractive summarization without dealing with redundancy. Among these methods, NEUSUM implicitly reduces redundancy by jointly scoring and selecting sentences with a sequence generation model. SUMMARUNNER considers redundancy during the sequence labeling. The other baselines do not conduct redundancy removal.
6 Results and Discussion

Automatic Evaluation Results
Following earlier work (Zhou et al., 2018;Liu and Lapata, 2019), we include 3 sentences as the summaries for each system for a fair comparison. We evaluate the full-length ROUGE-F1 (Lin, 2004) of the extracted summaries and report ROUGE-1, ROUGE-2 and ROUGE-L which indicates the unigrams, bigrams overlap and longest common subsequence against human edited summaries. The full-length ROUGE-F1 (Lin, 2004) scores of the extracted summaries are evaluated using the official Perl script 7 for both CNN/DailyMail and NYT50. The results of NEUSUM and HIBERT are taken from their original papers while we obtained the rest of the results by re-running the models. Since in the previous work (Liu and Lapata, 2019;Zhang et al., 2019;Paulus et al., 2017) there are no consistent ways of pre-processing the NYT dataset for extractive summarization, we only report the evaluation results from the models we re-trained on this dataset in Table 2. CNN/DailyMail. Results shown in Table 1 are all comparable as we use the same non-anonymized version of CNN/DailyMail. For BERTSUMEXTbased methods, we observe that redundancy removal helps improve the ROUGE score compared to BERTSUMEXT. TRIBLK has considerably bet- In other cases, they agrees with each other. This shows that by adaptively balancing salience and diversity, AREDSUM-CTX is superior to TRIBLK when redundancy removal is promising. We also find that the sequence generation models, i.e., NEUSUM and our AREDSUM-SEQ, do not have clear advantage over other models regardless of their encoder network structure (i.e., BERT or other neural architectures). For instance, SUM-MARUNNER and SEQ2SEQ models have the best performance among methods that are not based on BERT 9 . Our AREDSUM-SEQ perform similarly to BERTSUMEXT. AREDSUM-SEQ is inferior to AREDSUM-CTX due to its order-sensitive optimization objective. While AREDSUM-CTX learns to optimize towards all the possible ordering of the ground truth sentence set, AREDSUM-SEQ is optimized towards only one sequence of them. Another ordering of the same set will be penalized by AREDSUM-SEQ even though they have the same ROUGE score. Its significant worse P @1 (shown in Section 6.2) also confirms this point. NYT50. In contrast to CNN/DailyMail, we observe that TRIBLK has harmed the performance of BERTSUMEXT on NYT50. In fact, as shown in Table 2, applying TRIBLK on ORACLE also causes reduction in ROUGE-1,2,L scores by 1.91, 1.59 and 1.92 absolute point respectively, which are much larger than those drops in CNN/DailyMail (0.94,0.74 and 0.98). This indicates that TRIBLK filters out more sentences that have high ROUGE gain on NYT50 than CNN/DailyMail, causing more drop of ROUGE. It also shows that sentences in oracle summaries have more trigram overlap on NYT50 than CNN/DailyMail, which implies that redundancy removal on NYT50 may have limited gains and a simple unified rule (TRIBLK) applying on all the documents could harm the performance.
We also observe that AREDSUM-SEQ performs better than BERTSUMEXT+TRIBLK but worse than BERTSUMEXT. In contrast, AREDSUM-CTX achieves higher performance than BERT-SUMEXT+TRIBLK and AREDSUM-SEQ by representing redundancy explicitly and controlling its effect dynamically. Since redundancy removal has a limited potential gain on NYT50, the predictions of AREDSUM-CTX differ from BERTSUMEXT only in 10.1% of the test set. However, these differences still lead to significant overall improvements.
Note that the gain of AREDSUM-CTX comes only from redundancy removal, which takes effect from the second step of selecting sentences. The improvements can be larger when redundancy removal has higher potentials (e.g., CNN/DailyMail) and smaller on datasets (e.g., NYT50) with lower potentials. In either case, it does not harm the performances as the other methods, which shows that it is adaptive and robust.

Model Analysis
Precision at Each Step. Since the content of sentences with positive labels, i.e.,Ŝ * , could vary from the original human-generated abstractive summaries S * , models that have higher precision with respect toŜ * do not necessarily yield better ROUGE scores against S * . Because BERTSUMEXT is optimized towardsŜ * while AREDSUM-CTX and AREDSUM-SEQ aim to learn to select sentences with best ROUGE gain against S * , they behave differently in terms of ROUGE and precision. Thus, we analyze how our model's selection at each kth step affects the ROUGE performance. We only present the precision on CNN/DailyMail in Figure 2 since similar trends are observed on NYT50. Note all the models except AREDSUM-SEQ have the same P @1 because initially, the model's selection is only based on salience. Filtering and demoting the selected sentences starts to take effect only after the second step. As shown in the figure, BERTSUMEXT has the best P @1, P @2 and P @3 among all, which is reasonable sinceŜ * is the target which it is optimized to learn. When TRIBLK is applied, P @2 and P @3 drop a lot while the ROUGE scores are up (as in Table 1). It indicates that TRIBLK could filter out some informative but redundant sentences during selection, which harms precision but improves ROUGE. In contrast, P @2 and P @3 of AREDSUM-CTX is between BERTSUMEXT with and without TRIBLK. Through learning towards ROUGE gain given the previously extracted sentences, AREDSUM-CTX achieves the best ROUGE scores with less harm to precision, which means that AREDSUM-CTX can better balance salience and redundancy. AREDSUM-SEQ has a significantly lower P @1 than the others since its objective at the first step is to find the sentence with maximal ROUGE gain, which is only one inŜ * . At steps 2 and 3, the disadvantage of AREDSUM-SEQ becomes smaller. It has similar P @3 to BERTSUMEXT+TRIBLK. The generated sequence of sentences cover a decent portion ofŜ * , but it is still worse than the methods that do not use order-sensitive optimization objectives.
Position of Selected Sentences. Figure 3 shows the position of sentences extracted by different models and ORACLE on CNN/DailyMail. A large portion of oracle sentences are the first 5 sentences, and all the models tend to extract the leading 5 sentences in the predicted summaries. The output of AREDSUM-SEQ concentrates more on the first 3 sentences, which differs from ORACLE more than the other models. With TRIBLK, BERTSUMEXT selects sentences in later positions more. The position distribution of AREDSUM-CTX is between BERTSUMEXT with and without TRIBLK, which is similar to their precision distribution in Figure  2. This indicates that AREDSUM-CTX seeks to find a smoother way to filter out sentences that are redundant but salient, and these sentences tend to be at earlier positions.

Human Evaluation
We also conduct human evaluations to analyze how our best model compares against the best baseline model. On both datasets, we randomly sample 20 summaries constructed by the best baseline and our best model from the cases where their ROUGE-2 score difference is more than 0.05 points. Following Zhou et al. (2018), we asked two graduate student volunteers to rank the summaries extracted by Table 3: Average ranks of our best method and the best baseline on CNN/DailyMail and NYT50 in terms of informativeness (Info), redundancy (Rdnd) and the overall quality by human participants (the lower, the better). * and † indicates significant improvements with p < 0.03 and p < 0.0001.

CNN/DailyMail
Info Rdnd Overall BERTSUMEXT + TRIBLK different models from best to worst in terms of informativeness, redundancy and the overall quality. We allowed ties in the analysis. Average ranks of the systems are shown in Table 3. On CNN/DailyMail, AREDSUM-CTX ranks higher than BERTSUMEXT+TRIBLK in terms of each aspect. On NYT50, AREDSUM-CTX has a more compelling performance in terms of redundancy than informativeness. This is consistent with the fact that BERTSUMEXT only focuses on learning salience and does not deal with redundancy during sentence selection. From both automatic and human evaluation of our best model and the best baseline, we can see that removing redundancy with our models is better than redundancy removal with heuristics and no redundancy removal.

Conclusions
Extending a state-of-the-art extractive summarization model, we propose AREDSUM-SEQ that jointly scores and selects sentences with a sequence generation model and AREDSUM-CTX that learns to balance salience and redundancy with a separate model. Experimental results show that AREDSUM-CTX outperforms AREDSUM-SEQ and all other strong baselines, which yields that redundancy reduction helps improve summary quality, and it is better to model the effect of redundancy explicitly than jointly with salience during sentence scoring.