Salience Allocation as Guidance for Abstractive Summarization

Abstractive summarization models typically learn to capture the salient information from scratch implicitly.Recent literature adds extractive summaries as guidance for abstractive summarization models to provide hints of salient content and achieves better performance.However, extractive summaries as guidance could be over strict, leading to information loss or noisy signals.Furthermore, it cannot easily adapt to documents with various abstractiveness.As the number and allocation of salience content pieces varies, it is hard to find a fixed threshold deciding which content should be included in the guidance.In this paper, we propose a novel summarization approach with a flexible and reliable salience guidance, namely SEASON (SaliencE Allocation as Guidance for Abstractive SummarizatiON).SEASON utilizes the allocation of salience expectation to guide abstractive summarization and adapts well to articles in different abstractiveness.Automatic and human evaluations on two benchmark datasets show that the proposed method is effective and reliable.Empirical results on more than one million news articles demonstrate a natural fifteen-fifty salience split for news article sentences, providing a useful insight for composing news articles.


Introduction
Abstractive summarization seeks to generate concise descriptions about synoptic information of longer documents (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017).Tackling this task can provide users with improved dissemination and acquisition of more readable content in long documents.More concretely, it allows for enhanced selection, compression and retrieval of Web-scale textual information that benefits other NLP tasks such as machine reading comprehension (Inoue et al., 2021), mention linking (Cheng et al., 2015), claim verification (Yin et al., 2021), and information extraction (Lu et al., 2022).
Abstractive summarization models are typically trained end-to-end using large collections of paired corpora of raw documents and human-written summaries to directly perform sequence-to-sequence generation.In terms of deciding what to include in the generated summaries, these models implicitly learn to capture the salient information from scratch.Accordingly, recent literature has attempted to add auxiliary extractive salience guidance for abstractive summarization models to give them a higher-level understanding of input documents, among which, extractive summaries appear to provide the most effective guidance (Li et al., 2020;Jin et al., 2020;Dou et al., 2021).Methods following this strategy learn to first perform extractive summarization, then perform abstraction on top of the extractive summaries (Hsu et al., 2018;Pilault et al., 2020;Dou et al., 2021).
However, incorporating extractive summaries as a form of guidance is evidently imperfect, even though it improves the overall performance of ab-stractive summarization in some cases (Dou et al., 2021): 1) Extractive summaries are not reliable guidance.When there are too many summaryworthy sentences in the document, selecting a part of them may prone to information loss.When there are too few or no summary-worthy sentences, using the selected extractive summaries could be noisy and confusing to the model.2) Extractive summaries are not flexible to adapt to different cases.The number and allocation of salience content pieces can vary by documents.Rather than extracting a fixed number of sentences, a flexible guidance should select salient content based on document properties.An imperfect selection process may also lead to further model biases, such as positional biases or length biases (Zhong et al., 2019).As the summarization process can differ for distinct documents (Grusky et al., 2018;Koupaee and Wang, 2018), a reliable guidance should allow flexible content selection, and be adaptive to documents with different abstractiveness.
In this paper, we propose a novel summarization approach with a flexible and reliable salience guidance, namely SEASON (SaliencE Allocation as Guidance for Abstractive SummarizatiON).Salience is the degree to which a sentence contributes to the central idea of a document, and its allocation means how salience is distributed among all sentences in a document.To estimate the salience allocation, a linear classifier is trained on top of the encoder.This estimation is incorporated into the decoder with Salience-Aware Cross-Attention (SACA).It provides the flexibility to decide how much signal to accept from the salience guidance to supervise the abstractive summarization.The ground-truth salience label is assigned to each sentence based on its similarity with the ground-truth summary.Meanwhile, the number of salience degrees and their cut-off thresholds are decided based on the corpus to balance informativeness and prediction accuracy.To further improve the robustness of the summarization model, we apply label smoothing between adjacent salience degrees during training, and use the expectation of salience as a more robust salience estimation.
The technical contributions of this work are three-fold.First, we develop a new method for abstractive summarization on Transformer-based encoder-decoder architecture with the allocation of salience expectation as flexible guidance ( §3).Our method provides reliable guidance that adapts well to articles in different abstractiveness ( §5.1).Second, we show the effectiveness and reliability of our proposed method comparing to the existing methods in both automatic ( §4.2) and human evaluation ( §5.3).Third, empirical results on more than one million news articles show a natural fifteen-fifty salience split for news article sentences ( §4.3), providing a useful insight for composing news articles.

Related Work
Joint extractive and abstractive summarization.Extractive summarization and abstractive summarization are two general paradigms of text summarization (See et al., 2017;Grusky et al., 2018).Extractive summarization ensures the faithfulness of the generated summary but is not able to properly summarize documents when rephrasing is needed (Liu and Liu, 2009).Abstractive summarization, comparatively, is more flexible but may suffer from hallucination (Maynez et al., 2020).
A series of studies attempt to benefit from the advantages of both paradigms by combining them.Hsu et al. (2018) encourage the word-level attention of an abstractive summarization model and the relative sentence-level extraction probability from an extractive summarization model to be consistent.More recent studies show that conducting abstractive summarization with extractive summaries as a part of the input leads to better performance (Saito et al., 2020;Pilault et al., 2020;Dou et al., 2021).Extractive summarization can also work as an effective content selector for abstractive summarization when summarizing long documents (Manakul and Gales, 2021).Some studies (Gehrmann et al., 2018;Li et al., 2020;Saito et al., 2020) also consider to extract key words or phrases instead of summary worthy sentences as guidance, but their performances are not as good as those using sentences (Dou et al., 2021).
Our work extends the strict extractive summary guidance to a soft guidance of salience allocation.The proposed guidance is more flexible, reliable and adaptive, leading to better performance.
Selective attention.Selective attention is a psychological concept referring to the differential processing of simultaneous sources of information (Johnston and Dark, 1986).Incorporating prior knowledge through selective attention is widely explored in natural language processing, especially in recent NLP models with attention mechanism (Lin et   2016; Sukhbaatar et al., 2019;Pruthi et al., 2020;Beltagy et al., 2020;Wang et al., 2022).To modify the summarization process with selective attention, previous studies either adjust the attention scores based on content selection probabilities directly (Hsu et al., 2018;Saito et al., 2020;Li et al., 2021), or appending selected content in the input (Saito et al., 2020;Dou et al., 2021).Recent studies show that the latter method with sentence-level content selection performs better (Dou et al., 2021).
Different from prior studies, SEASON maps salience degrees to distinct embeddings and adds them to the encoder outputs as key vector for crossattention.This gives our model the flexibility to decide how much signal to accept from the salience guidance for supervising the abstractive summarization process.This strategy achieves better performance in comparison with previous salienceguided selective attention methods.

SEASON
In this work, we employ a Transformer-based encoder-decoder model for abstractive summarization.As shown in Fig. 2, our model SEASON encapsulates salience prediction and text summariza-tion in a single network.We perform multi-task end-to-end training, and inference via one forward pass.During training, the model jointly learns to predict the degree of salience for each sentence and is guided with ROUGE-based ground-truth salience allocation to generate the abstractive summary.During inference, SEASON predicts the expected salience allocation intermediately with the encoder outputs, and uses this predicted information to guide the decoder to generate the summary.

Problem Formulation
Our assumption comes from an intuition that knowing the content salience allocation helps the model to pay attention to important content and generate more informative summaries.Although the content salience allocation is a built-in attribute of the source document, it is hard for the model to leverage this attribute without direct supervision (Li et al., 2020;Saito et al., 2020;Dou et al., 2021).
Let x be the sequence of input tokens in the source document, and y be the sequence of the summary tokens, where every token x i or y i is in the vocabulary V. We use z j , where j ∈ {1, . . ., N }, to represent the salience degree of the j-th sentence in the input document.We define o i as the sentence index for the i-th token, where o i ∈ {1, . . ., N }.The salience allocation is defined as The problem can be formulated as follows: (1) In Eq. 1, each token prediction is conditioned on the previously decoded summary tokens, the input tokens in the source document, and the allocation of salience of the source document.

Salience Allocation Prediction
To predict salience degrees of input sentences, we slightly modify the encoder input sequence by adding a special token at the beginning of each sentence, obtaining their last-layer hidden states as sentence representations: where h sent j , j ∈ {1, . . ., N }, is the contextualized embedding of the j-th sentence, and x is the modified input sequence.Then, sentence representations are fed into a single-layer classification head: where τ is a sharpening coefficient for the salience degree distribution, l ∈ {1, . . ., L} is the index of salience degree, L is the number of salience degrees, w l and b l are trainable parameters.We provide discussions on L and τ in §4.3 and §5.4 respectively.The design above allows the model predict salience allocation with minimal modifications on the architecture.

Salience-Aware Cross-Attention
To explicitly incorporate the salience allocation into the model , we develop a salience-aware crossattention (SACA) module.SACA first maps the salience degrees to trainable salience embeddings: This operation is intuitive when using groundtruth salience degrees.For predicted salience degrees, SACA needs to perform an estimation on the salience embedding with the inferred salience distribution.A simple hard estimation can be achieved by directly taking the embedding of degree l that maximizes the probability: However, this direct estimation does not take the uncertainty of prediction into consideration, so we propose the soft estimation that calculates the expectation for the salience embedding: Emb(z j = l)P (z j = l|x).
We compare these two estimation methods comprehensively in §5.4.Next, SACA incorporates the salience allocation in the cross-attention layer to guide summary generation on the decoder side.SACA adds the sentence salience embedding to the encoder hidden state of each token belonging to the sentence as the key state for cross-attention.The cross-attention is formulated as: where the attention query Q = h decoder thereof corresponds to the hidden state of the decoder, the attention key K = h encoder +ζ(x) is the sum of the encoder hidden state and the salience embedding, and the value V = h encoder is composed of the original encoder hidden state.In comparison with adding salience scores to cross-attention scores directly, SACA allows the model to learn how much signal to take from the salience guidance.

Learning Objectives
In training, SEASON learns to predict the salience allocation and generate the summary simultaneously.For salience prediction, we use the averaged cross-entropy loss on each predicted sentence: In addition, we apply label smoothing (Diaz and Marathe, 2019) to the salience degrees for denoising.Specifically, a probability β is evenly assigned to salience degrees adjacent to the ground-truth degree.Analysis in §5.4 shows its effectiveness comparing with common label smoothing.For summary generation, we use the ground-truth salience allocation as input, and apply the averaged crossentropy loss on each predicted token as below: We further combine two loss functions together with a coefficient α that balances the two: (9) §5.4 shows that SEASON is not sensitive to α.

Experiment
In this section, we first describe our experimental setting, including datasets, baselines, evaluation metrics and implementation details ( §4.1).Then, we show the model performance on two summarization datasets ( §4.2), and provide an insight on salience threshold selection ( §4.3).

Experimental Setup
Datasets.We evaluate our method on two news summarization datasets.For both datasets, we use the original news article as input and the humanwritten summary as the ground-truth output.CN-NDM (See et al., 2017)  Metrics.We report widely used ROUGE metrics (Lin, 2004), including ROUGE-1 (R-1), ROUGE-2 (R-2), and sentence-level ROUGE-L (R-L) F 1 scores with rouge-score python package. 3 Baselines.We compare our system with three types of strong baselines, including of the reference summaries are fully preserved.For inference, we use the predicted soft estimation for allocation of expected salience.The predicted probability of salience degree is sharpened with a temperature τ = 0.5.We use beam search with beam size of 5, length penalty of 1.5 and 3-gram blocking.
According to their ROGUE-L F 1 scores against the ground-truth summary, we split sentences into L = 3 categories of salience degrees: 1) The most important top 15% sentences, 2) the bottom 50% least important ones, and 3) everything in between.More discussions regarding this setup is in §4.3).

Main Results
Tab. 1 shows the results on the two summarization datasets.For baselines on CNNDM, joint extractive and abstractive summarization methods (i.e.CIT+SE and GSum) perform better than independent extractive (i.e.LEAD-3, MatchSum and HAH-Sum) and abstractive summarization methods (i.e.Point-Generator, BART and PEGASUS) when using the same backbone models.Among the joint summarization baselines, using extractive summary as guidance (i.e.GSum) performs better than using key words as guidance (i.e.CIT+SE), which agrees with the observation by Dou et al. (2021).Our method promisingly improves the original BART by 2.06/1.41/1.91 points in terms of ROUGE-1/2/L F1 scores, indicating the multi-degree salience ex-

The Fifteen-Fifty Phenomenon
The number of salience degrees and thresholds to delimit them are the important hyper-parameters to discretize the proposed guidance.We apply a greedy search algorithm to find the best thresholds.First, we compute salience scores of all sentences in the corpus.In this work, we use ROUGE-L F 1 between each document sentence and corresponding reference summary to represent salience, and find the best threshold for two salience degrees.
Then we gradually add one more salience degree and search the additional threshold.The results on CNNDM is shown in Tab. 2. Splitting all sentences into three salience degrees by top 15% and bottom 50% salience scores leads to the best ROUGE-L F 1 .As the number of salience degrees L increases, the model performance first increases and then decreases.We attribute this phenomenon to the trade-off between informativeness and prediction accuracy of the guidance.Although a more fine-grained salience guidance is more informative, our model generates summaries based on predicted guidance during inference, where error propagation exists.Increasing the number of salience degrees to predict also increases the risk of misclassification.Furthermore, we find the best number and thresholds of salience degrees for summarization is consistent on Newsroom, indicating the salience split by top fifteen and bottom fifty percentile is a nature property of news articles.This phenomenon may provide a useful insight for composing news articles by journalists.

Analysis
To gain further insights on the proposed method, we perform additional analyses on CNNDM to comprehensively investigate performance by abstractiveness ( §5.1), summary length ( §5.2), human evaluation ( §5.3), and the impact of different model components ( §5.4).A case study is also presented in §5.5.

Performance by Abstractiveness
To understand how adaptive our method is on documents with different abstractiveness, we split all documents into three subsets of equal sizes based on their density scores following Grusky et al. (2018).Results are shown in Fig. 3. SEASON performs better than baselines on all subsets, indicating that our method is adaptive to documents with different abstractiveness.The improvements on abstractive and mixed subsets are slightly higher than that on the extractive subset, indicating abstract documents benefit more than extractive ones from a flexible salience guidance.

Summary Length
As SEASON achieves better performance under different abstractiveness with a more flexible salience guidance, a followup research question is: Does the flexible salience guidance help predict summary length more accurately?To answer this question, we compute the average lengths of Ground-Truth summaries and summaries generated by SEASON and baseline systems.The average summary length of Ground-Truth, SEASON, BART, and GSum are respectively 54.8, 59.0, 60.7, and 72.0.Among these methods, SEASON gave the closest average summary length to Ground-Truth.Moreover, while both of SEASON and GSum introduce sentencelevel salience guidance to BART, they change the summary length in opposite directions.

Human Evaluation
We further evaluate the system outputs of SEASON, BART, Gsum and the Ground-Truth with human subjective evaluation.We randomly pick 100 instances from CNNDM test set.For each raw document in those instances, we provide the summary generated by each system and the ground-truth summary.We hire human evaluators on Amazon Mechanical Turk to answer three Yes/No questions for the four summaries and rank them.Each instance is assigned to 3 different human evaluators to answer the following three questions (Song et al., 2021).a) Informativeness: Does the summary include the major information from the news?b) Faithfulness: Does the summary give any additional information not covered by the news?c) Fluency: Is the summary grammatical and well-formed?Tab. 3 reports the average percentage for each method to get a positive answer on the corresponding question.Among the three systems, SEASON performs the best on informativeness and fluency, while GSum performs the best on faithfulness.This indicates that our flexible guidance helps the model to identify salient content accurately and rephrase them properly.Not surprisingly, systems with guidance (i.e.SEASON and Gsum) are more faithful to original content than a system without any guidance (i.e.BART).Tab. 4 shows the ranking results.SEASON has the highest percentage of the highestranked summaries and the lowest percentage of the lowest-ranked summaries.It also has the best average rank.These results further demonstrate that summaries generated by SEASON are of high quality.Interestingly, we find that the ground-truth is not always the best choice in human evaluation.This observation aligns with the findings in prior studies (Maynez et al., 2020;Song et al., 2020;Fabbri et al., 2021).It could happen since both human composition of summaries and human justification on their qualities could be subjective.Thus, groundtruth news summaries written by editors may not always be the first choice of readers.It also indicates that human evaluators could not distinguish between the real human writer and our automatic summarizer, and actually prefer our system outputs more than the ground-truth summaries.

Ablation Study
For all the experiments in this section, we use the default setting introduced in §4.1, unless discussed otherwise with different hyper-parameter values.

BART
New York-based writer Danielle Page set out to ask every cabbie she came across to dispense their best piece of relationship advice.The drivers, many of whom are married themselves, revealed their personal tips, life lessons and cultural anecdotes all in the name of love.

GSum
New York-based writer Danielle Page set out to ask every cabbie she came across to dispense their best piece of relationship advice in hopes of unlocking the key to a successful union.The drivers, many of whom are married themselves, revealed their personal tips, life lessons and cultural anecdotes all in the name of love.A 60-year-old named Michael revealed that his trick to marital bliss is putting his wife's happiness above his ownbut insists that what really makes a relationship work is finding a partner who will do the same for you.SEASON New York-based writer Danielle Page set out to ask every cabbie she came across to dispense their best piece of relationship advice.

Gold
New York-based writer Danielle Page set out to ask every cabbie she came across to share their tips on finding -and keeping -a partner.

BART
Researchers from Texas A&M School of Public Health found that hospitalizations from car crashes dropped 7 percent between 2003 and 2010 in the 45 states with texting bans.Arizona, Texas, Montana, Missouri, and Oklahoma are the only five states in America that do not have texting at the wheel bans for all drivers.

GSum
Researchers from Texas A&M School of Public Health found that hospitalizations from car crashes dropped 7 percent between 2003 and 2010 in the 45 states with texting bans when compared to states with no restrictions.Drivers between the ages of 25 and 40 are the most likely group of people to get in an accident related to texting and driving.SEASON Researchers from Texas A&M School of Public Health found that hospitalizations from car crashes dropped 7 percent between 2003 and 2010 in the 45 states with texting bans.Arizona, Texas, Montana, Missouri, and Oklahoma are the only five states in America that do not have texting at the wheel bans for all drivers.The study found that older drivers were more likely to make a texting and driving mistake than a younger driver.

Gold
Study found that hospitalizations from car crashes dropped 7 percent between 2003 and 2010 in the 45 states with texting bans.Arizona, Texas, Montana, Missouri, and Oklahoma are the only five states in America that do not have texting at the wheel bans for all drivers.The study also found that older drivers were more likely to make a texting and driving mistake than a younger driver.Multi-Task Learning.We first investigate the effectiveness of salience prediction as an auxiliary task by removing salience-aware cross-attention.In this setting, the model jointly predicts the salience allocation and the abstractive summary, but does not feed the gold or predicted salience allocation to the decoder.That means the salience allocation is only used as supervision signals but not (intermediate) input features.As shown in Tab. 5, with MTL solely, the model can achieve 0.36/0.32/0.32 points improvements in terms of R-1/2/L.This indicates that the salience prediction task can not only provide effective guidance for abstractive summarization, but also act as supervision for learning more robust representations.
Salience-Aware Cross-Attention.We examine the effectiveness of the proposed salience-aware crossattention module from two perspectives in Tab. 5. First, we provide the gold salience labels instead of predicted ones to explore its upper bound.The performance increases by 8.58/8.72/9.06 points with a perfect salience predictor.This result indicates that a better estimation of the salience can be helpful for further improving the abstractive summarization performance.Second, we compare it with the original cross-attention module while keeping the auxiliary task and observe a performance drop by 1.70/1.09/1.59points in terms of R-1/2/L.This indicates that salience-aware cross-attention is essential for selecting important content accurately.
Coefficient of Multi-Task Learning.We further examine the influence of the coefficient α of multitask learning in Tab. 6.We test three different α values and observe that the largest difference of R1/2/L are within 0.02/0.07/0.03.According to the results, SEASON is not sensitive to α, indicating our model architecture is robust.
Adjacent Label Smoothing.We compare different label smoothing strategies in Tab. 7. In general, label smoothing improves model generalization and calibration (Müller et al., 2019), therefore benefits the overall performance.Given the same smoothing probability β, adding label smoothing to adjacent salience degrees performs better than adding label smoothing on all other salient degrees.
Salience Estimation.We compare the effectiveness of using soft (Eq.6) and hard (Eq.5) strategies for salience estimation in Tab. 8. Computing the expectation with raw probabilities (i.e., τ = 1.0) brings 0.11 points improvements on ROUGE-1.By adjusting the sharpness of probability distribution with the sharpening coefficient τ , ROUGE-1/2/L improvements become 0.50/0.39/0.34 points, respectively.As defined in Eq. 3, τ represents the confidence on predictions, and a lower τ leads to sharper probability distribution.In our experiments, τ = 0.5 performs the best.

Case Study
We present a case study in Tab. 9 with two representative examples to illustrate the advantage of SEASON.In the first case, BART tends to generate extra details without the help of proper guidance when only one sentence is enough to summarize the document.GSum is guided by an extractive summary consisting of three sentences, so not surprisingly it provides even more details.In the second case, BART infers without any salience guidance and ignores an important finding of the research.GSum selects exactly three sentences as guidance, thus it misses key information when multiple sentences are similarly important but some of them are not included in the guidance.SEASON performs well for both cases, indicating it is adaptive to documents of different properties.

Conclusion
In this paper, we propose SEASON, an abstractive summarization approach guided with salience allocation expectation.In SEASON, the salience guidance is adaptive to documents with different abstractiveness, and the salience-aware crossattention module is flexible to decide how much signal to accept from the salience guidance.Automatic and human evaluation further demonstrate the effectiveness and reliability of our proposed method.Comparing to the strong baseline model (i.e.BART), our method achieves 2.06/1.41/1.91ROUGE-1/2/L performance gain on CNNDM, and 0.33/0.32/0.60 performance gain on Newsroom.Finally, the empirical results on more than one million news articles demonstrate a natural fifteen-fifty salience split for news article sentences providing a useful insight for composing news articles.
tative English news summarization datasets.Future work on other languages may need to search the best salience degrees and thresholds of each language.Despite we use BART as our base model and maximum likelihood estimation (MLE) as the learning objective in this study, the proposed method can also be applied to other backbones and learning objectives (Zhang et al., 2020;Liu and Liu, 2021;Liu et al., 2022).While we have limited the proposed technique to abstractive summarization on news articles, future research can extend SEASON to other domains, such as scientific publications (Cohan et al., 2018) and podcasts (Song et al., 2022).In terms of evaluation, we focus on the supervised and in-domain setting.Future work may also consider to extend our method to zero-shot, few-shot, cross-domain, or cross-dataset settings.In addition to abstractive summarization, future research can also extend SEASON to other NLP tasks requiring salience-awareness, such as fact verification (Wang et al., 2021), information retrieval (Xiong et al., 2018) and distantly supervised relation extraction (Lin et al., 2016).

Ethical Consideration
A general issue of automatic text summarization is intellectual property problem caused by copying content from the raw document to the generated summary.This work seeks to improve abstractive summarization models with salience allocation as guidance.As the proposed guidance is more flexible than extractive summaries, it is likely to reduce copying content.Although we create salience guidance based on ground-truth summaries, the documents and ground-truth summaries remains the same as it is in the original dataset, ensuring no further social bias is introduced.

Figure 1 :
Figure 1: Illustration of different guidance.Extractive summary is a strict guidance consisting of extracted sentences labeled with check-mark.Salience allocation is a flexible guidance mapping sentences to different salience degrees shown as a bar chart.

Figure 2 :
Figure 2: Model architecture of SEASON.The proposed modules are highlighted with bold lines.SEASON adds a salience predictor on top of the encoder, maps (the expectation of) salience degrees to corresponding embeddings, and adds these salience embeddings to the key vectors of cross attention. al., (Grusky et al., 2018)., 2018)contains news articles and summaries written by authors and editors from 38 major newsrooms published between 1998 and 2017.The dataset includes 995,041/108,837/108,862 training/validation/test pairs.On average, each article has 659 words and each summary has 27 words.
consists of news articles and their human-written abstracts from CNN and Daily Mail websites, including 287,226/13,368/11,490 training/validation/test pairs.On average, each article has 781 words and each abstract contains 56 words.

Table 1 :
Results on CNNDM and Newsroom test sets.Best scores are in bold.Scores significantly better than the best baseline model are underlined (p < 0.001).Results with * are reproduced by us.Other numbers are from prior papers.

Table 4 :
Percentage of ranking and the average rank by human evaluation.

Table 6 :
Results on CNNDM dev set with different loss weights α for salience prediction loss.

Table 7 :
Results on CNNDM dev set with different label smoothing strategies.

Table 8 :
Results on CNNDM dev set with different salience estimation methods.τ is the sharpening coefficient in softmax.

Table 9 :
Case Study.Continuous word spans overlapped with the gold summary of more than 3 words are in blue.Continuous word spans in the gold summary not covered by any prediction are in red.Baselines may suffer from extra details or information loss due to no or imperfect salience guidance.