Dissecting Generation Modes for Abstractive Summarization Models via Ablation and Attribution

Despite the prominence of neural abstractive summarization models, we know little about how they actually form summaries and how to understand where their decisions come from. We propose a two-step method to interpret summarization model decisions. We first analyze the model’s behavior by ablating the full model to categorize each decoder decision into one of several generation modes: roughly, is the model behaving like a language model, is it relying heavily on the input, or is it somewhere in between? After isolating decisions that do depend on the input, we explore interpreting these decisions using several different attribution methods. We compare these techniques based on their ability to select content and reconstruct the model’s predicted token from perturbations of the input, thus revealing whether highlighted attributions are truly important for the generation of the next token. While this machinery can be broadly useful even beyond summarization, we specifically demonstrate its capability to identify phrases the summarization model has memorized and determine where in the training pipeline this memorization happened, as well as study complex generation phenomena like sentence fusion on a per-instance basis.


Introduction
Transformer-based neural summarization models (Liu and Lapata, 2019;Stiennon et al., 2020;Xu et al., 2020b;Desai et al., 2020), especially pretrained abstractive models like BART (Lewis et al., 2020) and PEGASUS (Zhang et al., 2020), have made great strides in recent years. These models demonstrate exciting new capabilities in terms of abstraction, but little is known about how these models work. In particular, do token generation decisions leverage the source text, and if so, which parts? Or do these decisions arise based primarily on knowledge from the language model (Jiang et al., 2020;Carlini et al., 2020), learned during pre-training or fine-tuning? Having tools to analyze these models is crucial to identifying and forestalling problems in generation, such as toxicity (Gehman et al., 2020) or factual errors (Kryscinski et al., 2020;Durrett, 2020, 2021).
Although interpreting classification models for NLP has been widely studied from perspectives like feature attribution (Ribeiro et al., 2016;Sundararajan et al., 2017) and influence functions (Koh and Liang, 2017;Han et al., 2020), summarization specifically introduces some additional elements that make these techniques hard to apply directly. First, summarization models make sequential decisions from a very large state space. Second, encoder-decoder models have a special structure, featuring a complex interaction of decoder-side and encoder-side computation to select the next word. Third, pre-trained LMs blur the distinction between relying on implicit prior knowledge or explicit instance-dependent input.
This paper aims to more fully interpret the stepwise prediction decisions of neural abstractive summarization models. 1 First, we roughly bucket generation decisions into one of several modes of generation. After confirming that the models we use are robust to seeing partial inputs, we can probe the model by predicting next words with various model ablations: a basic language model with no input (LM ∅ ), a summarization model with no input (S ∅ ), with part of the document as input (S part ), and with the full document as input (S full ). These ablations tell us when the decision is context-independent (generated in an LM-like way), when it is heavily context-dependent (generated from the context), and more. We map these regions in Figure 2 and can use these maps to coarsely analyze model behavior. For example, 17.6% of the decisions on XSum are in the lower-left corner (LM-like), which means they do not rely much on the input context. Second, we focus on more fine-grained attribution of decisions that arise when the model does rely heavily on the source document. We carefully examine interpretations based on several prior techniques, including occlusion (Zeiler and Fergus, 2014), attention, integrated gradients (Sundararajan et al., 2017), and input gradients (Hechtlinger, 2016). In order to evaluate and compare these methods, we propose a comprehensive evaluation based on presenting counterfactual, partial inputs to quantitatively assess these models' performance with different subsets of the input data.
Our two-stage analysis framework allows us to (1) understand how each individual decision depends on context and prior knowledge (Sec 3), (2) find suspicious cases of memorization and bias (Sec 4), (3) locate the source evidence for context dependent generation (Sec 5). The framework can be used to understand more complex decisions like sentence fusion (Sec 6).

Background & Setup
A seq2seq neural abstractive model first encodes an input document with m sentences (s 1 , · · · , s m ) and n tokens (w 1 , w 2 , · · · , w n ), then generates a sequence of tokens (y 1 , · · · , y T ) as the summary. At each time step t in the generation phase, the model encodes the input document and the decoded summary prefix and predicts the distribution over tokens as p(y t | w 1 , w 2 , . . . , w m , y <t ).

Target Models & Datasets
We investigate the English-language CNN/DM (Hermann et al., 2015) and XSum (Narayan et al., 2018) datasets, which are commonly used to fine tune pre-trained language models like BART, PE-GASUS and T5. As shown in past work (Narayan et al., 2018;Chen et al., 2020b;Xu et al., 2020a), XSum has significantly different properties from CNN/DM, so these datasets will show a range of model behaviors. We will primarily use the development sets for our analysis.
We focus on BART (Lewis et al., 2020), a stateof-the-art pre-trained model for language modeling and text summarization. Specifically, we adopt 'bart-large' as the language model M LM , 'bartlarge-xsum' as the summarization model M SUM for XSum, and 'bart-large-cnn' for CNN/DM, made available by Wolf et al. (2019). BART features separate LM and summarization model sharing the same subword tokenization method. 2 Our approach focuses on teasing apart these different modes of decisions. We first run the full model to get the predicted summary (y 1 , · · · , y T ). We then analyze the distribution placed by the full model S full to figure out what contributes towards the generation of the next token.  the ablation stage, we compare the predictions of different model and input configurations. The goal of this stage is to coarsely determine the mode of generation. Here, for and Khan are generated in an LM-like way: the model already has a strong prior that Sadiq should be Sadiq Khan and the source article has little impact on this decision. Cameron, by contrast, does require the source in order to be generated. And mayoral is a complex case, where the model is not strictly copying this word from anywhere in the source, but instead using a nebulous combination of information to generate it. In the attribution stage, we interpret such decisions which require more context using a more fine-grained approach. Given the predicted prefix (like David), target prediction (like Cameron), and the model, we use attribution techniques like integrated gradients (Sundararajan et al., 2017) or LIME (Ribeiro et al., 2016) to track the input which contributes to this prediction.

Ablation Models and Assumptions
The configurations we use are listed in Table 1 and defined as follows: LM ∅ is a pre-trained language model only taking the decoded summary prefix as input. We use this model to estimate what a pure language model will predict given the prefix. We denote the prediction distribution as P LM ∅ = P (y t | y <t ; M LM ).
S ∅ is the same BART summarization model as S full , but without the input document as the input. That is, it uses the same parameters as the full model, but with no input document fed in. We use the prediction of this model to estimate how strong an effect the in-domain training data has, but still treating the model as a decoder-only language model. It is denoted as P ∅ = P (y t | y <t ; M SUM ). Figure 1 shows how this can effectively identify cases like Khan that surprisingly do not rely on the input document.
S part is a further step closer to the full model: this is the BART summarization model conditioned on the decoder prefix and part of the input document, denoted as P part = P (y t | y <t , {s i }; M SUM ) where {w i } is a subset of tokens of the input document. The selected content could be a continuous span, or a sentence, or a concatenation of several spans or sentences.
Although M SUM is designed and trained to condition on input document, we find that the model also works well with no input, little input and incomplete sentences. As we will show later, there are many cases that this scheme successfully explains; we formalize our assumption as follows: Assumption 1 If the model executed on partial input nearly reproduces the next word distribution of the full model, then we view that partial context as a sufficient (but perhaps not necessary) input to explain the model's behavior.
Here we define partial input as either just the decoded summary so far or the summary and partial context. In practice, we see two things. First, when considering just the decoder context (i.e., behaving as an LM), the partial model may reproduce the full model's behavior (e.g., Khan in Figure 1). We do not focus on explaining these cases in further detail. While conceivably the actual conditional model might internally be doing something different (a risk noted by Rudin (2019)), this proves the existence of a decoder-only proxy model that reproduces the full model's results, which is a criterion used in past work . Second, when considering partial inputs, the model frequently requires one or two specific sentences to reproduce the full model's behavior, suggesting that the given contexts are both necessary and sufficient.
Because these analyses involve using the model on data significantly different than that which it is trained on, we want another way to quantify the importance of a word, span, or sentence. This brings us to our second assumption: Assumption 2 In order to say that a span of the input or decoder context is important to the model's prediction, it should be the case that this span is demonstrated to be important in counterfactual settings. That is, modified inputs to the model that include this span should yield closer predictions than those that don't.
This criterion depends on the set of counterfactuals that we use. Rather than just word removal (Ribeiro et al., 2016), we will use a more compre-hensive set of counterfactuals (Miller, 2019;Jacovi and Goldberg, 2020) to quantify the importance of input tokens. We describe this more in Section 5.

Distance Metric
Throughout this work, we rely on measuring the distance between distributions over tokens. Although KL divergence is a popular choice, we found it to be very unstable given the large vocabulary size, and two distributions that are completely different would have very large values of KL. We instead use the L 1 distance between the two distributions: D(P, Q) = i |p i − q j |. This is similar to using the Earth Mover's Distance (Rubner et al., 1998) over these two discrete distributions, with an identity transportation flow since the distributions are defined over the same set of tokens.

Ablation: Mapping Model Behavior
Based on Assumption 1, we can take a first step towards understanding these models based on the partial models described in Section 2.3. Previous work (See et al., 2017;Song et al., 2020) has studied model behavior based on externally-visible properties of the model's generation, such as identifying novel words, differentiating copy and generation, and prediction confidence, which provides some insight about model's behavior (Xu et al., 2020a). However, these focus more on shallow comparison of the input document, the generated summary, and the reference summary, and do not focus as strongly on the model.
We propose a new way of mapping the prediction space, with maps 3 for XSum and CNN/DM shown in Figure 2. Each point in the map is a single subword token being generated by the decoder on the development set at inference time; that is, each point corresponds to a single invocation of the model. This analysis does not depend on the reference summary at all.
The x-axis of the map shows the distance between LM ∅ and S full , using the metric defined in Section 2.4 which ranges from 0 to 2. The y-axis shows the distance between S ∅ and S full . Other choices of partial models for the axes are possible (or more axes), but we believe these show two important factors. The x-axis captures how much the generic pre-trained language model agrees with the full model's predictions. The y-axis cap-3 While our axes are very different here, our mapping concept loosely follows that of Swayamdipta et al. (2020). . The x-axis and y-axis show the distance between LM ∅ and S full , and distance between S ∅ and S full . The regions characterize different generation modes, defined in Section 3.
tures how much the decoder-only summarization model agrees with the full model's predictions. The histogram on the sides of the map show counts along with each vertical or horizontal slice.

Modes of decisions
We break these maps into a few coarse regions based on the axis values. We list the coordinates of the bottom left corner and the upper right corner. These values were chosen by inspection and the precise boundaries have little effect on our analysis, as many of the decisions fall into the corners or along sides.
LM ([0, 0], [0.5, 0.5]) contains the cases where LM ∅ and S ∅ both agree with S full . These decisions are easily made using only decoder information, even without training or knowledge of the input document. These are cases that follow from the constraints of language models, including function words, common entities, or idioms.
CTX ([0.5, 0.5], [2, 2]) contains the cases where the input is needed to make the prediction: neither decoder-only model can model these decisions.
FT ([1.5, 0], [2, 0.5]) captures cases where the finetuned decoder-only model is a close match but the pre-trained model is not. This happens more often on XSum and reflects memorization of training summaries, as we discuss later.
PT ([0, 1.5], [0.5, 2]) is the least intuitive case, where LM ∅ agrees with S full but S ∅ does not; that is, finetuning a decoder-only model causes it to work less well. This happens more often on CNN/DM and reflects memorization of data in the pre-training corpus.

Coloring the Map with Context Probing
While the map highlights some useful trends, there are many examples that do rely heavily on the context that we would like to further analyze. Some examples depend on the context in a sophisticated way, but other tokens like parts of named entities or noun phrases are simply copied from the source article in a simple way. Highlighting this contrast, we additionally subdivide the cases by how they depend on the context. We conduct a sentence-level presence probing experiment to further characterize the generation decisions. For a document with m sentences, we run the S part model conditioned on each of the sentences in isolation. We can obtain a sequence of scalars P sent = (P part (s i ); i ∈ [1, m]). We define CTX-Hd ("context-hard") cases as ones where max(P sent ) is low; that is, where no single sentence can yield the token, as in the case of sentence fusion. These also reflect cases of high entropy for S full , where any perturbation to the input may cause a big distribution shift. The first, second and third quartile of max(P sent ) is [0.69, 0.96, 1.0] and [0.95, 1.0, 1.0] on XSum and on CNN/DM.

Region Count & POS Tags
To roughly characterize the words generated in different regions of the map, in Table 2, we show the percentage of examples falling to each region and the top 3 POS tags for each region on the XSum map. From the frequency of these categories, we can tell more than two-thirds of the decisions belong to the Context category. 17.6% of cases are in LM, the second-largest category. In the LM region, ADP and DET account for nearly half of the data points, confirming that these are largely function  words. Nouns are still prevalent, accounting for 13.5% of the category. After observing the data, we found that these points represent commonsense knowledge or common nouns or entities, like "Nations" following "United" or "Obama" following "Barack" where the model generates these without relying on the input. Around 8% of cases fall into gaps between these categories. Only 2.5% and 2.1% of the generations fall into the PT and FT, respectively. These are small but significant cases, as they clearly show the biases from the pretraining corpus and the fine-tuning corpus. We now describe the effects we observe here.

Bias from Training Data
One benefit of mapping the predictions is to detect predictions that are suspiciously likely given one language model but not the other, specifically those in the PT and FT regions. CNN/DM has more cases falling into PT than XSum so we focus on CNN/DN for PT and XSum for FT.
PT: Bias from the Pretraining Corpus The data points falling into the PT area are those where LM ∅ prediction is similar to S full prediction but the S ∅ prediction is very different from S full . We present a set of representative examples from the PT region of the CNN/DM map in Table 3. For the first example, match is assigned high probability by LM ∅ and S full , but not by the no-input summarization models. The cases in this table exhibit a suspiciously high probability assigned to the correct answer in the base LM: its confidence about Kylie Jenner vs. Kyle Min(ogue) is uncalibrated with what the "true" probabilities of these seem likely to be to our human eyes. One explanation which we investigate is whether the validation and test sets of benchmark datasets   4 Our matching criteria is more than three 7-gram word overlaps between the pre-training document and reference summaries from the dataset; upon inspection, over 90% of the cases flagged by this criterion contained large chunks of the reference summary.
Our conclusion is that the pre-trained language model has likely memorized certain articles and their summaries. Other factors could be at play: other types of knowledge in the language model (Petroni et al., 2019;Shin et al., 2020;Talmor et al., 2020) such as key entity cooccurrences, could be contributing to these cases as well and simply be "forgotten" during fine-tuning. However, as an analysis tool, ablation suggested a hypothesis  about data overlap which we were able to partially confirm, which supports its utility for understanding summarization models.
FT: Bias from Fine-tuning Data We now examine the data points falling in the bottom right corner of the map, where the fine-tuned LM matches the full model more closely than the pre-trained LM.
In Table 4, we present some model-generated bigrams found in the FT region of XSum and compare the frequency of these patterns in the XSum and CNN/DM training data. Not every generation instance of these bigrams falls into the FT region, but many do. Table 4 shows the relative probabilities of these counts in XSum and CNN/DM, showing that these cases are all very common in XSum training summaries. The aggregate over all decisions in this region (the last line) shows this pattern as well. These can suggest larger patterns: the first three come from the common phrase in our series of letters from African journalists (starts 0.5% of  Table 5: Examples of DISPTOK and RMTOK. We show the change of the prediction probability of the target token when displaying or masking the w attr token, which is the highest rank token from the occlusion method. Significant change is marked in bold. summaries in XSum). Other stylistic markers, such as ways of writing currency, are memorized too.

Attribution
As shown in Table 2, more than two thirds of generation steps actually do rely heavily on the context. Here, we focus specifically on identifying which aspects of the input are important for cases where the input does influence the decision heavily using attribution methods.
Each of the methods we explore scores each word w i in the input document with a score α i . The score can be a normalized distribution, or a probability value ranging from 0 to 1. For each method, we rank the tokens in descending order by score. To confirm that the tokens highlighted are meaningfully used by the model when making its predictions, we propose an evaluation protocol based on a range of counterfactual modifications of the input document, taking care to make these compatible with the nature of subword tokenization.

Evaluation by Adding and Removing
Our evaluation focuses on the following question: given a budget of tokens or sentences, how well does the model reconstruct the target token y t when shown the important content selected by the attribution method? Our metric is the cross entropy loss of predicting the model-generated next token given different subsets of the input. 5 Methods based on adding or removing single tokens have been used to evaluate before (Nguyen, 2018). However, for summarization, showing the model partial or ungrammatical inputs in the source 5 The full model is not a strict bound on this; restricting the model to only see salient content could actually increase the probability of what was generated. However, because we have limited ourselves to CTX examples and are aggregating across a large corpus, we do not observe this in our metrics. may significantly alter the model's behavior. To address this, we use four methods to evaluate under a range of conditions, where in each case the model has a specific budget. Our conditions are: 1. DISP-TOK selects n tokens as the input. 2. RMTOK shows the document with n tokens masked instead of deleted. 6 3. DISPSENT selects n sentences as the input, based on cumulative attribution over the sentence. 4. RMSENT removes n sentences from the document as the input. Table 5 shows examples of these methods applied to the examples from Figure 1. These highlight the impact of key tokens in certain generation cases, but not all.
We describe the details of how we feed or mask the tokens in TOK in Appendix. C. The sentencelevel methods are guaranteed to return grammatical input. Token-based evaluation is more precise which helps locating the exact feature token, but the trade-off is that the input is not fully natural.

Methods
We use two baseline methods: Random, which randomly selects tokens or sentences to display or remove, and Lead, which selects tokens or sentences according to document position, along with several attribution methods from prior work. Occlusion (Zeiler and Fergus, 2014) involves iteratively masking every single token or remove each sentence in the document and measuring how the prediction probability of the target token changes. Although attention has been questioned (Jain and Wallace, 2019), it still has some value as an explanation technique (Wiegreffe and Pinter, 2019;Serrano and Smith, 2019). We pool the attention heads from the last layer of the Transformer inside our models, ignoring special tokens like SOS.
Finally, we use two gradient-based techniques (Bastings and Filippova, 2020). Input Gradient is a saliency based approach taking the gradient of the target token with respect to the input and multiplying by the input feature values. Integrated Gradients Sundararajan et al. (2017) computes gradients of the model input at a number of points interpolated between a reference "baseline" (typically an all-MASK input) and the actual input. This computes a path integral of the gradient.  Figure 3: Four-way evaluation for our content attribution methods. The reported value is the NLL loss with respect to the predicted token. Lower is better for display methods and higher is better for removal methods (we "break" the model more quickly). n = 0 means the baseline when there is no token or sentence displayed in DISP or removed or masked in RM.

Attribution Aggregation for Sentence-level Evaluation
We have described the six methods we use for token-level evaluation. To evaluate these methods on the sentence level benchmark, we aggreagate the attributions in each sentence attr(s i ) = d j=0 attr(w j )/d. Hence we can obtain a ranking of sentences by their aggregated attribution score.

Results
In Figure 3, we show the token-level and sentencelevel comparison of the attribution methods on the CTX examples in XSum. IntGrad is the best technique overall, with InpGrad achieving similar performance. Interestingly, occlusion underperforms other techniques when more tokens are removed, despite our evaluation being based on occlusion; this indicates that single-token occlusion is not necessarily the strongest attribution method. We also found that all of these give similar results, regardless of whether they present the model with a realistic input (sentence removal) or potentially ungrammatical or unrealistic input (isolated tokens added/removed).
Our evaluation protocol shows better performance from gradient-based techniques. The combination of four settings tests a range of counterfactual inputs to the model and increases our confidence in these conclusions.

Case Study: Sentence Fusion
We now present a case study of the sort of analysis that can be undertaken using our two-stage interpretation method. We conduct an analysis driven by sentence fusion, a particular class of CTX-Hd cases. Sentence fusion is an exciting capability of abstractive models that has been studied previously (Barzilay and McKeown, 2005;Thadani and McKeown, 2013;Lebanoff et al., 2019Lebanoff et al., , 2020. We broadly identify cases of cross-sentence information fusion by first finding cases in CTX-Hd where the max(P sent ) < 0.5, but two sentences combined enable the model to predict the word. We search over all m 2 combinations of sentences (m is the total number of sentences) and run the S part model on each pair of sentences. We identify 16.7% and 6.0% of cases in CNN/DM and XSum, respectively, where conditioning on a pair of sentences increases the probability of the model's generation by at least 0.5 over any sentence in isolation.
In Table 6, we show two examples of sentence fusion on XSum in this category, additionally analyzed using the DISPSENT attribution method. In the first example, typical in XSum, the model has to predict the event name UCI without actually seeing it. The model's reasoning appears distributed over the document: it consults entity and event descriptions like world champion and France, perhaps to determine this is an international event. In the second example, we see the model again connects several pieces of information. The generated text is factually incorrect: the horse is retiring, and not Dujardin. Nevertheless, this process tells us some things that are going wrong (the model disregards the horse in the generation process), and could potentially be useful for fine-grained factuality evaluation using recent techniques (Tian et al., 2019;Kryscinski et al., 2020;Goyal and Durrett, 2020;Maynez et al., 2020). The majority of the "fusion" cases we investigated actually reflect content selection at the beginning of the generation. Other cases we observe fall more cleanly into classic sentence fusion or draw on coreference resolution.

Related Work
Model interpretability for NLP has been intensively studied in the past few years (Ribeiro et al., 2016;Alvarez-Melis and Jaakkola, 2018;Jacovi et al., 2018;Chen et al., 2020a;Jacovi and Goldberg, 2020;DeYoung et al., 2020;Pruthi et al., 2020;Ye et al., 2021). However, many of these techniques are tailored to classification tasks like sentiment. For post-hoc interpretation of generation, most work has studied machine translation (Ma et al.;Voita et al., 2020).  focus on evaluating explanations by finding surrogate models that are similar to the base MT model; this is similar to our evaluation approach in Section 5, but involves an extra distillation step. Compared to Voita et al. (2020), we are more interested in highlighting how and why changes in the source article will change the summary (counterfactual explanations).
To analyze summarization more broadly, Xu et al. (2020a) provides a descriptive analysis about models via uncertainty. Previous work (Kedzie et al., 2018;Zhong et al., 2019;Kryscinski et al., 2019;Zhong et al., 2019) has conducted comprehensive examination of the limitations of summarization models. Filippova (2020) ablates model input to control the degree of hallucination. Miao et al. (2021) improves the training of MT by comparing the prediction of LM and MT model.
Finally, this work has focused chiefly on abstractive summarization models. We believe interpreting extractive (Liu and Lapata, 2019) or compressive (Xu and Durrett, 2019;Xu et al., 2020b;Desai et al., 2020) models would be worthwhile to explore and could leverage similar attribution techniques, although ablation does not apply as discussed here.

Recommendations & Conclusion
We recommend a few methodological takeaways that can generalize to other conditional generation problems as well.
First, use ablation to analyze generation models. While removing the source forms inputs not strictly on the data manifold, ablation was remarkably easy, robust, and informative in our analysis. Constructing our maps only requires querying three models with no retraining required.
Second, to understand an individual decision, use feature attribution methods on the source only. Including the target context often muddies the interpretation since recent words are always relevant, but looking at attributions over the source and target together doesn't accurately convey the model's decision-making process.
Finally, to probe attributions more deeply, consider adding or removing various sets of tokens. The choice of counterfactuals to explain is an illposed problem, but we view the set used here as realistic for this setting (Ye et al., 2021).
Taken together, our two-step framework allows us to identify generation modes and attribute generation decisions to the input document. Our techniques shed light on possible sources of bias and can be used to explore phenomena such as sentence fusion. We believe these pave the way for future studies of targeted phenomena, including fusion, robustness, and bias in text generation, through the lens of these interpretation techniques.
Towards robust interpretability with self-explaining neural networks.

A Validity of Decoder-Only Model in S ∅ Setting
We use an off-the-shelf BART summarization model as the decoder-only model for the ablation study. To guarantee the validity of the usage of the off-the-shelf model for ablation study, we also fine-tuned a BART language model where encoding input is empty and the decoding target is the reference summary. We compare the model output with the S ∅ output in the paper. For 55% of cases the top-1 predictions of these two models agree with each other. This is pretty high, and suggests that the S ∅ is at least doing reasonably. Note that fine-tuning will probably give rise to different behavior on the 70% of CTX cases, since the S ∅ will hallucinate differently than the newly fine-tuned model (which further suggests why our analysis should focus on S ∅ ).

B Examples of PT
We present more examples of bias from the pretrained language model on CNN/DM in Table 9. In Table 3 we have shown the cases where the memorized phrases are proper nouns or nouns. Here we provide examples of other types like function words. The memorization of function words like with or and can be challenging to spot using other means due to their ubiquity.

C Implementation Detail for TOK
We rank the attribution score of all subword tokens rather than words. However, to provide necessary context for DISPTOK and to avoid information leakage in RMTOK, we extend the selection by a context window to collect neighboring word pieces. We illustrate the way of fulfilling budget with an example.

Examples from XSum
Hundreds of people have attended a memorial service in Liverpool.
Two code violations for Nicolas Almagro and Pablo Cuevas at the Australian Open were described as disgraceful.
In our series of letters from African journalists film maker and columnist Farai Sevenzo looks at the challenges facing Nigeria's President Muhammadu Buhari.
Four people have been arrested after a BBC Panorama investigation uncovered shocking abuse at a private hospital.
West Indies Shabnim Ishaq has been ruled out of the rest of the Women's World Cup.

Examples from CNN/DM
In the worst cases, doctors have reported patients showing up because they were hungover, their false nails were hurting or they had paint in their hair. More than four million visits a year are unnecessary and cost the NHS £290million annually.
Elski Felson of Los Angeles, California, decided to apply for a Community Support Specialist role at Snapchat via the social media app. In just over three minutes, the tech enthusiast created a video resume.
Chelsea supporters have been involved in the highest number of reported racist incidents as they travelled to and from matches on trains. The information, gathered from 24 police forces across the country, shows there have been over 350 incidents since 2012.
Kris-Deann Sharpley was on maternity leave and had just given birth to her first child. Her body was found in the bathroom of her father's home. In this example "Bur" receives the highest score and "new" the second. We use a context windows of size 1 and a budget of n = 4 tokens. In DISPTOK, the input will be "〈sos〉Burberry, on new〈eos〉"; In RMTOK, the input will be 〈sos〉## bets## branding〈eos〉 where # stands for the MASK token. If n = 5, branding will be added or masked.

D Efficient Two-Stage Selection Model
For long documents in summarization, attribution methods can be computationally expensive. Occlusion requires running inference once for each token in the input document. Gradient-based methods store the gradients and so require a lot of GPU   memory when the document is long. These techniques spend time and memory checking words that have little impact on the generation. In order to improve the efficiency of these methods, we propose an efficient alternative where we first run sentence level presence probing on the full document, and then run attribution methods locally on the top-k sentences. We call the proposed model S+ [method] where method can be arbitrary attribution methods including occlusion, attention, InpGrad and IntGrad.
We define our notation as follows: s, n and d are the number of sentences, the number of tokens in the document, and the number of tokens in each sentence, respectively. For the occlusion method, we can run inference s times to pre-select important sentences, each of which costs O(d 2 ) times due to self-attention. The attribution is then applied only to only one or few sentences so the complexity is now O(k×d 2 ×d) where k is the number of top sentences used for attribution. In our experiments, we set k = 2 and n ≤ 500. Compared to the complex-ity of the regular model O(n 3 ), the complexity of the two-stage model is only O(s × d 2 + k × d 2 × d).
In Table 7 we compare the complexity and actual run time and memory usage. We batch the occlusion operation and the batch size is set to 100. We can see a huge reduction in running time and a significant drop in memory usage.
Takeaway A two-stage selection model is much more efficient, yielding a 97% running time reduction on the occlusion method. The downside of this method is that it only produces single-sentence attributions, and so isn't appropriate in cases involving sentence fusion.
Following (Vaswani et al., 2017), we compare the complexity for all methods in Table 12. n is the number of tokens in the document. d is the number of tokens in each sentence. s is the number of sentences in the document. r is the number of steps in the integral approximation of Integrated Gradient. bp indicates the time consumption of one back-   propagation for gradient based methods. We list the complexity of the original methods in the middle column and the sentence based pre-selection variant in the right column. The base cost for sentence pre-selection model is to run the sentence selection model s times, so it's O(s × d 2 ). The n 2 and d 2 originate from the quadratic operation of self-attentions in Transformer models. We ignore the number of layers in the neural network or other model related hyper-parameters since all of the methods here share the same model.

E Four Way Evaluation
Due to the space limit, we only show the plot of the four way evaluation in Figure 3. To enable future comparisons on the proposed evaluation protocol, we also include the detailed results in Table 10 and  Table 11 for TOK and SENT evaluation. The ∆ measures how the average performance increase or drop deviates from the original baseline. We abstract the evaluation methods as a function eval. The input is the text and the budget n and output is the predicted loss. ∆ = Avg(eval(i)) − eval (0) For TOK series evaluation, i ∈ {1, 2, 4, 8, 16}. For SENT series evaluation, i ∈ {1, 2, 3, 4} because a sentence carries much more information than a token. IntGrad performs the best across all of the evaluation methods.