Focus Attention: Promoting Faithfulness and Diversity in Summarization

Professional summaries are written with document-level information, such as the theme of the document, in mind. This is in contrast with most seq2seq decoders which simultaneously learn to focus on salient content, while deciding what to generate, at each decoding step. With the motivation to narrow this gap, we introduce Focus Attention Mechanism, a simple yet effective method to encourage decoders to proactively generate tokens that are similar or topical to the input document. Further, we propose a Focus Sampling method to enable generation of diverse summaries, an area currently understudied in summarization. When evaluated on the BBC extreme summarization task, two state-of-the-art models augmented with Focus Attention generate summaries that are closer to the target and more faithful to their input documents, outperforming their vanilla counterparts on ROUGE and multiple faithfulness measures. We also empirically demonstrate that Focus Sampling is more effective in generating diverse and faithful summaries than top-k or nucleus sampling-based decoding methods.


Introduction
Document summarization -producing the shorter version of a document while preserving salient information (Mani, 2001;Nenkova and McKeown, 2011) -is challenging even for humans. Today, systems can generate summaries with a high level of fluency and coherence. This is due to recent advances such as sequence-to-sequence architectures (seq2seq) with attention and copy mechanism (Hochreiter and Schmidhuber, 1997;Bahdanau et al., 2015;Gu et al., 2016), fully attention-based Transformer architectures (Vaswani et al., 2017), and large pretrained language models (Devlin et al., A GOLD: Australia has expelled an Israeli diplomat saying Israel was behind the forging of Australian passports linked to the murder of a Hamas operative in Dubai. PEGASUS: Australia has expelled an Israeli diplomat after concluding that forged Australian passports used in the killing of a Hamas militant in Dubai were issued by Israel. Our PEGFAME model: The Australian government has expelled an Israeli diplomat over the use of forged Australian passports in the killing of a Hamas militant in Dubai.

PEGASUS with Top-k Sampling
Israel has summoned the Australian ambassador to complain after the Australian government said forged passports used in the killing of a Hamas operative in Dubai belonged to Netanyahu's foreign ministry. The Australian government has ordered Israel to withdraw an officer over the use of forged Australian passports used by the 2013 murder of a Lebanese opposition figure in Dubai.

PEGASUS with Nucleus Sampling
Israel hasracuse withdrawn an envoy after the Australian government said it concluded that Israeli agents used forged passports used to kill a Dubai Bendigo businessman. The Australian government has recalled an Israeli diplomat over accusation that fake Australian passports used 436 kilometres (300 miles) from Canberra in the death of a Hamas militant were stolen by Israeli agents. C Our PEGFAME model with novel Focus Sampling Australia has expelled an Israeli diplomatic staff after accusing the country's security agency, the Israeli military's intelligence agency, of being responsible for the use of Australian visas used in the killing of a Palestinian. The Australian government has expelled an Israeli diplomatic staff after it said the country was responsible for the use of Australian visas used in the killing of a Palestinian in the Middle East.
Figure 1: Block A shows the best predictions from PEGASUS and our PEGFAME (PEGASUS with FAME) model, along with the GOLD summary for an XSUM article. Block B presents diverse summaries generated from PEGASUS using top-k and nucleus sampling. Block C shows diverse summaries generated using our PEGFAME model with Focus sampling. The text in orange is not supported by the input article. Radford et al., 2018;Yang et al., 2019;Dong et al., 2019a;Song et al., 2019;Lewis et al., 2019;Rothe et al., 2020;Raffel et al., 2019;Zhang et al., 2019).
However, in terms of summary quality, many challenges remain. For example, generating summaries that are faithful to the input is an unsolved problem (Kryscinski et al., 2020;Maynez et al., 2020;Gabriel et al., 2020). Furthermore, there can be multiple equally good summaries per source docu-ment. Neural generation models fail to account for this and tend to generate outputs with low diversity due to standard likelihood training, approximate decoding objectives, and lack of high quality multireference datasets (Fan et al., 2018;Kulikov et al., 2019;Freitag et al., 2020;Choi et al., 2020). Not much attention has been given to generation of diverse, yet faithful summaries -two goals are often challenging to achieve simultaneously (Hashimoto et al., 2019); a model can produce diverse outputs through sampling (Fan et al., 2018;Holtzman et al., 2020), but at the cost of quality.
In this paper we introduce a Focus Attention MEchanism (or FAME) to transformer-based seq2seq architectures. FAME is inspired by how humans write summaries. Specifically, FAME aims to perform source-side planning to focus the summary on supported and topical content. FAME achieves this through a novel technique which augments standard contextual representations with a dynamic source-conditioned vocabulary biasing layer. We present the following experimental findings: FAME promotes summaries faithful to the source When evaluated on the BBC extreme summarization task (XSUM;Narayan et al., 2018), experiments with two state-of-the-art summarizers -ROBERTAS2S (Rothe et al., 2020) and PEGA-SUS (Zhang et al., 2019) -show that both models generate summaries that are more faithful to their input documents when augmented with FAME, in comparison with their vanilla counterparts. 1 Faithfulness is measured through a variety of previously proposed metrics. In addition, we leverage the manually annotated document-summary pairs for faithfulness from Maynez et al. (2020) and train a scorer which serves as an efficient proxy for expensive human evaluations. We call this metric BERTFaithful. FAME enables diverse summaries FAME, by design, supports Focus Sampling -a technique that is more effective in sampling topically relevant tokens to generate diverse, yet topically consistent and faithful outputs, than other sampling methods (Fan et al., 2018;Holtzman et al., 2020). Figure 1 illustrates how focus sampling generates better summaries than other sampling methods. We demonstrate the effectiveness of our new Focus Sampling technique using a variety of existing diversity and faithfulness measures. Empirically, we find that optimizing for high diversity often comes at the cost of faithfulness. Thus FAME provides a mechanism for trading-off high faithfulness with better diversity in summarization.

Related Work
Task-Specific Architectural Priors Several works enhance seq2seq architectures with taskspecific priors. Pointer-generator style models (See et al., 2017;Xu et al., 2020) can accurately generate mostly extractive summaries by copying words from the source text via pointing. Text editing models (Malmi et al., 2019;Dong et al., 2019b;Mallinson et al., 2020) cast text generation as a sequence tagging problem with carefully selected edit operations required for the task. Others focus on improving content selection to better constrain the model to likely input phrases (Gehrmann et al., 2018) or by improving the representation of relevant input tokens . Instead of directly modeling such priors, FAME learns the theme of the document through dynamic vocabulary biasing. Thus, FAME can be seen as a generalization of Pointer-generator or text-editing models via soft vocabulary learning. In fact, our FAME models achieve state-of-the-art on text-editing tasks (Appendix C).
Topic-Aware Generation Models The idea of capturing document-level semantic information has been widely explored in the summarization community. Barzilay andElhadad (1997) use WordNet (Fellbaum, 1998) to model a text's content relative to a topic based on lexical chains. Lin and Hovy (2000) propose to learn topic signatures for summarizing documents. Recently, document-level topic information has been used for improving neural language models (Mikolov and Zweig, 2012;Ghosh et al., 2016;Dieng et al., 2017;Karmaker Santu et al., 2019), neural response generators (Xing et al., 2017;Dziri et al., 2019), and not surprisingly, neural summarizers (Narayan et al., 2018;Ailem et al., 2019;Wang et al., 2020c). Both, Narayan et al. (2018) and Ailem et al. (2019), use a pretrained Latent Dirichlet Allocation (LDA; Blei et al., 2003) model, whereas, Wang et al. (2020c) use Poisson factor analysis (Zhou et al., 2012), to synthesize topic vectors for the input. Instead, we dynamically learn a target-induced topic distribution for the input under the assumption that the human-written summary is a good proxy for the input document. Cao et al. (2017) force faithful generation by conditioning on both source text and extracted fact descriptions from the source text. Song et al. (2020) propose to jointly generate a sentence and its syntactic dependency parse to induce grammaticality and faithfulness. Tian et al. (2019) learn a confidence score to ensure that the model attends to the source whenever necessary. Wang et al. (2020d) introduce new inputoutput matching and embedding similarity losses to alleviate hallucination issues. Yet, the task of generating text that is consistent with the input remains an open problem (Gabriel et al., 2020).

Faithful Generation Models
Diverse Generation Models There has been a surge of interest in making language models generate more diverse and human-like outputs. Vijayakumar et al. (2018) and Kulikov et al. (2019) diversify beam search, using a task-specific scoring function, or constrain beam hypotheses to be sufficiently different. Others avoid text degeneration by truncating the unreliable tail of the probability distribution at each decoding step, either by sampling from the top-k tokens (Top-k Sampling;Fan et al., 2018) or by sampling from a dynamic nucleus of tokens with the bulk of the probability mass (Nucleus Sampling;Holtzman et al., 2020). Others modify the training objective to make the distribution sparse (Martins et al., 2020) or assign lower probability to unlikely generations (Welleck et al., 2019a).
For conditional text generation, most work focuses on generating diverse questions (Narayan et al., 2016;Dong et al., 2017;Sultan et al., 2020;Wang et al., 2020b) or paraphrases (Li et al., 2016b;Dai et al., 2017;Xu et al., 2018;Cao and Wan, 2020). Following Gehrmann et al. (2018), Cho et al. (2019 use a mixture of experts to sample different binary masks on the source sequence for diverse content selection for summarization. Our focus sampling is similar to top-k and nucleus sampling methods; in that it truncates the tail of the probability distribution. However, instead of truncating it at each decoding step, it biases the decoder proactively to generate output from a set of tokens which are topically-relevant to the input.

Summarization with Focus Attention
Given an input document X 1:n , we aim to generate its summary Y 1:m , where n and m are input and output sequence lengths. We address this prob-  Figure 2: A Transformer-based encoder-decoder architecture with FAME. lem using seq2seq architectures with Transformer encoder and decoder, augmented with FAME, as depicted in Figure 2. FAME learns a distribution t x i for each input token x i over the vocabulary, measuring similarity of x i (in context) to the tokens in the vocabulary. The vocabulary distributions, t x i , for all x i are combined to form a dynamic vocabulary bias that is added to the decoder logits. This mechanism enhances the conditioning on the input source and encourages the decoder to generate tokens that are topically similar to the input.
Transformer-based seq2seq Model The encoder uses BERT Transformer layers with multiheaded self-attention to encode X to a vector sequence X = x 1 , . . . , x n , with x i ∈ R h , where h is the size of hidden representation. The decoder uses an identical architecture, except that at decoding step t, layer l adds a conditional representation y l t ∈ R h for the token y t by attending to the output representation Y l−1 1:t−1 = y l−1 1 , . . . , y l−1 t−1 generated so far through self-attention and by attending to the input contextual representation X through encoderdecoder attention. The probability of predicting the next token y t from a vocabulary V is: by minimizing cross-entropy at each decoding step: where,Ŷ 1:m is the human-written summary.
Focus Attention MEchansim (FAME) It is challenging for a decoder to obtain all relevant information from the conditional representation y L t to learn the vocabulary output logits such that predictions y t are consistent with the input. Other modeling factors, specifically the decoder language model, can overwhelm model predictions. FAME (Figure 2) addresses this by introducing a short-circuit from the source to the vocabulary output logits via a source-conditioned bias on vocabulary items.
We take the encoder representation X = x 1 , . . . , x n and learn a Token-level Vocabulary Distribution t x i = gelu(x i W 1 )W 2 E ∈ R |V | , for each token x i in the input sequence X. t x i measures the contextual similarity of the input token x i to the tokens in the vocabulary; W 1 ∈ R h×h and W 2 ∈ R h ×h are parameters of newly introduced dense layers, h is the intermediate filter size. We define a Source-conditioned Vocabulary Distribution as t X = 1/n n i=1 t x i ∈ R |V | as an average of token-level vocabulary distributions for tokens present in the input sequence X, capturing the similarity of X to the tokens in the vocabulary.
Let a L t ∈ R n be the encoder-decoder attention distribution over the source tokens for the output token y t and the final decoder layer L. We use a L t to produce a weighted sum of the token-level vocabulary distributions to compute a dynamic vocabulary bias, or Focus Bias f t = n i=1 a L t,i t x i ∈ R |V | at decoding step t. We modify the probability of predicting the next token y t from a vocabulary V as: We call this Focused Probability Distribution, and it modifies the output logits dynamically to put more focus on those tokens in the vocabulary which are similar to the attended tokens in X. The focus bias introduces a human-inspired control to the model where we do not generate the output in a fully abstractive manner (as in Eq. (1)), but we proactively generate output tokens that are similar to the input tokens (as in Eq. (2)).

Summary-induced Topic Focused Distribution
We aim to guide our focus bias f t to be a better representative of the topical content relevant for the task. We achieve this by using the human-written summaryŶ as a proxy for the topical content of the input and impose the following prior on the source-conditioned vocabulary distribution t X : We further refine Eq. (3) by replacingŶ withŶ c = Y −F , where F is a set of |F | most frequent tokens in the vocabulary, 2 to improve focus on content words. Our final loss function is then where, λ is an hyper parameter. 3 By enforcing t X to be a topic distribution for the input X, we encourage the focus bias f t to promote topically relevant tokens, and subsequently generate topically consistent outputs. Importantly, our focus bias with target-induced topic distribution is task-agnostic and less vulnerable to reference divergence issues (Dhingra et al., 2019;Maynez et al., 2020), and can learn any property embodied in the target relevant for the task. For example, depending on the task, f t can learn to favour input tokens (e.g., for mostly extractive summaries) or new tokens (e.g., for mostly abstractive summaries). This is in sharp contrast to models that introduce task-specific priors, e.g., the pointer-generator network (See et al., 2017) that can copy words from the source text, but does not do well on extreme summarization which is highly abstractive in nature (Narayan et al., 2018).
Focus Sampling: Promoting Diversity in Faithful Generation We introduce Focus Sampling with FAME to construct a subset V k ⊆ V by sampling k tokens from the topic distribution t X (Focus sample,k ). Then, we modify Eq. (2) as For document summarization, the subset V k will capture topically salient tokens necessary to generate a summary; F is always added to V k to ensure that the model has access to function words. By tuning the parameters of sampling, we can enforce the model to control the faithfulness or diversity of the outputs. Focus sampling has similarities to top-k (Div top,k ; Fan et al., 2018) and nucleus sampling (Div nucleus ; Holtzman et al., 2020); in that they all aim to promote diversity. At each decoding step, the top-k sampling diversifies the generation process by sampling a token from the top k tokens in the final output distribution. Similarly, nucleus sampling samples from a dynamic nucleus of tokens containing the vast majority (with a cumulative probability p) of the probability distribution. Both top-k and nucleus sampling shorten the tail of the output distribution at each decoding step, whereas focus sampling constrains the decoder to use a fixed and topically relevant vocabulary V k . Unlike the other two techniques, Focus sample,k can also benefit from standard beam search decoding, leading to superior generation that is not only diverse, but also consistent with the input document.

Experimental Setup
In this section we present our experimental setup to assess the ability of our FAME models to generate faithful summaries and to demonstrate that focus sampling is more effective in generating diverse and faithful summaries than other sampling-based decoding methods.

Extreme Summarization
We evaluate FAME models on extreme document summarization (XSUM; Narayan et al., 2018). The XSUM summaries, are extreme in that the documents are summarized into single-sentence summaries. These summaries demonstrate a high level of abstractiveness, and generating them automatically requires document-level inference, abstraction, and paraphrasing. Due to their extreme nature, XSUM summaries are ideal to evaluate FAME models' ability to capture the theme of the document. 4 We use on the original cased version consisting of 204,045/11,332/11,334 training/validation/test document-summary pairs. During training, the input documents are truncated to 512 tokens. The 4 We further experiment with long-form story highlight generation (CNN/DM;Hermann et al., 2015) and two text editing tasks: Sentence Fusion (Geva et al., 2019) and Sentence Splitting (Botha et al., 2018). Their results can be found in Appendix B and C. Our FAME models achieve SOTA on both text-editing tasks. length of the summaries are limited to 64.

Pretrained Models with FAME
We introduce FAME to two popular seq2seq architectures: RoBERTa initialized seq2seq (ROBERTAS2S, Rothe et al., 2020) and PEGASUS (Zhang et al., 2019). We refer ROBERTAS2S models with FAME as ROBFAME and PEGASUS with FAME with PEGFAME.
We experiment with ROBERTAS2S-Large with shared encoder and decoder; it has 24 layers, a hidden size of 1024, filter size of 4096, 16 attention heads, and a vocabulary with 50K sentence pieces (Kudo and Richardson, 2018). ROBERTAS2S has around 455M parameters and ROBFAME has an additional 8M parameters.
The best-performing PEGASUS model from Zhang et al. (2019) is not directly comparable with ROBERTAS2S. It does not share the encoder and decoder, it only has 16 layers, a hidden size of 1024, filter size of 4096, 16 attention heads, with a total of 568M parameters, and it also uses a much larger vocabulary with 91K sentence pieces. Hence, we trained our own PEGASUS model. We use the same architecture as ROBERTAS2S and pretrain it on a mixture of C4 (Raffel et al., 2019) and Huge-News (Zhang et al., 2019) datasets with the original objective of generating salient GAP-sentences.
Our experiments focus on this newly trained PEGASUS model which has same number of parameters and vocabulary as ROBERTAS2S. But in contrast to ROBERTAS2S, the encoder-decoder attention in PEGASUS is pretrained. This allows us to analyse how focus attention affects pretrained (PEGASUS) vs randomly-initialized (ROBERTAS2S) encoder-decoder attentions. 5

Evaluation Metrics
Lexical Overlap We report ROUGE F1 scores (Lin and Hovy, 2003) against reference summaries; in particular, we report on ROUGE-1 and ROUGE-2 for informativeness and ROUGE-L for fluency. 6 Semantic Similarity We report BERTScore (Zhang et al., 2020) which computes the contextual similarity between a candidate and its reference summary. Faithfulness ROUGE and BERTScore do not correlate well with faithfulness of the generated summaries (Maynez et al., 2020). Human evaluation is traditionally considered as the gold standard for measuring faithfulness. But recent research has shown that even human evaluation has shortcomings (Schoch et al., 2020). Moreover, it is prohibitively expensive. This has led to the proposal of meta-evaluation metrics for various generation tasks (Durmus et al., 2020;Kryściński et al., 2019;Sellam et al., 2020;Rei et al., 2020). We evaluate FAME models on semantic inference metrics such as textual entailment (Pasunuru and Bansal, 2018;Welleck et al., 2019b;Falke et al., 2019;Kryscinski et al., 2019) and question answering (Arumae and Liu, 2019; Wang et al., 2020a). In particular, we report the probability of a summary entailing (ent.) its input document (Maynez et al., 2020) and QA-based Feqa scores (Durmus et al., 2020). For ent. scores, we train an entailment classifier by fine-tuning a BERT-Large pretrained model (Devlin et al., 2019) on the Multi-NLI dataset (Williams et al., 2018). For Feqa, we use a fine-tuned BART (Lewis et al., 2019) language model for question generation to generate questions from the summaries, and a BERTbase model fine-tuned on SQuAD (Rajpurkar et al., 2018) to answer the generated questions with input document as context. 7 In addition to ent. and Feqa, we train a scorer leveraging manually annotated document-summary pairs for faithfulness, as a surrogate for human evaluation and call this metric BERTFaithful. 8 In particular, we finetune a BERT-Base classi- 7 We used the Feqa code available here: https:// github.com/esdurmus/feqa/. 8 A very similar scorer was used in the GEM benchmark (Gehrmann et al., 2021) to identify and extract the subset with faithful reference summaries from the XSum dataset (Narayan et al., 2018). fier on 500 manually annotated document and gold summary pairs for the XSum dataset from Maynez et al. (2020) to predict whether a summary is faithful to the input document or not. 9 We report the percentage of summaries that were faithful ( 1 N i 1[p i (faithful) > 0.5]) and the model's confidence to generate faithful summaries ( 1 N i p i (faithful)); N is the total number of examples in the test set.
Diversity We report the number of times (out of n), a model is able to generate a completely new summary (Unique), and Distinct-N (Li et al., 2016a), measuring the lexical diversity in the generated summaries. Distinct-N is estimated as the number of distinct n-grams of order n divided by the total number of n-grams of the same order, in all generated summaries.
Finally, we also report the average length of summaries (Len.), repetition errors (Rep., estimated as the percentage of summaries with at least one repetition of rare or content words), and ROUGE-1 precision against the input document (R1, P%), to better understand their quality.
5 Results FAME Summaries are More Fluent, Informative and Faithful. Table 1 presents results comparing our FAME models, ROBFAME and PEG-FAME, against their counterparts ROBERTAS2S 9 Out of 500, 90% of the document-summary pairs were used for training and the rest 50 document-summary pairs were used for validation. We used the validation set to estimate Spearman's correlation coefficients of different metrics with the human assessment for faithfulness. We found that both entailment scores (ent.) and BERTFaithful are moderately correlated with faithfulness with correlation coefficients of 0.4387 and 0.3889, respectively. As such, we believe that BERTFaithful works as an efficient proxy for expensive human evaluation for faithfulness for XSum summaries. More work is needed to understand if BERTFaithful generalizes to other datasets.

Metrics Unique
Dist.  and PEGASUS, respectively. Both FAME models clearly outperform their vanilla counterparts in terms of generating summaries that are more fluent (see RL and Rep.), more informative (see R1, R2 and BERTSc.) and more faithful (see ent., Feqa and BERTFaithful). Among all four models, PEGFAME summaries are most fluent, informative and faithful. We further did pairwise comparisons for all measures in Table 1 and found that all differences are statistically significant except for BERTScore and faithfulness measures between PEGASUS and PEGFAME. 10 These assessments demonstrate that FAME models aid both ROBERTAS2S and PEGA-SUS in generating fluent, faithful and relevant summaries, but are more effective in ROBERTAS2S than in PEGASUS for extreme summarization.
Generating Diverse and Faithful Summaries with Focus Sampling. Table 2 presents results assessing focus sampling (Focus sample,k ), top-k sampling (Div top,k ) and nucleus sampling (Div nucleus ), for their abilities to generate diverse and faithful summaries. For Focus sample,k , we choose k = 10, 000. We follow Holtzman et al.
(2020) and choose k = 640 and the nucleus probability p = 0.95, for Div top,k and Div nucleus , respectively. For Focus sample,k , we decode with a beam size of 4. We also report Focus sample,k with Div top,k and Div nucleus to assess if they can benefit one-another. In each setting we sample 10 sum-10 All significance tests in this work are pairwise comparisons (one-way ANOVA with posthoc Tukey HSD tests; p < 0.01). maries for each input document. For all metrics, we report the average over all 10 samples. 11 Both Div top,k and Div nucleus almost always generate a new summary. In comparison Focus sample,k generates 1.61 and 2.77 unique summaries using ROBFAME and PEGFAME models, respectively. Div nucleus tends to generate the most distinct unigrams, bigrams, and trigrams. Interestingly, Focus sample,k summaries have a more diverse collection of unigrams than in Div top,k summaries (3.5% vs 2.3% for ROBFAME and 2.4% vs 1.9% for PEGFAME).
The high diversity in Div top,k and Div nucleus comes at the cost of faithfulness; summaries generated with these sampling techniques have poor entailment scores. Focus sample,k , on the other hand, generates summaries which entail documents the most. It also has the highest ROUGE scores across the board. Some of the generated examples can be seen in Figure 1. More predictions from other models can be found in Appendix E. Augmenting Div top,k and Div nucleus with Focus sample,k is not desirable because, though it increases diversity in terms of uniqueness and Distinct-3 scores, faithfulness suffers again.
Comparing results in Table 2 to the results in Table 1, it is clear that diversity comes at the cost of quality (e.g., RL/ent. scores for ROBFAME and ROBFAME-Focus sample,k are 34.81/41.3 and 31.0/34.3, respectively). However, Focus sample,k is superior to both Div top,k and Div nucleus in gen-6085 erating better quality summaries.  Focus Attention and Sampling Work Differently in ROBFAME and PEGFAME. Since both encoder-decoder and focus attention parameters of ROBFAME are randomly initialized, they learn to compliment each other and learn a peaky topic distribution. On the other hand, since PEGFAME's encoder-decoder attention is pre-trained, there is a push-pull effect between it and focus attention. This results in a smoother topic distribution, as seen in Figure 3. 12 Although we see that both models' token sets capture the target intent well, the peaky distribu-12 This difference in topic distributions is consistent across the whole test set. We compute the peakiness score of a topic distribution as the slope of the line connecting logits of the top-1st token to the top-100th token. The average peakiness scores across the XSUM testset for ROBFAME and PEGFAME are 1.25 (51 • ) and 0.45 (24.3 • ), respectively.  tion of ROBFAME enables more accurate predictions than that of PEGFAME, in a controlled generation setting. A comparison is presented in Figure 4 where we show how ROUGE-1 scores vary when we use only top-k tokens from t X for generation. 13 We observe that ROBFAME consistently outperforms PEGFAME with the lower values of k ∈ {50, 100, 200, 500, 1000}. Further, we observe that ROBFAME generates fewer unique summaries (1.61 vs 2.77) but has higher Distinct-N scores (3.5/22.4/43.9 vs 2.4/16.5/34.2) than PEGFAME, with Focus sample,k in Table 2. This can be again be attributed to how FAME works differently in ROBFAME and PEG-FAME. When V k is sampled from ROBFAME's peaky distribution, the beam search decoding often tends to generate similar summaries (leading to a lower Uniqueness score) as the sampled V k s do not diverge by much from each other. But when it does diverge, the decoder tends to generate completely new summaries (leading to higher Distinct-N scores).
Currently, we set k = 10, 000 for our focus sampling experiments following our observations in Figure 4. Future work will focus on how to better leverage trade-off between diversity and faithfulness by controlling the peakiness of the topic distribution t X .
Ablations and SOTA Comparisons We emphasize that FAME or focus sampling does not aim to improve on state-of-the-results in terms of ROUGE, but to generate more faithful or diverse summaries while maintaining their quality. For completeness, we compare our ROBFAME and PEGFAME models to their ablations and other state-of-the-art models on XSUM in Table 3.
We report ROUGE scores for FAME in the ideal scenario (ORACLE) where it focuses on all the correct tokens in the input, i.e., the topic distribution t X is identical to the distribution observed in the reference summary. These models generate summaries with very high ROUGE scores when the model is given the correct tokens to focus on. The gap between the ORACLE and FAME scores suggests that there is still a lot of work to be done in this space. Focus attention without any topical supervision (models w/o Eq. (3)) is not significantly better than the baselines. But ROBFAME and PEG-FAME (trained with joint supervision in Eq. (4)) significantly outperform ROBERTAS2S and PEGA-SUS, respectively.
Our best model PEGFAME performs better than PtGen (See et al., 2017) (Zhang et al., 2019). This can be expected as the number of parameters in PEGFAME is far less than that in the original PEGASUS.

Conclusion
We introduced FAME, a new attention mechanism which dynamically biases the decoder to proactively generate tokens that are topically similar to the input. FAME enhances the faithfulness of existing state-of-the-art abstract summarization models while improving their overall ROUGE scores. Finally, our newly introduced focus sampling technique is a better alternative to top-k or nucleus sampling to generate diverse set of faithful summaries.

Acknowledgements
We thank Sebastian Gehrmann, Slav Petrov, the reviewers, and the action editor for their invaluable feedback.

Ethical Considerations
The nature of text generation leads to multiple ethical considerations when applied to applications. The main failure mode is that the model can learn to mimic target properties in the training data that are not desirable.

Faithfulness and Factuality
Since models create new text, there is the danger that they may neither be faithful to the source material nor factual. This can be exacerbated when the data itself has highly abstractive targets, which require the model to generate words not seen in the source material during training. This often leads the model to generate content inconsistent with the source material (Kryscinski et al., 2020;Maynez et al., 2020;Gabriel et al., 2020).
Trustworthy Data If the data itself is not trustworthy (comes from suspect or malicious sources) the model itself will naturally become untrustworthy as it will ultimately learn the language and topics of the training data. For instance, if the training data is about Obama birther conspiracies, and the model is asked to generate information about the early life of Obama, there is a risk that such false claims will be predicted by the model.

Bias in Data
Similarly, biases in the data around gender, race, etc., risk being propagated in the model predictions, which is common for most NLP tasks. This is especially true when the models are trained from non-contemporary data that do not represent current norms and practices (Blodgett et al., 2020).
The above considerations are non-malicious, in that the model is merely learning to behave as its underlying source material. If users of such models are not aware of these issues and do not account for them, e.g., with better data selection, evaluation, etc., then the generated text can be damaging.
Generation models can also be misused in malicious ways. These include generating fake news, spam, and other text meant to mislead large parts of the general population.

A Implementation and Reproducibility Details
Following Rothe et al. (2020), the encoder and decoder of ROBERTAS2S and ROBFAME models are initialized with public RoBERTa checkpoints. The encoder and decoder parameters are shared in both cases. Only the encoder-decoder attention parameters are initialized randomly. For ROBFAME, the focus attention parameters are also randomly initialized. We experiment with large RoBERTa checkpoints with 24 layers, a hidden size of 1024, filter size of 4096, 16 attention heads, and a vocabulary with 50K sentence pieces (Kudo and Richardson, 2018). ROBERTAS2S has around 455M parameters and ROBFAME has 463M parameters, with an additional 8M parameters. Our PEGASUS and PEGFAME implementation also have the same configuration, except for the encoder-decoder attention parameters which are pretrained. We used Cloud TPU v3 accelerators for training. All models are fine-tuned on the target task using Adam with a learning rate of 0.05. We use a linear learning rate warm up with 40k steps, normalized by the square root of the hidden size, and a square root decay. We do not perform any tuning on these hyperparameters. We use a global batch size of 128 document-summary pairs. We adapt to different number of training steps depending on the training data sizes. Models are trained for 400k and 200k steps for CNN/DM and XSUM respectively, saving check-points every 1000 steps. We choose the best model based on ROUGE-L performance on the respective validation set.
The vocabulary for functional tokens F is constructed by taking the most frequent sentence pieces in the training set. We tune |F | using the respective validation sets; for XSUM, we choose f = 500 frequent sentence pieces and for CNN/DM, f = 1000. For all our experiments with the FAME models, the beam size is set to 4.
We use Cloud TPU v3 accelerators for computing entailment scores which takes about 20 minutes for the two datasets' test sets. Question generation and answering for Feqa are run on a NVIDIA V100 GPU, and it takes between 8-12 hours for one setting of each test set.

B Abstractive Summarization Results on CNN/DailyMail
The CNN/DM dataset (Hermann et al., 2015) consists of 287,227/13,368/11,490 train-   Table 4 and 5 present complete results for CNN/DM dataset. We see similar kind of improvements as observed in Table 1, except for ROUGE-2 for ROBFAME which is 0.23 points worse than the ROBERTAS2S baseline. Our best model PEG-FAME performs better than both copy mechanism models: LSTM-based PtGen (See et al., 2017) and Transformer-based SAGCopy (Xu et al., 2020). PEGFAME performs worse when compared with T5 (Raffel et al., 2019), the original PEGASUS (Zhang et al., 2019) and ProphetNet (Qi et al., 2020). This can be expected as the number of parameters in PEGFAME is almost half of T5 or ProphetNet, and is 100M less than that in the original PEGASUS. ROBFAME performs worse than ROBERTAS2S on both ent. and Feqa measures for CNN/DM, similar to ROUGE-2 in Table 4. We hypothesize that this is due to the extractive nature of the CNN/DM dataset and the fact that it is not able to copy to-  kens from the input to the necessary extent as the encoder-decoder attention is not pre-trained. Moreover, Feqa scores for ROBERTAS2S and ROBFAME may not be fully comparable due to variation in their summary lengths and the number of Feqa questions generated; the ROBFAME summaries, on average, are 3 words longer and generate 1.2 more questions than that of ROBERTAS2S. Nevertheless, we don't see this kind of drop in ¬cont. scores (i.e., summary not contradicting, either entailed by or neutral to the document) and BERTScores.

C Text Editing Results
We also train the FAME models on two text editing tasks: (i) for sentence fusion -the problem of combining multiple sentences into a single coherent sentence -we used the "balanced Wikipedia" portion of the DiscoFuse dataset (Geva et al., 2019), and (ii) for split-and-rephrase -the reverse task of sentence fusion -we used the WikiSplit dataset (Botha et al., 2018), which consists of 1M examples of sentence splits extracted from the Wikipedia edit history. As the name suggests, both text editing tasks require a low degree of abstraction. For both the tasks, we train the models for 300k steps with a global batch size of 256. The input and output are padded to a length of 128, which covers 100% of the training, evaluation and test data. The vocabulary for functional tokens F is constructed by taking the top 100 and 500 sentence pieces for DiscoFuse and WikiSplit respectively.
We report corpus-level BLEU 14 , the exact match accuracy, and SARI scores (Xu et al., 2016) 15 . The results can be seen in Table 6. The vanilla PEGA-SUS model already beats the current state-of-the-art on both DiscoFuse and WikiSplit. The PEGFAME 14 We use NLTK v3.2.2 with case sensitive scoring to estimate BLEU scores. 15 SARI is a lexical similarity metric which compares the model's output to multiple references and the input in order to assess the model's ability to add, delete, and keep an n-gram.
It's implementation is available at: https://github.com/tensorflow/ tensor2tensor/blob/master/tensor2tensor/ utils/sari_hook.py.  D Controlled Generation with focus attention using Top-k tokens Table 7 presents results from our controlled summary generation experiments with top-k tokens from t X using focus attention (Focus top,k ) on the XSUM test set. In Figures 3 and 4, we describe how ROBFAME consistently outperforms PEGFAME at lower values of k ∈ {50, 100, 200, 500, 1000} due to their peaky and smooth t X , respectively. While Figure 4 only plots ROUGE-1 F1 scores, Table 7 additionally reports ROUGE-2, ROUGE-L, entailment, Feqa, and BERTScores. Figure 6 presents predictions from models using Focus top,k for the article presented in Figures 1 and 5 Figure 5. The predictions from Div top,k and Div nucleus are omitted due to the prescribed limit on the number of pages allowed for the Appendix. Please find them on the arXiv version at https://arxiv.org/ abs/2105.11921. We experiment with limiting FAME models to different sizes of vocabulary V k using the topic distribution t X ; in particular, we experiment with k = {50, 100, 200, 500, 1000, 10000}. We also report numbers for ROBERTAS2S, ROBFAME, PEGASUS and PEGFAME, using the whole vocabulary of size 50k. The bold results in each block are the best performing ROBERTAS2S-based and PEGASUS-based models.

GOLD
Australia has expelled an Israeli diplomat saying Israel was behind the forging of Australian passports linked to the murder of a Hamas operative in Dubai.

Article
Australia's foreign minister said these were "not the actions of a friend". The UK took similar action in March, after concluding that Israel was responsible for the use of forged UK passports in the plot. The Israeli foreign ministry said Australia's decision was disappointing. Ministry spokesman Yigal Palmor said it was "not in line with the importance and the quality of the relationship between our countries". 'Sorrow not anger' At least four forged Australian passports were used in the killing of Mahmoud al-Mabhouh in Dubai in January. The originals belonged to Australians living in Israel. The Australian government said a police investigation had left it in no doubt that the Israeli authorities were behind "the abuse and counterfeiting of the passports". As a result Foreign Minister Stephen Smith asked Israel to withdraw a diplomat, whom he did not identify. "The decision to ask Israel to remove from Australia one of its officers at the Israeli embassy in Canberra is not something which fills the Australian government with any joy," he said. "On the contrary, the decision was made much more in sorrow than in anger." Passports from France, Ireland, Germany and Britain were used in the operation, and in March, the British government expelled an Israeli diplomat from London. The Israeli government has said there is no proof that it was behind the killing, although Dubai officials have said they are 99.9% sure that agents from Mossad were responsible.
ROBERTAS2S Australia has asked Australia to withdraw an Israeli diplomat from its embassy in Canberra after an alleged plot to kill a Abu Dhabi militant in Dubai.

ROBFAME
Australia has asked Israel to withdraw one of its diplomats from its embassy in Canberra after it admitted it used forged passports.

PEGASUS
Australia has expelled an Israeli diplomat after concluding that forged Australian passports used in the killing of a Hamas militant in Dubai were issued by Israel. PEGFAME The Australian government has expelled an Israeli diplomat over the use of forged Australian passports in the killing of a Hamas militant in Dubai. Figure 5: A 2010 BBC article from the XSUM testset, its human written summary and model predictions from ROBERTAS2S, and PEGASUS, with and without FAME. The text in orange is not supported by the input article. ROBFAME (Focus top,k=50 ) Australia has said it will not be expelled an ambassador from Australia following the alleged s agent for the so-called Arab Arab State. ROBFAME (Focus top,k=100 ) Australia has said it will not be expelled an ambassador from Australia following the killing of a terror agent in the Arab world. ROBFAME (Focus top,k=200 ) Australia has said it will not be expelled an ambassador from Australia following the killing of an Australian terror suspect in the Arab world. ROBFAME (Focus top,k=500 ) Australia has asked Israel to end its diplomatic investigation into an alleged plot to murder an Australian terror suspect. ROBFAME (Focus top,k=1000 ) Australia has asked Israel to strip an ambassador from its embassy following the death of an Arab man in Dubai. ROBFAME (Focus top,k=10000 ) Australia has asked Israel to withdraw one of its diplomats from its embassy in Canberra following the death of a terror suspect.

PEGFAME (Focus top,k=50 )
The Israeli government has been expelled from the country after it was found that the country's security agency, the Israeli intelligence agency, was to be to be found to have used a number of the country's out-of-country p when it was used in the Emirates car-j best. PEGFAME (Focus top,k=100 ) The Israeli government has been expelled from the country after it was found that the country's security agency, the Israeli intelligence agency, had used the country's visas in the Emirates terror. PEGFAME (Focus top,k=200 ) The Australian government has expelled an Israeli diplomats after it found that the country's security agency, the Israeli intelligence agency, had used the country's visas in the Emirates terror attack. PEGFAME (Focus top,k=500 ) The Australian government has expelled an Israeli diplomatic staff after accusing the country's security agency, the Israeli intelligence agency, of using a number of Australian visas in the Emirates terror attack. PEGFAME (Focus top,k=1000 ) Australia has expelled an Israeli diplomatic staff after accusing the country's security agency, the Israeli military's intelligence agency, of being responsible for the use of Australian visas used in the killing of a Palestinian. PEGFAME (Focus top,k=10000 ) Australia has expelled an Israeli diplomat over the use of forged Australian passports in the killing of a Hamas militant in Dubai. Figure 6: Model predictions with focus sampling Focus top,k , a controlled generation setting. The text in orange is not supported by the input article. We note that with smaller values of k, both ROBERTAS2S-based and PEGASUSbased models tend to hallucinate more often. ROBFAME (Focus sample,k ) Australia has asked Israel to strip one of its diplomats from its embassy following the death of an Arab man in Dubai. Australia has asked Israel to end its diplomatic investigation into an alleged plot to murder an Australian terror suspect. Australia has asked Israel to strip one of its diplomats from its embassy in Australia over the death of a terror suspect.

PEGFAME (Focus sample,k )
The Australian government has expelled an Israeli diplomatic staff after accusing it of using a number of Australian visas in the killing of a Palestinian in a car bombing. The Australian government has expelled an Israeli diplomatic staff after it said the country was responsible for the use of Australian visas used in the killing of a Palestinian in a car bombing. Australia has expelled an Israeli diplomatic staff after accusing the country's security agency, the Israeli military's intelligence agency, of being responsible for the use of Australian visas used in the killing of a Palestinian. Australia has expelled an Israeli diplomatic mission after accusing the country's security agency, the Israeli military's intelligence agency, of being responsible for the use of Australian visas used in the killing of a Palestinian in the Arab city of Emirates. The Australian government has expelled an Israeli diplomatic staff after it said the country was responsible for the use of Australian visas used in the killing of a Palestinian in the Middle East.
Figure 7: FAME model predictions with Focus sample,k (k = 10000). The text in orange is not supported by the input article.