Long Document Summarization with Top-down and Bottom-up Inference

Text summarization aims to condense long documents and retain key information. Critical to the success of a summarization model is the faithful inference of latent representations of words or tokens in the source documents. Most recent models infer the latent representations with a transformer encoder, which is purely bottom-up and thus does not capture long-distance context well. Also, self-attention-based models face the challenge of quadratic complexity with respect to sequence length. We propose a method to improve summarization models on these two aspects. Our method assumes a hierarchical latent structure of a document where the top-level captures the long range dependency at a coarser time scale and the bottom token level preserves the details. Critically, our method enables token representations to be updated in both a bottom-up and top-down manner. In the bottom-up pass, token representations are inferred with local self-attention to leverage its efficiency. Top-down correction is then applied to allow tokens to capture global context. We demonstrate the effectiveness on a diverse set of summarization datasets, including narrative, conversational, scientific documents and news. Our model achieves state-of-the-art performance on a wide range of long document summarization benchmarks, compared to recent efficient transformers. We show that our model can summarize an entire book and achieve competitive performance using 0.27% parameters and much less training data, compared to a recent GPT-3-based model. These results indicate the general applicability and benefits of the framework.


Introduction
An abstractive summarization system aims to generate a semantically coherent and linguistically fluent summary by conditioning on the document.The dominant approach for abstractive summarization is to use a Seq2Seq model (Sutskever et al., 2014) with an encoder-decoder architecture instantiated with either RNNs (Hochreiter and Schmidhuber, 1997) or transformers (Vaswani et al., 2017).In such a model, an encoder computes or infers1 latent representations of observed tokens (words or subwords) in a document, conditioning on which a decoder generates a summary.This paper studies the problem of how to compute informative latent representations, which in turn would improve summarization.
We propose a method which synergizes bottomup computation with top-down computation while assuming a multi-scale latent structure of a document.In a multi-scale structure, higher-level variables (like those representing sentences, segments) model the document at a coarser time-scale and abstract away details, and are suitable for capturing long range dependency of the document; in contrast, lower-level variables (like those representing tokens) preserve details, and prevent the summary from losing key details (such as the name of an entity).In our method, the summary is generated by conditioning on token representations (low-level variables), similar to recent abstractive summarization models (Zaheer et al., 2020;Beltagy et al., 2020).There is however a critical difference.In our method, token representations are first bottomup inferred and then top-down updated with high level representations, hence rendering low-level representations aware of global context.See Figure 1 for an overview of our method.
Multi-level models have been widely studied in modeling for images (Sønderby et al., 2016), speech (Mehri et al., 2016), and language (Chung et al., 2016).It is also not new in the summarization literature.Prior summarization research has explored hierarchical models (Cheng and Lapata, 2016;Nallapati et al., 2016;Zhang et al., 2019;Xu et al., 2020;Cohan et al., 2018;Ruan et al., 2022).These works focus on the bottom-up computation in a hierarchical model, computing higher-level representations (e.g., sentences, paragraphs) based on lower-level representations (e.g., words).In contrast, our method emphasizes the combination of bottom-up, as done in prior works, and top-down where lower-level representations are updated and enriched with higher-level representations (see the middle panel in Figure 1).This design is critical for summarization which requires global context.As shown in our ablations, removing the top-down update undermines the summarization performance.
The proposed method is agnostic to the model architecture.Due to the dominance of transformer models in NLP (Chen et al., 2018;Zhang et al., 2020;Sun et al., 2019;Martin et al., 2020), we instantiate our method with a transformer-based model.There is a bottleneck of applying transformers to long documents, because its computational and memory cost has a quadratic dependency on the sequence length.This issue is especially critical for summarization since we are more interested in summarizing long documents since short ones can be quickly read through by humans.To address this issue, a large amount of prior works have been devoted to develop efficient transformers with sub-quadratic complexity (Wang et al., 2020;Child et al., 2019;Beltagy et al., 2020;Zaheer et al., 2020;Kitaev et al., 2020;Roy et al., 2021).
Our method provides a natural way to diminish this quadratic complexity issue.In the bottomup computation, we use local self-attention where each token only attends the tokens within a local fixed-length window, and thus the complexity does not grow as a function of the input sequence length.The top-down correction for (local) token representations enables them to capture more global context, reducing the limitation of local attention.In prior works like Longformer (Beltagy et al., 2020), Bigbird (Beltagy et al., 2020), local attention is also used.Our method is different from these models in terms of how to inject global information to locally computed representations.Longformer and Big-Bird utilize a few global tokens which attend and are attended by all local tokens, whereas we use topdown correction.Our approach can better capture global information compared to prior models, as demonstrated by clear performance improvements over these models in our experiments.
In summary, our methods have two key components: (1) local attention in bottom-up computation and (2) top-down correction for locally-computedtoken-representations by high level representations.The first component alleviates the computational and memory cost and allows our model to process long documents, and the second component injects global information to local tokens and improves summarization performance.We call our model as top-down transformer, to emphasize the importance of the top-down update.We evaluate the model on a diverse set of summarization benchmarks.They cover documents from a variety of domains, including news articles and scientific, conversational, and narrative documents, and of various lengths ranging from hundreds of words (e.g., a news article), several thousands to over ten thousands of words (e.g., a scientific paper, a book chap-ter), to even over hundred thousands of words (e.g., an entire book).Across all long document datasets, our models achieve competitive or state-of-the-art performance.We also show that our model is able to summarize a whole book.Compared to Wu et al. (2021) using GPT-3 and requiring humans to extensively label data, our model achieves competitive performance on book summary with only 0.27% parameters and a small amount of publicly available data.The diverse and strong empirical results support the effectiveness and wide applicability of the proposed model.
Our contributions are summarized as follows: (1) we propose a method which combines bottom-up computation and top-down update for long document summarization; (2) we conduct extensive evaluations and achieve strong performance on various long document benchmarks; and (3) we adapt our method to the challenging task of summarizing an entire book and achieve GPT-3-level performance with only 0.27% parameters.

Methods
Figure 1 gives a graphical overview of the topdown transformer.We introduce its details in this section.Suppose a document has N tokens, t = {t i } N i=1 .In our method, token representations are computed by combining top-down and bottomup processes.This leads to effective and efficient inference for token representations.They are then attended by a decoder to generate a summary, as in a regular encoder-decoder transformer.

Bottom-Up Computation
In the bottom-up path, contextual embeddings of the tokens, {e i | e i ∈ R d } N i=1 , are computed with N 1 layers of local self-attention.In particular, each token t i only attends to nearby tokens within a window of size of w.The complexity is hence O(N w), in contrast to O(N 2 ) for full self-attention models.

Top-Down Computation
The efficiency with local self-attention in the bottom-up path nevertheless comes with a limitation, that is, each e i only captures the context within a local window instead of that of the whole document.To mitigate this issue, we propose a top-down update for token representations.
Consider a two-level multi-scale latent structure for a document.The lower level consists of token representations, {e i } N i=1 , computed by the bottom-up computation.The top level consists of units at a coarser level.It is affordable to apply full selfattention at the top level due to its coarser granularity, allowing these top-level units to capture global document context.The self-attention mechanism for the top-level representations is the original multi-head self-attention proposed in Vaswani et al. (2017).Denote the top level representations after selfattention update as {s j | s j ∈ R d } M j=1 (see Section 2.3 for details on top-level representation initialization methods).We can then update the bottom-up-inferred token representations with the top-level representations.This is achieved with N 3 top-down computation layers, as illustrated by the middle panel in Figure 1.Each layer contains three transformations on {e i }: (1) token self-attention, (2) token-segment cross-attention, (3) feed-forward.
(1) and ( 3) are the same as those in the bottomup layers or regular self-attention layer with local attention.(2) implementing the cross-attention between the top and bottom levels is the critical operation.In particular, each e i is updated with cross-attention, where f q , f k , and f v indicate query, key, and value linear mappings, respectively.For notational clarity, Equation 1 only illustrates the case with a single attention head.In practice, we use multi-heads.The cross-attention operation injects global contextual information into bottom-up-inferred token representations, e i , and yields global-context-aware token representations, ẽi , conditioning on which a summary can be generated by a decoder.
To instantiate the top-down computation, we need to make two choices: (1) the number of toplevels above the token level and (2) the unit representation for each top-level.We choose to use one top level since it is sufficiently coarser to apply full self-attention for a wide range of long document benchmarks we experimented on.A natural choice for top level units is sentence, paragraph, and chapter, depending on the number top level considered.Such a choice however leads to complicated implementations and reduced scalability due to the varying length of these units.We hence choose a simpler approach, where the top level consists of fixed-length segments of the documents.While we use a single top level, multiple top levels can be simply achieved with segments with increasingly coarser granularity.
In the top-down computation, segment-level self-attention has a complexity of O(M 2 ), and token-segment cross-attention has a complexity of O(N M ).Thus, together with bottom-up inference, the complexity is O(N w + M 2 + N M ).In practice, we use relatively small w (window size) and M (number of segments).

Pooling Methods
As aforementioned, we use a single top level, consisting of fixed-length segments.The segment representations are initialized by pooling token representations.Following the notation above, suppose a document is divided into M segments, and the embedding of the jth segment is initialized as, where k is the kernel size and d is the stride.p n is the weight for the nth token.We introduce two approaches to compute the weights.The first method is average pooling (AvgPool) and hence p n = 1 k , which is simple and convenient.In the second approach, we leverage the reference summary to define the importance of each token to assign adaptive weights (AdaPool).Particularly, we learn an importance tagger with labels constructed with the reference summaries, which involves three steps: (4) {s (0) j } M j=1 are updated with self-attention, yielding {s j } M j=1 , which are then used in top-down inference for token representations, as discussed in Section 2.2.

Overview
We thoroughly evaluate the proposed method on various summarization datasets.See Table 7 in the appendix for a summary of datasets used in the current work.Our model is first evaluated on two standard long document summarization benchmarks, PubMed and arXiv (Cohan et al., 2018).It outperforms various efficient transformers and other approaches and achieves state-of-the-art performance.Although we focus on long document summarization, models under our framework is also applicable to shorter documents.We test our model on CNN-Dailymail (See et al., 2017), the most widely used short summarization dataset.Compared to a full self-attention model, our model achieves competitive or better performance.Recently, a more challenging benchmark, SummScreen (Chen et al., 2021), is proposed, where summarization systems need to summarize TV show scripts.These documents convey plot events often indirectly and implicitly in dialogues, in contrast to news and scientific articles where statements follow a logical order and facts are offered explicitly.Moreover, a typical episode contains multiple subplots that proceed in parallel.Solving this benchmark thus requires a system to draw information from utterances spreading out through the entirety of the input and integrate them to a concise description.Our model outperforms strong baselines on this challenging benchmark by a significant margin.Another challenging dataset, BookSum (Kryściński et al., 2021), is also recently released.It covers books from the literature domain, including stories, plays, and novels.Similar to ScreenSum, it requires integrating plot events from indirectly expressed descriptions.A further challenge is to process long-form texts up to hundreds of pages or over 100,000 words.Our method does well on this challenge, achieving competitive or superior performance compared to a GPT-3-based model (Wu et al., 2021).While the GPT-3-based model has 175 billion parameters and requires human labelers to extensively write summaries and provide reward information, our model with 464 million parameters is 380 times smaller and merely requires training on relatively minimal data.These results suggest our framework is a generally effectively for documents of various lengths, domains.

Implementation Details
We use the same encoder-decoder architecture for all datasets.The encoder has 8 bottom-up layers and 4 top-down layers for tokens, and 2 selfattention layers for segments.The decoder has 12 layers.The encoder layers for tokens (12 layers) and the decoder layers are all initialized from BART (Lewis et al., 2020) except the parameters for token-segment cross-attention in the top-down layers, which are randomly initialized.The selfattention parameters for segments are also randomly initialized.The window size is 1024 unless otherwise specified.Our settings closely follow Longformer (Beltagy et al., 2020) which has 12 layers for the encoder and decoder, is initialized from BART, and uses a local window size of 1024.Thus, comparison with Longformer is a test of the effect of top-down correction for token representations.The segment-pooling has a kernel size of 32 and a stride size of 24.The maximum number of segments is 512.The maximum document lengths for PubMed, arXiv, CNN-DM, TVMega-Site, ForeverDreaming, BookSum are 8192, 16384,1024,12288,12288,12288, respectively.The optimizer for all models is Adam with an learning rate of 5e-5.Model performance is evaluated with ROUGE scores (Lin, 2004).Reported performance is based on the checkpoint with the best validation R-2 score.Summary samples for each dataset generated by our models are provided in the Appendix.

Scientific Documents
We first test the effectiveness of our framework on two widely used datasets based on scientific documents, PubMed and arXiv.They consist of long documents of length ranging from several thousands of words to over ten thousands words.
Three variants of our model with various pooling weights are presented.AvgPool, AdaPool, and Or-acleAdaPool in Table 1 indicate average pooling, pooling with adaptive weights, pooling with adaptive weights determined by references, respectively (see Section 2.3 for more details).
The experiment results are displayed in Table 1.Pegasus (Zhang et al., 2020) is pretrained on a large-scale of dataset with a pretraining objective specifically designed for summarization.It uses a full self-attention encoder and thus has to truncate the source document due to the quadratic memory complexity.The summarization-oriented large-scale pre-training makes it a strong baseline.Dancer (Gidiotis and Tsoumakas, 2020) takes a divide-and-conquer approach in which the summary is divided into sections and each section is paired to the appropriate section of the document and the model is trained on short sequences and has a low memory requirement.This is a straightforward approach achieving strong performance.
TLM-I+E (Pilault et al., 2020) first extracts salient sentences and then uses a GPT-style model to generate a summary by conditioning on the introduction section and extracted sentences (instead of the whole document), thus reducing memory requirement.SSN-DM (Cui and Hu, 2021) is an extractive model and uses a sliding encoder to process segments of a document and a memory module to capture autoregressive dependency between segments.These two models bear similarities to our model in that they use a multi-scale structure.The extracted salient sentences in TLM-I+E can be considered a representation of the document at a coarser granularity since salient information is retained.Instead of keeping the coarser representations in the latent space, TLM-I+E reads out them to the observed word space.In SSN-DM, the fixed-size memory module pooling information from each segments can also be considered a high level representation of the document.Despite these similarities, our model, synergizing bottom-up and top-down inference, clearly outperforms these prior models.
BigBird (Zaheer et al., 2020), Longformer (Beltagy et al., 2020), and LSH (Kitaev et al., 2020;Huang et al., 2021)  To demonstrate the general applicability of the proposed framework, we show its effectiveness on short document summarization and compare it to full self-attention model.We hypothesize that although the bottom-up computation uses local selfattention, our method with the top-down correction would lead to competitive or better summarization performance.
Our model parameters are initialized from BART.Hence, BART with full self-attention forms a natural baseline, allowing for direct comparison.In the bottom-up inference, the local attention window size of our models is 256.As shown in Table 2, our models achieve slightly better performance, especially in terms of R-1 and R-L, than BART.It confirms our hypothesis that a synergy of bottomup with local attention and top-down inference with global attention is effective and achieves on-par or better performance as full self-attention.

SummScreen
Scientific and news articles often require that facts are offered explicitly and statements follow a logical order, which might allow summarization models to exploit layout and stylistic biases.We next test the proposed method on a more challenging dataset, SummScreen, which requires a model to draw and integrate information from indirect expressions across a wide range of the document.SummScreen (Chen et al., 2021) provides two datasets, TVMegaSite and ForeverDreaming, collecting from two different TV show transcript websites.Each document is the transcript of a TV show episode and the summary is an associated recap.
Table 3 summarizes the results.Extractive oracle is an extractive method by extracting nearest neighbors based on Rouge scores.Longformer is an abstractive method and takes the whole document as input.Hybrid models first select salient sentences and then input them to BART.Our models outperform these strong baselines and even achieves comparable or superior performance than prior models having access to oracle information.

BookSum
BookSum (Kryściński et al., 2021) is another challenging dataset, consisting of books from the literature domain including stories, plays and novels.
It includes examples on three levels of granularity with increasing difficulty: (1) paragraph-level with inputs with hundreds of words, (2) chapterlevel, with inputs with several thousands or over ten thousands of words, (3) book-level, with inputs spanning up to hundreds of pages and over hundred thousands of words.The chapter-level examples have comparable lengths to other popular long-form summarization datasets such as PubMed, arXiv.We first test our models on the chapter level.The book-level summarization is extremely challenging.First, the number of examples (313 books) is limited.Second, a book is too long to fit in current models.We train our model in a curriculum and recursive way to address the two issues.

Chapter Level
Table 4 displays the results.Kryściński et al. (2021) takes a divide-and-conquer approach to summarize chapters.They finetune BART, T5, and Pegasus on the paragraph level data and the chapter summary is obtained by concatenating the paragraph summary.This might miss the intra-paragraph context.Our models directly summarize the whole chapters and outperform these divide-and-conquer models.Efficient transformers, Longformer and BigBird, are also able to take in the whole chapters as inputs.But these bottom-up approaches clearly underperform our models.

Book Level
We first train a top-down transformer on chapterlevel and then fine-tune it on book-level data.The inputs to the book-level model are (1) the concatenated chapter reference summaries in training or (2) the concatenated chapter summaries generated by the chapter-level model in testing.The chapter-tobook curriculum training is to mitigate the scarcity of book-level data.The recursive summarization of chapters and then books can be considered abstractive content selection applied to book data.Table 5 summarizes the book-level results.The middle section shows the performance for the models with the divide-and-conquer approach (Kryściński et al., 2021), same as those for the chapterlevel data.Wu et al. (2021) also attempts to summarize books using GPT-3 with reinforcement learning (RL) finetuning.The results are shown in third section in Table 5.Their method shares similarity with ours in that they decompose books into shorter sequences and train the model and summarize the text segments recursively.There are three differences between our approach and theirs.First, we train our model with the limited data from BookSum, while (Wu et al., 2021) requires human labelers to write summaries, which is highly costly.Second, our model has lower complexity, allowing it to takes in longer input.Thus, we only need to decompose the book one time (into chapters), in contrast to multiple recursive decomposition steps.
Multiple recursive summarization steps is prone to accumulating errors.Third, GPT-3 uses bottom-up inference to infer token representations, in contrast to the synergy of bottom-up and top-down inference in our approach.The last two differences might account for our competitive performance using a much smaller model (0.46B vs. 175B) and less data.

Ablation Studies
Our method has two key components: (1) local attention in bottom-up computation, and (2) topdown update to inject global context.We conduct ablation studies on these two factors.All ablation experiments are performed with PubMed.
We first ablate top-down update (TDU).The results are summarized in Table 6.The first row shows the performance of the top-down transformer with top-down update via cross-attention and window size 1024, which is our final model.The second row shows the performance for a vari-ant of top-down update.In this variant, to update the bottom-up inferred token representations, we concatenate the token representations with the corresponding top-level segment representations, in contrast to the cross-attention approach used in the final model.We can see a clear performance degradation, indicating the importance of the crossattention-based top-down update.The third row displays the results without top-down update, and the decoder attends the bottom-up-inferred token representations to generate summaries.Compared to our final model, the performance is also degraded, suggesting the effectiveness of the top-down update.
The lower panel of Table 6 presents ablations on window size (WS) of local attention.As the window size increases, the performance on all metrics enhances.The effect is quite large when the window size is increased from 32 to 256.The effect becomes smaller after 256, but the model performance can still benefit from larger window size.

Related Work
Summarization Models Prior works have proposed extractive models (Nallapati et al., 2017;Cui and Hu, 2021), abstractive models (Nallapati et al., 2016;Zhang et al., 2020), and hybrid models combining extractive and abstractive methods (Gehrmann et al., 2018;Pilault et al., 2020), for text summarization.Although our model mostly follows the abstractive approach, it also has connections to the hybrid models.These models usually first extract salient sentences from the source document and then summarize the extracted sentences with an abstractive model.Extracted sentences can be viewed a high level representation of the document, although it is the observed space but not in the latent space as in our framework.A continuous representations in the latent space facilities endto-end learning.Moreover, assigning importance weight with the importance tagger in our method resembles an extractive step in a hybrid model, and thus top down transformer with learned importance tagger can be considered a hybrid model.
Efficient Transformers Despite the effectiveness of transformers on a variety of tasks, its quadratic complexity with respect to the sequence length has limited its application to problems with long sequences.A large amount of works have attempted to address this limitation.A major line of work focuses on designing various sparse attention mechanisms.These works can be roughly categorized into two groups, depending on whether the sparsity pattern is content-dependent (Kitaev et al., 2020;Roy et al., 2021;Wang et al., 2021;Liu et al., 2021) or content-independent (Child et al., 2019;Beltagy et al., 2020;Ainslie et al., 2020;Zaheer et al., 2020).Our work is mostly related to content-independent sparse attention.A main assumption of content-independent sparse attention is that the context temporally and/or spatially proximate to the query token is more important, which is intuitively sensible and supported by empirical attention analysis (Child et al., 2019).Thus, a common sparse attention pattern is local attention, where each query token only attends to a neighborhood within a fixed temporal and/or spatial window.While this reduces the complexity to be linear, a model with only local attention cannot model long-range dependency.Prior works combine local attention with other attention patterns with wider or global receptive field such as dilated attention, random attention tokens, and global attention tokens (Beltagy et al., 2020;Zaheer et al., 2020).Our models also use local attention for its efficiency and leverage top-down inference to enable global-context awareness.

Conclusion
In this work, we propose a summarization method which combines bottom-up computation with topdown computation to improve token representation inference.In the bottom-up pass, token representations are inferred with local self-attention to exploit its efficiency.Top-down correction is then applied to allow tokens to capture global context.Our model achieves (1) state-of-the-art performance on a wide range of long document summarization benchmarks, and (2) competitive performance on summarizing whole books using 0.27% parameters and much less training data, compared to a recent GPT-3-based model.These results indicate the general applicability and benefits of the proposed 1274 framework.

Limitations
In the current work, we only explore a model with a single top-level layer.It would be a fruitful research direction to study models with multiple layers, with growing level of abstraction.This might improve both the efficiency and performance of the current model, since long range dependency is mostly captured by higher-level layers and the window size at the low-level can be small.a new class of water -soluble c60 transfecting agents has been prepared using hirschbingel chemistry and assessed for their ability to act as gene -delivery vectors in vitro. in an effort to elucidate the relationship between the hydrophobicity of the fullerene core, the hydrophilicity of the water -solubilizing groups, and the overall charge state of the c60 vectors in gene delivery and expression, several different c60 derivatives were synthesized to yield either positively charged, negatively charged, or neutral chemical functionalities under physiological conditions.these fullerene derivatives were then tested for their ability to transfect cells grown in culture with dna carrying the green fluorescent protein ( gfp ) reporter gene.statistically significant expression of gfp was observed for all forms of the c60 derivatives when used as dna vectors and compared to the ability of naked dna alone to transfect cells.however, efficient in vitro transfection was only achieved with the two positively charged c60 derivatives, namely, an octa -amino derivatized c60 and a dodecaamino derivatized c60 vector.all c60 vectors showed an increase in toxicity in a dose -dependent manner.increased levels of cellular toxicity were observed for positively charged c60 vectors relative to the negatively charged and neutral vectors.structural analyses using dynamic light scattering and optical microscopy offered further insights into possible correlations between the various derivatized c60 compounds, the c60 vector / dna complexes, their physical attributes ( aggregation, charge ) and their transfection efficiencies.recently, similar gd@c60 -based compounds have demonstrated potential as advanced contrast agents for magnetic resonance imaging ( mri ).thus, the successful demonstration of intracellular dna uptake, intracellular transport, and gene expression from dna using c60 vectors suggests the possibility of developing analogous gd@c60 -based vectors to serve simultaneously as both therapeutic and diagnostic agents.
PubMed Example #1: Model Output a new class of water -soluble c60 transfecting agents with positively charged, negatively charged, or neutral chemical functionalities under physiological conditions was prepared using hirschbingel chemistry.transfection, cytotoxicity, and biophysical structure / activity studies were performed in an effort to elucidate the relationship between the hydrophobicity of the fullerene core, hydrophilicity, and the overall charge state of the c60 vectors in gene delivery / expression.only two positively charged c60 derivatives, namely, an octaamino derivatized c60 and a dodeca -amino derived c60 vector, showed efficient in vitro dna delivery into cells.increased levels of cellular toxicity were observed for positively charged compounds relative to the negatively charged and neutral vectors, and aggregation was the major factor that negatively affected the cytotoxic profiles of these c60 -vector / dna complexes.the successful demonstration of intracellular dna uptake and gene expression from dna dna conjugates formed from c60 compounds suggests the possibility of developing analogous gd@c60 vectors to serve as both therapeutic and diagnostic agents.
PubMed Example #2: Reference context : isolated involvement of the appendix in crohn's disease is reported to be 0.2 % to 1.8 %, and is usually associated with ileocaecal crohn's disease in 25 % of ileal and 50 % of caecal disease.while appendicitis in a patient who was previously diagnosed to have ileocaecal crohn's may be managed with appendicectomy and ileocaecal resection, appendicectomy alone when performed for appendicitis in a patient with unsuspected ileocaecal crohn's disease could lead to postoperative complications including enterocutaneous fistula.casereport : a young female patient who underwent appendicectomy elsewhere for acute appendicitis presented to us with a persistent enterocutaneous fistula of 6 weeks duration.she had complained of general ill health and occasional altered bowel habits for 6 months prior to the acute appendicitis presentation.our investigations, including a ct scan, suggested the possibility of ileocaecal crohn's disease.she underwent excision of the enterocutaneous fistula and ileocaecal resection, and histopathology of the resected specimen confirmed crohn's disease. in the postoperative period she received mesasalazine.when last seen 2 years later during her regular follow -up, she was found to be in good health.conclusion: the possibility of ileocaecal crohn's disease should be considered in patients presenting with unexplained postoperative enterocutaneous fistula following appendicectomy.a high index of clinical suspicion is required to make a prompt diagnosis and institute appropriate further treatment in form of ileocaecal resection.
PubMed Example #2: Model Output introduction : isolated involvement of the appendix in crohn's disease is reported to be 0.2 % to 1.8 %, and is usually associated with ileocaecal crohn's disease in 25 % of ileal and 50 % of caecal disease.while appendicitis in a patient who was previously diagnosed to have ileocecal ileocecal crohn's may be managed with appendicectomies, appendicectomy alone when performed for acute appendiculectomy could lead to postoperative complications including enterocutaneous fistula.casereport : a young female patient who underwent appendiclectomy elsewhere presented to us with a persistent enterococcal fistula of 6 weeks duration.on further questioning, she informed us that she had presented to this hospital with a 2 -week history of abdominal pain, confined to the right lower abdomen.she had complained of general ill health and occasional altered bowel habits for 6 months prior to the acute appendectomy presentation.our investigations, including a computerized tomography scan, suggested the possibility of crohns disease.she underwent excision of the enterococcus fistula and resection of the caecum, and histopathology of the resected specimen confirmed the diagnosis.in the postoperative period she received mesasalazine.when last seen 2 years later during her regular follow -up, she was found to be in good health.conclusion: appendicection alone is a safe procedure in patients with suspected crohn ileocoecal involvement.while appendiceal resection will suffice in those who present subacutely and are diagnosed preoperatively by clinical signs, a high index of clinical suspicion is required to make a prompt diagnosis.we study experimentally the thermal fluctuations of energy input and dissipation in a harmonic oscillator driven out of equilibrium, and search for fluctuation relations.we study transient evolution from the equilibrium state, together with non equilibrium steady states.fluctuations relations are obtained experimentally for both the work and the heat, for the stationary and transient evolutions.a stationary state fluctuation theorem is verified for the two time prescriptions of the torque.but a transient fluctuation theorem is satisfied for the work given to the system but not for the heat dissipated by the system in the case of linear forcing.experimental observations on the statistical and dynamical properties of the fluctuation of the angle, we derive analytical expressions for the probability density function of the work and the heat.we obtain for the first time an analytic expression of the probability density function of the heat.agreement between experiments and our modeling is excellent.
arXiv Example #1: Model Output we study the fluctuations of energy and dissipation in a thermostated harmonic oscillator driven out of equilibrium by an external deterministic forcing.this oscillator is very well described by a second order langevin equation.using the fluctuation dissipation theorem, we obtain fluctuation relations for the work given to the system by the external torque and for the heat dissipated by the system, by writing the first principle of thermodynamics between the two steady states.from the experimental results, we derive analytical expressions for the probability density functions of the work and the dissipated heat, and for their finite time corrections, both for a forcing linear in time and a forcing sinusoidal in time.from these expressions, we present some analytical derivations of fluctuation theorems based on experimental observations.arXiv Example #2: Reference the pairing hamiltonian constitutes an important approximation in many-body systems, it is exactly soluble and quantum integrable.on the other hand, the continuum single particle level density ( cspld ) contains information about the continuum energy spectrum.the question whether one can use the hamiltonian with constant pairing strength for correlations in the continuum is still unanswered.in this paper we generalize the richardson exact solution for the pairing hamiltonian including correlations in the continuum.the resonant and non -resonant continuum are included through the cspld.the resonant correlations are made explicit by using the cauchy theorem.low lying states with seniority zero and two are calculated for the even carbon isotopes.we conclude that energy levels can indeed be calculated with constant pairing in the continuum using the cspld.it is found that the nucleus @xmath0c is unbound.the real and complex energy representation of the continuum is developed and their differences are shown.the trajectory of the pair energies in the continuum for the nucleus @xmath1c is shown.
arXiv Example #2: Model Output the exact solution of the richardson pairing hamiltonian is obtained by including the resonant and non resonant continuum through the continuum single particle level density ( cspld ). the gamow states, which appear in the complex energy representation, provide the main contribution from the continuum.the inclusion of the continuum has been used to study the unbound isotope @xmath0c and beyond.it was found that the continuum pairs ( pair energies with positive real components ) converge to the real part of the uncorrelated pair energy and they do not appear in complex conjugate partners.as a result the total energy of the system may be complex.from the exact solutions of the pairing and pairing -like hamiltonians the csmld can be used to investigate many -body correlations.Sydney races to find a cure for Vaughn, but in order to find the antidote, Sydney must make a deal with Sark that could endanger Sloane's life.Meanwhile, Will continues his research for Vaughn and discovers some disturbing inconsistencies involving 20-year -old standardized IQ tests.Sydney finds out that Vaughn has a girlfriend.

ForeverDreaming Example #1: Model Output
Sydney must find a way to save Vaughn's life while trying to track down Sark's base of operations in order to obtain the antidote to a deadly virus that Sloane has been infected with.Meanwhile, Vaughn is in critical condition after being exposed to Rambaldi's toxin, and Sydney is forced to make a dangerous deal with Sark in exchange for his aid in getting rid of him.Will meets with a professor to find out if any of the CIA's standardized tests were ever used to recruit American children in the 1980's, and discovers that one in ten thousand first graders could pass the test.
ForeverDreaming Example #2: Reference Half a universe from Earth, the Doctor and Martha answer a distress call from the cargo ship SS Pentallian, which will impact a star in 42 minutes.The ship scooped out part of the star, which is alive, for cheap fuel.The star uses crew members Korwin and Ashton as host bodies, and begins taking out everyone on board.Martha and crew member Riley, while trying to reach the front of the ship, are jettisoned into space in an escape pod by Ashton.The Doctor remagnetises the pod to bring Martha and Riley back.The Doctor begins being taken over by the star, and tells Martha to vent the engines, getting rid of the " sun particles " in the fuel.This causes the engines to start working again, and frees the ship from the star's pull.
ForeverDreaming Example #2: Model Output The Doctor and Martha are trapped on board a space station in the Torajji system, where the crew are trying to prevent the ship from colliding with the sun.The Doctor uses the sonic screwdriver on Martha's mobile phone to activate Universal Roaming Activation, which allows him to travel anywhere in space and time without interference from the ship's control centre.However, the device malfunctions and the ship begins to fall towards the sun, and the Doctor is forced to use the emergency escape pod to escape.The pod, which contains the Doctor, Martha and two other crewmembers, is destroyed by the impact, but the Doctor manages to return to the control centre to try and stop the ship hitting the sun before it does so.At the opening of Act I, it is a cloudy autumn day on a Russian country estate.In the garden, the old nurse Marina stands at the samovar and offers Doctor Astrov something to eat, but he refuses.He complains about the difficulty of his job.Telegin, an impoverished local landowner, sits with them.Voynitsky, known as Vanya, comes out of the house and joins them.He is almost fifty and is weary and irritable.He complains about his brother-in-law, Serebryakov, Serebryakov's young second wife, Helen, and about how their visit has turned the place upside down.Serebryakov, Helen, and Serebryakov's daughter, Sonya, join them for a moment.After they depart, Vanya sighs about Helen's beauty and then complains about how he has toiled his whole life on this estate for the professor and it has come to naught.After Vanya's sister's death, he and Sonya worked here so the professor could continue his studies and his writings, but Vanya has come to see that work as foolish and irrelevant.When Astrov suggests that Vanya is jealous, Vanya laughs that he obviously is, especially as the old, gout-and-rheumatism-ridden man seems to attract beautiful women.Helen ventures outside and tells Astrov his services are not needed for her husband.Mrs. Voynitsky, Vanya's mother and Sonya's grandmother, tells them about a new pamphlet written by a friend in Kharkov.When Vanya sneers that all they do is read pamphlets, she becomes distressed and claims he hates her.Vanya merely says he is old, tired, and frustrated.A laborer arrives and tells Astrov he is wanted at the factory; the doctor bitterly departs, but not before they all discuss how he is very interested in forestry work.Sonya speaks up cheerfully about how Astrov is trying to save the old forest from destruction because forests make people happier.Astrov speaks of how Russians have torn down the forests and destroyed the wildlife: they no longer create, but rather destroy.
After Sonya walks Astrov out, Vanya tries to seduce Helen, but she pushes him away.She muses about how Sonya clearly seems to love the doctor but he does not love her back.
Helen sighs that she is simply bored and life is too much for her.In Act II, Serebryakov complains to Helen of how he is old and no one respects him.His querulous behavior only annoys Helen, who begs him to stop it.Serebryakov ignores her and bemoans how his life of scholarship seems to be nothing now.Sonya joins them and tells them Serebryakov must see Astrov now; she wants her father to stop behaving like a child.The elderly nurse Marina comforts Serebryakov and leads him out.Helen tells Vanya, who entered the room, that her husband wearies her.Vanya can only lament that everything is over for him and his life was wasted on trivial things.Helen is annoyed and moves to leave, but he bars her way.She accuses him of being drunk, and he admits to it.After Helen sweeps out of the room, Vanya ruminates on what a fool he was not to fall in love with her when she was younger; he once admired the professor, but now he does not.When Astrov returns, he mocks Vanya for having feelings for Helen, but Vanya will not admit it.Astrov leaves to get a drink; Sonya pulls him aside and makes him promise to stop drinking and stop getting her uncle drunk.He agrees.They continue to talk for a moment.He comments that Helen is beautiful but idle and useless.This country life makes people like that, and he despises it; he has been beaten down and sees no light at the end for himself.The peasants are all the same, and educated people are ridiculous.He only likes forests.Sonya compliments him and tries to cheer him up.As he prepares to leave, she asks how he might feel if he were to out that a friend of hers has feelings for him, and he drolly says he cannot love anyone.After he leaves, Sonya feels a surge of happiness though she is not sure why.In Act III, Sonya confesses to Helen that she loves Astrov, and Helen suggests that she say something to see if the doctor loves Sonya too.Sonya gives her permission for Helen to do this.Astrov and Helen meet to ostensibly look at his forestry maps.He discourses volubly on the patterns of deforestation until he sees that Helen is uninterested.Helen insists she is interested but says they should talk about something else.She point-blank asks if he likes Sonya, and he says no.He then moves in to seduce Helen, but she wants none of it.As he tries to kiss her, Vanya enters the room with flowers.Helen is horrified by the situation and begs Vanya to tell her husband that they must leave today.A moment later, Serebryakov and the others enter and Serebryakov announces that he has an idea to sell the estate because he and Helen need to afford a place in the city.This announcement angers Vanya tremendously, and he begins to complain violently about how Serebryakov is a fraud, is uninspired, is thankless, and how he, Vanya, has labored for Serebryakov his whole life and for no reason.He insists this is Sonya's estate.He runs out of the room.Serebryakov is startled by Vanya's outburst.He insists he cannot stay here anymore.Sonya implores him to talk to her uncle and he agrees.He departs, and those in the room hear a gunshot, then another.Helen and Vanya struggle over a revolver as Vanya screeches that he missed the professor.Cursing, he sinks into a chair.In Act IV, Telegin and Marina discuss Serebryakov and Helen's planned departure for that day, then exit the room.Vanya and Astrov come in.Astrov mocks Vanya for his behavior and asks him to return what he stole.Vanya maintains that he is innocent of theft.Astrov laments how this parochial existence crushes people: it is stultifying and useless.When Sonya enters Astrov tells her to tell her uncle to return the bottle of morphia he stole.Sonya turns to her uncle and tearfully asks him for the bottle.He complies.She takes him to make up with her father.Helen enters to tell Astrov goodbye.He tries to seduce her again in a rather lackluster fashion; she kisses him and bids him farewell.Everyone bids goodbye to Serebryakov and Helen, who leave the estate.Sonya and Vanya return to work.Marina suggests Astrov stay for dinner; he refuses and says he must be off.Astrov leaves, and even though Sonya knows he did not love her, she is sad.Vanya, though, claims he is extremely depressed.With both in tears, Sonya comforts her uncle.She tells him that life may be difficult for them now, but the afterlife will be full of peace, love, and rest.

BookSum Book-Level Example #1: Model Output
The scene opens in a country house in the Russian countryside.Ivan, a young man, is sitting with his mother, who is knitting him a Christmas stocking.He tells her that he does not want to have any tea, but she insists that he drink it anyway.She asks him how long they have known each other, and he says that it has been ten years since he first met her.He has not had a day of rest since he has known her, and has toiled without repose or a day's freedom ever since.He had a typhoid epidemic in the third week of Lent, and when he returned home, he had to take care of a man who had died of typhoid fever.He then sat down and closed his eyes for a while and wondered if he had killed the man.He wonders if the future generations will forget him, but he is sure that they will not.Ivan tells them that the house is a mess, because the Professor has been up all night writing and doesn't have time to eat or drink.He asks them to wake him up at two in the morning, but when they do not wake him, he thinks that the Professor wants some tea.They all come in from their walk, and as they come in, they talk about how happy they are to be together, and how they are all living in peace and contentment.Ivan is struck by the sight of a beautiful woman, and says that as he looks at her, his heart swells with unbridled happiness.The work of preserving the forest is thrilling, he says, and Dr. Astroff watches over the old woods and sets out new plantations every year.He wants to show them to Helena and Sonia, and Helena asks him if he finds it interesting.Helena is sitting next to him, and tells him to look for Batushka's works in the library the next day.He is worried that he is getting angina from his old age.Helena tries to tell him to make peace with himself, and to stop being so angry with everyone else.But he says he will not listen to her, because he has spent too much time thinking about the past, and now he cannot bear to watch the success of others and to fear death.Helena gets up and walks away from him, sitting down at a distance.She tries to get him to tell her why he doesn't love anyone.He says he is not an idiot, that he can understand her, but that he cannot understand her because she is young and healthy and beautiful and longing for life and he is an old dotard, almost a dead man.She tells him that she can understand him and that he should be able to understand her.Now, she says, she is too old and sick to continue to live in the country, and she has a young daughter and a young wife to care for.She doesn't want to live on the income generated by the estate, so they have to find some way to guarantee that they can live in town on the money generated each year.They decide to sell the woods, because they don't have enough money to do so every year, and they can't afford to buy the woods every year either.They need to find a way to make sure they can afford a certain amount of money to buy a cottage in Finland.They also have to figure out what to do with the rest of the property.The estate is worth ninety-five thousand roubles, and the estate was bought from his uncle, who had a debt of twenty-five.Semyon Lakedemonoff's brother bought the estate from the Lakedemons' uncle.The family ties are what got the estate to be worth so much money.Everyone agrees that the estate is for Sonia's good, and that she should get to keep it.But now that he has gotten old and ill, the time has come for him to dispose of his property in regard to the interests of his VOITSKI, ASTROFF, SEREBRAKOFF, Vanya, and SONIA arrive at the house to say goodbye to Tommo and Marina.They are to move to Kharkov to live with the professor and his wife.They have been frightened by what they have just witnessed, and decide to go to the city to see if they can find a place to live there.They will not be staying in the village any longer.Except for Vanya and Alexander, who stay to say good-bye to his wife and son-in-law.The Professor kisses them all three times, and then goes out to see them off.He gives them one last kiss to each of them before he leaves.They say they will always remember each other with pleasure, that they are interesting and original, and original.They shall rest

Figure 1 :
Figure 1: An overview of the top-down transformer.Suppose a document with 7 tokens is the inputs to the model, as shown on the bottom left.The bottom-up inference is achieved with local self-attention (N1 layers) as shown in the left panel.To initialize the top-level representations, we pool bottom-up-inferred token representations with either equal weights or adaptive weights (see Section 2.3 for details).Top-level representations are then updated with full self-attention (N2 layers) to capture global context.They are then used to update bottom-up-inferred token representations, accounting for the top-down update for token representations, as shown in the middle panel.The final token representations are attended by the decoder to generate a summary.Note that inference is used in the sense of statistical inference for latent variables and does not imply no training.

Table 1 :
Results on Scientific Articles.Best performance (no oracle) is in bold, and the second best is underlined.

Table 2 :
are efficient transformers.Big-Bird based on Pegasus pre-training combines local attention, random attention tokens, and global attention tokens.LSH uses content-dependent sparse attention based on local sensitivity hashing.Longformer is closely related to our models.It uses the same local attention as in our bottom-up computation except it has an extra [CLS] token which is a global attention token.Longformer is also initialized from BART.The only difference is that our models compute token representations with both top-down and bottom-up processes, in contrary to pure bottom-up in Longformer.The clear performance improvement over Longformer and other efficient transformers indicates the effectiveness of the synergy of bottom-up and top-down computation.Results on CNN-DailyMail.Best performance (no oracle) is in bold, and the second best is underlined.

Table 3 :
Results on SummScreen.Best performance (no oracle) is in bold, and second best is underlined.

Table 4 :
Results on BookSum Chapter Level.Best performance (no oracle) is in bold, and second best is underlined.

Table 5 :
Results on BookSum Book Level.Best performance (no oracle) is in bold, and second best is underlined.

Table 6 :
Ablation studies of Top-Down Transformer.TDU: top-down update.WS: window size.

Table 7 :
Summarization Datasets.It shows the total number of documents, the average number of input words, the average number of summary words, and the domain for each dataset.

Table 8 :
Summary Samples for PubMed arXiv Example #1: Reference

Table 9 :
Summary Samples for arXiv

Table 12 :
Summary Samples for ForeverDreaming

Table 13 :
Summary Samples for BookSum Book-Level