Globalizing BERT-based Transformer Architectures for Long Document Summarization

Fine-tuning a large language model on downstream tasks has become a commonly adopted process in the Natural Language Processing (NLP) (CITATION). However, such a process, when associated with the current transformer-based (CITATION) architectures, shows several limitations when the target task requires to reason with long documents. In this work, we introduce a novel hierarchical propagation layer that spreads information between multiple transformer windows. We adopt a hierarchical approach where the input is divided in multiple blocks independently processed by the scaled dot-attentions and combined between the successive layers. We validate the effectiveness of our approach on three extractive summarization corpora of long scientific papers and news articles. We compare our approach to standard and pre-trained language-model-based summarizers and report state-of-the-art results for long document summarization and comparable results for smaller document summarization.


Introduction
Language model pre-training has become a key component to improve performances on a majority of Natural Language Processing (NLP) tasks (Wang et al., 2019). Most of the recent competitive architectures Lan et al., 2020;Liu et al., 2019b;Radford et al., 2018) are based on the efficient transformer layer introduced in Vaswani et al. (2017). BERT  is one of these architectures that has been widely adopted for comprehension and generation tasks. It is a multi-layer transformer network, pretrained with different self-supervised objectives. Numerous variations of transformer architectures have been proposed to improve this approach (Lan et al., 2020;Liu et al., 2019b;Radford et al., 2018). However, this type of process is only evaluated on tasks composed of relatively short input text, GLUE (Wang et al., 2019), SQUAD (Rajpurkar et al., 2016), SWAG (Zellers et al., 2018). Indeed, for the tasks that require reasoning with longer documents, this approach exhibits several limitations. The transformer self-attention memory quadratically increases with the number of input tokens, making it technically impossible to compute on document-scale sequences. In addition, they usually require to define a fixed maximum input length, typically of 512 tokens, at the pre-training stage.
One solution is to pre-train the entire model on longer sequences. However, this will still require a massive computation power and will only push the length limitation further. Other alternatives have been proposed to extend multi-layer transformers architectures to longer sequences without modifying this maximum length limitation. The first one is to limit the input sequence to its first tokens by removing the text beyond the length limit. Obviously, it cannot be a reasonable solution to treat long documents that are consistently longer than this limit. The second alternative is to apply the model on a window that slides all over the document. It has been used in Wolf et al. (2019) to deal with SQUAD documents that are longer than the 512 token limitation and in Joshi et al. (2019) for a co-reference resolution task on long documents. This approach can only work if the tokens need to be contextualized only in their surroundings because there is no interaction between the different windows. It seems to be a solution for co-reference resolution (Joshi et al., 2019) as they usually can be solved with a reasonably sized window. Another approach adopted to deal with long documents or multi-document is to select a sub-sample of the input that is small enough for the transformer model. Most of the state-of-the-art pipelines on the multihop question answering dataset HotpotQA (Yang et al., 2018) use a first model to retrieve the relevant pieces of text before feeding them to a transformerbased architecture (Fang et al., 2019a;Tu et al., 2019).
We argue that these solutions are not feasible to deal with tasks that require a global understanding of long documents. An example is extractive summarization, where the decision for each sentence should be based on the information of the complete document. To address these challenges, we propose a simple adaptation of the multi-layer transformer architecture that can scale to long documents and benefit from pre-trained parameters with a relatively small length limitation. The general idea is to independently apply a transformer network on small blocks of a text, instead of a long sequence, and to share information among the blocks between two successive layers. To the best of our knowledge, this is the first attempt to introduce hierarchical components directly between the layers of a pre-trained model and not only on top of it (Fang et al., 2019b;Zhang et al., 2019b;Tu et al., 2020). Between each of the transformer layers, we use a Bidirectional Gated Recurrent Unit (BiGRU) network (Cho et al., 2014) to spread global information across the blocks. Adding these propagation layers between the transformer layers preserves the original structure of the pre-trained model and makes it possible to transfer parameter weights from a large pre-trained language model with only few additional parameters to propagate information between blocks.
The contributions of this paper can be summarized as follows: (i) we propose a novel architecture dedicated to long documents which interweaves recurrent hierarchical modules with transformer layers and which exploits pre-trained language models like BERT, and (ii) we demonstrate that this architecture is able to build informative representations in the context of extractive summarization.

Global BERT-based Transformer Architecture
In this part, we briefly recall the transformer layer from Vaswani et al. (2017) and its integration in the BERT model . Then we describe our modifications of this architecture that allow the model to read longer documents. Transformers: The transformer architecture, based on a sequence of transformer layers, has been initially introduced in Vaswani et al. (2017). The key idea of this layer is to produce a contextualized representation of an input sequence of tokens. It is composed of the succession of a multi-head selfattention, a first normalizer, a feed-forward neural network, and a second normalizer. This model, which has originally been introduced for machine translation, has then been adopted for most natural language comprehension tasks. Most of the successful approaches Liu et al., 2019b;Lan et al., 2020) are composed of multiple stacked transformer layers. In the remainder, we denote by T the transformation corresponding to the th , 1 ≤ ≤ L, transformer layer (T is a function from R N ×h to R N ×h , where N denotes the length of the sequence and h the hidden dimension).
BERT ) is a multi-layer transformer encoder pre-trained on large text corpora. Two BERT architectures have been proposed in : BERT BASE composed of 12 stacked transformer layers with hidden dimension of 768 (L = 12, h = 768) and BERT LARGE composed of 24 layers of hidden dimension 1024 (L = 24, h = 1024). For both architectures, the input length is limited to 512 WordPiece tokens and the pre-training includes two self-supervised tasks, namely masked language modeling and next sentence prediction. For masked language modeling, 15% of all the WordPiece tokens of the input sequence are masked or corrupted, and the model is used to predict the original token with a crossentropy loss. For next sentence prediction, the model is trained as a classifier to predict if two sentences are contiguous or not. The pre-training procedure uses the BooksCorpus (Zhu et al., 2015) and documents from English Wikipedia. It requires 4 days of optimization on 16 TPU chips for BERT BASE and 64 TPU chips for BERT LARGE .

Stacked Propagation Layers
We propose a hierarchical structure that uses pretrained transformers to encode local text blocks that will be used to compute document level representations. The novel contribution of this work, depicted in Figure 1, is to incorporate recurrent hierarchical modules between the different transformer layers and not only on top of the model, as proposed in  Figure 1: Our proposed modification of a multi-layer transformer architecture. The input sequence is composed of K blocks of tokens. Each transformer layer is applied within the blocks, and a bidirectional GRU network propagates information in the whole document by updating the [CLS] representation of each block. several recent works (Fang et al., 2019b;Zhang et al., 2019b;Tu et al., 2020). Because we construct and propagate document level information between the layers, global and local information are fused at every level of the architecture. The text blocks can be sentences, paragraphs, or sections. We experiment using sentences as blocks because it generally does not exceed the maximum length allowed by pre-trained models and because BERT has demonstrated to be well adapted to represent such sequences.
We start by splitting the original sequence into multiple blocks. Let D be a document composed of K blocks, D = {B 1 ; B 2 ; · · · ; B K } where a block B k , 1 ≤ k ≤ K, is composed of n k tokens. To follow the convention of BERT, special tokens [CLS] and [SEP] are respectively added at the beginning and end of each block of the document, so that: x k,i is the index of the WordPiece token i of block k. In the remainder, the index 0 (resp. n k + 1) will be used to refer to the representation of the [CLS] (resp. [SEP]) token in each block.
Embedding Layer Because our goal is to reuse the available pre-trained BERT parameters, token representations are kept the same as in the original BERT and are composed of a token embedding, a segment embedding, and a positional encoding that represents the position of the token in its block. We will denote by E k (E k ∈ R (n k +2)×h , 1 ≤ k ≤ K) the embedding representation of block k. Propagation Layers Our model is composed of L stacked identical hierarchical layers, called propagation layers, that comprise a transformer layer, a BiGRU to propagate information across blocks and, finally, a feed-forward network. For any layer , 1 ≤ ≤ L, let U k ∈ R (n k +2)×h be the representation of block k after the ( − 1) th layer, the representation for the first layer being initialized with the output of the embedding layer: U 1 k = E k , ∀k ∈ {1, · · · , K}. We first apply the pre-trained transformer function T individually on each block of the document to compute local, token-aware representations V k ∈ R (n k +2)×h : The next step is to propagate information across all the blocks of the document in order to compute a global block-aware representation for the document at layer , denoted by W ∈ R K×h , 1 ≤ k ≤ K.
To do so, we use a BiGRU network, fed with the representation vectors of the different blocks, and apply a feed-forward neural network to preserve the hidden dimension of the transformer. Each block k is represented by its [CLS] vector, i.e., the vector (represented by V k,0 ∈ R h ) at the first position in the local representation of the block. These representations are then concatenated to form the input to the BiGRU. The global, block-aware representation is then computed by applying the feed-forward neural network (FFNN) to all K outputs of the BiGRU: where BiGRU k denotes the k th output of the Bi-GRU and [; ] is the concatenation operation.
At this stage, we have computed, for a given document, local block representations V k (1 ≤ k ≤ K) and a global representation W . We combine them to build the output representation of the layer: As one can note, U +1 k ∈ R (n k +2)×h is a representation of block k in which the [CLS] vector representation has been enriched with document level information propagated from other blocks. U +1 k is then used as input for the next propagation layer.

Output Layer
In this work, we validate our approach on the task of extractive summarization described in Section 3. This task can be considered as a binary classification problem where each block has to be labeled as selected or not. We use a feed-forward neural network followed by a Softmax function on the top of the block level representations after the last layer L to compute Y ∈ R K×2 .
Using a recurrent architecture to propagate information between blocks has two interesting properties. First, it allows our model to scale to long sequences of blocks without using an attention mechanism that would not scale. Second, it does not require to implement any positional encoding on block representations.

Experiments
We evaluate our approach, which we refer to as GBT-EXTSUM (for 'Global BERT-based Transformer for Extractive Summarization'), in the context of extractive summarization, the goal of which being to identify and extract from a document the pieces of text that are the most important (Kupiec et al., 1995). We view this task as a sentencelevel classification problem where each sentence has to be labeled according to its belonging to the summary or not. To validate the effectiveness of our approach, we propose to test it on three summarization datasets, namely ArXiv, PubMed and CNN/DailyMail:  Table 1 presents some statistics on these three datasets. As one can note, for the scientific articles, the average number of tokens in the documents to summarize is way beyond the capabilities of a standard transformer pre-trained with BERT.

Evaluation Metrics
We evaluate the quality of the extracted summaries using the ROUGE metric (Lin, 2004), and more particularly ROUGE-1 (overlap of unigrams), ROUGE-2 (overlap of bigrams), ROUGE-3 (overlap of trigrams) and ROUGE-L (longest common subsequence between the produced summary and the gold-standard one).

Label Generation
In order to train extractive summarizers, one needs annotations in the form of sentence-level binary labels. To compute such annotations, we follow the work of Kedzie et al. (2018) and label all sentences by greedily optimizing the ROUGE-1 score of the extracted summary against the gold-standard summary associated with each article. These labels are only used at training time, the evaluation of the extracted summaries being done against the gold-standard summaries provided in the datasets.  Table 2: Summarization results on PubMed and arXiv. Except for BERT-based approaches, for Reformer-Ext and for Longformer-Ext, which we have reimplemented, the results of the baselines are taken from their associated paper as well as from Cohan et al. (2018). Bold results correspond to the best scores of extractive summarizers.

Baseline Models
We compare our approach to several well known published methods described below. These methods include SumBasic ( Discourse-aware summarizer (Cohan et al., 2018). The results for these models are the ones reported in the paper (Cohan et al., 2018). We also report the results of Sent-CLF and Sent-PTR, which are hierarchical sentence pointer and classifier, TLM-I+E (G,M) a mixed extractive/generative transformer language model from Subramanian et al. BERT Ranker: We used a BERT ranker, similar to Nogueira and Cho (2019) in which each sentence of the document is processed individually. We apply BERT on each sentence 1 and use a Sigmoid layer, the input of which consists of the [CLS] representation of the sentence, to model the probability of the sentence to be selected.
BERTSUMEXT has been introduced in Liu and Lapata (2019b). This model is an adaptation of BERT for extractive summarization. Because this model takes as input the concatenation of all the tokens of the document, it cannot scale to the arXiv and PubMed datasets. We propose two variants: the first one is to take as input only the first 800 tokens of the document, as suggested in the original paper. This solution is displayed as BERTSUMEXT in Table 2. The second is to apply BERTSUMEXT per sliding windows on the original document and to use, as a token representation, its representation in the window that maximizes its surrounding context. We name this sliding window implementation BERTSUMEXT (SW) in Table 2.
Longformer-Ext and Reformer-Ext: The Longformer and Reformer models were respectively introduced by Beltagy et al. (2020) and Kitaev et al. (2020). They both propose an adaptation of the Transformer self-attention that scale to long sequences. We add the same classification head as the one used in our model on top of the contextualized representation of the first token of each sentence to label them as selected or not in the summary.
We also present the Oracle extractive results as an upper bound as well as the Lead baseline (which respectively select the first 3, 6, 7 sentences for CNN/DailyMail, arXiv and PubMed datasets). Several models are reported only on CNN/DailyMail dataset and not on arXiv/Pubmed as they do not scale to long documents.

Implementation details
We run all our experiments using the Pytorch library (Paszke et al., 2019). We built our model using the "bert-base-uncase" 2 version of BERT and its implementation in the HuggingFace library (Wolf et al., 2019). Our architecture is composed of L = 12 propagation layers with a transformer hidden dimension of h = 768. The hidden dimension of the BiGRU is set to 384 and we share its parameters among all the propagation layers. The FFNN inside the propagation layers maps the output of the BiGRU of dimension 2 × 384 to a vector of dimension 768. The FFNN of the output layer is a binary classifier that projects the sentence representations of dimension 768 to an output of dimension 2. We fine-tuned our model on the crossentropy loss, for 5 epochs on 4 GPUs V100 and use Adam optimizer (Kingma and Ba, 2015) with the initial learning rate set to 3 × 10 −5 , β 1 = 0.9, β 2 = 0.999, no learning rate warmup and a linear decay of the learning rate. We describe implementation details of BERTSUMEXT, Longformer-Ext and Reformer-Ext baselines in the Supplementary Material, Appendix A.
We used Trigram Blocking to avoid the repetition of trigrams in the extracted summaries as suggested in Paulus et al. (2018). Given the extracted summary so far, we only added candidate sentences that had no overlapping trigram with the current summary. We limited the summary to 3 sentences for the CNN/DailyMail dataset, 6 sentences for arXiv, and 7 for PubMed.

Results
Our main results are shown in Tables 2 and 3. On the arXiv and PubMed datasets, our model outperforms the baseline models on almost all of the reported metrics. Our approach manages to summarize long documents while preserving informativeness (evaluated by ROUGE-1) and fluency (evaluated by ROUGE-L) of the summaries. In addition  Table 3: Comparison of ROUGE scores on CNN/DailyMail wrt extractive models.
All results are taken from original papers but Reformer-Ext and Longformer-Ext which we have reimplemented.
to the previously published methods, our approach also improves over the BERT-based, Longformer-Ext and Reformer-Ext baselines we have developed. Among them, BERTSUMEXT, which focuses on a truncated version of the document, is the less effective. As documents are significantly longer than the 800 tokens limitation of this model, this result is not surprising. The sliding window adaptation of this model, that allows it to scale to long documents, is the one that achieves results that are the most comparable to ours. Our approach still outperforms this adaptation, demonstrating that summaries require to propagate information beyond a single BERT window.
On the CNN/DailyMail dataset, one can see that our model outperforms all the models that do not use pre-trained parameters. This includes several transformer-based and hierarchical models. However, while having comparable results, we do not achieve stronger performance than the current extractive state of the art from Zhong et al. Lastly, we evaluate the impact of several elements of our proposed model in Table 4. We first study the influence of the underlying language model by considering both RoBERTa (Liu et al., 2019b) and PEGASUS (Zhang et al., 2019a) pre-trained models, respectively referred to as GBT-EXTSUM-RoBERTa and GBT-EXTSUM-  PEGASUS. As one can see, the results show that BERT-base architecture performs best in terms of ROUGE scores on both arXiv and PubMed. One major difference between PEGA-SUS and BERT/RoBERTA pre-trained models is that BERT/RoBERTA are only encoders while PE-GASUS is a pre-trained encoder/decoder architecture. This could explain why BERT/RoBERTA outperform PEGASUS on extractive summarization tasks. We then compare an alternative of our implementation of GBT-EXTSUM in which the parameters of the BiGRU are not shared among all the propagation layers (GBT-EXTSUM-NoShare) and found no clear difference with the version in which the parameters are shared. Lastly, we compare three architectures of propagation layers, including an average pooling of the [CLS] representations of the sentences, a Transformer layer between the [CLS] tokens (associated to a block position embedding), and a BiGRU layer. Among these three layers, the average pooling layer, which introduces no additional trainable parameters, performs the worst. Furthermore, the BiGRU layer slightly outperforms the Transformer layer in terms of ROUGE scores.
Analysis. In Figure 2, we compare the R-1 score of several models regarding the number of words in the source documents. One can see that GBT-EXTSUM consistently outperforms BERTSUMEXT (SW), Reformer-Ext and Longformer-Ext regardless of the number of words in the source documents. We present in Table 5 two example summaries of a document from the PubMed test set (Kamio et al., 2009), respectively obtained by GBT-EXTSUM and BERTSUMEXT (SW). The numbers in the margin indicate the position of the sentences in the original document, which is composed of a total of 78 sentences. As one can observe, GBT-EXTSUM extracts sentences from various parts of the document whereas BERTSUMEXT (SW) mostly focuses To analyse the influence of the positions of the sentences in the input document, we present in Figure 3 the histograms of the positions of the sentences of the Oracle summary as well as that of the predicted positions of different models, on the PubMed test set. One can see that if most relevant sentences appear at the beginning of a document, other Oracle sentences are still relevant further down the document. GBT-EXTSUM is the model that behaves the most closely to the Oracle, followed by BERTSUMEXT (SW), Reformer-Ext and Longformer-Ext. These last two models tend to over-select sentences from the beginning while focusing less on the ones appearing later in the document. Our model remains influenced by the sentence position but is still able to select sentences from all over the document and is closer to the Oracle distribution.

Related Work
Hierarchical neural architectures have been competitive on a collection of NLP tasks that require to reason over long or multiple documents such as aspect-based sentiment analysis (Paulus et al., 2018), document summarization (Cheng and Lapata, 2016), document segmentation (Koshorek GOLD purpose : to investigate whether the glc3a locus harboring the cyp1b1 gene is associated with normal tension glaucoma ( ntg ) in japanese patients.materials and methods : one hundred forty two japanese patients with ntg and 101 japanese healthy controls were recruited . patients exhibiting a comparatively early onset were selected as this suggests that genetic factors may show stronger involvement . genotyping and assessment of allelic diversity was performed on 13 highly polymorphic microsatellite markers in and around the glc3a locus.results:there were decreased frequencies of the 444 allele of d2s0416i and the 258 allele of d2s0425i in cases compared to controls ( p = 0.022 and p = 0.034 , respectively ) . however , this statistical significance disappeared when corrected ( pc > 0.05 ) . we did not find any significant association between the remaining 11 microsatellite markers , including d2s177 , which may be associated with cyp1b1 , and ntg ( p > 0.05). conclusions : our study showed no association between the glca3 locus and ntg , suggesting that the cyp1b1 gene , which is reportedly involved in a range of glaucoma phenotypes , may not be an associated factor in the pathogenesis of ntg .

GBT-EXTSUM
1-primary open angle glaucoma ( poag ) is the most common type of glaucoma .

15-
we excluded individuals who were diagnosed under 20 or over 60 years of age and who had 8.0 d or higher myopic refractive error of spherical equivalence .
17-the cases exhibiting a comparatively early onset were selected as they suggest that genetic factors may show stronger involvement . during diagnosis , 30-the probability of association was corrected by the bonferroni inequality method , ie , by multiplying the obtained p values with the number of alleles compared .

63-
only two adjacent markers , d2s0416i and d2s0425i , were significantly positive , as shown in table 2 , and the frequency of the 444 allele of d2s0416i and the 258 allele of d2s0425i were decreased in cases compared to controls ( p = 0.022 , or = 0.59 and p = 0.034 , or = 0.42 , respectively ) .

66-
the purpose of this study was to investigate whether the glc3a locus is associated with ntg in japanese subjects , based on results from recent studies reporting that the cyp1b1 gene , located at the glc3a locus on chromosome 2p21 , could be a causative gene in poag as well as pcg . to this end , we genotyped 13 microsatellite markers in and around the glc3a locus . here

BERTSUMEXT (SW)
1-primary open angle glaucoma ( poag ) is the most common type of glaucoma .

2-
normal tension glaucoma ( ntg ) is an important subset of poag ; while many poag patients have high iop,1 patients with ntg have statistically normal iop.24 the prevalence of ntg is higher among the japanese population than among caucasians , and recent studies reported that 92% of poag patients in japan had ntg.58 the diagnosis of glaucoma is based on a combination of factors including optic nerve damage and specific field defects for which iop is the only treatable risk factor .
7-of these subjects , 142 were diagnosed with ntg , and 101 were control subjects .
20-genomic dna was extracted using the qiaamp dna blood mini kit ( qiagen , hilden , germany ) or the guanidine method . in this association study , we selected 13 highly polymorphic microsatellite markers that are located in and around the glc3a locus as shown in figure 1 .
28-the number of microsatellite repeats was estimated automatically using the genescan 672 software ( applied biosystems ) by the local southern method with a size marker of gs500 tamra ( applied biosystems ) .
22-polymerase chain reaction ( pcr ) was performed in a reaction mixture with a total volume of 12.5 l containing pcr buffer , genomic dna , 0.2 mm dinucleotide triphosphates ( dntps ) , 0.5 m primers , and 0.35 u taq polymerase .  (Child et al., 2019) or the recently proposed Longformer and BIGBIRD models (Beltagy et al., 2020;Zaheer et al., 2020). One major difference with our work is that these models compute the attention only between a limited set of randomly or a priori chosen tokens. Reformer (Kitaev et al., 2020) also tackles the problem of language modeling for long sequences, but it does so by computing the self-attention only between similar tokens, based on locality-sensitive hashing.

Conclusion
In this paper, we have introduced a novel transformer-based model for long document summarization based on propagation layers that spread information between multiple transformer windows. This model preserves the architecture of commonly used pre-trained language models, thus allowing the transfer of parameters. An evaluation, conducted on top of the BERT model in the context of an extractive summarization task, further revealed its effectiveness in dealing with long documents compared to other adaptations of BERT and previously proposed models. In the future, we plan to adapt our model to other tasks that require understanding long documents, as question-answering and document-scale machine translation. A Baselines: Implementation Details BERTSUMEXT: For all experiments with BERTSUMEXT, we started with the original implementation 3 and adapted the code to build the sliding windows version. This implementation leverage bert-base-uncased pre-trained model and its associated hyperparameters. We use windows of width 800 with an overlap of 300 tokens between two following windows. If a sentence is in multiple windows, we select its [CLS] representation in the window that maximizes the number of surrounding tokens. We finetune the model for 5 epochs using Adam optimizer with an initial learning rate of 1 × 10 −5 , β 1 = 0.9, β 2 = 0.999.

Longformer-Ext:
We built the Longformer-Ext baseline from the Longformer implementation released by HuggingFace 4 . We use the official longformer-base-4096 pre-trained model trained by AllenAI 5 . This model is based on RoBERTabase and its associated hyperparameters. To increase the maximal position embedding, we drop the pre-trained positional embedding parameters and train a novel token embedding layer to scale Longformer-Ext input up to 12294 tokens. This model computes a sliding self-attention with a window size of 512 tokens on all its 12 Transformer layers. We finetune the model for 5 epochs with only local attention because of memory constraints, using Adam optimizer with an initial learning rate of 1 × 10 −5 , β 1 = 0.9, β 2 = 0.999, no learning rate warmup and a linear decay of the learning rate.
Reformer-Ext: We started from the HuggingFace implementation of Reformer to build Reformer-Ext baseline. We use a Reformer configuration composed of six layers of attention. We use Locality-Sensitive Hashing Attention with 128 buckets on the input sequence and Local Selfattention on chunks of 64 tokens. We use hidden sates of dimension 256, a feed-forward layer of dimension 512, and 12 attention heads in Transformer encoders. We train this model for 5 epochs using Adam optimizer with an initial learning rate of 1 × 10 −5 , β 1 = 0.9, β 2 = 0.999, no learning rate warmup and a linear decay of the 3 https://github.com/nlpyang/PreSumm 4 https://github.com/huggingface/ transformers 5 https://github.com/allenai/longformer learning rate. Figure 4 presents the distribution of the document lengths in arXiv, PubMed and CNN/DailyMail, after tokenization with pretrained BERT-base tokenizer. It also provides the histograms of the position of the [CLS] tokens of the Oracle sentences in input documents. One can see that the three datasets contain an important number of documents longer than 512 tokens, the standard length limitation of pre-trained language models. However, one can also notice that CNN/DailyMail contains a large part of its Oracle sentences within this first window of 512 tokens. As a consequence, a model that is not able to "read" beyond this limitation is not penalized. It is also a reason why Lead baseline is quite strong on this dataset. On the contrary, on arXiv and PubMed, one can see that a large part of Oracle sentences occur beyond this 512 windows. This explains why models capable of reading long sequences are required to achieve good results on these datasets.  results . the number of adrenal enlargements and proportion of incidental adrenal enlargement increased each year . mean patient age was 50.32 years . thirty -nine cases had unilateral enlargement on the left side and 3 on the right side ; 36 had bilateral enlargement . routine medical checkup was found to have the greatest chance ( 43.59% ) of revealing clinical onsets leading to discovery . biochemical and functional evaluation revealed 54 ( 69.23% ) cases of nonfunctional lesions , 12 ( 15.38% ) of subclinical cushing syndrome , 6 ( 7.69% ) of primary hyperaldosteronism , 1 ( 1.28% ) of metastasis , and 5 ( 6.41% ) of unknown functional status . nodular adrenal enlargement ( or , 7.306 ; 95% ci , 1.72728.667 ; p = 0.006 ) was a risk factor for functional lesions . age and lesion location were not significant factors . conclusion . incidental adrenal enlargement is a frequent radiographic finding and is accompanied by diverse clinical factors that require proper evaluation and management . nodular adrenal enlargement was a risk factor .

B Datasets Statistics
GBT-EXTSUM 8-data retrieved included patient demographics , final functional diagnosis , adrenal imaging features , and concomitant diseases .
14-smooth enlargement was defined as enlargement of the gland with a smooth contour and no measureable or diffuse nodules . after obtaining patient history and physical examination , all patients underwent biochemical evaluation to assess their functional status .
25-as shown in table 1 , routine medical checkup was found to have the greatest chance ( 43.59% ) of revealing clinical onsets leading to the discovery of adrenal enlargement .
31-our study shows that the proportion of incidental adrenal enlargement has gradually increased by year .
46-acth -independent macronodular hyperplasia ( aimah ) and primary pigmented nodular adrenal hyperplasia often manifest as adrenal hyperplasia . the clinical features of aimah tended to be atypical .

BERTSUMEXT (SW)
4-it is a common term for a variety of adrenal disorders , but its cause must be properly assessed so that patients needing treatment , such as those with hormone hypersecretion or malignant disease , can receive appropriate care . however , there is a lack of literature on functional status and its follow -up to provide comprehensive insight to these findings .
5-patients with incidental adrenal enlargement were evaluated in a tertiary referral hospital with endocrinological departments in china .
7-this retrospective study included 578 patients with adrenal imaging features showing adrenal enlargement who were hospitalized at the department of endocrinology in pla general hospital ( beijing , china ) between january 1993 and july 2013 .
36-in addition , smooth enlargement was more common , in 53 ( 83% ) cases , and together these statistics reflect the likelihood that adrenal enlargement will be bilateral , smooth , and found in men .
37-however , our study did not show this tendency , likely because the research goals and thus , study populations , differed between the 2 studies .
38-'s study aimed to explore prevalence , while the present study aimed to evaluate functional status .
GOLD background and objective . antimicrobial resistance is now a major challenge to clinicians for treating patients . hence , this short term study was undertaken to detect the incidence of multidrug -resistant ( mdr ) , extensively drug -resistant ( xdr ) , and pandrug -resistant ( pdr ) bacterial isolates in a tertiary care hospital . material and methods . the clinical samples were cultured and bacterial strains were identified in the department of microbiology . the antibiotic susceptibility profile of different bacterial isolates was studied to detect mdr , xdr , and pdr bacteria . results . the antibiotic susceptibility profile of 1060 bacterial strains was studied . 393 ( 37.1% ) bacterial strains were mdr , 146 ( 13.8% ) strains were xdr , and no pdr was isolated . all ( 100% ) gram negative bacterial strains were sensitive to colistin whereas all ( 100% ) gram positive bacterial strains were sensitive to vancomycin . conclusion . close monitoring of mdr , xdr , or even pdr must be done by all clinical microbiology laboratories to implement effective measures to reduce the menace of antimicrobial resistance .
GBT-EXTSUM 5-multidrug resistant ( mdr ) was defined as acquired nonsusceptibility to at least one agent in three or more antimicrobial categories . extensively drug 36-no mdr or xdr strain was isolated from streptococcus sp . all ( 100% ) gram positive cocci were sensitive to vancomycin and linezolid .

67-
unless and until multidrug resistant organisms are detected and their incidence is known , the strategies for their control can not be adopted properly in healthcare setup . hence , detection , prevention of transmission of mdros by following infection control practices , antimicrobial surveillance , and stewardship are need of the hour .
69-we hereby conclude that early detection and close monitoring of mdr , xdr , or even pdr bacterial strains must be started by all clinical microbiology laboratories to reduce the menace of antimicrobial resistance which is now a global problem .
BERTSUMEXT (SW) 9-this short term cross -sectional study was conducted in the department of microbiology from 15th of april to 15th of july , 2014 .
10-the bacterial strains were isolated from different clinical samples and were identified by conventional methods .
17-methicillin resistant staphylococcus aureus ( mrsa ) strains were detected by meca -mediated oxacillin resistance using cefoxitin disk ( 30 g ) on mueller hinton ( mh ) agar plate inoculated with test strains as per standard disk diffusion recommendations and incubated at 3335c for 1618 hours .
20-an increase in diameter of 5 mm with ceftazidime plus clavulanic acid as compared to ceftazidime disk alone was considered positive for esbl detection . 36-no mdr or xdr strain was isolated from streptococcus sp . all ( 100% ) gram positive cocci were sensitive to vancomycin and linezolid .

65-
the limitation of this study is that this is a single center study for only three -month period in a tertiary care hospital in central india . to reflect the trend of infections caused by mdr and xdr strains of bacteria in the region , a multicenter study involving all types of healthcare setups for a minimum period of one year GOLD background suicide is a grave public health issue that is responsible for a high mortality rate among individuals aged 1544 years . attitudes toward suicide among medical staff members have been associated with appropriate therapeutic responses to suicidal individuals . the aim of this study was to examine the effects of parental rearing on attitudes toward suicide among japanese medical college students.methodswe examined the association between parental bonding and attitudes toward suicide in 160 medical college students in japan . the parental bonding instrument was used to assess the attitudes and behaviors of parents . the attitudes toward suicide were evaluated using the japanese version of the attitudes toward suicide questionnaire.resultsthe mean age of the subjects was 25.24.0 years old . the majority of the participants in our study agreed that anyone could commit suicide ( 88.8% ) and that suicide is preventable ( 86.3% ) . after adjusting for age and sex , multivariate regression analysis revealed that maternal care approached a statistically significant association with the right to suicide attitude . under the same conditions , maternal care was shown to be significantly associated with the common occurrence attitude . no other significant relationships were observed between parental bonding and attitudes toward suicide.conclusionthis study suggests that a higher level of maternal care ensures that children think that suicide occurs less commonly . the promotion of best practices for suicide prevention among medical students is needed . child rearing support might be associated with suicide prevention .

3-
previous studies have shown that difficulties with parental bonding during childhood could be a predisposing factor for the onset of many psychiatric conditions , such as anxiety , depressive states , and maladjusted behaviors.68 parental bonding and premorbid personality traits play an important role in shaping the developmental trajectory of an individual , including his / her ability to adjust to stressful events .
5-the objective of this study was to investigate whether parental bonding is associated with attitudes toward suicide among medical college students in japan .
8-the demographic data ( age and sex ) were obtained from self -questionnaires and interviews .
14-higher scores on the care and protection dimensions reveal that participants perceive their parents to be more caring and/or protective .
39-right to suicide was significantly associated with common occurrence , unjustified behavior , and preventability / readiness to help .
43-the majority of the participants in our study agreed that anyone could commit suicide ( 88.8% ) and that suicide is preventable ( 86.3% ) .
44-in addition , the multiple regression analysis revealed that participants who reported a higher level of maternal care thought that suicide was a common occurrence and tended to think that people do not have the right to commit suicide .

BERTSUMEXT (SW)
6-students in their fifth year of medical school at hirosaki university , hirosaki , japan , participated in the study .
7-the surveys were distributed to 226 medical students . of the distributed 226 surveys , 160 questionnaires ( 116 males and 44 females ) 13-the overprotection dimension of the pbi reflects parental overprotection and control in contrast to the encouragement of autonomy .
14-higher scores on the care and protection dimensions reveal that participants perceive their parents to be more caring and/or protective .
15-we employed the japanese version of the attitudes toward suicide questionnaire ( atts ) to assess the attitudes toward suicide held by the study participants.12 we employed a six factor model that was previously developed in studies of japanese attitudes , including 16-common occurrence , suicidal expression as mere threat , unjustified behavior , 17-impulsiveness.12,13 each item , with the exception of items 10 and 28 , was scored on a five point scale from 1 ( strongly agree ) to 5 ( strongly disagree ) .

D ArXiv Summaries
GOLD in vivo calcium imaging through microscopes has enabled deep brain imaging of previously inaccessible neuronal populations within the brains of freely moving subjects . however , microendoscopic data suffer from high levels of background fluorescence as well as an increased potential for overlapping neuronal signals . previous methods fail in identifying neurons and demixing their temporal activity because the cellular signals are often submerged in the large fluctuating background . here we develop an efficient method to extract cellular signals with minimal influence from the background . we model the background with two realistic components : ( 1 ) one models the constant baseline and slow trends of each pixel , and ( 2 ) the other models the fast fluctuations from out -of -focus signals and is therefore constrained to have low spatial -frequency structure . this decomposition avoids cellular signals being absorbed into the background term . after subtracting the background approximated with this model , we use constrained nonnegative matrix factorization ( cnmf , @xcite ) to better demix neural signals and get their denoised and deconvolved temporal activity . we validate our method on simulated and experimental data , where it shows fast , reliable , and high quality signal extraction under a wide variety of imaging parameters .
GBT-EXTSUM 1-. continued advances in optical imaging technology are greatly expanding the number and depth of neuronal populations that can be visualized .

2-
specifically , in vivo calcium imaging through microendoscopic lenses and the development of miniaturized microscopes have enabled deep brain imaging of previously inaccessible neuronal populations of freely moving mice ( @xcite ) . while these techniques have been widely used by neuroscientists , 20-like the proposed cnmf in @xcite , our extended cnmf for microendoscopic data ( cnmf -e ) also has the capability of identifying neurons with low signal -to -noise ratio ( snr ) and simultaneously denoising , deconvolving and demixing large -scale microendoscopic data . to accomplish this : ( 1 ) we replace the rank-1 nmf approximation of the background with a more sophisticated approximation , which can better account the complex background and avoid absorbing cellular signals , and ( 2 ) we develop an efficient initialization procedure to extract neural activities with minimal influence from the background .
71-@xmath56 is a template matching filter to detect spatial structures with similar shapes and sizes . for flat structures in the small regions , like background , filtering them with @xmath56 134-in this paper , we proposed an efficient method for extracting cellular signals from microendoscopic data ; such methods are in very high demand in the neuroscience community .
136-our method shows credible performances in recovering the real neuronal signals and outperforms the previous standard pca -ica method .
BERTSUMEXT (SW) 0-monitoring the activity of large -scale neuronal ensembles during complex behavioral states is fundamental to neuroscience research 11-our work is based on a matrix factorization approach , which can simultaneously segment cells and estimate changes in fluorescence in the temporal domain .
26-the video data we have are observations from the optical field for a total number of @xmath2 frames .

64-
we estimate the temporal component of one neuron @xmath15 from spatially filtered data and then use it to extract the corresponding spatial footprint @xmath14 from the raw data . in the step of estimating @xmath14 , we re -order all frames to make nearby frames share the similar local background levels and then take the temporal differencing to remove the background signals temporally .
105-we also display @xmath98 tightly clustered neurons in the simulated data (  107-in contrast , pca -ica based detection can only detect two neurons and the calcium traces have high level of noise .