Revisiting Transformer-based Models for Long Document Classification

The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods. We examine several aspects of sparse attention (e.g., size of local attention window, use of global attention) and hierarchical (e.g., document splitting strategy) transformers on four document classification datasets covering different domains. We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.


Introduction
Natural language processing has been revolutionised by the large scale self-supervised pretraining of language encoders (Devlin et al., 2019;Liu et al., 2019), which are fine-tuned in order to solve a wide variety of downstream classification tasks.However, the recent literature in text classification mostly focuses on short sequences, such as sentences or paragraphs (Sun et al., 2019;Adhikari et al., 2019;Mosbach et al., 2021), which are sometimes misleadingly named as documents. 2he transition from short-to-long document classification is non-trivial.One challenge is that BERT and most of its variants are pre-trained on sequences containing up-to 512 tokens, which is not a long document.A common practice is to truncate actually long documents to the first 512 tokens, which allows the immediate application of these pre-trained models (Adhikari et al., 2019;Chalkidis et al., 2020).We believe that this is an insufficient approach for long document classification because truncating the text may omit important information, leading to poor classification performance (Figure 1).Another challenge comes from the computational overhead of vanilla Transformer: in the multi-head self-attention operation (Vaswani et al., 2017), each token in a sequence of n tokens attends to all other tokens.This results in a function that has O(n 2 ) time and memory complexity, which makes it challenging to efficiently process long documents.
In response to the second challenge, longdocument Transformers have emerged to deal with long sequences (Beltagy et al., 2020;Zaheer et al., 2020).However, they experiment and report results on non-ideal long document classification datasets, i.e., documents on the IMDB dataset are not really long -fewer than 15% of examples are longer than 512 tokens; while the Hyperpartisan dataset only has very few (645 in total) documents.On datasets with longer documents, such as the MIMIC-III dataset (Johnson et al., 2016) with an average length of 2,000 words, it has been shown that multiple variants of BERT perform worse than a CNN or RNN-based model (Chalkidis et al., 2020;Vu et al., 2020;Dong et al., 2021;Ji et al., 2021a;Gao et al., 2021;Pascual et al., 2021).We believe there is a need to understand the performance of Transformer-based models on classifying documents that are actually long.
In this work, we aim to transfer the success of the pre-train-fine-tune paradigm to long document classification.Our main contributions are: • We compare different long document classification approaches based on transformer architecture: namely, sparse attention, and hierarchical methods.Our results show that processing more tokens can bring drastic improvements comparing to processing up-to 512 tokens.
• We conduct careful analyses to understand the impact of several design options on both the effectiveness and efficiency of different approaches.Our results show that some design choices (e.g., size of local attention window in sparse attention method) can be adjusted to improve the efficiency without sacrificing the effectiveness, whereas some choices (e.g., document splitting strategy in hierarchical method) vastly affect effectiveness.
• Last but not least, our results show that, contrary to previous claims, Transformer-based models can outperform former state-of-the-art CNN based models on MIMIC-III dataset .

Problem Formulation and Datasets
We divide the document classification model into two components: (1) a document encoder, which builds a vector representation of a given document; and, (2) a classifier that predicts a single or multiple labels given the encoded vector.In this work, we mainly focus on the first component: we use Transformer-based encoders to build a document representation, and then take the encoded document representation as the input to a classifier.For the second component, we use a TANH activated hidden layer, followed by the output layer.Output probabilities are obtained by applying a SIGMOID (multi-label) or SOFTMAX (multi-class) function to output logits. 3 We mainly conduct our experiments on the MIMIC-III dataset (Johnson et al., 2016), where researchers still fail to transfer "the Magic of BERT" to medical code assignment tasks (Ji et al., 2021a;Pascual et al., 2021).

MIMIC-III contains Intensive Care Unit (ICU)
discharge summaries, each of which is annotated with multiple labels-diagnoses and procedures-using the ICD-9 (The International Classification of Diseases, Ninth Revision) hierarchy.Following Mullenbach et al. (2018), we conduct experiments using the top 50 frequent labels. 4o address the generalisation concern, we also use three datasets from other domains: EC-tHR (Chalkidis et al., 2022) sourced from legal cases, Hyperpartisan (Kiesel et al., 2019) and 20 News (Joachims, 1997), both from news articles.

ECtHR contains legal cases from The European
Court of Human Rights' public database.
The court hears allegations that a state has breached human rights provisions of the European Convention of Human Rights, and each case is mapped to one or more articles of the convention that were allegedly violated.5 Hyperpartisan contains news articles which are 3 Long document classification datasets are usually annotated using a large number of labels.Studies that have focused on the second component investigate methods of utilising label hierarchy (Chalkidis et al., 2020;Vu et al., 2020), pre-training label embeddings (Dong et al., 2021), to name but a few.We note that documents in MIMIC-III and ECtHR are much longer than those in Hyperpartisan and 20 News (Table 5 in Appendix and Figure 2).

Approaches
In the era of Transformer-based models, we identify two representative approaches of processing long documents in the literature that either acts as an inexpensive drop-in replacement for the vanilla self-attention (i.e., sparse attention) or builds a taskspecific architecture (i.e., hierarchical Transformers).

Sparse-Attention Transformers
Vanilla transformer relies on the multi-head selfattention mechanism, which scales poorly with the length of the input sequence, requiring quadratic computation time and memory to store all scores that are used to compute the gradients during back-propagation (Qiu et al., 2020).Several Transformer-based models (Kitaev et al., 2020;Tay et al., 2020;Choromanski et al., 2021)  in-between a window of neighbour (consecutive) tokens.Global attention relies on the idea of global tokens that are able to attend and be attended by any other token in the sequence (Figure 3).BigBird of Zaheer et al. (2020) is another sparse-attention based Transformer that uses a combination of a local, global and random attention, i.e., all tokens also attend a number of random tokens on top of those in the same neighbourhood.Both models are warm-started from the public RoBERTa checkpoint and are further pre-trained on masked language modelling.They have been reported to outperform RoBERTa on a range of tasks that require modelling long sequences.
We choose Longformer (Beltagy et al., 2020) in this study and refer readers to Xiong et al. (2021) for a systematic comparison of recent proposed efficient attention variants.

Hierarchical Transformers
Instead of modifying multi-head self-attention mechanism to efficiently model long sequences, hierarchical Transformers build on top of vanilla transformer architecture.
A document, D = {t 0 , t 1 , • • • , t |D| }, is first split into segments, each of which should have less than 512 tokens.These segments can be independently encoded using any pre-trained Transformer-based encoders (e.g., RoBERTa in Figure 4).We sum the contextual representation of the first token from each segment up with segment position embeddings as the segment representation (i.e., n i in Figure 4).Then the segment encoder-two transformer blocks (Zhang et al., 2019)-are used to capture the interaction between segments and output a list of contextual segment representations (i.e., s i in Figure 4), which are finally aggregated into a document representation.By default, the aggregator is the max-pooling operation unless other specified.

Experimental Setup
Backbone Models We mainly consider two models in our experiments: Longformer-base (Beltagy et al., 2020), and RoBERTa-base (Liu et al., 2019) which is used in hierarchical Transformers.
Evaluation metrics For the MIMIC-III (multilabel) dataset, we follow previous work (Mullenbach et al., 2018;Cao et al., 2020) and use micro-averaged AUC (Area Under the receiver operating characteristic Curve), macro-averaged AUC, micro-averaged F 1 , macro-averaged F 1 and Precision@5-the proportion of the ground truth labels in the top-5 predicted labels-as the metrics.We report micro and macro averaged F 1 for the ECtHR (multilabel) dataset, and accuracy for both Hyperpartisan (binary) and 20 News (multiclass) datasets.

Experiments
We conduct a series of controlled experiments to understand the impact of design choices in different TrLDC models.Bringing these optimal choices all together, we compare TrLDC against the state of the art, as well as baselines that only process upto 512 tokens.Finally, based on our investigation, we derive practical advice of applying transformerbased models to long document classification regarding both effectiveness and efficiency.
Task-adaptive pre-training is a promising first step.Domain-adaptive pre-training (DAPT) -the continued pre-training a language model on a large corpus of domain-specific text -is known to improve downstream task performance (Gururangan et al., 2020;Kaer Jørgensen et al., 2021).However, task-adaptive pre-training (TAPT) -continues unsupervised pre-training on the task's datais comparatively less studied, mainly because most of the benchmarking corpora are small and thus the benefit of TAPT seems less obvious than DAPT.
We believe document classification datasets, due to their relatively large size, can benefit from TAPT.On both MIMIC-III and ECtHR, we continue to pre-train Longformer and RoBERTa us- ing the masked language modelling pre-training objective (details about pre-training can be found at Appendix 9.3).We find that task-adaptive pretrained models substantially improve performance on MIMIC-III (Figure 5 (a) and (b)), but there are smaller improvements on ECtHR (Figure 5 (c) and (d)).We suspect this difference is because legal cases (i.e., ECtHR) are publicly available and have been covered in pre-training data used for training Longformer and RoBERTa, whereas clinical notes (i.e., MIMIC-III) are not (Dodge et al., 2021).See Appendix 9.5 for a short analysis on this matter.
We also compare our TAPT-RoBERTa against publicly available domain-specific RoBERTa, trained from scratch on biomedical articles and clinical notes.Results (Figure 8 in Appendix) show that TAPT-RoBERTa outperforms domain-specific base model, but underperforms the larger model.

Longformer
Small local attention windows are effective and efficient.Beltagy et al. (2020) observe that many tasks do not require reasoning over the entire context.For example, they find that the distance between any two mentions in a coreference resolution dataset (i.e., OntoNotes) is small, and it is possible to achieve competitive performance by processing small segments containing these mentions.
Inspired by this observation, we investigate the impact of local context size on document classification, regarding both effectiveness and efficiency.4 ± 0.3 5.5 (1.6x) 25.4 (1.6x)  512 68.5 ± 0.3 3.5 (1.0x) 16.3 (1.0x)   Table 1: The impact of local attention window size in Longformer on MIMIC-III development set.Speed is measured using 'processed samples per second', and numbers in parenthesis are the relative speedup.We hypothesise that long document classification, which is usually paired with a large label space, can be performed by models that only attend over short sequences instead of the entire document (Gao et al., 2021).In this experiment, we vary the local attention window around each token.
Table 1 shows that even using a small window size, the micro F 1 score on MIMIC-III development set is still close to using a larger window size.We observe the same pattern on ECtHR and 20 News (See Table 11 in the Appendix).A major advantage of using smaller local attention windows is the faster computation for training and evaluation.
Considering a small number of tokens for global attention improves the stability of the training process.Longformer relies heavily on the [CLS] token, which is the only token with global attention-attending to all other tokens and all other tokens attending to it.We investigate whether allowing more tokens to use global attention can improve model performance, and if yes, how to choose which tokens to use global attention.
Figure 6 shows that adding more tokens using global attention does not improve F 1 score, while a small number of additional global attention tokens can make the training more stable.
Equally distributing global tokens across the sequence is better than content-based attribution.We consider two approaches to choose additional tokens that use global attention: position based or content based.In the position-based approach, we distribute n additional tokens at equal distances.For example, if n = 4 and the sequence length is 4096, there are global attention on tokens at position 0, 1024, 2048 and 3072.In the contentbased approach, we identify informative tokens, using TF-IDF (Term Frequency-Inverse Document Frequency) within each document, and we apply global attention on the top-K informative tokens, together with the [CLS] token.Results show that the position based approach is more effective than content based (see Table 13 in the Appendix).

Hierarchical Transformers
The optimal segment length is dataset dependent.Ji et al. (2021a) and Gao et al. (2021) reported negative results with a hierarchical Transformer with a segment length of 512 tokens on the MIMIC-III dataset.Their methods involved splitting a document into equally sized segments, which were processed using a shared BERT encoder.Instead of splitting the documents into such large segments, we investigate the impact of segment length and preventing context fragmentation.
Figure 7 (left side in each violin plot) shows that there is no optimal segment length across both MIMIC-III and ECtHR.Small segment length works well on MIMIC-III, and using segment length greater than 128 starts to decrease the performance.In contrast, the ECtHR dataset benefits from a model with larger segment lengths.The optimal performing segment length on 20 News and Hyperpartisan are 256 and 128, respectively (See Table 14 in the Appendix).
Splitting documents into overlapping segments can alleviate the context fragmentation problem.Splitting a long document into smaller segments may result in the problem of context fragmentation, where a model lacks the information it needs to make a prediction (Dai et al., 2019;Ding et al., 2021).Although, the hierarchical model uses a second-order transformer to fuse and contextualise information across segments, we investigate a simple way to alleviate context fragmentation by allowing segments to overlap when we split a document into segments.That it, except for the first segment, the first 1 4 n tokens in each segment are taken from the previous segment, where n is the segment length.Figure 7 (right side in each violin plot) show that this simple strategy can easily improve the effectiveness of the model.
Splitting based on document structure.Chalkidis et al. (2022) argue that we should follow the structure of a document when splitting it into segments (Tang et al., 2015;Yang et al., 2016).They propose a hierarchical Transformer for the ECtHR dataset that splits a document at the paragraph level, reading up to 64 paragraphs of 128 token each (8192 tokens in total).
We investigate whether splitting based on document structure is better than splitting a long document into segments of same length.Similar to their model, we consider each paragraph as a segment and all segments are then truncated or padded to the same segment length.We follow Chalkidis et al. (2022) and use segment length (l) of 128 on ECtHR, and tune l ∈{32, 64, 128} on MIMIC-III. 8esults show that splitting by the paragraphlevel document structure does not improve performance on the ECtHR dataset.On MIMIC-III, splitting based on document structure substantially underperforms evenly splitting the document (Figure 9 in the Appendix) .

Label-wise Attention Network
Recall from Section 3 that our models form a single document vector which is used for the final prediction.That is, in Longformer, we use the hidden states of the [CLS] token; in hierarchical models, we use the max pooling operation to aggregate a list of contextual segment representations into a document vector.The Label-Wise Attention Network (LWAN) (Mullenbach et al., 2018;Xiao et al., 2019;Chalkidis et al., 2020) is an alternative that allows the model to learn distinct document representations for each label.Given a sequence of hidden representations (e.g., contextual token representations in Longformer or contextual segment representations in hierarchical models: ), LWAN can allow each label to learn to attend to different positions via: where u ℓ and β ℓ are vector parameters for label ℓ.
Results show that adding a LWAN improves performance on MIMIC-III (Micro F 1 score of 1.1 with Longformer; 1.8 with hierarchical models), where on average each document is assigned 6 labels out of 50 available labels (classes).There is a smaller improvement on ECtHR (0.4 with Longformer; 0.1 with hierarchical models), where the average number of labels per document is 1.5 out of 10 labels (classes) in total (Table 16 in the Appendix).

Comparison with State of the art
We compare TrLDC models against recently published results on MIMIC-III, as well as baseline models that process up to 512 tokens.In addition to the common practice of truncating long documents (i.e., using the first 512 tokens), we consider two alternatives that either randomly choose 512 a larger maximum sequence length is required to preserve the same information as in evenly splitting.
. tokens from the document as input or take as input the most informative 512 tokens, identified using TF-IDF scores.
Results in Table 2 and 3 show that there is a clear benefit from being able to process longer text.Both the Longformer and hierarchical Transformers outperform baselines that process up to 512 tokens with a large margin on MIMIC-III and EC-tHR, whereas relatively small improvements on 20 News and Hyperpartisan.It is also worthy noting that, among these baselines, there is no single best strategy to choose which 512 tokens to process.Using the first 512 tokens works well on MIMIC-III and Hyperpartisan datasets, but it performs much worse than 512 random tokens on ECtHR.
Finally, Longformer, which can process up to 4096 tokens, achieves competitive results with the best performing CNN-based model (Ji et al., 2021b) on MIMIC-III.By processing longer text and using the RoBERTa-Large model, the hierarchical models further improve the performance, leading to comparable results of RNN-based models (Vu et al., 2020;Yuan et al., 2022).We hypothesise that further improvements can be observed when TrLDC models are enhanced with better hierarchyaware classifier as in Vu et al. (2020) or code synonyms are used for training as in Yuan et al. (2022).

Comparison in terms of of GPU memory consumption
GPU memory becomes a big constraint when Transformer-based models are trained on long text.Table 4 shows a comparison between Longformer and Hierarchical models regarding the number of parameters and their GPU consumption.We use batch size of 2 in these experiments, and measure the impact of attention window size and segment length on the memory footprint.We find that Hierarchical models require less GPU memory than  attention windows are effective in Longformer, and the optimal segment length in hierarchical models is dataset dependent.

Practical Advice
We compile several questions that practitioners may ask regarding long document classification and provide answers based on our results: Q1 When should I start to consider using long document classification models?
A We suggest using TrLDC models if you work with datasets consisting of long documents (e.g., 2K tokens on average).We notice that on 20 News dataset, the gap between baselines that process 512 tokens and long document models is negligible. 9 Q2 Which model should I choose?Longformer or hierarchical Transformers?
A We suggest Longformer as the starting point if you do not plan on extensively tuning hyperparameters.We find the default config of Longformer is robust, although it is possible to set a moderate size (64-128) of local attention window to improve 9 Although Hyperpartisan is a widely used benchmark for long document models, we do not recommend drawing practical conclusions based on our results because we observe high variance when we run experiments using different GPUs or CUDA versions.We attribute this may to the small size (65) of its test set and the subjectivity of the task.efficiency without sacrificing effectiveness, and a small number of additional global attention tokens to make the training more stable.On the other hand, hierarchical Transformers may benefit from careful hyperparameter tuning (e.g., document splitting strategy, using LWAN).We suggest splitting a document into small non-structure-derived segments (e.g., 128 tokens) which overlap as a starting point when employing hierarchical Transformers.
We also note that the publicly available Longformer models can process sequences up-to 4096 tokens, whereas hierarchical Transformers can be easily extended to process much longer sequence.

Related Work
Long document classification Document length was not a point of controversy in the pre-neural era of NLP, where documents are encoded with Bagof-Word representations, e.g., TF-IDF scores.The issue arised with the introduction of deep neural networks.Tang et al. (2015) use CNN and BiL-STM based hierarchical networks in a bottom-up fashion, i.e., first encode sentences into vectors, then combine those vectors in a single document vector.Similarly, Yang et al. (2016) incorporate the attention mechanism when constructing the sentence and document representation.Hierarchical variants of BERT have also been explored for document classification (Mulyar et al., 2019;Chalkidis et al., 2022), abstractive summarization (Zhang et al., 2019), semantic matching (Yang et al., 2020).Both Zhang et al., and Yang et al. also propose specialised pre-training tasks to explicitly capture sentence relations within a document.A very recent work by Park et al. (2022) shows that TrLDC do not perform consistently well across datasets that consist of 700 tokens on average.Methods of modifying transformer architecture for long documents can be categorised into two approaches: recurrent Transformers and sparse attention Transformers.The recurrent approach processes segments moving from left-to-right (Dai et al., 2019).To capture bidirectional context, Ding et al. (2021) propose a retrospective mechanism in which segments from a document are fed twice as input.Sparse attention Transformers have been explored to reduce the complexity of selfattention, via using dilated sliding window (Child et al., 2019), and locality-sensitive hashing attention (Kitaev et al., 2020).Recently, the combination of local (window) and global attention are proposed by Beltagy et al. (2020) and Zaheer et al. (2020), which we have detailed in Section 3.

ICD Coding
The task of assigning most relevant ICD codes to a document, e.g., radiology report (Pestian et al., 2007), death certificate (Koopman et al., 2015) or discharge summary (Johnson et al., 2016), as a whole, has a long history of development (Farkas and Szarvas, 2008).Most existing methods simplified this task as a text classification problem and built classifiers using CNNs (Karimi et al., 2017) or LSTMs (Xie et al., 2018).Since the number of unique ICD codes is very large, methods are proposed to exploit relation between codes based on label co-occurrence (Dong et al., 2021), label count (Du et al., 2019), knowledge graph (Xie et al., 2019;Cao et al., 2020;Lu et al., 2020), code's textual descriptions (Mullenbach et al., 2018;Rios and Kavuluru, 2018).More recently, Ji et al. (2021a); Gao et al. (2021) investigate various methods of applying BERT on ICD coding.Different from our work, they mainly focus on comparing domain-specific BERT models that are pre-trained on various types of corpora.Ji et al. show that PubMedBERT-pre-trained from scratch on PubMed abstracts-outperforms other variants pre-trained on clinical notes or health-related posts; Gao et al. show that BlueBERT-pre-trained on PubMed and clinical notes-performs best.However, both report that Transformers-based models perform worse than CNN-based ones.

Conclusions
Transformers have previously been criticised for being incapable of long document classification.In this paper, we carefully study the role of different components of Transformer-based long document classification models.By conducting experiments on MIMIC-III and other three datasets (i.e., ECtHR, 20 News and Hyperpartisan), we observe clear improvements in performance when a model is able to process more text.Firstly, Longformer, a sparse attention model, which can process up to 4096 tokens, achieves competitive results with CNN-based models on MIMIC-III; its performance is relatively robust; a moderate size of local attention window (e.g., 128) and a small number (e.g., 16) of evenly chosen tokens with global attention can improve the efficiency and stability without sacrificing its effectiveness.Secondly, hierarchical Transformers outperform all CNN-based models by a large margin; the key design choice is how to split a document into segments which can be encoded by pre-trained models; although the best performing segment length is dataset dependent, we find splitting a document into small overlapping segments (e.g., 128 tokens) is an effective strategy.Taken together, these experiments rebut the criticisms of Transformers for long document classification.

Limitations
Long document classification datasets are usually annotated using a large number of labels.For example, the complete MIMIC-III dataset contains 8, 692 unique labels.As we mentioned in Section 2, we focus on building document representation and leave the challenge of learning with a large target label set for future work.Therefore, in this paper, we follow previous work (Mullenbach et al., 2018;Chalkidis et al., 2022) and consider a subset of frequent labels in MIMIC-III and ECtHR.numeric-only tokens in Mullenbach et al. (2018).We did not apply additional preprocessing to EC-tHR and 20 News.We follow Beltagy et al. (2020) to preprocess the Hyperpartisan dataset. 10raining We fine-tune the multilabel classification model using a binary cross entropy loss.That is, given an training example whose ground truth and predicted probability for the i-th label are y i (0 or 1) and ŷi , we calculate its loss, over the C unique classification labels, as:

Dataset statistics
For the multiclass and binary classification tasks, we fine-tune using the cross entropy loss, where ŷg is the predicted probability for the gold label: We use the same effective batch size ( 16), learning rate (2e-5), maximum number of training epochs (30) with early stop patience (5) in all experiments.We also follow Longformer (Beltagy et al., 2020) and set the maximum sequence length as 4096 in most of the experiments unless other specified.We fine-tune all classification models on Quadro RTX 6000 (24 GB GPU memory) or Tesla V100 (32 GB GPU memory).If one batch of data is too large to fit into the GPU memory, we use gradient accumulation so that the effective batch sizes (batch size per GPU × gradient accumulation steps) are still the same.
We repeat all experiments five times with different random seeds.The model which is most effective on the development set, measured using the micro F 1 score (multilabel) or accuracy (multiclass and binary), is used for the final evaluation.

A comparison between clinical notes and legal cases
Although we usually use the term domain to indicate that texts talk about a narrow set of related concepts (e.g., clinical concepts or legal concepts), text can vary along different dimensions (Ramponi and Plank, 2020).
In addition to the statistics difference between MIMIC-III and ECtHR, which we show in Table 5, there is another difference worthy considering: clinical notes are private as they contain protected health information.Even those clinical notes after de-identification are usually not publicly available (e.g., downloadable using web crawler).In contrast, legal cases have generally been allowed and encouraged to share with the public, and thus become a large portion of crawled pre-training data (Dodge et al., 2021).Dodge et al. find that legal documents, especially U.S. case law, are a significant part of the C4 corpus, a cleansed version of Com-monCrawl used to pre-train RoBERTa models.The ECtHR proceedings are also publicly available via HUDOC, the court's database.
We suspect task-adaptive pre-training is more useful on MIMIC-III than on ECtHR (Figure 5) may relate to this difference.Therefore, we evaluate the vanilla RoBERTa on MIMIC-III and ECtHR regarding tokenization and language modelling.A comparison of the fragmentation ratio using the tokenizer and perplexity using the language model can be found in Table 7.

MIMIC-III ECtHR
Fragmentation ratio 1.233 1.118 Perplexity 1.351 1.079 9.6 A comparison between TAPT and public available RoBERTa by (Lewis et al., 2020) We compare our TAPT-RoBERTa against publicly available domain-specific RoBERTa (Lewis et al., 2020), which are trained from scratch on biomedical articles and clinical notes, in hierarchical models.In these experiments, we split long documents into overlapping segments of 64 tokens.Results in Figure 8 show that TAPT-RoBERTa outperforms domain-specific base model, but underperforms the larger model.

Results on ECtHR test set
Results in Table 8 show that our results are higher than the ones reported in (Chalkidis et al., 2022) 9.8 A comparison between evenly splitting and splitting based on document structure based on document structure substantially underperforms evenly splitting the document.
9.9 Detailed results on the development sets For the sake of brevity, we use only micro F 1 score in most of our illustrations, and we detail results of other metrics in this section.

Figure 1 :
Figure 1: The effectiveness of Longformer, a longdocument Transformer, on the MIMIC-III development set.There is a clear benefit from being able to process longer text.

Figure 2 :
Figure 2: The distribution of document lengths.A log-10 scale is used for the X axis.

Figure 3 :
Figure 3: A comparison of three types of attention operations.The example sequence contains 7 tokens; we set local attention window size as 2, and only the first token using global attention.Note that these curves are bi-directional that tokens can attend to each other.manually labelled as hyperpartisan (taking an extreme left or right standpoint) or not. 620 News contains newsgroups posts which are categorised into 20 topics. 7

Figure 5 :
Figure 5: Task-adaptive pre-training (right side in each plot) can improve the effectiveness (measured on the development sets) of pre-trained models by a large margin on MIMIC-III, but small on ECtHR.∆: the difference between mean values of compared experiments.

Figure 6 :
Figure 6: The effect of applying global attention on more tokens, which are evenly chosen based on their positions.In the baseline model (first column), only the [CLS] token uses global attention.

Figure 7 :
Figure 7: The effect of varying the segment length and whether allowing segments to overlap in the hierarchical Transformers.∆: improvement due to overlap.

Figure 9 Figure 8 :Figure 9 :
Figure9shows that splitting by the paragraph level document structure does not improve performance on the ECtHR dataset.On MIMIC-III, splitting

Table 2 :
Comparison of TrLDC against state-of-the-art on the MIMIC-III test set.
(Lewis et al., 2020) R: RNN-based models; and T: Transformer-based models.Models marked with an asterisk (*) is domain-specific RoBERTa-Large(Lewis et al., 2020), whereas Longformer and other RoBERTa models are task-adaptive pre-trained base versions.

Table 4 :
A comparison between Longformer and Hierarchical models regarding their GPU memory consumption.The number of parameters are listed in the table header.Size refers to the local attention window size in Longformer and the segment length in hierarchical method, respectively.
Table5shows the descriptive statistics of four datasets we use.

Table 5 :
Statistics of the datasets.The number of tokens is calculated using RoBERTa tokenizer.

Table 6 :
Hyperparameters and training time (measured on MIMIC-III dataset) for task-adaptive pre-training Longformer and RoBERTa.Batch size = batch size per GPU × num.GPUs × gradient accumulation steps.

Table 7 :
Evaluating vanilla RoBERTa on MIMIC-III and ECtHR.Lower fragmentation ratio and perplexity indicate that the test data have a higher similarity with the RoBERTa pre-training data.

Table 8 :
(Chalkidis et al., 2022)edifferent BERT variants including domain-specific models, whereas we use task-adaptive pre-trained models.Regarding hierarchical method, we split a document into overlapping segments, each of which has 512 tokens.We use the default setting for Longformer as inBeltagy et al. (2020).Comparison of our results against the results reported in(Chalkidis et al., 2022)on the ECtHR test set.Results are sorted by Micro F 1 .