ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Transformers are not suited for processing long documents, due to their quadratically increasing memory and time consumption. Simply truncating a long document or applying the sparse attention mechanism will incur the context fragmentation problem or lead to an inferior modeling capability against comparable model sizes. In this paper, we propose ERNIE-Doc, a document-level language pretraining model based on Recurrence Transformers. Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. We pretrain ERNIE-Doc to explicitly learn the relationships among segments with an additional document-aware segment-reordering objective. Various experiments were conducted on both English and Chinese document-level tasks. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification and question answering.


Introduction
Transformers (Vaswani et al., 2017) have achieved remarkable improvements in a wide range of natural language tasks, including language modeling , text classification , and question answering (Devlin et al., 2018;. This success is largely due to the self-attention mechanism, which enables the network to capture contextual information from the *indicates equal contribution. 1 Source code and pre-trained checkpoints can be found at https://github.com/PaddlePaddle/ERNIE/ tree/repro/ernie-doc. entire input sequence. Nevertheless, the memory usage and computation complexity caused by the self-attention mechanism grows quadratically with the sequence length, incurring excessive cost when processing a long document on existing hardware. Currently, the most prominent pretrained models, such as BERT (Devlin et al., 2018), are used on fixed-length input segments of a maximum of 512 tokens owing to the aforementioned limitation. Thus, a long document input must be partitioned into smaller segments of manageable sizes. However, this leads to the loss of important crosssegment information, that is, the context fragmentation problem , as shown in Fig. 1(a). To mitigate the problem of insufficient interactions among the partitioned segments of long documents, Recurrence Transformers Rae et al., 2019) permit the use of contextual information from previous segments in computing the hidden states for a new segment by maintaining a memory component from the previous activation; this enables the modeling of long documents. In addition, Sparse Attention Transformers Tay et al., 2020;Beltagy et al., 2020;Zaheer et al., 2020) focus on reducing the complexity of self-attention operations to explicitly improve the modeling length, but only up to a restricted context length (4,096) due to resource limitations.
We argue that existing strategies are not sufficiently effective or reliable, because the contextual information of a complete document is still not available for each segment during the training phase. As depicted in Fig. 1, when training on segment S 2 , the model is ideally optimized by maximizing P (y | (S 1 , S 2 , S 3 )) conditioned on the contextual information of the entire document D = {S 1 , S 2 , S 3 }, in contrast to the following suboptimal solutions: P (y | S 2 ) for Vanilla/Sparse Transformers 2 and P (y | (S 1 , S 2 )) for Recurrence Transformers.
To address this limitation, we propose ERNIE-DOC (A Retrospective Long-Document Modeling Transformer) based on the Recurrence Transformer paradigm. Inspired by the human reading behavior of skimming a document first and then looking back upon it attentively, we design a retrospective feed mechanism in which segments from a document are fed twice as input. As a result, each segment in the retrospective phase could explicitly fuse the semantic information of the entire document learned in the skimming phase, which prevents context fragmentation.
However, simply incorporating the retrospective feed mechanism into Recurrence Transformers is infeasible because the maximum effective context length is limited by the number of layers , as shown in Fig. 1 (b). Thus, we present an enhanced recurrence mechanism, a drop-in replacement for a Recurrence Transformer, by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence. In this manner, the maximum effective context length can be expanded, and past higher-level representations can be exploited to enrich future lower-level representations.
Moreover, we introduce a segment-reordering objective to pretrain a document-level model. Specifically, it is a document-aware task of predicting the correct order of the permuted set of segments of a document, to model the relationship among segments directly. This allows ERNIE-DOC to build full document representations for prediction. This is analogous to the sentencereordering task in ERNIE 2.0 (Sun et al., 2020b) but at a segment level of granularity, spanning (commonly) multiple training steps.
We first evaluate ERNIE-DOC on autoregressive word-level language modeling using the enhanced recurrence mechanism, which, in theory, allows the model to process a document with infinite words. ERNIE-DOC achieves state-of-theart (SOTA) results on the WiKiText-103 benchmark dataset, demonstrating its effectiveness in long-document modeling. Then, to evaluate the potential of ERNIE-DOC on document-level natural language understanding (NLU) tasks, we pretrained the English ERNIE-DOC on the text corpora utilized in BigBird (Zaheer et al., 2020) from the RoBERTa-released checkpoint, and the Chinese ERNIE-DOC on the text corpora utilized in ERNIE 2.0 (Sun et al., 2020b) from scratch. After pretraining, we fine-tuned ERNIE-DOC on a wide range of English and Chinese downstream tasks, including text classification, question answering and keypharse extraction. Empirically, ERNIE-DOC consistently outperformed RoBERTa on various benchmarks and showed significant improvements over other high-performance long-text pretraining models for most tasks.

Related Work
Sparse Attention Transformers have been extensively explored Tay et al., 2020;Beltagy et al., 2020;Zaheer et al., 2020). The key idea is to sparsify the self-attention operation, which scales quadratically with the sequence length. For instance, the Sparse Transformer  uses a dilated sliding window that reduces the complexity to O(L √ L), where L is the sequence length. Reformer (Kitaev et al., 2020) further reduces the complexity to O(L log L) using locality-sensitive hashing attention to compute the nearest neighbors. BP-Transformers (Ye et al., 2019) employs a binary partition for the input sequence. Recently, Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) have been proposed, and both achieved state-of-the-art performance on a variety of long-document tasks. They reduce the complexity of self-attention to O(L) by combining random attention, window attention, and global attention. However, it has been proven in Zaheer et al. (2020) that sparse attention mech-anisms cannot universally replace dense attention mechanisms; moreover, solving the simple problem of finding the furthest vector requires Ω(n)-layers of a sparse attention mechanism but only O(1)layers of a dense attention mechanism. In addition, the aforementioned methods require customized CUDA kernels or TVM programming to implement sparse attention, which are not maintainable and are difficult to use. In this study, we adopt a different approach to adapting Recurrence Transformers for a pretraining-then-finetuning setting, to model a long document.
Recurrence Transformers Rae et al., 2019) have been successfully applied in generative language modeling. They employ the Transformer decoder as a parametric model for each conditional distribution in p(x) = L t=1 p(x t |x <t ), where x denotes a text sequence. To capture long dependencies, they process the text in segments from left to right based on the segment recurrence mechanism . This mechanism maintains a memory bank of past activations at each layer to preserve a history of context. Compressive Transformer (Rae et al., 2019) adds a compressive memory bank to sufficiently store old activations instead of discarding them, which facilitates long-range sequence learning. However, these methods operate from left to right, which limits their capacity for discriminative language understanding tasks that require bidirectional information. XLNet  proposed a permutation language modeling objective to construct bidirectional information and achieve superior performance in multiple NLP tasks; however, its application to long-document modeling tasks remains largely unexplored. ERNIE-DOC builds on the ideas of the Recurrence Transformers to 1) tackle the limitation of Recurrence Transformers for utilizing bidirectional contextual information and 2) improve the behavior of the segment recurrence mechanism to capture longer dependencies.  have enabled significant progress on numerous document-level tasks, such as document summarization (Zhang et al., 2019) and document ranking . Similar to Vanilla Transformers, Hierarchical Transformers also split long documents into shorter segments with manageable lengths and then feed them independently to produce corresponding segment-level semantic representations. Unlike in Vanilla Transformers, however, separate Transformer layers are used in Hierarchical Transformers to process the concatenation of these representations. Hierarchical Transformers ignore the contextual information from the remaining segments when processing each segment of a long document, thus suffering from the context fragmentation problem.

Proposed Method
In this section, we first describe the background (Sec. 3.1) that ERNIE-DOC builds on. Then, we present the implementation of ERNIE-DOC, including the retrospective feed mechanism in Sec. 3.2, the enhanced recurrence mechanism in Sec. 3.3, and the segment-reordering objective in Sec. 3.4.

Background
Formally, a long document D is sliced into T sequential segments, denoted as {S 1 , S 2 , ..., S T }, where S τ = {x τ,1 , x τ,2 , ..., x τ,L } is the τ -th segment with L tokens; x denotes a single token. Vanilla, Sparse, and Recurrence Transformers employ different strategies to produce the hidden state h n τ ∈ R L×d for segment S τ at the n-th layer: where q ∈ R L×d , k, and v ∈ R (L+m)×d are the query, key and value vectors, respectively with hidden dimension d and memory length m (Note that m = 0 for Vanilla or Sparse Transformers); h n−1 τ +1 ∈ R (L+m)×d is the extended context; W * ∈ R d * ×d represents learnable linear projection parameters; the function SG(·) denotes the stop-gradient operation; and the notation [•] denotes the concatenation of two hidden states along the length dimension. In contrast to Vanilla or Sparse Transformers, where h n τ +1 is produced using only itself, Recurrence Transformers introduce a segment-level recurrence mechanism to promote interaction across segments. The hidden state computed for the previous segment h n−1 τ is cached as an auxiliary context to help process the current segment h n τ . However, from the concatenation part in Eq. 1, i.e., [SG(h n−1 τ ) • h n−1 τ +1 ], there is apparently a constraint that the current hidden state can only fuse information from the previous segments. In  other words, the contextual information of an entire document is not available for each segment.

Retrospective Feed Mechanism
ERNIE-DOC employs a retrospective feed mechanism to address the unavailability of the contextual information of a complete document for each segment. The segments from a long document are twice fed as input. Mimicking the human reading behavior, we refer to the first and second inputtaking phases as the skimming and retrospective phases, respectively. In the skimming phase, we employ a recurrence mechanism to cache the hidden states for each segment. In the retrospective phase, we reuse the cached hidden states from the skimming phase to enable bi-directional information flow. Naively, we can rewrite Eq. 1 to obtain the contextual information of an entire document in the skimming phase to be utilized in the retrospective phase as follows, (2) where H ∈ R (L * T * N )×d denotes the cached hidden states in the skimming phase with T segments, L length of each segment and total N layers, and is the concatenation of i-th layer's hidden states of the skimming phase. Thus, the extended context h n−1 τ +1 is guaranteed to capture the bidirectional contextual information of the entire document. However, it will incur massive memory and computation cost for directly employing H in self-attention mechanism. Henceforth, the main issue is how H should be implemented in a memory-and computation-efficient manner.
By rethinking segment-level recurrence , we observe that the largest possible context dependency length increases linearly w.r.t the number of layers (N ). For instance, at i-th layer, h i τ have the longest dependency to h 1 τ −(i−1) . Thus, to minimize memory and computation consumption, hidden states from the N -th layer (toplayer) are included at a stride of N , which is sufficient to build the contextual information of an entire document. Formally, H can be reduced to (Note that when T is not evenly divisible by N , the last hidden state h N T need to be included). However, for a long document input, the extra computational and memory cost of H r ∈ R T /N ×d where T N is still excessive on existing hardware.

Enhanced Recurrence Mechanism
To effectively utilize the retrospective feed mechanism in practice, an ideal strategy is to ensure that the cached hidden state h n−1 τ already contains the contextual information of an entire document without explicitly taking H or H r as input. Essentially, we should tackle the problem of limited effective context length in the segment-level recurrence mechanisms. Herein, we introduce the enhanced recurrence mechanism, a drop-in replacement for the segment-level recurrence mechanism, by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence as follows: where the cached hidden state h n−1 τ in Eq. 1 and Eq. 2 is replaced with h n τ in Eq. 3. As shown in Fig. 2, when the retrospective feed mechanism is combined with the enhanced recurrence mechanism, every segment in the retrospective phase (shown in the box with a green dotted border) has bidirectional contextual information of the entire text input. We successfully modeled a larger effective context length (shown in the box with a orange dotted border) than traditional Recurrence Transformers can without extra memory and computation costs. Another benefit of the enhanced recurrence scheme is that past higher-level representations can be exploited to enrich future lower-level representations.

Segment-Reordering Objective
In addition to the masked language model (MLM) objective (Devlin et al., 2018), we introduce an additional document-aware task called segment-reordering objective for pretraining. Benefitting from the much larger effective context length provided by the enhanced recurrence mechanism, the goal of the segment-reordering objective is to predict the correct order for the permuted set of segments of a long document, to explicitly learn the relationships among segments. During the pretraining process of this task, a long text input D is first randomly partitioned into 1 to m chunks; then, all the combinations are shuffled in a random order. As shown in Fig. 3, D is partitioned into three chunks and then permuted, that is, where C i denotes the i-th chunk. Subsequently, the permuted long contextD is split into T sequential segments as a common practice, denoted asD = {S 1 , S 2 , ..., S T }. We let the pretrained model reorganize these permuted segments, modeled as a K-class classification problem, where K = m i=1 i!. The pretraining objective is summarized as follows for the τ -th input segment: whereŜ τ is the corrupted version of S τ , which is obtained by randomly setting a portion of tokens

Autoregressive Language Modeling
Autoregressive language modeling aims to estimate the probability distribution of an existing token/character based on previous tokens/characters in an input sequence. For comparison with previous work, we conducted experiments on wordlevel LM, that is, WikiText-103 (Merity et al., 2016), which is a document-level language modeling dataset.

Experimental Setup
For autoregressive language modeling, we use a memory-enhanced Transformer-XL , that is, we employ our enhanced recurrence mechanism to replace the primitive one used in the Transformer-XL. Additionally, as proposed by Segatron (Bai et al., 2020), we introduce the segment-aware mechanism into Transformer-XL. Based on Transformer-XL, we trained a base-size model (L=16, H=410, A=10) and a large-size model (L=18, H=1,024, A=16) 3 . The models were trained for 200K/400K steps using a batch size of 64/128 for the base/large configurations. During the training phase, the sequence length and memory length were limited to 150 and 384 for the base and the large model, respectively. The remaining hyper-parameters were identical to those of Transformer-XL.  (Merity et al., 2018) 151M 33.0 Transformer-XL Base  151M 24.0 SegaTransformer-XL Base (Bai et al., 2020) Tab. 12). Additionally, we employed relative positional embedding (Shaw et al., 2018) in our model pretraining because it is necessary for reusing hidden state without causing temporal confusion .
Finetune. In contrast to previous models, such as BERT, RoBERTa, and XLNet, the proposed model employs the retrospective feed mechanism and the enhanced recurrence mechanism during the finetuning phase to fully utilize the advantages of these two strategies.

Results on English Tasks
Results on Long-Text Classification Tasks. We consider two datasets: IMDB reviews (Maas et al., 2011) and Hyperpartisan News Detection (HYP) (Kiesel et al., 2019). The former is a widely used sentiment analysis dataset containing 50,000 movie reviews, labeled as positive or negative. The latter contains news that takes extreme left-wing or right-wing standpoints. The documents in HYP are extremely long (50% of the samples contain more than 537 tokens) and are thus suitable for testing long-text classification ability. Tab. 3 summarizes the results of the ERNIE-DOC-Base and ERNIE-DOC-Large models for long-text classification tasks, and ERNIE-DOC achieves a SOTA result. On IMDB, we observed a modest perfor-    (Joshi et al., 2017) and distractor setting of HotpotQA (HQA) (Yang et al., 2018)) to evaluate the reasoning ability of the models over long documents. TQA and HQA are extractive QA tasks, and we follow the simple QA model of BERT (Devlin et al., 2018) to predict an answer with the maximum sum of start and end logits across multiple segments of a sample. In addition, we use a modified cross-entropy loss (Clark and Gardner, 2017) for the TQA dataset and use a two-stage model (Groeneveld et al., 2020) with the backbone of ERNIE-DOC for the HQA dataset. Tab. 4. shows that ERNIE-DOC outperforms RoBERTa and Longformer by a considerable margin on these two datasets, and is comparable to current SOTA long-document model, i.e., BigBird on HQA in large-size model setting. Results on the Keyphrase Extraction Task. We include OpenKP (Xiong et al., 2019) dataset to eval-uate ERNIE-DOC's ability to extract keyphrases from a long document. Each document contains up to three short keyphrases and we follow the model setting of JointKPE (Sun et al., 2020a) and ETC (Ainslie et al., 2020) by applying CNNs on BERT's output to compose n-gram embeddings for classification. We report the results of basesize models in Tab. 5 under no-visual-features setting for easy and fair comparison with baselines. ERNIE-DOC performs stably better on all metrics on the OpenKP dataset.

Results on Chinese Tasks
We conducted extensive experiments on seven Chinese natural language understanding (NLU) tasks, including machine reading comprehension (CMRC2018 (Cui et al., 2018) (Sun et al., 2016)). The documents in all the aforementioned datasets are sufficiently long to be used to evaluate the effectiveness of ERNIE-DOC on long-context tasks (see detailed datasets statistics in Tab. 9). We reported the mean results with five runs for the seven Chinese tasks in Tab. 6, and summarized the hyperparameters in Tab. 16. ERNIE-DOC outperforms previous models across these Chinese NLU tasks by a significant margin in the base-size model group.  Effect of proposed components. Tab. 7 shows the performance of ERNIE-DOC-Small on two English tasks after ablating each proposed component. All models were pretrained and fine-tuned with the same experimental setup, and we report the mean results of five runs. We observed a stable performance gain across these two tasks by incorporating each proposed component. By comparing   Effect of enhanced recurrence mechanism with regard to different maximum sequence lengths.

Ablation Studies
As depicted in Fig. 4, the enhanced recurrence mechanism plays an important role in pretraining an effective language model with lower PPL and higher accuracy under both the maximum sequence input lengths of 128 and 512. The effect of the enhanced recurrence mechanism is more significant under a smaller maximum sequence length, even makes the ERNIE-DOC-Small (max-len:128) comparable to ERNIE-DOC-Small w/o en recur (max-len:512) w.r.t accuracy. This intriguing property of the enhanced recurrence mechanism enables more efficient model training and inference by reducing maximum sequence length while remaining comparable modeling capability.

Conclusion
In this paper, we proposed ERNIE-DOC, a document-level language pretraining model based on the Recurrence Transformers paradigm. Two well-designed mechanisms, namely the retrospective feed mechanism and the enhanced recurrent mechanism, enable ERNIE-DOC, which theoretically has the longest possible dependency, to model bidirectional contextual information of a complete document. Additionally, ERNIE-DOC is pretrained with a document-aware segment-reordering objective to explicitly learn the relationship among segments of a long context. Experiments on various downstream tasks demonstrate that ERNIE-DOC outperforms existing strong pretraining models such as RoBERTa, Longformer, and BigBird and achieves SOTA results on several language modeling and language understanding benchmarks.
In future studies, we will evaluate ERNIE-DOC on language generation tasks, such as generative question answering and text summarization. We will also investigate its potential applicability in other areas, such as computational biology. Another possibility is to incorporate graph neural networks into ERNIE-DOC to enhance its modeling capability for tasks that require multi-hop reasoning and longdocument modeling ability. Long Text classification. We consider two English datasets: IMDB reviews (Maas et al., 2011) and Hyperpartisan news detection (Kiesel et al., 2019) (see Tab. 8), and two Chinese datasets: IFLY-TEK (Xu et al., 2020) and THUCNews (Sun et al., 2016) (see Tab. 9). IMDB is a widely used sentiment analysis dataset containing 50,000 movie reviews labeled as positive or negative. Training and dev dataset is equally split. Hyperpartisan contains news that takes an extreme left-wing or right-wing standpoint. Documents are extremely long in Hyperpartisan which makes it a good test for long text classification. We use the same split as Longformer by dividing 654 documents into train/dev/test sets. IFLYTEK contains 17,332 app descriptions. The task is to assign each description into one of 119 categories, such as food, car rental and education. THUCNews is generated by filtering historical data of Sina News RSS subscription channel from 2005 to 2011, including 740,000 news documents and 14 categories. In this paper, we employ the subset version instead of the full one 6 , which contains 10 categories, each with 5,000 pieces of data. For the above four long text classification datasets, we concatenate [CLS] token with each segment and takes as input multiple segments of a text sequentially. Each segment is generated by slicing the text with a sliding window of 128 tokens. We apply binary cross entropy loss on the [CLS] token of the last segment.
Long Text Semantic Similarity. Considering that there is no available long text semantic similarity dataset in English, we evaluate the effectiveness of ERNIE-DOC on semantic similarity task only depending on Chinese dataset CAIL2019-SCM. According to Xiao et al. (2019), CAIL2019-SCM is a sub-task of the Chinese AI and Law Challenge (CAIL) competition in 2019, which contains 8,964 triplets of legal documents collected from China Judgments Online. Every document in a majority of triplet has more than 512 characters, therefore, the total length of a triplet is quite long. CAIL2019-SCM requires researchers to decide which two cases are more similar in a triplet. Specifically, given a triplet (A, B, C), where A, B, C are fact descriptions of three cases. The model needs to predict whether sim(A, B) > sim(A, C) or sim(A, C) > sim(A, B), in which sim denotes the similarity between two cases. Instead of separately feeding the document A, B, C into the model to get the feature h, we use the combinations of (A, B) and (A, C) as input. We generate multiple segments for (A, B) or (A, C) with a sliding window of 128 tokens and feed them as input sequentially. The binary cross entropy loss is applied to the difference of [CLS] token output of each segment.
TriviaQA is a large scale QA dataset that contains over 650K question-answer pairs. We evaluate models on its Wikipedia setting where documents are Wikipedia articles, and answers are named entities mentioned in multiple documents. The dataset is distantly supervised meaning that there is no golden span, thus we find all superficial identical answers in provided documents 7 . We use the following input format for each segment: "[CLS] context [q] question [/q]" where context is generated by slicing multidocuments input with a sliding window of 128 tokens. We take as input multiple segments of a sample sequentially and attach a linear layer to each token in a segment to predict the answer span. We   use a modified cross entropy loss (Clark and Gardner, 2017) assuming that each segment contains at least one correct answer span. The final prediction for each question is a span with the maximum sum of start and end logit across multiple segments.
HotpotQA is a QA dataset where golden spans of an answer and sentence-level supporting facts are provided. Thus, it contains two tasks namely, answer span prediction and supporting facts prediction. In the distractor setting, each question is associated with 10 documents where only 2 documents contain supporting facts. It requires the model to find and reason over multiple documents to find answers, and explain the predicted answers using predicted supporting facts. Following Groeneveld et al. (2020), we implemented a two-stage model based on ERNIE-DOC and use the following input format for each segment: " [CLS]  representing a sentence and a paragraph separately. Then we use binary cross entropy loss to do binary classification. For answer span prediction, we train the model with a multi-task objective: 1) question type (yes/no/span) classification on the [CLS] token. 2) supporting evidence prediction on [SEP] and [p]. 3) span prediction on the start and end token of a golden span.
CMRC2018, DRCD and DuReader are common Chinese QA datasets with same format, which have been evaluated in numerous popular pretrain-ing models, such as BERT (Devlin et al., 2018), ERNIE 1.0 (Sun et al., 2019b), ERNIE 2.0 (Sun et al., 2020b) and etc. The detailed descriptions of three datasets can refer to Cui et al. (2018), Shao et al. (2018 and He et al. (2017). We adopt the same input format as TriviaQA for each segment, denotes as " [CLS] context [SEP] question [SEP]" where context is generated by slicing multi-documents input with a sliding window of 128 tokens. We take as input multiple segments of a sample sequentially and attach a linear layer to each token in a segment to predict the answer span. Then, we apply a softmax and use the cross entropy loss with the correct answer. The final prediction for each question is a span with the maximum sum of start and end logit across multiple segments.
The multiple Choice Chinese machine reading Comprehension dataset (C 3 ) (Sun et al., 2019a) is the first Chinese free-form multi-choice dataset where each question is associated with at most four choices and a single document. According to (Sun et al., 2019a) (Sun et al., 2020a) and ETC (Ainslie et al., 2020) by applying CNNs on BERT's output to compose n-gram embeddings for classification. We clean the dataset by removing some nonsense words such as the HTTP links. In detail, we apply five CNNs on BERT's output with the kernel size ranging from 1 to 5. Since each word is composed of several sub-tokens, we take the first token's embedding as the input for CNNs. Finally, we use the binary cross entropy loss as the optimization objective.

A.2 Ablation Studies
Tab. 10 shows the performance of ERNIE-DOC-Small on English tasks after ablating each proposed component. All models were pretrained and finetuned with the same experimental setup, and we report the mean results of five runs. In the last column in Tab. 10, we see that the segment-reordering objective is improved ERNIE-DOC by 0.81% on average (#1 -#0), the retrospective feed mechanism is improved ERNIE-DOC by an average of 0.58% (#2 -#1), and the enhanced recurrence mechanism makes a large contribution of 2.55 percentage points on average (#3 -#2). By comparing #3 with #4, we see that segment-level recurrence is necessary for modeling long documents and produces a 4.92 percentage point improvement on average. Considering different types of tasks, we observe that on Hyperpartisan, an extremely long text classification dataset, a substantial improvement is achieved using the segment-reordering ob-8 The dataset can be downloaded from https:// github.com/thunlp/BERT-KPE jective (1.5% point). This indicates that the [CLS] token, pretrained using the segment-reordering objective, is more adaptable to the document-level text classification task. Moreover, we observed a stable performance gain across all tasks using the enhanced recurrence mechanism.

A.3 Hyperparameters for Language Modeling
In Tab

A.4 Hyperparameters for Pre-Training
As shown in Tab. 12, we present the detailed hyperparameters adopted to pretraining ERNIE-DOC on English text corpora and Chinese text corpora. For comparisons, we follow the same optimization hyperparameters of RoBERTa BASE or RoBERTa LARGE  for base-size or large-size model in English domain. As for Chinese ERNIE-DOC, we follow the same optimization hyperparameters of ERNIE 2.0 BASE .

A.5.2 Document-level Question answering tasks
The finetuning hyperparameters for TriviaQA (Welbl et al., 2018) and HotpotQA (Yang et al., 2018) are presented in Tab. 14. HQA-sent. is the model for coarse-grained evidence prediction, and we choose the evidence with the probability larger than a pre-defined threshold 1e-3 and 1e-5 for base and large models, respectively. HQA-span. is the model for span prediction.

B Attention Complexity
Given a long document with length L, Longformer and BigBird usually applies a local attention with a window size of 512 tokens on the entire input resulting in L * 512 token-to-token calculations. While the long document is fed twice as input and each input is sliced with a sliding window size of 512 tokens in ERNIE-DOC, which resulting in 2 * L 512 * 512 * (512 + m) token-to-token calculations where m is the memory length. Since 512 L and m L, the attention complexity of ERNIE-DOC is comparable to Longformer and BigBird which scales linearly with respect to the input length L, i.e., O(L). Notably, the segments produced from the long document are fed one by one in ERNIE-DOC, leading to the lower spatial complexity.