Skim-Attention: Learning to Focus via Document Layout

Transformer-based pre-training techniques of text and layout have proven effective in a number of document understanding tasks. Despite this success, multimodal pre-training models suffer from very high computational and memory costs. Motivated by human reading strategies, this paper presents Skim-Attention, a new attention mechanism that takes advantage of the structure of the document and its layout. Skim-Attention only attends to the 2-dimensional position of the words in a document. Our experiments show that Skim-Attention obtains a lower perplexity than prior works, while being more computationally efficient. Skim-Attention can be further combined with long-range Transformers to efficiently process long documents. We also show how Skim-Attention can be used off-the-shelf as a mask for any Pre-trained Language Model, allowing to improve their performance while restricting attention. Finally, we show the emergence of a document structure representation in Skim-Attention.


Introduction
More and more companies have started automating their document processing workflows by leveraging artificial intelligence techniques. This had lead to the emergence of a dedicated research topic, Document Intelligence 1 (DI), which encompasses the techniques used to read, interpret and extract information from business documents. Such documents span multiple pages and contain rich multi-modal information that include both text and layout. Earliest approaches to analyzing business documents rely on rule-based algorithms (Lebourgeois et al., 1992;Amin and Shiu, 2001), but the success of deep learning has put computer vision and natural language processing (NLP) models at the heart of contemporary approaches (Katti et al., 2018;Denk and Reisswig, 2019). With the massive impact of 1 https://sites.google.com/view/di2019 large pre-trained Transformer-based language models (Devlin et al., 2019;Radford et al., 2019), DI researchers have recently started leveraging Transformers.
At the core of the Transformer architecture is self-attention, a powerful mechanism which contextualizes tokens with respect to the whole sequence. While being the key to the success of Transformers, it is also its bottleneck: the time and memory requirements of self-attention grow quadratically with sequence length. As a consequence, only short sequences can be processed (512 tokens or 1,024 at most), making it impossible to capture long-term dependencies. This is an important issue for DI since texts can be very dense and long in business documents. To allow efficient training on very long sequences, there has been growing interest in building model architectures that reduce the memory footprint and computational requirements of Transformers (Dai et al., 2019;Kitaev et al., 2020;Beltagy et al., 2020). This plethora of long-range Transformers lie in one specific research direction: capturing long-range dependencies by reducing the cost of self-attention.
These Transformer architectures all operate on serialized texts, i.e. one-dimensional sequences of words, completely disregarding the document layout. However, layout, i.e. the physical organization of a document's contents, carries useful information about the semantics of the text and has a significant impact on readers ' understanding (Wright, 1999). Thus, ignoring the document layout leads to a considerable loss of information. To address this issue, an orthogonal direction that has gained traction recently is based on integrating layout information into Transformer-based language models. Joint pre-training of text and layout has allowed models to reach state-of-the-art performance in several downstream tasks concerning layout-rich documents (Xu et al., 2020b;Pramanik et al., 2020;Xu et al., 2020a). Despite their effec-tiveness, these approaches all "read" the contents token by token to compute attention. We claim that one does not need to have read each word in a document page to be able to understand a specific paragraph. Thus, we argue that, to efficiently process long documents, it is a waste of effort and computation to contextualize a token with respect to the entire input sequence.
To shift towards processing long documents with awareness of their structure, we propose to take into account layout in a more intuitive and efficient way. First, we present a quick cognitive experiment wherein we show that layout plays a fundamental role in humans' comprehension of documents. In light of this experiment, we claim that one can already gather a lot of information from the layout alone. As a consequence, we propose Skim-Attention, a new self-attention mechanism that is solely based on the 2-D position of tokens in the page, independently from their semantics. To exploit this mechanism, we introduce Skimformer and SkimmingMask, two frameworks for integrating Skim-Attention into Transformer models. Skimformer is an end-to-end Transformer language model that replaces self-attention with Skim-Attention. Based on the tokens' spatial locations, Skimformer computes the Skim-Attention scores only once, before using them in each layer of a text-based Transformer encoder. Skimformer can also be adapted to long-range Transformers to model longer documents. Conversely, Skimming-Mask uses Skim-Attention as a mask to sparsify attention in any Transformer language model. Each token is restricted to its k most attended tokens, as indicated by Skim-Attention, which allows for a smaller context length.
In summary, our main contributions are as follows: • We introduce Skim-Attention, a new attention mechanism that leverages layout.
• We design two frameworks for integrating Skim-Attention into Transformer models, and show that they are more time and memory efficient than LayoutLM.
• To the best of our best knowledge, this is the first time layout is considered as a means for reducing the cost of self-attention.

Cognitive Background
The layout of a document, which refers to the arrangement and organization of its visual and textual elements, has a significant influence on readers' behavior and understanding (Wright, 1999;Kendeou and Van Den Broek, 2007;Olive and Barbier, 2017). It has been shown that a well-designed layout results in less cognitive effort (Britton et al., 1982;Olive and Barbier, 2017) and facilitates comprehension of the conveyed information by helping identify the document type and its constituents, as well as providing cues regarding relationships between elements (Wright, 1999). Semiotic research assumes that readers scan the document before taking a closer look at certain units (Kress et al., 1996), a claim supported by eye-tracking experiments on newspapers (Leckner, 2012). For all these reasons, layout is a critical element for document understanding, which motivates its integration into modeling. Inspired by these research findings, our work focuses on exploiting layout in a similar fashion as humans, since this can be key to a successful model coping with long and complex documents.

Long-range Transformers
In the field of natural language processing, Transformers have become the go-to component in the modern deep learning stack. In recent years, there has been a substantial growth in the number of Transformer variants (long-range Transformers) that improve computational and memory efficiency, making it possible to extend the maximum sequence length and to incorporate long-term context. Models such as Longformer (Beltagy et al., 2020), Reformer (Kitaev et al., 2020), and Performer (Choromanski et al., 2020) are able to process sequences of thousands of tokens or longer. Although these models are highly efficient in reducing time and memory requirements, they consider long documents as huge one-dimensional blocks of texts: Reformer, for instance, has to read the 4,096 elements contained in the input sequence in order to create buckets of similar elements. Hence, all information about the document structure is lost. Our approach is orthogonal to long-range Transformers; instead of focusing only on architecture optimization, we propose to leverage layout-rich information.

Multi-modal Pre-training Techniques for Document Understanding
Recently, multi-modal pre-training techniques have become increasingly popular in the document understanding area (Xu et al., 2020b;Pramanik et al., 2020;Garncarek et al., 2020;Wu et al., 2021). This research direction consists in jointly pre-training on textual and layout/visual information from a large and heterogeneous collection of unlabeled documents, learning cross-modal interactions in an end-to-end fashion. Based on the BERT architecture, Xu et al. (2020b) build LayoutLM, a multi-modal Transformer model that ties spatial information with tokens through a point-wise summation to learn pre-trained embeddings for document understanding tasks. LayoutLM, along with most approaches in priorart, is not motivated by efficiency and cognitive perspectives. The layout information is rather considered as an additional feature, and this approach requires to "read" each individual token one by one. As opposed to LayoutLM, in our proposed approach, attention is computed exclusively on spatial positions. This leads to improvements on time and memory efficiency. In addition, our approach can be plugged into any textual language model, making it more flexible than LayoutLM, which requires both text and layout to be learnt jointly in an extensive pre-training stage.

Preliminary Experiments: Human Evaluation
How much does the document layout help in comprehending long textual contents? How faster is it for humans to find information in documents when layout is provided? To answer these questions, we conduct a simple cognitive experiment wherein we measure the amount of time needed for human annotators to retrieve information from both formatted and plain-text documents. Half of the time, they are given access to the full layout, and the other half, to plain text only (i.e., no layout nor formatting). 2 Table 1 reports the average time needed to retrieve information from the documents. We find that it is 2.5× faster to answer questions from the formatted documents, and that the variability in the results is much lower in this case. These results  support the hypothesis that less cognitive effort is spent when the document is formatted, emphasizing the importance of layout information in reading comprehension. We believe that machines could benefit from the the document layout, just like humans, as a strategy to retrieve information faster while expending less effort. In particular, layout information could be of great help in reducing the cost of self-attention in Transformer models.

Proposed Approach
Using common sense, and in light of the cognitive experiment previously reported, it is clear that thelayout is of utmost importance for humans to understand long documents. We propose to take into account layout by introducing Skim-Attention, a selfattention module that computes attention solely based on spatial positions. To process long and layout-rich documents, we propose different ways of integrating this mechanism into Transformer architectures.

Background on Transformers
We first provide an overview of the well-established Transformer, an encoder-decoder architecture composed by stacking a series of Transformer blocks on top of each other. Each block is characterized by a self-attention module. Given an input sequence encoded as a matrix X ∈ R n×d , the operation for a single layer is defined as: where Q, K and V are the Query, Key and Value matrices obtained by a linear transformation of X. More intuitively, the attention matrix, A = QK , provides text-based similarity scores for all pairs of tokens in the sequence, while each row in Softmax QK √ d represents a distribution that indicates how we need to aggregate information from the input tokens (V) for the corresponding output token (Q).
It is clear that the main limitation of Transformers lies in the computational and memory requirements of the attention: to obtain the attention matrix, inner products between each key and each query need to be computed, resulting in a quadratic complexity w.r.t. the input sequence length. This operation is repeated at each layer, hence processing longer sequence quickly becomes computationally challenging. Finally, the standard Transformer architecture considers documents as serialized sequences of texts, leading to a severe loss of information when it comes to layout-rich documents.

Skim-Attention Overview
Our novel attention mechanism, Skim-Attention, views documents as collections of boxes distributed over a two-dimensional space, i.e., the page. In the following, we provide details on how to encode spatial positions into layout embeddings, before describing our attention module.
Layout Embeddings Layout embeddings carry information about the spatial position of the tokens. Following LayoutLM (Xu et al., 2020b), the spatial position of a token is represented by its bounding box in the document page image, (x 0 , y 0 , x 1 , y 1 ), where (x 0 , y 0 ) and (x 1 , y 1 ) respectively denote the coordinates of the top left and bottom right corners. We discretize and normalize them to integers in [0, ..., 1000]. Four embedding tables are used to encode spatial positions: two for the coordinate axes (x and y), and the other two for the bounding box size (width and height). The final layout embedding of a token, ∈ R d , located at position (x 0 , y 0 , x 1 , y 1 ) is defined by: Skim-Attention We propose Skim-Attention, an attention mechanism that leverages document layout in a novel way. As opposed to standard selfattention, Skim-Attention does not depend on the text semantics (i.e. token representations), as it calculates the attention using only the spatial positions of the tokens, i.e. their layout embeddings .
Formally, let X = { 0 , 1 , . . . , n } be an input sequence of layout embeddings, and Q = W q X , K = W k X , the Queries and Keys obtained by linear transformations of the layout em-beddings. For a single attention head, the Skim-Attention matrix is defined by: Intuitively, A captures the correlation between two tokens based on their spatial positions: the more similar two tokens are in terms of layout embeddings, the higher their attention score.
Since attention is calculated only once, we want the layout embeddings to be as meaningful as possible. Therefore, to obtain better layout representations, we contextualize them by adding a small Transformer prior to computing Skim-Attention.
It is possible to combine Skim-Attention with any long-range Transformer, as these approaches are orthogonal. We adapt our approach by computing the corresponding long-range attention only once, based on layout instead of text semantics.

Skim-Attention in Transformers
We investigate two approaches to exploit Skim-Attention: i) Skimformer, wherein self-attention is replaced by Skim-Attention; and ii) Skimming-Mask, where an attention mask is built from Skim-Attention and fed to a Transformer language model.
Skimformer is a two-stage Transformer that replaces self-attention with Skim-Attention. Inspired by previous work in cognitive science, the intuition behind this approach is to mimic how humans process a document by i) skimming through the document to extract its structure, and ii) reading the contents informed by the previous step. Skimformer accepts as inputs a sequence of token embeddings and the corresponding sequence of layout embeddings. The model adopts a two-step approach: first, the skim-attention scores are computed once and only once using layout information alone; then, these attentions are used in every layer of a Transformer encoder. The architecture of Skimformer is depicted in Figure 1a.
For a given encoder layer k and a single head, the traditional self-attention operation becomes: where A is the skim-attention matrix obtained through Eq. 3, and V t k = W v,k X t is the Value matrix produced by projecting the textual input 3  . Both models take as input a sequence of tokens and a sequence of token bounding box coordinates. The input of each modality is converted to an embedding sequence. Only the layout embeddings are used to compute Skim-Attention.
More intuitively, computing skim-attention scores (Eq. 3) can be interpreted as skimming through the document. Information about the semantics (contained in V) is then routed based on these similarity scores. This is done via Eq. 4 and can be seen as reading the contents of the document, focusing on the most relevant parts informed by the skim-attention scores.
We train Skimformer using Masked Visual-Language Modeling (MVLM), a pre-training task that extends Masked Language Modeling (MLM) with layout information. MVLM randomly masks some of the input tokens but preserves their layout embeddings. The model is then trained to recover the masked tokens given the contexts. Hence, MVLM helps capture nearby token features, leveraging both semantics and spatial information.
While we experimented with a standard Transformer model, it is worth noting that any language model can be used as the backbone of Skimformer.
SkimmingMask For each token in a sequence, Skim-Attention provides a ranking of the other tokens based on their layout-based similarity. Leveraging this, SkimmingMask uses Skim-Attention as a mask to restrict the computation of self-attention to a smaller number of elements for each token. In this setting, Skim-Attention is viewed as an independent, complementary module that can be plugged into any language model. Given a sequence of layout embeddings, the corresponding skim-attention matrix is converted to an attention mask: based on the similarity scores provided in the attention matrix, each token can only attend to its k most similar tokens. The resulting mask is then given as input to a text-based Transformer language model with standard self-attention, and is used to restrict self-attention for each element in the input text sequence. This can be viewed as sparsifying the standard self-attention matrix.
SkimmingMask is not trainable end-to-end with the Transformer model it is plugged to, as creating an attention mask from an attention matrix is not a differentiable operation (we leave this for future work). Thus, to train this model, the weights for Skim-Attention need to be already trained, and we naturally use the Skimformer weights. The overall architecture of the model is illustrated in Figure 1b.
We note that SkimmingMask is a new way to cluster tokens: all tokens belonging to the same group have a high similarity to each other regarding their respective layout position. This makes Skim-mingMask a concurrent approach to Reformer, which reduces the cost of self-attention by clustering tokens into chunks. As opposed to the latter, the concept of similarity is not based on text semantics but on the document structure. Moreover, SkimmingMask does not require the semantic of each token, but only their layout features. Because each token is viewed as a bounding box whose characteristics are only its size and position, the representation space of layout features is much smaller than that of the text, which spans a vocabulary of more than 30k sub-words. As a consequence, computing attention based on layout could require a smaller latent space dimension than for text, corresponding to less computational efforts. This is also the case for humans: as demonstrated in section 3, it is much easier to retrieve information from documents when the layout is provided.

Data
Pre-training Data To pre-train our models on a wide variety of document formats, we select three datasets with various non-trivial document layouts: DocBank (Li et al., 2020), RVL-CDIP (Harley et al., 2015) and PubLayNet (Zhong et al., 2019). We combine them by randomly selecting 25k documents from each dataset, for a total of 75K documents. We discard the provided labels and consider these data as unannotated. The resulting dataset is referred to as MIX. As a first evaluation metric, we can compare the perplexity for the different language models on MIX.
DocBank DocBank is a large-scale dataset that contains 500K English document pages from papers extracted from arXiv.com. These articles span a variety of disciplines (e.g. Physics, Mathematics, and Computer Science), which is beneficial to train more robust models. Pages are split into a training set, validation set and test set with a ratio of 8:1:1. As the authors already extracted the text and bounding boxes using PDFPlumber, 4 there is no need for an OCR system or a PDF parser. To build our subset, we extract 25k document pages: 20k from the full training set, 2,500 from the validation set and 2,500 from the test set.
RVL-CDIP RVL-CDIP is a large collection of 400k scanned document images from various categories (e.g. letter, form, advertisement, invoice). The wide range of layouts, as well as the low image quality, allows to train more robust models. We select 25k documents from the RVL-CDIP dataset available on Kaggle, 5 which amounts to half of the 4 https://github.com/jsvine/pdfplumber 5 https://www.kaggle.com/nbhativp/first-half-training training images from the full dataset (160k images). The text and word bounding boxes are extracted using Tesseract. 6 We split the data into 80% for training, 10% for validation and 10% for test.
PubLayNet PubLayNet comprises over 360 thousand document images from PubMed Central ™ Open Access. The medical publications contained in the collection have similar layouts, but the text density coupled with the small image size add to the robustness of the trained models. We extract the first training split among the 7 available on IBM Data Asset eXchange 7 and use the first 20k images as our training set. For the validation and test sets, we keep the first 2,500 images in each split. Because OCR accuracy is too low without any preprocessing, we apply a few image processing operations (i.e. rescaling, converting to grayscale, applying dilation and erosion) on each image in order to improve text extraction.

Dataset for Document Layout Analysis
In addition to perplexity, we evaluate our approach on a downstream task, document layout analysis. Document layout analysis consists in associating each token with its corresponding category: abstract, author, caption, date, equation, footer, list, paragraph, reference, section, table, title and figure. 8 We use a subset of the full DocBank dataset, created by selecting 10k document pages (distinct from the ones used for pre-training): 8,000 from the full training set, 1,000 from the validation set and 1,000 from the test set. We refer to this dataset as DocBank-LA. Each document page is organized as a list of words with bounding boxes, colors, fonts and labels. We use the precision, recall and F1 score defined by Li et al. (2020).

Experimental Settings
For reproducibility purposes, we make the code publicly available. 9 Baselines We compare our models with three baselines: i) the text-only BERT, ii) the multimodal LayoutLM, and iii) the text-only Longformer for long documents. Note that the Lay-outLM architecture is based on BERT, with addi-  Document Layout Analysis As DocBank contains fine-grained token-level annotations, we consider the document layout analysis task as a sequence labeling task. Each model pre-trained on MIX is fine-tuned on this downstream task for 10 epochs. Section B of the appendix provides a detailed description of the settings used. For the Skim-mingMask models, we selected the hyperparameter k on validation, i.e. the number of tokens that can be attended to. 10

Perplexity
In    LongSkimformer respectively outperform BERT and Longformer by a huge margin, while improving perplexity by more than 10 points over Lay-outLM. In addition, Figure 2 demonstrates that Skimformer converges much faster than BERT, and slightly more than LayoutLM.

Ablation Study
We further conduct an ablation study about the influence of the Skim-Attention inputs on Skimformer's performance. The results are listed in Table 3. To estimate the impact of the input type, we consider a Skimformer model i) wherein Skim-Attention is based on sequential positions (1D position), ii) the bounding boxes are all set to the same fixed value, preventing the model to gather any information about the true location (Uniform layout), iii) they are replaced by their centers (Degraded layout), and iv) the layout embeddings are contextualized (Contextualized Layout).
We can see that replacing spatial with sequential positions results in an increase in perplexity, indicating that layout information is crucial for the Language Model. It is also observed that assigning the same bounding box to every token leads to a severe drop in performance. Coupled with the perplexity obtained with a degraded layout, this shows that the model's performance is greatly impacted by the layout input quality. At last, contextualizing the layout inputs through a small Transformer brings slight improvements over computing Skim-Attention directly on the layout embeddings.
Finally, we benchmark Skimformer and Lay-outLM on both speed and peak memory usage for training. Results provided in Figure 5 of the appendix show that Skimformer is more time and memory efficient than LayoutLM. Table 4 reports the performance on DocBank-LA, the sequence length processed, the number of times attention is computed and the ratio of the total calculation unit (n 2 × Nb Skim-Attn + Seq. Len 2 × Nb Standard Attn, where n is the length of the initial sequence on which Skim-Attention is applied; and Seq. Len is the length obtained after applying SkimmingMask) to that of BERT/LayoutLM and Longformer. All models were pre-trained from scratch on MIX.

Document Layout Analysis
Skimformer is substantially superior to BERT, improving the F1 score by 15% while reducing the number of attentions computed by four. We experimented with plugging the layout embeddings learnt by Skimformer in a BERT model. The resulting model, BERT+SkimEmbeddings, resembles LayoutLM in terms of architecture. 11 Results show that BERT+SkimEmbeddings performs on par with LayoutLM despite simply combining separately pre-trained modalities, as opposed to the latter which requires an extensive joint training.
For the SkimmingMask models (see the last two rows in Table 4), the models attend to only the top-k 128 tokens. Compared to LayoutLM, this reduction to the quadratic factor allows to obtain the same downstream results with only 31.25% of the computational burden. Compared to BERT, it even obtains an absolute improvement of more than 6% in term of F1 score.
LongSkimformer benefits from both Skim-Attention and Longformer's gain in efficiency. It outperforms Longformer by 5% while requiring four times less attention operations, and the use of Longformer's linear attention allows LongSkimformer to process sequences four times larger than Skimformer can. 11 In BERT+SkimEmbeddings, the layout embeddings are first projected into the same dimensional space as the text embeddings. In this way, we can plug the layout embeddings from any Skimformer model, in particular smaller ones. Figure 3 shows the attention maps produced by Skimformer on a sample document. 12 Given a semantic unit (either title or abstract in our example), we select the corresponding tokens and compute their average attention over the whole document. We observe, both qualitatively and quantitatively, that tokens attend mainly to other elements in the same semantic unit, thus creating clusters of tokens that are relevant to each other. This shows that the model has grasped the concept of semantic unit with only self-supervision, enabling the emergence of a document structure representation. We argue that these structure-aware clusters could pave the way for long text encoding and unsupervised document segmentation.

Conclusion
We present Skim-Attention, a new structure-aware attention mechanism. We conduct extensive experiments to show the effectiveness of Skim-Attention, both as an end-to-end model (Skimformer) and as a mask for any language model (Skim-Attention). We hope this work will pave the way towards a new research direction for efficient attentions. For future works, we will investigate how to integrate image features, and explore tasks that require capturing longer-range dependencies.  Patricia Wright. 1999. The psychology of layout: Consequences of the visual structure of documents. American Association for Artificial Intelligence Technical Report FS-99-04, pages 1-9.

Skim-Attention: Learning to Focus via Document Layout -Appendix A Preliminary Experiments: Human Evaluation
To evaluate the impact of layout on readers' understanding, we conduct an experiment in which we measure the amount of time required for human annotators to answer questions from both formatted and non-formatted documents. We hand-pick four document pages from the DocBank dataset (Li et al., 2020), and create a plain-text version out of each of these documents by flattening them. This results in eight pages: the four original document pages, and their serialized versions with no layout nor formatting. The original, formatted document pages are displayed in figure 4. We create two basic questions for each document (answers are provided in italic): • Document ( • Document (c) : -What is proposed in this paper ? A Reinforced Neural Extractive Summarization model to extract a coherent and informative summary from a single document. -What is compared in table 3 ? Human evaluation in terms of informativeness(Inf), coherence(Coh) and overall ranking.
• Document (d) : -When was this paper submitted ? May 2028, 2020. -What are the keywords of this paper ? Touchscreen keyboards, gesture input, model-based design, Monte Carlo simulation.
Four annotators are asked to answer these questions. Each of them alternates between fully formatted contents (i.e. the original document page) and plain text. We decide that annotators 1 and 3 have access to the document layout for documents 1 and 3, while annotators 2 and 4, for documents 2 and 4.
Given a document, the instructions are as follows: 1. Read the entire document, then the questions; 2. Start the timer; 3. Find the answer to the first question (without writing it down);

Stop the timer and check if the answer is correct;
• If this is the case, write down the time indicated by the timer, then reset it and answer the second question by re-iterating steps 2 to 4. • If not, resume timer until you find the correct answer.
Results per document and document type (i.e., formatted or non-formatted) are given in table 5. The entirety of the results is reported in table 6.

B Implementation Details
Pre-training For BERT, LayoutLM and Longformer, we use the PyTorch implementation from Hugging Face's Transformers library (Wolf et al., 2020). In this paper, we clarify and illustrate these issues, focusing on the comparison of conditional and marginal Deviance Information Criteria (DICs) and Watanabe-Akaike Information Criteria (WAICs) in psychometric modeling. The conditional/marginal distinction corresponds to whether the model should be predictive for the clusters that are in the data or for new clusters (where "clusters" typically correspond to higher-level units like people or schools). Correspondingly, we show that marginal WAIC corresponds to leave-one-cluster out (LOcO) cross-validation, whereas conditional WAIC corresponds to leave-one-unit (LOuO). These results lead to recommendations on the general application of these criteria to models with latent variables.
The research reported here was supported by NSF grant 1460719 and by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D140037. Code to replicate the results from this paper can be found at http://semtools.r-forge.r-project.org/. Correspondence to Edgar Merkle, email: merklee@missouri.edu. (a) from the left-and right-handed couplings extracted from forward-backward asymmetries and charge asymmetries in two-fermion processes, different high-scale models can be discriminated (cf. e.g. [25]).
One final remark: if something similar like the 2 TeV anomaly in W W/W Z/ZZ at the end of the 8 TeV run or the 750 GeV anomaly in diphotons will remain at the end of run II or the high-lumi run, then the ILC is the only option in the near future to comfirm or refute such a signal.

Summary
In this talk I tried to collect the facts in favor of a future high-energy lepton collider (that is capable to reach at least 500 GeV) with the focus lying on new physics beyond the SM. Both the two main SM pillars, the Higgs boson and top quark measurements serve as indirect tools for new physics searches, but there is also a plethora of direct search opportunities at such a machine. Most prominent examples are dark matter searches, searches for other light weakly coupling particles, and a scan over all weakly interacting particles. The interplay of the ILC with the LHC, but more importantly with future hadron machines is elucidated. Conditions, or better, scenarios for possible BSM discoveries at the ILC have been given. Several prime examples for the BSM potential of the ILC have been highlighted. Though RNES with the coherence reward achieves higher ROUGE scores than baselines, there is a small gap between its score and that of RNES trained without coherence model. This is because that the coherence objective and ROUGE score do not always agree with each other. Since ROUGE is simply computed based on n-grams or longest common subsequence, it is ignorant of the coherence between sentences. Therefore, enhancing coherence may lead to a drop of ROUGE. However, the 95% confidence intervals of the two RNES models overlap heavily, indicating that their difference in ROUGE is insignificant. We also conduct a qualitative evaluation to find out whether the introduction of coherence reward improves the coherence of the output summaries. We randomly sample 50 documents from the test set and ask three volunteers to evaluate the summaries extracted by RNES trained with or without coherence as the reward. They are asked to compare and rank the outputs of two models regarding three aspects: informativeness, coherence and overall quality. The better one will be given rank 1, while the other will be given rank 2 if it is worse. In some cases, if the two outputs are identical or have the same quality, the ranks could be tied, i.e., both of them are given rank 1. Table 3 shows the results of human evaluation. RNES model trained with coherence reward is better than RNES model without coherence reward in all three aspects, especially in the coherence. The result indicates that the introduction of coherence effectively improves the coherence of extracted summaries, as well as the overall quality. It is surprising that summaries produced by RNES with coherence are also more informative than RNES without coherence, indicating that ROUGE might not be the gold standard to evaluate informativeness as well. Table 4 shows a pair of summary produced by RNES with or without coherence. The summary produced by RNES without coherence starts with pronoun 'That' which is referring to a previously mentioned fact, and hence it may lead to confusion. In contrast, the output of RNES trained with coherence reward includes the sentence "The earthquake disaster . . . " before referring to this fact in the second sentence, and therefore is more coherent and readable. This is because the coherence model gives a higher score to the second sentence if it can form a coherent sentence pair with the first sentence. In REINFORCE training, if the second sentence receives a high coherence score, the action of extracting the first sentence before the second one will be strengthened. This example shows that coherence model is indeed effective in changing the behavior of RNES towards extracting summaries that are more coherent.

Conclusion
In this paper, we proposed a Reinforced Neural Extractive Summarization model to extract a coherent and informative summary from a single document. Empirical results show that the proposed RNES model can balance between the cross-sentence coherence and importance of the sentences effectively, and achieve state-of-the-art performance on the benchmark dataset. For future work, we will focus on improving the performance of our neural coherence model and introducing human knowledge into the RNES.

Introduction
The advent of smartphones and tablets has made the use of touchscreen keyboards pervasive in modern society. However, the ubiquitous QWERTY keyboard was not designed with the needs of a touchscreen keyboard in mind, namely accuracy and speed. The introduction of gesture or stroke-based input methods significantly increased the speed that text could be entered on touchscreens [Montgomery (1982); Zhai and Kristensson (2003); Zhai et al. (2009);Kushler and Marsden (2006)]. However, this method introduces some new problems that can occur when the gesture input patterns for two words are too similar, or sometimes completely ambiguous, leading to input errors. An example gesture input error is illustrated in Figure 1. A recent study showed that gesture input has an error rate that is about 5-10% higher compared to touch typing [Bi et al. (2013)]. With the fast and inherently imprecise nature of gesture input the prevalence of errors is unavoidable and the need to correct these errors significantly slows down the rate of text entry. The QWERTY keyboard in particular is poorly suited as a medium for swipe input. Characteristics such as the "u", "i", and "o" keys being adjacent lead to numerous gesture ambiguities and potential input errors. It is clearly not the optimal layout for gesture input.   Table 6: Time (in seconds) taken by each annotator to answer each question. Average per document is also reported.
Yellow cells indicate that the document layout was provided for corresponding documents and annotators.
Each model is trained from scratch on the MIX dataset for 10k steps with a batch size of 8, except for Longformer which was trained with a smaller batch size of 4 due to memory limitations. We use the Adam optimizer with weight decay fix (Loshchilov and Hutter, 2017), a weight decay of 0.01 and (β 1 , β 2 ) = (0.9, 0.999). The learning rate is set to 1e −4 and linearly warmed up over the first 100 steps. The maximum sequence length is set to n = 512, with the exception of Longformer and LongSkimformer, for which n = 2, 048. Following BERT, we mask 15% of the text tokens in MVLM, among which 80% are replaced by a special token [MASK], 10% are replaced by a random token, and 10% remains the same.
Document Layout Analysis Each model pretrained on MIX is fine-tuned on DocBank's document layout analysis task for 10 epochs, with a learning rate of 5e −5 and a batch size of 8 (except for Longformer which was fine-tuned with a batch size of 4). The models are extended with a tokenclassification head on top, consisting of a linear layer followed by a softmax layer, and are trained using cross-entropy.
For SkimmingMask, we select the Skim-Attention module from the Skimformer model pretrained from scratch on MIX. We then plug it into a BERT model, also pre-trained from scratch on MIX. The resulting model is fine-tuned with the same settings as described previously.

C Benchmark
Using Hugging Face's Transformers benchmarking tools (Wolf et al., 2020), we benchmark Skimformer and LayoutLM on both speed and required memory for pre-training. We consider the base variant of LayoutLM, and use the implementation from the Transformers library. In addition to the full Skimformer, we evaluate a variant in which the small Transformer contextualizing layout embeddings is removed (Skimformer-no-context). The batch size is fixed to 8, and memory and time per-formance is evaluated for the following sequence lengths: 8, 32, 128 and 512. We use Python 3.7.10, PyTorch 1.8.1+cu101 (Paszke et al., 2019), and Transformers 4.6.0.dev0. All experiments were conducted on one Tesla T4 with 15GB of RAM. Figure 5b reports the time (figure 5a) and peak memory consumption (figure 5b) with respect to the sequence length. Figure 6 contains the attention maps obtained by Skimformer on two documents sampled from Pub-LayNet (Zhong et al., 2019). For each sample, we average the attention scores of tokens belonging to a given semantic unit, and map the result to the document image. In the first document (figure 6a), we focus on the top table (left) and the bottom one (right). In the second sample (figure 6b), we investigate the title (left), the authors (center) and the abstract (right).