SCROLLS: Standardized CompaRison Over Long Language Sequences

NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.


Introduction
Standard benchmarks à la GLUE (Wang et al., 2018(Wang et al., , 2019)), WMT (Barrault et al., 2019(Barrault et al., , 2020)), and SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018)), have driven progress in natural language processing of short utterances.However, a large portion of natural language is produced in the context of longer discourses, such as books, articles, meeting transcripts, etc.To tackle the computational challenges associated with processing such long sequences, a plethora of new model architectures have recently emerged (Tay et al., 2020b;Fournier et al., 2021), without establishing a standard scheme for evaluating them on long natural language problems.Some long-context models are evaluated via language modeling perplexity, but this metric mostly captures model sensitivity to local, shortrange patterns (Khandelwal et al., 2018;Sun et al., 1 https://www.scrolls-benchmark.com(Devlin et al., 2019) and GPT3 (Brown et al., 2020).

2021
).Other studies rely on Long Range Arena (Tay et al., 2021), which is limited from a naturallanguage perspective, since only two of its datasets involve natural language, and those are artificiallyelongated through byte tokenization.To enable the research community to go beyond sentences and paragraphs, we present a new benchmark, SCROLLS: Standardized CompaRison Over Long Language Sequences.
SCROLLS incorporates multiple tasks (summarization, question answering, and natural language inference) over various domains (literature, meeting transcripts, TV shows, scientific articles, and more), where each example's input typically contains thousands of words.We review the existing literature on long-text tasks and manually curate a subset of 7 datasets, prioritizing those that require contextualizing and abstracting information across multiple parts of the text.We then clean and convert the data to a unified text-to-text format to enable the evaluation of a single model over all datasets.Figure 1 shows that the texts in SCROLLS datasets are substantially longer than commonlyused NLP benchmarks.Moreover, our analysis reveals that, in SCROLLS, critical information is spread out across longer distances within the input documents.
SCROLLS is available via the Datasets library (Lhoest et al., 2021) or direct download on its website, which hosts a live leaderboard that accepts submissions and automatically evaluates them against private test sets.By producing a single aggregate score, in addition to individual dataset scores, SCROLLS can serve as an evaluation platform for future approaches to processing long text, whether by new pretraining schemes, novel transformer architectures and alternatives, or even retrieval-based methods.We provide initial baselines for SCROLLS using two transformer models, BART (Lewis et al., 2020), and its length-efficient variant, Longformer Encoder-Decoder (Beltagy et al., 2020).Our experiments indicate that SCROLLS poses a formidable challenge for these models, leaving much room for the research community to improve upon.

Background: Contemporary
Evaluation of Long-Text Models While transformers (Vaswani et al., 2017) are the current go-to architecture for building state-of-theart models in NLP, they present a computational challenge when it comes to long sequences due to the O(n 2 ) complexity of self-attention, where n is the sequence's length.To address this problem, a wide variety of efficient alternatives and approximations have been proposed over the past couple of years (Tay et al., 2020b;Fournier et al., 2021).Much of these novel architectures were developed concurrently, leading to somewhat of a "Wild West" when it comes to model evaluation, making crossmodel comparison challenging.Roughly speaking, we can cluster the more prominent evaluation methodologies into three categories: language modeling, Long-Range Arena, and summarization.
The language modeling community typically uses perplexity to measure how well models predict the next token, a practice that has been adopted by several works on efficient transformer architectures (Roy et al., 2021;Choromanski et al., 2020;Tay et al., 2020a;Peng et al., 2021).However, using perplexity to evaluate a model's long-range abilities is currently under scrutiny.A growing amount of literature shows that predicting the next token is mostly a local task that does not require modeling long-range dependencies (Khandelwal et al., 2018;Sun et al., 2021), and that masking or down-weighting distant tokens can actually improve perplexity (Press et al., 2021a,b).
A more recent approach to standardizing longsequence model evaluation is the Long Range Arena (LRA) (Tay et al., 2021).It incorporates 5 classification datasets: byte-level sentiment analysis (IMDB) and document relatedness (ACL Anthology); path-finding (Pathfinder) and image classification (CIFAR-10) over 1-dimensional pixel sequences; and executing a list of mathematical operations (ListOps).Of those, two involve visual reasoning, and one is a synthetic mathematical language (ListOps), leaving only two natural language datasets (sentiment analysis and document relatedness).The multi-modal nature of LRA makes it inappropriate as a testbed for pretrained language models, limiting its relevance for NLP.Moreover, LRA artificially inflates natural language sequences via byte tokenization, and truncates each example at 4,000 bytes, which is equivalent to less than 1,000 words.This exempts models from coping with the complex long-range dependencies that exist in naturally long texts.
The third practice uses summarization tasks to evaluate long-sequence models.The most popular datasets use abstracts of academic papers on arXiv and PubMed (Cohan et al., 2018) as summaries.Other summarization datasets, however, are less frequently used, biasing the evaluation towards academic domains.SCROLLS includes summarization as one of its main tasks, selecting datasets from several different domains to increase diversity.Through this curation process, we handpick 7 datasets, and process them into a uniform textto-text format.Table 1 provides an overview of the datasets included in SCROLLS. Figure 2 and Figure 3 show two examples from SCROLLS datasets SummScreenFD and QuALITY, demonstrating how contextualizing and synthesizing information over long ranges of text is paramount to addressing the challenges in the benchmark.

Datasets
We survey the 7 datasets in SCROLLS, and elaborate how the original data was collected.
GovReport (Huang et al., 2021): A summarization dataset of reports addressing various national policy issues published by the Congressional Research Service 2 and the U.S. Government Accountability Office, 3 where each document is paired with an expert-written executive summary.The reports and their summaries are longer than their equivalents in other popular long-document summarization datasets; for example, GovReport's documents are approximately 1.5 and 2.5 times longer than the documents in arXiv and PubMed (Cohan et al., 2018), respectively.
SummScreenFD (Chen et al., 2021): A summarization dataset in the domain of TV shows (e.g.Friends, Game of Thrones).Given a transcript of a specific episode, the goal is to produce the episode's recap.The original dataset is 2 https://crsreports.congress.gov/ 3https://www.gao.gov/divided into two complementary subsets, based on the source of its community contributed transcripts.For SCROLLS, we use the ForeverDreaming (FD) subset,4 as it incorporates 88 different shows, making it a more diverse alternative to the TV MegaSite (TMS) subset,5 which has only 10 shows.Community-authored recaps for the ForeverDreaming transcripts were collected from English Wikipedia and TVMaze. 6MSum (Zhong et al., 2021): A query-based summarization dataset, consisting of 232 meetings transcripts from multiple domains and their corresponding summaries.The corpus covers academic group meetings at the International Computer Science Institute (Janin et al., 2003),7 industrial product meetings for designing a remote control (Carletta et al., 2005), and committee meetings of the Welsh8 and Canadian9 Parliaments, dealing with a variety of public policy issues.Annotators were tasked with writing queries about the broad contents of the meetings, as well as specific questions about certain topics or decisions, while ensuring that the relevant text for answering each query spans at least 200 words or 10 turns.
Qasper (Dasigi et al., 2021): A question answering dataset over NLP papers filtered from the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al., 2020).Questions were written by NLP practitioners after reading only the title and abstract of the papers, while another set of NLP practitioners annotated the answers given the entire document.Qasper contains abstractive, extractive, and Penny returns from visiting family in Nebraska, but mentions while picking up mail from Leonard that most of her relatives became sick.Sheldon, a germophobe according to Leonard, freaks out and becomes sick, becoming demanding on top of his already obnoxious personality.Familiar with Sheldon being sick, Leonard and the guys hide from him at a Planet of the Apes series marathon, leaving Penny to care for Sheldon.However, Leonard breaks his glasses in the cinema and has to retrieve his spare pair from the apartment, piloted by Howard and Raj using a laptop, an endoscope, and a Bluetooth helmet camera worn by the short-sighted Leonard.Penny intercepts him and abandons him to his fate with Sheldon.Leonard tries to escape, but runs into a wall and nearly knocks himself out.In the end, injured Leonard and sick Sheldon sit miserably on the couch.
NarrativeQA (Kočiský et al., 2018): An established question answering dataset over entire books from Project Gutenberg10 and movie scripts from different websites.11Annotators were given summaries of the books and scripts obtained from Wikipedia, and asked to generate question-answer pairs, resulting in about 30 questions and answers for each of the 1,567 books and scripts.They were encouraged to use their own words rather then copying, and avoid asking yes/no questions or ones  about the cast.Each question was then answered by an additional annotator, providing each question with two reference answers (that may be identical).
QuALITY (Pang et al., 2021): A multiplechoice question answering dataset over stories and articles sourced from Project Gutenberg, 10 the Open American National Corpus (Fillmore et al., 1998;Ide and Suderman, 2004), and more.Experienced writers wrote questions and distractors, and were incentivized to write answerable, unambiguous questions such that in order to correctly answer them, human annotators must read large portions of the given document.To measure the difficulty of their questions, Pang et al. conducted a speed validation process, where another set of annotators were asked to answer questions given only a short period of time to skim through the document.As a result, 50% of the questions in QuALITY are labeled as hard, i.e. the majority of the annotators in the speed validation setting chose the wrong answer.
Contract NLI (Koreeda and Manning, 2021): A natural language inference dataset in the legal domain.Given a non-disclosure agreement (NDA, the premise), the task is to predict whether a particular legal statement (the hypothesis) is entailed, not entailed (neutral), or cannot be entailed (contradiction) from the contract.The NDAs were manually picked after simple filtering from the Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) 12 and Google.The dataset contains a total of 607 contracts and 17 unique hypotheses, which were combined to produce the dataset's 10,319 examples.

Preprocessing
Data Cleansing As part of the curation process, we examine each dataset and clean or filter examples to ensure high quality data.In GovReport, we discard all examples where the report's length (in words) is less than twice the summary, or more than 1,000 times the summary, as well as examples where the summary exists verbatim in the report.This process removes 64 examples from the original dataset.In Qasper, we discard all papers that have less than 8,192 characters, removing a total of 176 questions over 63 papers, which appear to be of lower quality.In NarrativeQA, we locate markers indicating the start and end of the actual story, and use them to remove excess metadata such as licenses, HTML headers, etc.
Unified Format We reformulate every dataset in SCROLLS as a sequence-to-sequence task to allow for a simple unified input-output format.When a query is given in addition to the raw text (as in QMSum, Qasper, NarrativeQA, QuALITY, and ContractNLI), we prepend it to the text, using two newlines as a natural separator.For the multiplechoice dataset QuALITY, we also provide all four answer candidates as part of the query.For the summarization datasets, GovReport and Summ-ScreenFD, we use only the original documents as input.Some datasets (Qasper and NarrativeQA) contain multiple target outputs for each input; we split them into separate instances for training and development.For test, we score each prediction with every valid answer independently, and then merge the scores of identical inputs by taking the maximum of those scores.Table 5 in Appendix A provides an example from each SCROLLS dataset.

Evaluation
Each dataset is split into training, validation, and test sets based on the original dataset splits.In SCROLLS, test set outputs are kept private, and only the inputs are publicly available.When evaluating a model, users must submit their model's outputs for all test sets via the SCROLLS website.Once a model is submitted, we compute the average performance metric across all datasets to provide the submission with a single aggregate SCROLLS score.We employ three different evaluation metrics across SCROLLS datasets: ROUGE for summarization tasks (GovReport, SummScreenFD, and QM-Sum), unigram overlap (F1) for question answering (Qasper and NarrativeQA), and exact match (EM) for multiple-choice (QuALITY) and classification (ContractNLI) tasks.The official evaluation script is available online. 13OUGE We use three flavors of ROUGE (Lin, 2004) to measure the overlap between the systemgenerated output and the reference: unigram overlap (ROUGE-1), bigram overlap (ROUGE-2), and the longest overlapping subsequence (ROUGE-L).Both system output and reference are normalized by lowercasing and converting all nonalphanumeric characters to whitespaces, followed by whitespace tokenization.We compute the geometric mean of the three scores (ROUGE-1/2/L) to produce a single score per dataset, which is used to calculate the final SCROLLS score.14F1 Similar to ROUGE-1, the F1 metric calculates unigram overlap.The key difference is that both reference and system output strings are normalized slightly differently; in addition to lowercasing and punctuation removal, stopwords are also discarded, following the practice of SQuAD (Rajpurkar et al., 2016) and other question-answering datasets (Fisch et al., 2019).Both Qasper and NarrativeQA contain questions with more than one reference answer; for each such example, we take the maximal F1 score over all of its reference answers.
EM Exact match normalizes the output strings using the same procedure as F1 (lowercasing, removing punctuation and stopwords, and normalizing whitespaces), and then compares whether the two normalized strings are identical.For QuAL-ITY, we calculate EM over the entire test set (EM-T), and also EM over its subset of hard questions (EM-H), as defined in the original dataset.For computing the final SCROLLS score, however, we only use the EM value calculated over the full test set (EM-T).

Quantitative Analysis
Length alone is not enough to make SCROLLS a challenging benchmark.Here, we provide a quantitative analysis that suggests that producing the correct output for a SCROLLS task typically requires fusing different parts of the input that are often hundreds and even thousands of words apart.This analysis complements the qualitative inspection of examples from SCROLLS, as shown in Figure 2 and Figure 3, and further discussed in Appendix E.
Methodology Each example in SCROLLS consists of a textual input and output. 15Given a specific input-output pair, we measure the example's spread by computing the standard deviation between the locations of output bigrams in the input. 16Specifically, we represent the output string as a set of bigrams, and locate the first occurrence of each bigram in the input (if exists); we then compute the standard deviation between these locations (where a bigram is represented by the position of its first word in the input).Now that we have an example-level measure of spread, we can plot an entire dataset's spread on a histogram, and compare different datasets.
Summarization Datasets Figure 4a compares the three summarization datasets in SCROLLS to the canonical CNN/DM summarization dataset (Hermann et al., 2015), as well as arXiv (Cohan et al., 2018), which has been used to evaluate longsequence models.We observe that the reference bigrams are spread out across much larger distances in SCROLLS than in CNN/DM, and by a factor of 1.5 to 2 times more than arXiv on average.4b compares the remaining four datasets in SCROLLS, which typi-cally have shorter outputs, to the popular SQuAD (Rajpurkar et al., 2016) and Natural Questions (Kwiatkowski et al., 2019) datasets.17While the answer bigrams in SQuAD and Natural Questions typically spread across distances of under 5 words, the output bigrams in SCROLLS datasets are usually separated by hundreds of words.NarrativeQA also seems to contain many examples where the answer bigrams cluster close together, but also a significant subset of examples where the answer's bigrams are dispersed across huge distances.

Experiments
We conduct experiments to evaluate the ability of mainstream models to handle the various long text challenges presented by SCROLLS.Our code is based on the Transformers library (Wolf et al., 2020), and is available online. 13

Baselines
We finetune two pretrained transformer variants as baselines, as well as naive heuristic baselines to establish the floor performance on each task.Hyperparameters are detailed in Appendix D.
BART As a standard transformer baseline, we use the pretrained BART-base18 model (Lewis et al., 2020).BART is a transformer encoderdecoder pretrained by reconstructing noised texts, which achieved state-of-the-art results on several summarization datasets when released.BART was pretrained on sequences of up to 1,024 tokens; we therefore truncate all inputs by retaining only their 1,024-token prefix.To examine the effect of available input length, we also consider truncating BART's inputs at 256 and 512 tokens.
Longformer Encoder-Decoder (LED) We experiment with LED-base,19 the encoder-decoder version of the efficient transformer architecture Longformer (Beltagy et al., 2020).Longformer avoids computing quadratic-complexity attention via sliding-window attention, where each word only attends to a constant number of nearby tokens, in addition to a few tokens that compute global attention over the entire input.LED is initialized with BART's parameters, without further pretraining.In our experiments, we use a sliding window of 1,024 tokens, and restrict the total input length to 16,384 tokens via truncation, following Beltagy et al.We also experiment with maximum sequence lengths of 1,024 and 4,096 tokens.While the original work on LED selects the globally-attending tokens on a per-task basis, we follow their summarization setting throughout all tasks, which enables global attention only for the first token.
Heuristic Baselines We use simple heuristics to find the lower bound of performance on each dataset.For most datasets, we use the fixed-length prefix heuristic, akin to the LEAD baseline in the summarization literature.Specifically, we compute the average output-input length ratio ρ over the training set (in characters), and then produce the first ρ • n characters from the given input at inference time (where n is the input's length).For QuALITY, we use the majority class (which is just above one quarter).For ContractNLI, we use the per-hypothesis majority class, as the same 17 hypotheses are shared across all documents.

Results
Table 2 shows the baselines' performance on SCROLLS.A few trends are apparent: More Context Improves Performance We experiment with three context lengths for each model.As the model receives more context, its average SCROLLS score increases.For BART, increasing the input length from 256 tokens to 1,024 increases performance by 2.66 points, while LED grows by 2.1 points when enlarging its maximal sequence length from 1,024 tokens to 16,384.This trend is relatively consistent across datasets for BART, but less so for LED (e.g., QMSum and ContractNLI).Overall, our experiments highlight the importance of measuring not only whether an architecture can efficiently process long sequences, but also whether it can effectively model their semanticsprecisely what SCROLLS is designed to do.

BART versus LED
How Far is SCROLLS from being Solved?The heuristic baselines set a lower bound average score of 19.35, which the model baselines are able to improve upon by 7 to 10 points.While it is difficult to establish an accurate human performance ceiling on SCROLLS, especially when considering the summarization datasets, we do have some indicators that it is probably much higher than the current baselines.Dasigi et al. (2021) study a subset of Qasper that has multiple annotated answers, and find their overlap to be 60.9%F1, more than double our best baseline.Likewise, human agreement on QuALITY was measured at 93.5% EM (Pang et al., 2021).We also compute the inter-annotator agreement (F1) on NarrativeQA's test set (where each question has two answers), arriving at around 58.7% F1, compared to our best baseline of 18.5% F1.Overall, it seems that contemporary off-theshelf models struggle with these tasks, challenging future work to make progress on SCROLLS.

Conclusion
We propose a new benchmark that places the spotlight on naturally long texts and their intricacies.SCROLLS fills a current gap around evaluating efficient transformer architectures and their alternatives on natural language tasks, and at the same time provides a testing ground for new pretraining schemes that target long language sequences.We hope that SCROLLS inspires the NLP community to go beyond single sentences and paragraphs, and meet the challenges of processing and reasoning over longer discourses.

Limitations
The main limitation of SCROLLS is the evaluation of long output texts, specifically in summarization.Since ROUGE only accounts for ngram overlaps, it might downvalue paraphrases of the reference summary that contain the same semantic content.Establishing unbiased, automated metrics for long generations that correlate well with humans judgments is an emerging field of research, and we may indeed decide to replace or complement ROUGE with model-based evaluation in the future.
A second limitation is that SCROLLS is monolingual.Model evaluation over languages other than English has major significance, affecting the usage of language processing technology in applications worldwide.SCROLLS is limited in that sense, but takes an initial step in standardizing evaluation over long text in general.A natural future direction is establishing benchmarks focusing on other languages as well.

D Hyperparameters
We finetune each of the baseline models on every dataset separately using AdamW (Loshchilov and Hutter, 2019) with β = (0.9, 0.98), ε = 1e-6, mixed precision (fp16), and gradient checkpointing.We achieve an effective batch size of 131,072 (2 17 ) tokens by processing 16,384 tokens per GPU across 8 NVIDIA V100 (32GB) GPUs either in parallel or via gradient accumulation.The summarization datasets are trained for 10 epochs, while Qasper, QuALITY, and ContractNLI are trained for 20; NarrativeQA (the largest dataset) is trained for 2 epochs.We tune the maximum learning rate over each validation set, selecting from 6 possible values: 1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4.The learning rate is warmed up from zero during the first 10% of the learning schedule, and then linearly decays back to zero throughout the remaining 90%.We also apply 0.1 dropout throughout each network.During inference, we generate outputs using greedy decoding.

E Qualitative Analysis
We manually analyze examples from each of the datasets in the benchmark demonstrating cases that require contextualizing and synthesizing information over long ranges of text.(web) to be synonymous; they are not.Rather, the web is one portion of the Internet, and a medium through which information may be accessed.In conceptualizing the web, some may view it as consisting solely of the websites accessible through a traditional search engine such as Google.However, this content-known as the "Surface Web"-is only one portion of the web.The Deep Web refers to "a class of content on the Internet that, for various technical reasons, is not indexed by search engines," and thus would not be accessible through a traditional search engine....[3,791 words]... the FBI has put resources into developing malware that can compromise servers in an attempt to identify certain users of Tor.Since 2002, the FBI has reportedly used a "computer and internet protocol address verifier" (CIPAV) to "identify suspects who are disguising their location using proxy servers or anonymity services, like Tor."It has been using this program to target "hackers, online sexual predators, extortionists, and others."Law enforcement has also reportedly been working with companies to develop additional technologies to investigate crimes and identify victims on the Dark Web.In addition to developing technology to infiltrate and deanonymize services such as Tor, law enforcement may rely upon more traditional crime fighting techniques; some have suggested that law enforcement can still rely upon mistakes by criminals or flaws in technology to target nefarious actors.For instance, in 2013 the FBI took down the Silk Road, then the "cyberunderworld's largest black market."Reportedly, "missteps" by the site's operator led to its demise; ...[979 words]... Figure 5: An example from GovReport, a dataset of government reports and their expert-written summaries.This example shows the spread of the relevant information in the document, exemplified by the first and last sentences of the summary.
What did the group discuss about budget balancing?
--Answer --The use of the LCD screen and the advanced chip cost the team half of the expenditure.Due to the budget limit, the team had to abandon some other designs such as the rubber material and the double-curved structure.The USB connection was not feasible for now as well.For the location function, a transmitter, a receiver and speakers could be incorporated on a TV instead  We compare classification and regression approaches and show that classification produces better results than regression but the quality of the results depends on the approach followed to annotate the data labels....[1,006 words]...The bottom section of Table TABREF26 shows the results of several variants of the neural architecture.The table includes a neural regressor (NNR) and a neural classifier (NNC).Whose initials are on the bottom of the burnt letter to Sir Charles?
--Answer --Laura Lyons --Story --... [35,871 words]... "Well, Sir Henry, your uncle had a letter that morning.He had usually a great many letters, for he was a public man and well known for his kind heart, so that everyone who was in trouble was glad to turn to him.But that morning, as it chanced, there was only this one letter, so I took the more notice of it.It was from Coombe Tracey, and it was addressed in a woman's hand.""Well?" "Well, sir, I thought no more of the matter, and never would have done had it not been for my wife.Only a few weeks ago she was cleaning out Sir Charles's study-it had never been touched since his death-and she found the ashes of a burned letter in the back of the grate.The greater part of it was charred to pieces, but one little slip, the end of a page, hung together, and the writing could still be read, though it was gray on a black ground.It seemed to us to be a postscript at the end of the letter and it said: 'Please, please, as you are a gentleman, burn this letter, and be at the gate by ten o clock.Beneath it were signed the initials L. L." ...    9: An example from ContractNLI, a natural language inference dataset over non-disclosure agreements (NDAs).Here, the challenge of finding the evidence, residing in the middle of a long document, is further amplified by the hypothesis being only implicitly contradicted.

Figure 1 :
Figure1: The distribution of words per input in SCROLLS datasets (blue), alongside frequently-used NLP datasets (pink).Dashed vertical lines indicate the maximal sequence length (in tokens) of BERT(Devlin et al., 2019) and GPT3(Brown et al., 2020).
[1,032 words]... Howard: Hello.Sheldon: Howard, I'm sick....[40 words]... Howard: It's my own fault, I forgot the protocol we put in place after the great ear infection of '06.Leonard: You call Koothrappali, we need to find a place to lay low for the next eighteen to twenty four hours.Howard: Stand by.Ma, can my friends come over?Howard's Mother: I just had the carpets steamed.Howard: That's a negatory.But there's a Planet of the Apes marathon at the New Art today.Leonard: Five movies, two hours apiece.It's a start....[660 words]... Sheldon: Based on what happened next, I assume it means "would you like an enema?"Penny: Okay, sweetie, I'll take care of you, what do you need?...[766 words]... Penny: You deliberately stuck me with Sheldon.Leonard: Well, I had to, you see what he's like....[142 words]...

Figure 2 :
Figure2: An example from the SummScreenFD summarization dataset, where the task is to generate the recap (top paragraph) given the episode's script.In this example, the information required to compose the third sentence in the recap (highlighted) is scattered across several snippets throughout the transcript.
The text says "The expert frowned horribly."What makes the expert's smile so horrible?(A) The frown indicates that he's close to detecting Korvin's true motivations.(B) The frown indicates that he knows that Korvin switched the wires on the lie detector.(C) The frown is a signal to the Ruler that Korvin is lying.(D) The frown is physically horrible because the Tr'en have fifty-eight, pointed teeth.--Story --...[607 words]...It was a ritual, Korvin had learned."You are of the Tr'en," he replied.The green being nodded."I am Didyak of the Tr'en," he said....[257 words]... Didyak beamed at him.The sight was remarkably unpleasant, involving as it did the disclosure of the Tr'en fifty-eight teeth, mostly pointed.Korvin stared back impassively."I have been ordered to come to you," Didyak said, "by the Ruler.The Ruler wishes to talk with you." ...[1,366 words]... "They can be treated mathematically," one of the experts, a small emerald-green being, told Korvin thinly."Of course, you would not understand the mathematics."...[33 words]...The expert frowned horribly, showing all of his teeth.Korvin did his best not to react."Your plan is a failure," the expert said, "and you call this a good thing."...[1,808 words]...

Figure 3 :
Figure 3: An example from the QuALITY dataset,where the task is to answer multiple-choice questions about a given story or document.In this example, answering the question correctly requires reasoning over four different snippets that are separated by long token sequences.

Figure 4 :
Figure 4: The spread of reference-text bigrams in the input texts, measured by the standard deviation of the position of each bigram's first occurrence in the input document.SCROLLS datasets (blue), other popular datasets (pink).

Figure 6 :
Figure 6: An example from QMSum, a query-basedsummarization dataset over meeting transcripts.Information relevant for generating the last two sentences in the answer is spread in different locations in the transcript.

Figure 7 :
Figure 7: An example from the Qasper dataset, which includes question answering over scientific papers.The evidence for the first part of the reference answer appears in the introduction, while the indication that neural models were also experimented with exists further in the document, in a description of the results table. 12020 [861 words]... but among the farmers or gentry there is no one whose initials are those.Wait a bit though," he added after a pause."There is Laura Lyons-her initials are L. L.-but she lives in Coombe Tracey."...[1,983 words]... "Did you ever write to Sir Charles asking him to meet you?"I continued.Mrs. Lyons flushed with anger again."Really, sir, this is a very extraordinary question.""I am sorry, madam, but I must repeat it.""Then I answer, certainly not."...[97 words]... "You do Sir Charles an injustice.He did burn the letter.But sometimes a letter may be legible even when burned.You acknowledge now that you wrote it?""Yes, I did write it," she cried, pouring out her soul in a torrent of words."I did write it.Why should I deny it?I have no reason to be ashamed of it.I wished him to help me....[19,996 words]...

Figure 8 :
Figure 8: An example from NarrativeQA, where the task is to answer questions about books and movie scripts.In this question about The Hound of theBaskervilles, the answer is first discussed in several places without certainty, where even the final reveal is preceded by an explicit distractor.

Figure
Figure9: An example from ContractNLI, a natural language inference dataset over non-disclosure agreements (NDAs).Here, the challenge of finding the evidence, residing in the middle of a long document, is further amplified by the hypothesis being only implicitly contradicted.

Table 1 :
An overview of the datasets in SCROLLS and their statistics.Summ refers to summarization, QB-Summ means query-based summarization, and MC-QA abbreviates multiple-choice question answering.

Table 2 :
Although LED does achieve the highest SCROLLS score when given 16,384 to-Baseline results on SCROLLS, using naive heuristics, BART, and Longformer Encoder-Decoder (LED), and various input length limits.The final SCROLLS score (Avg) is computed by averaging over each dataset's overall performance score.For QuALITY (QALT), we use the EM score calculated over the full test set (EM-T), without up-weighting the performance on the hard subset (EM-H).For datasets evaluated with ROUGE, we aggregate the different ROUGE scores via geometric mean to produce a single score per dataset, which is then used when calculating the final average SCROLLS score.
comprehension.In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 1-13, Hong Kong, China.Association for Computational Linguistics.

Table 3 :
The number of examples in each train, validation, and test set.

- -
Meeting Transcript --...[1,813 words]...Even then as well , um there was no criteria technically defined for a joystick so I've used what I think's appropriate .With any luck that won't mean that we've incurred more cost than we can actually afford to .It blows a lot of our really good ideas kind of slightly to one side , for example the possibility of having a U_S_B_ connection is definitely not viable now .Um ....[656 words]... Marketing: We don't even have uh speakers here .The {disfmarker} like uh we uh {disfmarker} what about speakers and transmitters and stuff like that ?Have we factored that in ?Industrial Designer: Mm .Project Manager: Uh no , we haven't , not {disfmarker} Marketing: Transmitter , receiver , speakers .Plus the extra device itself that's gonna be on a T_V_ ....[4,651 words]...