SCROLLS: Standardized CompaRison Over Long Language Sequences
Uri Shaham | Elad Segal | Maor Ivgi | Avia Efrat | Ori Yoran | Adi Haviv | Ankit Gupta | Wenhan Xiong | Mor Geva | Jonathan Berant | Omer Levy
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
Cryptonite: A Cryptic Crossword Benchmark for Extreme Ambiguity in Language
Avia Efrat | Uri Shaham | Dan Kilman | Omer Levy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Current NLP datasets targeting ambiguity can be solved by a native speaker with relative ease. We present Cryptonite, a large-scale dataset based on cryptic crosswords, which is both linguistically complex and naturally sourced. Each example in Cryptonite is a cryptic clue, a short phrase or sentence with a misleading surface reading, whose solving requires disambiguating semantic, syntactic, and phonetic wordplays, as well as world knowledge. Cryptic clues pose a challenge even for experienced solvers, though top-tier experts can solve them with almost 100% accuracy. Cryptonite is a challenging task for current models; fine-tuning T5-Large on 470k cryptic clues achieves only 7.6% accuracy, on par with the accuracy of a rule-based clue solver (8.6%).
A Simple and Effective Model for Answering Multi-span Questions
Elad Segal | Avia Efrat | Mor Shoham | Amir Globerson | Jonathan Berant
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Models for reading comprehension (RC) commonly restrict their output space to the set of all single contiguous spans from the input, in order to alleviate the learning problem and avoid the need for a model that generates text explicitly. However, forcing an answer to be a single span can be restrictive, and some recent datasets also include multi-span questions, i.e., questions whose answer is a set of non-contiguous spans in the text. Naturally, models that return single spans cannot answer these questions. In this work, we propose a simple architecture for answering multi-span questions by casting the task as a sequence tagging problem, namely, predicting for each input token whether it should be part of the output or not. Our model substantially improves performance on span extraction questions from DROP and Quoref by 9.9 and 5.5 EM points respectively.
- Uri Shaham 2
- Omer Levy 2
- Elad Segal 2
- Jonathan Berant 2
- Dan Kilman 1
- show all...