BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization
Wojciech Kryscinski | Nazneen Rajani | Divyansh Agarwal | Caiming Xiong | Dragomir Radev
Findings of the Association for Computational Linguistics: EMNLP 2022
The majority of existing text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future text summarization systems. We address these issues by introducing BOOKSUM, a collection of datasets for long-form narrative summarization. Our dataset covers documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.
CREATIVESUMM: Shared Task on Automatic Summarization for Creative Writing
Divyansh Agarwal | Alexander R. Fabbri | Simeng Han | Wojciech Kryscinski | Faisal Ladhak | Bryan Li | Kathleen McKeown | Dragomir Radev | Tianyi Zhang | Sam Wiseman
Proceedings of The Workshop on Automatic Summarization for Creative Writing
This paper introduces the shared task of summrizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations, as well as understanding non-trivial temporal dependencies in texts containing varied styles of plot development and narrative structure. This poses unique challenges and is yet underexplored for text summarization systems. In this shared task, we introduce four sub-tasks and their corresponding datasets, focusing on summarizing books, movie scripts, primetime television scripts, and daytime soap opera scripts. We detail the process of curating these datasets for the task, as well as the metrics used for the evaluation of the submissions. As part of the CREATIVESUMM workshop at COLING 2022, the shared task attracted 18 submissions in total. We discuss the submissions and the baselines for each sub-task in this paper, along with directions for facilitating future work.
- Wojciech Kryściński 2
- Dragomir Radev 2
- Nazneen Rajani 1
- Caiming Xiong 1
- Alexander Richard Fabbri 1
- show all...