Summarization Beyond News: The Automatically Acquired Fandom Corpora

Benjamin Hättasch, Nadja Geisler, Christian M. Meyer, Carsten Binnig


Abstract
Large state-of-the-art corpora for training neural networks to create abstractive summaries are mostly limited to the news genre, as it is expensive to acquire human-written summaries for other types of text at a large scale. In this paper, we present a novel automatic corpus construction approach to tackle this issue as well as three new large open-licensed summarization corpora based on our approach that can be used for training abstractive summarization models. Our constructed corpora contain fictional narratives, descriptive texts, and summaries about movies, television, and book series from different domains. All sources use a creative commons (CC) license, hence we can provide the corpora for download. In addition, we also provide a ready-to-use framework that implements our automatic construction approach to create custom corpora with desired parameters like the length of the target summary and the number of source documents from which to create the summary. The main idea behind our automatic construction approach is to use existing large text collections (e.g., thematic wikis) and automatically classify whether the texts can be used as (query-focused) multi-document summaries and align them with potential source texts. As a final contribution, we show the usefulness of our automatic construction approach by running state-of-the-art summarizers on the corpora and through a manual evaluation with human annotators.
Anthology ID:
2020.lrec-1.827
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6700–6708
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.827
DOI:
Bibkey:
Cite (ACL):
Benjamin Hättasch, Nadja Geisler, Christian M. Meyer, and Carsten Binnig. 2020. Summarization Beyond News: The Automatically Acquired Fandom Corpora. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6700–6708, Marseille, France. European Language Resources Association.
Cite (Informal):
Summarization Beyond News: The Automatically Acquired Fandom Corpora (Hättasch et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.827.pdf