DaNewsroom: A Large-scale Danish Summarisation Dataset

Daniel Varab, Natalie Schluter


Abstract
Dataset development for automatic summarisation systems is notoriously English-oriented. In this paper we present the first large-scale non-English language dataset specifically curated for automatic summarisation. The document-summary pairs are news articles and manually written summaries in the Danish language. There has previously been no work done to establish a Danish summarisation dataset, nor any published work on the automatic summarisation of Danish. We provide therefore the first automatic summarisation dataset for the Danish language (large-scale or otherwise). To support the comparison of future automatic summarisation systems for Danish, we include system performance on this dataset of strong well-established unsupervised baseline systems, together with an oracle extractive summariser, which is the first account of automatic summarisation system performance for Danish. Finally, we make all code for automatically acquiring the data freely available and make explicit how this technology can easily be adapted in order to acquire automatic summarisation datasets for further languages.
Anthology ID:
2020.lrec-1.831
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6731–6739
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.831
DOI:
Bibkey:
Cite (ACL):
Daniel Varab and Natalie Schluter. 2020. DaNewsroom: A Large-scale Danish Summarisation Dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6731–6739, Marseille, France. European Language Resources Association.
Cite (Informal):
DaNewsroom: A Large-scale Danish Summarisation Dataset (Varab & Schluter, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.831.pdf
Data
DaNewsroom