IceSum: An Icelandic Text Summarization Corpus

Jón Daðason, Hrafn Loftsson, Salome Sigurðardóttir, Þorsteinn Björnsson


Abstract
Automatic Text Summarization (ATS) is the task of generating concise and fluent summaries from one or more documents. In this paper, we present IceSum, the first Icelandic corpus annotated with human-generated summaries. IceSum consists of 1,000 online news articles and their extractive summaries. We train and evaluate several neural network-based models on this dataset, comparing them against a selection of baseline methods. We find that an encoder-decoder model with a sequence-to-sequence based extractor obtains the best results, outperforming all baseline methods. Furthermore, we evaluate how the size of the training corpus affects the quality of the generated summaries. We release the corpus and the models with an open license.
Anthology ID:
2021.naacl-srw.2
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
Month:
June
Year:
2021
Address:
Online
Editors:
Esin Durmus, Vivek Gupta, Nelson Liu, Nanyun Peng, Yu Su
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9–14
Language:
URL:
https://aclanthology.org/2021.naacl-srw.2
DOI:
10.18653/v1/2021.naacl-srw.2
Bibkey:
Cite (ACL):
Jón Daðason, Hrafn Loftsson, Salome Sigurðardóttir, and Þorsteinn Björnsson. 2021. IceSum: An Icelandic Text Summarization Corpus. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 9–14, Online. Association for Computational Linguistics.
Cite (Informal):
IceSum: An Icelandic Text Summarization Corpus (Daðason et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-srw.2.pdf
Video:
 https://aclanthology.org/2021.naacl-srw.2.mp4