Þorsteinn Björnsson


2021

pdf bib
IceSum: An Icelandic Text Summarization Corpus
Jón Daðason | Hrafn Loftsson | Salome Sigurðardóttir | Þorsteinn Björnsson
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Automatic Text Summarization (ATS) is the task of generating concise and fluent summaries from one or more documents. In this paper, we present IceSum, the first Icelandic corpus annotated with human-generated summaries. IceSum consists of 1,000 online news articles and their extractive summaries. We train and evaluate several neural network-based models on this dataset, comparing them against a selection of baseline methods. We find that an encoder-decoder model with a sequence-to-sequence based extractor obtains the best results, outperforming all baseline methods. Furthermore, we evaluate how the size of the training corpus affects the quality of the generated summaries. We release the corpus and the models with an open license.