SumTitles: a Summarization Dataset with Low Extractiveness

Valentin Malykh, Konstantin Chernis, Ekaterina Artemova, Irina Piontkovskaya


Abstract
The existing dialogue summarization corpora are significantly extractive. We introduce a methodology for dataset extractiveness evaluation and present a new low-extractive corpus of movie dialogues for abstractive text summarization along with baseline evaluation. The corpus contains 153k dialogues and consists of three parts: 1) automatically aligned subtitles, 2) automatically aligned scenes from scripts, and 3) manually aligned scenes from scripts. We also present an alignment algorithm which we use to construct the corpus.
Anthology ID:
2020.coling-main.503
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5718–5730
Language:
URL:
https://aclanthology.org/2020.coling-main.503
DOI:
10.18653/v1/2020.coling-main.503
Bibkey:
Cite (ACL):
Valentin Malykh, Konstantin Chernis, Ekaterina Artemova, and Irina Piontkovskaya. 2020. SumTitles: a Summarization Dataset with Low Extractiveness. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5718–5730, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
SumTitles: a Summarization Dataset with Low Extractiveness (Malykh et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.503.pdf
Data
WikiHow