SumTitles: a Summarization Dataset with Low Extractiveness
Valentin Malykh | Konstantin Chernis | Ekaterina Artemova | Irina Piontkovskaya
Proceedings of the 28th International Conference on Computational Linguistics
The existing dialogue summarization corpora are significantly extractive. We introduce a methodology for dataset extractiveness evaluation and present a new low-extractive corpus of movie dialogues for abstractive text summarization along with baseline evaluation. The corpus contains 153k dialogues and consists of three parts: 1) automatically aligned subtitles, 2) automatically aligned scenes from scripts, and 3) manually aligned scenes from scripts. We also present an alignment algorithm which we use to construct the corpus.