MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

Chenguang Zhu, Yang Liu, Jie Mei, Michael Zeng


Abstract
This paper introduces MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model’s performance on other dialogue summarization tasks.
Anthology ID:
2021.naacl-main.474
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5927–5934
Language:
URL:
https://aclanthology.org/2021.naacl-main.474
DOI:
10.18653/v1/2021.naacl-main.474
Bibkey:
Cite (ACL):
Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng. 2021. MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5927–5934, Online. Association for Computational Linguistics.
Cite (Informal):
MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization (Zhu et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.474.pdf
Optional supplementary material:
 2021.naacl-main.474.OptionalSupplementaryMaterial.pdf
Video:
 https://aclanthology.org/2021.naacl-main.474.mp4
Code
 zcgzcgzcg1/MediaSum
Data
CRD3InterviewMultiWOZSAMSum Corpus