100,000 Podcasts: A Spoken English Document Corpus

Ann Clifton; Sravana Reddy; Yongze Yu; Aasish Pappu; Rezvaneh Rezapour; Hamed Bonab; Maria Eskevich; Gareth Jones; Jussi Karlgren; Ben Carterette; Rosie Jones

doi:10.18653/v1/2020.coling-main.519

100,000 Podcasts: A Spoken English Document Corpus

Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, Rosie Jones

Abstract

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

Anthology ID:: 2020.coling-main.519
Volume:: Proceedings of the 28th International Conference on Computational Linguistics
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Editors:: Donia Scott, Nuria Bel, Chengqing Zong
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 5903–5917
Language:
URL:: https://aclanthology.org/2020.coling-main.519/
DOI:: 10.18653/v1/2020.coling-main.519
Bibkey:
Cite (ACL):: Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones. 2020. 100,000 Podcasts: A Spoken English Document Corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5903–5917, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):: 100,000 Podcasts: A Spoken English Document Corpus (Clifton et al., COLING 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.coling-main.519.pdf

PDF Cite Search Fix data