Detecting Extraneous Content in Podcasts

Podcast episodes often contain material extraneous to the main content, such as advertisements, interleaved within the audio and the written descriptions. We present classifiers that leverage both textual and listening patterns in order to detect such content in podcast descriptions and audio transcripts. We demonstrate that our models are effective by evaluating them on the downstream task of podcast summarization and show that we can substantively improve ROUGE scores and reduce the extraneous content generated in the summaries.


Introduction
Podcasts are a rich source of data for speech and natural language processing. We consider two types of textual information associated with a podcast episode: the short description written by the podcast creator, and the transcript of its audio, both of which may contain content that is not directly related to the main themes of the podcasts. Such content may come in the form of sponsor advertisements, promotions of other podcasts, or mentions of the speakers' websites and products.
While such content is tightly integrated into the user experience and monetization, it is a source of noise for many natural language processing and information retrieval applications which utilize podcast data. For example (Table 1), an episode of the podcast show Survival includes a promotion for an unrelated podcast Dog Tales about dogs; a search query for podcasts on dogs should probably not surface the Survival episode. Algorithms attempting to connect topics discussed in the podcast to those mentioned in the episode description, such as summarization models, would be confounded by the presence of supplementary material and URLs * * Work done while at Spotify ... sit stay and roll over with excitement for par casts endearing series dog tails. Listen to dog tails free on Spotify or wherever you get your podcast. And now back to the story. Almost immediately after setting sail on June 29th. 1871 Charles Francis Halls Arctic Expedition... ... the focus is on strengthening deficient repertoires, while systematically increasing task demands and difficulty. For more information, visit www.behaviorbabe.com. -This episode is sponsored by Anchor: The easiest way to make a podcast. https://anchor.fm/app in the description. Information extraction models looking for entities may mistakenly retrieve sponsor names from advertisements.
In this paper, we introduce the problem of detecting extraneous content (which we sometimes shorthand as EC) in episode descriptions and audio transcripts. We produce an annotated corpus by taking advantage of podcast listening data, construct models to detect extraneous content, and evaluate our models for accuracy of detection, as well as for the downstream task of summarizing podcast transcripts. We also discuss some of challenges that arise while annotating and classifying extraneous content in this domain.

Previous Work
A related, well-studied problem is boilerplate detection on web pages, mainly involving the detection of templates, navigational elements, and advertisements (Kohlschütter et al., 2010). Such models tend to rely on the specific structure of web page boilerplate markup and characteristics. There has also been work on detecting promotional content on Wikipedia (Bhosale et al., 2013).
There are primarily two lines of work in advertisement detection and discovery in multimedia. One computes acoustic features over the entire audio to discriminate between the segments of con-tent and the segments of advertisements (Conejero and Anguera, 2008;Melamed and Kim, 2009;Nguyen et al., 2010). The other fuses multimodal features such as visual cues to segment ad clips from televised and online videos (Lienhart et al., 1997;Duan et al., 2006;Vedula et al., 2017). Our work is closely related to Huang et al. (2018), who analyze consumer engagement on audio advertisements when compared to topical content, as we utilize the user engagement signals to predict extraneous content segments.

Datasets and Annotation
To create an annotated dataset, we selected a random set of podcast episodes out of the Spotify Podcast Dataset, a corpus of 105,360 episodes (Clifton et al., 2020). Each episode in the dataset has an automatically generated transcript and a short text description of the episode written by the podcast creators. We annotate both sources, creating ground-truth labels for the extraneous content detection task, using the open source software doccano (Nakayama et al., 2018). Annotators were instructed to select spans that correspond to extraneous content, which we defined as ads, social media links, promotions of other podcasts, and show notes that are not directly related to the episode. Respecting sentence boundaries was encouraged, but not required, to allow for cases where extraneous content starts or ends mid-sentence.

Podcast Episode Descriptions
Annotation of episode descriptions was relatively straightforward. We encountered a few corner cases: for example, an episode may in its entirety be a promotion or an ad; a description that reflects these promotions would be on-topic. In such scenarios, the annotations attempted to be consistent with the annotators' judgments of the main topics of the episode as far as possible. Examples of annotated episode descriptions are shown in Table 6 in the appendix.

Podcast Episode Transcripts
Each transcript contains thousands of words, of which extraneous content may make up a small portion. This necessitates a way to sample an informative subset of transcript segments for annotation. We observe that if a region is extraneous to the main content, listeners of the podcast episode may fast-forward past this region or abandon listening.
To this end, we gather listener data for a subset of our corpus from Spotify, an audio streaming platform. We record the proportion of all listeners who begin streaming an episode retained at each second of the episode's duration. Listening data was aggregated from the date of each episode's publication -with the most recent episode published in February 2020 -through September 2020. For episodes with a substantial number of listeners, there exist distinct local minima or "dips" in the retention curves (Figure 1), which we posit may correspond to regions of extraneous content. Listener Retention Figure 1: A podcast listener retention curve for a single podcast episode. The dips in the graph suggest potential EC regions. Start-and end-points for each dip are automatically estimated as described in Section 3.2, shown with green and red markers respectively.
To locate the center point of each dip, we first apply SciPy peak detection on the negative retention curve (Virtanen et al., 2020). Within ±2 min of each peak, we calculate the slopes of secant lines passing through every point on the curve. The coordinates which maximize the secant slope within this range correspond to the start/end points of the dips. Green and red secant lines are shown in Figure 1 to illustrate this process.
The transcript is then segmented 60 sec before the starting point and 90 sec after the ending point of each dip, loosening the boundaries of the potential EC regions; these segments are then manually annotated. We note that the transcripts contain noisy text which are artifacts from an automatic speech recognition (ASR) system. An additional challenge stems from identifying native advertising, where podcast creators deliberately script product placement into the context of their content (Einstein, 2015;Hutton, 2015). Considering all cases, we attempted to best estimate the boundaries of the scripted content.
Examples of annotated transcript regions are shown in Table 7 in the appendix. The set of manual annotations is used as a gold labeled transcript dataset for our models.  Of the annotated dips, 38.4 % were found not to contain extraneous content. These largely correspond to episode beginnings, where listeners may skip over the introduction to show, or dynamically inserted ads that are not present in the transcript. The dip boundaries for the rest are relatively accurate against the manual annotations, with an absolute mean error of 16.0 words for the starts, and 35.2 for the ends, motivating the use of the unannotated listener dips as a silver training set (described in §4.2).
Sentence-Level Labels Our models for descriptions and transcripts use the sentence as the unit of classification. The annotated data is split into sentences using SpaCy 1 . A sentence is labeled as extraneous if more than 50 % is annotated as such (Table 2).

Sentence-Level Classification
We built separate classifiers for detecting extraneous content in descriptions and transcripts. A pretrained BERT (Devlin et al., 2019) cased language model 2 was first fine-tuned on our entire large corpus (of podcast descriptions and transcripts respectively, excluding the test set) to capture the distinctive language use of the podcast domain, and then further fine-tuned for classification on the annotated data to predict whether a sentence is extraneous. We also trained non-neural classifiers (logistic regression and SVMs) with TF-IDF unigram and bigram features (Appendix A.2).
We experimented with single sentence classification in isolation, and with the immediately preceding sentence prepended for context. 3

Document-Level Classification
We classify sentences independently, but extraneous content comes as contiguous groups of sentences. We apply non-parametric kernel regression to post-hoc smooth the sequence of individual sentence classification probabilities in the transcripts. We observe that extraneous content within episode descriptions often appears as a contiguous block at the end, prompting us to apply a change point detector (Appendix A.3) on the sentence classification probabilities in order to detect the start of the EC block.
We also formulate the problem as sequence tagging at the sentence level, in order to allow the model to learn the label dependencies. For this, we use the BERT pooled sentence embeddings as input to a separate BiLSTM-CRF model. The BiLSTM-CRF improves over sentence-level classification but underperforms the change point detection strategy (Table 4). In future work, we would like to investigate an end-to-end BERT-BiLSTM-CRF model (Dai et al., 2019) or sequential sentence classification models (Cohan et al., 2019).

Expanded Transcript Dataset
Since the manually labeled gold set is small, we create a larger silver dataset from 6401 detected listener dips across 4930 episodes by applying the best performing classifier trained on the gold data. To encode information about the dip locations to aid the model, we prepend special tokens 'in-dip' and 'outside-dip' to each sentence depending on whether the sentence lies within a listening dip.
We strip special tokens from the resulting silver set. This data is then used to train a final classifier that can detect extraneous content regions in podcast transcripts without listener dips. We also add a negative sample of sentences that are distant (by at least 5 minutes) from dips in the same episodes, with the assumption that these are likely to be topical.
On the same test set as the previous experiments, the document and sentence level performance increases (Table 4), proving the model benefits from the larger, albeit noisy, training set. This model is applied to the corpus in the downstream task described below.

Application to Podcast Summarization
We address the problem of automatically generating episode descriptions from podcast transcripts, a task similar to abstractive summarization. Within this problem, we evaluate the downstream effect of removing extraneous sentences from the training and/or test data. Alternatives to removal (such as using the model's predictions as auxiliary inputs in the downstream system) are left for future work.
We experiment with two supervised abstractive summarization models both built using BART (Lewis et al., 2020). The first experiment uses a model pretrained for summarization on the CNN/DailyMail dataset 4 (Hermann et al., 2015). This model (which we refer to as BART-CNN) evaluates the extent to which extraneous text in the transcripts contribute to the presence of extraneous content in the generated descriptions. In the second experiment, we fine-tune BART-CNN on our corpus of podcasts, similar to the work of Zheng et al. (2020), using the episode transcripts as inputs and descriptions as targets. We refer to this model as BART-PODCASTS; with it, we can evaluate the effects of EC as realistic noise which may contaminate the training data of summarization models.

Experimental Setup
From the original corpus of 105,360 podcast shows, 6401 were used for training and evaluation of the two EC detection models. We filter the remaining episodes for descriptions which are distinctly short or long. The resulting dataset contains 84,451 episodes sorted by episode publication date. This is split into 82,451 episodes for training, 1000 for validation, and 1000 for evaluation.
As a baseline, we manually remove extraneous content from the episode descriptions within the test set, comparing the ROUGE score of the model outputs against the manually-cleaned descriptions as well as against the original descriptions. Additionally, we manually validate whether the generated outputs for 150 random test episodes contain extraneous content. Table 5 shows the full ROUGE-L scores of our experiments. We evaluate quality through ROUGE as well as by manually verifying for the presence of extraneous content in the outputs. BART-CNN is an out of the box summarization model, while BART-PODCASTS is the same model that is finetuned on our data of transcripts and descriptions. All numbers are reported on the test split of the corpus. The range of these ROUGE scores is comparable to previous podcast summarization work. Removing extraneous content is clearly beneficial for summary quality: while the baseline models have better ROUGE scores against the original (ECcontaining) descriptions, the highest scoring models score better against the clean descriptions compared to the originals. With the original data, both  BART-CNN and BART-PODCASTS tend to generate descriptions that contain extraneous material. Interestingly, BART-PODCASTS, being trained on the unmodified descriptions, produces even more extraneous content (73.2 %) than the corresponding original descriptions (50.0 %), often generating ads unrelated to the actual sponsors, and nonexistent URLs. While post-processing only the output summaries with no change to the model inputs is effective at minimizing extraneous content, it does so at the expense of summary quality, since the resulting summaries are significantly shortened. The best overall performance comes from detecting extraneous content on transcripts and descriptions before model training and application.

Conclusion
We introduced the problem of detecting extraneous content in podcast descriptions and transcripts, presented models that leverage textual and listener data, and evaluated them on a downstream summarization task. We consider our models to be baselines for a new problem with several opportunities for future work. Although we used two separate models for the descriptions and transcripts with the view that the language patterns are different, a joint model or shared components may be able to take advantage of some of the common vocabulary. One could leverage the 'boilerplate' nature of some types of extraneous content like ads by detecting repeated sentences and phrases across the corpus. A language model that is robust to noisy speech transcripts (Lin et al., 2019;Chuang et al., 2019) may improve accuracy on podcast transcripts. Given that extraneous content may appear as pre-recorded audio, or with a different speaking pitch and cadence, acoustic features alongside textual ones may be helpful.

A.1 Annotation Process and Annotated Examples
Examples of extraneous content regions of episode descriptions are shown in Table 6. and example of annotated listener 'dip' regions in the podcast transcripts in Table 7. The categorization into types is only for illustrative purposes and is not used in our model.

A.2 Model Training and Evaluation Details
We modified the sentence splitter in SpaCy to include ---, ... and the three-space string as delimiters for episode descriptions, based on our observations of common patterns. Speech recognition errors/disfluencies and missing punctuation contribute to a small amount of noise in the sentence segmentations of transcripts and descriptions respectively. For the bag of words models, we used https: //scikit-learn.org/ implementations with the default parameters and no hyper-parameter tuning.
For all Transformer models, we used the Huggingface library (Wolf et al., 2020). For summarization, we set the maximum length of the target descriptions as 250 tokens for training and generation, and the minimum length to 30 tokens. The models were trained for up to 5 epochs, with early stopping based on ROUGE-2 on the validation set. All other hyperparameters were set as the defaults specified in the Huggingface Transformers code.

A.3 Change Point Detection for Episode Descriptions
As shown in Eq. 1, we can find a positionτ which maximizes the negative log-likelihood ratio R τ of H 1 as the existence of the change point versus H 0 as no change point.
We make the assumption that (1) there is only one change point, and (2) extraneous content appears at end the descriptions. The null hypothesis is that there is no changepoint, while the alternative hypothesis assumes that there is a changepoint at the time t = τ . Here is the hypothesis test: Therefore, the likelihood is given by the probability of observation the data x = x 1 ,...,x n conditional on H 0 . In other words, And the the likelihood of the alternative hypothesis is, p(x j |θ j ) (5) The log-likelihood ratio R τ is then,  ... Not much is known about her personal life prior to her murder if she was so average, why was she living with her aunt and uncle I'm not saying there's anything wrong with that [...] I'm sorry to say that it looks as though your husband has an advanced... Table 7: Annotated examples from the podcast transcripts, corresponding to detected dips in listening.