StreamHover: Livestream Transcript Summarization and Annotation

With the explosive growth of livestream broadcasting, there is an urgent need for new summarization technology that enables us to create a preview of streamed content and tap into this wealth of knowledge. However, the problem is nontrivial due to the informal nature of spoken language. Further, there has been a shortage of annotated datasets that are necessary for transcript summarization. In this paper, we present StreamHover, a framework for annotating and summarizing livestream transcripts. With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora. We explore a neural extractive summarization model that leverages vector-quantized variational autoencoder to learn latent vector representations of spoken utterances and identify salient utterances from the transcripts to form summaries. We show that our model generalizes better and improves performance over strong baselines. The results of this study provide an avenue for future research to improve summarization solutions for efficient browsing of livestreams.


Introduction
One of the most powerful communication mediums is livestreaming. New platforms such as YouTube Live, Twitch, Instagram Live and TikTok encompass a variety of topics, ranging from video games to social media to professional sports. We are particularly interested in livestreams that are distinguished by three characteristics: Excessive length, the recordings could last from several minutes to several hours; Verbal communication, the use of natural language is the primary means of communication, in contrast to gestures or facial expressions; Informal nature, the streamers' language is mostly informal and unplanned, unlike news broadcasts. Without an effective mechanism to summarize such streamed content, livestreaming platforms may not fully meet the needs of their customers. Our goal in this work is to create a text preview of the streamed content. When a user hovers over the thumbnail or scrolls past a video, they are shown a preview of the content. We present a dataset of over 500 hours of video footage, which were streamed live on a social media platform (behance.net) created to showcase and discover creative work. Figure 1 shows an example of the streams, where the artists showcase the use of Adobe Photoshop and Illustrator in designing holiday cards and posters. It is necessary to point out that video analysis is not suitable here, as the video only mirrors the artists' screen content. As a first step towards automatic creation of a text preview, we focus on identifying salient utterances to produce an extract from the livestream transcript.
We make use of vector-quantized variational autoencoders (VQ-VAE; van den Oord et al., 2017) to identify salient utterances. The model has been applied successfully to opinion summarization that learns in-domain sentence representations (Angelidis et al., 2021), which is essential for adaptation of general-domain models. We refrain from using sequential methods for utterance selection. First, it is difficult to scale up sequential prediction to process transcripts that exceed the maximum allowed length, even with models that handle long text (Beltagy et al., 2020;Zhao et al., 2020). Second, sequential methods (Narayan et al., 2018b;Xiao and Carenini, 2019) may not give enough flexibility to select salient utterances on-the-fly when content is being streamed live, thus they are unsuitable for our case.
There has been a shortage of annotated datasets that are necessary for livestream transcript summarization. We build a browser-based user interface for summary annotation that provides to the annotators a clip of the livestream recording alongside a synchronized display of the transcript. The interface allows annotators to conveniently label summary utterances and write an abstractive summary using their own words (Figure 3). With a total of 500 hours of annotated video footage, our dataset is notably larger than existing annotated corpora for transcript summarization (Janin et al., 2003;Carletta et al., 2006). We compare our summarization approach with strong baselines on the dataset and shed light on the task of livestream transcript summarization. Our contributions are as follows.
• We create a detailed annotation interface and new benchmark dataset for automatic summarization of livestream transcripts. An informative preview of streamed content is of crucial importance to users when considering whether to hit play.
• We present StreamHover, a unsupervised model based on VQ-VAE to identify salient utterances from livestream transcripts to form preview summaries. We evaluate the method across multiple dimensions and discuss its strengths and weaknesses. Empirical results show that our method outperforms strong summarization baselines. 1

Related Work
Closed captions are often provided onscreen, turning streaming videos into text on an unprecedented scale (Besik, 2020). However, there are very few summarization studies that attempt to generate text previews of streaming videos to help users browse or refind information that has been watched before. Neural text summarizers have focused primarily on written text, including news articles, reviews, scientific papers and book chapters (See et al., 2017; Tan et al., 2017;Chen and Bansal, 2018;Narayan et al., 2018a;Gehrmann et al., 2018;Cohan et al., 2018;Liu and Lapata, 2019;Fabbri et al., 2019;Bražinskas et al., 2020;Ladhak et al., 2020;Song et al., 2021). Despite their success, it remains unclear as to if and how the summarizers can be extended to spoken text, whose utterances may have very low information density.
It is crucial to identify salient content from transcripts where a substantial number of utterances are devoted to informal chit-chats in an attempt to connect with the audience (Figure 2). We investigate extractive rather than abstractive approaches as the latter are prone to generate hallucinated content that does not exist in the source text (Cao et al., 2017;Kryscinski et al., 2019;Lebanoff et al., 2019;Maynez et al., 2020). The problem could be exacerbated by ungrammatical spoken utterances and transcription errors. Instead, we consider VQ-VAE, an unsupervised representation learning technique (van den Oord et al., 2017;Jin et al., 2020;Angelidis et al., 2021) for content extraction. Unsupervised training of the VQ-VAE model and its inference could potentially be performed at the same time, allowing important utterances to be extracted from a transcript segment on-the-fly during streaming, without interrupting the learning process. It is also easier to tailor the model to specific domains compared to contemporary extractive methods (Yasunaga et al., 2017;Dong et al., 2018;Xu and Durrett, 2019;Wang et al., 2020).
Our work contributes to a refined understanding of transcript summarization, which is understudied relative to its importance and potential. The transcripts may be obtained from channels such as movies and TVs (Papalampidi et al., 2020;Chen et al., 2021), interviews (Zhu et al., 2021), multiparty meetings (Murray and Carenini, 2008;Wang and Cardie, 2013;Li et al., 2019b;Koay et al., 2020Koay et al., , 2021Zhong et al., 2021), telephone speech (Kafle and Huenerfauth, 2018) and more. The main thrust distinguishing our work with others is the combination of a benchmark summarization dataset, novel summarization methods and a challenging new domain where salient content is scattered throughout the transcript and mixed with substantial chit-chats. We do not make use of video event detection or multi-modal fusion (Zhu et al., 2018;Palaskar et al., 2019;Li et al., 2020) as little information could be gleaned from videos that mirror the artists' desktop. Instead, we focus on generating short descriptions from transcripts and leave for future work crossmodality research. We describe our data annotation process in the following section.

Our Dataset
We aim to create a large and representative corpus containing transcripts and summaries of streamed videos. We explore a leading social media platform (Behance.net) supported by Adobe Creative Cloud that features livestreams of creative work by artists and designers. The website boasts over 10 million users, who watch artists and designers livestream when they create. Our data are extracted from this website, containing a large quantity of streamed videos (>5,000), the length of which ranges from minutes to several hours. The streamers' language is unplanned, instead of rehearsed as that of TED talks (Hernandez et al., 2018).
We obtain a total of 5,398 streamed videos. The metadata of a video includes its ID, duration, title, a short description and the transcript. Automatic transcription was provided by Microsoft Automatic Speech Recognition which helps make videos accessible to a wide audience. Each transcript contains a set of segments, each corresponds to about 30 seconds of audio. Each segment contains a set of utterances. 2 Figure 2 shows an example of the segments and utterances. The offset of the segment Figure 3: An example of our browser-based annotation interface. It includes a clip of the streamed video alongside a display of the transcript (omitted for space). The streamer talks about Digital Painting with Maddy Bellwoar to create fairytale themed images. The annotators are asked to write a concise summary of this clip using their own words (Task A) and identify summary utterances (Task B).
indicates the number of minutes since the beginning of the recording.
When a user hovers over the thumbnail or scrolls past a video, we expect a textual summary to give a glimpse of the verbal content. This view of summarization leads us to annotate salient content across the video in an equally detailed manner. It naturally avoids lead bias that is ubiquitous in news (Grenander et al., 2019). We segment a video into 5-minute clips and annotate each clip for summary-worthy content. A clip contains an average of 51 utterances and 460 words. Due to time and budget constraints, we select 370 streamed video for summary annotation. 3 Table 1 provides a detailed comparison of our annotated corpus with previous datasets, including Switchboard (Godfrey et al., 1992), ICSI (Janin et al., 2003) and AMI (Carletta et al., 2006) that contain both transcripts and human-annotated ex-  tractive/abstractive summaries. With a combined duration of 500 hours, our dataset is substantially larger than previously released datasets. We recruit 12 workers from Upwork.com and validate their language skills for summary annotation. Upwork is a freelancing platform that allows us to reach out to workers directly to ensure our instructions are fully understood. Each worker is asked to write a concise summary for a given clip using their own words (Task A) and identify summary utterances (Task B) using the graphical user interface (Figure 3), which shows a clip of the streamed video alongside a synchronized display of the transcript. Additionally, our guidelines suggest a good summary of Task A should have at least 100 characters and that of Task B should have between 50 and 80 words (∼15% compression). As is the case with meeting summaries (Janin et al., 2003), a clip is annotated by a single worker owing to an expensive annotation process. The worker can also identify a clip to be chitchat, in which case it will not be annotated for summaries. Table 2 shows our dataset statistics. On average, a human abstract contains 3 sentences (36 words) and a human annotated extract contains 5.5 utterances (80 words). Moreover, summary utterances constitute 8.9% and 8.6% of all utterances in terms of number and duration. We study inter-annotator agreement by inviting three workers to each annotate 8 hours of video that contains a total of 96 clips. Using 10-second intervals as measuring units, 4 the Fleiss' Kappa on identifying summary utterances is 0.13. We note that the score is within the range of what is normally found in annotating speech transcripts for extractive summaries (0.1∼0.3; Marge 4 We use 10-second intervals rather than utterances as measuring units as the duration of utterances vary. If annotators all selected some content, or no content at all, from a 10-second interval, they are in agreement. Total number of annotated videos  370  Total annotated duration in hours  500  Total number of utterances  331,928  Average number of utterances in a clip 61.23 Average duration of utterances in seconds 3.04 Average number of words in an utterance 9.48 Total number of annotated clips 6,003 Total number of chitchat clips 582 Total number of human annotators 12 Avg. # of (sentences) words in a human abstract (3.0) 36 Avg. # of (utterances) words in a human extract (5.5) 80 Percentage of duration of summary utterances 8.57%  , 2010), as annotating spoken text is a highly challenging task. We find that annotators tend to perceive the same region as salient but they may disagree as to which utterances should be included in the summary due to verbosity of spoken text. We refer the reader to (Artstein and Poesio, 2008) for interpretations and improvements to IAA.

Summarization
Let X denote a sequence of spoken utterances from a segment of the transcript. Our summarizer aims to extract a subset of utterances Y ⊂ X that convey the essential content of the input. We experiment with an unsupervised summarizer that leverages vector-quantized variational autoencoders (VQ-VAE; van den Oord et al., 2017) to learn utterance representations and identifies summary utterances. The method was explored for opinion summarization (Angelidis et al., 2021) and machine translation (Prato et al., 2020). We are interested in using the method to account for domain characteristics of livestreams, which showcase new and creative work of artists and designers on their use of Photoshop, Illustrator, and other tools. 5 VQ-VAE is a powerful framework for learning latent variable models using deep neural networks. It learns discrete vector representations for an utterance, which is then used to categorize the utterance along various dimensions. E.g., "Good morning Hi Everybody" suggests a greeting and opens up a dialogue; "I had probably 3 or 4 different customers on YouTube and ... on Facebook asked me how the heck do you burn an audio CD in Adobe Audition" engages the audience and introduces the  Figure 4: Our summarizer embeds an input utterance using BERT, transforms BERT's semantic space to a set of latent codes, then reconstructs the utterance using the code embeddings. We identify summary utterances as those associated with prominent latent codes/topics. The model is trained using a dictionary learning algorithm for code embeddings (E) and backpropagation with a straight-through estimator for model parameters θ, ϕ and φ.
main topic. The VQ-VAE method groups utterances based on their discrete representations and selects salient utterances to form a summary. We employ an embedding function Embed θ (·) to map an input utterance x into a semantically meaningful space. The space is subsequently discretized according to a codebook. To achieve this, we prefix x with a [CLS] token and append a [SEP] token, pass it into a BERT model, then obtain the vector corresponding to [CLS] as a pooled representation of the utterance, denoted by h ∈ R H (Eq. (1)). We use a ConvEncoder ϕ (·) with a set of D filters to convolve the input h. The output is a sequence of feature vectors [q 1 , · · · , q H ] where q i ∈ R D (Eq. (2)). We define a codebook E = [e 1 , · · · , e K ], where K is the number of latent codes and e k ∈ R D is the kth code embedding. The i-th feature q i is assigned to the latent code z i whose embedding e z i has the minimum Euclidean distance with it (Eq. (3)). Our method essentially discretizes the H-dimensional semantic space by producing latent codes {z i } H i=1 , one for each dimension of the semantic space.
With the latent code embeddings [e z 1 , · · · , e z H ], we seek to reconstruct the input utterance, which is achieved by generating a dense vector h using a ConvDecoder ϕ (·) (Eq. (4)). h is then fed to a Transformer decoder to reconstruct the original utterance x (Eq. (5)). In this process, the code embeddings serve as "topic vectors" that group dimensions of the semantic space into clusters relevant to the application domain. Our model parameters include those used by the BERT encoder and Transformer decoder (θ and φ), the convolutional encoder and decoder that use tied parameters (ϕ), and embeddings of the codebook E.
We next describe the loss function used to learn these parameters. The loss function of our model comprises of three parts, including a cross-entropy term between the original and reconstructed utterance XEnt(x, x) that optimizes the BERT embedder θ, Transformer generator φ, and convolutional encoder and decoder ϕ, as shown in Figure 4. The gradients will, however, bypass the latent code embeddings due to the straight-through estimator (Bengio et al., 2013). To learn code embeddings in an end-to-end manner, we use a dictionary learning algorithm (van den Oord et al., 2017) that moves code embeddings e z i towards feature vectors q i by minimizing the l 2 -distance between the two vectors e z i −sg(q i ) 2 2 , where sg(·) is a stop-gradient operator that constrains its operand to be a non-updated constant during backpropagation, i.e., it stops q i from being updated. As illustrated in Eq. (6), we additionally apply a commitment loss to encourage the feature vector q i to commit to a code embedding. sg(e z i ) − q i 2 2 prevents q i from deviating too much from the code embedding e z i . This loss term is associated with a coefficient β ∈ [0, 1].
At test time, we define summary utterances as those associated with prominent latent codes/topics. Given a set of N utterances, we obtain latent codes from the n-th utterance using Eq. (3), denoted by {z This gives a total of N × H codes from which we find prominent ones. They are denoted by P which contains a set of most frequently occurring codes. A score S(x n ) is assigned to utterance x n that computes how often it is associated with those prominent codes P. In Eq.
= k] indicates the number of times the n-th utterance is assigned to code k, where k belongs to P. Finally, we extract K highest-scoring utterances to form an extractive summary of the input.
Our method draws on the convolutional encoder and decoder to transform BERT's semantic space to map each dimension to a latent code. The summary selection process is deterministic and our encoder takes full advantage of a large, pre-trained model to produce initial utterance representations. This design sets our method apart from that of Angelidis et al. (2021). Moreover, the method has the potential for modelling topic transitions between utterances to improve summarization of livestreams, which we leave for future work.

Experiments
Dataset. Finding salient content from livestream transcripts is a "needle-in-the-haystack" problem. Our summarization dataset contains a total of 370 videos split into short clips of 5 minutes each. The annotators manually annotated 5,421 clips (∼451 hours) with extractive and abstractive summaries. 582 clips (∼49 hours) are removed because they are identified to contain only chit-chats. The dataset is divided into training, validation and test splits: • 3,884 clips (320 videos / 323 hours) in training, • 728 clips (25 videos / 61 hours) in validation, • 809 clips (25 videos / 67 hours) in test split.
We call our summarizer "StreamHover." When a user hovers their mouse over a video's timeline, a summary preview is shown and keeps updating. As a first attempt, StreamHover focuses on extracting salient utterances from individual clips instead of whole streams to encourage selected utterances to be mostly evenly distributed across the stream. When the content is provided live, the stream can be divided into short clips and our algorithm consumes one clip at a time to produce summaries on-the-fly. It is important to note that extracting summary utterances remains challenging even for modern neural summarizers. E.g., Kedzie et al. (2018) reveal that summarizers may not effectively identify summary content without a dependency on intentional lead bias in news writing. Our setting is challenging as not only are there few utterances deemed to be summary-worthy but such utterances can occur anywhere in a video clip.
Baselines. We compare StreamHover with stateof-the-art extractive and abstractive summarizers. The abstractive summarizers generate an abstract from the transcript of a clip without tuning. 6 These include BART-large, BART-large-cnn (Lewis et al., 2020) and T5 (Raffel et al., 2020), which are some of the strongest performing neural abstractive summarizers that are pre-trained on language modeling and summarization tasks.
The unsupervised extractive summarizers extract salient utterances from a clip. LexRank (Erkan and Radev, 2004) and TextRank (Mihalcea and Tarau, 2004) are graph-based models that extract relevant sentences based on eigenvector centrality. Sum-Basic (Vanderwende et al., 2007) assigns higher scores to sentences containing frequently occurring content words. We further compare to a novel unsupervised graph-based summarization method for speech transcripts: FluCovRank (Shang et al., 2018) groups utterances into clusters, generates an abstractive sentence from each cluster, then selects the best elements from abstractive sentences under a budget constraint. Finally, we compare our approach with the Quantized Transformer (Angelidis et al., 2021), which uses a clustering interpretation of the quantized space and two-step sampling algorithm to extract summary sentences from reviews.
Settings. We use pretrained BERT-BASE as our embedder Embed θ (·). The model has 12 layers, 12 heads per layer and a hidden size (H) of 768. A 6layer Transformer decoder is used as the generator Generate φ (·) to reconstruct the original utterance. The model has 8 heads per layer, a hidden size of 768, and randomly initialized parameters. The convolutional encoder and decoder use a kernel size of 3. Because our embedder is pretrained and the remaining parameters are not, we divide them into two groups E={θ} and R={φ, ϕ}, then apply separate training schedules. Following Liu and Lapata (2019) we use two Adam optimizers: where the learning rate for the embedder lr E =7e −4    Table 5: Results of human evaluation regarding fluency, informativeness and the overall quality of system summaries using Best-Worst Scaling.

3-Sentence Output 4-Sentence Output 5-Sentence Output System P (%) R (%) F (%) #Wrds P (%) R (%) F (%) #Wrds P (%) R (%) F (%) #Wrds
is smaller than that of the rest params lr R =4e −2 . Its warmup period is longer: warmup E =3,000 for the embedder and warmup R =1,500 for the rest. It allows the pretrained embedder to be updated in a slower pace until other model parameters start to generate accurate gradients. All of our models are trained for 30 epochs on dual NVIDIA V100 GPUs with gradient accumulation every ten steps. We experiment with different numbers of filters, D = {64, 100, 128}, for the convolutional encoder and decoder. The number of latent codes are varied in K = {512, 1024, 2048}. The coefficient β used for commitment loss is set to 0.25 (Eq. (6)). These hyperparameters are tuned on the validation set. We keep only utterances that contain >5 words in consideration. The final training set contains 168,111 utterances.

Results
In Table 3, we analyze the performance of extractive summarizers on identifying ground-truth summary utterances and report their precision, recall and F 1 -measure scores. We vary the length of their output to yield {3, 4, 5}-utterance summaries. In comparison, a ground-truth extract contains 5.5 utterances. The Lead-N baseline selects the first N utterances of a clip. It gives low scores because our data do not present strong lead bias as that of news articles. We find that StreamHover consistently outperforms other summarization systems across all lengths. Its length, when measured by number of words, is comparable to that of LexRank and Tex-tRank. The highest F 1 -score of 30.47% is achieved when StreamHover generates a 5-utterance summary for each 5-minute clip. This amounts to rendering one utterance per one-minute segment when a user scrolls past the video.
In Table 4, we compare extractive and abstractive summarizers and report ROUGE scores (Lin, 2004) that measure content overlap between system and reference summaries. 7 We use human abstracts as the reference. All extractive summarizers produce 5-utterance summaries and Oracle Extract contains ground-truth utterances. It places an upper bound on the performance of extractive summarizers. We observe that StreamHover yields the highest scores on both R-2 and R-L metrics.
• top left bottom / cloud studies today / find links to their original posts / hey jennifer saw the images / love the top left and bottom / info tab and i uploaded / colors are beautiful but im partial through colorful sky scenes / pretty large about 4000 by 4000 pixels / photo studies of today / moment LexRank • I hope you guys are having a good day so far.
• So I'm going to be painting from these images and these beautiful photos are from various photographers. • Those yeah well top right also is like very Contra high contrast that tends to like grab my attention when I look at the sheet but I would say top left and bottom right give me the most like happy feels. • So yeah, if you guys want to grab the reference images, you can find them in the stream description below the individual images...

BART-Large
• Hello good morning everybody welcome to the stream. I hope you guys are having a good day so far. Is there a lot of buffering or are we doing alright? I got a little message that there was some connectivity issue. For a moment there, so I hope I hope it's OK. Yeah, I'll just keep going. So yeah, if you guys want to grab the reference images, you can find them in the stream description below the individual images...

Quantized Transformer
• Good to see you were going to be doing cloud studies today. • The stream in the description. • One of them is from Morguefile, One is from unsplash, well, two are from Unsplash and one is from pixels there a little bit from all over the place, but you can find the photographers below if you'd like.
• Hey Jennifer, saw the images. • Let's see top left, bottom right...

StreamHover (Ours)
• So if anybody is interested in joining in, if you want to work on some skies for your landscapes for future landscapes, this is what we're going to be doing.
• One of them is from Morguefile, One is from unsplash, well, two are from Unsplash and one is from pixels there a little bit from all over the place, but you can find the photographers below if you'd like.
• Those yeah well top right also is like very Contra high contrast that tends to like grab my attention when I look at the sheet but I would say top left and bottom right give me the most like happy feels.
• So yeah, if you guys want to grab the reference images, you can find them in the stream description below the individual images... Table 6: Example system summaries for Digital Painting Studies with Maddy Bellwoar-Clouds. The BART summary is fluent but its content lacks specificity, as is the case for LexRank. The summary segments selected by FluCovRank are ungrammatical. StreamHover identifies on-topic and informative utterances related to digital painting. Their relevant spans of text are manually underlined for readability. We show example system summaries in Table 6. The abstractive summaries generated by BART-Large are fluent but their content lacks specificity, so are the summaries produced by LexRank and Quantized Transformer. Particularly, QT does not seem to perform well on this task despite that the model has been retrained on livestream transcripts. 8 We believe this is partly because words and phrases tend to repeat themselves in review documents, and while spoken utterances are verbose, there is little repetition found in the transcripts. We observe that summary segments selected by FluCovRank are ontopic but they are ungrammatical and difficult to 8 https://github.com/stangelid/qt/blob/main/custom.md x x 5 Photo studies of today. 6 So I'm going to be painting from these images and these beautiful photos are from various photographers.
x 7 You can find links to their original posts below. 8 The stream in the description.
x 9 One of them is from Morguefile, One is from unsplash, well, two are from Unsplash and one is from pixels there a little bit from all over the place, but you can find the photographers below if you'd like.
x 10 Hey Jennifer, saw the images. 11 I really love the top left and bottom right. 12 The colors are beautiful but I'm partial through colorful Sky scenes. 13 Yeah, I totally agree. 14 Let's see top left, bottom right. 15 Those yeah well top right also is like very Contra high contrast that tends to like grab my attention when I look at the sheet but I would say top left and bottom right give me the most like happy feels.
x ... interpret without context. In contrast, StreamHover can identify on-topic and informative utterances related to digital painting. We provide more examples in the supplementary materials. In Table 7, we study the most prominent latent codes (C1-3) and their associated utterances. We define representative utterances as those frequently assigned to these codes (Eq. (3)). We observe that C1 usually contains a skewed number of utterances that are commonly seen in the data and not representative of the input; C2 contains lengthy but not necessarily summary-worthy utterances. In our experiments, we exclude C1/C2 before performing grid search on all codes to find the set of prominent codes: we use P=50 tuned on the valid set which is effective in helping identify summary utterances. 9 We conduct a human evaluation to assess how StreamHover compares to strong extractive and abstractive baselines. They are (a) LexRank (Erkan and Radev, 2004), (b) FluCovRank (Shang et al., 2018) and (c) BART-Large (Lewis et al., 2020); the latter two are abstractive systems. Each evaluator is shown a video clip with a synchronized display of the transcript followed by four system summaries, shown in random order to remove any positional bias. The evaluator is asked to select the best and worst of the summaries according to each of these criteria: Fluency/Coherence: is the the summary well-presented, grammatically correct and easy to read? Informativeness: does the summary provide useful information about the video clip? Overall Quality: is the summary of good quality considering both content and linguistic aspects?
We randomly sample 100 clips from the test set. Each clip and its summaries are judged by five evaluators that we recruit from Amazon mechanical turk. 10 Table 5 shows the performance of all systems measured by Best-Worst Scaling (Kiritchenko and Mohammad, 2016), where the score of a system is computed as the percentage of times it was selected as the best minus the worst. The range of scores is [-1,1]. Figure 5 shows how frequently a system is chosen to produce the "best summary." We observe that StreamHover achieves an overall score of 0.52 and it is selected as the best summary in over half of the times.

Conclusion
We present StreamHover, a new framework for annotating and summarizing livestream transcripts. Our dataset contains over 500 hours of videos annotated with extractive and abstractive summaries. We explored an extractive method leveraging VQ-VAE to identify salient summary utterances and obtained strong results. Future work includes boosting summarization solutions to provide users a concentrated overview of streamed content.

A Bēhance Dataset
We collect a total of 5,398 streamed videos from Behance.net. Some streamers opt-out of the transcription service provided by Microsoft Automatic Speech Recognition, so transcripts are not available for these videos. We create a list of domain keywords by finding 50 most frequently appearing words from video titles (stopwords are excluded). Examples include 'fresco ', 'adobe', 'photoshop', 'illustration', 'art', 'painting', 'drawing', 'illustrator', 'character', 'design.' The keywords are used to select videos for human annotation. 2,360 videos have transcripts available and contain at least one of our domain keywords in their titles. These videos are split into clips of 5-minute each. Some clips contain little or no verbal content. We thus remove clips that contain very few words (≤333 words) or utterances (≤38 utterances). These thresholds are determined using the average values of all clips. Videos with less than 5 valid clips are also removed from consideration. This preprocessing step gives 6,003 clips from 381 videos. During annotation, our annotators find 582 clips to contain only chitchats, suggesting that these clips are uninformative. 11 videos contain only chit-chat clips, they are subsequently removed from the dataset, yielding a total of 5,421 clips from 370 videos that are split into train, validation and test sets.

B Baseline Summarizers
Our neural abstractive baselines include pre-trained BART-large (Lewis et al., 2020), BART-large-cnn, and T5-large (Raffel et al., 2020). We follow the HuggingFace implementation (Wolf et al., 2020). Utterances that are longer than 5 words are concatenated into a flat sequence, which is used as the input to each summarizer. The model parameters include: the maximum and minimum summary lengths are 150 and 15 tokens, respectively. We use a beam size of 5 with early stopping. The length penalty is 1.0. "no_repeat_ngram_size" is set to 3, such that a trigram cannot occur more than once in the summary. Our extractive baselines include LexRank (Erkan and Radev, 2004), TextRank (Mihalcea and Tarau, 2004), and SumBasic (Vanderwende et al., 2007. They are implemented using the Sumy library where we adopt the default text parser and stemmer. Our unsupervised summarizer for speech transcript summarization (Shang et al., 2018) uses the following settings: we report the FluCovRank scores. The number of components used in LSA is 25. The number of utterance communities is 35. The number of clusters is 6, with a scaling factor of 1.3 and lambda of 0.4. The size of the summary is set to 50 words.

C Example Summaries
We show example summaries generated by different summarizers: FluConvRank (Shang et al., 2018), LexRank (Erkan and Radev, 2004), BARTlarge (Lewis et al., 2020) and StreamHover. We also show the top-3 most prominent latent codes and their associated utterances. We choose 5 representative utterances for each code that are most frequently assigned to this code. We observe that C1 utterances are frequently seen in the data (chitchats) and not representative of the input. C2 is associated with lengthy but not necessarily summaryworthy utterances. C3 utterances are both comprehensive and contain diverse information. In our experiments, we exclude C1/C2 before performing grid search on all codes to find the set of prominent codes. It allows us to effectively identify summary utterances without biasing towards the lengthy ones.
• digital painting today / 3 hours minimum / explain my painting process / starbucks florida cup / really really cheerful scene / prob hey welcome hi kumar / canvas sizes 3000 by 2000 pixels / sore throat so todays stream might be a tiny bit shorter / otherwise just sit back and relax and enjoy the painting / find LexRank • If you guys are interested in painting along with me or doing a little sketch or whatever, you have time for your always welcome to my reference.
• Is what I have planned to work on this if I'm feeling good, we might go on or do another painting, but I'm drinking some tea with honey and I have to kind of be careful with my voice a little bit talking so much, but I'm otherwise I'm good. • So again, that's where you can find my references below in the stream description.
• You guys are welcome to paint along if you're interested in doing that. • I've lived in a few different states, but where I spent most of my life was in Florida so.

BART-Large
• Good afternoon, good evening, good morning, good day, good afternoon, how are you? Good morning, hi Mohammed, hi kumar, hi muhammad, hi hi kumar, hi hello hi hi hi Mohammed. I hope you've had a good day so far. I'm doing pretty good pro to be honest. I have a little bit of a sore throat, so today's stream might be a tiny bit shorter than some of the others. It's still going to be around 3 hours minimum. Is what I have planned to work on this if I'm feeling good, we might go on or do another painting, but I'm drinking some tea with honey and I have to kind of

Quantized Transformer
• We're going to be doing some digital painting today. • As usual, I'm going to be working from this photo as a reference.
• I hope you're all doing well. • Image is below in the stream description and above the chat. • It's still going to be around 3 hours minimum.
• Is what I have planned to work on this if I'm feeling good, we might go on or do another painting, but I'm drinking some tea with honey and I have to kind of be careful with my voice a little bit talking so much, but I'm otherwise I'm good.
• Um prob hey welcome Hi Kumar. • You guys are welcome to paint along if you're interested in doing that. • I hope it's going to be good. • My canvas sizes 3000 by 2000 pixels.

StreamHover (Ours)
• It's like a little cafe table in Spain, with some dappled lighting coming through these trees.
• I really like the pattern of light on the building over here, and the just overall feeling the different colors.
• If you guys are interested in painting along with me or doing a little sketch or whatever, you have time for your always welcome to my reference.
• Is what I have planned to work on this if I'm feeling good, we might go on or do another painting, but I'm drinking some tea with honey and I have to kind of be careful with my voice a little bit talking so much, but I'm otherwise I'm good.
• I am using pure ref to put the little reference image up here so we can see a thumbnail view of it while we're painting. We're going to be doing some digital painting today. 8 As usual, I'm going to be working from this photo as a reference. 9 I think it's really beautiful.
x 10 It's like a little cafe table in Spain, with some dappled lighting coming through these trees.
x x 11 I really like the pattern of light on the building over here, and the just overall feeling the different colors.
x x 12 the Blues and pinks and yellows on the wall here. 13 It's really, really cheerful scene. 14 Hi Mohammed, high pro, how are you? 15 I hope you're all doing well.
x 16 If you guys are interested in painting along with me or doing a little sketch or whatever, you have time for your always welcome to my reference.
x x 17 Image is below in the stream description and above the chat. 18 Also, if you check the info panel it should be there as well so this photo is from Unsplash.
x 19 You can find a link to the original post from the photographer below. 20 The stream. 21 I'm doing pretty good pro to be honest. 22 I have a little bit of a sore throat, so today's stream might be a tiny bit shorter than some of the others.
x 23 It's still going to be around 3 hours minimum. 24 Is what I have planned to work on this if I'm feeling good, we might go on or do another painting, but I'm drinking some tea with honey and I have to kind of be careful with my voice a little bit talking so much, but I'm otherwise I'm good. 25 Um prob hey welcome Hi Kumar.
x 26 So again, that's where you can find my references below in the stream description. 27 You guys are welcome to paint along if you're interested in doing that. 28 Otherwise, just sit back and relax and enjoy the painting. 29 I'll try to do a nice job with this one. 30 I hope it's going to be good.
x 31 I really like the reference so. x Table 9: A transcript snippet from Digital Painting with Maddy Bellwoar. We show the most prominent latent codes and their representative utterances ('X'). Human annotated summary utterances are colored gray and ultra-short utterances are crossed out.
• view is created by them stitching images / maps years was a fitness studio yeah / image stitching just has some weird that happens team / oh if you want to share a link for map crunch / leave so weird parts / hope achieve that feeling of the water is really moving / yeah its LexRank • So I'm going to start to work on the foreground now and this is going to be tricky, so this is one of the things that made me choose to paint this reference. • The rushing water in the foreground kind of coming towards us and the ripples that it creates an the little bubbles and all that kind of stuff. • Sometimes the image stitching just has some weird stuff that happens, Oh team. • Oh, if you want to share a link for map crunch. • Alright that should work this time.

BART-Large
• Trying to figure out those relationships near the end or like when you're trying to like if you're going to try to get the one or like if the one is like if I'm working on the one I'm going to do it's like if it's not working. I'm ready for some dog like with half his body has been overlapped by a boat. Yeah, it's always really satisfying the color combination in these kind of desert scenes, the nice blue Sky and like the Warm Reds and oranges. It's like you get the extra bonus of like some purple thrown in there and that's just perfect. You know, I've seen people like riding camels and stuff, he's taking pictures

Quantized Transformer
• So I hope I can achieve that feeling of the water is really moving there.
• You have to click the share button and then copy the link that is there if you copy the link from the URL above it's not going to send the right image. • So what you can do is click right here where it says share and then copy this.
• OK, so these so right now, I'm just going to start with a darker shadows. • Of where I see the ripples. • Alright that should work this time.
• You know where some of the really, really fantastic ones come from.
• But people will be on the boat on foot.
• Maps years was a fitness studio, yeah, there's all kinds of stuff. • I've painted with references with stuff like that.

StreamHover (Ours)
• So I'm going to start to work on the foreground now and this is going to be tricky, so this is one of the things that made me choose to paint this reference.
• The rushing water in the foreground kind of coming towards us and the ripples that it creates an the little bubbles and all that kind of stuff.
• Yeah, it's always really satisfying the color combination in these kind of desert scenes, the nice blue Sky and like the Warm Reds and oranges.
• I'm ready for some dog like with half his body has been overlapped by a boat.
• People are in a car traveling there, you get a lot of side of the road like kind of bland images, but then you also get like the most amazing pictures and people also upload images and stuff, I think that might be. Table 10: Example system summaries from Virtual Plein Air Landscape Painting. The BART summary is fluent but its content lacks specificity, as is the case for LexRank. The summary segments selected by Flu-CovRank are ungrammatical. StreamHover identifies on-topic and informative summary utterances.

Utterances C1 C2 C3
0 Trying to figure out those relationships near the end or like force them if they're not working. 1 So I'm going to start to work on the foreground now and this is going to be tricky, so this is one of the things that made me choose to paint this reference.
x 2 'cause I wanted to figure out how to paint this kind of situation. 3 The rushing water in the foreground kind of coming towards us and the ripples that it creates an the little bubbles and all that kind of stuff.
x x 4 So I hope I can achieve that feeling of the water is really moving there. 5 Might give some inspiration alright let's see got an artstation link. 6 Oh, cool. 7 Good vibes good vibes. 8 Other desert desert scene. 9 Yeah, it's always really satisfying the color combination in these kind of desert scenes, the nice blue Sky and like the Warm Reds and oranges.
x x 10 I like the one I'm working on 2 'cause. 11 It's like you get the extra bonus of like some purple thrown in there and that's just perfect.
x 12 For me. 13 Yeah, that's cool thanks. 14 Team I says, I was looking for a nice location on map French and I found this alright. 15 I'm ready for some dog like with half his body has been overlapped by a boat. 16 Sometimes the image stitching just has some weird stuff that happens, Oh team. 17 Oh, if you want to share a link for map crunch. 18 You have to click the share button and then copy the link that is there if you copy the link from the URL above it's not going to send the right image.
x 19 So what you can do is click right here where it says share and then copy this. 20 And then that will be the right one, 'cause I think it sent the wrong one from you. 21 Yeah. 22 Alright cool. 23 Will get back to painting?
x 24 OK, so these so right now, I'm just going to start with a darker shadows. 25 Of where I see the ripples.
x 26 I'm not worried about putting like every ripple in the right spot. 27 But we couldn't get the overall feel this. 28 Some more Brown here. 29 Alright that should work this time.
x 30 What's wrong with that picture?
x 31 Does actually seems nice? 32 I don't see anything wrong with that landscape Timo? 33 When you when you're using map crunch you're going to get a lot of like side of the road road views because you know if you think about where the car is going.
x 34 Order fact that it's your in the Google Maps. 35 People are in a car traveling there, you get a lot of side of the road like kind of bland images, but then you also get like the most amazing pictures and people also upload images and stuff, I think that might be.
x x 36 You know where some of the really, really fantastic ones come from. 37 But people will be on the boat on foot. 38 You know, I've seen people like riding camels and stuff, he's taking pictures for Google. 39 Maps years was a fitness studio, yeah, there's all kinds of stuff. 40 Yeah, but sometimes you get something weird because the 3D or the 360 Panorama. 41 View is created by them stitching images together. 42 So, sometimes you have weird situations where stuff overlaps. • OK, here, we go ever it is yeah, some of these.

StreamHover (Ours)
• It's gotta stay and the same letter so they was the letter C and I'm just using my hands my fingers.
• The app and it's working it's a work in progress OK is constantly being updated with the features and bugs.
• But I thought we fresco does not have good, um Palm rejection and so when you using the pen and drawing whether you're on the iPad or surface.
• I kept them in the drawing and now they are part of the look of each of these letters, which is crazy.  x 1 It's gotta stay and the same letter so they was the letter C and I'm just using my hands my fingers.
x 2 I do have my Surface Pen somewhere it's up there. 3 I'm on the Surface Pro.
x 4 7 5 That's the latest song version of the surface. 6 It just came out early in October 20th 19th. 7 I know that in a few days were gonna be in 2020, which is why I want to have. 8 This file ready an complete I'm just listening. 9 Bump on Phone so yeah, that's the wrong letter, but let's go. 10 It's just he gone. 11 So I'm doing this little dots and what happens is we on with Adobe fresco is there. 12 It doesn't do a good job.
x 13 The app and it's working it's a work in progress OK is constantly being updated with the features and bugs.
x x 14 But I thought we fresco does not have good, um Palm rejection and so when you using the pen and drawing whether you're on the iPad or surface.
x x 15 You end up with all these little artifacts that when your Palm lays on the screen. 16 And so as I was drawing diesel letters. 17 I got to the point where.
x 18 I just cannot avoid these artifacts in Astana kept them. 19 I kept them in the drawing and now they are part of the look of each of these letters, which is crazy. x 40 OK, here, we go ever it is yeah, some of these. 41 Pan's had details that were missing right now, I'm using the vector pen the vector brushes are in.
x 42 There we go. 43 I'ma try not to yell but I know I can hear. 44 Just a big bounce in my voice and Missy Volume. 45 OK, so now we're back we're back 'cause I could hear the sound an it was very loud.
x 46 I'm sorry I already have a loud voice. • learning process proceed / draw this out and post it where people / sit down an just trying to draw a picture / spent like 3 hours / alot like ill draw something like old man i want to draw / consistently draw this character and erase and restart and you cant necessarily / practice will

LexRank
• The practice will do a lot more for you than drawing a picture.
• You might get one picture that might turn out really cool and you can post it and stuff like that, but if you work on your practice and you get that practice the way that you want it, you can use that practice forever for more than one picture.
• You stop trying to make the picture and get back to practice.
• So if there's things that you need to practice and stuff like that, you stand to gain a lot more from that practice then.
• Sit down an just trying to draw a picture.

BART-Large
• I do I like to do more practice in anything. Practices gives me this feel that I've never had when I was younger. You know, of course, everybody will tell you. You might get one picture that might turn out really cool and you can post it and stuff like that, but if you work on your practice and you get that practice the way that you want it, you can use that practice forever for more than one picture. So that's generally would always tell myself when I'm drawing, and I'm like. Oh crap, here I go again. I should draw this into a picture and then I end up skipping practice to save whatever I'm doing as a picture so that I can. Just 'cause

Quantized Transformer
• Trying to make a picture, right?
• And then I'm like OK alright I gotta I gotta center myself You know I gotta always gotta check myself every now and again or back that was kind of how it wasn't back then but now I don't really like to draw pictures.
• I do I like to do more practice in anything.
• You sit up and you put on like references for the superhero be in movies or something like that.
• You pull up references on whatever social media or wherever you get your images from.
• Whereas if you were just trying to practice, you would realize that all of that stuff that you're about to throw away or disregard was actually what you were actually searching for.

StreamHover (Ours)
• Doesn't matter what styles you draw, so don't overly complicated by like Oh my God, you know now you could draw it.
• I used to do that a lot when I was younger where I would do something and be like holy crap, didn't think that would come out that way.
• And then I'm like OK alright I gotta I gotta center myself You know I gotta always gotta check myself every now and again or back that was kind of how it wasn't back then but now I don't really like to draw pictures.
• They don't see it happening until it's too late where it's like, Oh Man, I could totally practice this right now, or I could Draw Something Super Duper cool, you know, a piece of fan art or something like that.
• You're going to struggle with for a couple of days or couple minutes and then figure out you can't draw it or you have a hard time drawing it and then just giving up. x 10 I don't want to get caught up in that thing that I always do where when I'm drawing a face or I'm doing practice. 11 I always do that thing where it's like holy crap you know this is a really good face. 12 I should draw this into a picture and then I end up skipping practice to save whatever I'm doing as a picture so that I can.
x 13 You know? 14 Posted on Instagram or something like that. 15 I used to do that a lot when I was younger where I would do something and be like holy crap, didn't think that would come out that way.
x 16 I don't want to waste this in practice.
x 17-21 . . . 22 You might get one picture that might turn out really cool and you can post it and stuff like that, but if you work on your practice and you get that practice the way that you want it, you can use that practice forever for more than one picture.
x 23 So that's generally would always tell myself when I'm drawing, and I'm like. 24 Oh crap, here I go again. 25 Trying to make a picture, right?
x 26 You stop trying to make the picture and get back to practice. 27 And then I'm like OK alright I gotta I gotta center myself You know I gotta always gotta check myself every now and again or back that was kind of how it wasn't back then but now I don't really like to draw pictures.
x x 28 I do I like to do more practice in anything. 29 Practices gives me this feel that I've never had when I was younger. 30 When I was younger, which is so eager to get people to see my work that I wasn't necessarily thinking about the overall count, their overall.
x 31 Affect that, we're going to have on. 32 My learning process proceed. 33 And it had a big big, big big like effect on it took me longer to learn stuff because when you sit down you feel like you don't want to do anything unless you're actually drawing something to show off, which is something you don't necessarily want to get caught up in, which happens a lot of people they don't.
x 35-36 . . . 37 Get way more out of practice because practice is more focused on getting you better at art where doing a picture or whatever is cool, but it's more focused on showing off what you can do or what you already know.
x 38 So if there's things that you need to practice and stuff like that, you stand to gain a lot more from that practice then. 39 You know? 40 Sit down an just trying to draw a picture. 41 You're going to struggle with for a couple of days or couple minutes and then figure out you can't draw it or you have a hard time drawing it and then just giving up. 42 Which is something that would happen alot like I'll Draw Something like old man, you know I want to draw this. 43 Superhero, I've had a problem with drawing the superhero forever, but today is going to be the day I'm going to nail it.
x Table 15: A transcript snippet from Call me Derek: Value Study. We show the most prominent latent codes and their representative utterances ('X'). Human annotated summary utterances are colored gray and ultrashort utterances are crossed out.
• people start off always drawing characters / favorite characters from sailor moon / mohammed roller show / yeah people are different or personalities / example prefer environment designed the character design and even though i like looking at character art by prefer to make environments / advice with a grain of salt / people specialized the LexRank • Some people like to have things mixed up all the time where they get bored and other people find that one thing they like and never tire of it.
• Or what I meant is that some people start off always drawing characters, for example like they just from.
• An anyway, some artists just like for example, love drawing characters and always draw them and never get into environments, and they go on and become a good working concepts are it could work in visual development as a character artist who knows what.
• Just because someone specializes, it doesn't necessarily mean that they like tried all the other things first.
• Some people sort of specialized the whole time just because that was their preference, you know.

BART-Large
• I really works good to see you. Yeah, people are different or personalities. What might work for them isn't necessarily right right for you for any number of reasons, like what David said just from person to person, those preferences can be really different. I for example, prefer environment designed the character design, and even though I like looking at character art by prefer to make environments, and that's not. I think that's why you have to take advice with a grain of salt like. See I do think people do though Omar, like maybe maybe you're not relating to it 'cause it's just not what your preferences are but. Definitely, I think it's important to seek out advice from professional artists and from people

Quantized Transformer
• It's about what you like doing.
• I really works good to see you.
• A specialist most of the time is someone with more experience.
• I mean I like for example, I started drawing in like when I was a kid and when I was in middle school and high school was drawn.
• Just because someone specializes, it doesn't necessarily mean that they like tried all the other things first.
• This was the first study.
• We're doing cloud studies for part of the art club were doing clouds in different scenarios, this time different.
• This was the second one.
• See I do think people do though Omar, like maybe maybe you're not relating to it 'cause it's just not what your preferences are but.
• All thanks claires you thank you.
• So I do hope that it helps.

StreamHover (Ours)
• Some people like to have things mixed up all the time where they get bored and other people find that one thing they like and never tire of it.
• An anyway, some artists just like for example, love drawing characters and always draw them and never get into environments, and they go on and become a good working concepts are it could work in visual development as a character artist who knows what.
• We're doing cloud studies for part of the art club were doing clouds in different scenarios, this time different.
• See I do think people do though Omar, like maybe maybe you're not relating to it 'cause it's just not what your preferences are but.
• I for example, prefer environment designed the character design, and even though I like looking at character art by prefer to make environments, and that's not. x 2 Specialist role. 3 Yeah, people are different or personalities. 4 Some people like to have things mixed up all the time where they get bored and other people find that one thing they like and never tire of it.
x x 5 And it really depends. 6 It really depends. 7 I think that's why you have to take advice with a grain of salt like. 8 Definitely, I think it's important to seek out advice from professional artists and from people that have sort of gone down the path that you're trying to go down. 9 That's going to make things easier for you, but you know, take the advice with a grain of salt because. 10 What might work for them isn't necessarily right right for you for any number of reasons, like what David said just from person to person, those preferences can be really different.
x 11 Exactly Pablo is at the end of the day. 12 It's about what you like doing.
x 13 Alright, I'm gonna grab a little more textured brush again. 14 I'm kind of going back and forth with harder brushes and textured brushes. 15 I really works good to see you. 16 It's not about that. 17 A specialist most of the time is someone with more experience. 18 That's why he's a specialist.
x 19 Oh, I understand your thoughts on that. 20 I would agree for the most part, but I guess the part I would disagree on is that. 21 Or what I meant is that some people start off always drawing characters, for example like they just from. 22 I mean I like for example, I started drawing in like when I was a kid and when I was in middle school and high school was drawn.
x 23 All my favorite characters from Sailor Moon and all that kind of stuff. 24 An anyway, some artists just like for example, love drawing characters and always draw them and never get into environments, and they go on and become a good working concepts are it could work in visual development as a character artist who knows what.
x x 25 Just because someone specializes, it doesn't necessarily mean that they like tried all the other things first. 26 Maybe they did. 27 It's possible they did, but I don't think that it has to be that you have to do everything before you specialize.
x 28 Some people sort of specialized the whole time just because that was their preference, you know. 29 From the beginning. 30 If anybody is just coming in. 31 Mohammed roller, I can show you what we did so far. 32 This was the first study.
x 33 We're doing cloud studies for part of the art club were doing clouds in different scenarios, this time different. 34 Like lightings, different times of day. 35 This was the second one.
x 36 And. 37 This is what we're working on right now.
x 38 See I do think people do though Omar, like maybe maybe you're not relating to it 'cause it's just not what your preferences are but. 39 I for example, prefer environment designed the character design, and even though I like looking at character art by prefer to make environments, and that's not.
x x 40 You know, that's just my preference. 41 All thanks claires you thank you. Human annotated summary utterances are colored gray and ultra-short utterances are crossed out.