Sangwoo Cho


2021

pdf bib
StreamHover: Livestream Transcript Summarization and Annotation
Sangwoo Cho | Franck Dernoncourt | Tim Ganter | Trung Bui | Nedim Lipka | Walter Chang | Hailin Jin | Jonathan Brandt | Hassan Foroosh | Fei Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

With the explosive growth of livestream broadcasting, there is an urgent need for new summarization technology that enables us to create a preview of streamed content and tap into this wealth of knowledge. However, the problem is nontrivial due to the informal nature of spoken language. Further, there has been a shortage of annotated datasets that are necessary for transcript summarization. In this paper, we present StreamHover, a framework for annotating and summarizing livestream transcripts. With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora. We explore a neural extractive summarization model that leverages vector-quantized variational autoencoder to learn latent vector representations of spoken utterances and identify salient utterances from the transcripts to form summaries. We show that our model generalizes better and improves performance over strong baselines. The results of this study provide an avenue for future research to improve summarization solutions for efficient browsing of livestreams.

2020

pdf bib
Better Highlighting: Creating Sub-Sentence Summary Highlights
Sangwoo Cho | Kaiqiang Song | Chen Li | Dong Yu | Hassan Foroosh | Fei Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Amongst the best means to summarize is highlighting. In this paper, we aim to generate summary highlights to be overlaid on the original documents to make it easier for readers to sift through a large amount of text. The method allows summaries to be understood in context to prevent a summarizer from distorting the original meaning, of which abstractive summarizers usually fall short. In particular, we present a new method to produce self-contained highlights that are understandable on their own to avoid confusion. Our method combines determinantal point processes and deep contextualized representations to identify an optimal set of sub-sentence segments that are both important and non-redundant to form summary highlights. To demonstrate the flexibility and modeling power of our method, we conduct extensive experiments on summarization datasets. Our analysis provides evidence that highlighting is a promising avenue of research towards future summarization.

2019

pdf bib
Multi-Document Summarization with Determinantal Point Processes and Contextualized Representations
Sangwoo Cho | Chen Li | Dong Yu | Hassan Foroosh | Fei Liu
Proceedings of the 2nd Workshop on New Frontiers in Summarization

Emerged as one of the best performing techniques for extractive summarization, determinantal point processes select a most probable set of summary sentences according to a probabilistic measure defined by respectively modeling sentence prominence and pairwise repulsion. Traditionally, both aspects are modelled using shallow and linguistically informed features, but the rise of deep contextualized representations raises an interesting question. Whether, and to what extent, could contextualized sentence representations be used to improve the DPP framework? Our findings suggest that, despite the success of deep semantic representations, it remains necessary to combine them with surface indicators for effective identification of summary-worthy sentences.

pdf bib
Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization
Sangwoo Cho | Logan Lebanoff | Hassan Foroosh | Fei Liu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The most important obstacles facing multi-document summarization include excessive redundancy in source descriptions and the looming shortage of training data. These obstacles prevent encoder-decoder models from being used directly, but optimization-based methods such as determinantal point processes (DPPs) are known to handle them well. In this paper we seek to strengthen a DPP-based method for extractive multi-document summarization by presenting a novel similarity measure inspired by capsule networks. The approach measures redundancy between a pair of sentences based on surface form and semantic information. We show that our DPP system with improved similarity measure performs competitively, outperforming strong summarization baselines on benchmark datasets. Our findings are particularly meaningful for summarizing documents created by multiple authors containing redundant yet lexically diverse expressions.