BehancePR: A Punctuation Restoration Dataset for Livestreaming Video Transcript

Viet Lai, Amir Pouran Ben Veyseh, Franck Dernoncourt, Thien Nguyen


Abstract
Given the increasing number of livestreaming videos, automatic speech recognition and post-processing for livestreaming video transcripts are crucial for efficient data management as well as knowledge mining. A key step in this process is punctuation restoration which restores fundamental text structures such as phrase and sentence boundaries from the video transcripts. This work presents a new human-annotated corpus, called BehancePR, for punctuation restoration in livestreaming video transcripts. Our experiments on BehancePR demonstrate the challenges of punctuation restoration for this domain. Furthermore, we show that popular natural language processing toolkits like Stanford Stanza, Spacy, and Trankit underperform on detecting sentence boundary on non-punctuated transcripts of livestreaming videos. The dataset is publicly accessible at http://github.com/nlp-uoregon/behancepr.
Anthology ID:
2022.findings-naacl.149
Volume:
Findings of the Association for Computational Linguistics: NAACL 2022
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1943–1951
Language:
URL:
https://aclanthology.org/2022.findings-naacl.149
DOI:
10.18653/v1/2022.findings-naacl.149
Bibkey:
Cite (ACL):
Viet Lai, Amir Pouran Ben Veyseh, Franck Dernoncourt, and Thien Nguyen. 2022. BehancePR: A Punctuation Restoration Dataset for Livestreaming Video Transcript. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1943–1951, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
BehancePR: A Punctuation Restoration Dataset for Livestreaming Video Transcript (Lai et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-naacl.149.pdf
Video:
 https://aclanthology.org/2022.findings-naacl.149.mp4
Code
 nlp-uoregon/behancepr