SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

Hao Wang, Shuhei Kurita, Shuichiro Shimizu, Daisuke Kawahara


Abstract
Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.
Anthology ID:
2024.alvr-1.11
Volume:
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, William Wang
Venues:
ALVR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
129–137
Language:
URL:
https://aclanthology.org/2024.alvr-1.11
DOI:
Bibkey:
Cite (ACL):
Hao Wang, Shuhei Kurita, Shuichiro Shimizu, and Daisuke Kawahara. 2024. SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 129–137, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition (Wang et al., ALVR-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.alvr-1.11.pdf