Akisato Kimura
2024
Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches
Tsutomu Hirao
|
Naoki Kobayashi
|
Hidetaka Kamigaito
|
Manabu Okumura
|
Akisato Kimura
Findings of the Association for Computational Linguistics: EMNLP 2024
This paper tackles a new task: discourse parsing for videos, inspired by text discourse parsing based on Rhetorical Structure Theory (RST). The task aims to construct an RST tree for a video to represent its storyline and illustrate the event relationships. We first construct a benchmark dataset by identifying events with their time spans, providing corresponding captions, and constructing RST trees with events as leaves. We then evaluate baseline approaches to video RST parsing: the ‘parsing after captioning’ framework and parsing via visual features. The results show that a parser using gold captions performed the best, while parsers relying on generated captions performed the worst; a parser using visual features provided intermediate performance. However, we observed that parsing via visual features could be improved by pre-training it with video captioning designed to produce a coherent video story. Furthermore, we demonstrated that RST trees obtained from videos contribute to multimodal summarization consisting of keyframes with texts.