Fanglong Yao
2022
Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos
Nayu Liu
|
Kaiwen Wei
|
Xian Sun
|
Hongfeng Yu
|
Fanglong Yao
|
Li Jin
|
Guo Zhi
|
Guangluan Xu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Multimodal summarization for videos aims to generate summaries from multi-source information (videos, audio transcripts), which has achieved promising progress. However, existing works are restricted to monolingual video scenarios, ignoring the demands of non-native video viewers to understand the cross-language videos in practical applications. It stimulates us to propose a new task, named Multimodal Cross-Lingual Summarization for videos (MCLS), which aims to generate cross-lingual summaries from multimodal inputs of videos. First, to make it applicable to MCLS scenarios, we conduct a Video-guided Dual Fusion network (VDF) that integrates multimodal and cross-lingual information via diverse fusion strategies at both encoder and decoder. Moreover, to alleviate the problem of high annotation costs and limited resources in MCLS, we propose a triple-stage training framework to assist MCLS by transferring the knowledge from monolingual multimodal summarization data, which includes: 1) multimodal summarization on sufficient prevalent language videos with a VDF model; 2) knowledge distillation (KD) guided adjustment on bilingual transcripts; 3) multimodal summarization for cross-lingual videos with a KD induced VDF model. Experiment results on the reorganized How2 dataset show that the VDF model alone outperforms previous methods for multimodal summarization, and the performance further improves by a large margin via the proposed triple-stage training framework.
Search
Co-authors
- Nayu Liu 1
- Kaiwen Wei 1
- Xian Sun 1
- Hongfeng Yu 1
- Li Jin 1
- show all...