Relation-aware Video Reading Comprehension for Temporal Language Grounding

Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, Bernard Ghanem


Abstract
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be available at https://github.com/Huntersxsx/RaNet.
Anthology ID:
2021.emnlp-main.324
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3978–3988
Language:
URL:
https://aclanthology.org/2021.emnlp-main.324
DOI:
10.18653/v1/2021.emnlp-main.324
Bibkey:
Cite (ACL):
Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. 2021. Relation-aware Video Reading Comprehension for Temporal Language Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3978–3988, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Relation-aware Video Reading Comprehension for Temporal Language Grounding (Gao et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.324.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.324.mp4
Code
 Huntersxsx/RaNet
Data
ActivityNet CaptionsCharadesCharades-STA