GHAN: Graph-Based Hierarchical Aggregation Network for Text-Video Retrieval

Yahan Yu, Bojie Hu, Yu Li


Abstract
Text-video retrieval focuses on two aspects: cross-modality interaction and video-language encoding. Currently, the mainstream approach is to train a joint embedding space for multimodal interactions. However, there are structural and semantic differences between text and video, making this approach challenging for fine-grained understanding. In order to solve this, we propose an end-to-end graph-based hierarchical aggregation network for text-video retrieval according to the hierarchy possessed by text and video. We design a token-level weighted network to refine intra-modality representations and construct a graph-based message passing attention network for global-local alignment across modality. We conduct experiments on the public datasets MSR-VTT-9K, MSR-VTT-7K and MSVD, and achieve Recall@1 of 73.0%, 65.6%, and 64.0% , which is 25.7%, 16.5%, and 14.2% better than the current state-of-the-art model.
Anthology ID:
2022.emnlp-main.374
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5547–5557
Language:
URL:
https://aclanthology.org/2022.emnlp-main.374
DOI:
10.18653/v1/2022.emnlp-main.374
Bibkey:
Cite (ACL):
Yahan Yu, Bojie Hu, and Yu Li. 2022. GHAN: Graph-Based Hierarchical Aggregation Network for Text-Video Retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5547–5557, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
GHAN: Graph-Based Hierarchical Aggregation Network for Text-Video Retrieval (Yu et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.374.pdf