ORANGE: Text-video Retrieval via Watch-time-aware Heterogeneous Graph Contrastive Learning

Yucheng Lin, Tim Chang, Yaning Chang, Jianqiang Ma, Donghui Li, Ting Peng, Zang Li, Zhiyi Zhou, Feng Wang


Abstract
With the explosive growth of short-video data on industrial video-sharing platforms such as TikTok and YouTube, text-video retrieval techniques have become increasingly important. Most existing works for text-video retrieval focus on designing informative representation learning methods and delicate matching mechanisms, which leverage the content information of queries and videos themselves (i.e., textual information of queries and multimodal information of videos). However, real-world scenarios often involve brief, ambiguous queries and low-quality videos, making content-based retrieval less effective. In order to accommodate various search requirements and enhance user satisfaction, this study introduces a novel Text-video Retrieval method via Watch-time-aware Heterogeneous Graph Contrastive Learning (termed ORANGE). This approach aims to learn informative embeddings for queries and videos by leveraging both content information and the abundant relational information present in video-search scenarios. Specifically, we first construct a heterogeneous information graph where nodes represent domain objects (e.g., query, video, tag) and edges represent rich relations among these objects. Afterwards, a meta-path-guided heterogeneous graph attention encoder with the awareness of video watch time is devised to encode various semantic aspects of query and video nodes. To train our model, we introduce a meta-path-wise contrastive learning paradigm that facilitates capturing dependencies across multiple semantic relations, thereby enhancing the obtained embeddings. Finally, when deployed online, for new queries non-existent in the constructed graph, a bert-based query encoder distilled from our ORANGE is employed. Offline experiments conducted on a real-world dataset demonstrate the effectiveness of our ORANGE. Moreover, it has been implemented in the matching stage of an industrial online video-search service, where it exhibited statistically significant improvements over the online baseline in an A/B test.
Anthology ID:
2023.emnlp-industry.27
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2023
Address:
Singapore
Editors:
Mingxuan Wang, Imed Zitouni
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
275–283
Language:
URL:
https://aclanthology.org/2023.emnlp-industry.27
DOI:
10.18653/v1/2023.emnlp-industry.27
Bibkey:
Cite (ACL):
Yucheng Lin, Tim Chang, Yaning Chang, Jianqiang Ma, Donghui Li, Ting Peng, Zang Li, Zhiyi Zhou, and Feng Wang. 2023. ORANGE: Text-video Retrieval via Watch-time-aware Heterogeneous Graph Contrastive Learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 275–283, Singapore. Association for Computational Linguistics.
Cite (Informal):
ORANGE: Text-video Retrieval via Watch-time-aware Heterogeneous Graph Contrastive Learning (Lin et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-industry.27.pdf
Video:
 https://aclanthology.org/2023.emnlp-industry.27.mp4