Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization

Minghang Zheng, Shaogang Gong, Hailin Jin, Yuxin Peng, Yang Liu


Abstract
Video sentence localization aims to locate moments in an unstructured video according to a given natural language query. A main challenge is the expensive annotation costs and the annotation bias. In this work, we study video sentence localization in a zero-shot setting, which learns with only video data without any annotation. Existing zero-shot pipelines usually generate event proposals and then generate a pseudo query for each event proposal. However, their event proposals are obtained via visual feature clustering, which is query-independent and inaccurate; and the pseudo-queries are short or less interpretable. Moreover, existing approaches ignores the risk of pseudo-label noise when leveraging them in training. To address the above problems, we propose a Structure-based Pseudo Label generation (SPL), which first generate free-form interpretable pseudo queries before constructing query-dependent event proposals by modeling the event temporal structure. To mitigate the effect of pseudo-label noise, we propose a noise-resistant iterative method that repeatedly re-weight the training sample based on noise estimation to train a grounding model and correct pseudo labels. Experiments on the ActivityNet Captions and Charades-STA datasets demonstrate the advantages of our approach. Code can be found at https://github.com/minghangz/SPL.
Anthology ID:
2023.acl-long.794
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14197–14209
Language:
URL:
https://aclanthology.org/2023.acl-long.794
DOI:
10.18653/v1/2023.acl-long.794
Bibkey:
Cite (ACL):
Minghang Zheng, Shaogang Gong, Hailin Jin, Yuxin Peng, and Yang Liu. 2023. Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14197–14209, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization (Zheng et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.794.pdf
Video:
 https://aclanthology.org/2023.acl-long.794.mp4