GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization

Onkar Susladkar, Gayatri Deshmukh, Vandan Gorade, Sparsh Mittal


Abstract
Zero-shot temporal action localization (TAL) aims to temporally localize actions in videos without prior training examples. To address the challenges of TAL, we offer GRIZAL, a model that uses multimodal embeddings and dynamic motion cues to localize actions effectively. GRIZAL achieves sample diversity by using large-scale generative models such as GPT-4 for generating textual augmentations and DALL-E for generating image augmentations. Our model integrates vision-language embeddings with optical flow insights, optimized through a blend of supervised and self-supervised loss functions. On ActivityNet, Thumos14 and Charades-STA datasets, GRIZAL greatly outperforms state-of-the-art zero-shot TAL models, demonstrating its robustness and adaptability across a wide range of video content. We will make all the models and code publicly available by open-sourcing them.
Anthology ID:
2024.emnlp-main.1061
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19046–19059
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1061
DOI:
Bibkey:
Cite (ACL):
Onkar Susladkar, Gayatri Deshmukh, Vandan Gorade, and Sparsh Mittal. 2024. GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19046–19059, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization (Susladkar et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1061.pdf