GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization

Onkar Kishor Susladkar; Gayatri Sudhir Deshmukh; Vandan Gorade; Sparsh Mittal

doi:10.18653/v1/2024.emnlp-main.1061

GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization

Onkar Kishor Susladkar, Gayatri Sudhir Deshmukh, Vandan Gorade, Sparsh Mittal

Abstract

Zero-shot temporal action localization (TAL) aims to temporally localize actions in videos without prior training examples. To address the challenges of TAL, we offer GRIZAL, a model that uses multimodal embeddings and dynamic motion cues to localize actions effectively. GRIZAL achieves sample diversity by using large-scale generative models such as GPT-4 for generating textual augmentations and DALL-E for generating image augmentations. Our model integrates vision-language embeddings with optical flow insights, optimized through a blend of supervised and self-supervised loss functions. On ActivityNet, Thumos14 and Charades-STA datasets, GRIZAL greatly outperforms state-of-the-art zero-shot TAL models, demonstrating its robustness and adaptability across a wide range of video content. We will make all the models and code publicly available by open-sourcing them.

Anthology ID:: 2024.emnlp-main.1061
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19046–19059
Language:
URL:: https://aclanthology.org/2024.emnlp-main.1061/
DOI:: 10.18653/v1/2024.emnlp-main.1061
Bibkey:
Cite (ACL):: Onkar Kishor Susladkar, Gayatri Sudhir Deshmukh, Vandan Gorade, and Sparsh Mittal. 2024. GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19046–19059, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization (Susladkar et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.1061.pdf

PDF Cite Search Fix data