Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought Vaishnavi Himakunthala author Andy Ouyang author Daniel Rose author Ryan He author Alex Mei author Yujie Lu author Chinmay Sonar author Michael Saxon author William Wang author 2023-12 text Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Houda Bouamor editor Juan Pino editor Kalika Bali editor Association for Computational Linguistics Singapore conference publication himakunthala-etal-2023-lets 10.18653/v1/2023.emnlp-main.15 https://aclanthology.org/2023.emnlp-main.15/ 2023-12 204 219