VideoRAG: Scaling the context size and relevance for video question-answering

Shivprasad Rajendra Sagare; Prashant Ullegaddi; Nachiketh K S; Navanith R; Kinshuk Sarabhai; Rajesh Kumar S A

VideoRAG: Scaling the context size and relevance for video question-answering

Shivprasad Rajendra Sagare, Prashant Ullegaddi, Nachiketh K S, Navanith R, Kinshuk Sarabhai, Rajesh Kumar S A

Abstract

Recent advancements have led to the adaptation of several multimodal large language models (LLMs) for critical video-related use cases, particularly in Video Question-Answering (QA). However, most of the previous models sample only a limited number of frames from video due to the context size limit of backbone LLM. Another approach of applying temporal pooling to compress multiple frames, is also shown to saturate and does not scale well. These limitations cause videoQA on long videos to perform very poorly. To address this, we present VideoRAG, a system to utilize recently popularized Retrieval Augmented Generation (RAG) pipeline to select the top-k frames from video, relevant to the user query. We have observed a qualitative improvement in our experiments, indicating a promising direction to pursue. Additionally, our findings indicate that videoRAG demonstrates superior performance when addressing needle-in-the-haystack questions in long videos. Our extensible system allows for trying multiple strategies for indexing, ranking, and adding QA models.

Anthology ID:: 2024.inlg-demos.3
Volume:: Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations
Month:: September
Year:: 2024
Address:: Tokyo, Japan
Editors:: Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7–8
Language:
URL:: https://aclanthology.org/2024.inlg-demos.3
DOI:
Bibkey:
Cite (ACL):: Shivprasad Rajendra Sagare, Prashant Ullegaddi, Nachiketh K S, Navanith R, Kinshuk Sarabhai, and Rajesh Kumar S A. 2024. VideoRAG: Scaling the context size and relevance for video question-answering. In Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations, pages 7–8, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):: VideoRAG: Scaling the context size and relevance for video question-answering (Sagare et al., INLG 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.inlg-demos.3.pdf

PDF Cite Search