Rajesh Kumar S A


2024

pdf bib
VideoRAG: Scaling the context size and relevance for video question-answering
Shivprasad Rajendra Sagare | Prashant Ullegaddi | Nachiketh K S | Navanith R | Kinshuk Sarabhai | Rajesh Kumar S A
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations

Recent advancements have led to the adaptation of several multimodal large language models (LLMs) for critical video-related use cases, particularly in Video Question-Answering (QA). However, most of the previous models sample only a limited number of frames from video due to the context size limit of backbone LLM. Another approach of applying temporal pooling to compress multiple frames, is also shown to saturate and does not scale well. These limitations cause videoQA on long videos to perform very poorly. To address this, we present VideoRAG, a system to utilize recently popularized Retrieval Augmented Generation (RAG) pipeline to select the top-k frames from video, relevant to the user query. We have observed a qualitative improvement in our experiments, indicating a promising direction to pursue. Additionally, our findings indicate that videoRAG demonstrates superior performance when addressing needle-in-the-haystack questions in long videos. Our extensible system allows for trying multiple strategies for indexing, ranking, and adding QA models.