Shivprasad Rajendra Sagare
2024
Audio-visual training for improved grounding in video-text LLMs
Shivprasad Rajendra Sagare
|
Hemachandran S
|
Kinshuk Sarabhai
|
Prashant Ullegaddi
|
Rajeshkumar Sa
Proceedings of the 17th International Natural Language Generation Conference
Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to better grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.
VideoRAG: Scaling the context size and relevance for video question-answering
Shivprasad Rajendra Sagare
|
Prashant Ullegaddi
|
Nachiketh K S
|
Navanith R
|
Kinshuk Sarabhai
|
Rajesh Kumar S A
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations
Recent advancements have led to the adaptation of several multimodal large language models (LLMs) for critical video-related use cases, particularly in Video Question-Answering (QA). However, most of the previous models sample only a limited number of frames from video due to the context size limit of backbone LLM. Another approach of applying temporal pooling to compress multiple frames, is also shown to saturate and does not scale well. These limitations cause videoQA on long videos to perform very poorly. To address this, we present VideoRAG, a system to utilize recently popularized Retrieval Augmented Generation (RAG) pipeline to select the top-k frames from video, relevant to the user query. We have observed a qualitative improvement in our experiments, indicating a promising direction to pursue. Additionally, our findings indicate that videoRAG demonstrates superior performance when addressing needle-in-the-haystack questions in long videos. Our extensible system allows for trying multiple strategies for indexing, ranking, and adding QA models.
Search
Fix data
Co-authors
- Kinshuk Sarabhai 2
- Prashant Ullegaddi 2
- Nachiketh K S 1
- Navanith R 1
- Hemachandran S 1
- show all...
Venues
- inlg2