Audio-visual training for improved grounding in video-text LLMs

Shivprasad Rajendra Sagare, Hemachandran S, Kinshuk Sarabhai, Prashant Ullegaddi, Rajeshkumar Sa


Abstract
Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to better grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.
Anthology ID:
2024.inlg-main.36
Volume:
Proceedings of the 17th International Natural Language Generation Conference
Month:
September
Year:
2024
Address:
Tokyo, Japan
Editors:
Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
440–445
Language:
URL:
https://aclanthology.org/2024.inlg-main.36
DOI:
Bibkey:
Cite (ACL):
Shivprasad Rajendra Sagare, Hemachandran S, Kinshuk Sarabhai, Prashant Ullegaddi, and Rajeshkumar Sa. 2024. Audio-visual training for improved grounding in video-text LLMs. In Proceedings of the 17th International Natural Language Generation Conference, pages 440–445, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
Audio-visual training for improved grounding in video-text LLMs (Sagare et al., INLG 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.inlg-main.36.pdf