Video-LLaMA An Instruction-tuned Audio-Visual Language Model for Video Understanding

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual & audio encoders and the frozen LLMs. Unlike previous vision-LLMs that focus on static image comprehensions such as MiniGPT-4 (Zhu et al., 2023) and LLaVA (Liu et al., 2023), Video-LLaMA mainly tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble the pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind (Girdhar et al., 2023), a universal embedding model aligning multiple modalities as the pre-trained audio encoder, and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual & audio encoders with LLM’s embedding space, we train Video-LLaMA on massive video/image-caption pairs as well as visual-instruction-tuning datasets of moderate amount but higher quality. We found Video-LLaMA show-cases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information presented in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants.


Introduction
Large Language Models (LLMs) (Chowdhery et al., 2022;Bai et al., 2022;OpenAI, 2023) have demonstrated remarkable capability of understanding and following user intentions and instructions 234 .Typically, the user requests and the corresponding responses from LLMs are all in texts, however, textonly human-computer interaction is not sufficient for many application scenarios because real-world information is usually multi-modal.In order to further explore the potential of LLMs, many researchers attempt to endow LLMs with the capability of understanding multi-modal content (Huang et al., 2023a;Zhang et al., 2023b;Yin et al., 2023).
Despite their effectiveness, these approaches are dedicated to aligning the input from exactly one additional modality with text (i.e., image or audio), which is unsatisfactory for video understanding.Concretely, empowering LLMs to understand video requires comprehensive processing for different modalities including visual input, auditory input, and textual output, which is more challenging than image-only understanding and audio-only understanding tasks.Although there are several recent works attempt to unleash the video understanding capability of LLMs (Li et al., 2023c;Maaz et al., 2023;Luo et al., 2023), their primary objective is to comprehend only the visual content of the video, with the auditory content remaining unused.
In this work, to fill in the blank of audio-visual LLMs, we investigate the possibility of building multi-modal LLMs that support the input of video and allow users to chat with computers around the user-uploaded video, which is usually composed of multiple video frames and audio.Instead of employing external perception models to convert visual/auditory signals to textual signals (Shen et al., 2023;Li et al., 2023c), we choose to build an end-to-end model that can handle the data from multiple modalities within one single framework.Specifically, we adopt the idea of BLIP-2 (Li et al., 2023b) to guarantee the efficiency of cross-modal pre-training.To explicitly capture the change of visual scenes in the video, we use a pre-trained visual encoder to separately compute frame representations.Then, we introduce a frame embedding layer to inject temporal information and a video Q-Former to generate visual query tokens.As for the audio signals from the video, we additionally leverage a pre-trained audio encoder as well as an audio Q-former to learn reasonable auditory query embeddings (see the right part of Figure 1).
To align textual output with video, we devise multi-branch cross-modal pre-training to learn the vision-language correspondence and the audiolanguage correspondence.For vision-language correspondence, we first pre-train the vision-related components on a large-scale video caption dataset with a video-clips-to-text generation task.To enhance the understanding of static visual concepts, we also add image-caption data into this pre-training stage.Then, we further fine-tune these components on a video-based conversation dataset to execute visual instruction tuning.For the alignment between the audio encoder and language decoder, we further pre-train the audio-related components on an audio caption dataset with an audio-to-text generation task.For the audio-language correspondence, we leverage Imagebind (Girdhar et al., 2023) as an encoder, which performs exceptionally well in aligning different modalities to a common embedding space.Given the limited availability of audio-text data, we also utilize vision-text data to train the audio-related components.These components learn to align the common embedding space provided by Imagebind with the embedding space of LLMs.Despite not being explicitly trained with audio-text data, Video-LLaMA exhibits a remarkable zero-shot audio understanding capability during inference.
As shown in Table 1, our Video-LLaMA stands out from other existing multi-modal LLMs in terms of its distinctively comprehensive comprehension of audiovisual modal information in videos.In summary, our contributions are as follows: • We propose Video-LLaMA, a multi-modal framework that enables LLM to simultaneously process both the visual and auditory content of a given video and engage in conversation with humans.
• To empower LLMs with video understanding capability, we propose a multi-branch cross-modal pre-training framework to achieve both visionlanguage alignment and audio-language alignment.
• We open-source the entire codebase for pretraining and fine-tuning as well as the model weights of all the variants of Video-LLaMA5 .We also prepared the demos for video-grounded conversation 67 .

Method
Video-LLaMA aims to empower frozen LLMs with the capability of understanding both visual and auditory content in videos.As shown in Figure 1, we design two branches, namely Vision-Language Branch and Audio-Language Branch, to respectively transform the video frames and audio signals into query representations that are compatible with the textual inputs of LLMs.In this section, we first introduce the overall architecture and the building blocks of each branch.Then, we delineate the procedures of the proposed multi-branch cross-modal pre-training and audio-visual instruction tuning.2.1 Architecture

Vision-Language Branch
The Vision-Language Branch is designed for enabling the LLMs to understand visual inputs.As shown in the left part of Figure 1, it is composed of a frozen pre-trained image encoder to extract features from video frames, a position embedding layer to inject temporal information into video frames, a video Q-former to aggregate frame-level representations and a linear layer to project the output video representations into the same dimension as the text embeddings of LLMs.Given one video consists of N frames, the image encoder will first map each frame/image into K f image embedding vectors, yielding video frame representations Since the frame representations v i from the frozen image encoder are computed without considering any temporal information, we further apply position embeddings as the indicator of temporal information to the representations from different frames.Then, we feed the position-encoded frame representations to Video Q-former, which shares the same architecture with Query Transformer (Q-Former) in BLIP-2 (Li et al., 2023b), to obtain k V video embedding vectors of dimension d v as the representation v ∈ R k V ×dv of the video.
To adapt the video representations to the input of LLMs, we add a linear layer to transform the video embedding vectors into the video query vectors.The video query vectors are of the same dimension as the text embeddings of LLMs.In the forward pass, they will be concatenated to text embeddings as a video soft prompt and guide the frozen LLMs to generate text conditioned on video content.
As for the implementation of the Vision-Language Branch, we utilize the pre-trained vision component of BLIP-2 (Li et al., 2023b) as the frozen visual encoder, which includes a ViT-G/14 from EVA-CLIP (Fang et al., 2022) and a pre-trained Q-former.The remaining components, including the position embedding layer, Video Qformer, and Linear layer are randomly initialized and optimized to well connect the output of the frozen visual encoder to frozen LLMs.

Audio-Language Branch
To deal with the auditory content of the given video, we introduce the Audio-Language Branch.Concretely, it consists of a pre-trained audio encoder to compute features given a short segment of origin audio, a position embedding layer to inject temporal information to audio segments, an audio Q-former to fuse the features of different audio segments, and a linear layer to map the audio representation into the embedding space of LLMs.
In practice, we utilize the pre-trained Imagebind (Girdhar et al., 2023) as the audio encoder.We first uniformly sample M segments of 2-second short audio clips from the video, then convert each 2-second audio clip into spectrograms using 128 mel-spectrogram bins.After obtaining the spectrogram list of input audio, the audio encoder will map each spectrogram into a dense vector.So the generated audio representation of the given video can be denoted as A = [a 1 , a 2 , ..., a M ].
Similar to Video Q-Former, the Audio Q-former injects temporal information by adding learnable positional embeddings to audio segments.It then generates fixed-length audio features by computing the interaction across the position-encoded audio segments.Audio Q-Former adopts the same architecture as Q-Former.It projects the variable-length audio representation list A into a fixed-length sequence Â ∈ R Ka×da , where the K a is the number of audio embedding vectors and d a is the dimension of each vector.Finally, we employ a linear layer to map audio features to the embedding space of the LLM.

Multi-branch Cross-Modal Training
We train the vision-language and audio-language branches separately.In the first stage, largescale vision-caption datasets are used for training, and in the second stage, high-quality instructionfollowing datasets were used for fine-tuning.The image is treated as a one-frame video.

Training of Vision-Language Branch
For pre-training vision-language branch, we utilized Webvid-2M (Bain et al., 2021), a large-scale dataset of short videos with textual descriptions sourced from stock footage sites.Moreover, we employed the image caption dataset CC595k, which is sourced from CC3M (Sharma et al., 2018) and filtered by Liu et al. (2023).We adopt a video-totext generation task during the pre-training stage, i.e., given the representation of a video, prompting the frozen LLM to generate the corresponding text description.We find that a significant portion of textual descriptions are insufficient to reflect the entire content of the videos.Therefore, the visual semantics in the videos are not fully aligned with the textual semantics in the video descriptions.Nevertheless, this stage aimed to utilize a vast amount of data and enable video features to contain as much visual knowledge as possible.We left the abilities of vision-text alignment and instruction-following for the next stage.
After the pre-training stage, the model can generate content about information in the video, but its ability to follow instructions has decreased.Therefore, in the second stage, we fine-tuned the model using high-quality instruction data.We integrated the image-detail-description dataset from MiniGPT-4 (Zhu et al., 2023), the image-instruction dataset from LLaVA (Liu et al., 2023), and the videoinstruction dataset from Video-Chat (Li et al., 2023c).After fine-tuning, Video-LLaMA exhibited remarkable abilities in following instructions and comprehending images and videos.

Training of Audio-Language Branch
Training the audio-language branch directly using audio-text data is highly challenging due to the rarity of such data.The objective of the learnable parameters in the audio-language branch is to align the output embedding of the frozen audio encoder with the embedding space of LLM.Given the scarcity of audio-text data, we employ a workaround strategy to achieve this objective.Im-ageBind, which is used as our audio encoder, has a remarkable ability to align different modalities' embeddings to one common space, demonstrating impressive performance on cross-modal retrieval and generation tasks.In light of the scarcity of audiotext data and the abundance of visual-text data, we train the audio-language branch using visual-text data, following the same data and process as the vision branch.Thanks to the shared embedding space provided by ImageBind, Video-LLaMA exhibits the ability to comprehend audio during inference, even though the audio interface has never been trained on audio data.

Related Works
Large Language Models: Large language models (LLMs) (Black et al., 2022;Scao et al., 2022;OpenAI, 2023;Tsimpoukelli et al., 2021) have demonstrated remarkable language understanding and reasoning abilities, enabling the generation of high-quality natural language text across various domains, including articles, conversations, stories, and poetry.LLMs have already sparked a technological revolution and have been widely applied in different applications.Moreover, a series of open source large models, such as LLaMA (Touvron et al., 2023), BLOOM (Scao et al., 2022) and OPT (Zhang et al., 2022), have greatly promoted technological advancement and made outstanding contributions to the NLP community.Building upon these LLMs, researchers have further extended their capabilities and developed excellent models for various NLP tasks.Examples include Vicuna (Chiang et al., 2023) and Baize (Xu et al., 2023a).Our work is based on these LLMs and provides plug-and-play plugins that empower them with the capability of comprehending both visual and auditory content in videos.
Multi-modal Large Language Models: Researchers have been actively exploring the use of LLMs for processing multi-modal inputs (Gao et al., 2023;Li et al., 2023c).Existing approaches can be categorized into two main groups.The first category involves employing LLMs as controllers and utilizing existing multi-modal models as tools.In this approach, when receiving the user's text instruction, the LLM recognizes the user's intention and makes decisions about which tools to call.It then generates comprehensive responses by incorporating the results obtained from these offthe-shelf multi-modal models.Examples include ChatGPT (Wu et al., 2023a), HuggingGPT (Shen et al., 2023), and AudioGPT (Huang et al., 2023a).The second category focuses on training fundamental large-scale multi-modal models.The key idea of this line of work is to align the pre-trained foundation models for other modalities to textual LLMs.For instance, Flamingo (Alayrac et al., 2022a) utilizes a perceiver resampler and a gated cross-attention layer to connect a frozen image encoder and LLM.BLIP2 (Li et al., 2023b) introduces a Q-Former to map learned image queries to the textual embedding space of LLMs.(Liu et al., 2023), mPLUG-owl (Ye et al., 2023) and MiniGPT4 (Zhu et al., 2023) develop instructionfollowing image-LLMs using image-instructionfollowing dataset.Video-Chat (Li et al., 2023c) and Video-ChatGPT (Maaz et al., 2023) extend image encoders to video encoders and connect them with LLMs to understand visual content in videos.PandaGPT (Su et al., 2023) utilizes multi-modal encoders from ImageBind, trained exclusively on image-instruction pairs, to enable large models to understand six modalities.Our work falls into the second category, where we train fundamental models to comprehend both the visual and auditory content in videos.

Examples
In this section, we show some cases to demonstrate Video-LLaMA's multi-modal instruction-following capability in video/audio/image-grounded conversations.Figure 2 shows some of these examples, while additional cases can be found in Appendix A.
(1) Audio-visual integration perception ability.Figure 2(a) and Figure 3 show Video-LLaMA's unique ability to comprehend auditory and visual information simultaneously.The videos in both cases contain audio.In each conversation, we pose two questions related to visual and auditory content respectively.If the model could only receive one modal, it would be unable to answer both of these questions.However, we can observe that Video-LLaMA accurately responds to both visual and auditory questions in both cases.
(2) The ability to capture temporal dynamics in videos.(3) The ability to perceive and understand static images.Figure 2(c) and Figure 5 show Video-LLaMA's ability to perceive and understand pictures.Figure 2(c) demonstrates Video-LLaMA's ability to understand the concept of "unusual" and specifically describe the unusual scene.In Figure 5 The video shows a beautiful scenery of a cherry blossomlined river flowing by a boat on the water, and a cityscape with tall buildings in the background.

… …
What direction is the ship going The ship is going towards the right side of the video.
(b) A case showing Video-LLaMA's ability to identify actions over time.
What is unusual about this image?
Unusual about this image is that a man in a yellow shirt is standing on top of a parked car while holding an ironing board and pressing clothes (c) A case demonstrating Video-LLaMA's ability to comprehend static images.the main content, but it also associates it with the friendly interaction between a dog and a human.
(4) The ability of common-knowledge concept recognition.

Conclusion
In this paper, we present Video-LLaMA, a cuttingedge multi-modal framework that empowers large language models with both audio & video understanding capabilities.Our experiments demonstrated the impressive abilities of Video-LLaMA in audio and video-grounded conversations, highlighting its potential as a promising prototype for audio-visual AI assistants.We have open-sourced the entire training code and various model weights, along with detailed instructions to assist developers in utilizing our code for further development.In addition, we have provided online demo websites and offline demo deployment guides for users to experi-ence Video-LLaMA's capabilities firsthand.We are committed to constantly maintaining and improving Video-LLaMA, and will continue to contribute to the open-source community.

Limitations
Although Video-LLaMA has demonstrated impressive abilities in understanding both visual and auditory content in videos, it is still an early-stage prototype and has some limitations, including: (1) Limited perception capacities: Video-LLaMA's performance is hindered by the quality and scale of the current training dataset.We are actively constructing a high-quality audio-video-text alignment dataset to enhance the model's perception capabilities.
(2) Limited ability to handle long videos.
Long videos(such as movies, and TV shows) contain a large volume of information and impose higher demands on computational resources.This challenge remains a crucial issue that the research community is actively working to address.(3) Hallucination.Video-LLaMA inherits the hallucination problem from the frozen LLMs.We will continue to address these challenges and develop more powerful versions for video understanding.
How was the audience's response?
The audience was really enthusiastic and clapped loudly for the performer.

… …
What is this man doing?
He playing a saxophone on stage at the concert.Describe this video.
The video shows a young red-haired woman, who is wearing a striped shirt, posing for the camera in front of a gray background.She is holding her fingers up to her mouth, as if she is trying to be quiet, and the caption mentions her long hair and red lips.The woman is young, with fair skin, and looks calm and confident.She is holding her fingers to her mouth, which suggests she might be trying to signal something to the camera or trying to convey a message.The image features a beautiful and playful golden retriever running on a bright green lawn.The dog is running with a tennis ball in its mouth, as it is an active and lively animal that enjoys playing and having fun.The green grass in the background creates a vibrant contrast with the golden color of the dog's coat, adding a touch of color and texture to the scene.The dog's fur is well-groomed, shiny, and nicely groomed, highlighting its health and appearance.The brightness of the green lawn and the playful behavior of the dog suggest a joyful atmosphere, inviting viewers to engage in this lovely scenery and the dog's energetic play.
It's worth mentioning that the dog's focus and dedication to playing with the ball also showcase its natural instincts and its hunting abilities.Additionally, this image captures a fun and heartwarming moment of bonding between the dog and its human companion, as the dog's positive attitude towards the ball reflects its enjoyment of interacting with its owner and its environment.Overall, the image conveys an inviting and joyful atmosphere.
Figure 2(b) and Figure 4 illustrate the capability of Video-LLaMA to identify actions over time.It successfully describes the actions of the girl and the moving direction of the boat.
, not only does Video-LLaMA accurately describeDescribe what you hear?I hear sounds of footsteps on the floor in the background, and there is also a dog barking in the house.……Does the man wear glasses?Yes, he wears a pair of glasses in the video.(a)A case that Video-LLaMA answers the questions based on the background sound and visual content of the video Summarize this video in one sentence.
Do you know the building in this picture?The building in the image appears to be the United States Capitol.It is a neoclassical style building located in Washington, D.C., the capital of the United States.It serves as the home of the U.S. Congress and is one of the most iconic landmarks in the country (d) A case demonstrating Video-LLaMA's ability to recognize famous landmarks.
Figure 2(d) and Figure 6 demonstrate Video-LLaMA's remarkable capacity for recognizing common-knowledge concepts in visual signals.Video-LLaMA successfully recognizes famous landmarks and characters and can engage in common-sense question-answering.

Figure 3 :
Figure 3: A case showing Video-LLaMA's ability to identify the sound of applause in a video and infer the positive response from the audience.Additionally, it infers that a man is playing the saxophone on stage based on the visual content.

Figure 4 :
Figure 4: A case where Video-LLaMA provides a detailed description of the visual content in a dynamic video.

Figure 5 :
Figure 5: A case where Video-LLaMA provides a detailed description of the static image content.