Training-free Deep Concept Injection Enables Language Models for Video Question Answering

Xudong Lin, Manling Li, Richard Zemel, Heng Ji, Shih-Fu Chang


Abstract
Recently, enabling pretrained language models (PLMs) to perform zero-shot crossmodal tasks such as video question answering has been extensively studied. A popular approach is to learn a projection network that projects visual features into the input text embedding space of a PLM, as well as feed-forward adaptation layers, with the weights of the PLM frozen. However, is it really necessary to learn such additional layers? In this paper, we make the first attempt to demonstrate that the PLM is able to perform zero-shot crossmodal tasks without any crossmodal pretraining, when the observed visual concepts are injected as both additional input text tokens and augmentation in the intermediate features within each feed-forward network for the PLM. Specifically, inputting observed visual concepts as text tokens helps to inject them through the self-attention layers in the PLM; to augment the intermediate features in a way that is compatible with the PLM, we propose to construct adaptation layers based on the intermediate representation of concepts (obtained by solely inputting them to the PLM). These two complementary injection mechanisms form the proposed Deep Concept Injection, which comprehensively enables the PLM to perceive instantly without crossmodal pretraining. Extensive empirical analysis on zero-shot video question answering, as well as visual question answering, shows Deep Concept Injection achieves competitive or even better results in both zero-shot and fine-tuning settings, compared to state-of-the-art methods that require crossmodal pretraining.
Anthology ID:
2024.emnlp-main.1249
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22399–22416
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1249
DOI:
Bibkey:
Cite (ACL):
Xudong Lin, Manling Li, Richard Zemel, Heng Ji, and Shih-Fu Chang. 2024. Training-free Deep Concept Injection Enables Language Models for Video Question Answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22399–22416, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Training-free Deep Concept Injection Enables Language Models for Video Question Answering (Lin et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1249.pdf