Seeing Beyond: Enhancing Visual Question Answering with Multi-Modal Retrieval

Boqi Chen, Anuj Khare, Gaurav Kumar, Arjun Akula, Pradyumna Narayana


Abstract
Multi-modal Large language models (MLLMs) have made significant strides in complex content understanding and reasoning. However, they still suffer from model hallucination and lack of specific knowledge when facing challenging questions. To address these limitations, retrieval augmented generation (RAG) has emerged as an effective solution. While incorporating knowledge has led to improvements, it also highlights the need for a more robust knowledge selection strategy. For multi-modal tasks, such as visual question answering (VQA), integrating all modalities is crucial in providing comprehensive information for accurate answers. Therefore, we propose to construct an encoder model for extracting joint embedding from all modalities, enabling alignment between the corresponding query and knowledge through contrastive learning. To further improve performance, we introduce an additional MLLM re-selection step, which selects the best matching knowledge from the top-k retrieved results of our alignment model. We evaluated our method, SeBe-VQA, on the Encyclopedic VQA dataset. Our knowledge retrieval results demonstrate the benefit of our multi-modal framework. By incorporating the retrieved knowledge along with the question, we achieve a significant performance improvement compared with the previous method and scenarios without knowledge provision.
Anthology ID:
2025.coling-industry.35
Volume:
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
410–421
Language:
URL:
https://aclanthology.org/2025.coling-industry.35/
DOI:
Bibkey:
Cite (ACL):
Boqi Chen, Anuj Khare, Gaurav Kumar, Arjun Akula, and Pradyumna Narayana. 2025. Seeing Beyond: Enhancing Visual Question Answering with Multi-Modal Retrieval. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 410–421, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Seeing Beyond: Enhancing Visual Question Answering with Multi-Modal Retrieval (Chen et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-industry.35.pdf