Towards Multilingual spoken Visual Question Answering system using Cross-Attention

Amartya Roy Chowdhury, Tonmoy Rajkhowa, Sanjeev Sharma


Abstract
Visual question answering (VQA) poses a multi-modal translation challenge that requires the analysis of both images and questions simultaneously to generate appropriate responses. Although VQA research has mainly focused on text-based questions in English, speech-based questions in English and other languages remain largely unexplored. Incorporating speech could significantly enhance the utility of VQA systems, as speech is the primary mode of human communication. To address this gap, this work implements a speech-based VQA system and introduces the textless multilingual visual question answering (TM-VQA) dataset, featuring speech-based questions in English, German, Spanish, and French. This TM-VQA dataset contains 658,111 pairs of speech-based questions and answers based on 123,287 images. Finally, a novel, cross-attention-based unified multi-modal framework is presented to evaluate the efficacy of the TM-VQA dataset. The experimental results indicate the effectiveness of the proposed unified approach over the cascaded framework for both text and speech-based VQA systems. Dataset can be accessed at https://github.com/Synaptic-Coder/TM-VQA.
Anthology ID:
2025.coling-main.615
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9165–9175
Language:
URL:
https://aclanthology.org/2025.coling-main.615/
DOI:
Bibkey:
Cite (ACL):
Amartya Roy Chowdhury, Tonmoy Rajkhowa, and Sanjeev Sharma. 2025. Towards Multilingual spoken Visual Question Answering system using Cross-Attention. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9165–9175, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Towards Multilingual spoken Visual Question Answering system using Cross-Attention (Chowdhury et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.615.pdf