Tonmoy Rajkhowa


2025

pdf bib
Towards Multilingual spoken Visual Question Answering system using Cross-Attention
Amartya Roy Chowdhury | Tonmoy Rajkhowa | Sanjeev Sharma
Proceedings of the 31st International Conference on Computational Linguistics

Visual question answering (VQA) poses a multi-modal translation challenge that requires the analysis of both images and questions simultaneously to generate appropriate responses. Although VQA research has mainly focused on text-based questions in English, speech-based questions in English and other languages remain largely unexplored. Incorporating speech could significantly enhance the utility of VQA systems, as speech is the primary mode of human communication. To address this gap, this work implements a speech-based VQA system and introduces the textless multilingual visual question answering (TM-VQA) dataset, featuring speech-based questions in English, German, Spanish, and French. This TM-VQA dataset contains 658,111 pairs of speech-based questions and answers based on 123,287 images. Finally, a novel, cross-attention-based unified multi-modal framework is presented to evaluate the efficacy of the TM-VQA dataset. The experimental results indicate the effectiveness of the proposed unified approach over the cascaded framework for both text and speech-based VQA systems. Dataset can be accessed at https://github.com/Synaptic-Coder/TM-VQA.

2024

pdf bib
Evaluating the Efficacy of Large Acoustic Model for Documenting Non-Orthographic Tribal Languages in India
Tonmoy Rajkhowa | Amartya Roy Chowdhury | Hrishikesh Ravindra Karande | S. R. Mahadeva Prasanna
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pre-trained Large Acoustic Models, when fine-tuned, have largely shown to improve the performances in various tasks related to spoken language technologies. However, their evaluation has been mostly on datasets that contain English or other widely spoken languages, and their potential for novel under-resourced languages is not fully known. In this work, four novel under-resourced tribal languages that do not have a standard writing system were introduced and the application of such large pre-trained models was assessed to document such languages using Automatic Speech Recognition and Direct Speech-to-Text Translation systems. The transcriptions for these tribal languages were generated by adapting scripts from those languages that held a prominent presence in the geographical regions where these tribal languages are spoken. The results from this study suggest a viable direction to document these languages in the electronic domain by using Spoken Language Technologies that incorporate LAMs. Additionally, this study helped in understanding the varying performances exhibited by the Large Acoustic Model between these four languages. This study not only informs the adoption of appropriate scripts for transliterating spoken-only languages based on the language family but also aids in making informed decisions in analyzing the behavior of particular Large Acoustic Model in linguistic contexts.