Bridging Language and Scenes through Explicit 3-D Model Construction

Tiansi Dong, Writwick Das, Rafet Sifa


Abstract
We introduce the methodology of explicit model construction to bridge linguistic descriptions and scene perception and demonstrate that in Visual Question-Answering (VQA) using MC4VQA (Model Construction for Visual Question-Answering), a method developed by us. Given a question about a scene, our MC4VQA first recognizes objects utilizing pre-trained deep learning systems. Then, it constructs an explicit 3-D layout by repeatedly reducing the difference between the input scene image and the image rendered from the current 3-D spatial environment. This novel “iterative rendering” process endows MC4VQA the capability of acquiring spatial attributes without training data. MC4VQA outperforms NS-VQA (the SOTA system) by reaching 99.94% accuracy on the benchmark CLEVR datasets, and is more robust than NS-VQA on new testing datasets. With newly created testing data, NS-VQA’s performance dropped to 97.60%, while MC4VQA still kept the 99.0% accuracy. This work sets a new SOTA performance of VQA on the benchmark CLEVR datasets, and shapes a new method that may solve the out-of-distribution problem.
Anthology ID:
2025.neusymbridge-1.6
Volume:
Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Kang Liu, Yangqiu Song, Zhen Han, Rafet Sifa, Shizhu He, Yunfei Long
Venues:
NeusymBridge | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
51–60
Language:
URL:
https://aclanthology.org/2025.neusymbridge-1.6/
DOI:
Bibkey:
Cite (ACL):
Tiansi Dong, Writwick Das, and Rafet Sifa. 2025. Bridging Language and Scenes through Explicit 3-D Model Construction. In Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025, pages 51–60, Abu Dhabi, UAE. ELRA and ICCL.
Cite (Informal):
Bridging Language and Scenes through Explicit 3-D Model Construction (Dong et al., NeusymBridge 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.neusymbridge-1.6.pdf