Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA

Han Ding, Li Erran Li, Zhiting Hu, Yi Xu, Dilek Hakkani-Tur, Zheng Du, Belinda Zeng


Abstract
Recent vision-language understanding approaches adopt a multi-modal transformer pre-training and finetuning paradigm. Prior work learns representations of text tokens and visual features with cross-attention mechanisms and captures the alignment solely based on indirect signals. In this work, we propose to enhance the alignment mechanism by incorporating image scene graph structures as the bridge between the two modalities, and learning with new contrastive objectives. In our preliminary study on the challenging compositional visual question answering task, we show the proposed approach achieves improved results, demonstrating potentials to enhance vision-language understanding.
Anthology ID:
2021.maiworkshop-1.11
Volume:
Proceedings of the Third Workshop on Multimodal Artificial Intelligence
Month:
June
Year:
2021
Address:
Mexico City, Mexico
Editors:
Amir Zadeh, Louis-Philippe Morency, Paul Pu Liang, Candace Ross, Ruslan Salakhutdinov, Soujanya Poria, Erik Cambria, Kelly Shi
Venue:
maiworkshop
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
74–78
Language:
URL:
https://aclanthology.org/2021.maiworkshop-1.11
DOI:
10.18653/v1/2021.maiworkshop-1.11
Bibkey:
Cite (ACL):
Han Ding, Li Erran Li, Zhiting Hu, Yi Xu, Dilek Hakkani-Tur, Zheng Du, and Belinda Zeng. 2021. Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence, pages 74–78, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA (Ding et al., maiworkshop 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.maiworkshop-1.11.pdf
Data
GQAVisual Question Answering