Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation

Yiming Xu, Lin Chen, Zhongwei Cheng, Lixin Duan, Jiebo Luo


Abstract
We study the problem of visual question answering (VQA) in images by exploiting supervised domain adaptation, where there is a large amount of labeled data in the source domain but only limited labeled data in the target domain, with the goal to train a good target model. A straightforward solution is to fine-tune a pre-trained source model by using those limited labeled target data, but it usually cannot work well due to the considerable difference between the data distributions of the source and target domains. Moreover, the availability of multiple modalities (i.e., images, questions and answers) in VQA poses further challenges in modeling the transferability between various modalities. In this paper, we address the above issues by proposing a novel supervised multi-modal domain adaptation method for VQA to learn joint feature embeddings across different domains and modalities. Specifically, we align the data distributions of the source and target domains by considering those modalities both jointly and separately. Extensive experiments on the benchmark VQA 2.0 and VizWiz datasets demonstrate that our proposed method outperforms the existing state-of-the-art baselines for open-ended VQA in this challenging domain adaptation setting.
Anthology ID:
2020.findings-emnlp.34
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
367–376
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.34
DOI:
10.18653/v1/2020.findings-emnlp.34
Bibkey:
Cite (ACL):
Yiming Xu, Lin Chen, Zhongwei Cheng, Lixin Duan, and Jiebo Luo. 2020. Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 367–376, Online. Association for Computational Linguistics.
Cite (Informal):
Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation (Xu et al., Findings 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.34.pdf
Data
MS COCOVisual Question AnsweringVisual Question Answering v2.0VizWiz