Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering

Zujie Liang; Weitao Jiang; Haifeng Hu; Jiaying Zhu

doi:10.18653/v1/2020.emnlp-main.265

Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering

Zujie Liang, Weitao Jiang, Haifeng Hu, Jiaying Zhu

Abstract

In the task of Visual Question Answering (VQA), most state-of-the-art models tend to learn spurious correlations in the training set and achieve poor performance in out-of-distribution test data. Some methods of generating counterfactual samples have been proposed to alleviate this problem. However, the counterfactual samples generated by most previous methods are simply added to the training data for augmentation and are not fully utilized. Therefore, we introduce a novel self-supervised contrastive learning mechanism to learn the relationship between original samples, factual samples and counterfactual samples. With the better cross-modal joint embeddings learned from the auxiliary training objective, the reasoning capability and robustness of the VQA model are boosted significantly. We evaluate the effectiveness of our method by surpassing current state-of-the-art models on the VQA-CP dataset, a diagnostic benchmark for assessing the VQA model’s robustness.

Anthology ID:: 2020.emnlp-main.265
Volume:: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:: November
Year:: 2020
Address:: Online
Editors:: Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3285–3292
Language:
URL:: https://aclanthology.org/2020.emnlp-main.265/
DOI:: 10.18653/v1/2020.emnlp-main.265
Bibkey:
Cite (ACL):: Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. 2020. Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3285–3292, Online. Association for Computational Linguistics.
Cite (Informal):: Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering (Liang et al., EMNLP 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.emnlp-main.265.pdf
Video:: https://slideslive.com/38938860
Code: jokieleung/CL-VQA
Data: VQA-CP, Visual Genome, Visual Question Answering v2.0

PDF Cite Search Code Video Fix data