Exploring Union and Intersection of Visual Regions for Generating Questions, Answers, and Distractors

Wenjian Ding; Yao Zhang; Jun Wang (王军); Adam Jatowt; Zhenglu Yang

doi:10.18653/v1/2024.emnlp-main.88

Exploring Union and Intersection of Visual Regions for Generating Questions, Answers, and Distractors

Wenjian Ding, Yao Zhang, Jun Wang, Adam Jatowt, Zhenglu Yang

Abstract

Multiple-choice visual question answering (VQA) is to automatically choose a correct answer from a set of choices after reading an image. Existing efforts have been devoted to a separate generation of an image-related question, a correct answer, or challenge distractors. By contrast, we turn to a holistic generation and optimization of questions, answers, and distractors (QADs) in this study. This integrated generation strategy eliminates the need for human curation and guarantees information consistency. Furthermore, we first propose to put the spotlight on different image regions to diversify QADs. Accordingly, a novel framework ReBo is formulated in this paper. ReBo cyclically generates each QAD based on a recurrent multimodal encoder, and each generation is focusing on a different area of the image compared to those already concerned by the previously generated QADs. In addition to traditional VQA comparisons with state-of-the-art approaches, we also validate the capability of ReBo in generating augmented data to benefit VQA models.

Anthology ID:: 2024.emnlp-main.88
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1479–1489
Language:
URL:: https://aclanthology.org/2024.emnlp-main.88/
DOI:: 10.18653/v1/2024.emnlp-main.88
Bibkey:
Cite (ACL):: Wenjian Ding, Yao Zhang, Jun Wang, Adam Jatowt, and Zhenglu Yang. 2024. Exploring Union and Intersection of Visual Regions for Generating Questions, Answers, and Distractors. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1479–1489, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Exploring Union and Intersection of Visual Regions for Generating Questions, Answers, and Distractors (Ding et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.88.pdf

PDF Cite Search Fix data