Visual Question Decomposition on Multimodal Large Language Models

Haowei Zhang, Jianzhe Liu, Zhen Han, Shuo Chen, Bailan He, Volker Tresp, Zhiqiang Xu, Jindong Gu


Abstract
Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model’s question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The finetuning pipeline consists of our proposed dataset and a training objective for selective decomposition. Finetuned MLLMs demonstrate significant improvements in the quality of sub-questions and the policy of selective question decomposition. Additionally, the models also achieve higher accuracy with selective decomposition on VQA benchmark datasets.
Anthology ID:
2024.findings-emnlp.107
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1926–1949
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.107
DOI:
Bibkey:
Cite (ACL):
Haowei Zhang, Jianzhe Liu, Zhen Han, Shuo Chen, Bailan He, Volker Tresp, Zhiqiang Xu, and Jindong Gu. 2024. Visual Question Decomposition on Multimodal Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1926–1949, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Visual Question Decomposition on Multimodal Large Language Models (Zhang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.107.pdf
Software:
 2024.findings-emnlp.107.software.zip
Data:
 2024.findings-emnlp.107.data.zip