Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Zhiyang Xu; Chao Feng; Rulin Shao; Trevor Ashby; Ying Shen; Di Jin; Yu Cheng; Qifan Wang; Lifu Huang

doi:10.18653/v1/2024.findings-acl.905

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang

Abstract

Despite vision-language models’ (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs’ capabilities but rather modulates the model’s responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.

Anthology ID:: 2024.findings-acl.905
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15271–15342
Language:
URL:: https://aclanthology.org/2024.findings-acl.905/
DOI:: 10.18653/v1/2024.findings-acl.905
Bibkey:
Cite (ACL):: Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, and Lifu Huang. 2024. Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning. In Findings of the Association for Computational Linguistics: ACL 2024, pages 15271–15342, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning (Xu et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.905.pdf

PDF Cite Search Fix data