MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

Zhiyang Xu, Ying Shen, Lifu Huang


Abstract
Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions, has shown promising zero-shot performance on various natural language processing tasks. However, it has yet to be explored for vision and multimodal tasks. In this work, we introduce MultiInstruct, the first multimodal instruction tuning benchmark dataset that consists of 62 diverse multimodal tasks in a unified seq-to-seq format covering 10 broad categories. The tasks are derived from 21 existing open-source datasets and each task is equipped with 5 expert-written instructions. We take OFA as the base pre-trained model for multimodal instruction tuning, and to further improve its zero-shot performance, we explore multiple transfer learning strategies to leverage the large-scale Natural Instructions dataset. Experimental results demonstrate strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from a text-only instruction dataset. We also design a new evaluation metric – Sensitivity, to evaluate how sensitive the model is to the variety of instructions. Our results indicate that fine-tuning the model on a diverse set of tasks and instructions leads to a reduced sensitivity to variations in instructions for each task.
Anthology ID:
2023.acl-long.641
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11445–11465
Language:
URL:
https://aclanthology.org/2023.acl-long.641
DOI:
10.18653/v1/2023.acl-long.641
Bibkey:
Cite (ACL):
Zhiyang Xu, Ying Shen, and Lifu Huang. 2023. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11445–11465, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning (Xu et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.641.pdf
Video:
 https://aclanthology.org/2023.acl-long.641.mp4