What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Yan Zeng; Hanbo Zhang; Jiani Zheng; Jiangnan Xia; Guoqiang Wei; Yang Wei; Yuchen Zhang; Tao Kong; Ruihua Song

doi:10.18653/v1/2024.naacl-long.440

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong, Ruihua Song

Abstract

Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in processing image inputs and following open-ended instructions. Despite these advancements, there is considerable scope for enhancing open-source multi-modal LLMs, especially in terms of multi-modal understanding accuracy and instruction-following proficiency. In this paper, we conduct a comprehensive study on training GPT4-style models. We introduce Lynx a multi-modal LLM developed through a series of controlled experiments comparing various model variants. This process allowed us to identify and implement an optimal training strategy tailored for multi-modal LLMs. In addition to our model development, we propose a plug-and-play technique designed to augment the instruction-following capabilities of multi-modal LLMs. We have validated the performance of Lynx on multiple benchmarks. Results demonstrate that Lynx not only achieves strong image understanding accuracy but also excels in instruction-following tasks, paving the path for ongoing enhancements in multi-modal LLMs.

Anthology ID:: 2024.naacl-long.440
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7937–7964
Language:
URL:: https://aclanthology.org/2024.naacl-long.440/
DOI:: 10.18653/v1/2024.naacl-long.440
Bibkey:
Cite (ACL):: Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong, and Ruihua Song. 2024. What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7937–7964, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? (Zeng et al., NAACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.naacl-long.440.pdf
Video:: https://aclanthology.org/2024.naacl-long.440.mp4

PDF Cite Search Video Fix data