Hanbo Zhang
2024
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng
|
Hanbo Zhang
|
Jiani Zheng
|
Jiangnan Xia
|
Guoqiang Wei
|
Yang Wei
|
Yuchen Zhang
|
Tao Kong
|
Ruihua Song
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in processing image inputs and following open-ended instructions. Despite these advancements, there is considerable scope for enhancing open-source multi-modal LLMs, especially in terms of multi-modal understanding accuracy and instruction-following proficiency. In this paper, we conduct a comprehensive study on training GPT4-style models. We introduce Lynx a multi-modal LLM developed through a series of controlled experiments comparing various model variants. This process allowed us to identify and implement an optimal training strategy tailored for multi-modal LLMs. In addition to our model development, we propose a plug-and-play technique designed to augment the instruction-following capabilities of multi-modal LLMs. We have validated the performance of Lynx on multiple benchmarks. Results demonstrate that Lynx not only achieves strong image understanding accuracy but also excels in instruction-following tasks, paving the path for ongoing enhancements in multi-modal LLMs.
Search
Co-authors
- Yan Zeng 1
- Jiani Zheng 1
- Jiangnan Xia 1
- Guoqiang Wei 1
- Yang Wei 1
- show all...