To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Junyan Lin, Haoran Chen, Dawei Zhu, Xiaoyu Shen


Abstract
In recent years, multimodal large language models (MLLMs) have attracted widespread attention from both industry and academia. Based on the integration position, MLLMs can be categorized into external and internal fusion architectures, with the former being more predominant. However, there remains considerable debate on how to construct the optimal external fusion MLLM architecture, especially regarding the performance of different connectors on tasks with varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance from this perspective. Our findings reveal significant performance differences between different types of connectors across various tasks, offering essential guidance for MLLM architecture design and advancing the understanding of MLLM architecture optimization.
Anthology ID:
2024.emnlp-main.325
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5666–5680
Language:
URL:
https://aclanthology.org/2024.emnlp-main.325
DOI:
Bibkey:
Cite (ACL):
Junyan Lin, Haoran Chen, Dawei Zhu, and Xiaoyu Shen. 2024. To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5666–5680, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models (Lin et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.325.pdf