Haoran Chen

2025

Multimodal large language models (MLLMs) typically extract visual features from the final layers of a pretrained Vision Transformer (ViT). This widespread deep-layer bias, however, is largely driven by empirical convention rather than principled analysis. While prior studies suggest that different ViT layers capture different types of information—shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, the impact of this variation on MLLM performance remains underexplored. We present the first comprehensive study of visual layer selection for MLLMs, analyzing representation similarity across ViT layers to establish shallow, middle, and deep layer groupings. Through extensive evaluation of MLLMs (1.4B–7B parameters) across 10 benchmarks encompassing 60+ tasks, we find that while deep layers excel in semantic-rich tasks like OCR, shallow and middle layers significantly outperform them on fine-grained visual tasks including counting, positioning, and object localization. Building on these insights, we propose a lightweight feature fusion method that strategically incorporates shallower layers, achieving consistent improvements over both single-layer and specialized fusion baselines. Our work offers the first principled study of visual layer selection in MLLMs, showing that MLLMs can often see better when they look shallower.

2024

pdf bib abs

Guided Knowledge Generation with Language Models for Commonsense Reasoning
Xiao Wei | Haoran Chen | Hang Yu | Hao Fei | Qian Liu
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) have achieved notable success in commonsense reasoning tasks, benefiting from their extensive world knowledge acquired through extensive pretraining. While approaches like Chain-of-Thought (CoT) have shown promise in enhancing LLMs’ reasoning capabilities, mitigating the influence of inaccurate commonsense knowledge remains a challenge, particularly for small-scale LLMs (e.g., those with less than 10B parameters). In this work, we propose a novel method named Guided Knowledge Generation (GuideKG) to address these issues. It presents three advantages: (i) Employing LLMs to generate knowledge explanations and to automatically assign labels based on the probability of correct answers eliminates the need for costly manual annotation in subsequent training. (ii) Training a new module called the ‘Know-Filter’, which is used to evaluate knowledge, and we have introduced a new loss to enhance its performance. (iii) Evaluating the effectiveness of knowledge fragments at the sentence level and fusing them allows for precise control over the generation process of LLMs. We evaluate our GuideKG on small-scale LLMs and show that it outperforms all baselines on four widely-used commonsense reasoning benchmarks. Moreover, our experiments reveal that, with proper guidance, small-scale LLMs can exhibit exceptional performance in commonsense reasoning.

pdf bib abs

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models
Junyan Lin | Haoran Chen | Dawei Zhu | Xiaoyu Shen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In recent years, multimodal large language models (MLLMs) have attracted widespread attention from both industry and academia. Based on the integration position, MLLMs can be categorized into external and internal fusion architectures, with the former being more predominant. However, there remains considerable debate on how to construct the optimal external fusion MLLM architecture, especially regarding the performance of different connectors on tasks with varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance from this perspective. Our findings reveal significant performance differences between different types of connectors across various tasks, offering essential guidance for MLLM architecture design and advancing the understanding of MLLM architecture optimization.

Co-authors

Hao Fei 1

Xin Jin 1

Hui Su 1

Hang Yu 1

Venues

EMNLP2
Findings1

Fix author