Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models

Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, JingBo Zhu


Abstract
Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MMBench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST facilitates the language-specific visual information engagement in shallow layers.
Anthology ID:
2025.findings-emnlp.666
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12473–12500
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.666/
DOI:
Bibkey:
Cite (ACL):
Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, and JingBo Zhu. 2025. Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 12473–12500, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models (Fan et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.666.pdf
Checklist:
 2025.findings-emnlp.666.checklist.pdf