Decoupled Proxy Alignment: Mitigating Language Prior Conflict for Multimodal Alignment in MLLMs

Chenkun Tan, Pengyu Wang, Shaojun Zhou, Botian Jiang, Zhaowei Li, Dong Zhang, Xinghao Wang, Yaqian Zhou, Xipeng Qiu


Abstract
Multimodal large language models (MLLMs) have gained significant attention due to their impressive ability to integrate vision and language modalities. Recent advancements in MLLMs have primarily focused on improving performance through high-quality datasets, novel architectures, and optimized training strategies. However, in this paper, we identify a previously overlooked issue, language prior conflict, a mismatch between the inherent language priors of large language models (LLMs) and the language priors in training datasets. This conflict leads to suboptimal vision-language alignment, as MLLMs are prone to adapting to the language style of training samples. To address this issue, we propose a novel training method called Decoupled Proxy Alignment (DPA). DPA introduces two key innovations: (1) the use of a proxy LLM during pretraining to decouple the vision-language alignment process from language prior interference, and (2) dynamic loss adjustment based on visual relevance to strengthen optimization signals for visually relevant tokens. Extensive experiments demonstrate that DPA significantly mitigates the language prior conflict, achieving superior alignment performance across diverse datasets, model families, and scales. Our method not only improves the effectiveness of MLLM training but also shows exceptional generalization capabilities, making it a robust approach for vision-language alignment.
Anthology ID:
2025.findings-emnlp.1142
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20954–20970
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.1142/
DOI:
Bibkey:
Cite (ACL):
Chenkun Tan, Pengyu Wang, Shaojun Zhou, Botian Jiang, Zhaowei Li, Dong Zhang, Xinghao Wang, Yaqian Zhou, and Xipeng Qiu. 2025. Decoupled Proxy Alignment: Mitigating Language Prior Conflict for Multimodal Alignment in MLLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20954–20970, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Decoupled Proxy Alignment: Mitigating Language Prior Conflict for Multimodal Alignment in MLLMs (Tan et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.1142.pdf
Checklist:
 2025.findings-emnlp.1142.checklist.pdf