Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Dong Shu; Haiyan Zhao; Jingyu Hu; Weiru Liu; Ali Payani; Lu Cheng; Mengnan Du

doi:10.18653/v1/2025.findings-emnlp.90

Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, Mengnan Du

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and textual representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and parameter-tuning approaches. Finally, we outline promising future research directions, emphasizing the need for standardized evaluation protocols and in-depth explainability studies.

Anthology ID:: 2025.findings-emnlp.90
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1713–1735
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.90/
DOI:: 10.18653/v1/2025.findings-emnlp.90
Bibkey:
Cite (ACL):: Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, and Mengnan Du. 2025. Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1713–1735, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability (Shu et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.90.pdf
Checklist:: 2025.findings-emnlp.90.checklist.pdf

PDF Cite Search Checklist Fix data