LLMs Can Compensate for Deficiencies in Visual Representations

Sho Takishita, Jay Gala, Abdelrahman Mohamed, Kentaro Inui, Yova Kementchedjhieva


Abstract
Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language decoder.
Anthology ID:
2025.findings-emnlp.825
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15253–15272
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.825/
DOI:
Bibkey:
Cite (ACL):
Sho Takishita, Jay Gala, Abdelrahman Mohamed, Kentaro Inui, and Yova Kementchedjhieva. 2025. LLMs Can Compensate for Deficiencies in Visual Representations. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15253–15272, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
LLMs Can Compensate for Deficiencies in Visual Representations (Takishita et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.825.pdf
Checklist:
 2025.findings-emnlp.825.checklist.pdf