Preserving Language Capabilities in Vision-Language Models via Representation Regulation

ZiXuan Chen; Juncheng Tao; Ziqian Zeng

doi:10.18653/v1/2026.findings-acl.1210

Preserving Language Capabilities in Vision-Language Models via Representation Regulation

Abstract

Vision-Language Models (VLMs) provide a unified framework to process both text-only tasks and vision-language tasks. However, finetuning VLMs on vision-language data has degraded language capabilities. In this paper, we prove that as the training loss declines during finetuning, the visual representation and textual representation move closer to each other, a phenomenon we term “representation mixing.” We prove that the representation mixing occurring within the post-representation layers causes the degradation of language capabilities. Post-representation layers refer to the first few layers in LLMs that are involved in representation learning. To preserve the language capabilities, we propose the Representation Regulation for VLM Training (RRVLM), which introduces a Representation Distribution Difference (RDD) loss to reduce the distance between these representations. Extensive experiments on various benchmarks and VLM frameworks show that our method can effectively preserve the language capabilities and achieve superior vision-language performance.

Anthology ID:: 2026.findings-acl.1210
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24189–24205
Language:
URL:: https://aclanthology.org/2026.findings-acl.1210/
DOI:: 10.18653/v1/2026.findings-acl.1210
Bibkey:
Cite (ACL):: ZiXuan Chen, Juncheng Tao, and Ziqian Zeng. 2026. Preserving Language Capabilities in Vision-Language Models via Representation Regulation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 24189–24205, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Preserving Language Capabilities in Vision-Language Models via Representation Regulation (Chen et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1210.pdf
Checklist:: 2026.findings-acl.1210.checklist.pdf

PDF Cite Search Checklist Fix data