Yuxin Xiong

2025

Recent MLLMs have demonstrated strong visual understanding and reasoning after large-scale multimodal pre-training. However, instruction-tuning is typically text-driven with limited visual supervision, leading to significant visual forgetting and degradation of pre-trained visual knowledge. Existing fine-tuning and continual learning methods compress visual representations and emphasize task alignment over visual retention, failing to address this challenge. We present a novel perspective using effective rank to quantify the loss of visual representation richness, framing visual forgetting as excessive compression under the information bottleneck principle. To address this, we propose modality-decoupled gradient descent (MDGD), which regulates gradient updates to preserve the effective rank of visual features and explicitly disentangles visual learning from task-specific alignment. We further introduce a memory-efficient fine-tuning variant using gradient masking for parameter-efficient adaptation. Extensive experiments show that MDGD effectively mitigates visual forgetting across downstream tasks and models, maintaining pre-trained visual knowledge while supporting strong task adaptation.

Recent advances in chain-of-thought (CoT) prompting have demonstrated the ability of large language models (LLMs) to perform multi-step reasoning. While prior work focuses on improving CoT generation quality or attributing token-level importance, we propose a novel framework to structurally analyze the latent dynamics of CoT trajectories for interpretability. Our method segments generated CoT into discrete reasoning steps, abstracts each step into a spectral embedding based on the eigenvalues of token-level Gram matrices, and clusters these embeddings into semantically meaningful latent states. We model the global evolution of reasoning as a first-order Markov chain over latent clusters, yielding interpretable transition structures. Through t-SNE visualizations and Monte Carlo rollouts, we uncover consistent trajectories across tasks and models, supporting the hypothesis that LLM reasoning follows globally coherent yet abstract paths.

Co-authors

Yu Xia 1

Venues

findings2

Fix author