Weixiang Zhou
2026
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Qiao Liang | Yanjiang Liu | Weixiang Zhou | Ben He | Yaojie Lu | Hongyu Lin | Jia Zheng | Xianpei Han | Le Sun | Yingfei Sun
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiao Liang | Yanjiang Liu | Weixiang Zhou | Ben He | Yaojie Lu | Hongyu Lin | Jia Zheng | Xianpei Han | Le Sun | Yingfei Sun
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder’s prior knowledge is seldom investigated. In this work, we introduce a novel metric Ranke to quantify the effect of prior knowledge of the vision encoder on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient, particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder’s prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.
2025
ConsistentChat: Building Skeleton-Guided Consistent Multi-Turn Dialogues for Large Language Models from Scratch
Jiawei Chen | Xinyan Guan | Qianhao Yuan | Guozhao Mo | Weixiang Zhou | Yaojie Lu | Hongyu Lin | Ben He | Le Sun | Xianpei Han
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jiawei Chen | Xinyan Guan | Qianhao Yuan | Guozhao Mo | Weixiang Zhou | Yaojie Lu | Hongyu Lin | Ben He | Le Sun | Xianpei Han
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20–30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Hao Zheng | Xinyan Guan | Hao Kong | Wenkai Zhang | Jia Zheng | Weixiang Zhou | Hongyu Lin | Yaojie Lu | Xianpei Han | Le Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hao Zheng | Xinyan Guan | Hao Kong | Wenkai Zhang | Jia Zheng | Weixiang Zhou | Hongyu Lin | Yaojie Lu | Xianpei Han | Le Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Automatically generating presentations from documents is a challenging task that requires accommodating content quality, visual appeal, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, overlooking visual appeal and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to extract slide-level functional types and content schemas, then drafts an outline and iteratively generates editing actions based on selected reference slides to create new slides. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Results demonstrate that PPTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.