Jia Zheng
2026
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Qiao Liang | Yanjiang Liu | Weixiang Zhou | Ben He | Yaojie Lu | Hongyu Lin | Jia Zheng | Xianpei Han | Le Sun | Yingfei Sun
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiao Liang | Yanjiang Liu | Weixiang Zhou | Ben He | Yaojie Lu | Hongyu Lin | Jia Zheng | Xianpei Han | Le Sun | Yingfei Sun
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder’s prior knowledge is seldom investigated. In this work, we introduce a novel metric Ranke to quantify the effect of prior knowledge of the vision encoder on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient, particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder’s prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.
2025
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Hao Zheng | Xinyan Guan | Hao Kong | Wenkai Zhang | Jia Zheng | Weixiang Zhou | Hongyu Lin | Yaojie Lu | Xianpei Han | Le Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hao Zheng | Xinyan Guan | Hao Kong | Wenkai Zhang | Jia Zheng | Weixiang Zhou | Hongyu Lin | Yaojie Lu | Xianpei Han | Le Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Automatically generating presentations from documents is a challenging task that requires accommodating content quality, visual appeal, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, overlooking visual appeal and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to extract slide-level functional types and content schemas, then drafts an outline and iteratively generates editing actions based on selected reference slides to create new slides. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Results demonstrate that PPTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
The Linguistic Connectivities Within Large Language Models
Dan Wang | Boxi Cao | Ning Bian | Xuanang Chen | Yaojie Lu | Hongyu Lin | Jia Zheng | Le Sun | Shanshan Jiang | Bin Dong | Xianpei Han
Findings of the Association for Computational Linguistics: ACL 2025
Dan Wang | Boxi Cao | Ning Bian | Xuanang Chen | Yaojie Lu | Hongyu Lin | Jia Zheng | Le Sun | Shanshan Jiang | Bin Dong | Xianpei Han
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have demonstrated remarkable multilingual abilities in various applications. Unfortunately, recent studies have discovered that there exist notable disparities in their performance across different languages. Understanding the underlying mechanisms behind such disparities is crucial ensuring equitable access to LLMs for a global user base. Therefore, this paper conducts a systematic investigation into the behaviors of LLMs across 27 different languages on 3 different scenarios, and reveals a Linguistic Map correlates with the richness of available resources and linguistic family relations. Specifically, high-resource languages within specific language family exhibit greater knowledge consistency and mutual information dissemination, while isolated or low-resource languages tend to remain marginalized. Our research sheds light on a deep understanding of LLM’s cross-language behavior, highlights the inherent biases in LLMs within multilingual environments and underscores the need to address these inequities.
READoc: A Unified Benchmark for Realistic Document Structured Extraction
Zichao Li | Aizier Abulaiti | Yaojie Lu | Xuanang Chen | Jia Zheng | Hongyu Lin | Xianpei Han | Shanshan Jiang | Bin Dong | Le Sun
Findings of the Association for Computational Linguistics: ACL 2025
Zichao Li | Aizier Abulaiti | Yaojie Lu | Xuanang Chen | Jia Zheng | Hongyu Lin | Xianpei Han | Shanshan Jiang | Bin Dong | Le Sun
Findings of the Association for Computational Linguistics: ACL 2025
Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field’s advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 3,576 diverse and real-world documents from arXiv, GitHub, and Zenodo. In addition, we develop a DSE Evaluation S3uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general Vision-Language Models, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.
2024
Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation
Ruotong Pan | Boxi Cao | Hongyu Lin | Xianpei Han | Jia Zheng | Sirui Wang | Xunliang Cai | Le Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Ruotong Pan | Boxi Cao | Hongyu Lin | Xianpei Han | Jia Zheng | Sirui Wang | Xunliang Cai | Le Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rapid development of large language models has led to the widespread adoption of Retrieval-Augmented Generation (RAG), which integrates external knowledge to alleviate knowledge bottlenecks and mitigate hallucinations. However, the existing RAG paradigm inevitably suffers from the impact of flawed information introduced during the retrieval phrase, thereby diminishing the reliability and correctness of the generated outcomes. In this paper, we propose Credibility-aware Generation (CAG), a universally applicable framework designed to mitigate the impact of flawed information in RAG. At its core, CAG aims to equip models with the ability to discern and process information based on its credibility. To this end, we propose an innovative data transformation framework that generates data based on credibility, thereby effectively endowing models with the capability of CAG. Furthermore, to accurately evaluate the models’ capabilities of CAG, we construct a comprehensive benchmark covering three critical real-world scenarios. Experimental results demonstrate that our model can effectively understand and employ credibility for generation, significantly outperform other models with retrieval augmentation, and exhibit robustness despite the increasing noise in the context.