Yiming Zhao
2025
Enhancing Large Vision-Language Models with Ultra-Detailed Image Caption Generation
Yu Zeng
|
Yukun Qi
|
Yiming Zhao
|
Xikun Bao
|
Lin Chen
|
Zehui Chen
|
Shiting Huang
|
Jie Zhao
|
Feng Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
High-quality image captions are essential for improving modality alignment and visual understanding in Large Vision-Language Models (LVLMs). However, the scarcity of ultra-detailed image caption data limits further advancements. This paper presents a systematic pipeline for generating high-quality, ultra-detailed image captions, encompassing both pre-processing and post-processing stages. In the pre-processing stage, we classify and deduplicate images, extract visual information using expert tools, and leverage GPT-4o with structured prompts to generate initial captions. To enhance comprehensiveness, we introduce an expansion strategy based on Large Language Models (LLMs), defining eight descriptive dimensions to refine and extend captions, which serve as seed data for training a proprietary captioner model. In the post-processing stage, we incorporate human error-correction annotations and an active learning-inspired approach to refine low-quality samples. Using high-quality corrected data, we apply Direct Preference Optimization (DPO) and develop a critic-rewrite pipeline, training a sentence-level critic model to mitigate hallucinations. Experimental results demonstrate that our ultra-detailed captions significantly enhance LVLMs’ perception and cognitive abilities across multiple vision-language benchmarks. The code and dataset are available at https://github.com/yuzeng0-0/UltraCaption.
Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios
Jiwei Tang
|
Jin Xu
|
Tingwei Lu
|
Zhicheng Zhang
|
Yiming Zhao
|
Lin Hai
|
Hai-Tao Zheng
Findings of the Association for Computational Linguistics: NAACL 2025
Large language models (LLMs) demonstrate exceptional capabilities in various scenarios. However, they suffer from much redundant information and are sensitive to the position of key information in long context scenarios. To address these challenges, we present Perception Compressor, a training-free prompt compression framework. It includes a perception retriever that leverages guiding questions and instruction to retrieve the most relevant demonstrations, a dual-slope ratio allocator to dynamically allocate compression ratios and open-book ratios, and a semi-guided iterative compression that retains key information at the token level while removing tokens that distract the LLM. We conduct extensive experiments on long context benchmarks, i.e., NaturalQuestions, LongBench, and MuSiQue. Experiment results show that Perception Compressor outperforms existing methods by a large margin, achieving state-of-the-art performance.
Search
Fix author
Co-authors
- Xikun Bao 1
- Lin Chen (陈霖) 1
- Zehui Chen 1
- Lin Hai 1
- Shiting Huang 1
- show all...