UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets

Pengyu Wang; Shaojun Zhou; Chenkun Tan; Xinghao Wang; Wei Huang; Zhen Ye; Zhaowei Li; Botian Jiang; Dong Zhang; Xipeng Qiu (邱锡鹏)

doi:10.18653/v1/2025.emnlp-main.1572

UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets

Pengyu Wang, Shaojun Zhou, Chenkun Tan, Xinghao Wang, Wei Huang, Zhen Ye, Zhaowei Li, Botian Jiang, Dong Zhang, Xipeng Qiu

Abstract

Unified vision large language models (VLLMs) have recently achieved impressive advancements in both multimodal understanding and generation, powering applications such as visual question answering and text-guided image synthesis. However, progress in unified VLLMs remains constrained by the lack of datasets that fully exploit the synergistic potential between these two core abilities. Existing datasets typically address understanding and generation in isolation, thereby limiting the performance of unified VLLMs. To bridge this critical gap, we introduce a novel dataset construction framework, UnifiedVisual, and present UnifiedVisual-240K, a high-quality dataset meticulously designed to facilitate mutual enhancement between multimodal understanding and generation. UnifiedVisual-240K seamlessly integrates diverse visual and textual inputs and outputs, enabling comprehensive cross-modal reasoning and precise text-to-image alignment. Our dataset encompasses a wide spectrum of tasks and data sources, ensuring rich diversity and addressing key shortcomings of prior resources. Extensive experiments demonstrate that models trained on UnifiedVisual-240K consistently achieve strong performance across a wide range of tasks. Notably, these models exhibit significant mutual reinforcement between multimodal understanding and generation, further validating the effectiveness of our framework and dataset. We believe UnifiedVisual represents a new growth point for advancing unified VLLMs and unlocking their full potential.

Anthology ID:: 2025.emnlp-main.1572
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30867–30899
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1572/
DOI:: 10.18653/v1/2025.emnlp-main.1572
Bibkey:
Cite (ACL):: Pengyu Wang, Shaojun Zhou, Chenkun Tan, Xinghao Wang, Wei Huang, Zhen Ye, Zhaowei Li, Botian Jiang, Dong Zhang, and Xipeng Qiu. 2025. UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30867–30899, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1572.pdf
Checklist:: 2025.emnlp-main.1572.checklist.pdf

PDF Cite Search Checklist Fix data