A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

Shijie Zhou, Ruiyi Zhang, Yufan Zhou, Changyou Chen


Abstract
Large multimodal models still struggle with text-rich images because of inadequate training data. Self-Instruct provides an annotation-free way for generating instruction data, but its quality is poor, as multimodal alignment remains a hurdle even for the largest models. In this work, we propose LLaVAR-2, to enhance multimodal alignment for text-rich images through hybrid instruction generation between human annotators and large language models. Specifically, it involves detailed image captions from human annotators, followed by the use of these annotations in tailored text prompts for GPT-4o to curate a dataset. It also implements several mechanisms to filter out low-quality data, and the resulting dataset comprises 424k high-quality pairs of instructions. Empirical results show that models fine-tuned on this dataset exhibit impressive enhancements over those trained with self-instruct data.
Anthology ID:
2025.coling-main.674
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10091–10110
Language:
URL:
https://aclanthology.org/2025.coling-main.674/
DOI:
Bibkey:
Cite (ACL):
Shijie Zhou, Ruiyi Zhang, Yufan Zhou, and Changyou Chen. 2025. A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10091–10110, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation (Zhou et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.674.pdf