Align2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Hongzhe Huang; Jiang Liu; Zhewen Yu; Li Cai; Dian Jiao; Wenqiao Zhang; Siliang Tang; Juncheng Li; Hao Jiang; Haoyuan Li; Yueting Zhuang

doi:10.18653/v1/2025.findings-acl.458

Align²LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Hongzhe Huang, Jiang Liu, Zhewen Yu, Li Cai, Dian Jiao, Wenqiao Zhang, Siliang Tang, Juncheng Li, Hao Jiang, Haoyuan Li, Yueting Zhuang

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding of instruction alignment. (ii) For LLM preference alignment, given the instruction selected by the reward model, we propose leveraging the inner LLM used in MLLM to align the writing style of visual instructions with that of the inner LLM itself, resulting in LLM-aligned instruction improvement. Extensive experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%. Impressively, by aggressively reducing the training instructions from 158k to 14k (9× smaller), our model consistently outperforms its full-size dataset counterpart across various MLLM benchmarks. Our project is available at https://github.com/DCDmllm/Align2LLaVA.

Anthology ID:: 2025.findings-acl.458
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8759–8781
Language:
URL:: https://aclanthology.org/2025.findings-acl.458/
DOI:: 10.18653/v1/2025.findings-acl.458
Bibkey:
Cite (ACL):: Hongzhe Huang, Jiang Liu, Zhewen Yu, Li Cai, Dian Jiao, Wenqiao Zhang, Siliang Tang, Juncheng Li, Hao Jiang, Haoyuan Li, and Yueting Zhuang. 2025. Align2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8759–8781, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Align2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation (Huang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.458.pdf

PDF Cite Search Fix data