VIT-Pro: Visual Instruction Tuning for Product Images

Vishnu Prabhakaran; Purav Aggarwal; Vishruit Kulshreshtha; Arunita Das; Sahini Venkata Sitaram Sruti; Anoop Saladi

doi:10.18653/v1/2025.naacl-industry.57

VIT-Pro: Visual Instruction Tuning for Product Images

Vishnu Prabhakaran, Purav Aggarwal, Vishruit Kulshreshtha, Arunita Das, Sahini Venkata Sitaram Sruti, Anoop Saladi

Abstract

General vision-language models (VLMs) trained on web data struggle to understand and converse about real-world e-commerce product images. We propose a cost-efficient approach for collecting training data to train a generative VLM for e-commerce product images. The key idea is to leverage large-scale, loosely-coupled image-text pairs from e-commerce stores, use a pretrained LLM to generate multimodal instruction-following data, and fine-tune a general vision-language model using LoRA. Our instruction-finetuned model, VIT-Pro, can understand and respond to queries about product images, covering diverse concepts and tasks. VIT-Pro outperforms several general-purpose VLMs on multiple vision tasks in the e-commerce domain.

Anthology ID:: 2025.naacl-industry.57
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 695–707
Language:
URL:: https://aclanthology.org/2025.naacl-industry.57/
DOI:: 10.18653/v1/2025.naacl-industry.57
Bibkey:
Cite (ACL):: Vishnu Prabhakaran, Purav Aggarwal, Vishruit Kulshreshtha, Arunita Das, Sahini Venkata Sitaram Sruti, and Anoop Saladi. 2025. VIT-Pro: Visual Instruction Tuning for Product Images. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 695–707, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: VIT-Pro: Visual Instruction Tuning for Product Images (Prabhakaran et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-industry.57.pdf

PDF Cite Search Fix data