Matteo Nulli
2026
Adapting Vision-Language Models for E-commerce Understanding at Scale
Matteo Nulli | Orshulevich Vladimir | Tala Bazazo | Christian Herold | Michael Kozielski | Marcin Mazur | Szymon Tuzel | Cees G. M. Snoek | Seyyed Hadi Hashemi | Omar Javed | Yannick Versley | Shahram Khadivi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Matteo Nulli | Orshulevich Vladimir | Tala Bazazo | Christian Herold | Michael Kozielski | Marcin Mazur | Szymon Tuzel | Cees G. M. Snoek | Seyyed Hadi Hashemi | Omar Javed | Yannick Versley | Shahram Khadivi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision–Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.