Visual Zero-Shot E-Commerce Product Attribute Value Extraction

Jiaying Gong; Ming Cheng; Hongda Shen; Pierre-Yves Vandenbussche; Janet Jenq; Hoda Eldardiry

doi:10.18653/v1/2025.naacl-industry.38

Visual Zero-Shot E-Commerce Product Attribute Value Extraction

Jiaying Gong, Ming Cheng, Hongda Shen, Pierre-Yves Vandenbussche, Janet Jenq, Hoda Eldardiry

Abstract

Existing zero-shot product attribute value (aspect) extraction approaches in e-Commerce industry rely on uni-modal or multi-modal models, where the sellers are asked to provide detailed textual inputs (product descriptions) for the products. However, manually providing (typing) the product descriptions is time-consuming and frustrating for the sellers. Thus, we propose a cross-modal zero-shot attribute value generation framework (ViOC-AG) based on CLIP, which only requires product images as the inputs. ViOC-AG follows a text-only training process, where a task-customized text decoder is trained with the frozen CLIP text encoder to alleviate the modality gap and task disconnection. During the zero-shot inference, product aspects are generated by the frozen CLIP image encoder connected with the trained task-customized text decoder. OCR tokens and outputs from a frozen prompt-based LLM correct the decoded outputs for out-of-domain attribute values. Experiments show that ViOC-AG significantly outperforms other fine-tuned vision-language models for zero-shot attribute value extraction.

Anthology ID:: 2025.naacl-industry.38
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 460–469
Language:
URL:: https://aclanthology.org/2025.naacl-industry.38/
DOI:: 10.18653/v1/2025.naacl-industry.38
Bibkey:
Cite (ACL):: Jiaying Gong, Ming Cheng, Hongda Shen, Pierre-Yves Vandenbussche, Janet Jenq, and Hoda Eldardiry. 2025. Visual Zero-Shot E-Commerce Product Attribute Value Extraction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 460–469, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Visual Zero-Shot E-Commerce Product Attribute Value Extraction (Gong et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-industry.38.pdf

PDF Cite Search Fix data