Multimodal Item Categorization Fully Based on Transformer

Lei Chen; Houwei Chou; Yandi Xia; Hirokazu Miyake

doi:10.18653/v1/2021.ecnlp-1.13

Multimodal Item Categorization Fully Based on Transformer

Lei Chen, Houwei Chou, Yandi Xia, Hirokazu Miyake

Abstract

The Transformer has proven to be a powerful feature extraction method and has gained widespread adoption in natural language processing (NLP). In this paper we propose a multimodal item categorization (MIC) system solely based on the Transformer for both text and image processing. On a multimodal product data set collected from a Japanese e-commerce giant, we tested a new image classification model based on the Transformer and investigated different ways of fusing bi-modal information. Our experimental results on real industry data showed that the Transformer-based image classifier has performance on par with ResNet-based classifiers and is four times faster to train. Furthermore, a cross-modal attention layer was found to be critical for the MIC system to achieve performance gains over text-only and image-only models.

Anthology ID:: 2021.ecnlp-1.13
Volume:: Proceedings of the 4th Workshop on e-Commerce and NLP
Month:: August
Year:: 2021
Address:: Online
Editors:: Shervin Malmasi, Surya Kallumadi, Nicola Ueffing, Oleg Rokhlenko, Eugene Agichtein, Ido Guy
Venue:: ECNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 111–115
Language:
URL:: https://aclanthology.org/2021.ecnlp-1.13
DOI:: 10.18653/v1/2021.ecnlp-1.13
Bibkey:
Cite (ACL):: Lei Chen, Houwei Chou, Yandi Xia, and Hirokazu Miyake. 2021. Multimodal Item Categorization Fully Based on Transformer. In Proceedings of the 4th Workshop on e-Commerce and NLP, pages 111–115, Online. Association for Computational Linguistics.
Cite (Informal):: Multimodal Item Categorization Fully Based on Transformer (Chen et al., ECNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.ecnlp-1.13.pdf

PDF Cite Search