FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Suvir Mirchandani | Licheng Yu | Mengjiao Wang | Animesh Sinha | Wenwen Jiang | Tao Xiang | Ning Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Multimodal tasks in the fashion domain have significant potential for e-commerce, but involve challenging vision-and-language learning problems—e.g., retrieving a fashion item given a reference image plus text feedback from a user. Prior works on multimodal fashion tasks have either been limited by the data in individual benchmarks, or have leveraged generic vision-and-language pre-training but have not taken advantage of the characteristics of fashion data. Additionally, these works have mainly been restricted to multimodal understanding tasks. To address these gaps, we make two key contributions. First, we propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks. Second, we propose a flexible decoder-based model architecture capable of both fashion retrieval and captioning tasks. Together, our model design and pre-training approach are competitive on a diverse set of fashion tasks, including cross-modal retrieval, image retrieval with text feedback, image captioning, relative image captioning, and multimodal categorization.
Visual Attention Model for Name Tagging in Multimodal Social Media
Di Lu | Leonardo Neves | Vitor Carvalho | Ning Zhang | Heng Ji
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Everyday billions of multimodal posts containing both images and text are shared in social media sites such as Snapchat, Twitter or Instagram. This combination of image and text in a single message allows for more creative and expressive forms of communication, and has become increasingly common in such sites. This new paradigm brings new challenges for natural language understanding, as the textual component tends to be shorter, more informal, and often is only understood if combined with the visual context. In this paper, we explore the task of name tagging in multimodal social media posts. We start by creating two new multimodal datasets: the first based on Twitter posts and the second based on Snapchat captions (exclusively submitted to public and crowd-sourced stories). We then propose a novel model architecture based on Visual Attention that not only provides deeper visual understanding on the decisions of the model, but also significantly outperforms other state-of-the-art baseline methods for this task.
NITE: A Neural Inductive Teaching Framework for Domain Specific NER
Siliang Tang | Ning Zhang | Jinjiang Zhang | Fei Wu | Yueting Zhuang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
In domain-specific NER, due to insufficient labeled training data, deep models usually fail to behave normally. In this paper, we proposed a novel Neural Inductive TEaching framework (NITE) to transfer knowledge from existing domain-specific NER models into an arbitrary deep neural network in a teacher-student training manner. NITE is a general framework that builds upon transfer learning and multiple instance learning, which collaboratively not only transfers knowledge to a deep student network but also reduces the noise from teachers. NITE can help deep learning methods to effectively utilize existing resources (i.e., models, labeled and unlabeled data) in a small domain. The experiment resulted on Disease NER proved that without using any labeled data, NITE can significantly boost the performance of a CNN-bidirectional LSTM-CRF NER neural network nearly over 30% in terms of F1-score.
- Di Lu 1
- Leonardo Neves 1
- Vitor Carvalho 1
- Heng Ji 1
- Siliang Tang 1
- show all...