UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

Xiangyu Zhao, Yuehan Zhang, Wenlong Zhang, Xiao-Ming Wu


Abstract
The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks that use embeddings, such as image-to-text or text-to-image retrieval, have been largely ignored from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain.
Anthology ID:
2024.emnlp-main.89
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1490–1507
Language:
URL:
https://aclanthology.org/2024.emnlp-main.89
DOI:
Bibkey:
Cite (ACL):
Xiangyu Zhao, Yuehan Zhang, Wenlong Zhang, and Xiao-Ming Wu. 2024. UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1490–1507, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation (Zhao et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.89.pdf
Software:
 2024.emnlp-main.89.software.zip