MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

Youbo Lei; Feifei He; Chen Chen; Yingbin Mo; Sijia Li; Defeng Xie; Haonan Lu

doi:10.18653/v1/2024.findings-naacl.96

MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

Youbo Lei, Feifei He, Chen Chen, Yingbin Mo, Sijia Li, Defeng Xie, Haonan Lu

Abstract

Due to the success of large-scale visual-language pretraining (VLP) models and the widespread use of image-text retrieval in industry areas, it is now critically necessary to reduce the model size and streamline their mobile-device deployment. Single- and dual-stream model structures are commonly used in image-text retrieval with the goal of closing the semantic gap between textual and visual modalities. While single-stream models use deep feature fusion to achieve more accurate cross-model alignment, dual-stream models are better at offline indexing and fast inference. We propose a Multi-teacher Cross-modality Alignment Distillation (MCAD) technique to integrate the advantages of single- and dual-stream models. By incorporating the fused single-stream features into the image and text features of the dual-stream model, we formulate new modified teacher similarity distributions and features. Then, we conduct both distribution and feature distillation to boost the capability of the student dual-stream model, achieving high retrieval performance without increasing inference complexity. Extensive experiments demonstrate the remarkable performance and high efficiency of MCAD on image-text retrieval tasks. Furthermore, we implement a lightweight CLIP model on Snapdragon/Dimensity chips with only ~100M running memory and ~8.0ms search latency, achieving the mobile-device application of VLP models.

Anthology ID:: 2024.findings-naacl.96
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1491–1503
Language:
URL:: https://aclanthology.org/2024.findings-naacl.96/
DOI:: 10.18653/v1/2024.findings-naacl.96
Bibkey:
Cite (ACL):: Youbo Lei, Feifei He, Chen Chen, Yingbin Mo, Sijia Li, Defeng Xie, and Haonan Lu. 2024. MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1491–1503, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval (Lei et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-naacl.96.pdf

PDF Cite Search Fix data