MITF:基于图像映射文本特征的跨模态图文检索方法(MITF:Cross-modal Image-text Retrieval Method with Mapping Images to Text Features)

Lou Xinyue (娄馨月), Li You (李铀), Qi Rui (齐睿), Chen Yufeng (陈钰枫), Xu Jinan (徐金安)


Abstract
“减小图文信息间的语义鸿沟,促进跨模态信息的对齐与融合一直是解决跨模态图文检索问题的关键。但现有的双流模型因为训练时图像编码器与文本编码器是分开的,导致图文特征的对齐与融合较难。因此,本文提出图像映射文本特征(MITF)网络将不同模态(图像和文本)的信息映射到单一模态(文本),进一步增强跨模态语义的融合和对齐,提高图文检索的性能。具体地,在冻结预训练的中文视觉语言模型Chinese-CLIP参数的情况下,训练一个MITF网络将图像映射为伪语言标记,在此基础上引入提示词自动学习机制提升模型对于伪语言标记的理解能力。同时,在检索时构建Faiss索引提高检索速度。在三个开源数据集的实验结果表明所提方法相比原始Chinese-CLIP模型检索时的Mean Recall指标平均提高了3.7%,检索速度提高了约4倍。同时,图文特征可视化结果进一步表明所提方法提高了图像特征与文本特征的对齐程度。”
Anthology ID:
2024.ccl-1.1
Volume:
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Month:
July
Year:
2024
Address:
Taiyuan, China
Editors:
Maosong Sun, Jiye Liang, Xianpei Han, Zhiyuan Liu, Yulan He
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
1–14
Language:
Chinese
URL:
https://aclanthology.org/2024.ccl-1.1/
DOI:
Bibkey:
Cite (ACL):
Lou Xinyue, Li You, Qi Rui, Chen Yufeng, and Xu Jinan. 2024. MITF:基于图像映射文本特征的跨模态图文检索方法(MITF:Cross-modal Image-text Retrieval Method with Mapping Images to Text Features). In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference), pages 1–14, Taiyuan, China. Chinese Information Processing Society of China.
Cite (Informal):
MITF:基于图像映射文本特征的跨模态图文检索方法(MITF:Cross-modal Image-text Retrieval Method with Mapping Images to Text Features) (Xinyue et al., CCL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ccl-1.1.pdf