Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Zijian Zhang, Chang Shu, Ya Xiao, Yuan Shen, Di Zhu, Youxin Chen, Jing Xiao, Jey Han Lau, Qian Zhang, Zheng Lu


Abstract
Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad .
Anthology ID:
2023.eacl-main.87
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1217–1229
Language:
URL:
https://aclanthology.org/2023.eacl-main.87
DOI:
10.18653/v1/2023.eacl-main.87
Bibkey:
Cite (ACL):
Zijian Zhang, Chang Shu, Ya Xiao, Yuan Shen, Di Zhu, Youxin Chen, Jing Xiao, Jey Han Lau, Qian Zhang, and Zheng Lu. 2023. Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1217–1229, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective (Zhang et al., EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.87.pdf
Software:
 2023.eacl-main.87.software.zip
Video:
 https://aclanthology.org/2023.eacl-main.87.mp4