A Strong and Robust Baseline for Text-Image Matching

Fangyu Liu, Rongtian Ye


Abstract
We review the current schemes of text-image matching models and propose improvements for both training and inference. First, we empirically show limitations of two popular loss (sum and max-margin loss) widely used in training text-image embeddings and propose a trade-off: a kNN-margin loss which 1) utilizes information from hard negatives and 2) is robust to noise as all K-most hardest samples are taken into account, tolerating pseudo negatives and outliers. Second, we advocate the use of Inverted Softmax (IS) and Cross-modal Local Scaling (CSLS) during inference to mitigate the so-called hubness problem in high-dimensional embedding space, enhancing scores of all metrics by a large margin.
Anthology ID:
P19-2023
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Fernando Alva-Manchego, Eunsol Choi, Daniel Khashabi
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
169–176
Language:
URL:
https://aclanthology.org/P19-2023/
DOI:
10.18653/v1/P19-2023
Bibkey:
Cite (ACL):
Fangyu Liu and Rongtian Ye. 2019. A Strong and Robust Baseline for Text-Image Matching. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 169–176, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
A Strong and Robust Baseline for Text-Image Matching (Liu & Ye, ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-2023.pdf
Data
Flickr30kMS COCO