An Anchor-based Relative Position Embedding Method for Cross-Modal Tasks

Ya Wang, Xingwu Sun, Lian Fengzong, ZhanHui Kang, Chengzhong Xu Xu


Abstract
Position Embedding (PE) is essential for transformer to capture the sequence ordering of input tokens. Despite its general effectiveness verified in Natural Language Processing (NLP) and Computer Vision (CV), its application in cross-modal tasks remains unexplored and suffers from two challenges: 1) the input text tokens and image patches are not aligned, 2) the encoding space of each modality is different, making it unavailable for feature comparison. In this paper, we propose a unified position embedding method for these problems, called AnChor-basEd Relative Position Embedding (ACE-RPE), in which we first introduce an anchor locating mechanism to bridge the semantic gap and locate anchors from different modalities. Then we conduct the distance calculation of each text token and image patch by computing their shortest paths from the located anchors. Last, we embed the anchor-based distance to guide the computation of cross-attention. In this way, it calculates cross-modal relative position embedding for cross-modal transformer. Benefiting from ACE-RPE, our method obtains new SOTA results on a wide range of benchmarks, such as Image-Text Retrieval on MS-COCO and Flickr30K, Visual Entailment on SNLI-VE, Visual Reasoning on NLVR2 and Weakly-supervised Visual Grounding on RefCOCO+.
Anthology ID:
2022.emnlp-main.362
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5401–5413
Language:
URL:
https://aclanthology.org/2022.emnlp-main.362
DOI:
10.18653/v1/2022.emnlp-main.362
Bibkey:
Cite (ACL):
Ya Wang, Xingwu Sun, Lian Fengzong, ZhanHui Kang, and Chengzhong Xu Xu. 2022. An Anchor-based Relative Position Embedding Method for Cross-Modal Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5401–5413, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
An Anchor-based Relative Position Embedding Method for Cross-Modal Tasks (Wang et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.362.pdf