MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

Ting Liu; Zunnan Xu; Yue Hu (胡月); Liangtao Shi; Zhiqiang Wang (王智强); Quanjun Yin

doi:10.18653/v1/2024.emnlp-main.287

MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

Ting Liu, Zunnan Xu, Yue Hu, Liangtao Shi, Zhiqiang Wang, Quanjun Yin

Abstract

Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by a aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters.

Anthology ID:: 2024.emnlp-main.287
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4984–4994
Language:
URL:: https://aclanthology.org/2024.emnlp-main.287/
DOI:: 10.18653/v1/2024.emnlp-main.287
Bibkey:
Cite (ACL):: Ting Liu, Zunnan Xu, Yue Hu, Liangtao Shi, Zhiqiang Wang, and Quanjun Yin. 2024. MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4984–4994, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension (Liu et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.287.pdf

PDF Cite Search Fix data