Towards Unifying Reference Expression Generation and Comprehension

Duo Zheng, Tao Kong, Ya Jing, Jiaan Wang, Xiaojie Wang


Abstract
Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems, we propose a unified model for REG and REC, named UniRef. It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention. Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG. We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora. The VMLM and TRP are directly related to REG and REC, respectively, but could help each other. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+ and RefCOCOg. Experimental results show that our model outperforms previous state-of-the-art methods on both REG and REC.
Anthology ID:
2022.emnlp-main.442
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6598–6611
Language:
URL:
https://aclanthology.org/2022.emnlp-main.442
DOI:
10.18653/v1/2022.emnlp-main.442
Bibkey:
Cite (ACL):
Duo Zheng, Tao Kong, Ya Jing, Jiaan Wang, and Xiaojie Wang. 2022. Towards Unifying Reference Expression Generation and Comprehension. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6598–6611, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Towards Unifying Reference Expression Generation and Comprehension (Zheng et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.442.pdf