CLEVR-Implicit: A Diagnostic Dataset for Implicit Reasoning in Referring Expression Comprehension

Jingwei Zhang, Xin Wu, Yi Cai


Abstract
Recently, pre-trained vision-language (VL) models have achieved remarkable success in various cross-modal tasks, including referring expression comprehension (REC). These models are pre-trained on the large-scale image-text pairs to learn the alignment between words in textual descriptions and objects in the corresponding images and then fine-tuned on downstream tasks. However, the performance of VL models is hindered when dealing with implicit text, which describes objects through comparisons between two or more objects rather than explicitly mentioning them. This is because the models struggle to align the implicit text with the objects in the images. To address the challenge, we introduce CLEVR-Implicit, a dataset consisting of synthetic images and corresponding two types of implicit text for the REC task. Additionally, to enhance the performance of VL models on implicit text, we propose a method called Transforming Implicit text into Explicit text (TIE), which enables VL models to reason with the implicit text. TIE consists of two modules: (1) the prompt design module builds prompts for implicit text by adding masked tokens, and (2) the cloze procedure module fine-tunes the prompts by utilizing masked language modeling (MLM) to predict the explicit words with the implicit prompts. Experimental results on our dataset demonstrate a significant improvement of 37.94% in the performance of VL models on implicit text after employing our TIE method.
Anthology ID:
2023.emnlp-main.791
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12820–12830
Language:
URL:
https://aclanthology.org/2023.emnlp-main.791
DOI:
10.18653/v1/2023.emnlp-main.791
Bibkey:
Cite (ACL):
Jingwei Zhang, Xin Wu, and Yi Cai. 2023. CLEVR-Implicit: A Diagnostic Dataset for Implicit Reasoning in Referring Expression Comprehension. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12820–12830, Singapore. Association for Computational Linguistics.
Cite (Informal):
CLEVR-Implicit: A Diagnostic Dataset for Implicit Reasoning in Referring Expression Comprehension (Zhang et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.791.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.791.mp4