FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension

Junzhuo Liu, Xuzheng Yang, Weiwei Li, Peng Wang


Abstract
Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model’s ability to correctly reject scenarios where the target object is not visible in the image—an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs.
Anthology ID:
2024.emnlp-main.864
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15440–15457
Language:
URL:
https://aclanthology.org/2024.emnlp-main.864
DOI:
10.18653/v1/2024.emnlp-main.864
Bibkey:
Cite (ACL):
Junzhuo Liu, Xuzheng Yang, Weiwei Li, and Peng Wang. 2024. FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15440–15457, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension (Liu et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.864.pdf