VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts

Yu Bai, Lianji Wang, Xiang Liu, Haifeng Chi, Guiping Zhang


Abstract
With the continuous growth of multi-modal data on social media platforms, traditional Named Entity Recognition has rendered insufficient for handling contemporary data formats. Consequently, researchers proposed Multi-modal Named Entity Recognition (MNER). Existing studies focus on capturing the visual regions corresponding to entities to assist in entity recognition. However, these approaches still struggle to mitigate interference from visual regions that are irrelevant to the entities. To address this issue, we propose an innovative framework, Visual Cue Refinement in MNER(VCRMNER) using CLIP Prompts, to accurately capture visual cues (object-level visual regions) associated with entities. We leverage prompts to represent the semantic information of entity categories, which helps us assess visual cues and minimize interference from those irrelevant to the entities. Furthermore, we designed an interaction transformer that operates in two stages—first within each modality and then between modalities—to refine visual cues by learning from a frozen image encoder, thereby reducing differences between text and visual modalities. Comprehensive experiments were conducted on two public datasets, Twitter15 and Twitter17. The results and detailed analyses demonstrate that our method exhibits robust and competitive performance.
Anthology ID:
2025.neusymbridge-1.7
Volume:
Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Kang Liu, Yangqiu Song, Zhen Han, Rafet Sifa, Shizhu He, Yunfei Long
Venues:
NeusymBridge | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
61–70
Language:
URL:
https://aclanthology.org/2025.neusymbridge-1.7/
DOI:
Bibkey:
Cite (ACL):
Yu Bai, Lianji Wang, Xiang Liu, Haifeng Chi, and Guiping Zhang. 2025. VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts. In Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025, pages 61–70, Abu Dhabi, UAE. ELRA and ICCL.
Cite (Informal):
VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts (Bai et al., NeusymBridge 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.neusymbridge-1.7.pdf