Incorporating Object-Level Visual Context for Multimodal Fine-Grained Entity Typing

Ying Zhang, Wenbo Fan, Kehui Song, Yu Zhao, Xuhui Sui, Xiaojie Yuan


Abstract
Fine-grained entity typing (FGET) aims to assign appropriate fine-grained types to entity mentions within their context, which is an important foundational task in natural language processing. Previous approaches for FGET only utilized textual context information. However, in the form of short text, the contextual semantic information is often insufficient for FGET. In many real-world scenarios, text is often accompanied by images, and the visual context is valuable for FGET. To this end, we firstly propose a new task called multimodal fine-grained entity typing (MFGET). Then we construct a large-scale dataset for multimodal fine-grained entity typing called MFIGER based on FIGER. To fully leverage both textual and visual information, we propose a novel Multimodal Object-Level Visual Context Network (MOVCNet). MOVCNet can capture fine-grained semantic information by detecting objects in images, and effectively merge both textual and visual context. Experimental results demonstrate that our approach achieves superior classification performance compared to previous text-based approaches.
Anthology ID:
2023.findings-emnlp.1027
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15380–15390
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.1027
DOI:
10.18653/v1/2023.findings-emnlp.1027
Bibkey:
Cite (ACL):
Ying Zhang, Wenbo Fan, Kehui Song, Yu Zhao, Xuhui Sui, and Xiaojie Yuan. 2023. Incorporating Object-Level Visual Context for Multimodal Fine-Grained Entity Typing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15380–15390, Singapore. Association for Computational Linguistics.
Cite (Informal):
Incorporating Object-Level Visual Context for Multimodal Fine-Grained Entity Typing (Zhang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.1027.pdf