Coreference by Appearance: Visually Grounded Event Coreference Resolution

Liming Wang, Shengyu Feng, Xudong Lin, Manling Li, Heng Ji, Shih-Fu Chang


Abstract
Event coreference resolution is critical to understand events in the growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entities depicting in different modalities may not be perfectly aligned and can be difficult to annotate, which makes the task especially challenging with little supervision available. To address the above issues, we propose a supervised model based on attention mechanism and an unsupervised model based on statistical machine translation, capable of learning the relative importance of modalities for event coreference resolution. Experiments on a video multimedia event dataset show that our multimodal models outperform text-only systems in event coreference resolution tasks. A careful analysis reveals that the performance gain of the multimodal model especially under unsupervised settings comes from better learning of visually salient events.
Anthology ID:
2021.crac-1.14
Volume:
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venues:
CRAC | EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
132–140
Language:
URL:
https://aclanthology.org/2021.crac-1.14
DOI:
10.18653/v1/2021.crac-1.14
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.crac-1.14.pdf
Data
M2E2