Three Stream Based Multi-level Event Contrastive Learning for Text-Video Event Extraction

Jiaqi Li, Chuanyi Zhang, Miaozeng Du, Dehai Min, Yongrui Chen, Guilin Qi


Abstract
Text-video based multimodal event extraction refers to identifying event information from the given text-video pairs. Existing methods predominantly utilize video appearance features (VAF) and text sequence features (TSF) as input information. Some of them employ contrastive learning to align VAF with the event types extracted from TSF. However, they disregard the motion representations in videos and the optimization of contrastive objective could be misguided by the background noise from RGB frames. We observe that the same event triggers correspond to similar motion trajectories, which are hardly affected by the background noise. Moviated by this, we propose a Three Stream Multimodal Event Extraction framework (TSEE) that simultaneously utilizes the features of text sequence and video appearance, as well as the motion representations to enhance the event extraction capacity. Firstly, we extract the optical flow features (OFF) as motion representations from videos to incorporate with VAF and TSF. Then we introduce a Multi-level Event Contrastive Learning module to align the embedding space between OFF and event triggers, as well as between event triggers and types. Finally, a Dual Querying Text module is proposed to enhance the interaction between modalities. Experimental results show that TSEE outperforms the state-of-the-art methods, which demonstrates its superiority.
Anthology ID:
2023.emnlp-main.103
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1666–1676
Language:
URL:
https://aclanthology.org/2023.emnlp-main.103
DOI:
10.18653/v1/2023.emnlp-main.103
Bibkey:
Cite (ACL):
Jiaqi Li, Chuanyi Zhang, Miaozeng Du, Dehai Min, Yongrui Chen, and Guilin Qi. 2023. Three Stream Based Multi-level Event Contrastive Learning for Text-Video Event Extraction. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1666–1676, Singapore. Association for Computational Linguistics.
Cite (Informal):
Three Stream Based Multi-level Event Contrastive Learning for Text-Video Event Extraction (Li et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.103.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.103.mp4