Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Fuyu Xing; Zimu Wang; Wei Wang (王巍); Haiyang Zhang

Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Fuyu Xing, Zimu Wang, Wei Wang, Haiyang Zhang

Abstract

The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M²E²) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M²E² task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M²E² dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M²E² capabilities.

Anthology ID:: 2025.inlg-main.42
Volume:: Proceedings of the 18th International Natural Language Generation Conference
Month:: October
Year:: 2025
Address:: Hanoi, Vietnam
Editors:: Lucie Flek, Shashi Narayan, Lê Hồng Phương, Jiahuan Pei
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 734–742
Language:
URL:: https://aclanthology.org/2025.inlg-main.42/
DOI:
Bibkey:
Cite (ACL):: Fuyu Xing, Zimu Wang, Wei Wang, and Haiyang Zhang. 2025. Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents. In Proceedings of the 18th International Natural Language Generation Conference, pages 734–742, Hanoi, Vietnam. Association for Computational Linguistics.
Cite (Informal):: Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents (Xing et al., INLG 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.inlg-main.42.pdf

PDF Cite Search Fix data