Multimodal Document-level Triple Extraction via Dynamic Graph Enhancement and Relation-Aware Reflection

Xiang Li, Runhai Jiao, Zhou Changyu, Shoupeng Qiao, Ruojiao Qiao, Ruifan Li


Abstract
Multimodal documents, which are among the most prevalent data formats, combine a large amount of textual and visual content. Extracting structured triples knowledge from these documents is a highly valuable task, aimed at helping users efficiently acquire key entities and their relationships. However, existing methods face limitations in simultaneously processing long textual content and multiple associated images for triple extraction. Therefore, we propose a Multimodal Document-level Triple Extraction (MDocTE) framework. Specifically, we introduce a dynamic document graph construction method that extends the model’s scope to the entire document and the external world, while adaptively optimizing the graph structure. Next, we inject the global information and external knowledge learned by the graph neural network into the large language model, generating structured triples after deep interaction. Finally, we design a multimodal relation-aware mechanism and loss function to guide the model in reflecting on the shared information between text and visuals. We release a new triple extraction dataset for multimodal documents and conduct extensive experiments. The results demonstrate that the proposed framework outperforms the state-of-the-art baselines, thus filling the gap in multimodal document extraction. Our data is available at https://github.com/XiangLiphd/Triple-extraction-dataset-for-multimodal-documents.
Anthology ID:
2025.findings-emnlp.171
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3212–3223
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.171/
DOI:
Bibkey:
Cite (ACL):
Xiang Li, Runhai Jiao, Zhou Changyu, Shoupeng Qiao, Ruojiao Qiao, and Ruifan Li. 2025. Multimodal Document-level Triple Extraction via Dynamic Graph Enhancement and Relation-Aware Reflection. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3212–3223, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Multimodal Document-level Triple Extraction via Dynamic Graph Enhancement and Relation-Aware Reflection (Li et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.171.pdf
Checklist:
 2025.findings-emnlp.171.checklist.pdf