FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Amirhossein Abaskohi; Spandana Gella; Giuseppe Carenini; Issam H. Laradji

doi:10.18653/v1/2025.findings-emnlp.383

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji

Abstract

Multimodal multihop question answering (MMQA) requires reasoning over images and text from multiple sources, an essential task for many real-world applications. Despite advances in visual question answering, this multihop setting remains underexplored due to a lack of quality datasets. Existing methods focus on single-hop, single-modality, or short texts, limiting real-world applications like interpreting educational documents with long, multimodal content. To fill this gap, we introduce FM2DS, the first framework for creating a high-quality dataset for MMQA. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure data quality. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks: MultimodalQA and WebQA. Our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) score on average. Additionally, we introduce M2QA-Bench with 1k samples, the first benchmark for MMQA on long documents, generated using FM2DS and refined by human annotators.

Anthology ID:: 2025.findings-emnlp.383
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7256–7282
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.383/
DOI:: 10.18653/v1/2025.findings-emnlp.383
Bibkey:
Cite (ACL):: Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, and Issam H. Laradji. 2025. FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7256–7282, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering (Abaskohi et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.383.pdf
Checklist:: 2025.findings-emnlp.383.checklist.pdf

PDF Cite Search Checklist Fix data