OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities

Anton Razzhigaev, Maxim Kurkin, Elizaveta Goncharova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov


Abstract
We introduce OmniDialog — the first trimodal comprehensive benchmark grounded in a knowledge graph (Wikidata) to evaluate the generalization of Large Multimodal Models (LMMs) across three modalities. Our benchmark consists of more than 4,000 dialogues, each averaging 10 turns, all annotated and cross-validated by human experts. The dialogues in our dataset are designed to prevent shortcut learning by incorporating various formats and misleading or irrelevant multimodal cues. We also evaluate both multimodal and unimodal models to gain insights into how they process modality inputs introduced in the conversation.
Anthology ID:
2024.genbench-1.12
Volume:
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Dieuwke Hupkes, Verna Dankers, Khuyagbaatar Batsuren, Amirhossein Kazemnejad, Christos Christodoulopoulos, Mario Giulianelli, Ryan Cotterell
Venue:
GenBench
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
183–195
Language:
URL:
https://aclanthology.org/2024.genbench-1.12
DOI:
Bibkey:
Cite (ACL):
Anton Razzhigaev, Maxim Kurkin, Elizaveta Goncharova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey Kuznetsov, and Denis Dimitrov. 2024. OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities. In Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP, pages 183–195, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities (Razzhigaev et al., GenBench 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.genbench-1.12.pdf