Multimodal Conversation Structure Understanding

Kent K. Chang; Mackenzie Hanh Cramer; Anna Ho; Ti Ti Nguyen; Yilin Yuan; David Bamman

Multimodal Conversation Structure Understanding

Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, David Bamman

Abstract

While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation—conversational roles and threading—remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.

Anthology ID:: 2026.eacl-long.349
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7437–7458
Language:
URL:: https://aclanthology.org/2026.eacl-long.349/
DOI:
Bibkey:
Cite (ACL):: Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, and David Bamman. 2026. Multimodal Conversation Structure Understanding. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7437–7458, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Multimodal Conversation Structure Understanding (Chang et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-long.349.pdf
Checklist:: 2026.eacl-long.349.checklist.pdf

PDF Cite Search Checklist Fix data