Analysing Next Speaker Prediction in Multi-Party Conversation Using Multimodal Large Language Models

Taiga Mori, Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara


Abstract
This study analyses how state-of-the-art multimodal large language models (MLLMs) can predict the next speaker in multi-party conversations. Through experimental and qualitative analyses, we found that MLLMs are able to infer a plausible next speaker based solely on linguistic context and their internalized knowledge. However, even in cases where the next speaker is not uniquely determined, MLLMs exhibit a bias toward overpredicting a single participant as the next speaker. We further showed that this bias can be mitigated by explicitly providing knowledge of turn-taking rules. In addition, we observed that visual input can sometimes contribute to more accurate predictions, while in other cases it leads to erroneous judgments. Overall, however, no clear effect of visual input was observed.
Anthology ID:
2026.iwsds-1.8
Volume:
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Month:
February
Year:
2026
Address:
Trento, Italy
Editors:
Giuseppe Riccardi, Seyed Mahed Mousavi, Maria Ines Torres, Koichiro Yoshino, Zoraida Callejas, Shammur Absar Chowdhury, Yun-Nung Chen, Frederic Bechet, Joakim Gustafson, Géraldine Damnati, Alex Papangelis, Luis Fernando D’Haro, John Mendonça, Raffaella Bernardi, Dilek Hakkani-Tur, Giuseppe "Pino" Di Fabbrizio, Tatsuya Kawahara, Firoj Alam, Gokhan Tur, Michael Johnston
Venue:
IWSDS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
83–94
Language:
URL:
https://aclanthology.org/2026.iwsds-1.8/
DOI:
Bibkey:
Cite (ACL):
Taiga Mori, Koji Inoue, Divesh Lala, Keiko Ochi, and Tatsuya Kawahara. 2026. Analysing Next Speaker Prediction in Multi-Party Conversation Using Multimodal Large Language Models. In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 83–94, Trento, Italy. Association for Computational Linguistics.
Cite (Informal):
Analysing Next Speaker Prediction in Multi-Party Conversation Using Multimodal Large Language Models (Mori et al., IWSDS 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.iwsds-1.8.pdf