Keiko Ochi
2026
Analysing Next Speaker Prediction in Multi-Party Conversation Using Multimodal Large Language Models
Taiga Mori | Koji Inoue | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Taiga Mori | Koji Inoue | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
This study analyses how state-of-the-art multimodal large language models (MLLMs) can predict the next speaker in multi-party conversations. Through experimental and qualitative analyses, we found that MLLMs are able to infer a plausible next speaker based solely on linguistic context and their internalized knowledge. However, even in cases where the next speaker is not uniquely determined, MLLMs exhibit a bias toward overpredicting a single participant as the next speaker. We further showed that this bias can be mitigated by explicitly providing knowledge of turn-taking rules. In addition, we observed that visual input can sometimes contribute to more accurate predictions, while in other cases it leads to erroneous judgments. Overall, however, no clear effect of visual input was observed.
Multilingual and Continuous Backchannel Prediction: A Cross-lingual Study
Koji Inoue | Mikey Elmers | Yahui Fu | Zi Haur Pang | Taiga Mori | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Koji Inoue | Mikey Elmers | Yahui Fu | Zi Haur Pang | Taiga Mori | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.
Estimating Relationships between Participants in Multi-Party Chat Corpus
Akane Fukushige | Koji Inoue | Keiko Ochi | Tatsuya Kawahara | Sanae Yamashita | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Akane Fukushige | Koji Inoue | Keiko Ochi | Tatsuya Kawahara | Sanae Yamashita | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
While most existing dialogue studies focus on dyadic (one-on-one) interactions, research on multi-party dialogues has gained increasing importance. One key challenge in multi-party dialogues is identifying and interpreting the relationships between participants. This study focuses on multi-party chat corpus and aims to estimate participant pairs with specific relationships, such as family and acquaintances. We evaluated the performance of large language models (LLMs) in estimating these relationships, comparing them with a logistic regression model that uses interpretable textual features, including the number of turns and the frequency of honorific expressions. The results show that even advanced LLMs struggle with social relationship estimation, performing worse than a simple heuristic-based approach. This finding highlights the need for further improvement in enabling LLMs to naturally capture social relationships in multi-party dialogues.
2025
ScriptBoard: Designing modern spoken dialogue systems through visual programming
Divesh Lala | Mikey Elmers | Koji Inoue | Zi Haur Pang | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Divesh Lala | Mikey Elmers | Koji Inoue | Zi Haur Pang | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Implementation of spoken dialogue systems can be time-consuming, in particular for people who are not familiar with managing dialogue states and turn-taking in real-time. A GUI-based system where the user can quickly understand the dialogue flow allows rapid prototyping of experimental and real-world systems. In this demonstration we present ScriptBoard, a tool for creating dialogue scenarios which is independent of any specific robot platform. ScriptBoard has been designed with multi-party scenarios in mind and makes use of large language models to both generate dialogue and make decisions about the dialogue flow. This program promotes both flexibility and reproducibility in spoken dialogue research and provides everyone the opportunity to design and test their own dialogue scenarios.
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
Koji Inoue | Divesh Lala | Mikey Elmers | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Koji Inoue | Divesh Lala | Mikey Elmers | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task’s complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.
Prompt-Guided Turn-Taking Prediction
Koji Inoue | Mikey Elmers | Yahui Fu | Zi Haur Pang | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Koji Inoue | Mikey Elmers | Yahui Fu | Zi Haur Pang | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer,” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.