Divesh Lala

2026

Analysing Next Speaker Prediction in Multi-Party Conversation Using Multimodal Large Language Models
Taiga Mori | Koji Inoue | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology

This study analyses how state-of-the-art multimodal large language models (MLLMs) can predict the next speaker in multi-party conversations. Through experimental and qualitative analyses, we found that MLLMs are able to infer a plausible next speaker based solely on linguistic context and their internalized knowledge. However, even in cases where the next speaker is not uniquely determined, MLLMs exhibit a bias toward overpredicting a single participant as the next speaker. We further showed that this bias can be mitigated by explicitly providing knowledge of turn-taking rules. In addition, we observed that visual input can sometimes contribute to more accurate predictions, while in other cases it leads to erroneous judgments. Overall, however, no clear effect of visual input was observed.

pdf bib abs

We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.

2025

pdf bib abs

Human-Like Embodied AI Interviewer: Employing Android ERICA in Real International Conference
Zi Haur Pang | Yahui Fu | Divesh Lala | Mikey Elmers | Koji Inoue | Tatsuya Kawahara
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

This paper introduces the human-like embodied AI interviewer which integrates android robots equipped with advanced conversational capabilities, including attentive listening, conversational repairs, and user fluency adaptation. Moreover, it can analyze and present results post-interview. We conducted a real-world case study at SIGDIAL 2024 with 42 participants, of whom 69% reported positive experiences. This study demonstrated the system’s effectiveness in conducting interviews just like a human and marked the first employment of such a system at an international conference. The demonstration video is available at https://youtu.be/jCuw9g99KuE.

pdf bib abs

Implementation of spoken dialogue systems can be time-consuming, in particular for people who are not familiar with managing dialogue states and turn-taking in real-time. A GUI-based system where the user can quickly understand the dialogue flow allows rapid prototyping of experimental and real-world systems. In this demonstration we present ScriptBoard, a tool for creating dialogue scenarios which is independent of any specific robot platform. ScriptBoard has been designed with multi-party scenarios in mind and makes use of large language models to both generate dialogue and make decisions about the dialogue flow. This program promotes both flexibility and reproducibility in spoken dialogue research and provides everyone the opportunity to design and test their own dialogue scenarios.

pdf bib abs

Why Do We Laugh? Annotation and Taxonomy Generation for Laughable Contexts in Spontaneous Text Conversation
Koji Inoue | Mikey Elmers | Divesh Lala | Tatsuya Kawahara
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

Laughter serves as a multifaceted communicative signal in human interaction, yet its identification within dialogue presents a significant challenge for conversational AI systems. This study addresses this challenge by annotating laughable contexts in Japanese spontaneous text conversation data and developing a taxonomy to classify the underlying reasons for such contexts. Initially, multiple annotators manually labeled laughable contexts using a binary decision (laughable or non-laughable). Subsequently, an LLM was used to generate explanations for the binary annotations of laughable contexts, which were then categorized into a taxonomy comprising ten categories, including “Empathy and Affinity” and “Humor and Surprise,” highlighting the diverse range of laughter-inducing scenarios. The study also evaluated GPT-4o’s performance in recognizing the majority labels of laughable contexts, achieving an F1 score of 43.14%. These findings contribute to the advancement of conversational AI by establishing a foundation for more nuanced recognition and generation of laughter, ultimately fostering more natural and engaging human-AI interactions.

pdf bib abs

An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
Koji Inoue | Divesh Lala | Mikey Elmers | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task’s complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.

pdf bib abs

Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection
Koji Inoue | Divesh Lala | Gabriel Skantze | Tatsuya Kawahara
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In human conversations, short backchannel utterances such as “yeah” and “oh” play a crucial role in facilitating smooth and engaging dialogue.These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents.This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model.While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets.We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior.Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments.This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.

pdf bib abs

Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer,” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

2022

pdf bib abs

Simultaneous Job Interview System Using Multiple Semi-autonomous Agents
Haruki Kawai | Yusuke Muraki | Kenta Yamamoto | Divesh Lala | Koji Inoue | Tatsuya Kawahara
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

In recent years, spoken dialogue systems have been applied to job interviews where an applicant talks to a system that asks pre-defined questions, called on-demand and self-paced job interviews. We propose a simultaneous job interview system, where one interviewer can conduct one-on-one interviews with multiple applicants simultaneously by cooperating with the multiple autonomous job interview dialogue systems. However, it is challenging for interviewers to monitor and understand all the parallel interviews done by the autonomous system at the same time. As a solution to this issue, we implemented two automatic dialogue understanding functions: (1) response evaluation of each applicant’s responses and (2) keyword extraction as a summary of the responses. It is expected that interviewers, as needed, can intervene in one dialogue and smoothly ask a proper question that elaborates the interview. We report a pilot experiment where an interviewer conducted simultaneous job interviews with three candidates.

2021

pdf bib abs

ERICA: An Empathetic Android Companion for Covid-19 Quarantine
Etsuko Ishii | Genta Indra Winata | Samuel Cahyawijaya | Divesh Lala | Tatsuya Kawahara | Pascale Fung
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects of the user interface: a web-based virtual agent, Nora vs. the android ERICA via a video call. The experimental results show that the android can offer a more valuable user experience by giving the impression of being more empathetic and engaging in the conversation due to its nonverbal information, such as facial expressions and body gestures.

pdf bib abs

A multi-party attentive listening robot which stimulates involvement from side participants
Koji Inoue | Hiromi Sakamoto | Kenta Yamamoto | Divesh Lala | Tatsuya Kawahara
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

We demonstrate the moderating abilities of a multi-party attentive listening robot system when multiple people are speaking in turns. Our conventional one-on-one attentive listening system generates listener responses such as backchannels, repeats, elaborating questions, and assessments. In this paper, additional robot responses that stimulate a listening user (side participant) to become more involved in the dialogue are proposed. The additional responses elicit assessments and questions from the side participant, making the dialogue more empathetic and lively.

2020

pdf bib abs

Designing Precise and Robust Dialogue Response Evaluators
Tianyu Zhao | Divesh Lala | Tatsuya Kawahara
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data in https://github.com/ZHAOTING/dialog-processing.

pdf bib abs

An Attentive Listening System with Android ERICA: Comparison of Autonomous and WOZ Interactions
Koji Inoue | Divesh Lala | Kenta Yamamoto | Shizuka Nakamura | Katsuya Takanashi | Tatsuya Kawahara
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We describe an attentive listening system for the autonomous android robot ERICA. The proposed system generates several types of listener responses: backchannels, repeats, elaborating questions, assessments, generic sentimental responses, and generic responses. In this paper, we report a subjective experiment with 20 elderly people. First, we evaluated each system utterance excluding backchannels and generic responses, in an offline manner. It was found that most of the system utterances were linguistically appropriate, and they elicited positive reactions from the subjects. Furthermore, 58.2% of the responses were acknowledged as being appropriate listener responses. We also compared the proposed system with a WOZ system where a human operator was operating the robot. From the subjective evaluation, the proposed system achieved comparable scores in basic skills of attentive listening such as encouragement to talk, focused on the talk, and actively listening. It was also found that there is still a gap between the system and the WOZ for more sophisticated skills such as dialogue understanding, showing interest, and empathy towards the user.

2017

pdf bib abs

Attentive listening system with backchanneling, response generation and flexible turn-taking
Divesh Lala | Pierrick Milhorat | Koji Inoue | Masanari Ishida | Katsuya Takanashi | Tatsuya Kawahara
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

Attentive listening systems are designed to let people, especially senior people, keep talking to maintain communication ability and mental health. This paper addresses key components of an attentive listening system which encourages users to talk smoothly. First, we introduce continuous prediction of end-of-utterances and generation of backchannels, rather than generating backchannels after end-point detection of utterances. This improves subjective evaluations of backchannels. Second, we propose an effective statement response mechanism which detects focus words and responds in the form of a question or partial repeat. This can be applied to any statement. Moreover, a flexible turn-taking mechanism is designed which uses backchannels or fillers when the turn-switch is ambiguous. These techniques are integrated into a humanoid robot to conduct attentive listening. We test the feasibility of the system in a pilot experiment and show that it can produce coherent dialogues during conversation.

2016

pdf bib

Talking with ERICA, an autonomous android
Koji Inoue | Pierrick Milhorat | Divesh Lala | Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Co-authors

Venues

NAACL1

Fix author