International Workshop on Spoken Dialogue Systems Technology (2026)
up
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Giuseppe Riccardi | Seyed Mahed Mousavi | Maria Ines Torres | Koichiro Yoshino | Zoraida Callejas | Shammur Absar Chowdhury | Yun-Nung Chen | Frederic Bechet | Joakim Gustafson | Géraldine Damnati | Alex Papangelis | Luis Fernando D’Haro | John Mendonça | Raffaella Bernardi | Dilek Hakkani-Tur | Giuseppe "Pino" Di Fabbrizio | Tatsuya Kawahara | Firoj Alam | Gokhan Tur | Michael Johnston
Giuseppe Riccardi | Seyed Mahed Mousavi | Maria Ines Torres | Koichiro Yoshino | Zoraida Callejas | Shammur Absar Chowdhury | Yun-Nung Chen | Frederic Bechet | Joakim Gustafson | Géraldine Damnati | Alex Papangelis | Luis Fernando D’Haro | John Mendonça | Raffaella Bernardi | Dilek Hakkani-Tur | Giuseppe "Pino" Di Fabbrizio | Tatsuya Kawahara | Firoj Alam | Gokhan Tur | Michael Johnston
MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations
Emre Can Acikgoz | Jinoh Oh | Joo Hyuk Jeon | Jie Hao | Heng Ji | Dilek Hakkani-Tur | Gokhan Tur | Xiang Li | Chengyuan Ma | Xing Fan
Emre Can Acikgoz | Jinoh Oh | Joo Hyuk Jeon | Jie Hao | Heng Ji | Dilek Hakkani-Tur | Gokhan Tur | Xiang Li | Chengyuan Ma | Xing Fan
Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real-world applications favor multi-agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge—particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with uncertain or incomplete user input. The fundamental questions of when to interrupt a user and how to formulate the optimal clarification query within the most optimal multi-agent settings remain open. In this paper, we propose MAC (Multi-Agent Clarification), an interactive multi-agent framework specifically optimized to resolve user ambiguities by strategically managing clarification dialogues. We first introduce a novel taxonomy categorizing user ambiguities to systematically guide clarification strategies. Then, we present MAC that autonomously coordinates multiple agents to interact synergistically with users. Empirical evaluations on MultiWOZ 2.4 demonstrate that enabling clarification at both levels increases task success rate 7.8% (54.5 → 62.3) and reduces the average number of dialogue turns (6.53 → 4.86) by eliciting all required user information up front and minimizing repetition. Our findings highlight the importance of active user interaction and role-aware clarification for more reliable human–agent communication.
FlowSwitch: A State-Aware Framework for Workflow Transitions in Adaptive Dialogue Agents
Wen Yu Chang | Luning Qiu | Yi-Hung Liu | Yun-Nung Chen
Wen Yu Chang | Luning Qiu | Yi-Hung Liu | Yun-Nung Chen
To enhance large language models (LLMs) with real-world task-solving capabilities, integrating workflow knowledge into LLMs has emerged as a promising direction. However, real-world conversations are inherently dynamic—users often shift intents or request actions beyond the scope of the current workflow. Existing systems struggle to detect such transitions and to decide when to retrieve or switch to a new workflow. This paper presents FlowSwitch, a state-aware framework that learns when to search for relevant workflows and switch between them during multi-turn dialogues. A policy module determines whether to continue within the current workflow or transition to a new one based on contextual representations. When searching, a retriever identifies the most relevant workflow knowledge given the dialogue state. We conduct comprehensive experiments to explore the optimal configuration of FlowSwitch, including workflow format, retrieval input type, and retrieval method. Experimental results show that our framework, when using the agent’s self-generated search queries, achieves the highest Top-1 accuracy and Mean Average Precision (MAP). Moreover, FlowSwitch reduces nearly 50% of search operations, substantially lowering computational cost and response time.
Personality Expression in Spoken Dialogue Systems: From Text to Speech
Kenta Yamamoto | Kazunori Komatani
Kenta Yamamoto | Kazunori Komatani
A consistent personality in a spoken dialogue system enhances the naturalness and friendliness of interactions. However, users may not accurately perceive all the personality traits that the system attempts to express. This study aims to identify which traits are most reliably perceived by users. We first analyzed third-party personality ratings of a dialogue corpus using principal component and factor analyses to uncover the underlying dimensions of user perception. We then conducted experiments under both text-only and speech-based dialogue conditions to evaluate how effectively each trait could be perceived. Crowd-sourced ratings showed that a trait concerning Extraversion and Openness can be reliably perceived through text alone, whereas accurate perception of the other traits requires speech-related features such as speech rate, backchannels, fillers, and turn-taking pause duration. These findings suggest that, rather than attempting to express all Big Five traits, focusing on a subset aligned with users’ perceptual tendencies enables more effective and expressive personality design in spoken dialogue systems.
Reproducing Proficiency-Conditioned Dialogue Features with Full-duplex Spoken Dialogue Models
Takao Obi | Sadahiro Yoshikawa | Mao Saeki | Masaki Eguchi | Yoichi Matsuyama
Takao Obi | Sadahiro Yoshikawa | Mao Saeki | Masaki Eguchi | Yoichi Matsuyama
Real-time, human-centered conversational AI requires systems that handle spoken dialogue with overlap and rapid turn-taking. Although full-duplex models promise these capabilities, empirical work applying them to conversational AI is still nascent. To fill this gap, this study investigates whether the full-duplex model can reproduce the human dialogue features. We adapt a full-duplex spoken dialogue model to a large corpus of second-language (L2) learner interviews and train proficiency-conditioned models. We then conduct real-time interview sessions between these models and a spoken dialogue system designed to elicit spontaneous learner speech, and analyze reaction time, response frequency, and fluency metrics across aggregated CEFR levels (A/B/C). Our results show that proficiency-conditioned models partially reproduce levelwise trends and distributions observed in human interviews across multiple metrics. These findings suggest that full-duplex models can reproduce dialogue features of human dialogues and offer a promising foundation for conversational AI systems.
Automatic Evaluation of Open-Domain Real Conversations: Combining Encoder-Based, Dialogue-Based Features and Large Language Models Ratings
Cristina Conforto López | Marcos Estecha-Goritagoitia | Mario Rodriguez-Cantelar | Ricardo Cordoba | Luis Fernando D’Haro
Cristina Conforto López | Marcos Estecha-Goritagoitia | Mario Rodriguez-Cantelar | Ricardo Cordoba | Luis Fernando D’Haro
Conversational AI is a central application of NLP, yet ensuring high response quality remains challenging due to the inherently subjective nature of user satisfaction. Dialogue evaluation can be performed manually—through expert or user ratings—or automatically, using methods that aim to predict quality scores consistent with human judgment. In this work, we present a reference-free automatic dialogue evaluation system that predicts user ratings from a dataset of real human–chatbot interactions collected during the Alexa Prize Socialbot Grand Challenge 5, combining multiple complementary models to enhance correlation with human scores. Experimental results indicate that the model that achieves the highest Pearson correlation with users’ ratings is an XGBoost regression model that combines different features such as conversation length, engineered flags capturing conversation characteristics, predictions from an Encoder-based Panel of Experts (PoE), and instruction-based outputs from a fine-tuned LLM. The overall Pearson Correlation on the eval set is 0.404, which is competitive with prior work trained on an order of magnitude more dialogues, albeit using different datasets and system configurations.
Audio and video tokenizers are autoencoders trained to represent the content of recordings as a sequence of vectors. They are prevalently used to interface large language models with non-textual modalities. While they allow advanced applications such as video generation, the envelope of their limitations is not known in the context of multimodal conversation. This work focuses on backchannels, which listeners use to signal to the speaker that they are listening. This feedback is essential to maintain the conversation flow. We evaluate whether a representative set of audio and video tokenizers encode backchannels using linear probing. Results show that although audio tokenizers capture the phenomenon relatively well, backchannels are not linearly separated by video tokenizers. However, joint representations resulting from concatenating representations in both modalities improve accuracy significantly over audio-only representations, suggesting to train multimodal tokenizers.
The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues
Zhi Rui Tam | Wen Yu Chang | Yun-Nung Chen
Zhi Rui Tam | Wen Yu Chang | Yun-Nung Chen
This study systematically compares end-to-end (E2E) audio language models (AudioLMs) against modular (ASR, LLM, TTS) systems for multi-phase task-oriented dialogues. We evaluate open-source models on key metrics: conversational naturalness and dialogue consistency. Our findings show that E2E configurations consistently underperform their modular counterparts, exhibiting severe degradation in dialogue quality across turns. Investigating this failure, our analysis reveals that the core issue lies in the E2E models’ dialogue modeling capabilities, specifically in context maintenance and topic tracking. This work highlights a critical gap between the purported low-latency benefit of AudioLMs and their practical ability to maintain coherence in complex, multi-turn dialogues, suggesting a need for focused architectural improvements.
Analysing Next Speaker Prediction in Multi-Party Conversation Using Multimodal Large Language Models
Taiga Mori | Koji Inoue | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Taiga Mori | Koji Inoue | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
This study analyses how state-of-the-art multimodal large language models (MLLMs) can predict the next speaker in multi-party conversations. Through experimental and qualitative analyses, we found that MLLMs are able to infer a plausible next speaker based solely on linguistic context and their internalized knowledge. However, even in cases where the next speaker is not uniquely determined, MLLMs exhibit a bias toward overpredicting a single participant as the next speaker. We further showed that this bias can be mitigated by explicitly providing knowledge of turn-taking rules. In addition, we observed that visual input can sometimes contribute to more accurate predictions, while in other cases it leads to erroneous judgments. Overall, however, no clear effect of visual input was observed.
Exploring Emotional Nuances in Spoken Dialogue: Dataset Construction and Prediction of Emotional Dialogue Breakdown
Hyuga Nakaguro | Koichiro Yoshino
Hyuga Nakaguro | Koichiro Yoshino
In spoken dialogue systems, even when the utterance text is the same, speaking style or tone differences can change its nuance. To respond appropriately in such cases, systems must accurately interpret paralinguistic information. Our study evaluates such a system’s ability using the "paraling-dial" dataset, which pairs a fixed utterance text with five distinct emotional expressions and their corresponding responses. We define a task using this dataset that detects mismatches—referred to as emotional dialogue breakdowns—between the expressed emotion of an utterance and the content of its response. We propose a breakdown detection system based on the Feature-wise Linear Modulation (FiLM) model, under the hypothesis that emotion dynamically controls text interpretation. Our experimental results show that the proposed model achieves 93.8% accuracy with gold emotion labels and 91.2% with predicted labels, demonstrating both its effectiveness and practicality. We also compare different types of control signals to identify the level of information required for such a breakdown detection task: emotion labels, emotion embeddings, and acoustic features. The results suggest that the appropriate level of abstraction, rather than simply richer information, is crucial for designing effective control signals.
Effects of Dialogue Corpora Properties on Fine-Tuning a Moshi-Based Spoken Dialogue Model
Yuto Abe | Mao Saeki | Atsumoto Ohashi | Shinnosuke Takamichi | Shiyna Fujie | Tetsunori Kobayashi | Tetsuji Ogawa | Ryuichiro Higashinaka
Yuto Abe | Mao Saeki | Atsumoto Ohashi | Shinnosuke Takamichi | Shiyna Fujie | Tetsunori Kobayashi | Tetsuji Ogawa | Ryuichiro Higashinaka
This study investigates how interactional characteristics of spoken dialogue corpora influence the learning process and resulting behavior of speech language models for full-duplex dialogue systems. While previous research has mainly focused on improving acoustic and linguistic quality, an effective dialogue system must also capture and reproduce task-dependent interactional dynamics such as conversational tempo and turn-taking patterns. To analyze these properties, we evaluated multiple dialogue corpora using NISQA for speech quality, LLM-as-a-Judge for linguistic and semantic appropriateness, and four timing-based indicators: inter-pausal units, pause, gap, and overlap. A curriculum learning strategy was applied to fine-tune a Moshi-based full-duplex dialogue model by incrementally combining corpora with different interactional characteristics. Experimental results on a dialogue continuation task showed that corpus-specific interactional patterns effectively shape model behavior. Chat-style corpora facilitated natural rhythms with moderate overlaps and gaps, whereas consultation-style corpora promoted more stable and deliberate timing. Fine-tuning with high-quality audio improved speech quality, while using task-mismatched data degraded linguistic coherence.
Mixed-Initiative Dialogue Management for Human-Virtual Agents Interaction in Forum Theatre Inspired Training
Samuel Otofa | Yacine Zerenini | Frederic Bechet | Benoit Favre | Jean-Marie Pergandi | Magalie Ochs
Samuel Otofa | Yacine Zerenini | Frederic Bechet | Benoit Favre | Jean-Marie Pergandi | Magalie Ochs
This work presents a virtual reality (VR) training tool designed to raise awareness of social discrimination (ethnic and gender-based) and to train individuals to respond effectively when witnessing such situations. Inspired by Augusto Boal’s forum theatre, the system recreates interactive scenarios of discrimination using autonomous virtual agents. The user first observes a discriminatory scene, then analyzes it through an interaction with a virtual conversational agent, and finally replays the scene by embodying the discriminated character to explore alternative reactions. From a dialogue system perspective, the project introduces a hybrid dialogue management architecture combining state-based control with Large Language Model (LLM)-driven open dialogue. This mixed-initiative approach allows the system to manage structured training sequences while supporting flexible, context-aware interactions on sensitive topics. The demonstrator illustrates this approach through a case of ordinary sexism in a professional setting, highlighting the potential of spoken dialogue systems in VR for experiential learning and social behavior training.
Analyzing Utterance Selection for Unnoticeable Topic Induction in Target-Guided Conversation Systems
Kai Yoshida | Koichiro Yoshino
Kai Yoshida | Koichiro Yoshino
Target-guided conversation systems conduct dialogues to achieve predefined conversation targets, such as recommending target goods or talking about target topics. In such systems, it is important to transition topics naturally toward the target without letting the user notice the intention behind the topic induction. In this study, we implement a surprisal-based framework that quantifies the sense of induction, target awareness, and naturalness of system utterances by computing surprisal using an external language model. Experimental results from dialogue sessions demonstrate that utterance selection based on the proposed surprisal-based evaluation reduces the perceived induction of system utterances. Furthermore, correlation analysis reveals that the proposed metric aligns with human perception of induction. We also observe that surprisal values with respect to the target gradually decrease as the conversation progresses, indicating that the model implicitly learns to approach the target more naturally over time.
Development of an Evaluation System for a Fan-Engagement Chat Application Using LLM-as-a-Judge
Yuki Fujita | Yasunobu Sasaki | Ryota Arashi | Hokuto Ototake | Shinya Takahashi
Yuki Fujita | Yasunobu Sasaki | Ryota Arashi | Hokuto Ototake | Shinya Takahashi
To address challenges in objectivity and efficiency in evaluating the quality of generative AI chatbots, we developed an automatic evaluation framework using the "LLM-as-a-judge" approach. A User Simulator, built with In-Context Learning and LoRA tuning, was employed to generate pseudo-conversation logs of the fan-engagement application OSHIAI. These logs were then automatically evaluated by a Judge LLM across six dimensions, and the contribution of this method to quality management in real-world services was verified.
A Dialogue Agent to Let Users Experience and Gently Enhance the "Gyaru-Mind"
Momoka Ikegami | Takuya Kato | Saizo Aoyagi | Tatsunori Hirai
Momoka Ikegami | Takuya Kato | Saizo Aoyagi | Tatsunori Hirai
In Japan, the term "Gyaru-Mind" is commonly used to describe an upbeat mindset associated with gyaru culture, often linked to proactive positivity and strong self-affirmation. While it is widely regarded as beneficial, "Gyaru-Mind" lacks an academic operationalization and practical method for internalization. In this work, we define a quantitative index, "GYARU-MIDX", built from eight text-based factors, and implement a dialogue agent named GYARU-AI that uses this index in real time. During conversation, the agent estimates a user’s score and produces brief, context-appropriate replies by choosing between advice and empathy, so responses are not just positive all the time. A live "GYARU-MIDX" view provides real-time feedback for reflection and practice. The current system is Japanese-only because it is trained on Japanese "gyaru" style. We describe initial design and modeling results and outline limitations and next steps.
Towards a proactive cooking companion for the elderly
Katarina Esteve | Morgan Fredriksson | Joakim Gustafson | Dimosthenis Kontogiorgos | Timo Mashiyi-Veikkola
Katarina Esteve | Morgan Fredriksson | Joakim Gustafson | Dimosthenis Kontogiorgos | Timo Mashiyi-Veikkola
Aging-in-place policies leave elderly populations vulnerable to declining nutrition and social isolation. This paper presents a voice-based cooking assistant designed as a companion, addressing both nutritional and social needs through intelligent kitchen interaction. Through WoZ experiments, we validated: social dialogue serves functional purposes, where "chatty" assistants transform cooking pauses into engaging interactions while instructional-only versions create frustrating dead air, despite identical timing.
Conversational AI for Virtual Standardized Patients using a Speech-to-Speech LLM
Andrew Emerson | Keelan Evanini | Su Somay | Kevin Frome | Le An Ha | Polina Harik
Andrew Emerson | Keelan Evanini | Su Somay | Kevin Frome | Le An Ha | Polina Harik
To develop clinical reasoning skills, medical students are often tasked with interacting with trained standardized patients (SPs). Human SPs enable real conversations that can resemble authentic clinical scenarios. However, human SPs require extensive training and are often limited in their accessibility and continual availability to medical students or residents. Virtual SPs offer the ability for medical students to practice clinical interviews in a lower-stakes setting across a broader set of clinical cases. This paper introduces a virtual SP (VSP) that leverages Amazon’s Nova Sonic, a speech-to-speech foundation model designed for human-like conversation. We investigated the ability of Nova Sonic to portray four distinct clinical cases in virtual doctor-patient encounters with 20 third-year medical students. The system’s realism, its perceived learning value, and user experience were all assessed via a survey administered to the students. Students were also asked to compare this experience to interactions with a human SP. Survey results and conversations were analyzed to derive insights for improving the Nova Sonic-based VSP system.
Can Small-Scale LLMs Balance Content Accuracy and Speaker Faithfulness in Noisy French Dialogue Summarization?
Rim Abrougui | Guillaume Lechien | Elisabeth Savatier | Benoît Laurent
Rim Abrougui | Guillaume Lechien | Elisabeth Savatier | Benoît Laurent
Summarizing domain-specific and multi-speaker conversations, such as political debates, remains challenging under noisy ASR conditions. In industrial contexts, large language models (LLMs) are often impractical due to resource and confidentiality constraints. This work evaluates whether smaller LLMs (up to 8B parameters) can produce reliable summaries in such settings. Experiments on French debates show that noise significantly degrades accuracy and readability, while fine-tuning on clean, domain-related data improves robustness and reduces hallucinations. We also analyze person-name mentions as indicators of speaker faithfulness, finding that fine-tuning can help identify all speakers in far more debates than chain-of-thought prompting. However, evaluations on limited industrial data show that fine-tuning still struggles to generalize to unseen speakers and topics.
ORCHESTRA: AI-Driven Microservices Architecture to Create Personalized Experiences
Jaime Bellver | Samuel Ramos-Varela | Anmol Guragain | Ricardo Córdoba | Luis Fernando D’Haro
Jaime Bellver | Samuel Ramos-Varela | Anmol Guragain | Ricardo Córdoba | Luis Fernando D’Haro
Industry stakeholders are willing to incorporate AI systems in their pipelines, therefore they want agentic flexibility without losing the guaranties and auditability of fixed pipelines. This paper describes ORCHESTRA, a portable and extensible microservice architecture for orchestrating customizable multimodal AI workflows across domains. It embeds Large Language Model (LLM) agents within a deterministic control flow, combining reliability with adaptive reasoning. A Dockerized Manager routes text, speech, and image requests through specialist workers for ASR, emotion analysis, retrieval, guardrails, and TTS, ensuring that multimodal processing, safety checks, logging, and memory updates are consistently executed, while scoped agent nodes adjust prompts and retrieval strategies dynamically. The system scales via container replication and exposes per-step observability through open-source dashboards. We ground the discussion in a concrete deployment: an interactive museum guide that handles speech and image queries, personalizes narratives with emotion cues, invokes tools, and enforces policy-compliant responses. From this application, we report actionable guidance: interface contracts for services, where to place pre/post safety passes, how to structure memory for RAG, and common failure modes with mitigations. We position the approach against fully agentic and pure pipeline baselines, outline trade-offs (determinism vs. flexibility, latency budget), and sketch near-term extensions such as sharded managers, adaptive sub-flows, and streaming inference. Our goal is to provide a reusable blueprint for safely deploying agent-enhanced, multimodal assistants in production, illustrated through the museums use case.
Benchmarking Multilingual Temporal Reasoning in LLMs: The Temporal Reasoning Dataset
Vittorio Mazzia | Sandro Pollastrini | Davide Bernardi | Chiara Rubagotti | Daniele Amberti
Vittorio Mazzia | Sandro Pollastrini | Davide Bernardi | Chiara Rubagotti | Daniele Amberti
Time reasoning is a make-or-break capability for Large Language Models (LLMs) aspiring to act as reliable personal and enterprise assistants. This work introduces the Temporal Reasoning Dataset (TRD), a programmatically generated multilingual benchmark designed to evaluate temporal reasoning operational capabilities in LLMs across ten languages, with particular focus on basic operations relevant to conversational agents handling time-sensitive tasks. TRD utilizes human-curated carrier phrases to generate a resilient-to-overfitting dataset with diverse samples and controlled difficulty levels across five core task categories, each at five difficulty levels. Extensive experimentation shows consistent patterns in model performance across languages, with a strong linear decline in accuracy as task difficulty rises in reasoning-based tasks, while memorization-based tasks remain stable. Furthermore, reasoning tasks remain robust across temporal shifts, whereas memorization tasks show performance degradation. Additionally, contextual modifications to prompts influence model performance differently than human cognitive patterns.
Retrospective Speech Recognition for Spoken Dialogue System: Exploiting Subsequent Utterances to Enhance ASR Performance
Ryu Takeda | Kazunori Komatani
Ryu Takeda | Kazunori Komatani
Spoken dialogue systems would benefit from the ability of self-correction, namely, –revising earlier recognition results once later utterances are available, as humans often do in dialogue. However, conventional automatic speech recognition (ASR) frameworks mainly process user utterances sequentially and rely only on the preceding context. To address this limitation, we propose Retrospective Speech Recognition (RSR), which refines past recognition results by exploiting its subsequent utterances. We formulate and implement an RSR model for a dialogue system situation where system utterances can also be utilized. Each past user utterance is processed with an interpretable syllabogram representation, which integrates preceding and subsequent utterances within a shared domain between the signal and text levels. This intermediate representation also helps reduce orthographic inconsistencies. Experimental results using real Japanese dialogue speech showed that utilizing the subsequent utterances improved the character error rate by 0.10 points, which demonstrates the utility of RSR. We also investigated the impact of other factors, such as utilization of system utterances.
From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
Parisa Rabbani | Nimet Beyza Bozdag | Dilek Hakkani-Tur
Parisa Rabbani | Nimet Beyza Bozdag | Dilek Hakkani-Tur
LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM’s conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model’s performance on direct factual queries with its assessment of a speaker’s correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?” to "Is this speaker correct?”. Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.”) to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.
Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
Galann Pennec | Zhengyuan Liu | Nicholas Asher | Philippe Muller | Nancy Chen
Galann Pennec | Zhengyuan Liu | Nicholas Asher | Philippe Muller | Nancy Chen
Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
Multilingual and Continuous Backchannel Prediction: A Cross-lingual Study
Koji Inoue | Mikey Elmers | Yahui Fu | Zi Haur Pang | Taiga Mori | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Koji Inoue | Mikey Elmers | Yahui Fu | Zi Haur Pang | Taiga Mori | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.
Vanishing point of attention: A platform for adaptive driver dialogue experiments
Morgan Fredriksson | Yanis Yaici | Kevin Lam | Jurgen Konigsmann | Jens Edlund
Morgan Fredriksson | Yanis Yaici | Kevin Lam | Jurgen Konigsmann | Jens Edlund
Current in-vehicle conversational agents lack awareness of the driving situation, treating all dialogue alike regardless of cognitive demands. This paper presents a modular experimental platform that integrates the CARLA driving simulator with a real-time spatial-reasoning engine to support research on situation-aware dialogue. The system enables Wizard-of-Oz studies in which human operators control conversational agents informed by live spatial-semantic analysis of the traffic environment. As initial validation, a controlled study (n = 10) tested the platform’s sensitivity to conversational load effects, examining whether increasing conversational complexity produces a vanishing point of attention, a threshold where combined conversational and driving demands lead to a non-linear collapse in performance. Results revealed a sharp rise in collisions and missed hazard detections under high cognitive load, confirming the platform’s sensitivity to conversational strain. The platform provides a reproducible testbed for investigating how dialogue timing, content, and environmental demands interact, offering a foundation for designing adaptive, cognitively safe in-vehicle conversational systems.
When social robots see our sketches: evaluating human perception of a robot and a VLM model performance in a drawing task
Viktoria Paraskevi Daniilidou | Nikolai Ilinykh | Vladislav Maraev
Viktoria Paraskevi Daniilidou | Nikolai Ilinykh | Vladislav Maraev
We introduce a multimodal framework for interactive drawing in a robot-assisted second language learning scenario. In this scenario, humans are asked to draw objects and spatial relations between them, while a social robot that uses a vision-language model (VLM) to analyse whether the drawings are correct.The correctness decision that is passed to the human is coming from a Wizard-of-Oz (WoZ) setup. Therefore, we use it to indirectly evaluate the quality of VLM predictions. We show that the task is very challenging for a VLM and approaching evaluation of VLM performance is important: focusing on the correctness of prediction of certain features (objects, relations) provides a different evaluation picture from when the model is evaluated on prediction of the content of the image as a whole. We also examine how the appearance of the social agent and the type of feedback influence perception of the agent by the participants through a questionnaire. The comparison of verbal feedback, generated by the large language models, against simple pattern-based feedback did not show any significant effects whereas the robot’s appearance change indicated significant difference in user ratings concerning naturalness of the agent and its social presence.
Adding Determinism to a Dialogue Agent for a Robotic Environment
Oihana Garcia Anakabe | Riccardo Cocola | Cristina Aceta
Oihana Garcia Anakabe | Riccardo Cocola | Cristina Aceta
Large Language Models (LLMs) have strong capabilities in natural dialogue, but their inherent indeterminacy presents challenges in robotic environments where safety and reliability are critical. In this study, we propose a dialogue agent that has been developed to guide and support human operators during robot demonstrations, following the Learning from Demonstration (LfD) paradigm, where the robot learns tasks from the operator’s actions. The agent presented in this work extends the standard prompt-based LLM setup by integrating state graphs that explicitly encode dialogue states and transitions. This structure ensures that user interactions follow the intended path, while still allowing users to communicate in a flexible and natural manner. The state graph agent is benchmarked against a monolithic prompt baseline in challenging dialogue scenarios involving ambiguity, incomplete actions, or operator errors. Despite the LLM prompt achieving good standalone performance, the state-controlled agent shows greater contextual understanding, reasoning capability, and advisory performance, leading to more intelligent and reliable interactions.
Context-Aware Language Understanding in Human-Robot Dialogue with LLMs
Svetlana Stoyanchev | Youmna Farag | Simon Keizer | Mohan Li | Rama Sanand Doddipatla
Svetlana Stoyanchev | Youmna Farag | Simon Keizer | Mohan Li | Rama Sanand Doddipatla
In this work, we explore the use of large language models (LLMs) as interpreters of user utterances within a human-robot language interface. A user interacting with a robot that operates in a physical environment should be able to issue commands that interrupt the robot’s actions, for example, corrections or refinement of the task. This study addresses the context-aware interpretation of user utterances, including those issued while the robot is actively engaged in task execution, exploring whether LLMs, without fine-tuning, can translate user commands into corresponding sequences of robot actions. Using an interactive multimodal interface—combining text and video—for a virtual robot operating in simulated home environments, we collect a dataset of user utterances that guide the robot through various household tasks simultaneously capturing manual interpretation when the automatic one fails. Driven by practical considerations, the collected dataset is used to compare the interpretive performance of GPT models with smaller publicly available alternatives. Our findings reveal that action-interrupting utterances pose challenges for all models. While GPT consistently outperforms the smaller models, interpretation accuracy improves across the board when relevant dynamically selected in-context learning examples are included in the prompt.
Learning Vision-Language Alignment in Unified LLMs with 24 Text Tokens per Image
Nicola Irmiger | Yixuan Xu | Raphael Kreft | Aram Davtyan | Manuel Kaufmann | Imanol Schlag
Nicola Irmiger | Yixuan Xu | Raphael Kreft | Aram Davtyan | Manuel Kaufmann | Imanol Schlag
We explore how to adapt a pre-trained large language model to understand and generate both visual and textual information. We use an image tokenizer to compress images into discrete tokens, and train the model using the next-token prediction paradigm with the standard cross-entropy loss. A two-stage pre-training approach is applied, first training on image-only data and then on a small amount of image-text data. We evaluate how different image-text token mixing ratios during continual pre-training affect the model’s ability to retain language skills while learning visual representations. The resulting model shows promising signs of flexible multimodal understanding, bridging vision and language in a single pre-trained model.
Incorporating Respect into LLM-Based Academic Feedback: A BI-R Framework for Instructing Students after Q&A Sessions
Mayuko Aiba | Daisuke Saito | Nobuaki Minematsu
Mayuko Aiba | Daisuke Saito | Nobuaki Minematsu
In academic research, post-presentation Q&A sessions are crucial for deepening understanding and shaping research directions. Supervisors’ comments are particularly valuable when they highlight perspectives that students have not yet fully considered. Such comments typically arise from careful reasoning within dialogue, yet large language models (LLMs) still struggle to reason precisely about dialogue context and communicative intentions. Building on LLMs, this study proposes a feedback generation framework based on the Belief–Desire–Intention (BDI) model, which conceptualizes Q&A sessions as cognitive interactions between presenters and questioners. We further extend this framework into BI-R by introducing Respect as an explicit dimension, ensuring that generated feedback is not only accurate but also pedagogically constructive. We evaluated the proposed framework (BDI and BI-R) through comparative experiments with master’s students and field experiments with doctoral students during pre-defense presentations. Results showed that while the BDI prompt did not outperform the baseline, the BI-R prompt was particularly effective when students did not fully grasp the broader context or background of the questions. When comparing BDI and BI-R, the inclusion of Respect improved the tone and pedagogical appropriateness of feedback. These findings highlight the potential of the proposed framework as a supportive tool for training students and early-career researchers.
The Complementary Role of Para-linguistic cues for Robust Pronunciation Assessment
Yassine El Kheir | Shammur Absar Chowdhury | Ahmed Ali
Yassine El Kheir | Shammur Absar Chowdhury | Ahmed Ali
Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the para-linguistic cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA.[The source code will be available to the public upon acceptance.] The framework innovatively incorporates both fine-grained frame- and abstract utterance-level para-linguistic cues, alongside the raw speech and phoneme representations. Additionally, we introduce the “Goodness of phonemic-duration” metric to model phoneme duration distribution within the framework effectively. Our results validate the effectiveness of the proposed IntraVerbalPA framework and its individual components, yielding performance that matches or outperforms existing research works.
Evaluating LLM Style Transfer Through Readability-Based Age Assessments
Maria Di Maro | Antonio Origlia | Leonilda Bilo | Roberta Meo | Pietro Maturi | Francesca Nappo
Maria Di Maro | Antonio Origlia | Leonilda Bilo | Roberta Meo | Pietro Maturi | Francesca Nappo
Adaptability to the audience is an important feature for conversational systems, especially in the healthcare dissemination field, where scientific concepts have to be delivered to a potentially wide range of users. This work presents an evaluation of the capability of LLMs to support style transfer according to the target user’s age group. Two complementary evaluation methods were employed: an automatic assessment based on the ARI readability index, and a human experts evaluation focusing on appropriateness depending on the user’s educational level as well as content accuracy. Results show that LLMs efficiently switch style when provided with information about the user’s age while managing content still requires the adoption of safety measures.
SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning
Emre Can Acikgoz | Jinoh Oh | Jie Hao | Joo Hyuk Jeon | Heng Ji | Dilek Hakkani-Tur | Gokhan Tur | Xiang Li | Chengyuan Ma | Xing Fan
Emre Can Acikgoz | Jinoh Oh | Jie Hao | Joo Hyuk Jeon | Heng Ji | Dilek Hakkani-Tur | Gokhan Tur | Xiang Li | Chengyuan Ma | Xing Fan
Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents’ conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
Adaptive Multimodal Sentiment Analysis with Stream-Based Active Learning for Spoken Dialogue Systems
Atsuto Ajichi | Takato Hayashi | Kazunori Komatani | Shogo Okada
Atsuto Ajichi | Takato Hayashi | Kazunori Komatani | Shogo Okada
In empathic dialogue systems, it is crucial to continuously monitor and adapt to the user’s emotional state. To capture user-specific mappings between multimodal behaviors and emotional states, directly asking users about their emotions during dialogue is the most straightforward and effective approach. However, frequent questioning can cause inconvenience to users and diminish the user experience, so the number of queries should be minimized. In this study, we formulate personalized multimodal sentiment analysis (MSA) as a stream-based active learning problem, where user behaviors are observed sequentially, and we assume that the system has an ability to decide at each step whether to request an emotion label from the user. Simulation experiments using a human–agent dialogue corpus demonstrate that the proposed method efficiently improves performance even under few-shot conditions. These results indicate that our approach is effective for developing dialogue systems that achieve cost-efficient personalized MSA.
Predicting Turn-Taking in Child–Adult Conversations Using Voice Activity Projection
Youcef Brahimi | César Blanc | Abdellah Fourtassi
Youcef Brahimi | César Blanc | Abdellah Fourtassi
Turn-taking is a hallmark of human conversation, yet its developmental trajectory remains poorly understood. Adults typically respond within a few hundred milliseconds, suggesting reliance on predictive cues rather than simply waiting for silence. In contrast, children’s longer gaps raise the question of whether they depend on simpler, reactive strategies. This study provides the first large-scale test of competing hypotheses about children’s turn-taking, using corpora of child–adult and adult–adult dialogues. In Study 1, we compared a simple silence-based threshold model with the Voice Activity Projection (VAP) model, which predicts upcoming speech activity from acoustic features. Results showed that silence alone could not account for children’s behavior, whereas predictive acoustic models performed well, indicating that even early turn-taking relies on anticipatory mechanisms. In Study 2, we asked what cues support these predictions by comparing models based on acoustic features alone with models combining acoustic and lexical information. For adult conversations, lexical cues improved prediction, but for child–adult dialogues, acoustic information was sufficient to solve the task. Together, these findings suggest that children’s turn-taking is predictive but primarily grounded in acoustic patterns, revealing both continuity with adult mechanisms and developmental differences in how linguistic cues are integrated.
Supporting human operators during customer service interactions with agentic-RAG
Juan Barrionuevo-Valenzuela | Daniel Calderón-González | Zoraida Callejas | David Griol
Juan Barrionuevo-Valenzuela | Daniel Calderón-González | Zoraida Callejas | David Griol
This paper focuses on improving customer service in call centers, where finding accurate answers in the shortest possible time is crucial. The proposed solution is the development of a conversational AI system that acts as a "copilot" for human operators. The main goal of this copilot is to assist the operator in real time by providing conversation summaries, relevant domain information, and suggested responses that help guide the interaction toward a successful resolution. To achieve this, different approaches to Retrieval Augmented Generation (RAG) have been explored. The proposed agentic-RAG architecture integrates multiple autonomous agents for routing, retrieval validation, and response generation, achieving consistent improvements in real-time performance, grounding, and overall user experience across diverse service scenarios. Empirical results with the Action-Based Conversations Dataset (ABCD) corpus show that the use of agents proved to be effective in handling unstructured conversational data. The proposed approach showed an improvement in the quality, relevance, and accuracy of the generated responses with respect to a naïve RAG baseline. It is important to emphasize that this system is not intended to replace the operator, but rather to act as a support tool to enhance efficiency and customer satisfaction.
Analysis of Child-Caregiver Interactions for Developing a Caregiver Spoken Dialogue System
Sanae Yamashita | Shota Mochizuki | Yuko Kuma | Ray Sakai | Ayaka Sasaki | Ryuichiro Higashinaka
Sanae Yamashita | Shota Mochizuki | Yuko Kuma | Ray Sakai | Ayaka Sasaki | Ryuichiro Higashinaka
We aim to develop a caregiver spoken dialogue system for remote childcare services. As a first step toward this goal, this study investigates how interactions occur between children and caregivers. We collected Japanese child–caregiver dialogue data through a remote childcare service in which participants engaged in activities such as introductions, quizzes, and free conversations. The collected data were analyzed and compared with existing child–caregiver dialogue data from both acoustic and linguistic perspectives. The results showed that, acoustically, child–caregiver dialogues contained fewer overlapping utterances than adult dialogues. Linguistically, the distribution and transitions of utterance intentions differed across dialogue parts, reflecting the diverse structures of each activity. These findings provide useful insights for building future caregiver spoken dialogue systems, suggesting that a turn-based interaction structure may be sufficient and that dialogue control should be adapted to each part of the dialogue.
Can code-switching improve the user experience with a dialogue system app for recording endangered languages?
Jacqueline Brixey | David Traum
Jacqueline Brixey | David Traum
This paper investigates whether a multilingual spoken dialogue system can be used to help collect and preserve endangered language data. In this work, we extend DAPEL (Dialogue APp for Endangered Languages), which is designed to help preserve any language. Our focus, for testing purposes, is on the American Indigenous language Choctaw. The system uses English as a common language, and we test whether incorporating code-switching—the act of alternating between languages—enhances the user experience and/or increases the amount of recorded language data. Our results indicate that users have a positive response to interacting in both languages with the system, that the system plays a meaningful role in language documentation, and, notably, that participants who speak Choctaw as their first language are more receptive to a code-switching system than to a monolingual English-based system.
Estimating Relationships between Participants in Multi-Party Chat Corpus
Akane Fukushige | Koji Inoue | Keiko Ochi | Tatsuya Kawahara | Sanae Yamashita | Ryuichiro Higashinaka
Akane Fukushige | Koji Inoue | Keiko Ochi | Tatsuya Kawahara | Sanae Yamashita | Ryuichiro Higashinaka
While most existing dialogue studies focus on dyadic (one-on-one) interactions, research on multi-party dialogues has gained increasing importance. One key challenge in multi-party dialogues is identifying and interpreting the relationships between participants. This study focuses on multi-party chat corpus and aims to estimate participant pairs with specific relationships, such as family and acquaintances. We evaluated the performance of large language models (LLMs) in estimating these relationships, comparing them with a logistic regression model that uses interpretable textual features, including the number of turns and the frequency of honorific expressions. The results show that even advanced LLMs struggle with social relationship estimation, performing worse than a simple heuristic-based approach. This finding highlights the need for further improvement in enabling LLMs to naturally capture social relationships in multi-party dialogues.
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
Zachary Ellis | Jared Joselowitz | Yash Deo | Yajie Vera He | Anna Kalygina | Aisling Higham | Mana Rahimzadeh | Yan Jia | Ibrahim Habli | Ernest Lim
Zachary Ellis | Jared Joselowitz | Yash Deo | Yajie Vera He | Anna Kalygina | Aisling Higham | Mana Rahimzadeh | Yan Jia | Ibrahim Habli | Ernest Lim
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen’s kappa of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
ReflectOR: an LLM-based Agent for Post-Operative Surgical Debriefing
Lorenzo Fumi | Marco Bombieri | Sara Allievi | Stefano Bonvini | Theodora Chaspari | Marco A. Zenati | Paolo Giorgini
Lorenzo Fumi | Marco Bombieri | Sara Allievi | Stefano Bonvini | Theodora Chaspari | Marco A. Zenati | Paolo Giorgini
Ineffective teamwork and communication can generate medical errors in the high-pressure environment of surgery, making post-operative debriefings essential for enhancing team performance and patient safety. However, these sessions are frequently rushed or incomplete due to clinicians’ limited time. This paper introduces ReflectOR, an Agentic-AI architecture designed to support surgical debriefings by processing audio recordings from the operating room. The system employs specialized sub-agents that perform tasks such as generating summaries, constructing timelines of intraoperative events, identifying potential errors and counting the materials used. A qualitative evaluation indicates that the system effectively contextualizes transcripts, demonstrating its potential as a valuable tool for surgical debriefing. The paper also outlines key considerations for applying such an architecture in real-world clinical environments.
Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue
Run Chen | Wen Liang | Ziwei Gong | Lin Ai | Julia Hirschberg
Run Chen | Wen Liang | Ziwei Gong | Lin Ai | Julia Hirschberg
Mental manipulation, the strategic use of language to covertly influence or exploit others, is a newly emerging task in computational social reasoning. Prior work has focused exclusively on textual conversations, overlooking how manipulative tactics manifest in speech. We present the first study of mental manipulation detection in spoken dialogues, introducing a synthetic multi-speaker benchmark SPEECHMENTALMANIP that augments a text-based dataset with high-quality, voice-consistent Text-to-Speech rendered audio. Using few-shot large audio-language models and human annotation, we evaluate how modality affects detection accuracy and perception. Our results reveal that models exhibit high specificity but markedly lower recall on speech compared to text, suggesting sensitivity to missing acoustic or prosodic cues in training. Human raters show similar uncertainty in the audio setting, underscoring the inherent ambiguity of manipulative speech. Together, these findings highlight the need for modality-aware evaluation and safety alignment in multimodal dialogue systems.
CoVaPh: A Vision-Language Multi-Agent Dialogue System for Tool-Augmented Pharmacogenetic Reasoning and Personalized Guidance
Shang-Chun Luke Lu | Hsin Yang | Hui-Hsin Xue | Ping Lin Tsai | Yu Jing Weng | Shiou-Chi Li | Jen-Wei Huang | Hui Hua Chang
Shang-Chun Luke Lu | Hsin Yang | Hui-Hsin Xue | Ping Lin Tsai | Yu Jing Weng | Shiou-Chi Li | Jen-Wei Huang | Hui Hua Chang
The post-pandemic healthcare labor crisis has intensified the demand for accessible, high-precision pharmaceutical care. To meet this challenge, we introduce CoVaPh, a multi-agent pharmacogenetic framework that integrates information retrieval with Large Language Model (LLM) and Vision-Language Model (VLM) technologies. At its core, a fine-tuned query rewriting module transforms clinical inquiries into structured search indices, ensuring precise multimodal retrieval from CPIC and PharmGKB while mitigating hallucination risks. By synthesizing structured API data with unstructured evidence from guidelines, our framework delivers highly reliable, context-aware responses, surpassing benchmarks by 10% on expert-curated datasets. This approach provides a scalable solution to alleviate clinical workloads and democratize access to specialized medical knowledge.