Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin (Editors)

Anthology ID:: 2025.sigdial-1
Month:: August
Year:: 2025
Address:: Avignon, France
Venue:: SIGDIAL
SIG:: SIGDIAL
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2025.sigdial-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote

pdf bib
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Frédéric Béchet | Fabrice Lefèvre | Nicholas Asher | Seokhwan Kim | Teva Merlin

The ability to generate explanations that are understood by explainees is the quintessence of explainable artificial intelligence. Since understanding depends on the explainee’s background and needs, recent research focused on co-constructive explanation dialogues, where an explainer continuously monitors the explainee’s understanding and adapts their explanations dynamically. We investigate the ability of large language models (LLMs) to engage as explainers in co-constructive explanation dialogues. In particular, we present a user study in which explainees interact with an LLM in two settings, one of which involves the LLM being instructed to explain a topic co-constructively. We evaluate the explainees’ understanding before and after the dialogue, as well as their perception of the LLMs’ co-constructive behavior. Our results suggest that LLMs show some co-constructive behaviors, such as asking verification questions, that foster the explainees’ engagement and can improve understanding of a topic. However, their ability to effectively monitor the current understanding and scaffold the explanations accordingly remains limited.

pdf bib abs
Modeling Turn-Taking Speed and Speaker Characteristics
Kazuyo Onishi | Hien Ohnaka | Koichiro Yoshino

Modeling turn-taking speed while considering speaker characteristics and the relationships between speakers is essential for realizing dialogue systems capable of natural interactions. In this study, we focused on dialogue participants’ roles, relationships, and personality, analyzing and modeling turn-taking speeds observed in real conversations. The analysis confirmed that the expression of these attributes—role, relationship, and personality—is closely associated with turn-taking speed. Based on these findings, we constructed a model that predicts the distribution of turn-taking speeds according to each attribute using a gamma distribution. Evaluation results demonstrated that appropriate parameter fitting to the three-parameter gamma distribution enables effective modeling of turn-taking speeds based on participants’ roles, relationships, and characteristics.

pdf bib abs
Zero-Shot Evaluation of Conversational Language Competence in Data-Efficient LLMs Across English, Mandarin, and French
Sheng-Fu Wang | Ri-Sheng Huang | Shu-Kai Hsieh | Laurent Prévot

Large Language Models (LLMs) have achieved oustanding performance across various natural language processing tasks, including those from Discourse and Dialogue traditions. However, these achievements are typically obtained thanks to pretraining on huge datasets. In contrast, humans learn to speak and communicate through dialogue and spontaneous speech with only a fraction of the language exposure. This disparity has spurred interest in evaluating whether smaller, more carefully selected and curated pretraining datasets can support robust performance on specific tasks. Drawing inspiration from the BabyLM initiative, we construct small (10M-token) pretraining datasets from different sources, including conversational transcripts and Wikipedia-style text. To assess the impact of these datasets, we develop evaluation benchmarks focusing on discourse and interactional markers, extracted from high-quality spoken corpora in English, French, and Mandarin. Employing a zero-shot classification framework inspired by the BLiMP benchmark, we design tasks wherein the model must determine, between a genuine utterance extracted from a corpus and its minimally altered counterpart, which one is the authentic instance. Our findings reveal that the nature of pretraining data significantly influences model performance on discourse-related tasks. Models pretrained on conversational data exhibit a clear advantage in handling discourse and interactional markers compared to those trained on written or encyclopedic text. Furthermore, the models, trained on small amount spontaneous speech transcripts, perform comparably to standard LLMs.

pdf bib abs
Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning
Nelson Filipe Costa | Leila Kosseim

This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.

pdf bib abs
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen

The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation, either focusing on a single user simulator or a specific system design, limiting the generalisability of insights across architectures and configurations. In this work, we propose clem:todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem:todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. To the best of our knowledge, clem:todd is the first evaluation framework for task-oriented dialogue systems that supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem:todd’s flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate dialogue state tracking (DST). We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To achieve this, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art DST performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and cross-turn consistency, demonstrating the effectiveness of execution-aware state tracking.

Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce **TD-EVAL** (**T**urn and **D**ialogue-level **Eval**uation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn-level, we assess each response along three TOD-specific dimensions: *conversation cohesion*, *backend knowledge consistency*, and *policy compliance*. Meanwhile, we design **TOD Agent Arena** that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and Tau-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with an easily reproducible framework for future research.

pdf bib abs
Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems
Vinh Quang Nguyen | Nguyen Quang Chieu | Hoang Viet Pham | Khac-Hoai Nam Bui

Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.

Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer,” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

pdf bib abs
Speech-Integrated Modeling for Behavioral Coding in Counseling
Do June Min | Verónica Pérez-Rosas | Kenneth Resnicow | Rada Mihalcea

Computational models of psychotherapy often ignore vocal cues by relying solely on text. To address this, we propose MISQ, a framework that integrates speech features directly into language models using a speech encoder and lightweight adapter. MISQ improves behavioral analysis in counseling conversations, achieving ~5% relative gains over text-only or indirect speech methods—underscoring the value of vocal signals like tone and prosody.

pdf bib abs
When a Dialog becomes a Monologue: A debate on custom-made literature with generative AI
Maja Tabea Jerrentrup | Martin F. Villalba

This paper presents a discussion on the potential effects of AI-generated fiction on its users in contrast to traditional literature. After discussing the importance of reading fiction and introducing the technical aspects of long story generation, we look at four aspects of how AI-generated fiction can affect users and society, namely, democratic use, creativity, customization and connectedness. We close with a discussion focusing on the needs for media education.

In human-human conversation, interpersonal consideration for the interlocutor is essential, and similar expectations are increasingly placed on dialogue systems. This study examines the behavior of dialogue systems in a specific interpersonal scenario where a user vents frustrations and seeks emotional support from a long-time friend represented by a dialogue system. We conducted a human evaluation and qualitative analysis of 15 dialogue systems under this setting. These systems implemented diverse strategies, such as structuring dialogue into distinct phases, modeling interpersonal relationships, and incorporating cognitive behavioral therapy techniques. Our analysis reveals that these approaches contributed to improved perceived empathy, coherence, and appropriateness, highlighting the importance of design choices in socially sensitive dialogue.

pdf bib abs
Synthetic Data Augmentation for Cross-domain Implicit Discourse Relation Recognition
Frances Yung | Varsha Suresh | Zaynab Reza | Mansoor Ahmad | Vera Demberg

Implicit discourse relation recognition (IDRR) – the task of identifying the implicit coherence relation between two text spans – requires deep semantic understanding. Recent studies have shown that zero-/few-shot approaches significantly lag behind supervised models. However, LLMs may be useful for synthetic data augmentation, where LLMs generate a second argument following a specified coherence relation. We applied this approach in a cross-domain setting, generating discourse continuations using unlabelled target-domain data to adapt a base model which was trained on source-domain labelled data. Evaluations conducted on a large-scale test set revealed that different variations of the approach did not result in any significant improvements. We conclude that LLMs often fail to generate useful samples for IDRR, and emphasize the importance of considering both statistical significance and comparability when evaluating IDRR models.

pdf bib abs
Segmenting a Large French Meeting Corpus into Elementary Discourse Units
Laurent Prévot | Roxane Bertrand | Julie Hunter

Despite growing interest in discourse-related tasks, the limited quantity and diversity of discourse-annotated data remain a major issue. Existing resources are largely based on written corpora, while spoken conversational genres are underrepresented. Although discourse segmentation into elementary discourse units (EDUs) is considered to be nearly solved for canonical written texts, conversational spontaneous speech transcripts present different challenges. In this paper, we introduce a large French corpus of segmented meeting dialogues, including 20 hours of manually transcribed and discourse-annotated conversations, and 80 hours of automatically transcribed and discourse-segmented data. We describe our annotation campaign, discuss inter-annotator agreement and segmentation guidelines, and present results from fine-tuning a model for EDU segmentation on this resource.

pdf bib abs
How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations
Ikumi Numaya | Shoji Moriya | Shiki Sato | Reina Akama | Jun Suzuki

Recent advancements in dialogue generation have broadened the scope of human–bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users’ preferences, subjective stylistic similarity based on users’ own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users’ subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences. The dataset presented in this paper is available online.

pdf bib abs
LLMs stick to the point, humans to style: Semantic and Stylistic Alignment in Human and LLM Communication
Noé Durandard | Saurabh Dhawan | Thierry Poibeau

This study investigates differences in linguistic accommodation—changes in language use and style that individuals make to align with their dialogue partners—in human and LLM communication. Specifically, it contrasts semantic and stylistic alignment within question-answer pairs in terms of whether the answer was given by a human or an LLM. Utilizing embedding-based measures of linguistic similarity, we find that LLM-generated answers demonstrate higher semantic similarity—reflecting close conceptual alignment with the input questions—but relatively lower stylistic similarity. Human-written answers exhibit a reverse pattern, with lower semantic but higher stylistic similarity to the respective questions. These findings point to contrasting linguistic accommodation strategies evident in human and LLM communication, with implications for furthering personalization, social attunement, and engagement in human-AI dialogue.

pdf bib abs
A Topicality-Driven QUD Model for Discourse Processing
Yingxue Fu | Mark-Jan Nederhof | Anais Ollagnier

Question Under Discussion (QUD) is a discourse framework that has attracted growing interest in NLP in recent years. Among existing QUD models, the QUD tree approach (Riester, 2019) focuses on reconstructing QUDs and their hierarchical relationships, using a single tree to represent discourse structure. Prior implementation shows moderate inter-annotator agreement, highlighting the challenging nature of this task. In this paper, we propose a new QUD model for annotating hierarchical discourse structure. Our annotation achieves high inter-annotator agreement: 81.45% for short files and 79.53% for long files of Wall Street Journal articles. We show preliminary results on using GPT-4 for automatic annotation, which suggests that one of the best-performing LLMs still struggles with capturing hierarchical discourse structure. Moreover, we compare the annotations with RST annotations. Lastly, we present an approach for integrating hierarchical and local discourse relation annotations with the proposed model.

pdf bib abs
A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition
Nelson Filipe Costa | Leila Kosseim

We propose a novel multi-label classification approach to implicit discourse relation recognition (IDRR). Our approach features a multi-task model that jointly learns multi-label representations of implicit discourse relations across all three sense levels in the PDTB 3.0 framework. The model can also be adapted to the traditional single-label IDRR setting by selecting the sense with the highest probability in the multi-label representation. We conduct extensive experiments to identify optimal model configurations and loss functions in both settings. Our approach establishes the first benchmark for multi-label IDRR and achieves SOTA results on single-label IDRR using DiscoGeM. Finally, we evaluate our model on the PDTB 3.0 corpus in the single-label setting, presenting the first analysis of transfer learning between the DiscoGeM and PDTB 3.0 corpora for IDRR.

pdf bib abs
A Multi-Layered Annotation Protocol for Polyadic Conversation: Structuring Interactional Data in the GaMMA Corpus
Mark Dourado | Frej Spangsberg Lorenzen | Jesper Udesen | Henrik Gert Hassager | Stefania Serafin

Computational models of dialogue often struggle to capture the nuanced structures of spontaneous conversation - specifically in polyadic, real-world settings. We introduce a multilayered annotation protocol designed for the GaMMA corpus, a Danish dataset of four-person conversations recorded in both quiet and noisy environments. The protocol targets key interactional phenomena: Turn Construction Units, backchannels, floor transfer attempts, and repair sequences. Each annotation layer is grounded in Conversation Analysis while remaining machine-actionable, enabling alignment with multimodal data such as gaze and motion. We report inter-annotator agreement metrics across annotation tiers and discuss how the protocol supports both fine-grained interaction analysis and the training of context-aware dialogue models.

pdf bib abs
Early Humorous Interaction: Towards a Formal Model
Yingqin Hu | Jonathan Ginzburg | Catherine Pelachaud

Current computational models for humour recognition and laughter generation in dialogue systems face significant limitations in explainability, context consideration and adaptability. This paper approaches these challenges by investigating how humour recognition develops in its earliest forms—during the first year of life. Drawing on developmental psychology and cognitive science, we propose a formal model incorporated within the KoS dialogue framework. This model captures how infants evaluate potential humour through knowledge-based appraisal and context-dependent modulation, including safety, emotional state, and social cues. Our model formalises dynamic knowledge updates during the dyadic interaction. We believe that this formal model can serve as the basis for developing more natural humour appreciation capabilities in dialogue systems and can be implemented in a robotic platform.

pdf bib abs
Transition Relevance Point Detection for Spoken Dialogue Systems with Self-Attention Transformer
Kouki Miyazawa | Yoshinao Sato

Most conventional spoken dialogue systems determine when to respond based on the elapsed time of silence following user speech utterances. This approach often results in failures of turn-taking, disrupting smooth communications with users. This study addresses the detection of when it is acceptable for the dialogue system to start speaking. Specifically, we aim to detect transition relevant points (TRPs) rather than predict whether the dialogue participants will actually start speaking. To achieve this, we employ a self-supervised speech representation using contrastive predictive coding and a self-attention transformer. The proposed model, TRPDformer, was trained and evaluated on the corpus of everyday Japanese conversation. TRPDformer outperformed a baseline model based on the elapsed time of silence. Furthermore, third-party listeners rated the timing of system responses determined using the proposed model as superior to that of the baseline in a preference test.

Generating and evaluating character-like utterances automatically is essential for applications ranging from character simulation to creative-writing support. Existing approaches primarily focus on basic aspects of character‐likeness, such as script-fidelity knowledge and conversational ability. However, achieving a higher level of character‐likeness in utterance generation and evaluation requires consideration of the character’s identity, which deeply reflects the character’s inner self. To bridge this gap, we identified a set of identity-centric character-likeness elements. First, we listed 27 elements covering various aspects of identity, drawing on psychology and identity theory. Then, to clarify the features of each element, we collected utterances annotated with these elements from a commercial smartphone game and analyzed them based on user evaluations regarding character-likeness and charm. Our analysis reveals part of element-wise effects on character‐likeness and charm. These findings enable developers to design practical and interpretable element-feature-aware generation methods and evaluation metrics for character-like utterances.

pdf bib abs
Evaluating Spoken Language Features in Conversational Models: The Case of Disfluencies and Feedbacks
Oussama Silem | Maïwenn Fleig | Philippe Blache | Houda Oufaida | Leonor Becerra-Bonache

Understanding how language is processed and represented cognitively increasingly involves the use of specialized language models. Yet, because most models are predominantly trained on written text, they struggle to reflect the characteristics of language as it naturally unfolds in spoken interaction. This gap limits their capabilities in capturing features typical of spontaneous speech, such as repetitions, feedback cues, and hesitations. In this work, we introduce linguistically motivated evaluation metrics designed to target these specific spoken-language phenomena. We apply them to analyse outputs from language models fine-tuned on spoken English and French, comparing their behaviour statistically with human dialogue corpora. Our findings highlight the value of these metrics for assessing the degree to which model-generated utterances resemble authentic human conversation.

pdf bib abs
DIMSUM: Discourse in Mathematical Reasoning as a Supervision Module
Krish Sharma | Niyar R. Barman | Akshay Chaturvedi | Nicholas Asher

We look at reasoning on GSM8k, a dataset of short texts presenting primary school, math problems. We find, with Mirzadeh et al (2024), that current LLM progress on the data set may not be explained by better reasoning but by exposure to a broader pretraining data distribution. We then introduce a novel information source for helping models with less data or inferior training reason better: discourse structure. We show that discourse structure improves performance for models like Llama2 13b by up to 160%. Even for models that have most likely memorized the data set, adding discourse structural information to the model still improves predictions and dramatically improves large model performance on out of distribution examples.

pdf bib abs
Improving LLMs’ Learning of Coreference Resolution
Yujian Gan | Yuan Liang | Yanni Lin | Juntao Yu | Massimo Poesio

Coreference Resolution (CR) is crucial for many NLP tasks, but existing LLMs struggle with hallucination and under-performance. In this paper, we investigate the limitations of existing LLM-based approaches to CR—specifically the Question-Answering (QA) Template and Document Template methods—and propose two novel techniques: Reversed Training with Joint Inference and Iterative Document Generation. Our experiments show that Reversed Training improves the QA Template method, while Iterative Document Generation eliminates hallucinations in the generated source text and boosts coreference resolution. Integrating these methods and techniques offers an effective and robust solution to LLM-based coreference resolution

Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation–critique–revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation.

pdf bib abs
EmoNews: A Spoken Dialogue System for Expressive News Conversations
Ryuki Matsuura | Shikhar Bharadwaj | Jiarui Liu | Dhatchinamoorthi Kunde Govindarajan

We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced.

pdf bib abs
Distilling Empathy from Large Language Models
Henry J. Xie | Jinghan Zhang | Xinhao Zhang | Kunpeng Liu

The distillation of knowledge from Large Language Models (LLMs) into Smaller Language Models (SLMs), preserving the capabilities and performance of LLMs while reducing model size, has played a key role in the proliferation of LLMs. Because SLMs are considerably smaller than LLMs, they are often utilized in domains where human interaction is frequent but resources are highly constrained, e.g., smart phones. Therefore, it is crucial to ensure that empathy, a fundamental aspect of positive human interactions, already instilled into LLMs, is retained by SLMs after distillation. In this paper, we develop a comprehensive approach for effective empathy distillation from LLMs into SLMs. Our approach features a two-step fine-tuning process that fully leverages datasets of empathetic dialogue responses distilled from LLMs. We explore several distillation methods beyond basic direct prompting and propose four unique sets of prompts for targeted empathy improvement to significantly enhance the empathy distillation process. Our evaluations demonstrate that SLMs fine-tuned through the two-step fine-tuning process with distillation datasets enhanced by the targeted empathy improvement prompts significantly outperform the base SLM at generating empathetic responses with a win rate of 90+%. Our targeted empathy improvement prompts substantially outperform the basic direct prompting with a 10+% improvement in win rate.

pdf bib abs
RaPSIL: A Preference‐Guided Interview Agent for Rapport‐Aware Self‐Disclosure
Kenta Hama | Atsushi Otsuka | Masahiro Mizukami | Hiroaki Sugiyama | Makoto Naka

Facilitating self-disclosure without causing discomfort remains a difficult task—especially for AI systems. In real-world applications such as career counseling, wellbeing support, and onboarding interviews, eliciting personal information like concerns, goals, and personality traits is essential. However, asking such questions directly often leads to discomfort and disengagement. We address this issue with RaPSIL (Rapport-aware Preference-guided Self-disclosure Interview Learner), a two-stage LLM-based system that fosters natural, engaging conversations to promote self-disclosure. In the first stage, RaPSIL selectively imitates interviewer utterances that have been evaluated by LLMs for both strategic effectiveness and social sensitivity. It leverages LLMs as multi-perspective judges in this selection process. In the second stage, it conducts self-play simulations, using the Reflexion framework to analyze failures and expand a database with both successful and problematic utterances. This dual learning process allows RaPSIL to go beyond simple imitation, improving its ability to handle sensitive topics naturally by learning from both successful and failed utterances. In a comprehensive evaluation with real users, RaPSIL outperformed baselines in enjoyability, warmth, and willingness to re-engage, while also capturing self-descriptions more accurately. Notably, its impression scores remained stable even during prolonged interactions, demonstrating its ability to balance rapport building with effective information elicitation. These results show that RaPSIL enables socially aware AI interviewers capable of eliciting sensitive personal information while maintaining user trust and comfort—an essential capability for real-world dialogue systems.

pdf bib abs
Learning to Speak Like a Child: Reinforcing and Evaluating a Child-level Generative Language Model
Enoch Levandovsky | Anna Manaseryan | Casey Kennington

A language model that can generate utterances that are appraised as being within a specific age of a young child who is beginning their language learning journey can be useful in scenarios where child-level language is needed, for example in virtual avatars, interactions with individuals who have disabilities, or developmental robotics. In this paper, we focus on an age range that is not represented in prior work: emergent speakers. We use the CHILDES database to train and tune language models of different parameter sizes using a group relative policy optimization reinforcement learning regime. Our goal is to find the most coherent, yet child-like language model while keeping the number of parameters to as few as possible. We evaluate using metrics of coherency, “toddlerality,” and an evaluation using human subjects who interact with two robot platforms. Our experiments show that even small language models (under 1 billion parameters) can be used effectively to generate child-like utterances.

pdf bib abs
Beyond Simple Personas: Evaluating LLMs and Relevance Models for Character-Consistent Dialogue
Debaditya Pal | David Traum

Dialogue systems often rely on overly simplistic persona representations, limiting their capacity to portray realistic, nuanced characters. In this paper, we explore how well existing persona-grounding methods capture complex personalities using two character-rich domains—Sgt Blackwell (single-character) and Twins (two-character)—described extensively through detailed narratives. We compare early fusion techniques, Retrieval-Augmented Generation (RAG), and relevance-based approaches. Evaluations across entailment, persona alignment, and hallucination metrics reveal distinct trade-offs: Knowledge Graph fusion notably reduces hallucinations and maintains relevance, Persona fusion strongly preserves relevance but has higher hallucination rates, and RAG provides fast, fluent responses. Our findings emphasize the critical role of structured persona grounding in achieving nuanced personality modeling.

Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM’s dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o’s performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

pdf bib abs
EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning
Lingxiao Kong | Cong Yang | Susanne Neufang | Oya Deniz Beyan | Zeyd Boukhers

Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including competing objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the fine-tuning to improve efficiency and flexibility. Our method is the first to aggregate the hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text classification models to score the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption (17,529 ± 1,650 data points and 6,573 ± 147.43 seconds), improved scalability and explainability, and comparable performance across multiple objectives.

pdf bib abs
Learning to Ask Efficiently in Dialogue: Reinforcement Learning Extensions for Stream-based Active Learning
Issei Waki | Ryu Takeda | Kazunori Komatani

One essential function of dialogue systems is the ability to ask questions and acquire necessary information from the user through dialogue. To avoid degrading user engagement through repetitive questioning, the number of such questions should be kept low. In this study, we cast knowledge acquisition through dialogue as stream-based active learning, exemplified by the segmentation of user utterances containing novel words. In stream-based active learning, data instances are presented sequentially, and the system selects an action for each instance based on an acquisition function that determines whether to request the correct answer from the oracle (in this case, the user). To improve the efficiency of training the acquisition function via reinforcement learning, we introduce two extensions: (1) a new action that performs semi-supervised learning, and (2) a state representation that takes the remaining budget into account. Our simulation-based experiments showed that these two extensions improved word segmentation performance with fewer questions for the user, compared to a baseline without these extensions.

pdf bib abs
Human Capital Visualization using Speech Amount during Meetings
Ekai Hashimoto | Kohei Nagira | Takeshi Mizumoto | Shun Shiramatsu

In recent years, many companies have recognized the importance of human resources and are investing in human capital to revitalize their organizations and enhance internal communication, thereby fostering innovation. However, conventional quantification methods have mainly focused on readily measurable indicators without addressing the fundamental role of conversations in human capital. This study focuses on routine meetings and proposes strategies to visualize human capital by analyzing speech amount during these meetings. We employ conversation visualization technology, which operates effectively, to quantify speech. We then measure differences in speech amount by attributes such as gender and job post, changes in speech amount depending on whether certain participants are present, and correlations between speech amount and continuous attributes. To verify the effectiveness of our proposed methods, we analyzed speech amounts by departmental affiliation during weekly meetings at small to medium enterprises.

Challenges in multimodal task-oriented dialogue between humans and systems, particularly those involving audio and visual interactions, have not been sufficiently explored or shared, forcing researchers to define improvement directions individually without a clearly shared roadmap. To address these challenges, we organized a competition for multimodal task-oriented dialogue systems and constructed a large competition-based dataset of 1,865 minutes of Japanese task-oriented dialogues. This dataset includes audio and visual interactions between diverse systems and human participants. After analyzing system behaviors identified as problematic by the human participants in questionnaire surveys and notable methods employed by the participating teams, we identified key challenges in multimodal task-oriented dialogue systems and discussed potential directions for overcoming these challenges.

Developing mobile robots that can provide guidance with high hospitality remains challenging, as it requires the coordination of spoken interaction, physical navigation, and user engagement. To gain insights that contribute to the development of such robots, we conducted a Wizard-of-Oz (WOZ) study using Teleco, a teleoperated humanoid robot, to explore the factors influencing hospitality in mobile robot guidance. Specifically, we enrolled 30 participants as visitors and two trained operators, who teleoperated the Teleco robot to provide mobile guidance to the participants. A total of 120 dialogue sessions were collected, along with evaluations from both the participants and the operators regarding the hospitality of each interaction. To identify the factors that influence hospitality in mobile guidance, we analyzed the collected dialogues from two perspectives: linguistic usage and multimodal robot behaviors. We first clustered system utterances and analyzed the frequency of categories in high- and low-satisfaction dialogues. The results showed that short responses appeared more frequently in high-satisfaction dialogues. Moreover, we observed a general increase in participant satisfaction over successive sessions, along with shifts in linguistic usage, suggesting a mutual adaptation effect between operators and participants. We also conducted a time-series analysis of multimodal robot behaviors to explore behavioral patterns potentially linked to hospitable interactions.

pdf bib abs
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
Jakub Hoscilowicz | Artur Janicki

With the growing reliance on digital devices with graphical user interfaces (GUIs) like computers and smartphones, the demand for smart voice assistants has grown significantly. While multimodal large language models (MLLM) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this work, we introduce ClickAgent, a novel framework for building autonomous agents. ClickAgent combines MLLM-driven reasoning and action planning with a separate UI location model that identifies relevant UI elements on the screen. This approach addresses a key limitation of current MLLMs: their inability to accurately locate UI elements. Evaluations conducted using both an Android emulator and a real smartphone show that ClickAgent outperforms other autonomous agents (DigiRL, CogAgent, AppAgent) on the AITW benchmark.

pdf bib abs
Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks
Joyeeta Datta | Niclas Doll | Qusai Ramadan | Zeyd Boukhers

Large Language Models (LLMs) have shown outstanding performance across a range of NLP tasks, but their computational demands hinder deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using knowledge distillation (KD) while maintaining strong performance on question answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models’ performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for real-world applications.

pdf bib abs
On Speakers’ Identities, Autism Self-Disclosures and LLM-Powered Robots
Sviatlana Hoehn | Fred Philippy | Elisabeth Andre

Dialogue agents become more engaging through recipient design, which needs user-specific information. However, a user’s identification with marginalized communities, such as migration or disability background, can elicit biased language. This study compares LLM responses to neurodivergent user personas with disclosed vs. masked neurodivergent identities. A dataset built from public Instagram comments was used to evaluate four open-source models on story generation, dialogue generation, and retrieval-augmented question answering. Our analyses show biases in user’s identity construction across all models and tasks. Binary classifiers trained on each model can distinguish between language generated for prompts with or without self-disclosures, with stronger biases linked to more explicit disclosures. Some models’ safety mechanisms result in denial of service behaviors. LLM’s recipient design to neurodivergent identities relies on stereotypes tied to neurodivergence.

pdf bib abs
Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations
Galo Castillo-López | Gael de Chalendar | Nasredine Semmar

Intent recognition is a fundamental component in task-oriented dialogue systems (TODS). Determining user intents and detecting whether an intent is Out-of-Scope (OOS) is crucial for TODS to provide reliable responses. However, traditional TODS require large amount of annotated data. In this work we propose a hybrid approach to combine BERT and LLMs in zero and few-shot scenarios to recognize intents and detect OOS utterances. Our approach leverages LLMs generalization power and BERT’s computational efficiency in such scenarios. We evaluate our method on multi-party conversation corpora and observe that sharing information from BERT outputs to LLMs lead to system performance improvement.

pdf bib abs
Retrieving Relevant Knowledge Subgraphs for Task-Oriented Dialogue
Nicholas Thomas Walker | Pierre Lison | Laetitia Hilgendorf | Nicolas Wagner | Stefan Ultes

In this paper, we present an approach for extracting knowledge graph information for retrieval augmented generation in dialogue systems. Knowledge graphs are a rich source of background information, but the inclusion of more potentially useful information in a system prompt risks decreased model performance from excess context. We investigate a method of retrieving relevant subgraphs of maximum relevance and minimum size by framing this trade-off as a Prize-collecting Steiner Tree problem. The results of our user study and analysis indicate promising efficacy of a simple subgraph retrieval approach compared with a top-K retrieval model.

We explore the potential of ChatGPT to generate conversations focused on self-care strategies for African-American patients with heart failure, a domain with limited specialized datasets. To simulate patient-health educator dialogues, we employed four prompting strategies: aspects, African American Vernacular English, Social Determinants of Health (SDOH), and SDOH-informed reasoning. Conversations were generated across key self-care aspects— food, exercise, and fluid intake—with varying turn lengths and incorporated patient-specific SDOH attributes such as age, gender, neighborhood, and socioeconomic status. Our findings show that effective prompt design is essential. While incorporating SDOH and reasoning improves dialogue quality, ChatGPT still lacks the empathy and engagement needed for meaningful healthcare communication.

pdf bib abs
Dialogue Scaffolding: Producing a Realistic Corpus of Human-Computer Open-Domain Dialogues Using a Spoken Dialogue System and ChatGPT
Kevin Bowden | Marilyn Walker

Researchers in dialogue interaction have had a long-term interest in multi-domain human-computer conversations and how they differ from human-human conversations. Recently, research on dialogue has begun to rely more and more on corpus-based training of neural conversational models, and conversational LLMs such as ChatGPT. However, existing large open-domain dialogue corpora do not accurately capture the characteristics of social human-computer dialogue. This paper addresses this gap by synthesizing a new corpus of 4000 long social dialogues on 200 user-model based topics that we call User-Centric SocialChat (UCSC). We create UCSC with a novel method called Dialogue Scaffolding, where a real dialogue system, that competed successfully in the Alexa Prize, interacts with ChatGPT to generate conversations. The Dialogue Scaffolding method ensures that the dialogues closely resemble the social chat genre of human-computer dialogues. We evaluate UCSC to ensure quality and safety, and we measure lexical diversity and topic consistency to show that the conversations are not repetitive and stay on topic. We evaluate the utility of UCSC by fine-tuning a compact dialogue-level model, PerQy-DLM, and showing that it outperforms competitive fine-tuned models like COSMO, Vicuna, and RedPajama-Chat-3B.

pdf bib abs
Multi-step or Direct: A Proactive Home-Assistant System Based on Commonsense Reasoning
Konosuke Yamasaki | Shohei Tanaka | Akishige Yuguchi | Seiya Kawano | Koichiro Yoshino

There is a growing expectation for the realization of proactive home-assistant robots that can assist users in their daily lives. It is essential to develop a framework that closely observes the user’s surrounding context, selectively extracts relevant information, and infers the user’s needs to proactively propose appropriate assistance. In this study, we first extend the Do-I-Demand dataset to define expected proactive assistance actions in domestic situations, where users make ambiguous utterances. These behaviors were defined based on common patterns of support that a majority of users would expect from a robot. We subsequently constructed a framework that infers users’ expected assistance actions from ambiguous utterances through commonsense reasoning. We explored two approaches: (1) multi-step reasoning using COMET as a commonsense reasoning engine, and (2) direct reasoning using large language models. Our experimental results suggest that both the multi-step and direct reasoning methods can successfully derive necessary assistance actions even when dealing with ambiguous user utterances.

pdf bib abs
Role of Reasoning in LLM Enjoyment Detection: Evaluation Across Conversational Levels for Human-Robot Interaction
Lubos Marcinek | Bahar Irfan | Gabriel Skantze | Andre Pereira | Joakim Gustafsson

User enjoyment is central to developing conversational AI systems that can recover from failures and maintain interest over time. However, existing approaches often struggle to detect subtle cues that reflect user experience. Large Language Models (LLMs) with reasoning capabilities have outperformed standard models on various other tasks, suggesting potential benefits for enjoyment detection. This study investigates whether models with reasoning capabilities outperform standard models when assessing enjoyment in a human-robot dialogue corpus at both turn and interaction levels. Results indicate that reasoning capabilities have complex, model-dependent effects rather than universal benefits. While performance was nearly identical at the interaction level (0.44 vs 0.43), reasoning models substantially outperformed at the turn level (0.42 vs 0.36). Notably, LLMs correlated better with users’ self-reported enjoyment metrics than human annotators, despite achieving lower accuracy against human consensus ratings. Analysis revealed distinctive error patterns: non-reasoning models showed bias toward positive ratings at the turn level, while both model types exhibited central tendency bias at the interaction level. These findings suggest that reasoning should be applied selectively based on model architecture and assessment context, with assessment granularity significantly influencing relative effectiveness.

pdf bib abs
Integrating Physiological, Speech, and Textual Information Toward Real-Time Recognition of Emotional Valence in Dialogue
Jingjing Jiang | Ao Guo | Ryuichiro Higashinaka

Accurately estimating users’ emotional states in real time is crucial for enabling dialogue systems to respond adaptively. While existing approaches primarily rely on verbal information, such as text and speech, these modalities are often unavailable in non-speaking situations. In such cases, non-verbal information, particularly physiological signals, becomes essential for understanding users’ emotional states. In this study, we aimed to develop a model for real-time recognition of users’ binary emotional valence (high-valence vs. low-valence) during conversations. Specifically, we utilized an existing Japanese multimodal dialogue dataset, which includes various physiological signals, namely electrodermal activity (EDA), blood volume pulse (BVP), photoplethysmography (PPG), and pupil diameter, along with speech and textual data. We classify the emotional valence of every 15-second segment of dialogue interaction by integrating such multimodal inputs. To this end, time-series embeddings of physiological signals are extracted using a self-supervised encoder, while speech and textual features are obtained from pre-trained Japanese HuBERT and BERT models, respectively. The modality-specific embeddings are integrated using a feature fusion mechanism for emotional valence recognition. Experimental results show that while each modality individually contributes to emotion recognition, the inclusion of physiological signals leads to a notable performance improvement, particularly in non-speaking or minimally verbal situations. These findings underscore the importance of physiological information for enhancing real-time valence recognition in dialogue systems, especially when verbal information is limited.

pdf bib abs
Prompt-based Language Generation for Complex Conversational Coaching Tasks across Languages
Alain Vazquez Risco | Maria Ines Torres

We investigate the role of prompt-based demonstrators in improving natural language generation for coaching-oriented dialogue systems in different languages. These systems present significant challenges due to their need for semantically accurate, goal-driven responses across diverse dialogue act taxonomies and languages. We define three types of prompt demonstrators, i.e., pairs of meaning representation-utterance, that include different degrees of specification in such meaning representation. We then fine-tune pretrained language models separately for four very different languages and evaluate how the specificity of these demonstrators affects the quality of the generated sentences. Our experiments show that more specific prompts lead to more coherent and accurate outputs, particularly for low-resource languages and small models. Additionally, we observe promising zero-shot performance with larger models, showing a complementary value of prompts. These results demonstrate that simple prompting strategies, combined with fine-tuning, can significantly improve output quality in complex dialogue generation tasks across languages.

pdf bib abs
DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
Xinyi Liu | Dachun Sun | Yi Fung | Dilek Hakkani-Tur | Tarek F. Abdelzaher

Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18% higher diagnostic accuracy and over 30% improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate DocCHA’s effectiveness in enabling structured, transparent, and efficient diagnostic conversations—paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.

pdf bib abs
Language Style Matching in Large Language Models
Noé Durandard | Saurabh Dhawan | Thierry Poibeau

Language Style Matching (LSM)—the subconscious alignment of linguistic style between conversational partners—is a key indicator of social coordination in human dialogue. We present the first systematic study of LSM in Large Language Models (LLMs) focusing on two primary objectives: measuring the degree of LSM exhibited in LLM-generated responses and developing techniques to enhance it. First, in order to measure whether LLMs natively show LSM, we computed LIWC-based LSM scores across diverse interaction scenarios and found that LSM scores for text generated by LLMs were either below or near the lower range of such scores observed in human dialogue. Second, we show that LLMs’ adaptive behavior in this regard can be improved using inference-time techniques. We introduce and evaluate an inference-time sampling strategy—Logit-Constrained Generation—which can substantially enhance LSM scores in text generated by an LLM while preserving fluency. By advancing our understanding of LSM in LLMs and proposing effective enhancement strategies, this research contributes to the development of more socially attuned and communicatively adaptive AI systems.

This demo will showcase updates made to the ‘robot-ready spoken dialogue system’ built on the Retico framework. Updates include new modules, logging and real-time monitoring tools, integrations with the Coppelia Sim virtual robot platfrom, integrations with a benchmark, improved documentation, and pypi environment usage.

We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic dialogs can be used to train efficient question rewriters that decontextualize user questions, allowing existing dialog-unaware retrievers to be utilized. The retrieved information and the decontextualized question are then passed on to an LLM that generates the system’s response.

Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.

pdf bib abs
Generating Diverse Personas for User Simulators to Test Interview Dialogue Systems
Mikio Nakano | Kazunori Komatani | Hironori Takeuchi

This paper addresses the issue of the significant labor required to test interview dialogue systems. While interview dialogue systems are expected to be useful in various scenarios, like other dialogue systems, testing them with human users requires significant effort and cost. Therefore, testing with user simulators can be beneficial. Since most conventional user simulators have been primarily designed for training task-oriented dialogue systems, little attention has been paid to the personas of the simulated users. During development, testing interview dialogue systems requires simulating a wide range of user behaviors, but manually creating a large number of personas is labor-intensive. We propose a method that automatically generates personas for user simulators using a large language model. Furthermore, by assigning personality traits related to communication styles when generating personas, we aim to increase the diversity of communication styles in the user simulator. Experimental results show that the proposed method enables the user simulator to generate utterances with greater variation.

pdf bib abs
Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation
Ahmed Njifenjou | Virgile Sucal | Bassam Jabaian | Fabrice Lefèvre

The prevailing paradigm in the field of Open-Domain Dialogue (ODD) agents predominantly focuses on some high-resource languages such as English or Chinese. Furthermore, the financial and temporal investments required for crowd-sourcing such datasets, in multiple languages, are substantial. Fortunately, advancements in Large Language Models (LLMs), specifically instruction-tuning enabled them to execute tasks based on natural language instructions. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new data samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating ODD data in multiple target languages using LLMs, with demonstrations provided in a unique source language. By eschewing explicit Machine Translation in this approach, we enhance language-specific nuances and cultural specificity. We apply this methodology to the PersonaChat dataset. To further improve the openness of generated dialogues and mimic real life scenarios, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and that of common ground which represents the premises of a conversation.

pdf bib abs
Using LLMs to Grade Clinical Reasoning for Medical Students in Virtual Patient Dialogues
Jonathan Schiött | William Ivegren | Alexander Borg | Ioannis Parodis | Gabriel Skantze

This paper presents an evaluation of the use of large language models (LLMs) for grading clinical reasoning during rheumatology medical history virtual patient (VP) simulations. The study explores the feasibility of using state-of-the-art LLMs, including both general-purpose models, with various prompting strategies such as zero-shot, analysis-first, and chain-of-thought prompting, as well as reasoning models. The performance of these models in grading transcribed dialogues from VP simulations conducted on a Furhat robot was evaluated against human expert annotations. Human experts initially achieved a 65% inter-rater agreement, which resulted in a pooled Cohen’s Kappa of 0.71 and 82.3% correctness. The best LLM, o3-mini, achieved a pooled Kappa of 0.68 and 81.5% correctness, with response times under 30 seconds, compared to approximately 6 minutes for human grading. These results indicate the possibility that automatic assessments can approach human reliability under controlled simulation conditions while delivering time and cost efficiencies.

pdf bib abs
Task Proficiency-Aware Dialogue Analysis in a Real-Time Cooking Game Environment
Kaito Nakae | Michimasa Inaba

Real-time collaborative dialogue tasks require dynamic, instantaneous decision-making and seamless coordination between participants, yet most existing studies on cooperative dialogues primarily focus on turn-based textual environments. This study addresses the critical gap in understanding human-human interaction patterns within dynamic, real-time collaborative scenarios. In this paper, we present a novel dataset collected from a real-time collaborative cooking game environment inspired by the popular game “Overcooked.” Our dataset comprises detailed annotations of participants’ task proficiency levels, game scores, game action logs, and transcribed voice dialogues annotated with dialogue act tags. Participants exhibited a broad range of gaming experience, from highly proficient players to those with minimal exposure to gaming controls. Through comprehensive analysis, we explore how individual differences in task proficiency influence dialogue patterns and collaborative outcomes. Our findings reveal key dialogue acts and adaptive communication strategies crucial for successful real-time collaboration. Furthermore, this study provides valuable insights into designing adaptive dialogue systems capable of dynamically adjusting interaction strategies based on user proficiency, paving the way for more effective human-AI collaborative systems. The dataset introduced in this study is publicly available at: https://github.com/UEC-InabaLab/OverCookedChat.

pdf bib abs
Collaborative Problem-Solving in an Optimization Game
Isidora Jeknić | Alex Duchnowski | Alexander Koller

Dialogue agents that support human users in solving complex tasks have received much attention recently. Many such tasks are NP-hard optimization problems that require careful collaborative exploration of the solution space. We introduce a novel dialogue game in which the agents collaboratively solve a two-player Traveling Salesman problem, along with an agent that combines LLM prompting with symbolic mechanisms for memory, state tracking and problem-solving. Our best agent solves 45% of games optimally in self-play. It also demonstrates an ability to collaborate successfully with human users and generalize to unfamiliar graphs.

pdf bib abs
Evaluating Large Language Models for Enhancing Live Chat Therapy: A Comparative Study with Psychotherapists
Neha Pravin Deshpande | Stefan Hillmann | Sebastian Möller

Large Language Models (LLMs) hold promise for addressing the shortage of qualified therapists in mental health care. While chatbot-based Cognitive Behavioral Therapy (CBT) tools exist, their efficacy in sensitive contexts remains underexplored. This study examines the potential of LLMs to support therapy sessions aimed at reducing Child Sexual Abuse Material (CSAM) consumption. We propose a Retrieval-Augmented Generation (RAG) framework that leverages a fine-tuned BERT-based retriever to guide LLM-generated responses, better capturing the multi-turn, context-specific dynamics of therapy. Four LLMs—Qwen2-7B-Instruct, Mistral-7B-Instruct-v0.3, Orca-2-13B, and Zephyr-7B-Alpha—were evaluated in a small-scale study with 14 domain-expert psychotherapists. Our comparative analysis reveals that, in certain scenarios, LLMs like Mistral-7B-Instruct-v0.3 and Orca-2-13B were preferred over human therapist responses. While limited by sample size, these findings suggest that LLMs can perform at a level comparable to or even exceeding that of human therapists, especially in therapy focused on reducing CSAM consumption. Our code is available online: https://git.tu-berlin.de/neha.deshpande/therapy_responses/-/tree/main