Dialogue & Discourse (2025)


up

bib (full) Dialogue Discourse Volume 16

The goal of this special issue is to show the challenges faced in reliably annotating abstractsemantic and pragmatic information at both the sentence and discourse levels, and how those chal-lenges are being met. Such information is frequently not explicitly or unambiguously marked innatural language. It is usually dependent on contextual information, and annotators often have toreconstruct complex relations and situations from the context.
This article describes how participants in online discussion forums manage claimed non-understanding of word meaning, specifically when one participant displays insufficient understanding by requesting meta-linguistic clarification. Claimed non-understanding refers to cases where a participant signals a lack of understanding in a way that invites repair. By engaging in a short word meaning negotiation sequence, the participants collaboratively repair the issue of claimed non-understanding and can move on with the discussion on topic. In some cases, however, participants behave in ways that break the normative pattern of interaction and do not enter into the anticipated sequence of repair dealing with the lack of understanding. The analysis of these deviant cases reveals the participants’ own normative orientations in repair of claimed non-understanding of word meaning, and thus provide evidence that there is an underlying organization of repair dealing with such issues in online discussion forum interaction.
We examined how the phrases I don’t know, I dunno, and idk are used in spontaneously produced speech and writing. We compared functions to the related phrases totally, absolutely, sorta, and kinda. We assessed usage across modalities (face to face, instant messaging, audiovisual), goals (tasks versus casual chat), and relationships (friends versus strangers). We also assessed where the phenomena occurred in a sentence, what words co-occurred with the phenomena, and what functions the phenomena served in the conversations. Communicators use phenomena differently depending on modality, goals, and relationships. We found that I don’t know was used more often when people could access cues beyond the voice, and that both I don’t know and I dunno can perform a variety of pragmatic functions. In instant messaging, I don’t know has been lexicalized to idk, but idk does not have as many pragmatic functions as I don’t know and I dunno.
This paper investigates proactivity, a characteristic phenomenon of collaborative human-human interaction, where a participant in the dialogue offers the addressee some useful and not explicitly requested information. More precisely, a proactive behaviour is: (i) self-prompted and not simply reactive, that is, the speaker does not act merely in response to the requests the other participant has made; (ii) somehow effective for the achievement of the dialogue goal, since the speaker has a long-term, goal-directed behaviour that predicts future states and needs. Proactivity has been poorly investigated from a theoretical point of view, and there is a general need of empirical data for both quantitative and qualitative research. The paper provides an extensive analysis of proactivity in several human-human task-oriented dialogic corpora, selected with different characteristics, including chat exchanges and telephone calls, collection modalities such as natural setting and Wizard of Oz, and two languages, Italian and English. The main result is the D-Pro Corpus, a new resource manually annotated at the utterance level with proactivity and dialogue acts, which allows to investigate proactivity in the context of task-oriented dialogues. There are several findings from our empirical investigation of proactivity: (i) we find that about 20% of turns in our corpus are proactive turns, showing that this is a very diffused and relevant phenomenon; (ii) we confirm the non-reactive nature of proactivity, highlighting the presence of a pattern where a turn in the dialogue triggers a reaction in a following turn and a proactive utterance is then added to the turn; (iii) we show that only a limited number of dialogue acts are actually involved in expressing proactivity, and we discuss the theoretical implications of this finding; (iv) we empirically confirm that proactivity has a crucial role in recovering from goal-failure situations, contributing to the effectiveness of the whole dialogue; (v) we support the intuition of a non-uniform distribution of proactive utterances throughout the dialogue. Our empirical findings and the D-Pro Corpus provide relevant insights for deeper theoretical investigations, as well as crucial resources for improving proactivity in current task-oriented dialogue systems.
This study investigates the interaction of German modal particles like ja and doch with discourse structure. We conduct an acceptability study of modal particles in four discourse relations (CIRCUMSTANCE, CONDITION, EVIDENCE, JUSTIFY) to test predictions of (in)compatibilities derived from a corpus study by Döring and Repp (2019). As ratings for sentences representing the discourse relations CIRCUMSTANCE and CONDITION were significantly lower than for the two causal relations if presented with a modal particle, we confirm that modal particles and discourse structure interact. In a forced-choice study testing the particle ja’s effect on relation disambiguation, we show that ja supports a causal interpretation of an ambiguous context in the absence of explicit discourse markers. Our findings contribute to delineating the role of German modal particles in discourse, as we show that there is an interaction between discourse relations and modal particles, meaning that readers do not accept all modal particles in every discourse relation, and at least the modal particle ja serves as a non-connective discourse signal for causal relations.
This paper explores the semantic connection between mixed quotation and name-informing quotation, proposing a unified account for both. While mixed quotation combines direct quotation with indirect reporting, name-informing quotation highlights the linguistic shape of a concept’s conventionalized name. We argue that both types of quotation involve a naming predicate – explicit in name-informing quotation and covert in mixed quotation. A pilot questionnaire study, which presented participants with two-turn dialogue contexts, used the notion of at-issueness to probe the naming component across these types, supporting the hypothesis that both share a similar semantic structure. This unified approach contributes to a broader understanding of quotational constructions and their role in linguistic and discourse representation.
This editorial introduces the special issue on Embodied Conversational Systems in Human–Robot Interaction.
Knowledge graphs are often used to represent structured information in a flexible and efficient manner, but their use in situated dialogue remains under-explored. This paper presents a novel conversational model for human–robot interaction that rests upon a graph-based representation of the dialogue state. The knowledge graph representing the dialogue state is continuously updated with new observations from the robot sensors, including linguistic, situated and multimodal inputs, and is further enriched by other modules, in particular for spatial understanding. The neural conversational model employed to respond to user utterances relies on a simple but effective graph-to-text mechanism that traverses the dialogue state graph and converts the traversals into a natural language form. This conversion of the state graph into text is performed using a set of parameterized functions, and the values for those parameters are optimized based on a small set of Wizard-of-Oz interactions. After this conversion, the text representation of the dialogue state graph is included as part of the prompt of a large language model used to decode the agent response. The proposed approach is empirically evaluated through a user study with a humanoid robot that acts as conversation partner to evaluate the impact of the graph-to-text mechanism on the response generation. After moving a robot along a tour of an indoor environment, participants interacted with the robot using spoken dialogue and evaluated how well the robot was able to answer questions about what the robot observed during the tour. User scores suggest an improvement in the perceived factuality of the robot responses when the graph-to-text approach is employed compared to a baseline using inputs structured as semantic triples.
Studies on human-robot interaction as well as on embodied conversational agents have revealed that the use of laughter by agents increases their perceived naturalness and their social presence. However, laughter plays a variety of functions in human interaction, and its effects on communication go beyond those previously investigated in the aforementioned fields. Taking into account that laughter has been shown to improve task performance in human-human interaction, we investigated here whether laughter use by a virtual agent increases task success also in human-machine interaction. A real-estate scenario was considered, in which an agent presented an apartment to an interested client. Both the presence of laughter and the nature of the agent (virtual or human) were varied in the experiment. We operationalized the task success as being the likelihood of participants recommending the apartment, while also examining the perceived rating of the agent. The results of an observer study showed that the use of laughter by a virtual agent results in increased task success, while also confirming previous findings regarding improvements in the social perception of the agent. Our results concerning the task success in the human agent condition were not in line with those of previous studies, most likely due to a reduced naturalness of the used laughter. This makes the findings pertaining to the virtual agent, where benefits were observed by the use of laughter in interaction, even more salient, suggesting that humans are less sensitive to reduced laughter naturalness in that case. We further discuss the need for better laughter integration with speech, as well as its automatic synthesis in order to better take advantage of these findings.
Efforts towards endowing robots with the ability to speak have benefited from recent advancements in natural language processing, in particular large language models. However, current language models are not fully incremental, as their processing is inherently monotonic and thus lack the ability to revise their interpretations or output in light of newer observations. This monotonicity has important implications for the development of dialogue systems for human–robot interaction. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and implications of incremental dialogue for embodied, robotic platforms in the age of large language models.
How can flexibility and control over the interpretation of multimodal signals by embodied agents be balanced? Flexibility means that agents respond fluently in any context, whereas control means that responses are transparent and faithful to goals and principles that are explicitly defined. This paper describes a modular platform to create multimodal interactive agents using an event bus on which signals and interpretations are posted as a sequence in time, but also provides control options to drive the interaction given specific intentions and goals. Different sensors and interpretation components can be integrated by defining their input and output topics in the event bus, which results in an open multimodal sequence-driven workflow for further interpretations. In addition, our platform allows us to define higher-level intents that control sequence patterns to achieve a goal. A key component is an episodic Knowledge Graph (eKG) that acts as a long-term symbolic memory to aggregate and connect these interpretations. This eKG establishes coherence and continuity across different interactions. Intents and the eKG make it possible to define different (embodied) agents and compare their behavior without having to implement complex software components for multimodal sensor data and design the control over their dependencies. In this paper, we explain the broad range of components that we developed and integrated into various interactive agents. We also explain how the interaction is recorded as multimodal data and how it results in an aggregated memory in the eKG. By analyzing the recorded interaction, we can compare agents and agent components and study their interactive behavior with people and other agents.
This study addresses the issue of what a Retrieval-Augmented Generation (RAG) chatbot should remember and what it should forget, based on findings from psychology. RAG retrieves relevant memories from past interactions to generate responses, and its effectiveness has been demonstrated. As conversations continue, however, the amount of stored memory keeps growing, which not only requires large storage capacity but also risks retaining unnecessary information, potentially reducing retrieval efficiency.To tackle this problem, we propose LUFY (Long-term Understanding and identiFYing key exchanges), a RAG chatbot that evaluates six distinct memory-related metrics derived from psychological models and real-world data. Instead of simply summing these metrics, it uses learned weights to account for the importance of each one. By using these weighted scores, the system can prioritize and retain relevant memories while gradually forgetting less important ones during both retrieval and memory management.To evaluate the effectiveness of LUFY in long-term conversations, we conducted experiments with human participants, who engaged in text-based conversations with three types of chatbots, each using different forgetting mechanisms, for at least two hours. The length of these conversations was more than 4.5 times longer than the longest conversations reported in previous studies. The results showed that prioritizing emotionally engaging memories while forgetting most of the conversation significantly enhanced user satisfaction.
Elementary Discourse Units (EDUs) constitutes the interface between language grammar and lan- guage use. On the one hand, they result from compositional semantic processes that combines individual word meanings into proposition-level representations. On the other hand, EDUs form the building blocks of most text, discourse, and dialogue frameworks. In written genres, where punctuation is available and reliable, segmenting EDUs is sometimes seen as a nearly solved problem, as least for high-resource languages. However, this is not the case for spontaneous speech transcripts. In this paper, we use a significant (8-hour) French corpus, manually segmented into EDUs, to evaluate several large language model (LLM)-based approaches for this task. We compare various fine-tuning strategies, including those relying on weakly supervised labels, in relation to the amount of ”gold” manual annotations that can be available. We also experiment with in-context learning, where example instances are provided to condition a generative model (few-shots learning) or in a purely generative approach (zero-shot). Our findings indicate that classical fine-tuning is still the most effective approach, requiring only a reasonable amount of gold-annotated data to achieve the best performance in our experiments. Beyond traditional quantitative evaluation, we conducted a systematic qualitative analysis, identifying directions for further improvement. These include integrating prosodic considerations while handling pauses when they co-occur with disfluencies or complex discourse markers uses. Finally, we argue for the significance of this task and the resulting units, compared to acoustic and syntactic proxies, especially for quantitative linguistics focusing on spontaneous speech.