2025
pdf
bib
abs
Integrating Physiological, Speech, and Textual Information Toward Real-Time Recognition of Emotional Valence in Dialogue
Jingjing Jiang
|
Ao Guo
|
Ryuichiro Higashinaka
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Accurately estimating users’ emotional states in real time is crucial for enabling dialogue systems to respond adaptively. While existing approaches primarily rely on verbal information, such as text and speech, these modalities are often unavailable in non-speaking situations. In such cases, non-verbal information, particularly physiological signals, becomes essential for understanding users’ emotional states. In this study, we aimed to develop a model for real-time recognition of users’ binary emotional valence (high-valence vs. low-valence) during conversations. Specifically, we utilized an existing Japanese multimodal dialogue dataset, which includes various physiological signals, namely electrodermal activity (EDA), blood volume pulse (BVP), photoplethysmography (PPG), and pupil diameter, along with speech and textual data. We classify the emotional valence of every 15-second segment of dialogue interaction by integrating such multimodal inputs. To this end, time-series embeddings of physiological signals are extracted using a self-supervised encoder, while speech and textual features are obtained from pre-trained Japanese HuBERT and BERT models, respectively. The modality-specific embeddings are integrated using a feature fusion mechanism for emotional valence recognition. Experimental results show that while each modality individually contributes to emotion recognition, the inclusion of physiological signals leads to a notable performance improvement, particularly in non-speaking or minimally verbal situations. These findings underscore the importance of physiological information for enhancing real-time valence recognition in dialogue systems, especially when verbal information is limited.
pdf
bib
abs
Towards Human-Like Dialogue Systems: Integrating Multimodal Emotion Recognition and Non-Verbal Cue Generation
Jingjing Jiang
Proceedings of the 21st Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems
This position paper outlines my research vision for developing human-like dialogue systems capable of both perceiving and expressing emotions through multimodal communication. My current research focuses on two main areas: multimodal emotion recognition and non-verbal cue generation. For emotion recognition, I constructed a Japanese multimodal dialogue dataset that captures natural, dyadic face-to-face interactions and developed an emotional valence recognition model that integrates textual, speech and physiological inputs. On the generation side, my research explores non-verbal cue generation for embodied conversational agents (ECAs). Finally, the paper discusses the future of SDSs, emphasizing the shift from traditional turn-based architectures to full-duplex, real-time, multimodal systems.
2024
pdf
bib
abs
Estimating the Emotional Valence of Interlocutors Using Heterogeneous Sensors in Human-Human Dialogue
Jingjing Jiang
|
Ao Guo
|
Ryuichiro Higashinaka
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Dialogue systems need to accurately understand the user’s mental state to generate appropriate responses, but accurately discerning such states solely from text or speech can be challenging. To determine which information is necessary, we first collected human-human multimodal dialogues using heterogeneous sensors, resulting in a dataset containing various types of information including speech, video, physiological signals, gaze, and body movement. Additionally, for each time step of the data, users provided subjective evaluations of their emotional valence while reviewing the dialogue videos. Using this dataset and focusing on physiological signals, we analyzed the relationship between the signals and the subjective evaluations through Granger causality analysis. We also investigated how sensor signals differ depending on the polarity of the valence. Our findings revealed several physiological signals related to the user’s emotional valence.
pdf
bib
abs
Towards a Real-Time Multimodal Emotion Estimation Model for Dialogue Systems
Jingjing Jiang
Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems
This position paper presents my research interest in establishing human-like chat-oriented dialogue systems. To this end, my work focuses on two main areas: the construction and utilization of multimodal datasets and real-time multimodal affective computing. I discuss the limitations of current multimodal dialogue corpora and multimodal affective computing models. As a solution, I have constructed a human-human dialogue dataset containing various synchronized multimodal information, and I have conducted preliminary analyses on it. In future work, I will further analyze the collected data and build a real-time multimodal emotion estimation model for dialogue systems.