Prediction of User Emotion and Dialogue Success Using Audio Spectrograms and Convolutional Neural Networks

Athanasios Lykartsis, Margarita Kotti


Abstract
In this paper we aim to predict dialogue success and user satisfaction as well as emotion on a turn level. To achieve this, we investigate the use of spectrogram representations, extracted from audio files, in combination with several types of convolutional neural networks. The experiments were performed on the Let’s Go V2 database, comprising 5065 audio files and having labels for subjective and objective dialogue turn success, as well as the emotional state of the user. Results show that by using only audio, it is possible to predict turn success with very high accuracy for all three labels (90%). The best performing input representation were 1s long mel-spectrograms in combination with a CNN with a bottleneck architecture. The resulting system has the potential to be used real-time. Our results significantly surpass the state of the art for dialogue success prediction based only on audio.
Anthology ID:
W19-5939
Volume:
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue
Month:
September
Year:
2019
Address:
Stockholm, Sweden
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
336–344
Language:
URL:
https://aclanthology.org/W19-5939
DOI:
10.18653/v1/W19-5939
Bibkey:
Cite (ACL):
Athanasios Lykartsis and Margarita Kotti. 2019. Prediction of User Emotion and Dialogue Success Using Audio Spectrograms and Convolutional Neural Networks. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 336–344, Stockholm, Sweden. Association for Computational Linguistics.
Cite (Informal):
Prediction of User Emotion and Dialogue Success Using Audio Spectrograms and Convolutional Neural Networks (Lykartsis & Kotti, SIGDIAL 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-5939.pdf