Sven Magg


2020

pdf bib
EDA: Enriching Emotional Dialogue Acts using an Ensemble of Neural Annotators
Chandrakant Bothe | Cornelius Weber | Sven Magg | Stefan Wermter
Proceedings of the Twelfth Language Resources and Evaluation Conference

The recognition of emotion and dialogue acts enriches conversational analysis and help to build natural dialogue systems. Emotion interpretation makes us understand feelings and dialogue acts reflect the intentions and performative functions in the utterances. However, most of the textual and multi-modal conversational emotion corpora contain only emotion labels but not dialogue acts. To address this problem, we propose to use a pool of various recurrent neural models trained on a dialogue act corpus, with and without context. These neural models annotate the emotion corpora with dialogue act labels, and an ensemble annotator extracts the final dialogue act label. We annotated two accessible multi-modal emotion corpora: IEMOCAP and MELD. We analyzed the co-occurrence of emotion and dialogue act labels and discovered specific relations. For example, Accept/Agree dialogue acts often occur with the Joy emotion, Apology with Sadness, and Thanking with Joy. We make the Emotional Dialogue Acts (EDA) corpus publicly available to the research community for further study and analysis.

2018

pdf bib
A Context-based Approach for Dialogue Act Recognition using Simple Recurrent Neural Networks
Chandrakant Bothe | Cornelius Weber | Sven Magg | Stefan Wermter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos
Egor Lakomkin | Sven Magg | Cornelius Weber | Stefan Wermter
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos. We outline several filtering and post-processing steps, which extract samples that can be used for training end-to-end neural speech recognition systems. In our experiments, we demonstrate that a single-core version of the crawler can obtain around 150 hours of transcribed speech within a day, containing an estimated 3.5% word error rate in the transcriptions. Automatically collected samples contain reading and spontaneous speech recorded in various conditions including background noise and music, distant microphone recordings, and a variety of accents and reverberation. When training a deep neural network on speech recognition, we observed around 40% word error rate reduction on the Wall Street Journal dataset by integrating 200 hours of the collected samples into the training set.

2017

pdf bib
Reusing Neural Speech Representations for Auditory Emotion Recognition
Egor Lakomkin | Cornelius Weber | Sven Magg | Stefan Wermter
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Acoustic emotion recognition aims to categorize the affective state of the speaker and is still a difficult task for machine learning models. The difficulties come from the scarcity of training data, general subjectivity in emotion perception resulting in low annotator agreement, and the uncertainty about which features are the most relevant and robust ones for classification. In this paper, we will tackle the latter problem. Inspired by the recent success of transfer learning methods we propose a set of architectures which utilize neural representations inferred by training on large speech databases for the acoustic emotion recognition task. Our experiments on the IEMOCAP dataset show ~10% relative improvements in the accuracy and F1-score over the baseline recurrent neural network which is trained end-to-end for emotion recognition.