Recognizing Emotions in Video Using Multimodal DNN Feature Fusion

Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, Oana Radu


Abstract
We present our system description of input-level multimodal fusion of audio, video, and text for recognition of emotions and their intensities for the 2018 First Grand Challenge on Computational Modeling of Human Multimodal Language. Our proposed approach is based on input-level feature fusion with sequence learning from Bidirectional Long-Short Term Memory (BLSTM) deep neural networks (DNNs). We show that our fusion approach outperforms unimodal predictors. Our system performs 6-way simultaneous classification and regression, allowing for overlapping emotion labels in a video segment. This leads to an overall binary accuracy of 90%, overall 4-class accuracy of 89.2% and an overall mean-absolute-error (MAE) of 0.12. Our work shows that an early fusion technique can effectively predict the presence of multi-label emotions as well as their coarse-grained intensities. The presented multimodal approach creates a simple and robust baseline on this new Grand Challenge dataset. Furthermore, we provide a detailed analysis of emotion intensity distributions as output from our DNN, as well as a related discussion concerning the inherent difficulty of this task.
Anthology ID:
W18-3302
Volume:
Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Amir Zadeh, Paul Pu Liang, Louis-Philippe Morency, Soujanya Poria, Erik Cambria, Stefan Scherer
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–19
Language:
URL:
https://aclanthology.org/W18-3302/
DOI:
10.18653/v1/W18-3302
Bibkey:
Cite (ACL):
Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, and Oana Radu. 2018. Recognizing Emotions in Video Using Multimodal DNN Feature Fusion. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pages 11–19, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Recognizing Emotions in Video Using Multimodal DNN Feature Fusion (Williams et al., ACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3302.pdf
Data
CMU-MOSEI