Lightweight Models for Multimodal Sequential Data

Soumya Sourav, Jessica Ouyang


Abstract
Human language encompasses more than just text; it also conveys emotions through tone and gestures. We present a case study of three simple and efficient Transformer-based architectures for predicting sentiment and emotion in multimodal data. The Late Fusion model merges unimodal features to create a multimodal feature sequence, the Round Robin model iteratively combines bimodal features using cross-modal attention, and the Hybrid Fusion model combines trimodal and unimodal features together to form a final feature sequence for predicting sentiment. Our experiments show that our small models are effective and outperform the publicly released versions of much larger, state-of-the-art multimodal sentiment analysis systems.
Anthology ID:
2021.wassa-1.14
Volume:
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Month:
April
Year:
2021
Address:
Online
Editors:
Orphee De Clercq, Alexandra Balahur, Joao Sedoc, Valentin Barriere, Shabnam Tafreshi, Sven Buechel, Veronique Hoste
Venue:
WASSA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
129–137
Language:
URL:
https://aclanthology.org/2021.wassa-1.14
DOI:
Bibkey:
Cite (ACL):
Soumya Sourav and Jessica Ouyang. 2021. Lightweight Models for Multimodal Sequential Data. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 129–137, Online. Association for Computational Linguistics.
Cite (Informal):
Lightweight Models for Multimodal Sequential Data (Sourav & Ouyang, WASSA 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wassa-1.14.pdf
Optional supplementary material:
 2021.wassa-1.14.OptionalSupplementaryMaterial.zip
Data
CMU-MOSEIIEMOCAP