Multimodal Conversational AI: A Survey of Datasets and Approaches

Anirudh Sundar, Larry Heck


Abstract
As humans, we experience the world with all our senses or modalities (sound, sight, touch, smell, and taste). We use these modalities, particularly sight and touch, to convey and interpret specific meanings. Multimodal expressions are central to conversations; a rich set of modalities amplify and often compensate for each other. A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities. This paper motivates, defines, and mathematically formulates the multimodal conversational research objective. We provide a taxonomy of research required to solve the objective: multimodal representation, fusion, alignment, translation, and co-learning. We survey state-of-the-art datasets and approaches for each research area and highlight their limiting assumptions. Finally, we identify multimodal co-learning as a promising direction for multimodal conversational AI research.
Anthology ID:
2022.nlp4convai-1.12
Volume:
Proceedings of the 4th Workshop on NLP for Conversational AI
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Bing Liu, Alexandros Papangelis, Stefan Ultes, Abhinav Rastogi, Yun-Nung Chen, Georgios Spithourakis, Elnaz Nouri, Weiyan Shi
Venue:
NLP4ConvAI
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
131–147
Language:
URL:
https://aclanthology.org/2022.nlp4convai-1.12
DOI:
10.18653/v1/2022.nlp4convai-1.12
Bibkey:
Cite (ACL):
Anirudh Sundar and Larry Heck. 2022. Multimodal Conversational AI: A Survey of Datasets and Approaches. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 131–147, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Multimodal Conversational AI: A Survey of Datasets and Approaches (Sundar & Heck, NLP4ConvAI 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.nlp4convai-1.12.pdf
Video:
 https://aclanthology.org/2022.nlp4convai-1.12.mp4
Data
AudioSetGuessWhat?!HMDB51HowTo100MImage-ChatMODMovieQASIMMCTEAChTVQATVQA+Talk the WalkUCF101VisDialVisual Question AnsweringVisual7WYouCook2