Multimodal Conversational AI: A Survey of Datasets and Approaches

As humans, we experience the world with all our senses or modalities (sound, sight, touch, smell, and taste). We use these modalities, particularly sight and touch, to convey and interpret specific meanings. Multimodal expressions are central to conversations; a rich set of modalities amplify and often compensate for each other. A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities. This paper motivates, defines, and mathematically formulates the multimodal conversational research objective. We provide a taxonomy of research required to solve the objective: multimodal representation, fusion, alignment, translation, and co-learning. We survey state-of-the-art datasets and approaches for each research area and highlight their limiting assumptions. Finally, we identify multimodal co-learning as a promising direction for multimodal conversational AI research.


Introduction
The proliferation of smartphones has dramatically increased the frequency of interactions that humans have with digital content. These interactions have expanded over the past decade to include conversations with smartphones and in-home smart speakers. Conversational AI systems (e.g., Alexa, Siri, Google Assistant) answer questions, fulfill specific tasks, and emulate natural human conversation (Hakkani-Tür et al., 2011;Gao et al., 2019).
Early examples of conversational AI include those based on primitive rule-based methods such as ELIZA (Weizenbaum, 1966). More recently, conversational systems were driven by statistical machine translation systems: translating input queries to responses (Ritter et al., 2011;Hakkani-Tür et al., 2012). Orders of magnitude more data led to unprecedented advances in conversational technology in the mid-part of the last decade. Tech-niques were developed to mine conversational training data from the web search query-click stream (Hakkani-Tür et al., 2011;Heck, 2012;Hakkani-Tür et al., 2013) and web-based knowledge graphs (Heck and Hakkani-Tür, 2012;El-Kahky et al., 2014). With this increase in data, deep neuralnetworks gained momentum in conversational systems (Mesnil et al., 2014;Heck and Huang, 2014;Sordoni et al., 2015;Vinyals and Le, 2015;Shang et al., 2015;Serban et al., 2016;Li et al., 2016a,b).
One limitation of existing agents is that they often rely exclusively on language to communicate with users. This contrasts with humans, who converse with each other through a multitude of senses. These senses or modalities complement each other, resolving ambiguities and emphasizing ideas to make conversations meaningful. Prosody, auditory expressions of emotion, and backchannel agreement supplement speech, lip-reading disambiguates unclear words, gesticulation makes spatial references, and high-fives signify celebration.
Alleviating this unimodal limitation of conversational AI systems requires developing methods to extract, combine, and understand information streams from multiple modalities and generate multimodal responses while simultaneously maintaining an intelligent conversation.
Similar to the taxonomy of multimodal machine learning research (Baltrušaitis et al., 2017), the research required to extend conversational AI systems to multiple modalities can be grouped into five areas: Representation, Fusion, Translation, Alignment, and Co-Learning. Representation and fusion involve learning mathematical constructs to mimic sensory modalities. Translation maps relationships between modalities for cross-modal reasoning. Alignment identifies regions of relevance across modalities to identify correspondences between them. Co-learning exploits the synergies across modalities by leveraging resourcerich modalities to train resource-poor modalities.
Concurrently, it is necessary for the research areas outlined above to address four main challenges in multimodal conversational reasoning -disambiguation, response generation, coreference resolution, and dialogue state tracking . Multimodal disambiguation and response generation are challenges associated with fusion that determine whether available multimodal inputs are sufficient for a direct response or if follow-up queries are required. Multimodal coreference resolution is a challenge in both translation and alignment, where the conversational agent must resolve referential mentions in dialogue to corresponding objects in other modalities. Multimodal dialogue state tracking is a holistic challenge across research areas typically associated with task-oriented systems. The goal is to parse multimodal signals to infer and update values for slots in user utterances.
In this paper, we discuss the taxonomy of research challenges in multimodal Conversational AI as illustrated in Figure 1. Section 2 provides a history of research in multimodal conversations. In Section 3, we mathematically formulate multimodal conversational AI as an optimization problem. Sections 4, 5, and 6 survey existing datasets and state-of-the-art approaches for multimodal representation and fusion, translation, and alignment. Section 7 highlights limitations of existing research in multimodal conversational AI and explores multimodal co-learning as a promising direction for research.

Background
Early work in multimodal conversational AI focused on the use of visual information to improve automatic speech recognition (ASR). One of the earliest papers along these lines is by Yuhas et al. (1989) followed by many papers including work by Meier et al. (1996), Duchnowski et al. (1994), Bregler and Konig (1994, and Ngiam et al. (2011). Advances in client-side capabilities enabled ASR systems to utilize other modalities such as tactile, voice, and text inputs. These systems supported more comprehensive interactions and facilitated a higher degree of personalization. Examples include ESPRIT's MASK (Lamel et al., 1998), Microsoft's MiPad (Huang et al., 2001), and AT&T's MATCH (Johnston et al., 2002).
Vision-driven tasks motivated research in adding visual understanding technology into conversational AI systems. Early work in reasoning over text+video include work by Ramanathan et al. (2014) where they leveraged these combined modalities to address the problem of assigning names of people in the cast to tracks in TV videos. Kong et al. (2014) leveraged natural language descrptions of RGB-D videos for 3D semantic parsing. Srivastava and Salakhutdinov (2014) developed a multimodal Deep Boltzmann Machine for image-text retrieval and ASR using videos. Antol et al. (2015) introduced a dataset and baselines for multimodal question-answering, a challenge combining computer vision and natural language pro-cessing. More recent work by Zhang et al. (2019b) and Selvaraju et al. (2019) leveraged conversational explanations to make vision and language models more grounded, resulting in improved visual question answering.
While modalities most commonly considered in the conversational AI literature are text, vision, tactile, and speech, other sources of information are gaining popularity within the research community. These include eye-gaze, 3D scans, emotion, action and dialogue history, and virtual reality. Heck et al. (2013)  Processing conventional and new modalities brings forth numerous challenges for multimodal conversations. To answer these challenges, we will first mathematically formulate the multimodal conversational AI problem, then detail fundamental research sub-tasks required to solve it.

Mathematical Formulation
We formulate multimodal conversational AI as an optimization problem. The objective is to find the optimal response S to a message m given underlying multimodal context c. Based on the sufficiency of the context, the optimal response could be a statement of fact or a follow-up question to resolve ambiguities. Statistically, S is estimated as: The probability of an arbitrary response r can be expressed as a product of the probabilities of responses {r i } T i=1 over T turns of conversation (Sordoni et al., 2015).
It is also possible for conversational AI to respond through multiple modalities. We represent the multimodality of output responses by a matrix R := {r 1 i , r 2 i , . . . r l i } over l permissible output modalities.
Learning from multimodal data requires manipulating information from all modalities using a function f (·) consisting of five sub-tasks: representation, fusion, translation, alignment, and co-learning. We include these modifications and present the final multimodal conversational objective below.
In the following sections, we describe each subtask contained in f (·).

Multimodal Representation + Fusion
Multimodal representation learning and fusion are primary challenges in multimodal conversations. Multimodal representation is the encoding of multimodal data in a format amenable to computational processing. Multimodal fusion concerns joining features from multiple modalities to make predictions.

Multimodal Representations
Using multimodal information of varying granularity for conversations necessitates techniques to represent high-dimensional signals in a latent space. These latent multimodal representations encode human senses to improve a conversational AI's perception of the real-world. Success in multimodal tasks requires that representations satisfy three desiderata (Srivastava and Salakhutdinov, 2014): 1. Similarity in the representation space implies similarity of the corresponding concepts 2. The representation is easy to obtain in the absence of some modalities 3. It is possible to infer missing information from observed modalities There exist numerous representation methods for the range of problems multimodal conversational AI addresses. Multimodal representations are broadly classified as either joint representations or coordinated representations (Baltrušaitis et al., 2017).
Transformer-based models used as joint multimodal representations can be described as illustrated in the taxonomy of Figure 1. Modality specific encoders {j i (·)} n i=1 embed unimodal tokens {c i k } n k=1 to create latent features {z i k } n k=1 (Equation 5). Decoder networks use latent features to produce output symbols. A transformer Ψ(·) consists of stacked encoders and decoders with intramodality attention. Attention heads compute relationships within elements of a modality, producing multimodal representations {h i k } n k=1 (Equation 6).

Coordinated Representations
In contrast, coordinated representations model each modality separately. Constraints coordinate representations of separate modalities by enforcing cross-modal similarity over concepts. For example, the audio representation g a (·) of a dog's bark would be closer to the dog's image representation g i (·) and further away from a car's (Equation 7). A notion of distance d between modalities in the coordinated space enables cross-modal retrieval.

Multimodal Fusion
Multimodal fusion combines features from multiple modalities to make decisions, denoted by the final block before the outputs in Figure 1. Fusion approaches are broadly classified into model-agnostic and model-based methods.
Model-agnostic methods are independent of specific algorithms and are split into early, late, and hybrid fusion. Early fusion integrates features following extraction, projecting features into a shared space (Potamianos et al., 2003;Ngiam et al., 2011;Nicolaou et al., 2011;Jansen et al., 2019). In contrast, late fusion integrates decisions from unimodal predictors (Becker and Hinton, 1992;Korbar et al., 2018;Akbari et al., 2021). Early fusion is predominantly used to combine features extracted in joint representations while late fusion combines decisions made in coordinated representations. Hybrid fusion exploits both low and high level modality interactions (Wu et al., 2005;Schwartz et al., 2020;Piergiovanni et al., 2020;Goyal et al., 2020).

State-of-the-art Representation+Fusion Models for Conversational AI
Having introduced the multimodal representation and fusion challenges, we present the state-of-theart in these sub-tasks for conversational AI. textual modalities using LSTMs. Nodes in the factor graph represent attention distributions over elements of each modality, and factors capture relationships between nodes. There are two types of factors -local and joint. Local factors capture interactions between nodes of a single modality (e.g., words in the same sentence), while joint factors capture interactions between different modalities (e.g., a word in a sentence and an object in an image).

Factor Graph Attention
Representations from all modalities are concatenated via hybrid fusion and passed through a multilayer perceptron network to retrieve the best candidate answer.  (Schwartz et al., 2020)   presents TRANSRESNET for image-based dialogue. Image-based dialogue is the task of choosing the optimal response on a dialogue turn given an image, an agent personality, and dialogue history. TRANSRESNET consists of separately learned sub-networks to represent input modalities. Images are encoded using ResNeXt 32×48d trained on 3.5 billion Instagram images (Xie et al., 2017), personalities are embedded using a linear layer, and dialogue is encoded by a transformer pretrained on Reddit (Mazaré et al., 2018) to create a joint representation.

TRANSRESNET
TRANSRESNET compares model-agnostic and model-based fusion by using either concatenation or attention networks to combine representations. Like FGA, the chosen dialogue response is the candidate closest to the fused representation.
On the first turn, TRANSRESNET uses only style and image information to produce responses. Dialogue history serves as an additional modality on subsequent rounds. Ablation of one or more modalities diminishes the ability of the model to retrieve the correct response. Optimal performance on Image-Chat  is achieved using multimodal concatenation of jointly represented modalities (Table 2).

MultiModal Versatile Networks (MMV)
Alayrac et al. (2020) presents a training strategy to learn coordinated representations using selfsupervised contrastive learning from instructional videos. Videos are encoded using TSM with a ResNet50 backbone (Lin et al., 2019), audio is encoded using log MEL spectrograms from ResNet50, and text is encoded using Google News pre-trained word2vec (Mikolov et al., 2013). Alayrac et al. (2020) defines three types of coordinated spaces: shared, disjoint, and 'fine+coarse'. The shared space enables direct comparison and navigation between modalities, by assuming equal granularity. The disjoint space sidesteps navigation to solve the granularity problem by creating a space for each pair of modalities. The 'fine+coarse' space solves both issues by learning two spaces. A fine-grained space compares audio and video, while a lower-dimensional coarse-grained space compares fine-grained embeddings with text. We further discuss the MMV model in Section 6.3.

Multimodal Translation
Multimodal translation maps embeddings from one modality to signals from another for cross-modality reasoning (Figure 1). Cross-modal reasoning enables multimodal conversational AI to hold meaningful conversations and resolve references across multiple senses, specifically language and vision. To this end, we survey existing work addressing the translation of images and videos to text. We discuss multimodal question-answering and multimodal dialogue, translation tasks that extend to multimodal conversations.

Image
Antol et al. (2015) and  present Visual Question-Answering (VQA) and Visual7W for multimodal question answering (MQA). The MQA challenge requires responding to textual queries about an image. Both datasets collect questions and answers using crowd workers, encouraging trained models to learn natural responses. Heck and Heck (2022) presents the Visual Slot dataset, where trained models learn answers to questions grounded in UIs.
The objective of MQA is a simplification of Equation 4 to a single-turn, single-timestep scenario (T = 1), producing a response to a question m q given multimodal context {c i } n i=1 :  , 2020). Besides visual reasoning, video-QA requires temporal reasoning, a challenge addressed by multimodal alignment that we discuss in the following section.

Multimodal Alignment
While image-based dialogue revolves around objects (e.g., cats and dogs), video-based dialogue revolves around objects and associated actions (e.g., jumping cats and barking dogs) where spatial and temporal features serve as building blocks for conversations. Extracting these spatiotemporal features requires multimodal alignment -aligning sub-components of different modalities to find correspondences. We identify action recognition and action from modalities as alignment challenges relevant to multimodal conversations.

Action Recognition
Action recognition is the task of extracting natural language descriptions from videos. UCF101 (Soomro et al., 2012), HMDB51 (Kuehne et al., 2011), andKinetics-700 (Carreira et al., 2019) involve extracting actions from short YouTube and Hollywood movie clips. HowTo100M (Miech et al., 2019), MSR-VTT , and YouCook2 (Zhou et al., 2017) are datasets containing instructional videos on the internet and require learning text-video embeddings. YouCook2 and MSR-VTT are annotated by hand while HowTo100M uses existing video subtitles or ASR.
Mathematically, the goal is to retrieve the correct natural language description y ∈ Y to a query video x (Equation 11). Video and text representation functions g(·) video and g(·) text embed modalities into a coordinated space where they are com-pared using a distance measure d.

Action from Modalities
Equipping multimodal conversational agents with the ability to perform actions from multiple modalities provides them with an understanding of the real world, improving their conversational utility. Talk the Walk (de Vries et al., 2018) presents the task of navigation conditioned on partial information. A "tourist" provides descriptions of a photo-realistic environment to a "guide" who determines actions. Vision-and-Dialog Navigation (Thomason et al., 2019) contains natural dialogues grounded in a simulated environment. The task is to predict a sequence of actions to a goal state given the world scene, dialogue, and previous actions. TEACh (Padmakumar et al., 2021) extends Visionand-Dialog Navigation to complete tasks in an AI2-THOR simulation. The challenge involves aligning information from language, video, as well as action and dialogue history to solve daily tasks. Ego4D (Grauman et al., 2021) contains text annotated egocentric (first person) videos in real-world scenarios. Ego4D includes 3D scans, multiple camera views, and eye gaze, presenting new representation, fusion, translation, and alignment challenges. It is associated with five benchmarks: Video QA, object state tracking, audio-visual diarization, social cue detection, and camera trajectory forecasting.

Multimodal Versatile Networks (MMV)
In addition to a representation, Alayrac et al. (2020) presents a self-supervised task to train modality embedding graphs for multimodal alignment.  (Miech et al., 2020) variant of NCE measures loss on pairs of modalities of different granularity. MIL accounts for misalignment between audio/video and text by measuring the loss of fine-grained information with multiple temporally close narrations.
The network is trained on HowTo100M (Miech et al., 2019) and AudioSet (Gemmeke et al., 2017). Table 3 compares the performance of MMV on action classification, audio classification, and zeroshot text-to-video retrieval.

Discussion
The current datasets used for research in multimodal conversational AI are summarized in Table 4. While MQA and MTQA are promising starting points for multimodal natural language tasks, extending QA to conversations is not straightforward. Inherently, MQA limits itself to direct questions targeting visible content, whereas multimodal conversations require understanding information that is often implied (Mostafazadeh et al., 2017). Utterances in dialogue represent speech acts and are classified as constatives, directives, commissives, or acknowledgments (Bach and Harnish, 1979). Answers belong to a single speech act (constatives) and represent a subset of natural conversations.
Similarly, the work to-date on action recognition is incomplete and insufficient for conversational systems. Conversational AI must represent and understand spatiotemporal interactions. However, current research in action recognition attempts to learn relationships between videos and their natural language descriptions. These descriptions are not speech acts themselves. Therefore, they do not adequately represent dialogue but rather only serve as anchor points in the interaction.
In contrast, Image-Chat (Shuster et al., 2020) presents a learning challenge directly aligned with the multimodal dialogue objective in Equation 4. Image-Chat treats dialogue as an open-ended discussion grounded in the visual modality. Succeeding in the task requires jointly optimizing visual and conversational performance. The use of crowd workers that adopt personalities during data collection encourages natural dialogue and captures conversational intricacies and implicatures.
In addition, algorithmic improvements are required to advance the field of multimodal conversational AI -particularly with respect to the objective function. Current approaches such as MQA and action recognition models optimize a limited objective compared to Equation 4. We postulate that the degradation of these methods when applied to multimodal conversations is largely caused by this and, therefore, motivates investigation.
Another open research problem is to improve performance on Image-Chat. The current state-ofthe-art TRANSRESNET RET is limited. The model often hallucinates, referring to content missing in the image and previous dialogue turns. The model also struggles when answering questions and holding extended conversations. We suspect these prob-lems are a reflection of the limiting assumptions Image-Chat makes and the absence of multimodal co-learning to extract relationships between modalities. For further details, we refer readers to example conversations in Appendix A.
Different modalities often contain complementary information when grounded in the same concept. Multimodal co-learning exploits this crossmodality synergy to model resource-poor modalities using resource-rich modalities. An example of co-learning in context of Figure 1 is the use of visual information and audio to generate contextualized text representations. Blum and Mitchell (1998) introduced an early approach to multimodal co-training, using information from hyperlinked pages for web-page classification. Socher and Fei-Fei (2010) and Duan et al. (2014) presented weakly-supervised techniques to tag images given information from other modalities. Kiela et al. (2015) grounded natural language descriptions in olfactory data. More recently, Upadhyay et al. (2018) jointly trains bilingual models to accelerate spoken language understanding in low resource languages. Selvaraju et al. (2019) uses human attention maps to teach QA agents "where to look". Despite the rich history of work in multimodal co-learning, extending these techniques to develop multimodal conversational AI that understands and leverages cross-modal relationships is still an open challenge.

Conclusions
We define multimodal conversational AI and outline the objective function required for its realization. Solving this objective requires multimodal representation and fusion, translation, and alignment. We survey existing datasets and state-ofthe-art methods for each sub-task. We identify simplifying assumptions made by existing research preventing the realization of multimodal conversational AI. Finally, we outline the collection of a suitable dataset and an approach that utilizes multimodal co-learning as future steps.