Multimodal Context Carryover

Multi-modality support has become an integral part of creating a seamless user experience with modern voice assistants with smart displays. Users refer to images, video thumbnails, or the accompanying text descriptions on the screen through voice communication with AI powered devices. This raises the need to either augment existing commercial voice only dialogue systems with state-of-the-art multimodal components, or to introduce entirely new architectures; where the latter can lead to costly system revamps. To support the emerging visual navigation and visual product selection use cases, we propose to augment commercially deployed voice-only dialogue systems with additional multi-modal components. In this work, we present a novel yet pragmatic approach to expand an existing dialogue-based context carryover system (Chen et al., 2019a) in a voice assistant with state-of-the-art multi-modal components to facilitate quick delivery of visual modality support with minimum changes. We demonstrate a 35% accuracy improvement over the existing system on an in-house multi-modal visual navigation data set.


Introduction
(CC) framework (Chen et al., 2019a;Naik et al., 2018;Sharaf et al., 2018;Rastogi et al., 2019) is a framework that handles identification and carryover of relevant context information in a multi-turn dialog interaction between a voice assistant and the user.The CC framework determines which tokens and intents in the most recent system-user interaction history are relevant as supporting information to fulfill user's current request.The details of the context carryover framework are well documented in (Chen et al., 2019a).One key limitation of the framework is that it is purely text-based, and therefore it struggles to capture user interactions that involve visual components.In this work, we introduce augmentations that enable the Context Carryover framework to deal with multimodal use cases.We focus on two specific use cases that's related to a user's visual navigation and selection experience: Visual product selection and visual scene and video selection.
Visual product selection, demonstrated in Fig. 1, consists of the use case where the user is referring to a single product on the screen.In the provided example, the user is shopping for a handbag and the voice assistant is displaying a number of handbags on the screen.The user selects one out of many handbags on the screen using a referring utterance, for instance the color of the handbag.The user is free to use any other natural language phrase that can differentiate the product from the others displayed on the screen.
In visual scene and video selection, as seen in Fig. 2, and Fig. 3, the user can refer to a scene or movie that has multiple products with a more cluttered visual landscape.Here, a "scene image" is defined as an image of an individual in a landscape wearing multiple products (dress, hat, purse, sunglasses etc.).The user then tries to select a scene using a referring expression (e.g., "The scene with the lady in the denim jacket").In Fig. 3, a movie can be associated with multiple frames and the user can refer to the movie by referring to an action or content from a specific frame.
The main contributions of our work are: 1. We introduce a Vision Augmentation scheme that enables ingestion of visual content in a dialogue-based context carryover framework.

We introduce an Aligned Vision and Text
Augmentation that incorporates the latest state-of-the-art developments in multi-modal contrastive learning to a dialogue-based context carryover framework.
3. The newly proposed methods result in significant accuracy improvements on an in-house data collected through Amazon Mechanical Turk (MTurk).We present sensitivity analyses that display the effectiveness of the various suggested model augmentations on our in-house dataset.
4. We introduce a synthetic data generation pipeline that generates synthetic visual product selection data that helps to train the models and cuts down on manual annotation and MTurk survey costs.
Single Encoder models appear early in the multimodal literature and pave the way for the other 2 types of models.For this family of models, usually the image and text representations exist in separate spaces and there is an ensuing fusion layer.
Dual Encoder models leverage image-text contrastive loss (Oord et al., 2018;He et al., 2019;Chen et al., 2020;Tian et al., 2019) during training, exhibit higher image-to-text alignment and bring the image, text representations to a common more aligned representation space.They perform well on image-text retrieval tasks but underperform in vision-language understanding tasks requiring higher reasoning, e.g., Visual Question Answering (VQA), and Natural Language Inference (NLI).
Our current task of Multimodal Context Carryover uses the latest advances in the multimodal representation learning space and injects state-ofthe-art components with minimal changes into a framework for dialog tracking and slot selection, and results in a system that can handle multimodal user-system dialog interaction.Our current multimodal use cases are set up to be similar to a textto-image retrieval task that occur within the context of a user-system dialog interaction.Thus, we incorporate the latest developments in Dual Encoder design in our work, since the approach is well suited for the multimodal text-to-image re-trieval step.The latest state-of-the-art models that incorporate multimodal representations to dialogue state tracking systems, e.g., VDST (Pang and Wang, 2019), Flamingo (Alayrac et al., 2022), and VDTN (Le et al., 2022), would require costly system revamps.
Recently, Kottur et al. (2021) released a novel multimodal conversation dataset with labeled dialogue state (e.g., entity and dialogue act), which motivated further studies (Garcia et al., 2022;Agarwal et al., 2021).These datasets contain dialog act and products but do not have the scene and video information required for our purposes.For our current study we resort to collecting our own dataset through Amazon Mechanical Turk which is catered for our commercial needs.

Problem Formulation
Each interaction between the user and the system can be formulated as a sequence of utterances , consisting of alternating utterances between the user and the system : = (ℎ 0 , ℎ 1 , ℎ 2 , ...., ℎ { , } ), where each element ℎ ∈ is an utterance either by the user, ℎ or the system, ℎ .We refer to as the dialog history.A subscript denotes the utterance distance, which measures the offset from the most recent user utterance (ℎ 0 ).The ℎ token of an utterance with distance is denoted as ℎ [ ].Each utterance in the dialog history, consists of slots.A slot = ( , , ) in a dialog is defined as a key-value pair that contains information about an entity.For e.g., in Fig. 1 the user says, "Show me the one with the brown handle".Here one of the slots would be [COLOR:Brown].Each slot is defined by the distance of it's corresponding utterance , slot key and slot value .We refer to as the context slots which comprise of all the slots in the dialog history.
In addition to the context slots which are derived from the dialogs, we also have on-screen lists which can be present in the current turn as shown in Fig. 1.Users can reference items in these lists either through visual features or through references to the title, e.g., "Canvaslove Rose one ...".A list object = ( , , ) in the current turn is defined as a key-value pair along with the associated image.The key in our case is ProductTitle, the value is the title itself and refers to the image associated with the list object.We refer to as all the list objects in the current turn.
Given the dialog history , context slots and the on-screen list , we can define the candidate slots as = ( ∪ ).The task can be formulated as correctly identifying the subset of candidate slots which are relevant to the current turn.A binary decision is made jointly over each of these candidate slots by the model , which takes the slot interdependencies into consideration, i.e.,: ( , ) = , where ⊆ .The full details of the context carryover (CC) architecture which forms our baseline are provided in Appendix A.1.The baseline solution is not capable of ingesting visual content and hence cannot perform selection based on visual features.In Section 3.2, we introduce the vision and vision aligned text augmentations to the CC model, which add the capability to process visual features and are the main contributions of this paper.

Vision Augmentation
Recently CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), ALBEF (Li et al., 2021b), ConVIRT (Zhang et al., 2020) train dense, aligned image-text embeddings using contrastive loss.The training requires having matched (image, text) pairs where the text can be free form.The bidirectional contrastive losses for the ℎ image-text pair is given in equation 1 and equation 2 in Appendix A.2.The image and text are projected onto a shared embedding space I ∈ R , T ∈ R respectively.I , T represents the cosine similarity and ∈ R + is a temperature parameter.The losses are then added as seen in equation 3 in Appendix A.2.
Our on-screen image selection use case is slightly different from this generic paired (image, text) training setting.In our case, the user makes a reference to a specific product, scene, or movie that is shown on the screen, which is more akin to a text-to-image retrieval task.The user also focuses on differentiating the desired product from the list of products shown on the screen.This is slightly nuanced than a generic text description of a product as the referring utterance is conditioned on the desired image and other surrounding images.
In our initial solution, we obtain the CLIP vision embeddings for the product images and add it to the CC framework as shown in Fig. 4a) which we term as the Vision Augmentation.In Fig. 4a), b), the term List SeMI stands for a List of [Se]mantically [M]eaningful [I]mages.More concretely, each product shown to the customer is represented by a list object = ( , , ) and considered as a potential carryover candidate as mentioned in Section 3.1.As part of the Vision Augmentation, we use the CLIP visual embedding of the product image as .These product list objects with CLIP visual embeddings are then sent to the CC Candidate Encoder.The CC Encoder-Decoder framework subsequently decides whether to carry over the list product object to the next dialogue state, which would signify a product selection.The rationale here is to simply augment the existing system with the visual modality and evaluate the effectiveness on a multimodal dataset.Even though we select CLIP as our initial vision embedding, the system is compatible with any embedding trained with a contrastive loss (e.g.we also show similar performance improvements on ALBEF (Li et al., 2021b)).

Aligned Vision and Text Augmentation
Using the notation from Section 3.1, at a given moment, there are product list item objects = ( , , ) shown to the user.Here, is the key word ProductTitle, the value is the actual title itself and is the associated image.Given the product images { 1 , 2 , ... }, their titles { 1 , 2 , ..., }, and the most recent user referring utterance ℎ 0 which is obtained from the dialog history, our task here is to find the best product list object that matches the user's request ℎ 0 .We utilize CLIP to bring images { 1 , 2 , ... } and the textual refer-ring expression ℎ 0 to the same embedding space, and get the dot product similarity between each image in { 1 , 2 , ... } and ℎ 0 , as shown in Fig. 4b).In other words, we obtain the similarity score per candidate list image with the referring utterance.We term this as the Aligned Vision and Text Augmentation.The pseudocode for this operation is shown in Fig. 7, in Appendix A.3.The resulting multimodal dot product tensors are shown in Fig. 8, in Appendix A.4.

Datasets
In this section, we describe the newly gathered Amazon Mechanical Turk (MTurk) multimodal dataset and the pre-existing text-only dataset.

Multimodal dataset
The MTurk dataset is created by showing Mechanical Turkers (MTurkers) product images, scene images, video thumbnails and asking them to pick one out of the many products, scenes or movies using a single referring utterance.We define one such referring act that includes multiple images and a single referring utterance as a single "instance".The newly collected MTurk dataset has a Train/Dev/Test split sizes of 33,526/4,087/4,152 instances respectively.Additional details about the dataset are included in the Appendix A.5.
We also utilize an internally annotated dataset, that we refer to as the existing Context Carryover dataset, details of which can be seen in Appendix A.6.Even though we use MTurk data for our current study, it is expensive to generate and infeasible to extend to more domains.An alternative cheaper and scalable approach to quickly collect carryover data in multimodal settings is to employ a data synthesizer.The synthetic data generation process is as follows; for each synthetic data sample, we first randomly sample a slot type (e.g., bag) and visual attributes (e.g., red) to create slot candidates.Note that for each slot, we only randomly select one type from a pre-defined object list, and we sample three different visual attributes under the same attribute category (e.g., color) from a pre-define attribute list.For example, we select one slot type bag from the object list, and then we draw three visual attributes red, blue, orange from the attribute list.We then combine them to obtain three slots: red bag, blue bag, orange bag to simulate the screen shown to the user when shopping for bags.In the second step, we retrieve an image from a product image catalog for each generated slot.To do so, we employ CLIP to get text embeddings of each slot: { 1 , 2 , 3 }.For each image in the product catalog, we precompute the image embedding with the same CLIP model: { 1 , . . ., }.We use the inner product as the similarity metric to perform the image retrieval: = arg max

Synthetic Visual Data Generation Pipeline
, where denotes the image selected for slot .To add more randomness, we retrieve the top-similar images, and randomly select one image from the candidates.After getting all of the product images (and associated metadata), we simulate a user selection phrase by randomly selecting one out of the three generated slots as the ground-truth and fill it in a predefined template.

Results
We compare various combinations of vision, text and similarity augmentation schemes against the baseline CC model in Table 1.All the models are trained on a combined CC and multimodal MTurk training dataset.Since we are interested in how our models perform on multimodal use cases, we show the results only on the multimodal test dataset.Further details of the experiment setup, such as model training details, are described in the Appendix A.7.
In Table 1, row 1 is the current baseline CC model, and rows 2-7 are the augmentations.Augmenting the existing CC framework with only CLIP Visual components (Table 1, row 4) gives a 23.96% accuracy improvement over the baseline on the multimodal test set.When adding CLIP Vision and CLIP user current utterance Text embeddings (Table 1, row 5), we see accuracy gains increase to 24.49%.The highest improvement comes when the CLIP vision, text embeddings and the dot product similarity scores (Table 1, row 7) are given to the CC framework with a 34.65% accuracy improvement over the baseline.These methods keep the CLIP embeddings frozen, but in Table 1, row 6 we attempt to fine-tune the CLIP embeddings in an end-to-end fashion using the CC framework.We find that fine-tuning the CLIP embeddings in our setting does not provide further gains.This maybe because the generic loss of the CC framework is non-contrastive (i.e., it's cross-entropy based) and thus it does not improve the effectiveness of CLIP embeddings being further fine-tuned.A similar observation is recorded in parallel work Flamingo (Alayrac et al., 2022) and likened to "catastrophic forgetting".In Table 1, row 3, we exclude the vision and textual embeddings altogether and provide only the dot product similarity scores.We find that only providing similarity scores (row 3) is on par with row 5 which is to provide both visual and textual embeddings.This has implications where the CC encoder can simply work with similarity scores instead of embeddings.Finally, in row 2 we only give the CC framework the CLIP Text embeddings (i.e., no vision components) and we find the performance to be worse than the other augmentations.We hypothesize that this is because the CC framework gets textual information from two sources in row 2. One from the dialogue history, which it processes in the usual fashion described in Section A.1 through the CC slot carryover framework, and one via the CLIP Text embeddings.Since there is no accompanying visual input, the CLIP Text input is redundant and might add additional noise, which leads to minor accuracy improvements and a weighted F1 degradation seen in row 2. We are also interested in seeing how the CC baseline model behaves when the multimodal MTurk training data is added to its training set.More simply put, we want to check whether adding the MTurk training data to the CC framework will improve the performance of the CC baseline on the multimodal MTurk test set, even if the CC baseline is a purely text-based system that have no notion of the visual modality.From Table 2, row 1 we see that even when no multimodal training data is added, the CC baseline still has an absolute accuracy of 67% on the MTurk test set.This can be attributed to the CC framework using the product image titles which are textual to make inferences.In Table 2, row 2 when the baseline is only trained on the multimodal MTurk dataset (without the 1.28M pre-existing CC dataset) there is an 2.34% accuracy improvement relative to the baseline, mainly due to training and testing distributions being similar.In Table 2, row 3, when the data sets are combined during training, the baseline shows a −5.31% degradation compared to Table 2, row 1 which indicates that the CC baseline is not equipped to handle a combined multimodal and non-multimodal dataset.In Table 2, row 4 we see that the best results are obtained when visual modality related model changes are added and the model is trained on the combined multimodal and pre-existing CC dataset.
We also experiment with ALBEF (Li et al., 2021b) embeddings and show results in Appendix A.8.We anticipate the science community to produce ever-improving dense multimodal embeddings as time goes on, and hope that our simple yet effective augmentation enables commercial frameworks to utilize the latest state-of-the-art embeddings with minimal changes.

Conclusion
We augment the existing Context Carryover framework with Visual and Vision Aligned Text components.We collect a multimodal dataset which mimics real world customer interactions to train and evaluate our models.We show a 35% accuracy improvement when the existing CC framework is augmented with Vision and Vision Aligned Text components.

Limitations
Our solution is only limited to the English language: our training data only contain products with English titles and all of the referring expressions are in English.Transferring the model to another language will require re-training the model and potential architecture changes.Furthermore, the Mturkers who provided the experssions in our study may not be representative of the user demographics, and the data may not provide a well grounded proxy of user behavior.While collecting more data can mitigate some of these limitations, curation and validation of visual expressions is a time-consuming and expensive process, which is why our dataset is limited in size.
Our models leverage visual embeddings from systems such as CLIP, ALBEF which have their own set of limitations.For instance, CLIP is known to fail in cases which requires counting objects, or relations of multiple objects in an image.Thus, visual models leveraging CLIP embeddings will have issues with referring expressions that refer to counts of objects.Further, we need to be cognizant and optimize for inference latency, which prevents us from using large scale language or vision models which could potentially improve upon the current solution.

Ethics Statement
Although our solution has no unethical applications or risky broader impacts, we need to consider aspects of fairness.In our setting, the images shown to the users can contain images of people along with the products.We need to consider how sensitive queries, e.g., ones that refer to protected attributes of the people in the image or expressions that contain hateful or derogatory speech, should be resolved.
During the data collection and model training process we take strong consideration on the type of referring expression we are curating and using to train the model.Expressions that contain references to protected and/or physical attributes of people are filtered out to ensure that our model is not capable of handling sensitive queries.
The final loss is a weighted combination of the two losses averaged over the training dataset.Here ∈ [0, 1] is a scalar weight. ( .
A.3 Pseudocode for the vision aligned text dot product The pseudocode for the vision aligned text dot product is shown in Fig. 7.
A.4 Multimodal dot product tensors A.5 Additional Details on the multimodal dataset.
Some randomly sampled instances from the multimodal MTurk dataset for visual product selection are given in Fig. 6.To better simulate a real-world customer interaction with the voice assistant, the MTurkers are free to use any phrase to refer to the product image.Some MTurkers use specific product attributes like color, size, shape, product material, product label text while there are instances where more ambiguous terms are used (e.g., "the animal one").To dissect the dataset further, we encode a few randomly selected product images and their associated labels in a joint CLIP embedding space.
As seen in Fig. 9a.The product images are shown on the horizontal axis and the product labels are shown on the vertical axis.The numbers in the table are CLIP similarity scores ∈ [0, 1] between the images and the product labels (higher scores mean higher similarity).Fig. 9a has a dual purpose: first, it shows the products and their labels in a matrix format where products that match with multiple labels or labels that match with multiple products can be clearly seen; second, it shows the effectiveness of CLIP embeddings in terms of quantifying image-text similarity.Ideally, the diagonal elements of the matrix should contain the largest scores, but we can see that there are a few off diagonal high similarity scores, which indicates that there is high ambiguity.In Fig. 9b we look at the alignment between the images, their labels and the referring utterance.It can be clearly seen from Fig. 9b row 1 that the image that matches the referring utterance ("the black one") has the highest CLIP similarity score (the middle image has the highest alignment score; 0.24 with the referring utterance compared to the other two images in row 1).

A.6 Existing Context Carryover Dataset
The existing CC dataset is created by internal annotators who were shown the dialogue history, current turn, context slots and were asked to select all the appropriate slots for the current turn.The dialogs originate from a commercial voice assistant, and we process the data so that users are not identifiable ("de-identified").The dataset spans 30 domains, 500 intents and includes both within domain (dialog that span a single domain) and cross domain cases (dialog than spans multiple domains).It has an average dialog distance length of 3.94 which is roughly 2 user turns and 2 system turns.The existing CC dataset has a Train/Dev/Test split size of 1,280,000/158,043/158,000 respectively.

A.7 Experimental Setup
We set the Context Carryover framework to the settings that are similar to the current commercial settings and run our experiments.The results in Table 1 and Table 2 use a context carryover threshold of 0.5.The context carryover threshold determines the probability threshold above which the slot will be labeled as a carryover instance (i.e., label of 1).We get the pretrained CLIP embeddings from the open-source CLIP (Radford et al., 2021) repo under the MIT license.For the CC model we use an embedding size of 300 for the dialog encoder and intent encoder.For the slot encoder, CLIP visual and CLP text we use an embedding size of 512.We use CLIP (ViT-B/32) (Dosovitskiy et al., 2020) as the vision encoder and a transformer (Vaswani et al., 2017) based text encoder as described in (Radford et al., 2019) for the CLIP text encoder.For the decoder, we use a single layer transformer based decoder with 12 attention heads which are   Here, the metadata are the textual information that are associated with commercial products provided by sellers, marketplace annotators and at times generated by the system.The "candidates" here refer to the visual list item candidates.
then passed to a single layer feed-forward network to make binary decisions over the slots.The model is trained for 100 epochs using a batch size of 160 with an Adam optimizer and learning rate of 0.001.We train on a single p3.16xlarge instance, which consists of 8 GPUs.

A.8 ALBEF results
We also experiment with ALBEF (Li et al., 2021b) embeddings trained using a Large Language Model training framework (FitzGerald et al., 2022) and find them to have a similar performance to CLIP as seen in Table 3.

Figure 1 :
Figure 1: Product Selection Use Case

Figure 2 :
Figure 2: Scene Selection Use Case.The images are from unsplash.comand used here only for illustrative purposes.

Figure 3 :
Figure 3: Video Selection Use Case

Figure 4 :
Figure 4: Augmented Context Carryover Models.Vision only augmentation (a), Vision, text and similarity score augmentation (b)

Fig. 8 a
Fig.8a) shows the dot product between the utterance text and the product visual embeddings for visual product selection for each carryover candidate.Fig.8b) shows the dot product between the utterance text and scene or video and associated product's visual embeddings for visual scene and video selection.Fig.8 c) shows the dot product between the utterance text and the product metadata text associated with each candidate, where the product metadata can be available for both visual product selection and visual scene and video selection.

Figure 6 :
Figure 6: Randomly sampled instances from the MTurk dataset.Each instance is a single referring utterance and multiple associated product images.The label array signifies the ground truth label associated with the referring utterance.A label of 1 signifies the ground truth true label, and 0 otherwise.

Figure 7 :
Figure 7: Pseudocode for vision-text embedding dot product similarity

Figure 8 :
Figure8: Tensors that contain the dot product between the user's referring utterance text and a) a candidate's single product visual embedding for the "Visual product selection" use case, b) a candidate's scene or video and multiple product visual embeddings for the "Visual scene and video selection" use case, c) a candidate's metadata text embeddings for both the use cases.Here, the metadata are the textual information that are associated with commercial products provided by sellers, marketplace annotators and at times generated by the system.The "candidates" here refer to the visual list item candidates.

Table 1 :
Results on the multimodal test set for various CLIP augmentation schemes and the CC baseline.The + <modality type> indicates an augmentation.

Table 2 :
Performance improvements on MTurk test data when MTurk training data is added to the CC train data set.