Mix-and-Match: Scalable Dialog Response Retrieval using Gaussian Mixture Embeddings

Embedding-based approaches for dialog response retrieval embed the context-response pairs as points in the embedding space. These approaches are scalable, but fail to account for the complex, many-to-many relationships that exist between context-response pairs. On the other end of the spectrum, there are approaches that feed the context-response pairs jointly through multiple layers of neural networks. These approaches can model the complex relationships between context-response pairs, but fail to scale when the set of responses is moderately large (>100). In this paper, we combine the best of both worlds by proposing a scalable model that can learn complex relationships between context-response pairs. Specifically, the model maps the contexts as well as responses to probability distributions over the embedding space. We train the models by optimizing the Kullback-Leibler divergence between the distributions induced by context-response pairs in the training data. We show that the resultant model achieves better performance as compared to other embedding-based approaches on publicly available conversation data.


Introduction
Retrieval-based response predictors (Ji et al., 2014;Yan et al., 2016;Wu et al., 2017a;Bartl and Spanakis, 2017;Whang et al., 2021;Xu et al., 2021;Han et al., 2021;Su et al., 2021;Gu et al., 2020) retrieve the response from a predefined set of responses given the dialog context.Such methods find application in a variety of real-world dialog modeling and collaborative human-agent tasks.For instance, dialog modeling frameworks typically utilize the notion of "intents" and "dialog flows" which aim to model the "goal" of a user-utterance (Aronsson et al., 2021).To make task of building and identifying such intents easier, some tools mine conversation logs to identify responses that are often associated with dialog contexts (intents) Figure 1: An example of a context with multiple valid responses.Note that each response contains different information and hence must have embeddings that are far way from each other.However, embedding-based approaches for retrieval attempt to bring all such responses close to the context and hence, close to each other.(Dhoolia et al., 2021) and then surface these responses for review by humans.These reviewed responses are then modeled into the dialog flow for different intents.Another instance, of humanagent collaboration powered by system returned responses is in 'Agent Assist' environments where a system makes recommendations to a customersupport or contact-center agent in real-time (Fadnis et al., 2020).
The success of a good response retrieval system lies in learning a good similarity function between the context and the response.In addition, it also needs to be scalable so that it can retrieve responses from the universe of responses efficiently.These two requirements present a trade-off between the richness of scoring and scalability, as discussed below.
Trade-off between Scoring and Scalability: Typically, in neural dialog retrieval models, the contexts and the responses in the conversation logs are embedded as points in the embedding space (Lowe et al., 2015).Approaches such as contrastive learning (Bromley et al., 1993) are then used to ensure that the context is closer to the ground-truth response than the other responses.Figure 1 shows a dialog context followed by multiple responses.Despite the apparent diversity among responses, all the responses are valid for the dialog context.Similarly, a generic response may be a valid response for several dialog contexts.A typical embeddingbased approach for retrieval would bring the embedding of the dialog context close to the embedding of all the valid responses (Karpukhin et al., 2020;Yu et al., 2021;Xiong et al., 2021;Luan et al., 2021a).However this has the undesirable effect of making the valid, but diverse, responses gravitate towards each other in the embedding space.
Thus, typical embedding-based approaches for retrieval fail to capture the complex, many-to-many relationships that exist in conversations.More complex matching networks such as Sequential Matching Networks (Wu et al., 2017a) and BERT (Chen et al., 2021b) based cross-encoders jointly feed the context-response pairs through multiple layers of neural networks for generating the similarity score.While these approaches have proven to be effective for response retrieval, they are very expensive in terms of inference time.Specifically, if N c is the total number of dialog contexts and N r is the total number of responses available for retrieval during inference, these methods have a time complexity of O(N c N r ).Hence, they can't be used in a real-world setting for retrieving from thousands of responses.

Contributions:
In this paper we present a scalable and efficient dialog-retrieval system that maps the contexts as well as the responses to probability distributions over the embedding space (instead of points in the embedding space).To capture the complex many-to-many relationships between the context and response, we use multimodal distributions such as Gaussian mixtures to model each context and response.The resultant model is referred to as 'Mix-and-Match'.
Intuitively, if a response is a valid response for a given dialog context, we want the corresponding probability distribution to be "close" to the context distribution.We formalize this notion of closeness among distributions by using Kullback-Leibler (KL) divergence.Specifically, we minimize the Kullback-Leibler divergence between the context distribution and the distribution of the ground-truth response while maximizing the divergence from the distributions of other negatively-samples responses.We derive approximate but closed-form expressions for the KL divergence when the un-derlying distributions are Gaussian mixtures.This approximation significantly alleviates the computation cost of KL-divergence, thereby making it suitable for use in real-world settings.We demonstrate our work on two publicly available dialog datasets -Ubuntu Dialog Corpus (v2) (Lowe et al., 2015) and the Twitter Customer Support dataset1 as well as on an internal real-world technical support dataset.Using automated as well as human studies, we demonstrate that Mix-and-Match outperforms recent embedding-based retrieval methods.Due to space limitations, we discuss a few related works in the Appendix.

Mix-and-Match
We consider a dialog to be a sequence of utterances (u 1 , . . ., u n ).At any time-step t, the set of utterances prior to that time-step is referred to as the context.The utterance that immediately follows the context2 is referred to as the response.Instead of modeling the context and response as point embeddings, we use probability distributions induced by the context and the response on the embedding space, denoted as p c (z) and p r (z)3 respectively, where z is any point in the embedding space R d .

Overview
An overview of the model is shown in Figure 2. The context and response are first encoded using a pre-trained BERT model.The model consists of a Gaussian Mixture Parameter Generator, π(X, K), which takes as input an encoded text sequence X along with the number of Gaussian Mixtures, K and then returns the means µ k and variance σ 2 k for the every Gaussian mixture component k ∈ {1, . . .K}, as its output.The encoded representations of the context and response from BERT are used to generate Gaussian Mixture distributions over the embedding space R d using the parameter generator π.We then compute the KL divergence between the context and response distributions and use contrastive loss to bring the context closer to the ground-truth response as compared to other, negatively-sampled responses.

Text Encoder
The text encoder maps the raw text to a contextualized embedding.Given a text sequence, we split it into tokens using the BERT tokenizer (Devlin et al., 2019).The BERT encoder (Devlin et al., 2019) takes the tokens as input and outputs the contextualized embedding of each token at the output.These embeddings are denoted as X (x 1 , . . ., x m ), where m is the number of tokens in the text sequence.

Parameter Generation of Gaussian Mixtures
We use the parameter generator π with the inputs X and K to generate the parameters µ k (X), σ 2 k (X) for each component of the mixture k ∈ {1, . . .K}.For simplicity, we assume a restricted form of Gaussian mixture that assigns equal probability to each Gaussian component.Further, we also assume that Gaussian components are axis-aligned that is, their covariance matrix is diagonal.Specifically, the probability distribution over the embedding space R d induced by the input text embeddings X is as follows: Given an input sequence of text X with token embedding representations x 1 . . .x |X| , we initialize K trainable embeddings e 1 , . . ., e K with same dimensions as x i .These trainable embeddings are used to attend on X to get attended token representations a 1 , . . ., a K .That is, a k = m i=1 α ik x i , where α ik are the normalized attention weights and are defined as follows: Finally, the attended token embeddings are passed through two linear maps in parallel to generate the mean and log-variance of each Gaussian component in the mixture.That is, where f 1 and f 2 are linear maps.

Context and Response Encodings
Given the dialog context c and response r, we generate the Gaussian Mixture representations p c (z) (for context) and p r (z) (for response) using π, with K and L components respectively.The Gaussian components of the mixture are denoted as p c (z; k) (for context) and p r (z; ℓ) (for response) and are given by where µ k (c) and σ 2 k (c) are the means and variances of the k th Gaussian component for the context, and µ ℓ (r) and σ ℓ (r) are the means and variances of the ℓ th Gaussian component of the response.The parameters of the text encoders (BERT and π module) for context and response are not shared.

Scoring Function
We want the context distribution to be 'close' to the distribution of the ground-truth response while simultaneously being away from distributions induced by other responses.We use the KL divergence to quantify this degree of closeness.The KL divergence between the distributions p r and p c over the embedding space R d is given by This integral has a closed form expression if both p r and p c are Gaussian.However, for Gaussian mixtures, this integral needs to be approximated.
We derive the following approximation to the KL divergence between two GMMs.
Theorem 1.Let p r and p c be two Gaussian mixture distributions with L and K Gaussian components respectively as defined in (3) and (4) respectively.
The KL divergence between the two GMMs can be approximated by the following quantity where p c (.; k) and p r (.; ℓ) are the k th and ℓ th Gaussian component of the context and response distributions as defined in (3) and (4).
A detailed derivation of the above approximation is provided in the Appendix.Note that the theorem above holds even when the individual components of the mixture are not Gaussian.
When the components are Gaussian, the KL divergence between the components can be tractably computed using the following equation: where d is the dimension of the embedding space.
Using equations ( 6) and ( 7), we get a closed form approximation to the Kullback-Leibler divergence between context and response GMMs.

Loss Function
We use N -pair contrastive loss (Sohn, 2016) for training the distributions induced by the context and response.Intuitively, given a batch B of context-response pairs, we minimize the KL divergence between the context and the true response while simultaneously maximizing the KL divergence with respect to other randomly selected responses.The loss for a given context-response pair (c, r) can be written as We average this loss across all the context-response pairs in the batch and minimize it during training.The BERT encoders, the randomly initialized embeddings as well as the linear layers for computing the means and variances, are trained in an end-toend manner.

Inference
During inference, we are provided a context and a collection of responses to select from.We map the context as well as the list of responses to their corresponding probability distributions over the embedding space.Next, we compute the KL divergence between the distribution induced by the context and every response in the list.Using the equation derived in ( 6), this can be computed efficiently and involves standard matrix operations only.We select the top-m responses that have the least KL divergence, where m is specified during evaluation.

Related Work
Our work is broadly related with two current areas of research -response retrieval and probabilistic embeddings.

Response Retrieval Systems
Depending on how the context and responses are encoded for retrieval, response-retrieval approaches can be classified into methods that use: (i) independent encodings (ii) joint encodings.Independent Encodings: In these methods, the contexts and the responses are encoded independently and the resultant embeddings are fed to a scoring function.A common architecture employed by neural methods for dialog retrieval is a dual encoder.Here, the context and responses are encoded using a shared architecture but in different parameter spaces.Early versions of such methods employed LSTMs (Lowe et al., 2015) but more recently, pre-trained models have been used (Karpukhin et al., 2020;Lu et al., 2020;Reimers and Gurevych, 2019;Liu et al., 2021).Models such as DPR (Karpukhin et al., 2020), S-BERT (Reimers and Gurevych, 2019) encode contexts and responses using dual encoders based on the BERT (Devlin et al., 2018) pre-trained model, and learn a scoring function using negative samples.Models such as Poly-Encoder (Humeau et al., 2019), MEBERT (Luan et al., 2021b), ColBERT (Khattab and Zaharia, 2020) use multiple representations for dialog contexts instead of using a single representation.
Joint Encoding: In contrast to methods that independently encode context and response pairs, methods such as Sequential Matching Networks (Wu et al., 2017b), cross encoders using BERT (Nogueira and Cho, 2019;Chen et al., 2021b) jointly encode context and dialog responses.However, such models are slow during inference because all candidate responses need to be jointly encoded with the dialog context for scoring at runtime.This is in contrast to dual-encoder architectures where response embeddings can be computed offline and cached for efficient retrieval.Models such as ConvRT (Vakili Tahami et al., 2020), Twin-BERT (Lu et al., 2020) use distillation to train a dual encoder from a cross encoder models to help a train better dual-encoder model.

Probabilistic Embeddings
Probabilistic embeddings have been applied in tasks for building better word representations (Qian et al., 2021;Athiwaratkun et al., 2018), entity comparison (Contractor et al., 2016), facial recognition (Chen et al., 2021a), pose estimation (Sun et al., 2020), generating multimodal embeddings (Athiwaratkun and Wilson, 2017;Chun et al., 2021), etc.The motivation in some of these tasks is similar to ours -for instance, Qian et al. (2021) use Gaussian embeddings to represent words to better capture meaning and ambiguity.However, to the best of our knowledge, the problem of applying probabilistic embeddings in dialog modeling tasks hasn't been explored.In this work, we represent dialog contexts as Mixture of Gaussians present approximate closed form expressions for efficiently computing KL-divergence based distance measures, thereby making it suitable for use in real-world settings.

Experiments
We answer the following questions through our experiments: (1) How does our model compare with recent dual-encoder based retrieval systems for the task of response retrieval?(2) Are the responses retrieved by our model more relevant and diverse?(3) Do human users of our system notice a difference in quality of response as compared to the recent,

ColBERT system?
Due to space limitations, we answer the following questions in the Appendix: 1) Is the improvement in retrieval performance a consequence of the extra learnable parameters in Mix-and-Match? 2) How does the performance of Mix-and-Match depend on the number of Gaussian components in response and context GMM?
The model and training details are provided in the Appendix.

Datasets
We conduct our experiments on two publicly available datasets -Ubuntu Dialogue Corpus (Lowe et al., 2015) We also conduct our experiments on an internal real-world technical support dataset with ∼ 127K conversations.We will refer to this dataset as 'Tech Support dataset' in the rest of the paper.The Tech Support dataset contains conversations pertaining to an employee seeking assistance from an agent (technical support) -to resolve problems such as password reset, software installation/licensing, and wireless access.In contrast to Ubuntu dataset, which used user forums to construct the data, this dataset has clearly two distinct users -employee and agent.In all our experiments, we model the agent response turns only.
For each conversation in the Tech Support dataset, we sample context and response pairs.Note that multiple context-response pairs can be generated from a single conversation.We create validation pairs by selecting 5000 conversations randomly and sampling their context response pairs.Similarly, we create test pairs from a different subset of 5000 conversations.The remaining conver-sations are used to create training context-response pairs.

Baselines
We compare our proposed model against two scalable baselines -SBERT (Reimers and Gurevych, 2019) and ColBERT (Khattab and Zaharia, 2020) -a recent state-of-the-art retrieval model.Similar to Mix-and-Match, both the baselines use independent encoders (dual-encoders to encode the contexts and responses.Hence, these baselines can be used for large-scale retrieval at an acceptable cost.

SBERT
SBERT (Reimers and Gurevych, 2019) uses two BERT encoders for embedding the inputs (context and response).We pass the contextualized embeddings at the last layer of BERT through the ReLU non-linearity followed by a linear layer to project it to a d-dimensional space.The projected embeddings are average-pooled to generate fixed size embeddings for context and response.Since context and response are from two different domains, we found that it is crucial that the context and response encoders do not share the parameters.We use inner-product between the context and response embeddings as the similarity measure and train the two encoders via contrastive loss.

ColBERT
Just like SBERT, ColBERT (Khattab and Zaharia, 2020) uses two BERT encoders to encode the inputs and pass the output through a linear layer to generate d-dimensional embeddings.However, instead of pooling the output through the linear layer, a late interaction is computed between all the contextualized token embeddings of the context and response.Unlike the original implementation of ColBERT, we do not enforce the context and response encoders to share parameters.This is essential for achieving reasonable performance for dialogs.The model is trained via contrastive loss.Please refer to the appendix for additional details about training and hyperparameter settings.

Response Retrieval
In this setting, each context is paired with 5000 randomly selected responses along with the ground truth response for the given context.The list of 5000 responses are randomly selected from the test data for each instance.Hence, the response universe associated with each dialog-context may be different.The task then is to retrieve the ground truth response given the context.For efficient computation, all the responses in the test data are encoded once and stored.Note that this is only possible for dual-encoder architectures (such as Mix-and-Match, SBERT, ColBERT); the major performance bottleneck in cross-encoder approaches arises from this step where the response encodings are dependent on the context and hence need to be encoded each time for every new dialog context.
For Mix-and-Match, the response encoder outputs the means and variances of the GMM induced by the response in the embedding space.We use a batch-size of 50 to encode the responses and cache the generated parameters (mean and variance) of the response-GMMs.
Similarly, the context is encoded by the context encoder to output the means and variances of the components of context-GMM.We compute the KL divergence between the context distribution and distribution of each response in the associated list of 5000 responses using the expressions derived in ( 6) and ( 7).The values are sorted in ascending order and the top-k responses are selected for evaluation.
A similar setting is used for SBERT and Col-BERT with the exception that the embeddings are stored instead of means and variances.Moreover, we sort the responses based on SBERT and Col-BERT similarity in descending order.

Results
We use MRR and Recall@k for evaluating the various models.For evaluating MRR, we sort the associated set of 5000 responses with each context, based on KL divergence in ascending order.For Recall@k, we pick the top-k responses with the least KL divergence.The percentage of contexts for which the ground truth response is present in the top-k responses is referred to as Recall@k.The results are shown in Table 1.For Mix-and-Match, we discovered that the optimal recall occurs when the number of Gaussian components in the GMM is small.The variation of performance with the number of Gaussian components is given in the appendix.
As can be observed, SBERT that uses a single embedding to represent the entire context as well as response, achieves the lowest recall.By using all the token embeddings to represent the context and response, ColBERT achieves better performance than SBERT.Finally, by using Gaussian mixture probability distributions to represent con- text and response, Mix-and-Match achieves substantial improvement in Recall@k and MRR on all the datasets as compared to SBERT and Col-BERT.Thus, richer the representation of context and response, better is the recall.In the appendix, we also include performance comparisons when the embedding size for SBERT and ColBERT is doubled.Note that the relative improvement is less in Tech Support, as there is less diversity among the responses in the training data of Tech Support.
The agents are trained to handle calls in specific way that reduces the diversity.
Re-ranking with cross-encoder Instead of using the models above (SBERT, ColBERT, Mixand-Match) for selecting a response, one may use these models to filter a subset of responses.The filtered responses can then be re-ranked using a more powerful, albeit slow models such as crossencoders.In Table 2, we use a BERT-based crossencoder (Nogueira and Cho, 2019;Chen et al., 2021b) for re-ranking the top-100 responses retrieved by each of the models on Ubuntu dataset.As can be observed, the scores of all the models improve significantly after re-ranking with crossencoder.Moreover, the scores achieved by Mixand-Match are significantly higher than the other baselines.

Response Recommendation
The response retrieval setting described in the previous section is unrealistic since it assumes that the ground truth response is also present in a set of 5000 responses.In reality, when a response retrieval model such as (Fadnis et al., 2020) is deployed for response recommendation, it must retrieve from a large set of all the responses present in the training data (often running into hundreds of thousands of responses).
To deal with the large set of responses present in the training data, we encode them offline using the response encoder of Mix-and-Match.As in the previous section, we use a batch-size of 50 for encoding the responses.After the means and variances of all the Gaussian components of response GMMs have been generated, we save them to a file along with the corresponding responses.To ensure faster retrieval, we use Faiss (Johnson et al., 2019) for indexing the means of the Gaussian components of response GMMs.Faiss is a library for computing fast vector-similarities and has been used for vector-based searching in huge sets.We use the IVFPQ index of faiss (Inverted File with Product Quantization) that discretizes the embedding space into a finite number of cells.This allows for faster search computations.
We flatten the tensor of means of Gaussian components of all response GMMs to a matrix of mean vectors.The matrix of mean vectors is added to the IVFPQ index.A pointer is maintained from the mean of each Gaussian component to the corresponding response as well as the means and variances of its Gaussian components.
When a new context arrives, we compute the means and variances of its Gaussian components.For each Gaussian component, we retrieve the top-10 responses by using the mean of the Gaussian component as the search query.After retrieving the top-10 responses for each Gaussian component, we load the corresponding means and variance.Finally, we compute the KL divergence between the context GMM and the GMMs of all the retrieved responses.The values are sorted in ascending order and the top-k responses are selected for evaluation.Language Generation Quality: Since the ground truth response may not be present verbatim in the set, metrics such as recall and MRR cannot be computed in this setting.We therefore use the BLEU metric (Papineni et al., 2002) for evaluating the quality of the responses.As can be observed from the table, the BLEU scores are quite low for Ubuntu dataset, suggesting that most retrieved responses have very little overlap with the ground truth response.As in the previous section, SBERT is outperformed by ColBERT in terms of BLEU-2 and BLEU-4.Finally, Mix-and-Match outperforms both the models on all three datasets.This suggests that the responses retrieved by Mix-and-Match are relevant to the dialog context.

Diversity of Responses:
The primary strength of the Mix-and-Match system is its capability to associate multiple diverse responses with the same context.To capture the diversity among the top-k responses retrieved for a given context, we measure the distance between every pair of responses and average it across all pairs.Thus, if R is the set of retrieved responses for a given context, the BERT distance among the responses in R is given by where e(r) is the pooled BERT embedding of r.
The results are shown in Table 3.As can be observed from the table, SBERT has the least diversity among the retrieved responses.This is expected since all the retrieved responses must be close to the context embedding and hence, close to each other.ColBERT fares better in terms of diversity since it uses multiple embeddings to represent contexts and responses.Finally, Mix-and-Match that uses GMMs to represent contexts and responses achieves the best diversity.This sug-  gests that having multiple or probabilistic embeddings helps in improving the diversity among the retrieved responses.Scalability: Next, we evaluate the time taken by the Mix-and-Match model to retrieve from the FAISS index as comapred to baselines.The similarity/KL-divergence computations as well as vector similarity searches for the FAISS index, are performed on a single A100 GPU.Unsurprisingly, SBERT achieves the lowest latency of 8.9 ms for retrieval per dialog context.ColBERT achieves a latency of 89.7 ms.The latency of Mix-and-Match ranges from 36.7 ms (for 1 Gaussian component) to 68.8 ms (for 32 Gaussian components) depending upon the number of Gaussian components in the mixture.Note that, even in the worst case, the latency is less than 0.01s, thus making the model suitable for practical use in the real world.
Qualitative Study:Table 6 shows a sample with a multi-turn dialog context where the user is complaining about bad cellphone coverage.As before, the responses retrieved by both ColBERT and Mixand-Match are presented.As can be seen, Mixand-Match returns a relevant response at the top ranked position (highlighted in green) and related responses at other positions.In contrast, ColBERT retrieved generic or unrelated responses.
Human Evaluation: We also conducted a human study comparing the output responses of ColBERT and Mix-and-match.We used samples from the Twitter data set for this study as it does not require domain expertise to assess the relevance of responses.Three users were asked to review 30 twitter dialogs contexts along with the top-4 responses returned by each system, 6 in a response recommendation setting.Users were presented the Dialog Context User: the worst mobile serive in 2015 2017 cellphone badservice miami florida Agent: hey send us a dm and we'll ensure a great experience channeyt User: tmobilehelp poor service low signal slow service it s miami Responses Retrieved ColBERT Mix-and-Match (i) Our apologies , we are currently experiencing a system challenge which we are working to resolve .kindly bear with us.
(i) how long has this been happening ?what type of phone do you have ?please send us a dm so we can fix it .thank you (ii) our sincere apologies for any inconveniences caused, we are having a technical issue, resolution is underway (ii) that ' s not good at all !please dm us with your zip code and nearest streets intersection to check the coverage (iii) it is not our intention to make you upset.please feel free to reach out to us if you have already called back and still need further assistance.
(iii) does this happen in specific locations ?when did you begin to experience these issues with your connection ?are you having issues making calls and sending text as well ?Ground Truth Response: let ' s flip thing around ! meet in the dms https://t.co/sbivwmm6x2Table 6: Sample of a multi-turn dialog context -Mix-and-Match returns a relevant response at the top ranked position and related responses at other positions.In contrast, ColBERT retrieved generic or unrelated responses.outputs from each system in random order and they were blind to the system returning the responses.We asked our users the following: 1. Given the dialog context and the response sets from two different systems, label each response with a "yes" or "no' depending on whether the response is a relevant response recommendation for the dialog context.Thus, each response returned by both systems was individually labeled by three human users.
2. Given the dialog context and the response sets from two different systems, which of the response set is more diverse?Thus, each context-recommendation set was assessed by three human users.
We count the number of votes received by the top-ranked response for each system and report percentage wins for each system.In addition, we also report a head-to-head comparison in which the two models were assessed for diversity (no ties).Finally, to assess whether diversity is accompanied by relevance in the response set, we define a metric called Diversified-Relevance (DR) which weighs the diversity wins by the number of relevant responses returned by each system.Specifically, DR model , the Diversified-Relevance for a model ∈ {ColBERT, Mix-and-Match} is given by: where M is the number of dialogs used in the study, 4, is the number of response recommendations per dialog, 1{win model i } is an indicator function that takes the value 1 if model was voted as being more diverse its responses to i th dialog context, and 1{relevance model ij }, is an indicator function that takes the value 1 if the j th response recommendation by model was voted as being relevant 7 .
As can be seen in Table 4, the top-ranked response returned by Mix-and-Match received significantly higher number of votes (40%) in favour as compared to ColBERT.In 43% of the cases there was no-clear winner.Finally, in 58% of the dialogs, Mix-and-Match was found to present a more diverse set of response recommendations.
In order to assess, if the diversity is accompanied by relevance, we also report the DR scores in Table 5.As can be seen the DR scores for Mix-and-Match is significantly higher than ColBERT (0.35 vs 0.25).Overall, the results from our human-study indicate that Mix-and-Match returns more diverse and relevant responses.

Conclusion
By modelling contexts and responses as multimodal distributions, we allow the network to be more expressive without forcing the representations of unrelated responses to move closer, as would have been the case with traditional dual-encoder learning objectives.We derived and presented a closed form expressions for efficiently computing the KL-divergence based distance measures and showed its suitability for real-world settings.We demonstrated the effectiveness of our retrieval systems on three different datasets -Ubuntu, Twitter and an internal, real-world Tech support dataset.Additional experiments for response relevance, including a human study were performed on the publicly available datasets.We found that not only is our model able to retrieve more relevant responses as compared to recent retrieval systems, it also presented more diverse results.

Limitations
Mix-and-Match relies heavily on the diversity of responses for a given input for achieving good performance.As a result, the model doesn't achieve significant performance boost when the diversity isn't significant.In our experiments, we observed this trend for our internal Tech Support dataset which had standard responses for most queries.
Moreover, it isn't straightforward to store the context and response GMMs in the nearest neighbor index.In our experiments, we used a workaround where we store the means of all the Gaussian components in the nearest neighbor index.To retrieve, we used the L2 distance between the means of the Gaussian components of response and context GMMs.The retrieved results were then reranked using the KL-divergence approximation discussed in the paper.By ignoring the variance term, we are forced to assume that the Gaussian components in the GMMs are spherical.
Supplementary material for Mix-and-Match: Scalable Dialog Response Retrieval using Gaussian Mixture Embeddings 1 Proof of Theorem 1 Proof.The proof follows a similar line of reasoning as the proof provided in (Hershey and Olsen, 2007) The KL divergence between p r and p c can be written as The first term is the negative of entropy while the second term is the cross entropy.We approximate the cross entropy by expanding the GMM in terms of its Gaussian components, and applying Jensen's inequality: Here, the first equality follows by multiplying and dividing the terms within the log by the variational distribution q ℓ (k).The last inequality follows by applying Jensen's inequality.The above upper bound holds for all choice of q.The bound can be tightened by minimizing it with respect to q ℓ (k).
We assume q ℓ to be a one-hot vector which can only be non-zero for one context component k.Every one-hot q ℓ has an entropy of 0 and hence, the second term in the equation is always 0. For a onehot q ℓ , the above equation is minimized when q ℓ assigns all its weights to the component of context GMM with lowest cross-entropy.Using the optimal one-hot q, the above equation can be written as The entropy of p r can be derived as a special case of the above equation by replacing p c in the above equation by p r .Thus, the entropy of a GMM can be upper-bounded by Finally, the KL divergence can be approximated by replacing ( 6) and ( 7) in (2).Note that the resultant quantity is neither an upper nor a lower bound, but still a useful approximation.

Ablation studies
In our ablation studies, we want to answer the following questions: 1) Is the improvement in retrieval performance a consequence of the extra learnable parameters in Mix-and-Match? 2) How does the performance of Mix-and-Match depend on the number of Gaussian components in response and context GMM?To account for this extra linear layer, we double the embedding size for SBERT and ColBERT.The resultant retrieval scores are present in Table ??.As can be observed from the table, the retrieval performance of SBERT and ColBERT improves by doubling the embedding layer.However, Mixand-Match with half the embedding size (size of the mean vector) still outperforms these baselines on Twitter and Tech support dataset.On Ubuntu 1 https://huggingface.co/bert-base-uncased dataset, the performance of the various models is comparable.

Variation of retrieval accuracy with the number of Gaussian components
Next, we plot the retrieval accuracy of Mix-and-Match with the number of Gaussian components in the GMM.We vary the number of Gaussian components in the context and response GMM from 1 to 32 and compute the recall@5 on Ubuntu and Twitter dataset.As can be observed from Figure ??, the recall is high when the number of Gaussian components is small but starts decreasing as we increase the number of components.Overall, the best performance occurs when K = L = 2 or K = L = 4.

Qualitative Study
Table 2 shows a sample with a single-turn dialog context where the user is complaining about flight boarding positions.The responses retrieved by both ColBERT and Mix-and-Match are presented.
As can be seen, Mix-and-Match returns a relevant response at the top ranked position (highlighted in green) and another related response at the second position.In contrast, ColBERT retrieved generic or unrelated responses.

Figure 2 :
Figure 2: An overview of our model -Mix-and-Match.
(v2.0) 4 and the Twitter Customer Support Dataset 5 , and an internal technical support dataset.The Ubuntu Dialog Corpus v2.0 contains 500K context-response pairs in the training set and 20K context-response pairs in the validation set and test set respectively.The conversations deal with technical support for issues faced by Ubuntu users.The Twitter Customer Support Dataset contains ∼ 1 million context-response pairs in the training data and ∼ 120K context-response pairs in validation and test sets.The conversations deal with customer support provided by several companies on Twitter.

2
Model and training detailsWe ran all our experiments on a single Nvidia A100 GPU.We use the pretrained 'bert-base' model pro-vided by Hugging Face 1 .The dimension of the embedding space is fixed to be 128 for all the models.The number of Gaussian components in the context and response distributions is selected by crossvalidation from the set {1, 2, 4, 8, 16, 32}.We use the 'AdamW' optimizer provided by Hugging Face (Adam optimizer with a fixed weight decay) with a learning rate of 1.5e − 5 for all our experiments.A fixed batch size of 16 context-response pairs is used.To prevent overfitting, we use early-stopping with the loss function defined in Section 2.6 on validation set as the stopping criteria.The time taken by the various algorithms to reach convergence on Ubuntu dataset is as follows: 37.5 hours for SBERT, 38.5 hours for ColBERT and 38.8 hours Mix-and-Match with K = L = 2.The total number of parameters in each model (including the baseline) is approximately 219 million.

3. 1
Effect of the extra parameters in Mix-and-MatchAll the three models used in our experiments (SBERT, ColBERT and Mix-and-Match) use the same BERT architecture for encoding.However, for SBERT and ColBERT, the BERT encodings are passed through a single linear layer for generating the final embeddings.Instead, for Mix-and-Match, two parallel linear layers are used to generate the means and log-variances of the Gaussian mixtures.These layers are shared by all the Gaussian components in the mixture.

Figure 1 :
Figure 1: Variation of recall@5 with the number of Gaussian components in the context and response GMM.

Table 1 :
Comparison of Mix-and-Match against baselines on retrieval tasks.Given a context, the task involves retrieving from a set of 5000 responses that also contains the ground truth response.The number of Gaussian components in the GMM are provided in paranthesis.

Table 3 :
Comparison of Mix-and-Match against baselines for the response recommendation task.Given a context, the task involves retrieving from the set of all responses in the training data.The computation of diversity is discussed in detail in Section 4.4

Table 4 :
The top-response returned by the Mix-andmatch model is found to be relevant more often (40% vs 17%) than ColBERT.In addition, the set of responses returned by Mix-and-Match are also more diverse (58% vs 42% for ColBERT).

Table 5 :
The Diversified-Relevance scores for ColBERT and Mix-and-Match in our human study.