TaskDiff: A Similarity Metric for Task-Oriented Conversations

The popularity of conversational digital assistants has resulted in the availability of large amounts of conversational data which can be utilized for improved user experience and personalized response generation. Building these assistants using popular large language models like ChatGPT also require additional emphasis on prompt engineering and evaluation methods. Textual similarity metrics are a key ingredient for such analysis and evaluations. While many similarity metrics have been proposed in the literature, they have not proven effective for task-oriented conversations as they do not take advantage of unique conversational features. To address this gap, we present TaskDiff, a novel conversational similarity metric that utilizes different dialogue components (utterances, intents, and slots) and their distributions to compute similarity. Extensive experimental evaluation of TaskDiff on a benchmark dataset demonstrates its superior performance and improved robustness over other related approaches.


Introduction
Task-oriented conversational assistants have become increasingly popular in multiple industries enabling users to perform tasks such as travel reservations, banking transactions, online shopping, etc., through multi-turn conversations.The increased use of these assistants has led to the availability of valuable user-assistant conversation logs (Budzianowski et al., 2018;Andreas et al., 2020), prompting efforts to extract insights from them.
A key aspect of such conversational analytics is identifying similarities and dissimilarities between conversations.This will enable developers to improve the user-experience including personalized response generation, next-action recommendations, and information retrieval (Yaeli et al., 2022;Bag et al., 2019;Gao et al., 2020;Li et al., 2022).The popularity of large language models like ChatGPT and Llama 2 (Touvron et al., 2023) has resulted in a race to create custom task-oriented conversational assistants in enterprise domains like finance and retail (Wu et al., 2023).However, evaluating these assistants has become an important challenge and requires effective metrics that can measure their performance across similar user-assistant conversations.
Measuring semantic textual similarity has been extensively studied for textual sources like documents, social media, transcripts, etc.However, there has been limited prior work studying similarity in task-oriented conversation settings (Appel et al., 2018;Lavi et al., 2021).Most approaches leverage popular word embeddings like Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) or pre-trained models like Universal Sentence Encoder (Cer et al., 2018), Sentence-BERT (Reimers and Gurevych, 2019) to obtain vector representations of utterances, and then use distancebased approaches such as cosine and edit-distance to compute the similarity between text snippets.
While such approaches can identify semantic relationships between texts, task-oriented conversations present several challenges that limit their effectiveness.Firstly, they consist of distinct components -intents, slots, and utterances -that impact the similarity and overlap between conversations.For instance, users can have different objectives (e.g., booking travel vs. product returns), or even have the same intents but provide different levels of slot information (Ruane et al., 2018).Additionally, information is typically provided over multiple conversation turns, and each turn could involve multiple user intents and slots.Finally, the same set of tasks can be expressed using numerous possible utterances by users, depending on their choice of phrasing, order of sentences, use of colloquialisms, introducing digressions, etc. (Guichard et al., 2019).Hence, relying solely on distance based similarity of utterance embeddings would adversely impact performance.In this work, we present TaskDiff, a novel similarity metric designed for task-oriented conversations to address the above challenges.Figure 1 shows multiple users having similar conversations about making bookings for a trip but with re-ordered tasks or paraphrased utterances with different slot values.It also shows that prior work is not robust to such differences that commonly occur in task-oriented conversations.
An ideal metric to measure conversational similarity should be able to identify that the overall goal of these conversations in Figure 1 is the same.TaskDiff represents the structure of conversations as distributions over the different task-oriented components and combines the geometry of the distributions with optimal transport to measure the similarity between conversations.Our approach is inspired by prior work in topic modelling (Kusner et al., 2015;Yurochkin et al., 2019) that have shown the effectiveness of comparing the structure of distributions, albeit for different settings.We evaluate TaskDiff on a benchmark task-oriented conversation dataset and demonstrate its effectiveness while presenting examples illustrating its improvement over existing approaches.

Definitions
A task-oriented conversational system supports a pre-defined set of user intents I and their corresponding slots or parameters S. Each conversation C i consists of a multi-turn sequence of utterances U i between the user and the system or agent, a subset of active intents, and slot-value in-formation provided by the user (i.e.)I i ⊆ I and S i ⊆ S. Our objective is to compute the similarity between task-oriented conversations, given their components K = {U, I, S} (i.e.) utterances, intents, and slot information.

Approach
TaskDiff measures similarity between task-oriented conversations as a function of the distance between their component-wise distributions.For each component k ∈ K, we represent its distribution over every conversation and compute similarity as the cumulative cost of transforming or transporting the component-wise distributions of one conversation to another.
Figure 2 shows an overview of TaskDiff.We first mask the values of the slots in every conversation with their corresponding '<slot name>' from the ontology, before using SBERT to generate conversational embeddings.The masking ensures that entities representing the slot values do not incorrectly bias or ambiguate the embeddings (Shi et al., 2018).For instance, the embedding similarity between the unrelated utterances -"I want a ticket to the Big Apple" and "I want a ticket to the Apple conference", could be incorrectly influenced by the word 'Apple', but masking with their appropriate slot names (e.g., <arrival_city> and <product_name>), resolves this possibility.We denote ∆ l U as the distribution of utterance embeddings of a single conversation.
We then compute probability distributions ∆ n I , ∆ m S for each conversation over the set of intents I Reserve a room in Hilton for 3 days.

… Utterance Masking
Find flights from <city> to <city> on <date>.
Reserve a room in <hotel> for <duration>.and slots S as - where each p i reflects the frequency of occurrence of intents and slots over the utterances.For example, ∆ n I for conversation C i represents the probability of all n intents within C i .
We then compute a separate cost matrix M i,j for each component, that represents the cost to move between two points (i, j) in its distribution.We compute each entry using the Euclidean distance between the embeddings generated for each component.Intuitively, conversations with similar intents, slot information, and analogous language would reflect similar distributions, and hence a lower cost of transportation (i.e.) high similarity.However, any differences in their components would incur a larger cost, and hence reflect a lower similarity.
Given distributions α ∈ ∆ a k , β ∈ ∆ b k , ∀k ∈ K and the cost matrix M, the 1-Wasserstein optimal transport distance (Vallender, 1974) between them is - where M i,j = d(i, j) denotes the cost matrix and d(., .)denotes the distance between the distributions.We then define the similarity (TaskDiff ) between two task-oriented conversations C 1 and C 2 as the weighted sum of the W 1 distances between their respective components - where C ⊕ i = {U i , I i , S i } represents the conversation's components K (i.e.) utterances, intents, and slots, and γ k is a hyperparameter reflecting the influence of each component on the similarity.

Dataset
We use SGD (Rastogi et al., 2020), a benchmark dataset of multi-turn task-oriented conversations between users and agents spanning 20 domains (e.g., travel, dining).Its 20,000 conversations are annotated with active intents and slot information.

Baselines
We compare TaskDiff to three existing approaches: 1. SBERT: A state-of-the-art approach to measure similarity between conversational embeddings using cosine similarity (Reimers and Gurevych, 2019).

Conversational Edit Distance (ConvED):
A dialogue similarity metric that aligns utterances between conversations and computes the edit distance between their embeddings (Lavi et al., 2021).

Hierarchical Optimal Transport (HOTT):
A document similarity metric that by models topics using Latent Dirichlet Allocation (LDA) (Blei et al., 2003), and subsequently uses the 1-Wasserstein distance on the topic and text embeddings (Yurochkin et al., 2019).
We conduct our experiments on an Intel Core i9 with 64GB of RAM.We implement TaskDiff in Python, leveraging the POT library (Flamary et al., 2021) for the 1-Wasserstein optimal transport distance.The choice of γ was set to 2, 1, and 1 for the intent, utterances and slots components, respectively after performing hyper-parameter search.

k-NN Classification
We evaluate the ability of the different approaches to accurately classify similar SGD conversations into the correct domains using k-NN.From Table 1, we observe that TaskDiff outperforms SBERT, HOTT and ConvED, demonstrating the importance of considering other conversational components for similarity beyond just utterances (i.e.) intents and slots, and the need for masking to avoid the adverse influence of entities.The utterance alignment coupled with use of edit distance in ConvED helps compared to SBERT, but requires annotations for alignment that may not always be available.We also see that HOTT returns the lowest accuracy, since LDA often picks topics outside the actual conversational intents due to its reliance on wordfrequencies.This incorrectly skews the optimal transport distributions thereby impacting classification.

Conversational Clusters
We visualize the conversational clusters formed by the different approaches on SGD using k-means, setting k to 20 (i.e.) the number of domains and running 20 iterations.From Figure 3, we observe that TaskDiff results in the most well-formed and distinct clusters followed by SBERT, which has some cluster overlap and lower distinction.The clusters resulting from ConvED and HOTT show a significant amount of overlap, demonstrating their inability to distinguish between similar and dissimilar conversations.

Ablation Study
We perform an ablation study using 200 randomly selected dialogues, to highlight the influence of the different components in TaskDiff that enable its effectiveness over approaches like SBERT.As shown in Table 2, masking the slot names within the utterances results in a 14% improvement in accuracy over SBERT, since the embedding similarity is no longer influenced by incorrect biases or ambiguity from the slot values as described in Section 2.2.Additionally, we see that the use of optimal transport (OT) based similarity on the utterances without the use of masks, suffers from the same drawbacks compared to when masks are introduced.Finally, the addition of intents and slots to the optimal transport (i.e.) TaskDiff results in a 26% improvement in accuracy over SBERT, due to the additional information about the dialogues provided by these components, thereby highlighting their importance while measuring task-oriented conversation similarity.

Robustness to Reordering
We evaluate the robustness of the approaches for a common setting where users provide the same tasks in a different order within the conversation.We perturb the SGD dataset, wherein 30% of the utterances in each conversation are reordered, and compute their distance from the original for each approach.The average distance over all perturbed conversations in Table 3 shows that TaskDiff returns an exact match on these conversations, since representing conversations as distributions over its components (i.e., intents, slots, utterances), makes it agnostic and robust to such changes.The comparison approaches however, are not as robust, with ConvED performing poorly due to its reliance on alignments between utterances.

Related Work
Efforts across many natural language tasks including sentiment analysis (Poria et al., 2016), recommendation systems (Magara et al., 2018), and question answering (Sidorov et al., 2015), have relied on using distance-based similarity measures over text embeddings (Wang and Dong, 2020 The use of optimal transport over text distributions has shown promising results in document similarity (Solomon, 2018) resulting in popular metrics like the word mover's distance (WMD) (Kusner et al., 2015) and supervised WMD (Huang et al., 2016).Recently, Yurochkin et al. (2019) used optimal transport over topic models for documents, demonstrating a significant improvement in performance over traditional distance based measures.However, direct application of such approaches to task-oriented dialogues is challenging, due to the unique structure and different components of conversations, as shown in our results.

Conclusion
In this paper we present TaskDiff, a novel metric to measure the similarity between task-oriented conversations.It not only captures semantic similarity between the utterances but also utilizes dialog specific features like intents and slots to identify the overall objective of the conversations.We demonstrate that unlike existing metrics, taking advantage of these unique components is critical and results in significantly improved performance.As part of future work, we will investigate the inclusion of additional dialog features on open domain dialog datasets and the utilization of TaskDiff to improve the performance of various downstream conversational tasks.

Limitations
We demonstrate in this work that TaskDiff is a superior and more robust similarity metric compared to existing state-of-the-art approaches for task-oriented conversations.Given the use of optimal transport to compute similarity as a function of differences over the component distributions (intents, slots, and utterances), TaskDiff is reliant on being given an ontology for the intents and slots present across the conversations.However, this is a fair assumption to make for the domain of task-oriented conversations, and such ontologies are leveraged by real-world deployments such as Google DialogFlow, IBM Watson Assistant, Amazon Lex, etc.

Figure 1 :
Figure 1: Demonstrating robustness of TaskDiff over prior approaches for multiple conversational scenarios.
Find flights fromBoston to Florida on 1 st July.

Figure 2 :
Figure 2: Overview of TaskDiff illustrating steps for masking utterances, generating distributions over different conversation components, and computing similarity using Optimal Transport cost between conversations' distributions.

Figure 3 :
Figure 3: Conversations clustered using k-means and color coded by domain names.

Table 1 :
Accuracy scores for k-NN classification

Table 2 :
Ablation study of TaskDiff with k-NN classification accuracy

Table 3 :
Impact of conversational reordering