We’ve had this conversation before: A Novel Approach to Measuring Dialog Similarity

Dialog is a core building block of human natural language interactions. It contains multi-party utterances used to convey information from one party to another in a dynamic and evolving manner. The ability to compare dialogs is beneficial in many real world use cases, such as conversation analytics for contact center calls and virtual agent design. We propose a novel adaptation of the edit distance metric to the scenario of dialog similarity. Our approach takes into account various conversation aspects such as utterance semantics, conversation flow, and the participants. We evaluate this new approach and compare it to existing document similarity measures on two publicly available datasets. The results demonstrate that our method outperforms the other approaches in capturing dialog flow, and is better aligned with the human perception of conversation similarity.


Introduction
Measuring semantic textual similarity lies at the heart of many natural language and text processing tasks, such as sentence classification, information retrieval, and question answering. Traditional text representation approaches, such as high dimensional and sparse feature vectors, have been boosted by the introduction of efficiently learned embeddings (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017), unleashing the full power of the dense semantic representation of words. Subsequently, new methods were developed for contextual representation of words, sentences, paragraphs, and documents, facilitating the assessment of semantic similarity between larger portions of text (Le and Mikolov, 2014;Peters et al., 2018;Devlin et al., 2019).
Conversations 1 differ from documents that compound multiple sentences or paragraphs in a number of key ways. They are semi-structured documents constructed from a sequence of utterances and they present unique characteristics such as having an author for each utterance, and a conversation flow that can be viewed as a skeleton built of dialog acts (McTear et al., 2016). Indeed, it has been shown that models adapted specifically to analyze conversations outperform those built to analyze general documents, when applied to dialog data (Wu et al., 2020;Ohsugi et al., 2019;Henderson et al., 2020;Zhang et al., 2020). A plausible assumption, therefore, would be that dialog similarity assessment could benefit from an approach adjusted specifically for this domain.
Related work in this field is relatively sparse. Appel et al. (2018) derive two different similarity functions between conversations, one for content similarity based on TF-IDF, and the other for the conversation structure, using engineered features related to the dialog flow. The scores, computed independently for the two dimensions, are further combined to infer the overall conversation similarity ranking. Xu et al. (2019) learn a distance function between utterances and conversations based on expert judgments and later use this function to cluster conversations. Their approach is specifically tailored for conversations with an automatic dialog system (i.e., bot), limiting its applicability to the much wider conversational domain.
In this work we draw on the concept of edit distance, the family of metrics and algorithms (Wagner and Fischer, 1974) widely used for sequence analysis. This analysis is done mainly at the character level for strings or at the nucleotide or amino-acid level in computational biology (Navarro, 2001). The edit distance has also been applied to sequences of sentences to detect the differences between documents (Zhang and Litman, 2014; Barzilay and Elhadad, 2003).
We propose combining the power of edit distance in assessing sequence similarity with the power of distributional semantics to form a novel similarity measure between conversations. This new measure -convED -takes into account the utterance semantics, as well as the dialog flow and its unique traits. We suggest and evaluate a framework for seamless, non-intrusive, and elegant adaptation of the widely-used edit distance metric to the scenario of conversation similarity. 2 The suggested approach can be practically leveraged for downstream applications, such as those in the domain of dialog pattern mining.

Model
In this section, we present a brief reminder of what the edit distance metric involves (Section 2.1), followed by the unique adaptations designed for the scenario of conversation similarity (Section 2.2).

Minimum Edit Distance
Our approach is inspired by the widely-used notion of sequence similarity -edit distance: the minimal number of insertions, deletions, and substitutions required to transform one sequence into another. Sequences are typically drawn from the same finite set of distinct symbols, e.g., the alphabet letters for strings. Given sequences a and b of lengths m and n, the distance d ij between two arbitrary sequence prefixes -of length i and j, respectively -is defined recursively by for i∈ [1, m], j∈ [1, n], where w del , w ins and w sub are deletion, insertion, and substitution weights, respectively; these vary according to the precise application. The final edit distance between the two sequences a and bd m,n -may then be computed using dynamic programming (Wagner and Fischer, 1974). The chain of steps needed to convert one sequence into another constitutes the sequence alignment, where each element in the first sequence is paired with an element or gap in the second one. As an example, one possible alignment between the words 'shine' and 'train' will result in following steps; assuming insertion and deletion cost of 1, and substitution cost of 2, the edit distance between these strings is 6.
• s h i n e | | | | | | t r a i n • A dialog can essentially be viewed as a temporal organization of utterances, and their underlying dialog acts. In this context, a dialog act refers to a certain function, such as a request or statement. The unique nature of dialogs, as opposed to strings, poses unique challenges to the alignment procedure. We next describe these challenges, as well as the solutions we applied to address them.

Conversation Edit Distance (convED)
This work focuses on multi-party conversations in the domain of customer service, where a dialog is represented by an interaction between the actors: a customer and a customer support agent. Formally, for two conversations -c 1 and c 2 of length m and n -we define the sequence of utterances to be u 1 1 , ..., u m 1 and u 1 2 , ..., u n 2 produced by actors a 1 1 , ..., a m 1 for the first and a 1 2 , ..., a n 2 for the second conversation, respectively.
Utterance Substitution Cost Motivated by the intuition that the alignment of two utterances -u i 1 and u j 2 -should be driven by their semantic similarity, we define the substitution cost of the two as a function of their distance in a semantic space. Namely, we encode u i 1 and u j 2 into distributional representations e i 1 and e j 2 using the Universal Sentence Encoder (Cer et al., 2018). 3 We define their substitution cost (w sub (u i 1 , u j 2 )) as the cosine distance of the representations, scaled by a factor α (see details on α optimization in Appendix A.1): Alignment by Actor Type Structural similarity between two dialogs inherently implies a similarity between utterances matched by their actor type, whether customer or agent. Conversations in the customer support domain are likely to comprise a sequence of requests, clarification questions, solutions, actions, and confirmations. Dialogs that agree on the assignment of such patterns to actors would naturally be considered more similar than those that do not. We impose inter-actor agreement during the alignment process, by weighting the substitution of utterances produced by different actors with an infinitely high cost. Consequently, the algorithm will avoid making such cross-actor alignments. We re-define Equation 2: A sample invocation of the framework (with w ins and w del weights set to 1) is presented in Table 3, resulting in the utterance-level alignment of two dialogs in the domain of booking tickets.

Evaluation and Results
We evaluated the effectiveness of our model through two distinct approaches: intrinsic evaluation, assessing the ability of the model to capture dialog flow (Section 3.3), and external human evaluation via crowd-sourced annotations (Section 3.4). We compared our model to two competitive baselines used for estimating text similarity.

Datasets
Two dialog datasets were used for evaluation: Schema-Guided Dialog (SGD) dataset (Rastogi et al., 2020) and MSDialog (Qu et al., 2018). SGD is a large corpus of task-oriented dialogs that were created by crowd-workers and follow pre-defined dialog skeletons. MSDialog is a real-world dialog dataset of question answering interactions collected from a forum for Microsoft products, where a subset of dialogs (over 2K) was labeled with metadata, including dialog acts.

Baseline Models
We selected two competitive baselines for the evaluation of document similarity assessment: (1) Universal Sentence Encoder, a common choice for generating sentence-level embeddings, where a document embedding is computed by averaging its individual sentence representations; and (2) doc2vec (Le and Mikolov, 2014), an embedding algorithm that generates a distributional representation of documents, regardless of their length. The latter has been shown to outperform other document embedding approaches (Lau and Baldwin, 2016; Zhang and Baldwin, 2019). The distance between two dialogs, avgSemDist and d2vDist, respectively, is then computed by the cosine similarity between the final dialog representations. For the d2vDist measure, we trained a doc2vec implementation on over 20K SGD and 35K MSdialog dialogs, respectively. Both models were trained with default parameter values for 40 epochs. After model training, individual document (dialog) representations were inferred using the pre-trained doc2vec models.

Intrinsic Evaluation
We next assess the key capability of the conversation similarity measure: the ability to capture conversation structure and its temporal flow. Note that albeit our intrinsic evaluation is done against datasets that include labeled dialog acts, our novel method, convED, does not rely on those for computing the similarity between two conversations. The dialog acts are being used merely for evaluation purpose. Thus, the method can be applied to any conversational data, whether between two humans or between a bot and a human. It is also not restricted to specific participant roles and can handle multiple participants.

Dialog Structural Edit Distance (structED)
Both SGD and a subset of MSDialog are annotated with rich metadata, including acts and slot names (SGD), and intent type, the equivalent of acts (MS-Dialog). For example, the agent utterance "When would you like to check in?" in the SGD corpus is labeled with an act of type REQUEST and a slot value of type check_in_date. Consequently, a dialog structure for the flow of actions and corresponding slot values can be extracted using this metadata. While faithfully representing a dialog flow, this structural pattern does not reflect (albeit not completely agnostic to) the precise semantics of utterances underlying the acts -a setup that offers a natural test-bed for evaluation of our similarity measure, compared to other methods.
Specifically, given a dialog, we define its action flow as the temporal sequence of its dialog acts or intents, concatenated with alphabetically sorted slots where they exist.
As a concrete example, utterance #2 in conversation 1 in Table 3 would be represented as [REQUEST_location], and utterance #10 in conversation 2 would be transformed into [OFFER_location,OFFER_time].
For a conversation c i , we denote the sequence of its dialog acts and slots by da i . Note that within a certain domain, the set of possible dialog acts # Conversation 1 Conversation 2  and slots spans a fixed set. Therefore, the traditional edit distance metric can be applied to assess the distance between the dialog act flows of two conversations. Adhering to the conventional approach, we define the cost of insertion and deletion as 1, and the cost of substitution as 2. The dialog structural edit distance (structED) between conversations c i and c j is then computed as the edit distance between the two sequences da i and da j , normalized by the longest sequence length.
Correlation to convED We hypothesize that the pairwise conversation distance represented by the convED measure (Section 2.2) will exhibit higher proximity to structED, than the distance computed by either of the baseline models. We tested this hypothesis by calculating the four measures on all distinct conversationpairs (c i , c j ), i =j, in a conversation set C. We then computed Pearson's correlation between each of {convED, avgSemDist, d2vDist} and structED. Since structED carries over only little semantics, the highest correlation will be indicative of the measure that most faithfully captures the inter-dialog structural similarity. We performed our evaluation on a subset of SGD dialogs in the domain of Events due to their diverse nature, and on the set of MSDialog conversations. Utilizing the bootstraping setup, we randomly sampled 100 subsets of 200 conversations, and averaged over individual sample correlations.  Table 2: Mean Pearson's correlation between structED and the pairwise dialog distance computed using each model. The best result in a column is boldfaced. Significant differences between convED and the two baselines are marked by '**' (t-test, p<.001).
Ablation study We next ask to study the affect of alignment by actor type (Section 2.2) on the evaluation results (Table 2). Relaxing the constraint of the alignment, and, thereby using Equation 2 for computation of the substitution cost, resulted in the correlation of 0.538 for SGD (Events) and 0.267 for MSDialog. While a considerable drop is evident for MSDialog, the SGD results remain practically unaffected. We attribute the latter result to the schematic nature of SGD dialogs, lacking naturalistic variation: customer and agent utterances follow a predefined pattern, and differ to an extent that prevents the algorithm to align (semantically-distant) cross-actor utterances, even if the same-actor alignment is not strictly imposed.

Human Evaluation
We further evaluated the convED measure by comparing it to the human perception of dialog similarity. We hypothesized that the suggested approach is likely to exhibit a higher agreement with human judgement, than the more competitive baseline avgSemDist (on the SGD data).
Rating the precise degree of similarity between two dialogs is an extremely challenging task due to the subjective nature of the relative perception of conversation similarity. Rather than directly estimating a similarity value through scale-based annotation, we cast the annotation task as a twoway comparison scenario. We presented the crowd with a conversation triplet: one anchor and two candidate conversations. We used the Appen annotation platform targeting only the highest quality workers, where 5 annotators provided judgements for a sample of 500 triplets. Appendix B contains guidelines for the annotation task.
Conversation Triplet Selection Our inspection of the data reveals that for 68.4% of randomly selected triplets with an anchor and two candidate conversations a, c 1 , c 2 , the two methods -convED and avgSemDist -agree on their judgement of the relative similarity for c 1 and c 2 to a. This observation limits the potential benefits of the crowd-sourcing task that was designed to test which approach better resembles human judgement, when the two methods exhibit mutualdisagreement. We, therefore, adhere to the retrospective annotation paradigm, where triplets are selected in a way that the two approaches yield contrasting judgement on the relative similarity of conversations c 1 and c 2 to the anchor a. Appendix B presents an annotation triplet example.

Annotation Results
We limited the crowdsourcing evaluation to a subset of annotation examples with at least 80% (4 out of 5) interannotator agreement; this resulted in 229 samples out of 500. Treating these high-confidence judgements as the ground truth, we computed the ratio of triplets that agree with human intuition, for each of the two methods: convED and avgSemDist. A higher ratio would indicate that the method -convED or avgSemDist -more closely resembling human judgment on dialog similarity. The evaluation yielded 73.3% and 26.7% agreement with human judgements, for convED and avgSemDist, respectively. This corroborated our hypothesis suggesting that our measure better captures human perception of dialog similarity.

Runtime Considerations
In this work we used a common dynamic programming algorithm implementation that computes edit distance between two sequences in quadratic time; implementation, sufficiently efficient for relatively short sequences, subject to alignment. Recall that semantic similarity between two utterances is calculated by computing utterance embeddings, followed by measuring their cosine similarity, where the former is the most time-consuming part of the flow. Caching pre-computed utterance representations results in efficient computation overall, where computing ConvED between two conversations takes 5-6ms running on CPU. Uncached alternative results in nearly 200ms for a dialog-pair.

Conclusions and Future Work
We presented a novel approach for measuring dialog similarity, capturing both conversation semantics and its structural properties. Our evaluation shows that this measure outperforms other baselines with respect to both intrinsic evaluation and agreement with human judgements. The framework is easily adaptable to different settings by manipulating the cost functions. For example, addressing multi-party chat by assigning lower distance to utterances coming from parties that share the same role, or reducing the cost for general chit chat utterances, thus focusing on semantic similarity of the conversations' essence. Our future work includes enhancements of the proposed measure with additional dialog-related traits, as well as its application to downstream tasks and adapting the framework to multiple conversation alignment.

Ethical Considerations
We have collected crowd annotations using the Appen platform. Due to the task difficulty, the mean hourly rate was 100% higher than the US federal minimum wage and contributors were offered additional bonuses to incentivize high quality work. In a contributor satisfaction survey conducted by the platform, our pay was rated 4.3/5 and clarity of instructions was rated 4.8/5.
Contributors only provided answers for multiple choice questions, or selected text spans from presented dialogs, and did not create any new textual content. No identifiable information about the contributors will be released with the data. Below are the annotation guidelines supplied for our annotators in the crowd-sourcing experiment.

Goal
The goal of this task is to assess the similarity between conversations in the domain of customer service. The purpose of conversations is to assist customers in obtaining information about music and sports events and making reservations. A typical conversation consists of multiple turns (interactions) between a customer and a human agent. Conversations normally follow a predefined structure with slight variations. As such, most interactions include queries about types, dates and locations of events, followed by booking tickets, and confirmation on the agent side. Figure 1 presents an example conversation from the dataset. Conversations will often introduce some deviations from the depicted flow. As an example, customers may ask for a certain date or number of tickets, and then change their mind and details of their request, e.g., asking for a different number of tickets or another date. In some cases, customers only want to get information (date and time, location) about certain events, without actually making a reservation.

Rules and Tips
In this task you will judge how similar conversations are, where similarity refers to multiple dimensions (the order of dimensions below does not necessarily imply relative importance): • Topical similarity: how similar are the topics discussed, e.g., event type.
• Conversation flow similarity: how similar is the structure of conversations, e.g., interactions between the two actors, the final conversation outcome (e.g., reservation made or not).
Topics and structure are considered the major aspects that affect conversation similarity. While other details tend to vary between conversations, they carry over only minor (or no) effect on the judgement and, therefore should be ignored. As such, the precise event location or the name of the sports team selected should not affect the decision on conversation similarity. Consequently, two conversations that only differ in the following aspects should be considered identical: greetings and thanks (e.g., thank you for your help), precise sports team names or artist names (e.g., France Rocks Festival), precise locations and dates (e.g., tomorrow, 61 West 62nd street), precise number of tickets requested (e.g., 4 tickets).
Each annotation sample includes three conversations: the anchor conversation (in the middle, with grey background), and two candidate conversations -one on the left-hand side and on the right-hand side. Your task is to decide which of the two candidate conversations is more similar to the anchor. Note that both conversations can exhibit various extents of similarity to the anchor -spanning the whole range between very similar to completely distinct. Even if both candidates are seemingly very different from the anchor, your task is to still decide which of the two is more similarthe left or the right one.

Annotation Steps
• Read all three conversations carefully, beginning with the anchor (the middle conversation).
• Answer the content question related to the anchor conversation. To answer the question, please select the relevant text section from the conversation and copy it to the answer text box. Some questions may entail answers that vary in their precise phrasing (e.g., "day after tomorrow", "the day after tomorrow"): select the phrasing you find most appropriate.
• Think over the various aspects of similarity discussed above, and how they apply to your case. Note that the above guidelines leave some room for your intuition and (often subjective) judgement. However, you should be able to lay out the rationale behind each of your decisions.
• Select the conversation more similar to the anchor between the two: click the radio button below -either below the left hand-side conversation, or the right-hand side conversation.
We estimate the average time for each annota-tion example as a couple of minutes. Try to be decisive. In rare cases where you cannot decide between the two, click the "I can't decide" radio button, located above the anchor conversation.

Examples
As an example, consider the three conversations in Figure 2, where the middle one is the anchor.
Annotation answer: the right-hand conversation is more similar to the anchor due to both higher topical similarity and structural similarity.