Task-Oriented Clustering for Dialogues

A reliable clustering algorithm for task-oriented dialogues can help developer analysis and deﬁne dialogue tasks efﬁciently. It is challenging to directly apply prior normal text clustering algorithms for task-oriented dialogues, due to the inherent differences between them, such as coreference, omission and diversity expression. In this paper, we propose a Dialogue Task Clustering Network (DTCN) model for task-oriented clustering. The proposed model combines context-aware utterance representations and cross-dialogue utterance cluster representations for task-oriented dialogues clustering. An iterative end-to-end training strategy is utilized for dialogue clustering and representation learning jointly. Experiments on three public datasets show that our model sig-niﬁcantly outperformed strong baselines in all metrics 1 .


Introduction
Task-Oriented Dialogue Clustering (TODC) aims to group task-oriented dialogues into different clusters according to their underlying tasks. Since each cluster includes dialogues for one specific task, it therefore brings convenience for task induction and definition. Especially for large unlabeled humanhuman dialogues, TODC can be employed to help to induce and define new tasks rapidly which is important for designing of task-oriented dialogue system.
Most prior studies focus on normal text clustering, and have made significant progress via keywords extracting (Bafna et al., 2016;Neto et al., 2000), topic model (Blei et al., 2001;Onan et al., 2017), deep clustering (Xie et al., 2016;Guo et al., 2017;Jiang et al., 2017;Yang et al., 2017). However, inherent differences between task-oriented dialogues and normal texts make above methods difficult to be applied in clustering of task-oriented 1 https://github.com/Ryan-Lv/DTCN Figure 1: From top to down, the figure shows an example of in-dialogue discourse relation of utterances. From left to right, the figure shows an example of cross-dialogue similarity relation of utterances, the implicit task-related concepts information like "informintent:search-house" can be concluded from different dialogues by grouping the utterances with similar semantic. dialogues directly. The first difficulty is that coreference and information omission occur frequently in dialogues (Su et al., 2019), which makes it harder to build a good representation for utterances in dialogue than in normal text. The second difficulty is that the task-related slot names and intents are scattered in each utterance implicitly and expressed diversely. In most cases, only slots values are given in dialogues without explicit slot names. Only by comparing utterances in different dialogues, we can find task-related implicit information, as shown in Fig.1. Considering these special characteristics in task-oriented dialogues, we emphasize that TODC should utilize in-dialogue relations between different utterance to build context-aware representations for each utterance and utilize cross-dialogue similarity between utterances in different dialogues to induce implicit task-related concepts information.
To address above problems, we proposed a Dialogue Task Clustering Network (DTCN) for TODC.
The key points of DTCN are two folds. First, we construct in-dialogue utterance adjacency graph for each dialogue, and encode the graph with graph attention networks (GAT) to build context-aware representations. Second, we cluster all utterances to induce implicit task-related concepts, and then learn utterance cluster representations to utilize this information. Further integrating both kinds of representations into dialogue representations. Finally, training the model with two stages training strategy, which includes pre-training and joint-training stages, the former pretrains a Transformer-based auto-encoder with the proposed Gate-based Transformer decoder for initial clustering assignments, the latter trains jointly the whole model with a selftraining strategy for optimizing dialogue representations and dialogue cluster assignments iteratively.
Experimental results on three constructed public dialogue datasets (SGD-S, SGD-M and Multiwoz-T) show that our model significantly outperforms the existing strong text clustering algorithms in all metrics on TODC. Especially, we achieve 19.76% improvement of accuracy on SGD-S dataset compared with the best baseline, which indicates that the proposed dialogue representation method can capture more task-related information.
In summary, the contributions of our paper are as follows: • We propose an unsupervised Dialogue Task Clustering Network (DTCN). As far as we know, this is the first work on task-oriented clustering for dialogues. Our model learns dialogue representations and clusters dialogues simultaneously by fusing both representations of utterances and utterance clusters.
• We propose a context-aware utterance representation learning model, which uses Graph Attention Network to efficiently capture indialogue structural information between utterances and learns the representations with the proposed Gate-based Transformer decoder.
• Experiments on three public datasets show that the proposed model significantly outperforms the existing strong baselines in all metrics on TODC.

Related work
Clustering Data representation and clustering algorithm are the two keys to address clustering problems. Previous works (Hartigan, 1979;McLachlan and Basford, 1988;Blei et al., 2001) mainly focused on feature transformation or clustering independently. Data are usually mapped into a feature space and then directly fed into a clustering algorithm to cluster. In the recent years, owing to the development of deep learning, more and more deep clustering methods (Caron et al., 2018;Xie et al., 2016;Guo et al., 2017;Yang et al., 2017;Jiang et al., 2017) were proposed, which can obtain feature representations and cluster assignments simultaneously.
Graph Neural Network Recently, there has been a surge of interest in Graph Neural Networks (GNNs) (Wu et al., 2020b) approaches for graph representation learning. Some GNN variants (Velickovic et al., 2018;Kipf and Welling, 2017) are proposed and also applied in dialogue related tasks.  proposed Graph Attention Matching Network and Recurrent Graph Attention Network based on Graph Attention Network to encode utterances, schema graphs and previous dialogue states. Ghosal et al. (2019) proposed Dialogue Graph Convolutional Network based on Graph Convolutional Network (Kipf and Welling, 2017) to model inter and self-party dependency to improve context understanding.

Task formulation
Given an unlabeled dialogue dataset D = {d j } N dia j=1 , where N dia denotes the total number of dialogues in dataset and d j = {u i } I i=1 denotes one dialogue with I utterances. Task-Oriented Dialogue Clustering (TODC) aims to group D into K dia clusters according to the underlying tasks.

The Proposed Model
The proposed Dialogue Task Clustering Network (DTCN) is composed of five modules as shown in Fig.2, and trained with two stages training strategy. In the first stage, we used an autoencoder to learn context-aware utterance representations for initial clustering assignments, in which the Utterance Encoder (UE) and the Structural Context Encoder (SCE) are used as encoder, the Utterance Decoder (UD) module is used as decoder. In the second stage, introducing two new modules based on the pretrained autoencoder, including the Utterance Cluster Representation Learning (UCRL) module for learning utterance cluster representations and the Dialogue Representation Learning (DRL) module for learning dialogue representations, and adopting an iterative training strategy for optimizing jointly dialogue clustering assignments and dialogue representations.

Utterance Encoder
UE module aims to encode each utterance to an embedding initially. Specifically, for the i-th utterance u i = {w t } m i t=1 , calculating the word encoding t ∈ R d mod for each word w t firstly as shown in Eq.1, where d mod is the embedding size, emb t is the word embedding and pos t is the position encoding calculated by the sinusoidal encoding method (Vaswani et al., 2017). Then, feeding { t } m i t=1 into Transformer encoder and adding role embedding r i ∈ R d mod to obtain the initial representation h i ∈ R d mod of the utterance u i as shown in Eq.2, where the mean values of all words of each utterance are used as the outputs of Transformer encoder.

Structural Context Encoder
SCE module aims to learn context-aware utterance representation, including utterance adjacency graph construction and graph encoding. Specifically, an adjacency graph is the node set, in which v i is corresponding to the utterance u i and its initial representation is h i . The edges set E between the nodes is defined by N -adjacency relationship as shown in Eq.3, where N represents the window size, we suppose that the utterances in the window have discourse relation. Then, feeding the graph G into Graph Attention Network to obtain the context-aware utterance representation with structural context information as shown in Eq.4,

Utterance Decoder
To better learn context-aware utterance representations, the Gate-based Transformer decoder is proposed as shown in Fig.3. Compared with the standard Transformer decoder, the Gate-based Transformer decoder has an additional Gate-based Extractor sublayer which captures more related information for decoding different words. Specifically, to decode word w t+1 in u i , the Gate-based Extractor sublayer extracts the hidden state g t i from g i through the gate mechanism (Gers, 2001) as shown in Eq.5, where · · refers to concatenation, W T ∈ R 2d mod ×d mod is a trainable weight matrix, a t i is the hidden state corresponding to word w t through the Masked Multi-head Self-Attention sublayer. Similar to the Transformer decoder, we use the linear projection and softmax function to convert the outputs to the probability distribution of next word p t+1 i ∈ R Nvoc , where N voc is the vocabulary size. The cross-entropy loss L ud between p t+1 i and the ground-truth word id y t+1 i is employed as shown in Eq.6, where m i is the number of words in u i .

Utterance Cluster Representation Learning
UCRL module first induces implicit task-related concepts by clustering utterances, and then represent this information by leaning representations for each utterance cluster. Specifically, first grouping all utterances into K utt clusters with the context-aware representations by Gaussian Mixture Model (GMM) (McLachlan and Basford, 1988).
Then, an utterance cluster representation learning method based on the transfer relationship is proposed. It makes use of historical utterance cluster representations and initial representation of the current utterance to update the utterance cluster representation corresponding to the current utterance. Specifically, given a dialogue history {u 1 , ..., u i−1 }, let the corresponding utterance cluster representations be C i−1 = {c 1 , ..., c i−1 } and the current utterance representation from UE module be h i , the utterance cluster representation c i corresponding to current utterance u i is calculated as shown in Eq.7, Furthermore, a loss function for cluster representation learning is adopted as shown in Eq.8, where C ∈ R Kutt×d mod is the representation matrix of all utterance clusters, softmax function is used to get the cluster probability distribution, finally crossentropy between the distribution and the utterance cluster label y u i corresponding to u i is calculated.

Dialogue Representation Learning
DRL module aims to learn dialogue representations by fusing the context-aware utterance representations and the corresponding utterance cluster representations, the former contains the in-dialogue discourse relation information, the latter contains the cross-dialogue task-related concepts information. Specifically, given a dialogue d j = {u i } n j i=1 , let the corresponding class label be y d j , the contextaware representation of utterances in the dialogue be g = {g i } n j i=1 , the utterance cluster representations be C = {c i } n j i=1 , and the utterance position embeddings be pos u = {pos u i } n j i=1 . The first step is fusing the g, C, and pos u into an embedding as shown in Eq.9, where W T ∈ R 2d mod ×d mod , pos u i ∈ R d mod are all trainable. Then, the Transformer encoder is leveraged to encode {ζ i } n j i=1 , the output o j is the [CLS] position embedding as shown in Eq.10, o j = T ransf ormerEncoder(ζ 1 , ..., ζ n j ) (10) Finally, a Linear layer and a LayerNorm layer project the output o j into the dialogue representation z j ∈ R K dia as shown in Eq.11, A cross-entropy loss as shown in Eq.12, is used to supervise the learning of dialogue representation by maximizing the distance among the dialogue representations in the different classes, which is helpful to cluster dialogues.

Dialogue Clustering
After obtaining the dialogue representation, we group them into K dia clusters with Gaussian Mixture Model (GMM). And assigning a label for each dialogue, which will be used as pseudo label for training DRL module. Further, a trained DRL module will generates better dialogue representation, and then better dialogue representations help to obtain better clustering assignments. It should be noted that dialogue representations used for the initial clustering is the mean value of utterance representations g from the pretrained autoencoder, and for the subsequent clustering is the learned dialogue representation z.
Due to the instability of the GMM, the initial clustering assignment is obtained by voting after clustering for continuous N clu times as shown in Algo.1.

Model Training Process
A two-stage training strategy including pre-training and joint-training is employed for model training.
In the pre-training stage, learning the context-aware utterance representation g i for each utterance with the autoencoder based on encoder-decoder architecture for initial clustering assignments, where UE, SCE modules as the encoder and UD as the decoder. The pre-training loss is defined by Eq.13, In the joint-training stage, an iterative training strategy is adopted. In each iteration, the label reassignment strategy is employed to improve the confidence of clustering assignments. Specifically, clustering all utterances and dialogues after each training epoch, and updating the old clustering assignment by best mapping the clusters between new and old clustering assignment using the Hungarian algorithm (Kuhn, 1955). However, the pseudo labels used by UCRL and DRL modules are not assigned immediately, but reassigning every interval epochs. Finally, stopping training when the change between two consecutive dialogue clustering assignment is less than tol% or reaching the maximum training epochs max ep . The last dialogue clustering assignment is used as the final clustering results. The loss is defined by Eq.14, where β is loss coefficient.

Datasets
In order to better evaluate the performance of different algorithms on TODC, we constructed three public task-divided dialogue datasets based on Schema-Guided Dialogue (SGD)  and Multiwoz dataset . In both SGD and Multiwoz datasets, we determine whether two  dialogues belong to the same dialogue task by judging whether the two dialogues contain the same set of active-intents. Finally, three datasets labeled by dialogue task are constructed: SGD-S includessingle domain dialogues of SGD dataset, SGD-M includes multiple-domains dialogues of SGD dataset, Multiwoz-T includes all dialogues of Multiwoz dataset. Detailed division instructions and datasets will be released. Tab.1 shows the statistics of them.

Implement Details
In our experiments, the hidden size is set to 256. Using 3-layers Transformer encoder for both UE and DRL module, 2-layers GAT for SCE module, and 3-layers Gated-based Transformer decoder for UD module. The window size is set to 2, 2, 1 for SGD-S, SGD-M and Multiwoz-T respectively. In the pre-training stage, we train 100 epochs with batch size 16 on each dataset with the same optimizer settings as the Transformer (Vaswani et al., 2017). In the joint-training stage, estimating K utt by BIC score (Schwarz et al., 1978) to 50, 50, 60, and setting interval to 2, 2, 1 for SGD-S, SGD-M and Multiwoz-T datasets respectively. Besides, setting the coefficient of L ucrl to 10 for stabilizing the utterance cluster representations quickly. And the learning rate at each step is calculated as shown in Eq.15, where the maximum learning rate lr max is set to 1e-3, 1e-3 and 5e-4 for SGD-S, SGD-M and Multiwoz-T respectively, the warmup steps wm stp = interval · N dia_bth , N dia_bth is the batch number of the corresponding dataset. Such warmup strategy increases learning rate linearly between the first and second dialogue labels reassignment until lr max , then decreases proportionally.

Baselines and Metrics
Three types of baselines are adopted to be compared with the proposed model on TODC performance.
Raw feature based models. Using bag of words model (BOW) and TF-IDF feature to represent dialogues, and clustering with LDA (Blei et al., 2001), K-means and GMM algorithms respectively.
Deep clustering models. Four popular deep clustering models are adopted as strong baselines. DCN (Yang et al., 2017) used k-means clustering loss to learn clustering friendly representations. VaDE (Jiang et al., 2017) is a generative deep clustering model based on variational autoencoder. DEC (Xie et al., 2016) designed a clustering objective to guide the learning of the data representations. IDEC (Guo et al., 2017) is a modified version of DEC with a reconstruction loss to preserve local structure.
Metrics Four popular metrics are adopted to evaluate TODC performance, including Accuracy (ACC), Purity, Normalized Mutual Information (NMI) (Strehl and Ghosh, 2002), and Adjusted Rand Index (ARI) (Hubert and Arabie, 1985). For each metric, a larger value implies a better clustering performance.

Main Results
Tab.2 shows the clustering performance of different methods. We can see that the proposed model outperforms all three types baselines significantly. Compared with the best baseline, our model improves ACC by 19.76%, 16.67% and 4.87%, Purity by 12.83%, 14.80% and 7.26%, NMI by 9.45%, 8.65% and 3.45%, ARI by 22.59%, 20.49% and 7.10% on SGD-S, SGD-M and Multiwoz-T datasets respectively. The results show that the obtained dialogue representations can capture more task-related information.   As shown in Tab.3, both SCE and UCRL contribute to the proposed model, and compared to UCRL module, the SCE module has a greater impact on performance. On the one hand, it shows that the integration of the structural context of utterance can improve the quality of utterance representation and further affect the dialogue representation. On the other hand, the utterance clustering assignment based on the utterance representation from SCE module has a direct impact on UCRL module, and the utterance cluster representations from UCRL module will directly affect dialogue representation.
Ablation Studies of Losses We also conduct ablation studies to evaluate the compact of different losses in our model.  Table 4: Ablation studies for losses. "-w/o L ucrl " refers to the performance using the utterance cluster embedding simply without learning by the L ucrl .
As shown in Tab.4, the ACC is reduces by 9.21% after removing L ucrl loss, which indicates that the structure information learned through the logical transfer relationship between utterance clusters is helpful to distinguish different tasks. And the ACC is reduced by 4.51% after removing L ud loss, which indicates that stabilizing the utterance representations helps stabilize model training. Meanwhile, we can see that all performance have been significantly improved after joint training, which indicates that introducing the induced implicit concepts information by clustering utterances and adopting an iterative training strategy are beneficial for TODC. We analyze the impact of the window size N on all performance on SGD-S dataset. As shown in Fig.4, as the window size increases, the performance is significantly improved. When the window size is 2, all performance reaches the maximum, then decreases slightly and stabilizes. This indicates that the optimal window size is 2. If the size of window is too small, the context information in-troduced is insufficient, and if it is too large, it will introduce too much noise and affect performance. We also analyze the impact of the clustering number on performance of NMI, ARI and Purity on SGD-S dataset. As shown in Fig.5, as the clustering number increases, all performance is significantly improved. When the clustering number is 29, which is the ground-truth number of tasks in SGD-S dataset, all performance reaches the maximum. Then the performance of Purity stabilizes, while NMI and ARI decrease.

Clustering Number K dia Analysis
The Purity measures the degree to which the samples in the cluster belong to the same true category. As the number of clusters increases, the purity will gradually increase as shown in the Fig.5, and then stabilize. NMI and ARI measure the degree of overlap between clustering and true category distributions. When the clusters number differs greatly from the true categories number, the performance will significantly decrease as shown in the Fig.5.

Case Studies
We selected some typical utterance clusters to evaluate the quality of learned context-aware utterance representation, and analysis whether some interpretable task-related concepts can be induced by obtained utterance clusters.
One case is shown in the Fig.6, obviously, these utterances are all about the concept of "user wants to search one-way flight". Besides, we also found some special utterance clusters. Another case is shown in Fig.7, there are two segments from two dialogues, to our surprise, all of these utterances are grouped into one cluster. This phenomenon can be explained from two aspects. First, from top to down, discourse relation information and the involved slots information make the representations of related utterances similar. For example, the both utterances of left segment are involved the same three slots, and associated by requestinform pair relationship, the both representations tend to be similar after incorporating this information. Further, from left to right, the cross-dialogue utterances with the similar context information are grouped automatically.
These indicate that our model can fully integrate contextual information by constructing adjacency graphs, and can also induce interpretable concepts through clustering utterances.

Conclusion and Future Work
This paper proposes a Dialogue Task Clustering Network for dialogue task clustering. The model makes use of both in-dialogue discourse relation information and cross-dialogue utterance similarity relation information to build dialogue representations. And an end-to-end iterative strategy of jointly dialogue representation learning and clustering is used to train the model. Experiments on three public datasets show that the proposed model significantly outperforms the existing strong clustering algorithms on dialogue task clustering. In the future, we will further induce more detailed task-related concepts information and explore the inner structure in each dialogue cluster for task induction and definition.