Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection

Dialogue disentanglement aims to group utterances in a long and multi-participant dialogue into threads. This is useful for discourse analysis and downstream applications such as dialogue response selection, where it can be the first step to construct a clean context/response set. Unfortunately, labeling all reply-to links takes quadratic effort w.r.t the number of utterances: an annotator must check all preceding utterances to identify the one to which the current utterance is a reply. In this paper, we are the first to propose a zero-shot dialogue disentanglement solution. Firstly, we train a model on a multi-participant response selection dataset harvested from the web which is not annotated; we then apply the trained model to perform zero-shot dialogue disentanglement. Without any labeled data, our model can achieve a cluster F1 score of 25. We also fine-tune the model using various amounts of labeled data. Experiments show that with only 10% of the data, we achieve nearly the same performance of using the full dataset.


Introduction
Multi-participant chat platforms such as Messenger and WhatsApp are common on the Internet. While being easy to communicate with others, messages often flood into a single channel, entangling chat history which is poorly organized and difficult to structure. In contrast, Slack provides a threadopening feature that allows users to manually organize their discussions. It would be ideal if we could design an algorithm to automatically organize an entangled conversation into its constituent threads. This is referred to as the task of dialogue disentanglement (Shen et al., 2006;Elsner and Charniak, 2008;Wang and Oard, 2009;Elsner and Charniak, 2011;Jiang et al., 2018;Kummerfeld et al., 2018;Zhu et al., 2020;Yu and Joty, 2020  Training data for the dialogue disentanglement task is difficult to acquire due to the need for manual annotation. Typically, the data is annotated in the reply-to links format, i.e. every utterance is linked to one preceding utterance. The effort is quadratic w.r.t the length of dialogue, partly explaining the sole existence of humanannotated large-scale dataset (Kummerfeld et al., 2018), which was constructed based on the Ubuntu IRC forum. To circumvent the need for expensive labeled data, we aim to train a self-supervised model first then use the model to perform zero-shot dialogue disentanglement. In other words, our goal is to find a task that can learn implicit reply-to links without labeled data.
Entangled response selection (Gunasekara et al., 2020) is the task that we will focus on. It is similar to the traditional response selection task, whose goal is to pick the correct next response among candidates, with the difference that its dialogue context consists of multiple topics and participants, leading to a much longer context (avg. 55 utterances). We hypothesize that: A well-performing model of entangled response selection requires recovery of reply-to links to preceding dialogue.
This is the only way that a model can pick the correct next response given an entangled context. Two challenges are ahead of us: Finally, we want to highlight the high practical value of our proposed method. Consider that we have access to a large and unlabeled corpus of chat (e.g. WhatsApp/Messenger) history. The only cost should be training the proposed entangled response selection model with attention supervision using unlabeled data. The trained model is immediately ready for dialogue disentanglement. In summary, the contributions of this work are: • Show that complex pruning strategies are not necessary for entangled response selection.
• With the proposed objective, the model trained on entangled response selection can perform zero-shot dialogue disentanglement.
• By tuning with 10% of the labeled data, our model achieves comparable performance to that trained using the full dataset.

Task Description
The dataset we use is DSTC8 subtask-2 (Gunasekara et al., 2020), which was constructed by crawling the Ubuntu IRC forum. Concretely, given an entangled dialogue context, the model is expected to pick the next response among 100 candidates. The average context length is 55 and the number of speakers is 20 with multiple (possibly relevant) topics discussed concurrently. The context is too long to be encoded by transformer-based models (Devlin et al., 2018;Liu et al., 2019). Despite the existence of models capable of handling long context (Yang et al., 2019;Zaheer et al., 2020;Beltagy et al., 2020), it is difficult to reveal the implicitly learned reply-to links as done in §3.4.

Related Work
To the best of our knowledge, previous works adopt complex heuristics to prune out utterances in the long context (Wu et al., 2020;Gu et al., 2020;Bertero et al., 2020). For example, keeping the utterances whose speaker is the same as  Table 1: Test set performance of the entangled response selection task. Concatenate is the model with complex pruning heuristics described in Wu et al. (2020). or referred to by the candidates. This is problematic for two reasons. 1) The retained context is still noisy as there are multiple speakers present in the candidates. 2) We might accidentally prune out relevant utterances even though they do not share the same speakers. A better solution is to let the model decide which utterances should be retained.

Model (Solid Arrows in Figure 2)
We use a hierarchical encoder as shown in the middle part of Figure 2. Suppose the input context is {U i } n i=1 and the next response candidate set is {C k } m k=1 . For every candidate utterance C k , we concatenate it with all U i s. For example, we form n pairs for k = 1, (U i + C 1 ) n i=1 . Then we use BERT as the encoder (ϕ) to encode pairs and get the last layer embedding of the [CLS] token as V i : While (C k + C k ) is not necessary for response selection, it is useful later for predicting self-link, which acts as the first utterance of a thread. We will see its role in §3.4. Then we use the output embeddings of a one layer transformer (ψ) with 8 heads to encode contextualized representations: To determine relative importance, we use an attention module (A) to calculate attention scores: The final predicted score is: Note that s should be 1 for C 1 (the correct next response), and otherwise 0 (row 1 of the multi-task loss table in Figure 2). This can be optimized using the binary cross-entropy loss. Given an entangled context U 1,2,3 and C k=1 as the correct next response (C k=2 is a negative sample), each pair of the concatenated inputs is encoded separately by ϕ (BERT) to get V i . A contextaware model ψ (transformer) is applied over V i s to generate contextualized V i . An attention module A is used to calculate the attention scores α i and weighted sum s. Model is optimized according to the target values for s and α 4 in the multi-task loss table. Dashed arrows: Given another entangled context, we know that the current utterance C k=1 is replying to U 2 by taking the arg max of attention scores α i in a zero-shot manner.

Results
We show the results in Table 1. The performance of our approach is comparable to previous work. Note that our model does not use any heuristics to prune out utterances. Instead, the attention scores α i are decided entirely by the model. We also run an experiment using augmented data following Wu et al. (2020), which is constructed by excerpting partial context from the original context 2 . Finally, we want to highlight the importance of the attention module A, where the performance drops by 10 points if removed.

Attention Analysis
The empirical success of the hierarchical encoder has an important implication: it is able to link the candidate with one or multiple relevant utterances in the context. This can be proved by the attention distribution α i . Intuitively, if C k is the correct next response (i.e. k = 1), then the attention distribution should be sharp, which indicates an implicit replyto that links to one of the previous utterances. In contrast, if C k is incorrect (i.e. k = 1), our model is less likely to find an implicit link, and the attention distribution should be flat. Entropy is a good tool 2 For a context of length 50, we can take the first i ∈ 1 . . . 49 utterances as a new context and the i + 1-th utterance acts as the new correct next response. We can sample the negatives randomly from other sessions in the dataset. to quantify sharpness. Numerically, the entropy is 1.4 (sharp) when C k is correct and 2.1 (flat) for incorrect ones, validating our suppositions.
Is it possible to reveal these implicit links? The solution is inspired by the labeled data of dialogue disentanglement as elaborated in §3.4.

Task Description
The dataset used is DSTC8 subtask-4 (Kummerfeld et al., 2018) 3 . We want to find the parent utterance in an entangled context to which the current utterance is replying, and repeat this process for every utterance. After all the links are predicted, we run a connected component algorithm over them, where each connected component is one thread.

Related Work
All previous work (Shen et al., 2006;Elsner and Charniak, 2008;Wang and Oard, 2009;Elsner and Charniak, 2011;Jiang et al., 2018;Kummerfeld et al., 2018;Zhu et al., 2020;Yu and Joty, 2020) treat the task as a sequence of multiplechoice problems. Each of them consists of a sliding window of n utterances. The task is to link the last  Table 2: w = 0 indicates pure entangled response selection training. In the few-shot section, scratch is the disentanglement model not trained self-supervisedly on entangled response selection before. The evaluation metrics and labeled data used for fine-tuning are in Kummerfeld et al. (2018). Results are the average of three runs.
. utterance to one of the preceding n − 1 utterances. This model is usually trained in supervised mode using the labeled reply-to links. Our model also follows the same formulation. Figure 2)

Model (Dashed Arrows in
We use the trained hierarchical model in §2.3 without the final MLP layer used for scoring. In addition, we only have one candidate now, which is the last utterance in a dialogue. We use C k=1 to represent it for consistency. Note that we only need to calculate i = arg max i α i . This indicates that C k=1 is replying to utterance U i in the context.

Proposed Attention Supervision
We note that the labeled reply-to links act as supervision to the attention α i : they indicate which α i should be 1. We call this extrinsic supervision.
Recall the implicit attention analysis in §2.5, from which we exploit two kinds of intrinsic supervision: • If C k is the correct next response, then α n+1 = 0 because C k should be linking to one previous utterance, not itself. • If C k is incorrect, then it should point to itself, acting like the start utterance of a new thread. Hence, α n+1 = 1. We train this intrinsic attention using MSE (row 2 of the multi-task loss table in Figure 2) along with the original response selection loss using a weight w for linear combination L = (1 − w) * L res + w * L attn . Note that we do not use any labeled disentanglement data in the training process.

Results
We present the results in Table 2. In the first section, we focus on zero-shot performance, where we vary w to see its effect. As we can see, w = 0.25 gives a close-to-best performance in terms of cluster and link scores. Therefore, we use it for few-shot finetuning setting, under which our proposed method outperforms baselines trained from scratch by a large margin. We pick the best checkpoint based on the validation set performance and evaluate it on the test set. This procedure is repeated three times with different random seeds to get the averaged performance reported in Table 2. With 10% of the data, we can achieve 92% of the performance trained using full data. The performance gap becomes smaller when more data is used as illustrated in Figure 3.

Real-World Application
Our method only requires one additional MLP layer attached to the architecture of  to train on the entangled response selection task, hence it is trivial to swap the trained model into a production environment. Suppose a dialogue disentanglement system  is already up and running: 1. Train a BERT model on the entangled response selection task ( §2.1) with attention supervision loss ( §3.4). This is also the multitask loss depicted in Figure 2.
2. Copy the weight of the pretrained model into the existing architecture .
3. Perform zero-shot dialogue disentanglement (zero-shot section of Table 2) right away, or finetune the model further when more labeled data becomes available (few-shot section of Table 2).
This strategy will be useful especially when we want to bootstrap a system with limited and expensive labeled data.

Conclusion
In this paper, we first demonstrate that entangled response selection does not require complex heuristics for context pruning. This implies the model might have learned implicit reply-to links useful for dialogue disentanglement. By introducing intrinsic attention supervision to shape the distribution, our proposed method can perform zero-shot dialogue disentanglement. Finally, with only 10% of the data for tuning, our model can achieve 92% of the performance of the model trained on full labeled data. Our method is the first attempt to zero-shot dialogue disentanglement, and it can be of high practical value for real-world applications.