Conversation Disentanglement with Bi-Level Contrastive Learning

Conversation disentanglement aims to group utterances into detached sessions, which is a fundamental task in processing multi-party conversations. Existing methods have two main drawbacks. First, they overemphasize pairwise utterance relations but pay inadequate attention to the utterance-to-context relation modeling. Second, huge amount of human annotated data is required for training, which is expensive to obtain in practice. To address these issues, we propose a general disentangle model based on bi-level contrastive learning. It brings closer utterances in the same session while encourages each utterance to be near its clustered session prototypes in the representation space. Unlike existing approaches, our disentangle model works in both supervised setting with labeled data and unsupervised setting when no such data is available. The proposed method achieves new state-of-the-art performance on both settings across several public datasets.


Introduction
Multi-party conversations generally involve three or more speakers in a single dialogue, in which the speaker utterances are interleaved, and multiple topics may be discussed concurrently (Aoki et al., 2006).This causes inconvenience for dialogue participant to digest the utterances and respond to a particular topic thread.Conversation disentanglement is the task of separating these entangled utterances into detached sessions, which is a prerequisite of many important downstream tasks such as dialogue information extraction (Fei et al., 2022a,b), state tracking (Zhang et al., 2019;Wu et al., 2022), response generation (Liao et al., 2018(Liao et al., , 2021b;;Ye et al., 2022a,b), and response ranking (Elsner and Charniak, 2008;Lowe et al., 2017).
There has been substantial work on the conversation disentanglement task.Most of them emphasize on the pairwise relation between utterances in Hi! Quick question: I don't like Unity, but I do like GNOME3.Will Ubuntu 12.04 offer GNOME3 without Unity, the regular "original" GNOME3 OttScorp try LXDE Gremuchnik :-)

ArNezT
Gremuchnik: may be you can use ubuntu no-effect from login menu :) OttScorp Ubuntu plans on sticking with Unity  a two-step manner.They predict the relationship between utterance pairs as the first step, followed by clustering utterances into sessions as the second.In the first step, early works (Elsner andCharniak, 2008, 2010) utilize handmade features and discourse cues to predict whether two utterances belong to the same session or whether there is a reply-to relation.The recent development in deep learning inspires the use of neural network such as LSTM or CNN to learn abstract features of utterances in training (Mehri and Carenini, 2017;Jiang et al., 2018).More recently, a number of methods show that BERT in combination with handcrafted features or heuristics remains a strong baseline (Li et al., 2020b;Zhu et al., 2021;Ma et al., 2022).In the second step, the most popular clustering methods use a greedy approach to group utterances by adding pairs (Wang and Oard, 2009;Zhu et al., 2020).There are also some variations incorporating voting mechanism (Kummerfeld et al., 2019), bipartite graph matching (Zhu et al., 2021) or additional tracking models (Wang et al., 2020).An obvious drawback of such two-step approach is that the pairwise relation prediction might not capture enough contextual information as the connection between two utterances depends on the contexts in many cases (Liu et al., 2020).Also, focusing on pairwise relations leads to a short-sighted local view.To mitigate this, there are methods trying to introduce additional conversation loss (Li et al., 2020b(Li et al., , 2022) ) or session classifier (Liu et al., 2021) to group utterances in the same session together.We also see methods leveraging relational graph convolution network (Ma et al., 2022) or masking mechanism in Transformers (Zhu et al., 2020).More directly, end-to-end methods (Tan et al., 2019;Liu et al., 2020) capture the context information contained in detached sessions and calculate the matching degree between a session and an utterance.However, many of such methods are conducted in an online manner which only considers the preceding context.It may lead to biased session representations, introduce noisy utterances to sessions and consequently accumulate errors.
Meanwhile, most of these methods rely heavily upon human-annotated session labels or reply-to relations, which are expensive to obtain in practice.Although there have been a few attempts to tackle this issue, a more general framework that can handle both supervised and unsupervised learning is yet to come.For example, Liu et al. (2021) design a deep co-training scheme with message-pair classifier and session classifier.However, various data augmentation procedures based on heuristics are required for good performance.Chi and Rudnicky (2021) propose a zero-shot disentanglement solution based on a related response selection task.Still, it relies on a closely related dataset that comes from the same Ubuntu IRC source inside DSTC8.
Recently, contrastive learning (Hadsell et al., 2006) has brought prosperity to numbers of machine learning tasks by introducing unsupervised representation learning.Substantial performance gains have been reported in computer vision (He et al., 2020;Chen et al., 2020) and NLP works (Yan et al., 2021;Gao et al., 2021).They believe that good representation should be able to identify semantically close neighbors while distinguishing from non-neighbors.Intuitively, in multi-party conversation, utterances in the same session should semantically resemble each other while be far apart from utterances in other sessions.Instead of handcrafted features such as speaker, mention and time difference etc, it provides another option for auto-matically learn discriminative representations.
In this work, we design a Bi-level Contrastive Learning scheme (Bi-CL) to learn discriminative representations of tangled multi-party dialogue utterances.It not only learns utterance level differences across sessions, but more importantly, it encodes session level structures discovered by clustering into the learned embedding space.Specifically, we introduce session prototypes to represent each session for capturing global dialogue structure and encourage each utterance to be closer to their assigned prototypes.Since the prototypes can be estimated via performing clustering on the utterance representations, it also supports unsupervised conversation disentanglement under an Expectation-Maximization framework.We evaluate the proposed model under both supervised and unsupervised settings across several public datasets.It achieves new state-of-the-art on both.
The contribution is summarized as follows: • We design a bi-level contrastive learning scheme to learn better utterance level and session level representations for disentanglement.• We delve into the conversation nature to harvest evidence which supports our model to disentangle dialogues without any supervision.• Experiments show that the proposed Bi-CL model significantly outperforms several stateof-the-art models both on the supervised and unsupervised settings across datasets.
2 Related Work

Conversation Disentanglement
Previous methods on conversation disentanglement are mostly performed in a supervised fashion, which can be coarsely organized into two lines: (1) two-step methods which first obtain the pairwise relations among utterances and then disentangle them with a clustering algorithm; and (2) end-toend approaches which directly assign utterances into different sessions.
The majority of efforts follow the two-step pipeline.Great attention has been devoted to the first step.Early works rely heavily on handcrafted features to represent the utterances for pairwise relation prediction.For example, Elsner andCharniak (2008, 2010) used the speaker, time, mentions, shared word count etc. to train a linear classifier for utterance pair coherence.More recent works utilized neural networks to train classifiers.For instance, Mehri and Carenini (2017) and Guo et al. (2018) leveraged LSTM to predict either the samesession or reply-to probabilities between utterances, while Jiang et al. (2018) combined the output of a hierarchical CNN on utterances with other features to capture the interactions.More recently, Gu et al. (2020) and Li et al. (2020b) used BERT to learn the similarity score in a fixed length context window.For the second step, there has also been progress in exploring an optimal clustering algorithm.Greedy decoding has been a popular choice (Elsner and Charniak, 2010;Jiang et al., 2018).There are also works that train a separate classifier to assign utterance to a thread (Mehri and Carenini, 2017) or design advanced algorithms like bipartite graph matching (Zhu et al., 2021).
On the downside, the pairwise relations, which are predicted typically without considering enough session context, are local and may not reflect how utterances interact in reality.Hence, the clustering step may be undermined subsequently.This motivates end-to-end solutions that aim at assigning the target utterance in each time step with respect to the existing threads or preceding utterances (Liu et al., 2020).Similarly, Yu and Joty (2020) used attention to capture utterance interactions and gradually assign each utterance to its replied-to parent with a pointer module.However, such online manner not only limits the scope of session context but also leads to error accumulation.
There are also studies that work in an unsupervised fashion to avoid the reliance on humanannotation.For example, Liu et al. (2021) designed both message-pair classifier and session classifier to form a co-training algorithm.Chi and Rudnicky (2021) proposed to train a closely-related response selection model for zero-shot disentanglement.The former needs pseudo labeled data to warm-up the training, while the latter gains from training data of the same source.More importantly, a general framework that can handle both supervised and supervised learning is yet to come.In our work, we target at building such a flexible model.

Contrastive Learning
Contrastive learning learns effective representation by pulling semantically close neighbors together and pushing apart non-neighbors (Hadsell et al., 2006).Recent advances are largely driven by instance discrimination tasks.For example, in the field of computer vision, such methods consist of two key components: image transformation and contrastive loss.The former aims to generate multiple representations about the same image, by data augmentation (Ye et al., 2019;Chen et al., 2020), patch perturbation (Misra and Maaten, 2020), or using momentum features (He et al., 2020).While the latter aims to bring closer samples from the same instance and separate samples from different instances.In the field of natural language processing, contrastive learning has also been widely applied, such as for language model pre-trainining (Yan et al., 2021;Gao et al., 2021).
Despite their improved performance, these instance discrimination methods share a common weakness: the representation is not encouraged to encode the global semantic structure of data (Caron et al., 2020).This is because it treats two samples as a negative pair as long as they are from different instances, regardless of their semantic similarity (Li et al., 2020a).Hence, there are methods which simultaneously conduct contrastive learning at both the instance-and cluster-level (Li et al., 2021;Shen et al., 2021).Likewise, we emphasize leveraging bi-level contrastive objects to learn better utterance level and session level representations.

Method
The definition of the conversation disentanglement task and details of our model are sequentially presented in this section.Starting from the supervised setting for a clear view, we gradually extend to the unsupervised setting.

Task Formulation
Given a multi-party conversation history with n utterances U = {u 1 , u 2 , ..., u n } in chronological order, our goal is to disentangle them into detached sessions S = {s 1 , s 2 , ..., s k }, where each s i is a non empty subset of U , and S is a partition of U .Each utterance includes an identity of speaker and a message sent by this user.
The task has been popularly formulated as a reply-to relation identification problem to find the parent utterance for every u i ∈ U .It has also been modeled as sequentially assigning each u i to already detached sessions in S or create a new session for S. Here, instead of separating local pair and global cluster modeling, we opt for learning more discriminative representations for utterances to push them into different sessions.

Utterance Encoder
The utterance encoder aims to capture the semantics of a given utterance and its connection to surrounding context.Similar to (Liu et al., 2020), we leverage a hierarchical Bi-LSTM structure similar to (Serban et al., 2017) as illustrated in Figure 2.
For the utterance-level encoder, given each utterance u i , we tokenize it into tokens {t 1 , t 2 , ..., t |u i | } and take the Glove embeddings (Pennington et al., 2014).We input these into a bidirectional LSTM and then use a linear transformation with non-linear activation to get the hidden states: where W 1 is the weight matrix that merges the two direction embeddings of each token, and we use ReLU as the activation δ.We omit the bias term for space limitation.The self-attention mechanism (Lin et al., 2017) is adopted to obtain utterance vectors that represent the overall semantics.
For the context-level encoder, we leverage another bidirectional LSTM to allow utterances to interact with their surroundings and acquire contextual information.Hence, we feed in the local utterance embedding sequence ⟨u 1 , ..., u n ⟩ and obtain the contextual utterance representations ⟨h ′ 1 , ..., h ′ n ⟩.It naturally captures information in the utterance itself, in its surrounding utterances and the relative temporal sequence implicitly as: To further utilize the speaker and mention information of each utterance, we simply concatenate each h ′ i with a padded, multi-hot mention vector m i ∈ R 50 where the j-th dimension is 1 if the speaker of u j is the same as that of u i or mentioned in u i .This will give the final utterance representations ⟨v 1 , ..., v n ⟩.

Bi-Level Losses
With encoder network ready at hand, the key is to introduce good objectives for back-propagating the right learning signals.When we have session labels for training data in the supervised learning setting, we aim to train the model so that ideally, (a) utterances in the same ground truth session will be embedded closer while utterances in different sessions will be pulled away; and (b) utterances in each session should be near its session center, or say, prototype.Correspondingly, we introduce utterance-level contrastive loss and session-level contrastive loss to encourage these for learning.

Utterance-level Contrastive Loss
Inspired from the contrastive learning scheme (Khosla et al., 2020) under supervised setting, we contrast an utterance with other utterances in same or different sessions to capture the local structure.Suppose the training dataset U contains |U| utterances in total and y(i) denotes the ground truth session assignment of u i , we define the utterance-level contrastive loss as: where τ 1 is the temperature hyper parameter, Y(i) contains all the positive utterances that have the same session assignment with u i , and N (i, j) contains the set of negative utterances that have different session assignment with u i , combined with the current positive utterance u j .Mathematically, we have Y(i) ≡ {j ∈ U : y(j) = y(i), j ̸ = i}, and N (i, j) ≡ {l ∈ U : y(l) ̸ = y(i)} ∪ {j}.Ideally, we could use all negative samples as many papers have shown increased performance with increasing number of negatives (He et al., 2020;Henaff, 2020), we set a relatively large number for balancing our computation efficiency.

Session-level Contrastive Loss
In session level, we introduce prototypes to represent each session, and minimize the distance from each utterance to its session prototype while maximize the distances from the utterance to other session prototypes.This incorporates global dialogue semantic structure into the resulting representations.When session labels are available in the supervised setting, suppose s i = {u 1 , u 2 , ..., u q }, we directly define the prototype p for session s i : Therefore, for each conversation U in the training set, we define the session-level contrastive loss: where p i is the ground truth session prototype for u i and τ 2 is the temperature hyper parameter.

Disentangle Sessions
Besides guiding the learning process with bi-level contrastive objects, our disentanglement task naturally involves the session assignment goal.Therefore, the foremost issue is to decide how many sessions the conversation contains.With supervised data, we train a light-weight network to predict K for each conversation.We leverage a two layer feed-forward network enriched with non-linearity.
It takes as input the dialogue utterances as well as meta information such as number of speakers n s and turn number n.The output logits indicate a distribution of the possible K values.
, where M is the global maximum session number, and q ∈ R M .We train the network parameters including W 2 , W 3 via the K prediction loss: where k is the ground truth K for conversation U .
In inference, we select the most likely value of K for the K-Means algorithm and constrain K <= n.
During training, we also perform K-Means to cluster utterances to mitigate the gap between training and inference.Suppose we obtain a partition S ′ = {s ′ 1 , s ′ 2 , ..., s ′ k} for the conversation U by K-Means, we compute the cluster centroids {c ′ 1 , c ′ 2 , ..., c ′ k} by averaging the embeddings of cluster members.We then run Hungarian Algorithm (Kuhn, 1955) to match clusters with sessions, hence align the calculated prototypes with these centroids, e.g.p i to c ′ i .We further introduce a centroid matching loss: which ensures that utterance embeddings are clustered according to their ground truth sessions.
To sum up, the final objective for supervised training is as below: where α, β, γ are hyper-parameters to adjust the contribution of different factors.

Unsupervised Extension
In the unsupervised setting, we mainly update the bi-level losses L u and L s for representation learning while omit the L k and L m losses.In the session level, since we do not know the session labels anymore, we directly estimate the session assignment by clustering utterance embeddings, and then maximize the data log-likelihood.Inspired from (Li et al., 2020a), we perform the two steps iteratively to form an Expectation-Maximization framework.
The following shows our objective under the framework.More derivation details can be found in Appendix A.
In a specific iteration, suppose we obtain cluster results as {c 1 , ..., c m } by running K-Means on conversation U , maximizing log-likelihood estimation corresponds to finding the utterance encoder network parameters that minimizes the loss: , where ϕ denotes the concentration level of the feature distribution around a cluster centroid c.It encourages utterances to flock around the centroids.
In practice, we cluster the utterances M times with different number of clusters K = {k m } M m=1 , to achieve a more robust probability estimation of prototypes.Hence the updated session level loss is calculated as: , since the number of utterances in conversation U is limited, we set ϕ to a small constant τ ′ 2 .In the utterance-level, we make use of heuristics to construct positive and negative samples for contrastive learning.The assumption is that one speaker mostly participates in only one session 1 , and utterances in different conversations are naturally in different sessions.Suppose the speaker of u i is s(i) in the conversation U i , we update the utterance level contrastive loss as below: , where Y ′ (i) ≡ {j ∈ U i : s(j) = s(i), j ̸ = i}, and N ′ (i, j) ≡ {l ∈ U/U i } ∪ {j}.To sum up, the final objective for unsupervised training is as below: where η is a hyper-parameter to adjust the contribution of different factors.
After the representation learning, we may use various methods to decide the session number k 1 Only 20% of speakers will join multiple sessions on the Ubuntu IRC dataset.
for each conversation, such as the Elbow algorithm (Thorndike, 1953), or Silhouette algorithm (Rousseeuw, 1987).Empirically, we find the Elbow algorithm works slightly better.Based on the predicted K, we simply run the K-Means clustering to obtain the session assignments.

Dataset
We train and evaluate our models on two large-scale annotated datasets.The first dataset is the Ubuntu IRC dataset (Kummerfeld et al., 2019), which consists of 153/10/10 intermingled dialogues in the train/validation/test set.Each dialogue is extracted from the Ubuntu IRC technical support channel and has a length of 250 or 500.Following (Liu et al., 2020), we cut each dialogue into dialogue segments of length 50, reorder the ground truth session labels, and get 1,737/134/104 dialogues in the train/validation/test split.The maximum session number is 14 for the Ubuntu IRC dataset.The second dataset is the Movie Dialogue dataset (Liu et al., 2020).The dialogues are generated by extracting sessions from 869 movie scripts and manually intermingling the sessions.There are 29,669/2,036/2,010 dialogues train/validation/test split.The maximum session number is 6.

Training Details
We initialize the word embeddings with 300dimensional Glove vectors (Pennington et al., 2014) and set the hidden state size of BiLSTM to be 300.The utterance embedding size after the co-attention layer will also be 300.The maximum length of an utterance after tokenization is set to 50.In supervised training, the hyperparameter α and β that controls the weights are configured as 0.4 empirically, while γ is set to 0.2.We adopt a batch size of 16 and use Adam Optimizer (Kingma and Ba, 2015) with an initial learning rate of 5e-5.We run ten epochs until convergence.In unsupervised training, the hyper parameters will be the same and η is set to 0.4.While certain hyper-parameters such as Glove embedding size are set according to the default practice of previous works, other hyper-parameters such as batch size and maximum sequence length are determined empirically.In particular, the weight parameters α, β, γ, and η are tuned with grid search.(Liu et al., 2020).Note the results of DialBERT + feature on the Movie Dialogue Dataset is not available since the dataset does not provide the corresponding features.

Metrics
We adopted three popular metrics to evaluate the disentanglement result: Normalized Mutual Information (NMI), Adjusted Random Index (ARI) (Hubert and Arabie, 1985), and Shen-F score (Shen-F) (Shen et al., 2006).Both NMI and ARI measures the similarity between the ground truth clusters and the predicted clusters for each conversation and a higher value indicates higher degree of matching.The difference is that ARI is based on counting pairwise links between utterances that exist in both ground truth and predictions, while NMI is more about the cluster level since it uses entropy conditioned on clusters.Shen-F is a F-1 score to measure how well utterances in the same ground truth cluster are grouped in the predicted clusters, and a higher value indicates higher cluster quality.

Baseline Models
We evaluate on both supervised and unsupervised settings.The baselines include both the traditional two-stage based and end-to-end approaches.Supervised Baselines: The majority of methods need supervision.Weighted SP (Shen et al., 2006) adopts a single pass greedy decoding to add and cluster utterances sequentially based on normalized TF-IDF vectors.CISIR (Jiang et al., 2018) uses Hierarchical CNN to encode utterances and compute score of pairs.Transition (Liu et al., 2020) is an end-to-end online approach where each utterance is encoded and compared with the existing session states to determine assignments.DialBERT (Li et al., 2020b) gains from hierarchical Pre-Trained model for better performance.StructBERT (Ma et al., 2022) emphasizes structural characteristics in modeling and is the current state-of-the-art.
Unsupervised Baselines: When no labeled data is available, Co-Training (Liu et al., 2021) leverages a message-pair classifier and session classifier to build up a co-training scheme.Zeroshot (Chi and Rudnicky, 2021) learns from a closely related response selection task.

Main Results
We report the main results for all compared methods in Table 1.Generally speaking, the proposed Bi-CL method performs better than all the other baselines on both the Ubuntu IRC and Movie Dialogue datasets in most evaluation metrics.Note that some of these baselines are based on largescale pre-trained language model BERT which has shown superior performance on various NLP tasks, our model is only based on the relatively lightweight bidirectional LSTM model.This situation, in some sense, signals the effectiveness of our bilevel contrastive learning design.
More specifically, under the supervised setting, the proposed Bi-CL method constantly outperforms other methods on the Movie Dialogue dataset Under the unsupervised setting, our model again excels except for NMI in Ubuntu IRC.This might be because the model Zeroshot has access to more augmented data from the same data source.However, it still performs worse than Bi-CL in ARI and Shen-F.Note that our model outperforms the baselines with a significant margin on the Movie Dialogue dataset.Again, this implies our model's generalizability.The model Zeroshot does not have results on the Movie Dialogue dataset.It relies on same source data to train response selection model, but such data is not available.We also put the our model's results with Silhouette algorithm as the K predictor.There is a slight drop in performance, which can be attributed to the lower prediction accuracy presented on Table 3.
A common pattern shared across the above settings is that while the baselines' results are typically much higher in Ubuntu IRC than in Movie Dialogue, Bi-CL performs stably across the two datasets.This is consistent with our previous observation that Bi-CL is independent of many features in Ubuntu IRC that are heavily utilized but often not available for other data sources.Moreover, the performance gap between the supervised and unsupervised versions of Bi-CL is relatively small, suggesting that it also relies less on labels.These demonstrate the potential of the model to be applied widely.

More Analysis
We further carry out ablation studies on various design components and provide more analysis on the prediction of session number K.

Ablation Study
We conduct ablation studies to investigate how each model component affects its effectiveness.As shown in Table 2, we observe that in the supervised setting, removing L k leads to the most significant performance drop, with the gaps of 0.442, 0.282 and 0.158 in NMI, ARI and Sehn-F on the Movie Dialogue.This is because it makes predicting K degenerate into a random guess.Also, we observe that L u has the second most impact.For example, it reaches the lowest performance on Ubuntu IRC regarding NMI and ARI.Removing the other components has a smaller impact and the model can still generate reasonable result.
In unsupervised setting, removing L ′ u undermines the model significantly on ARI since it removes pairwise contrastive learning on the utterance-level that helps to model local relations.Removing L ′ s tends to have a milder impact, but it still undermines the results to a certain extent.The above results imply that the utterance-level loss captures local pairwise relations well and the session-level loss also has positive contribution to learning cluster-friendly utterance representations.

Prediction of K
Predicting the session number K is crucial for our model since it directly affects the clustering results.We hence replace the predicted K with the ground truth K in training and inference, resulting in a moderate performance boost (w/gold K) in both settings as shown in Table 3.We also observe that the performance gaps between model using predicted K and ground truth K.This show that the model with the predicted K can still generate relatively satisfactory results and the performance of K prediction is relatively good.We show the ACC and MAE of predicted K in Table 3.It indicates that the supervised predictor works better which is reasonable, and the unsupervised methods such as Silhoutte and Elbow perform similarly.This might be because both of them only work on utterance features.Introducing other side information from conversation might further boost the performance.
Another observation is that NMI on Ubuntu IRC has a decrease when gold K value is adopted in supervised setting.While it is counter-intuitive, it may actually be caused by large number of sessions that contains only one utterance in this dataset.

Conclusion
We studied disentanglement on multi-party conversations and proposed a general model that works in both supervised and unsupervised learning settings.It is trained with a Bi-Level contrastive learning mechanism to bring utterances in the same session closer and encourage utterances to flock around their session centers.At the same time, we aim to pull utterances from different sessions further apart by contrasting each utterance with negative samples.The obtained representations naturally fit to the clustering scheme for session predictions.Consequently, K means is used during inference to predict the sessions.Our model is evaluated on the largest benchmark dataset Ubuntu IRC and the latest benchmark dataset Movie Dialogue.Experimental results show new SOTA performance results and advancements compared to previous works.Additionally, the stability of our model across different datasets, as well as different training schemes with or without session labels, shows its potential to be applied in a general setting,

Limitations
Our work has the following limitations.Firstly, although bidirectional LSTM is more light-weight and obtains reasonable performance for our task, an easy extension is to explore how pre-trained language models such as BERT would further affect the performance (Liao et al., 2021a).Secondly, the prediction of session number K is only based on conversation utterances.More advanced session number estimation model would be devised to capture more side information for more accurate K prediction.An alternative approach is to adopt different clustering algorithm such as CISIR (Jiang et al., 2018) that does not require the prediction of cluster number but instead has a universal, empirically determined threshold that controls the cluster size.Last but not least, our model has not been applied to dialogues of length longer than 50, and we have not verified its effectiveness of modeling longer dependency.This entails our future effort to adapt our model to a more general setting with longer conversation, more threads, and more complicated dialogue structures.
L1nuxR ules gremuchink although I cant answer your question, you can install and use any desktop in Linux usr13 lolcat: Oh, yea, myabe you need streamripper or something.But if you just want to watch it, try gxine OttScorp Or perhaps http://forum.videohelp.com/threads/257045-How-to-record-streaming...

Figure 1 :
Figure 1: An example piece of conversation from the Ubuntu IRC corpus.There are distribution patterns in both utterance level and session level.

Figure 2 :
Figure 2: Overview of the proposed Bi-CL framework.It incorporates utterance level contrastive loss to discriminate utterances, and uses session level contrastive loss to encourage them flocking around session centers.

Table 1 :
Results on the Ubuntu IRC Dataset and the Movie Dialogue Dataset.* indicates that the statistics are taken from

Table 2 :
Ablation study on different design components of the proposed Bi-CL method under both settings.across all metrics.It also performs the best on the Ubuntu IRC dataset regarding the metric Shen-F.This demonstrates the effectiveness of our Bilevel contrastive learning design for conversation disentanglement.We notice that DialBERT and StructBERT obtain better NMI results on Ubuntu IRC than our method.This is because these methods have special designs to model pairwise relations in a more fine-grained manner, by utilizing additional dialogue features such as the time of each utterance in Ubuntu IRC.

Table 3 :
Accuracy (ACC) and Mean Absolute Error (MAE) of the predictions given by the K predictors.