Disentangling Online Chats with DAG-structured LSTMs

Many modern messaging systems allow fast and synchronous textual communication among many users. The resulting sequence of messages hides a more complicated structure in which independent sub-conversations are interwoven with one another. This poses a challenge for any task aiming to understand the content of the chat logs or gather information from them. The ability to disentangle these conversations is then tantamount to the success of many downstream tasks such as summarization and question answering. Structured information accompanying the text such as user turn, user mentions, timestamps, is used as a cue by the participants themselves who need to follow the conversation and has been shown to be important for disentanglement. DAG-LSTMs, a generalization of Tree-LSTMs that can handle directed acyclic dependencies, are a natural way to incorporate such information and its non-sequential nature. In this paper, we apply DAG-LSTMs to the conversation disentanglement task. We perform our experiments on the Ubuntu IRC dataset. We show that the novel model we propose achieves state of the art status on the task of recovering reply-to relations and it is competitive on other disentanglement metrics.


Introduction
Online chat and text messaging systems like Facebook Messenger, Slack, WeChat, WhatsApp, are common tools used by people to communicate in groups and in real time. In these venues multiple independent conversations often occur simultaneously with their individual utterances interspersed.
It is reasonable to assume the existence of an underlying thread structure partitioning the full conversation into disjoint sets of utterances, which ideally represent independent sub-conversations. * Equal contribution  Figure 1: Excerpt from the IRC dataset (left) and our reply-to classifier architecture (right). Blue dots represent a unidirectional DAG-LSTM unit processing the states coming from the children of the current node. Red dots represent the GRU units performing thread encoding. At this point in time, we are computing the score (log-odds) of fifth utterance replying to the third.
The task of identifying these sub-units, disentanglement, is a prerequisite for further downstream tasks among which question answering, summarization, and topic modeling (Traum et al., 2004;Shen et al., 2006;Adams and Martell, 2008;Elsner and Charniak, 2010). Additional structure can generally be found in these logs, as a particular utterance could be a response or a continuation of a previous one. Such reply-to relationships implicitly define threads as the connected components of the resulting graph topology, and can then be used for disentanglement (Mehri and Carenini, 2017;Dulceanu, 2016;Wang et al., 2008;Gaoyang Guo et al., 2018).
Modeling work on conversation disentanglement spans more than a decade. Charniak (2008, 2010) use feature based linear models to find pairs of utterances belonging to the same thread and heuristic global algorithms to assign posts to threads. Mehri and Carenini (2017) and Jiang et al. (2018), while also adopting similar heuristics, use features extracted through neural models, LSTMSs (Hochreiter and Schmidhuber, 1997) and siamese CNNs (Bromley et al., 1993) respectively. Wang et al. (2011) follow a different approach by modeling the interactions between the predicted reply-to relations as a conditional random field.
One challenge in building automatic systems that perform disentanglement is the scarcity of large annotated datasets to be used to train expressive models. A remarkable effort in this direction is the work of Kummerfeld et al. (2019a) and the release of a dataset containing more that 77k utterances from the IRC #Ubuntu channel with annotated reply-to structure. In the same paper, it is shown how a set of simple handcrafted features, pooling of utterances GloVe embeddings (Pennington et al., 2014), and a feed-forward classifier can achieve good performances on the disentanglement task. Most of the follow-up work on the dataset relies on BERT (Devlin et al., 2019) embeddings to generate utterance representations .  use an additional transformer module to contextualize these representations, while ;  use an LSTM. Two exceptions are , which models thread membership in an online fashion and discards reply-to relationships, and the recent Yu and Joty (2020a) which uses pointer networks (Vinyals et al., 2015).
In this short paper, we use DAG-structured LSTMs (İrsoy et al., 2019) to study disentanglement. As a generalization of Tree-LSTMs (Tai et al., 2015a), DAG-LSTMs allow to faithfully represent the structure of a conversation, which is more properly described as a directed acyclic graph (DAG) than a sequence. Furthermore, DAG-LSTMs allow for the systematic inclusion of structured information like user turn and mentions in the learned representation of the conversation context. We enrich the representation learned by the DAG-LSTM by concatenating to it a representation of the thread to which the utterance belongs. This thread encoding is obtained by means of a GRU unit (Cho et al., 2014) and captures thread specific features like style, topic, or persona. Finally we manually construct new features to improve username matching, which is crucial for detecting user mentions, one of the most important features for disentanglement.
Our results are summarized in Table 1. The DAG-LSTM significantly outperforms the BiL-STM baseline. Ablation studies show the importance of the new features we introduce. When augmented by thread encoding and a careful handling of posts predicted to be thread starters, the DAG-LSTM architecture achieves state of the art performances on reply-to relation extraction on the IRC Ubuntu dataset and it is competitive on the other metrics which are relevant to disentanglement.

Problem Statement
A multi-party chat C is a sequence of posts (c i ) i , i = 1, . . . , |C|. For each query post c i we look for the set of link posts R(c i ) such that c i replies to, or links to, c j for c j ∈ R(c i ). When a post c is a conversation starter we define, consistently with Kummerfeld et al. (2019a), R(c) = {c}, that is c replies to itself, it is a self-link. This reply-to binary relation defines a DAG over C. By taking the union of the reply-to relation with its converse and by calculating its transitive closure, we obtain an equivalence relation on C whose equivalence classes are threads, thus solving the disentanglement problem.
We frame the problem as a sequence classification task. For each query post c i we consider its L preceding posts O c i ≡ {c i−L−1 , . . . , c i } and predict one of them as its link. In the IRC Ubuntu dataset, predicting a single link per query post is a good approximation, holding true for more than 95% of the annotated utterances. We use L = 50 in the following. As described in Sections 2.2 and 2.3, for each query utterance c i , we construct a contextualized representation, φ i ≡ φ(c i , C). We do the same for each of the links c j ∈ O c i , using a representation ψ that can in principle differ from φ.
where s ij ≡ s(φ i , ψ j , f ij ) is a real-valued scoring function described in Section 2.4 and f ij are additional features. The parameters of the resulting model are learned by maximizing the likelihood associated to Eq. 1. At inference time we predict j = argmax c j ∈Oc i p(c j |c i ).

Contextual Post Representation
The construction of the φ and ψ representations closely followsİrsoy et al. (2019). Every post c i is represented as a sequence of tokens (t i n ) n . An embedding layer maps the tokens to a sequence of d I -dimensional real vectors (ω i n ) n . We use the tokenizer and the word embeddings from Kummerfeld et al. (2019a), d I = 50. We generate a representation χ i of c i by means of a single BiLSTM layer unrolled over the sequence of the token embeddings To obtain the contextualized representations φ, we use a DAG-LSTM layer. This is an N-ary Tree-LSTM (Tai et al., 2015a) in which the sum over children in the recursive definition of the memory cell is replaced with an elementwise max operation (see Appendix). This allows the existence of multiple paths between two nodes (as it is the case if a node has multiple children) without the associated state explosion (İrsoy et al., 2019). This is crucial to handle long sequences, as in our case.
At each time step the DAG-LSTM unit receives the utterance representation χ i of the current post c i as the input and all the hidden and cell states coming from a labeled set of children, C(c i ), see Figure 1. In our case C(c i ) contains three elements: the previous post in the conversation (c i−1 ), the previous post by the same user of c i , the previous post by the user mentioned in c i if any. More dependencies can be easily added making this architecture well suited to handle structured information. The DAG-LSTM is unrolled over the sequence ({χ i , C(c i )}) i , providing a sequence of contextualized post representations (φ i ) i . We also consider a bidirectional DAG-LSTM defined by a second unit processing the reversed sequencec i ≡ c |C|−i+1 . Forward and backward DAG-LSTM representations are then concatenated to obtain φ.

Thread Encoding
The link post representation ψ can coincide with the query one, ψ j ≡ φ j . One potential issue with this approach is that ψ does not depend on past thread assignments. Furthermore, thread-specific features such as topic and persona, cannot be easily captured by the hierarchical but sequential model described in the previous section. Thus we augment the link representations by means of thread encoding . Given a query, c i , and a link c j posts pair, we consider the thread T (c j ) = (c t i ), t i < t i+1 , t |T (c j )| = j, to which c j has been assigned. We construct a representation τ j of such thread by means of a GRU cell, τ j = GRU[(χ(c)) c∈T (c j ) ]. ψ j is then obtained by concatenating φ j and τ j . At training time we use the gold threads to generate the τ representations, while at evaluation time we use the predicted ones.

Scoring Function
Once query and link representations are constructed we use the scoring function in Eq. 1 to score each link against the query utterance, with s a three-layer feed-forward neural network. The input of the network is the concatenation [φ i ; ψ j ; f ij ], where f ij are the 77 features introduced by Kummerfeld et al. (2019a). We augment them by 42 additional features based on Levenshtein distance and longest common prefix between query's username and words in the link utterance (and viceversa). These are introduced to improve mention detection by being more lenient on spelling mistakes (see 2.5 for precise definitions).

User Features
While IRC chats allow systematically tagging other participants (a single mention per post), users can address each other explicitly by typing usernames. This allows for abbreviations and typos to be introduced, which are not efficiently captured by the set of features used by Kummerfeld et al. (2019b). To ameliorate this problem we construct additional features. Given a pair of utterances c 1 and c 2 we define the following: • Smallest Levenshtein distance (D L ) between c 1 (c 2 )'s username and each of the word in c 2 (c 1 ); 5 bins, D L = i for i = 0, . . . , 4 or D L > 4 .
• Binary variable indicating whether c 1 (c 2 )'s username is a prefix of any of the words in c 2 (c 1 ).
These amount to a total of 42 additional features for each pair of posts.

Evaluation
We conduct our experiments on the Ubuntu IRC dataset for disentanglement (Kummerfeld et al., 2019a;Kim et al., 2019). We focus on two evaluation metrics defined in Kummerfeld et al. (2019a): graph F 1 , the F-score calculated using the number of correctly predicted reply-to pairs; cluster F 1 , the F-score calculated using the number of matching threads of length greater than 1.

Experiments
As a baseline, we use a BiLSTM model in which φ i (= ψ i ) is obtained as the hidden states of a bidirectional LSTM unrolled over the sequence (χ i ) i . The base DAG-LSTM model uses both username and mentions to define the children set C of an utterance. Bidirectionality is left as a hyperparameter. All our experiments use the same architecture from section 2 to construct the utterance representation χ. We train each model by minimizing the negative log-likelihood for Eq. 1 using Adam optimizer (Kingma and Ba, 2019). We tune the hyperparameters of each architecture through random search. 1 Table 1 shows the test set performances of the models which achieve the best graph F 1 score 1 We refer to the Appendix for details.

Model
Self-links P R F  over the dev set. Optimizing graph over cluster score is motivated by an observation: dev set cluster F 1 score displays a much larger variance than graph F 1 score, which is roughly four-fold after subtracting the score rolling average. By picking the iteration with the best cluster F 1 score we would be more exposed to fluctuation and to worse generalization, which we observe.

Self-Links Threshold Tuning
As noted by Yu and Joty (2020b), the ability of the model to detect self-links is crucial for its final performances. In line with their findings, we also report that all our models are skewed towards high recall for self-link detection (Table 2).
To help with this, we introduce two thresholds θ and δ, which we compare withp, the argmax probability Eq. 1, and ∆p, the difference between the top-2 predicted probabilities. Whenever the argmax is a self-link: if p < θ, we predict the next-to-argmax link, otherwise we predict both the top-2 links if also ∆p < δ. On the dev set, we first fine-tune θ to maximize the self-link F 1 score and the fine-tune δ to maximize the cluster F 1 score. Table 1 shows our main results. Our DAG-LSTM model significantly outperforms the BiLSTM baseline. We perform ablation studies on our best DAG-LSTM model showing that while both user features and mention link provide a performance improvement for both cluster and graph score, only user features ablation results in a significant change. Self-links threshold tuning improves performances, particularly on cluster score for both models, highlighting the importance of correctly identifying thread starters.

Results Discussion
The DAG-LSTM model with thread encoding achieves state of the art performances in predicting reply-to relations. This is particularly interesting especially when we compare with models employ-ing contextual embeddings like . For the cluster scores, the best model is the pointer network model of Yu and Joty (2020a), which is anyway within less than 0.5% of the best contextual model, and within 2.5% of our model. The difference mainly arises from a difference in recall and corresponds to an absolute difference of less than 10 true positive clusters on the test set. Further comparisons with existing literature are limited by code not being available at the moment.

Conclusions
In this paper we apply, for the first time, DAG-LSTMs to the disentanglement task; they provide a flexible architecture that allows to incorporate into the learned neural representations the structured information which comes alongside multi-turn dialogue. We propose thread encoding and a new set of features to aid identification of user mentions.
There are possible directions left to explore. We modeled the reply-to relationships in a conversation by making an assumption of conditional independence of reply-to assignments. This is possibly a poor approximation and it would be interesting to lift it. A challenge with this approach is the computational complexity resulting from the large dimension of the output space of the reply-to classifier. We notice that thread encoding allows a non-greedy decoding strategy through beam search which would be interesting to further explore.

A.1 DAG-LSTM Equations
A DAG-LSTM is a variation on the Tree-LSTM (Tai et al., 2015b) architecture, that is defined over DAGs. Given a DAG, G, we assume that for every vertex v of G, the edges e(v, v ) connecting the children v ∈ C(v) to v can be assigned a unique label v,v from a fixed set of labels.
A pair of states vectors (h v , c v ) and an input x v are associated to every vertex v. The DAG-LSTM equations define the states (h v , c v ), as a function of the input x v and the states of its children: (2) The equations defining such functions are the following: The equations for the o and u gates are the same as those for the i gate by replacing everywhere i → o, u. Bias vectors are left implicit in the definition of i, f , o, and u.
represents Hadamard product and max in Eq. 5 represent elementwise max operation.
A bidirectional DAG-LSTM, is just a pair of independent DAG-LSTM, one of which is unrolled over the time reversed sequence of utterances. The output of a bidirectional DAG-LSTM is the concatenation of the h states of the forward and backward unit for a given utterance.

A.2 Training and Hyperparameter Tuning
We use adjudicated training, development, and test sets from (Kummerfeld et al., 2019b). Each of these dataset is composed a set of conversation (153 in the training set and 10 in both development and test set) each representing a chunk of contiguous posts from the IRC #Ubuntu channel. Each of these conversation contains strictly more than 1000 posts (exactly 1250 and 1500 for dev and test set respectively). Annotations are available for all but the first 1000 posts in every conversation. We apply some preprocessing to these conversations.
We chunk the annotated section of every training conversation in contiguous chunks of 50 posts each, starting from the first annotated post. 2 To each of these chunks we attach a past context of 100 posts and a future context of 50, resulting in 200 utterances long chunks. For each of these chunks we keep only those annotated links for which the response utterance lies in the central 50 posts. We do not chunk development and test set, but drop the first 900 post in every conversation.
The various architectures we consider share the same set of parameters to fine-tune. One parameter d h controls the dimension of the hidden state of the LSTMs and one parameter d F F controls the dimension of the hidden layers of the feed-forward scorer. We use word dropout, apply dropout after the max-affine layer, and apply dropout after activation at every layer of the feed-forward scorer. We clip all gradient entries at 5. We use a single layer of LSTMs and DAG-LSTMs to build the χ and φ, ψ representations and we do not dropout any of their units. Similarly we use a single layer GRU for the thread encoder. We list all the hyperparameters in Table 3 together with their range and distribution used for the random search.
Hyperparameter optimization is performed by running 100 training jobs for the base BiLSTM architecture, DAG-LSTM, and DAG-LSTM with thread encoding. Our published results are from the best among these runs. The best sets of parameters we find for each of these architectures are: • BiLSTM: d h = 256, d F F = 128, no word and max-affine dropout, a feed forwarddropout equal to 0.3, and a learning rate of 2.4 × 10 −4 .
• DAG-LSTM with thread encoding: d h = d F F = 256, word and max-affine dropout equal to 0.3, a feed forward-dropout equal to 0.5, and a learning rate of 7.9 × 10 −4 .
User feature and mention link ablations are obtained by fixing all parameters of the best DAG-LSTM run (removing the feature we are experimenting with) and running 10 jobs by only changing the random seed.  Each training job is performed on a single GPU and, depending on the architectures, takes from 6 to 12 hours.

A.3 Significance Estimates
We use McNemar test (McNemar, 1947) to evaluate the significance of performance differences between model. Given two models M A and M B , we define n AB as the number of links correctly predicted by A but not by B. Under the null hypothesis both n AB ∼ Bin(n AB , n, 1/2), where n ≡ n AB + n BA . We define a model A to be significantly better than a model B if the null hypothesis is excluded at 95% confidence level.