Survival text regression for time-to-event prediction in conversations

Time-to-event prediction tasks are common in conversation modelling, for applications such as predicting the length of a conversation or when a user will stop contributing to a platform. Despite the fact that it is natural to frame such predictions as regression tasks, recent work has modelled them as classiﬁcation tasks, determining whether the time-to-event is greater than a pre-determined cut-off point. While this allows for the application of classi-ﬁcation models which are well studied in NLP, it imposes a formulation that is contrived, as well as less informative. In this paper, we explore how to handle time-to-event forecasting in conversations as regression tasks. We focus on a family of regression techniques known as survival regression, which are commonly used in the context of healthcare and reliability engineering. We adapt these models to time-to-event prediction in conversations, us-ing linguistic markers as features. On three datasets, we demonstrate that they outperform commonly considered text regression methods and comparable classiﬁcation models.


Introduction
The task of predicting when an event will occur in a conversation frequently arises in NLP research. For instance, Backstrom et al. (2013) and Zhang et al. (2018b) predict when a conversation thread will terminate.  define the task of forecasting when users will cease to interact on a social network based on their language use. Although these questions naturally lend themselves to regression, this presents some difficulties: datasets may be highly skewed towards shorter durations (Zhang et al., 2018b) and samples with a longer duration can contribute inordinately to error terms during training. Furthermore, classical regression models do not explicitly consider the effect of time as distinct from other features.
The abovementioned studies instead frame the time-to-event prediction as a classification task, predicting whether the current state will continue for a set number of additional timesteps. For instance, Backstrom et al. (2013) predict whether the number of responses in a thread will exceed 8, after seeing 5 utterances. This presents obvious limitations; such a setup would assign the same error for mistakenly classifying conversations of respectively 9 and 30 utterances as "short". Additionally, its predictions are less informative: predicting that a conversation will be more than 8 utterances long is less telling than predicting whether it will be 9 or 30.
In this paper, we propose that survival regression is a more appropriate modelling framework for predicting when an event will occur in a conversation. Survival regression aims to predict the probability of an event of interest at different points in time, taking into account features of a subject as seen up to the prediction time. We apply survival models to two tasks: predicting conversation length, and predicting when conversations will get derail into personal attack. We report results for the conversation length prediction task on the datasets from Danescu-Niculescu-Mizil et al. (2012) and De Kock and Vlachos (2021), and evaluate the personal attack prediction task on the dataset of Zhang et al. (2018a). Our results illustrate that linear survival models outperform their linear regression counterparts, with an improvement in MAE of 1.22 utterances on the dataset of De Kock and Vlachos (2021). Further performance gains are made using neural network-based survival models. An analysis of the coefficients of our linear models indicates that survival models infer similar relationships as previous work on conversation length prediction, but that their predictions are more accurate than conventional regression and classification models due to their explicit accounting for the effect of time. On the personal attack prediction task, the best survival model provides a 13% increase in ranking accuracy over linear regression models.
The remainder of this paper is structured as follows. In Section 2 we provide a description of key survival analysis concepts. In Section 3, we describe how we apply these concepts to conversations. Results are reported in Section 4.

Survival regression
Survival analysis is concerned with modelling timeto-event prediction, which often represents transitions between states throughout a subject's lifetime. In the general case, exactly one event of interest occurs per lifetime, after which the subject is permanently in the alternate state, often referred to as "death" in literature. In this section, we review some key concepts of survival analysis that are relevant to our work, however, we refer the interested reader to the exposition by Rodriquez (2007).

Definitions
Let T be a non-negative random variable representing the waiting time until the occurrence of an event. Given the cumulative distribution function F (t) of the event time T , the survival function is defined as the probability of surviving beyond a certain point in time t: (1) Per illustration, we consider the task of predicting conversation length using the dataset of disagreements of De Kock and Vlachos (2021). The event of interest is the end of a conversation, with time measured in utterances. We can estimate the survival function using Kaplan-Meier estimation (Jager et al., 2008) as follows: where d i is the number of candidates who experience the event at time t i , and R i represents the so-called risk set, or candidates at risk of experiencing the event just prior to t i . In Figure 1, the base function is the estimated survival probabilities over time for the full population. Only conversations of more than 5 utterances are considered; hence the survival probability is 1 for all curves up until t = 5. If we create subsets of the population by conditioning on the response time, the subset with a longer response time has a steeper decline, indicating that conversations where participants take longer to respond are more likely to end earlier.
In survival regression, the aim is to learn survival regression functions based on such features, while the current time is modelled separately from them, unlike in standard regression models.
To estimate the expected event time given a survival function, one can find the expected value of the survival function as follows: The cumulative hazard is given by H(t) = t 0 h(t)dt and is related to the survival function according to S(t) = e −H(t) .
Parameteric survival regression models (described in more detail in Section 2.3 and 2.4) are often optimised to predict either the survival or the hazard function, given that it is always possible to convert between them. Such models can include feature representations (such as the response time in Figure 1) to obtain individualised predictions.

Censoring
A common consideration in survival studies is the presence of censoring, where a participant leaves a study before the end of the observation period, or they do not experience the event of interest within this period. Under censoring, each subject i has an associated potential censoring time C i and a potential lifetime T i . We observe Y i = min{T i , C i }, i.e. the minimum of the censoring and lifetimes, and an indicator variable δ i for whether the observation ended with death or censoring.
Consider the task of predicting personal attacks (described in more detail in Section 3). Conversations that end without a personal attack occurring can be considered analogous to patients dropping out of a study before the end of the observation period. The duration of the observation can then be taken as the censoring time.
Different survival models account for censoring in different ways. For instance, for a survival curve estimated with the Kaplan-Meier method (Equation 2), censored individuals are removed from the set of candidates at risk (R i ) at the censoring time, without having experienced the event of interest.
(5) h 0 (t) represents the baseline hazard for the population at each timestep, such as the base survival function in Figure 1. The g(x; θ) term is often referred to as the risk function and specifies how the feature vector x of a sample is taken into account using parameters θ. In our experiments, we consider two variations of this approach: Linear Cox The traditional Cox-PH model (Cox, 1972) uses a linear weighting of a feature vector to calculate the risk function as follows: DeepSurv DeepSurv (Katzman et al., 2018) uses a neural network to compute the risk function g(x; θ), where θ represents the weights of the network. The advantage of this is that the neural network can learn nonlinear features from the training data, which often improves predictive accuracy. During training, the parameters θ are optimised with maximum likelihood estimation for both models. Given individuals i with event time T i in dataset D, let R i denote the risk set at T i and δ i the censoring indicator. Then, the likelihood of the data is given by: ) δ i . (7) Intuitively, we aim to maximise the risk of i experiencing an event, over all other candidates at risk at time T i . In the context of predicting conversation lengths, this means that at time T i , any conversation that has not yet ended could end, but we want to maximise the probability of the candidate that had indeed ended then over the rest. This is referred to as a partial likelihood, in reference to the fact that the effect of the features can be estimated without the need to model the change of the hazard over time. The indicator term δ expresses that only non-censored samples contribute terms that impact the likelihood (since the contribution of censored samples would be 1); however, the censored samples would be included in the risk set R i up until their respective censoring times.

Survival regression as classification
A different approach to survival regression is to use classification to predict the timestep when an event will occur. DeepHit (Lee et al., 2018) is a neural network model that predicts a distribution over timesteps in this fashion. This provides more modelling flexibility compared to the Cox-PH models, where features are incorporated through the risk function and combined with a baseline hazard.
The model can incorporate multiple competing risks with distinct events of interest, and models censoring as a special type of risk. The output of the network is a vector representing the joint probability that the subject will experience each non-censoring event for every timestep t in the observation period. Censoring is assumed to take place at random and is therefore not included in the prediction. In the case of a single risk, it therefore predicts a vector:ŷ i = [ŷ t=0 , ...,ŷ t=tmax ], where each output elementŷ t represents the estimated probabilityP (t|x, θ) that a subject with feature vector x will experience the event at time t under the model parameters θ. Instead of a survival function, DeepHit defines a risk-specific cumulative incidence function (CIF) which expresses the probability that the event occurs before a time t * , conditioned on features x * : The loss function for training DeepHit has two components: an event time likelihood and a ranking loss. The ranking loss ensures that earlier events are predicted to happen before later events based on their CIF, but does not penalise models for mispredicting the times in absolute terms. The event time likelihood maximises the probability of the event occurring at the right time (y (i) T (i) ), or, in the case of censoring, it maximises the probability of the event not happening before the censoring time

Previous applications in NLP
A small number of NLP studies have employed techniques from survival analysis for timedependent tasks. Navaki Arefi et al. (2019) use survival regression to investigate factors that result in posts being censored on a Chinese social media platform, finding that negative sentiment is associated with shorter lifetimes. Stewart and Eisenstein (2018) use a linear Cox model to infer factors that are predictive of non-standard words falling out of use in online discourse, finding that words that appear in more linguistic contexts survive longer. Other applications include modelling fixation times in reading (Nilsson and Nivre, 2011) and evaluating dialogue systems (Deriu et al., 2020). However, none of these studies considered time-to-event prediction tasks based on conversations.

Survival regression in conversations
We evaluate survival models on two tasks, predicting conversation length and predicting when personal attacks will occur, where each conversation is a subject and the time is measured in utterances. 1 1 The task of predicting when users would cease to use a platform would also have been an interesting case for this study; however, the datasets of  are no longer available.   (2013), mentioned in Section 1.
Task 2: Predicting personal attacks Having seen t utterances, predict the number of utterances until a personal attack occurs. Conversations where no personal attack occurs are censored during training, and the conversation length is used as the observation time. Just less than half of the conversations contain personal attacks (1 569 out of 3 466). This is a novel task; previous work has only addressed predicting whether conversations will derail into personal attack, without attempting to predict when in a conversation this may occur (Zhang et al., 2018a;Chang and Danescu-Niculescu-Mizil, 2019). The motivation cited in both of the abovementioned studies is to prioritise conversations at risk of derailing for preemptive moderation. Survival models can give a more informative answer that takes into account the time until the attack, and therefore which conversations pose the most immediate risk. We use the dataset of Zhang et al. (2018a) for our experiments on this task. Characteristics of the datasets we use are shown in Table 1. We use only conversations where the event of interest occurs after the fifth utterance, and we remove conversations longer than the 95th percentile as these are often flame wars which may have confounding impacts. Data is split into training, development and test sets with ratios 75:10:15.

Metrics
Two metrics are calculated to evaluate model performance: mean absolute error and concordance index. The mean absolute error (MAE) for a dataset of n test samples is defined as This metric provides an easily interpretable score, and it is commonly used in evaluating regression models, e.g. Bitvai and Cohn (2015). However, MAE is not robust to outliers; large errors on a few values can outweigh many correct predictions. 2 MAE is also ill-defined in the presence of censoring as there is no event time to compare against, and it cannot be used to compare model performance between different datasets. For these reasons, we also include the concordance index (Harrell Jr et al., 1996), which is concerned with ordering rather than absolute values. A pair of observations i, j is considered concordant if the prediction and the ground truth have the same inequality relation, i.e. (y i > y j ,ŷ i >ŷ j ) or (y i < y j ,ŷ i <ŷ j ). The concordance index (CI) is the fraction of concordant pairs. A random model, or a model that predicts the same value for every sample, would yield a score of 0.5. A perfect score is 1. In the presence of censoring, censored samples are only compared with uncensored samples of a smaller event time, since it is known in that case that the uncensored sample should be assigned a later event time.
A disadvantage of the CI score is that it does not reflect how accurate the predictions are in absolute terms, meaning that good CI scores can be achieved with predictions in the wrong range. The two scores thus provide complementing views on model performance.

Features
The features we consider are based on previous work on conversation length prediction and predicting personal attacks. These are: • Politeness (POL): The politeness strategies from Zhang et al. (2018a) as implemented in Convokit (Chang et al., 2020), which capture greetings, apologies, and saying "please", etc.
• Arrival sequences (ARR): The order in which speakers partake in the first 5 utterances, defined by Backstrom et al. (2013).
• Hypergraph (HYP): Conversation structure features based on the reply tree, proposed by Zhang et al. (2018b) and implemented in Convokit (Chang et al., 2020). These features capture dynamics between participants, such as engagement and reciprocity.
• Sentiment (SENT): Positive and negative sentiment word counts, as per the lexicon of Liu et al. (2005), also implemented in Convokit.
• Time features (TIME): Log mean time between utterances and time between last two utterances, inspired by Backstrom et al. (2013).
• Utterance lengths (LEN): Log mean utterance length features, measured in tokens. For the POL, SENT, TIME and LEN features, we include both the mean value throughout the conversation and the gradient of a straight-line fit to capture how the feature changes throughout it. All features are calculated up to the point of prediction, and not for the full conversation.

Experimental setup
We use partly conditional training (Zheng and Heagerty, 2005) to account for using features that change over time, such as politeness, in contrast with static features like the arrival sequence. Under partly conditional training, a feature measured at time t predicts the risk of the occurrence of an event at a future time T . In our case, each individual is a conversation and features are measured after every utterance. Each measurement t of a conversation i is recorded as an individual entry in the dataset, with event time This construction is illustrated in Table 2 for the Talk dataset. There are 307 conversations that contain 12 utterances, but 0 samples of length 12 in the training data, since all conversations of this length have 0 utterances remaining and regression is therefore unnecessary. However, we include the first 11 utterances of the length-12 conversations in the training set at t = 11, since the remaining length here could be either 0 or 1. As such, there are (642+307=) 769 samples of length 11. We use  a minimum value of t = 5 to ensure there is sufficient information from which to make a prediction. Details for the other datasets are in Appendix A.
Our baseline model is a univariate Kaplan-Meier estimator (Jager et al., 2008), which predicts the same event time for all samples without taking features into account. For this model and the linear Cox-PH model, we use the implementations in lifelines 3 . We use grid search on the validation set for each model to determine hyperparameter values, experimenting with regularisation values in [0, 0.01, 0.1, 0.5], L1 ratios in [0, 0.1, 0.5, 1] and learning rates in [0.01, 0.1, 0.5, 1]. We also compare to a linear regression model, implemented in scikit-learn 4 and using the same features. For the linear regression model, we truncate predictions at 0 since negative times are invalid. Finally, to compare to previous work on threshold classification, we implement a logistic regression classifier, using the median of each training set as the cut-off point. For these models, the upper and lower quartiles are used to compute the MAE. For instance, for the Dispute dataset, the threshold value is 9. To calculate MAE, we use an event time of 5 if the model predicts the shorter class and 12 for the longer class.
For the neural models (DeepSurv and Deep-Hit), we use the implementations in PyCox 5 by Kvamme et al. (2019). For both we use two hidden layers with [128, 64] nodes, dropout (p = 0.3), batch normalisation, and the Adam optimiser (Kingma and Ba, 2014) with learning rate 0.01.

Task 1: Predicting conversation length
Results for the conversation length prediction task are shown in Table 3 for the Dispute and Talk datasets (left and middle columns respectively). MAE scores should not be compared between datasets since the datasets have different length distributions, with the Dispute dataset having conversations of up to 37 utterances, compared to a maximum of 12 in the Talk dataset.
For both datasets, all survival models outperform the linear regression and threshold classification models on the MAE metric. The survival baseline uses only population-level knowledge of the event time distribution, and predicts the same event time for all samples, whereas the other baselines take into account information from the features and can therefore tailor predictions per sample. While this results in the survival baseline having the worst CI (0.5), it is still better than linear reg and threshold in terms of MAE, illustrating the importance of separating the effect of time from the other features; time alone can be highly predictive.
The DeepHit model performs the best on the MAE metric on both datasets, with a statistically significant difference from the Linear Cox model at the P=0.01 level using the sign test. The latter performs better than DeepHit on the CI metric for the Dispute dataset, however, this difference is not statistically significant (P=0.869, using a randomised permutation test).

Coefficient analysis
Since the linear Cox model performed well on the Dispute dataset and is more interpretable than its deep counterparts, we show its 10 largest coefficients in absolute value in Table  4. Positive weights are associated with larger risk function values, and therefore a shorter conversation. Time between utterances is the most predictive feature, with a longer time between utterances correlating positively with shorter conversations (also observed in Figure 1). This corroborates the findings of Backstrom et al. (2013) and Zhang et al. (2018b). Having more participants is also correlated with shorter conversations. This suggests that the conversations in the Dispute dataset are less prone to long expansionary-style threads (as defined by Backstrom et al. (2013)), where many participants each contribute one utterance. Given the dataset consists of disagreements, it is not surprising that there would rather be focused discussions between a small number of participants.
Features 4, 6, 7 and 8 are from the hypergraph   feature set, which describes the structure of the reply-tree. Feature 4 indicates that a thread which forms a shallow tree, with posts receiving many direct responses, is likely to terminate soon. Features 6 and 7 indicate that interactions with multiple users is likely to extend the conversation. The length of the last utterance before the prediction is made (feature 9) is negatively associated with a short time-to-event, indicating that a long last utterance means that the conversation is still ongoing. Finally, the arrival sequence 00110 (feature 10) encodes the order in which two participants (indexes 0 and 1) contributed the first 5 utterances.
The only language-related feature in the 10 most predictive features is feature 5, the number of first person pronouns, which is positively correlated with a shorter time-to-event. Seven of the ten features on this list are from previous work on predicting conversation length (TIME, ARR and HYP), although this was for the thresholded classification version of the task. This indicates that the survival models are inferring similar relationships as the classification variant, while providing a more informative prediction and better performance on the MAE and CI metrics.

Results per timestep
To gain an understanding for how our models perform at different timesteps in the conversation, we also evaluate the predictive accuracy at every utterance index. The intuition here is illustrated in Figure 2 for t = 5, with the task being: having seen 5 utterances, predict the conversation length. We can see that the actual conversation length is 8, as there are 3 more utterances after the prediction, which are not seen by the model. The linear Cox model predicts the right value in this case. We also show the survival functions predicted by DeepHit, the baseline and the linear Cox model. As explained in Section 2 (Equation 3), the predictions are the expected values of the survival function. We depict predictions in relation to the prediction time; for instance, the linear Cox model predicts a time-to-event of 3 utterances at timestep 5, meaning that the event time will be timestep 8. Figure 3 shows the MAE and CI scores, aggregated per timestep, for the Talk dataset. Lower MAE scores are observed for later timesteps. However, this does not mean that these models are necessarily better; the possible range of error is smaller later in a conversation. An interesting deviation here is the linear regression model, which performs the best at the last timestep. Upon inspection, we note that this is because the model predicts small values (with median values in range 0.02-0.09) at every timestep. This strategy is likely the result of the dataset being biased towards shorter conversations. At smaller values of t, there is a portion of long conversations which would contribute large errors to drive up the MAE, but this is not present at larger t, hence the discrepancy.
CI allows for more direct comparison of models at different t, since it measures ranking accuracy. On this metric, DeepHit performs better than the linear regression model at all but the last timestep. Both models perform slightly worse at the last two timesteps. A reason for this may be that there are fewer training samples available at larger t, as illustrated in Table 2. Similar trends are observed in the Dispute dataset.

Task 2: Predicting personal attacks
Results for the personal attack prediction task are shown in the right column of Table 3. Compared to Task 1, higher values are observed on the CI. A reason for this may be that conversation length prediction has to rely on more subtle cues that indicate a conversation has run its course (e.g. users signing off) which are not captured by our features.
We observe again that the DeepHit model performs the best, and that the survival models outperform the three baselines. Due to censoring, the MAE score here is calculated using only uncensored samples; i.e. samples where a personal attack does occur. The censored samples are accounted for in the CI metric, as explained in Section 3.1.
A key question in this task is whether the survival models manage to prioritise samples where a personal attack occurs over censored examples. This means that when comparing a pair consisting of a censored and an uncensored example, we would like for the predicted time to event of the censored example to be higher. We can calculate how often this is true for the DeepHit model by calculating the concordance index of the predicted event times and the inverse of the censoring indicator. For instance, given a pair of samples s = [censored, uncensored], the indicator function is [0,1] and the inverse therefore [1,0]. The ordering of the inverse (s 0 > s 1 ) should be concordant with the predictions. Using the DeepHit model, we find that the model ranks samples where an attack occurs over censored examples in 57.75% of cases, compared to 51.92% with the linear regression model, and 50% for a random baseline. The best model of Zhang et al. (2018a) has a predictive accuracy of 64.9% for classifying which conversations will derail into personal attacks, but does not predict when this will occur.

Conclusions
In this paper, we proposed that survival analysis is a useful but hitherto ignored framework for timeto-event prediction tasks in conversations, which are frequently framed as classification tasks. We provided evidence to this by showing that survival models outperform both linear regression and logistic regression models on two tasks and three datasets. The survival regression models explored can be useful in other tasks, for instance, predicting escalation in customer service conversations.
x Threshold classifier x Linear regression x DeepHit Figure 3: MAE and CI calculated per timestep, t, for the Talk dataset on Task 1. For MAE, lower values are preferred; for CI, higher.