Linguistic Indicators of Severity and Progress in Online Text-based Therapy for Depression

,


Introduction
Mental illnesses such as depression and anxiety have been called "the biggest causes of misery in Britain today" (Layard, 2012). The main avenue of treatment for such conditions is talking therapies, such as Cognitive Behavioural Therapy (CBT); however, there is far greater demand than can currently be met, and currently only 25% of sufferers in the UK receive treatment. Therapy is therefore increasingly being delivered online: this * This work was partly supported by the ConCreTe project. The project ConCreTe acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET grant number 611733. helps to improve access and reduce waiting times, and is just as effective as standard therapy ). However, this new online setting provides a challenge of evaluation and optimisation (Hanley and Reynolds, 2009;Beattie et al., 2009). Online therapy is a significant departure from face-to-face therapy, and it is not yet known exactly what features or approaches are likely to lead to successful outcomes, or help identify negative outcomes such as risk to the patient or others. Current methods (e.g. controlled studies) are expensive and time-consuming; we need fast, accurate methods to ensure treatment can be made effective and efficient in this new context. and prediction of outcomes in schizophrenia treatment (Howes et al., 2013).
Online therapy data provides a new challenge -language and interaction styles differ to faceto-face -but also an opportunity in the availability of large amounts of text data without the need for automatic speech recognition or manual transcription. Here, we present an initial investigation into the application of computational linguistic techniques to online therapy for depression and anxiety. We find that important measures such as symptom severity can be predicted with comparable accuracy to face-to-face data, and that general aspects such as discussion topic and sentiment are useful predictors; and suggest some ways in which techniques can be adapted for improved performance in future.

Computational analysis & mental health
Research into computer-based diagnosis in mental health goes back at least to the 1960s -see (Overall and Hollister, 1964;Hirschfeld et al., 1974) amongst others -but most systems rely on doctoror patient-reported data rather than naturally occurring language. Much recent work similarly uses self-reported clinical and socio-demographic data, e.g. to predict treatment resistance in depression (Perlis, 2013). Some recent natural language processing (NLP) research examines features of the language used by patients when discussing conditions or treatment, e.g. discovering topics and opinions from online doctor ratings (Paul et al., 2013) or social media (Paul and Drezde, 2011).
However, aspects of the communication during treatment itself are also associated with patient outcomes (Ong et al., 1995). In the mental health domain, recent work suggests that, for patients with schizophrenia both conversation structure (how communication proceeds in therapy), and content (what is talked about), can affect outcomes (McCabe et al., 2013a;John et al., under review). NLP research has now begun to examine both.  model speech acts to characterise doctor-patient consultations on medication adherence; Angus et al. (2012) use unsupervised topic models to visualise shared content in clinical dialogue; Cretchley et al. (2010) use a similar approach for a qualititative analysis of topic and communication style between patients with schizophrenia and carers. DeVault et al. (2013) use features of speech, and Yu et al. (2013) multimodal features, from video-mediated dialogue to detect depression and PTSD with promising accuracies (0.66 to 0.74 depending on condition and task). In face-to-face therapy for schizophrenia, Howes et al. (2012; use a combination of supervised and unsupervised approaches to predict a range of diagnostic and outcome measures, including future adherence to treatment (accuracy 0.70); fine-grained lexical features gave reasonable accuracy, with more general topic features giving weaker prediction of some outcomes.

Topic modelling
One focus of research for mental health is therefore on methods for analysing content (what is talked about). Traditional methods, while effective, involve time-consuming hand-coding of data (Beattie et al., 2009;John et al., under review); NLP techniques can reduce this requirement. Unsupervised probabilistic models (e.g. Latent Dirichlet Allocation (LDA) Blei et al. (2003) and variants) have been widely applied to learn topics (word distributions) from the data itself, connecting words with similar meanings and even distinguishing between uses of words with multiple meanings (Steyvers and Griffiths, 2007). Such techniques have been applied successfully to structured dialogue e.g. meetings and tutorials (Purver et al., 2006;Eisenstein and Barzilay, 2008), and more recently to dialogues in the clinical domain (Cretchley et al., 2010;Howes et al., 2013), with topics found to identify important themes within therapy conversation such as medication, symptoms, family and social issues, and to correlate with outcomes.

Sentiment and emotion analysis
One aspect of conversation process and style is the affect or emotion present. NLP research has generally approached this via the task of sentiment detection, distinguishing positive from negative (and sometimes neutral) stance (Pang and Lee, 2008). Methods generally take either a knowledge-rich approach (relying on e.g. dictionaries of sentiment-carrying words (Pennebaker et al., 2007)), or a data-rich approach via (usually supervised) machine learning over datasets of sentiment-carrying text (e.g. Socher et al. (2013)). The former can provide deeper insights, but are less robust in the face of unexpected vocabulary, unusual or errorful spelling; the latter are more ro-bust but require training from large datasets. Recent research has attempted finer-grained distinctions, e.g. detecting specific emotions such as anger, surprise, fear etc; again, approaches can be characterised as dictionary-based or machinelearning-based (Chuang and Wu, 2004;Seol et al., 2008;Purver and Battersby, 2012;De Choudhury et al., 2012). The resulting sentiment or emotion ratings have been widely used to determine aspects of personality and mental state in various domains. In social media text, Quercia et al. (2011;2012) found correlations between sentiment and levels of popularity, influence and general wellbeing; O' Connor et al. (2010) with measures of public opinion. Closer to our application, Liakata et al. (2012) show that these methods can be applied to analyse emotion in suicide notes.

Research questions
Here, similar to (DeVault et al., 2013;Howes et al., 2013), our primary question is whether these approaches can be usefully applied to diagnose conditions and predict outcomes, but in a new modality -online text-based therapy -which may require different and/or more robust methods. In addition, we would like to gain some insight into which features of language and interaction might be predictive, in order to help clinicians improve therapeutic methods, and to assess how general and transferable any model might be. Our main questions here are therefore: • What features of text-based online therapy dialogue might help predict symptoms and/or outcomes? Specifically, how predictive are conversation topic and emotional content?
• Can we detect them accurately and reliably, using approaches generalisable to large datasets, across different subjects and conditions?
• Can the features provide any insights into the treatment process and/or the online modality?

Data
The data used in this study consisted of the transcripts from 882 Cognitive Behavioural Therapy (CBT) treatment dialogues between patients with depression and/or anxiety and their therapists using an online text-based chat system. The transcripts are from online CBT provided by Psychology Online, who deliver 'live' therapy from a qualified psychologist accessed via the internet (http://www.psychologyonline. co.uk). Of the 882 transcripts, 837 are between therapists and patients who were in an ongoing treatment program or had completed their treatment by the time our sample was collected. There are 167 patients in this sample (125 females and 42 males), with 35 different therapists (for 2 patients the identity of the therapist is unknown). The number of transcripts per patient ranges from 1 to 14, with a mean of 5.011 (s.d. 2.73). For all of the measures based on the transcripts, as outlined below, we included all text typed by both the therapist and the patient. In addition to the transcripts themselves, each patient normally filled out two questionnaires prior to each session with their therapist. These are described below.

Outcomes
Patient Health Questionnaire (PHQ-9) This is a self-administered diagnostic instrument for common mental disorders (Kroenke and Spitzer, 2002). The PHQ-9 is the depression module, which scores each of the 9 DSM-IV criteria as '0' (not at all) to '3' (nearly every day). A higher score indicates higher levels of depression, with scores ranging from 0-27. It has been validated for use (Martin et al., 2006).

Generalised Anxiety Disorder scale (GAD-7)
Similarly, the GAD-7 (Spitzer et al., 2006) is a brief self-report scale of generalised anxiety disorder. This is a 7-item scale which scores each of the items as '0' (not at all) to '3' (nearly every day). A higher score indicates higher levels of anxiety.
Outcome measures For the data in our sample, PHQ-9 and GAD-7 were highly correlated (r = 0.811, p < 0.001) so for the results reported below we focus on PHQ-9. As each patient filled in the PHQ-9 before each consultation, we used two different outcome measures: PHQ now -the PHQ-9 score of the patient for the questionnaire completed immediately prior to the consultation; and PHQ start-now -the difference between the PHQ-9 score prior to any treatment and PHQ now, i.e. a measure of progress (how much better or worse the patient is since the start of their treatment). Although these two measures are numerical, one of the general aims of our research is to identify patients at risk. We therefore binarised the outcome measures and treated our task as a categorisation problem to identify the group of interest. For PHQ now, these were patients with moderate to severe symptoms; for PHQ start-now, patients whose PHQ score had not improved.

Topics
The transcripts from the 882 treatment consultations were analysed using an unsupervised probabilistic topic model, using MALLET (McCallum, 2002) to apply standard Latent Dirichlet Allocation (Blei et al., 2003), with the notion of document corresponding to a single consultation session, represented as the sequence of words typed by any speaker. Stop words (common words which do not contribute to the content, e.g. 'the', 'to') were removed as usual (Salton and McGill, 1986), but the word list had to be augmented for text chat conventions and spellings (e.g. unpunctuated "ive"). Additionally, common mispellings were mapped to their correctly spelled equivalents using Microsoft Excel's in-built spellchecker. This was due to the nature of text chat, in contrast to transcribed speech or formal text -the word 'questionnaire', for example, was found to have been typed in 21 different ways. Following (Howes et al., 2013) we set the number of topics to 20, 1 used the default setting of 1000 Gibbs sampling iterations, and enabled automatic hyperparameter optimisation to allow an uneven distribution of topics via an asymmetric prior over the document-topic distributions (Wallach et al., 2009). As Howes et al. (2013) did in face-to-face therapy, we found most topics were composed of coherent word lists, with many corresponding to common themes in therapy e.g. family (Topic 12), symptoms (16), treatment process (2, 14), and issues in work and social life (19, 5) -see Table 5.

Sentiment and emotion analysis
Each turn in the transcripts was then annotated for strength of positive and negative sentiment, and level of anger. We compared three approaches: the dictionary-based LIWC (Pennebaker et al., 2007) and two machine learning approaches, the Stanford classifier based on deep neural nets and parse structure trained on standard text (Socher et al., 2013), and one based on distant supervision over social media text, Sentimental (Purver and Bat-tersby, 2012). 2 None are specifically designed for therapy dialogue data; however, given the unorthodox spelling and vocabulary used in text chat, we expect machine-learning based approaches, and training on "noisy" social media text, to provide more robustness.
We used each to provide a positive/negative/neutral sentiment value; for LIWC, we took this from the relative magnitudes of the posemo and negemo categories. Two human judges then rated the 85 utterances in one transcript independently. Inter-annotator agreement was good, with Cohen's kappa = 0.66. Agreement with LIWC was poor (0.43-0.45); with Stanford better (0.51-0.54); but best with Sentimental (0.63-0.80). For anger, LIWC gave only one utterance a non-zero rating, while Sentimental provided a range of values. We therefore used Sentimental in our experiments. Raw values per turn were scaled to [-1,+1] for sentiment (-1 representing strong negative sentiment, +1 strong positive), and [0,1] for anger; we then derived minimum, maximum, mean and standard deviation values per transcript.

Classification experiments
We performed a series of experiments, to investigate whether various features of the transcripts could enable automatic detection of patient responses to the PHQ-9. The full range of possible features were calculated for each transcript -see Table 1. As well as topic, sentiment and emotion features as detailed above, we include raw lexical features to characterise details of content, and some high-level features (amount of talk; patient demographics; and therapist identity, known to affect outcomes).
In each case, we used the Weka machine learning toolkit (Hall et al., 2009) to pre-process data, and a decision tree classifier (J48), a logistic regression model and the support vector machine implementation LibLINEAR (Chang and Lin, 2001) as classifiers. PHQ now was binarised based on the classification in Kroenke and Spitzer (2002), whereby scores of 10 or over are moderate to severe and scores of less than 10 are mild. PHQ start-now was binarised according to whether there was an improvement (reduction) in the PHQ score or not. Positive scores indicate Overall sentiment mean, standard deviation, minimum and maximum; overall anger mean, standard deviation, minimum and maximum Word Unigrams, for all words that appeared in at least 20 of the transcripts, regardless of speaker; the features were the normalised counts of each word N-gram As word, but including unigrams, bigrams and trigrams

Correlations
First, we examined statistical associations between our four outcome measures and our available features (see Section 3). R-values are shown for all significant correlations (at the p < 0.05 level) in Tables 2-4. For the PHQ now measure, a positive correlation means a greater value of the feature is associated with a greater value of the PHQ score (i.e. a higher level of symptoms). For the PHQ start-now measures, a positive correlation means that a greater value of the feature is associated with a greater improvement in the PHQ score since the start of treatment. Correlations greater than ±0.2 are shown in bold.
High-level With patients with a worse (higher) PHQ score (PHQ now), more words and turns are typed by both participants. Better overall progress scores are also weakly associated with the amount of talk, with fewer turns typed by both participants if patients' PHQ score has improved by a greater 3 We partition the data into 10 equal subsamples, and use each subsample as the test data for a model trained on the remaining 90%. This is repeated for each subsample (the 10 folds), and the test predictions collated to give the overall results. This partitioning is done by transcript: different transcripts from the same patient may therefore appear in training and test data within the same fold; our use of low-dimensional topic/sentiment features should minimise over-fitting, but future work will investigate the extent of this effect. amount since the start of their treatment program (see Table 2). Table 3, more negative sentiment expressed in the transcripts (mean and minimum), a higher variability of sentiment between negative and positive (s.d.), and greater levels of anger (mean and maximum) are associated with worse PHQ scores. More positive sentiments (mean and maximum) are also associated with better progress.

Sentiment As shown in
Topic Topics 2, 6, 9, 10, 16 and 17 are negatively correlated with PHQ scores, i.e. higher levels of these topics are associated with better PHQ (see Table 4). Some of these topics involve words related to assessing the patient's progress and feedback, e.g. topic 2 includes session, goals and questionnaires, and topic 17 includes good, work and positive. Others relate to specific concerns of the patient, e.g. topic 6 (worry, worrying and problem) and topic 16 (anxiety, fear and sick). The top twenty words assigned to each topic by LDA, and the direction of significant correlations are shown in Table 5.
Conversely, topics 4, 5, 7, 8, 11 and 18 are positively correlated with PHQ scores, meaning more talk assigned to these topics is associated with worse PHQ. Several of these topics relate to specific issues, such as topic 5 (sleep, bed, night) and topic 18 (eating, food, weight). Some of these topics display overlap with the previous group (e.g. topics 2 and 4 both contain words reviewing progress such as session, week, next and last); this suggests that some topics (e.g. progress or particular issues) are discussed in importantly (and recognisably) different ways or contexts (possibly different emotional valences -see below), and these differences are being identified by the automatic topic modelling.
Similarly, greater amounts of talk in topics 2, 15 and 17 are weakly associated with better progress. These are the topics identified above as involving words related to assessing progress, and feedback. Greater amounts of talk in topic 8 (checking, OCD, anxiety, rituals) is associated with worse progress.
Cross-correlations between topic and sentiment features Previous work has hypothesised that automatically derived topics may differ from hand-coded topics in picking up additional factors of the communication such as valence (Howes et al., 2013). To explore this on a global level (i.e.   at the level of the transcript, rather than at the finer-grained level of the turn) we examined crosscorrelations between sentiment and topic. This initial exploration offers support for this hypothesis, as can be seen in Table 6. For example, topics 3 and 4 both contain words relating to feelings and thoughts, but topic 3 is positively correlated with sentiment, while topic 4 is negatively correlated. These correlations indicate a complex relationship between topic and sentiment which should be explored further in future research; a joint topic-sentiment model might be appropriate e.g. (Paul et al., 2013). Although some topics pattern consistently with sentiment (e.g. topic 12, with words about relatives and relationships, is associated with negative sentiments and higher levels of anger) some do not (e.g. topic 19 is associated with more positive sentiment, but greater anger). Examination suggests that this topic involves discussions about feelings of anger, but not necessarily expressing anger, and also may include talk on how to deal with such feelings (with words like assertive). These observations may indicate that in this domain, in which people explicitly talk about their feelings, fully accurate sentiment and emotion analysis may require a different approach than in domains such as social media analysis.

Classification experiments
Results of classification experiments on different feature sets are shown in Tables 7-9. For each experiment, the weighted average f-score is shown, with the f-score for the class of interest shown in brackets. For PHQ now the class of interest is patients with high (moderate to severe) PHQ-9 scores; for PHQ start-now we are concerned with patients who are not getting better. As a baseline, the proportion of the data in the class of interest in each case is shown in the first column in Table 7 note that these are not exactly 50%, but reflect the actual proportions in the data (see Section 3.5).
High-level As can be seen in Table 7, if we use a feature set consisting of high-level features and AgentID, we are able to predict PHQ now and PHQ start-now reasonably well (> 0.7). However, given the nature of the data, it is uncommon for a therapist to have many clients of the same age group and gender; these features can therefore act as a reasonable proxy for identifying individual patients, meaning that this result is somewhat spurious. Also, although identity of therapist is an important factor in therapeutic outcomes (McCabe et al., 2013a;McCabe et al., 2013b), we would like to identify aspects of the communication that explain why particular therapists are more successful than others, and generalise our findings to new therapists. AgentID was therefore removed in all subsequent experiments. Table 8, using the proportions of derived topics by transcript as features does allow us to predict whether a patient has a high PHQ now score reasonably well; but sentiment alone performs poorly. Combining sentiment and topic features, however, allows us to predict PHQ now with scores of around 0.7 (i.e. approaching the accuracy achieved using highlevel and AgentID features above). Prediction of the progress measure is less effective.

Sentiment and topic As shown in
Words and n-grams For the symptom measure, using words and n-grams gives f-scores in  -+ good thought re well also mindfulness hw thoughts now vc maybe prob message neg just wk one self bit Topic 1 people good others self evidence thought enough wrong negative esteem thinking say confidence beliefs person true someone belief situation Topic 2 -+ -session send goals next week last sent read great think questionnaires also homework goal appointment set time cbt able Topic 3 + thoughts thinking unhelpful helpful look thought behaviours go feelings may think anxiety negative try aware behaviour agenda start self Topic 4 + -feel think like just good really week now know last session next say felt people thoughts going feeling bit Topic 5 + -+ sleep bed day week work get night mood time diary see better much sleeping activity house routine done activities Topic 6 -worry worrying worries bit stop train worried problem go example idea control hierarchy driving exposure home happen worst car Topic 7 + -help feel gp depression thank understand therapy now feeling life today think problems able little message medication sorry make Topic 8 + check checking ocd thoughts anxiety try something difficult danger brain week sense threat helpful away rituals anxious elephant images Topic 9 --think time like much way sure see though know look lot sounds well also right thing sorry sense different Topic 10 -+ thought thoughts anxiety really situation situations one week next example social experience record great emotions thanks notice see make Topic 11 + + things get time go need like want now just something feel know one work good day going give next Topic 12 + -+ mum relationship husband life family dad parents never love feelings children said years mother much hard way told sister Topic 13 really week think appointment homework however lets teeth questions great just ready start may dentist set end sure therapy Topic 14 + -great right sure appointment just thank well tonight loo lol good say really cool get going sorry transcript absolutely Topic 15 + -things like get bit good sounds feeling also something really great today think idea send week useful anything make Topic 16 --anxiety panic breathing get anxious feeling going go attack fear physical control try happen sick symptoms times cope distraction Topic 17 -+ -good work well positive back help really time still last much weeks use thanks session better keep done things Topic 18 + eating eat food weight day week meal lunch dinner pie energy good mum put table public walk believe ate Topic 19 + + work job anger angry school stress thanks wife team stuff issues also boss year assertiveness assertive meeting kids times Table 5: Top 20 words per topic line with those using only the reduced dimensionality of sentiment and topic. This is surprising; one might expect finer-grained lexical features (which provide more information via a much higher-dimensional feature space) to increase predictivity, as per Howes et al. (2013); on the other hand, it is also promising as it suggests that meaningful generalisations can be drawn out of this data using NLP techniques.
For the progress measure, on the other hand, ngram features perform better than topic/sentiment (though not as well as on the symptom measures); this suggests that there are aspects of the communication that can assist in predicting patient progress, but that they are not captured by the topic and sentiment information as currently defined. This suggests that dialogue structure or style may play a role; one possibility for exploration is to look at topic and/or sentiment at a finer-grained level and examine their dynamics (e.g. are posi-tive sentiments expressed near the start or end of a consultation linked to better progress)?

Discussion
Standard topic, sentiment and emotion modelling can be usefully applied to online text therapy dialogue, although care is needed choosing and applying a technique suitable for the idiosyncratic language and spelling. The resulting information allows us to predict aspects of symptom severity and patient progress with reasonable degrees of accuracy (similar to those achieved with faceto-face data (DeVault et al., 2013;Howes et al., 2012)), without requiring knowledge of therapist identity. However, some measures of patient progress are predicted better with fine-grained, high-dimensional lexical features, suggesting that insight into style and/or dialogue structure is required, beyond simple topic or sentiment analysis.      Table 9: Weighted average f-scores using raw lexical features (words/ngrams) using LibLINEAR (figures in brackets are the f-scores for the class of interest)