Applying prosodic speech features in mental health care: An exploratory study in a life-review intervention for depression

The present study aims to investigate the application of prosodic speech features in a psychological intervention based on lifereview. Several studies have shown that speech features can be used as indicators of depression severity, but these studies are mainly based on controlled speech recording tasks instead of natural conversations. The present exploratory study investigated speech features as indicators of depression in conversations of a therapeutic intervention. The changes in the prosodic speech features pitch, duration of pauses, and total duration of the participant’s speaking time were studied over four sessions of a life-review intervention for three older participants. The ecological validity of the dynamics observed for prosodic speech features could not be established in the present study. The changes in speech features differed from what can be expected in an intervention that is effective in decreasing depression and were inconsistent with each other for each of the participants. We suggest future research to investigate changes within the intervention sessions, to relate the changes in feature values to the topical content of the speech, and to relate the speech features directly to depression scores.


Introduction
Depression is a mood disorder that is mainly characterized by a sad mood or the loss of in terest and pleasure in nearly all activities in a period of at least two weeks (American Psychiatric Association, 2000). Depression disorders are the leading cause of disability and contribute largely to the burden of disease in middle-and high-income countries worldwide (Üstun et al., 2004). In 2012, more than 350 million people around the world suffered from depression symptoms (World Health Organization, 2012). To decrease the onset of depression disorders, early psychological interventions, i.e., psychological methods targeting behavioral change to reduce limitations or problems (Vingerhoets, Kop, & Soons, 2002), aiming at adults with depression symptoms or mild depression disorders are necessary. Meta-analytic findings show that psychological interventions reduce the incidence of depression disorders by 22%, indicating that prevention of new cases of depression disorders is indeed possible (Cuijpers et al., 2008).
To evaluate the effectiveness of interventions for depression and changes during the interventions, reliable and valid measures of depression severity are necessary. Depression severity is mostly measured by self-report questionnaires such as the Center for Epidemiologic Studies Depression scale (CES-D; Radloff, 1977), the Hamilton Depression Rating Scale (HAM-D; Hamilton, 1960), and the Beck Depression Inventory (Beck, Steer, & Brown, 1996). These self-report questionnaires often include items on mood and feelings. Moreover, questionnaire items may cover physical depression symptoms such as sleep disturbances, changes in weight and appetite, and loss in energy. However, in some target groups such as older adults these items can confound with health problems and physical diseases, which increase in old age. For these reasons, there is a need for valid and objective measures of depression severity. Not only to assess depression severity before and after therapy, but also to detect the dynamics during the therapy (Elliot, 2010).

Computational linguistics, speech analysis, and mental health care
It is commonly assumed and confirmed in several studies that emotions and mood can influence the speaking behavior of a person and the characteristics of the sound in speech (Kuny & Stassen, 1993;Scherer, Johnstone, & Klasmeyer, 2003). Already in 1954, Moses concluded that the voice and speech patterns of psychiatric patients differed from those of people without a psychiatric diagnosis. Clinicians observe the speech of depressed patients frequently as uniform, monotonous, slow, and with a low voice (Kuny & Stassen, 1993). A review by Sobin and Sackeim (1997) showed that depressed people differ from normal and other psychiatric groups on psychomotor symptoms such as speech. The speech of depressed patients is characterized by a longer pause duration, that is, an increased amount of time between speech utterances as well as by a reduced variability in mean vocal pitch.
More recently these insights have led to collaborative and multidisciplinary work between researchers from the fields of computational linguistics and mental health care. With the growing availability of models and algorithms for automated natural language processing that can be put to use in clinical scenarios, depression can now increasingly be measured based on the characteristics of the language used by patients, such as the frequency of verbal elements in a narrative that express a certain mood or sentiment (Pennebaker & Chung, 2011), and acoustic speech features. Because vocal acoustic features such as pause durations and pitch are biologically based, it has even been argued that they can serve as biomarkers of depression severity (Mundt et al., 2012). As a consequence, speech features such as pitch and pause durations can be used to estimate the severity of a depression.
To date, several studies investigated the validity of several speech features as indicators of depression. Indeed, the speech features pitch and speech loudness correlate significantly with global depression scores during recovery (Kuny & Stassen, 1993;Stassen, Kuny, & Hell, 1998). After recovery from depression, the speech pause time of depressed adults was no longer elongated (Hardy et al., 1984). These results indicate that prosodic speech features are valid measures of depression.
However, these studies have the limitation that the speech analyses are based on the recording of controlled speech based on tasks such as counting and reading out loud. Such speech recording tasks take place under ideal voice recording conditions (Cannizzaro, Harel, Reilly, Chappell, & Snyder, 2004), while speech analysis is more difficult when conducted outside a controlled setting, because of so-called noisy channel effects (Janssen, Tacke, de Vries, van den Broek, Westerink, Haselager, & IJsselsteijn, 2013). Moreover, controlled speech tasks are cognitively less demanding than free speech tasks (Alpert et al., 2011). This evokes the question whether speech features are also ecological valid, i.e., whether they can be used as indicators of depression severity, when measured during natural conversations instead of during the recording of controlled speech tasks (Bronfenbrenner, 1977).
A study on speech samples from video recordings of structured interviews revealed promising results: speaking rate and pitch variation, but not the percentage of pauses, showed a large correlation with depression rating scores (Cannizaro, Harel, Reilly, Chappell, & Snyder, 2004). Additional studies on the ecological validity of using prosodic speech features as indicator for depression are necessary.

Speech features as mood markers in a life-review intervention
In the present study the speech of older adults will be measured in four sessions of a psychological intervention, combining knowledge in the fields of computational linguistics and psychological interventions in mental health care. Because psychological interventions of depression have shown to be effective (e.g., Cuijpers, van Straten, & Smit, 2006) and are broadly implemented in mental health care, the measurement of speech features in psychological interventions is a promising application for the field of computational linguistics. For example, speech features can be used to provide direct feedback to both the therapist and patient on the severity and changes in severity of depression during the psychological intervention. Clinicians do not have the ability to differentiate precisely the duration of for example the patient's utterances and pauses (Alpert et al. 2001). There is also ample evidence that text mining techniques based on the frequency of certain terms can be applied to narratives from patients in order to monitor changes in mood (Pennebaker & Chung, 2011), and a recent study has shown that machines can better recognize certain emotions than lay people (Janssen et al., 2013), underlining once again the added value of automated speech analysis. To pave the way for future applications that would enable the use of speech features as a direct feedback mechanism, the first step is to gain more knowledge on the patterns in speech features and on how changes in these features can be considered as meaningful signals of patterns in psychological interventions.
The psychological intervention in the present study is based on life-review: the structured recollection of autobiographical memories. Depressed people have difficulties in retrieving specific, positive memories. Their autobiographical memory is characterized by more negative and general memories (e.g., Williams et al., 2007), for example memories that reflect a period or recurrent event (e.g., the period of a marriage) rather than a specific event (e.g., the ceremony on the wedding day). The present life-review course targets the recollection of specific, positive memories in older adults with depression symptoms. In four weekly sessions, the interviewer stimulates the recollection by asking questions on the depressed person's childhood, adolescence, adulthood and life in general. An advantage of life-review in comparison to other therapies such as Cognitive Behavioral Therapy, is that it fits in with a natural activity of older adults to recollect memories and tell stories about their lives (Bluck & Levine, 1998). Lifereview has shown to be an effective method to decrease depression symptoms (Korte, Bohlmeijer, Cappeliez, Smit, & Westerhof, 2012;Pinquart & Forstmeier, 2012) and is considered an evidence-based intervention for depression in older adults (Scogin, Welsh, Hanson, Stump, & Coates, 2005).
Our study is one of the first to investigate prosodic speech features during a psychological intervention. The study is exploratory and aims to gain insight into the ecological validity of prosodic speech features in a psychological lifereview intervention. The life-review intervention offers the opportunity to investigate the prosodic speech features over time. Life-review is highly suitable to investigate speech features during an intervention, since the speech from the recall of autobiographical memories provides strong prosodic speech changes (Cohen, Hong, & Guevara, 2010) and the expression of emotions characterized by speech characteristics is stronger after open and meaning-questions as compared to closed and fact-questions (Truong, Westerhof, Lamers, & de Jong, under review). Our paper is a first step to gain insight into the methods that are necessary to evaluate the application of prosodic speech features in mental health care. In the present study into the role of prosodic speech features, vocal pitch and pause duration will be investigated in three participants across all four weekly sessions. Because the life-review intervention is effective in decreasing depression symptoms (Korte et al., 2011;Serrano, Latorre, Gatz, & Montanes, 2004), we expect that the prosodic features change accordingly. Therefore, we hypothesize (a) an increase in average vocal pitch, (b) an increase in the variation in vocal pitch, (c) a decrease in average pause duration, (d) a decrease in the ratio between the total pause time and total speech time (pause speech ratio), and (e) an increase in the ratio between the participant's speech and total duration of the session (speech total duration ratio) during the intervention.

Method
In this section we will describe the methodology applied in the design of the psychological interventions during which the research data sets were generated, the procedure for selecting the participants and the corresponding data sets, the data preparations steps and the analyses performed.

Intervention 'Precious memories'
The life-review intervention 'Precious memories' (Bohlmeijer, Serrano, Cuijpers, & Steunenberg, 2007) targets the recollection of specific, positive memories. The intervention is developed for older adults with depression symptoms living in a nursing home. Each of the four weekly sessions focuses on a different theme: childhood, adolescence, adulthood, and life in general. The sessions are individual and guided by a trained interviewer. The sessions take place at the participant's home and last approximately 45 minutes. Each of the sessions is structured by fourteen main questions that stimulate the participant to recollect and tell specific positive memories about his or her life. The interviewers are instructed to ask for lively details about each of the positive memories of the participants, for example the colors, smells and people that were involved in the memory. Table 1 shows an example question for each of the four sessions.

Session
Example question 1: Childhood Can you remember an event in which your father or mother did something when you were a child that made you very happy? 2: Adolescence Do you remember a special moment of getting your first kiss or falling in love with someone? 3: Adulthood What has been a very important positive experience in your life between the ages of 20 and 60? 4: Life in general What is the largest gift you ever received in your life? Tabel 1. Example questions for the four sessions of the life-review intervention 'Precious memories'

Procedure and participants
Participants with depression symptoms were recruited in nursing homes in the area of Amsterdam, the Netherlands. Participation in the lifereview intervention was voluntary. Three participants were selected for whom audio recordings of the four sessions were available, which resulted in a dataset of twelve life-review sessions. The three participants (below labeled as P1, P3 and P5) were females with an age between 83 and 90 years. The educational background varied from low to high and the marital status from married to never married. The participants signed an informed consent form for the use of the audio-tapes for scientific purposes.

Data preparation and analysis
All acoustic features were automatically extracted with Praat (Boersma, 2001). Because the speech of both the interviewer and the participants were recorded on one mixed audio channel, some manual interventions had to be applied in order to determine the segments in which the participant is talking. First, for each session, the segments in which the participant is the main speaker were selected. These so-called 'turns' were then labeled in more detail; utterances pro-duced by the interviewer were marked and discarded in the speech analysis. For each turn, mean pitch, standard deviation pitch, pause duration, the ratio between total pause time and total speech time, and the ratio between total speech time and total duration of the session were extracted. Pause durations were automatically extracted by applying silence detection where the minimal silence duration was set at 500 ms. All features were normalized per speaker by transforming the raw feature values to z-scores (mean and standard deviation were calculated over all 4 sessions, z = ((x-m)/sd)). The ratio between total speech time and total duration time was not normalized because this feature was calculated over a whole session instead of a turn. Subsequently, averages over all turns per session were taken in order to obtain one value per session.

Results
The results of the prosodic speech features over the four sessions of the life-review intervention are graphically presented separately for each feature, in the Figures 1 to 5. We hypothesized an increase in the average pitch during the intervention. As shown in Figure 1, the patterns in average pitch during the intervention differs across the three participants. Only in Participant 3, the pattern is in line with our expectations, showing an increase in the sessions 3 and 4. In both Participant 1 and 5, there was a decrease in average pitch in the sessions 3 and 4. We expected the variation in pitch to increase during the intervention. Figure 2 shows the participants' patterns of the standard deviation of pitch during the intervention. The changes in standard deviation do not confirm our hypothesis. Although the speech of Participant 3 shows an increase in session 4, the standard deviation is lower in session 4 than in session 1 of the intervention. The standard deviation of Participant 5 is relatively stable during the intervention. Participant 1 mainly shows a large variation in pitch in session 2. It was hypothesized that the average pause duration would decrease during the four sessions of the intervention. Figure 3 shows that the average pause duration was relatively stable over the first three sessions in all three participants. Only in Participant 1 the average pause duration decreased in session 4, in line with our expectations. In agreement with our hypothesis on average pause duration, we also expected a decrease during the intervention in the ratio between the total pause time and total speech time. Although there was a large decrease in the pause speech ratio of Participant 1 between the sessions 2 and 3, the ratio in session 4 was similar to the pause speech ratio in the first session (see Figure 4). In both Participant 2 and 3, the ratio was relatively stable in the sessions 1 to 3, but in session 4 the pause speech ratio showed an increase in Participant 3 and a slight decrease in Participant 2. Last, we investigated the ratio between the participant's speech and total duration of the session. We hypothesized an increase in the speech total duration ration during the intervention. Figure 5 shows the differences between the participants in the speech total duration ratio over the four sessions. The ratio is relatively stable, and high, in Participant 5. The ratio in both Participant 1 and 3 in general decreases during the intervention, with a lower speech total duration ratio in session 4 as compared to session 1.

Conclusion
The aim of the present study was to investigate the suitability of applying an analysis of prosodic speech features in the speech recordings collected in psychological intervention based on life-review. Because several studies have shown that speech features can be used as indicators of depression severity (e.g., Kuny & Stassen, 1993;Stassen, Kuny, & Hell, 1998), the application of speech analyses in mental health care is promising. However, the measurement of speech features is often based on speech recording tasks and the ecological validity within psychological interventions is not yet established. The study is a first exploratory step to gain insight into the ecological validity of prosodic speech features in a psychological life-review intervention. We expected to measure a change during the intervention in the prosodic speech features that could be related to depression symptoms, and hypothesized an increase in average pitch and pitch variation, a decrease in average pause duration, and an increase in the amount of speech by the participant during the intervention. However, we could not establish the ecological validity of these speech indicators in the present study. In general, the patterns of the prosodic speech indicators differ from our expectations. The dynamics in the speech indicators was different from what can be expected in an intervention that is effective in decreasing depression (Korte et al., 2011;Serrano et al., 2004). Moreover, the speech indicators were inconsistent with each other for the participants in the pool. For example, Participant 3 showed an increase in pitch during the intervention, which indicates a decrease in depression, and an increase in average pause duration and pause speech ratio, which indicates an increase in depression.
Taken together, the findings from the present study indicate that the prosodic speech features that have been validated for controlled settings, are not directly applicable for the spontaneous type of conversation that is typical for a mental health care setting. More research is needed to establish the ecological validity of prosodic speech features such as pitch, pauses, and speech duration as indicators of depression severity. A few suggestions can be made. First, each of the four sessions in the life-review intervention in the present study focused on a different theme. Although we aimed to evaluate the development of the speech features during the intervention, the differences across the session may be the consequence of differences in session theme. Moreover, not all parts of the session consisted of life-review, and participants were talking about a variety of subjects, for example about their caregivers. The goal of the life-review interven-tion is to stimulate the retrieval of specific positive memories. In a next step, we aim to select the parts in which the participant is recollecting such memories and to evaluate the patterns in prosodic speech features only for these parts.
Second, the prosodic speech indicators were averaged per session to provide a clear overview of the changes over the four sessions. However, changes can also occur within the session. For example, vocal pitch may increase during the session, which would indicate a decrease in depression symptoms. Furthermore, within each session, the interaction between the interviewer and participant may play a role. For instance, when the interviewer speaks with a higher pitch and more variation in pitch, the participant may unconsciously take over some of this speaking behavior. We suggest future studies to investigate not only the average session, but to include changes during the session the interviewer's speech features.
Third, the present research was conducted in line with the assumption that life-review is effective as an intervention for mood disorder, as is shown in several studies (Korte et al., 2011;Serrano et al., 2004). However, we due to lack of data on depression severity we do not know whether the life-review intervention was fully effective for the participants in the present study. To validate the patterns prosodic speech features as a reliable indicator for depressions that can be used in mental health care, it is necessary to demonstrate that the dynamics in speech features can be related directly to changes in depression scores. As argued in earlier studies, in order to conclude that speech features correlate significantly with global depression scores during recovery (Kuny & Stassen, 1993;Stassen, Kuny, & Hell, 1998), these correlations need to be investigated in psychological interventions.
In sum, the study of how prosodic speech features such as pitch and pauses relate to the kind of spoken narratives that play a role in mental health care settings is a promising field. However, the ecological validity of prosodic speech features could not be established in the present study. More research based on larger data samples the establishment of a direct relation to depression scores is necessary before the techniques from the field of computational linguistics can be applied as a basis for the collection of indicators that can be used in psychological interventions in a meaningful and effective way.