Multi-Task Learning for Depression Detection in Dialogs

Depression is a serious mental illness that impacts the way people communicate, especially through their emotions, and, allegedly, the way they interact with others. This work examines depression signals in dialogs, a less studied setting that suffers from data sparsity. We hypothesize that depression and emotion can inform each other, and we propose to explore the influence of dialog structure through topic and dialog act prediction. We investigate a Multi-Task Learning (MTL) approach, where all tasks mentioned above are learned jointly with dialog-tailored hierarchical modeling. We experiment on the DAIC and DailyDialog corpora – both contain dialogs in English – and show important improvements over state-of-the-art on depression detection (at best 70.6% F1), which demonstrates the correlation of depression with emotion and dialog organization and the power of MTL to leverage information from different sources.


Introduction
Depression is a serious mental disorder that affects around 5% of adults worldwide. 1It comes with multiple causes and symptoms, leading to major disability, but is often hard to diagnose, with about half the cases not detected by primary care physicians (Cepoiu et al., 2008).Automated detection of depression, sometimes associated to other mental health disorders, has been the topic of several studies recently, with a particular focus on social media data and online forums (Coppersmith et al., 2015;Benton et al., 2017;Guntuku et al., 2017;Yates et al., 2017;Song et al., 2018;Akhtar et al., 2019;Ríssola et al., 2021).The ultimate goal of such system would be to complement expert assessments, but such empirical studies are also valuable to better understand how communication is affected by health disorders.In this paper, we propose to 1 https://www.who.int/news-room/fact-sheets/detail/depression investigate depression detection within dialogs, a scenario less studied but more similar to the interviews with clinicians, which allegedly involves dialog features and allows to also examine how interaction is affected.
However, depression detection suffers from data sparsity.In fact, using social media data was a way to tackle this issue, including considering data generated by self diagnosed users, a method that leads to potentially noisy data and comes with ethical issues (Chancellor et al., 2019).We rather examine a dataset of 189 clinical interviews, the DAIC-WOZ (Gratch et al., 2014), collected by experts to support the diagnosis of distress conditions.Participants are identified as depressive or not, and if so they receive a severity score.A line of work proposed to overcome data scarcity by leveraging varied modalities, e.g. using audio as in Al Hanai et al. (2018).Previous approaches solely based on textual information relied on hierarchical contextual attention networks on word and sentence-level representations (Mallol-Ragolta et al., 2019), or multi-task learning (MTL) but limited to combing identification and severity prediction (Qureshi et al., 2019;Dinkel et al., 2019), possibly with emotion (Qureshi et al., 2020).
Inspired by the latter approaches, we also propose to rely on MTL framework to help our model leveraging information from different sources.We exploit three auxiliary tasks: emotion -naturally tied to mental health states -, but also dialog act and topic classification, hoping this shallow information about the dialog structure could further enhance the performance.Our architecture is classic, based on hard-parameter sharing (Ruder, 2017), simpler than the shared-private architecture in (Qureshi et al., 2020) but shown effective.In order to take into account dialog organization, we advocate for a dialog-tailored hierarchical architecture with some tasks performed at the speech turn level and others at the document level.
Our contributions are: (i) An empirical study on depression detection in dialogs, leveraging the power of multi-task learning to deal with data sparsity; (ii) An extension of previous work in examining the effects of depression on dialog structure via shallow markers, i.e., dialog acts and topics, as a first step; (iii) State-of-the-art results on depression detection in DAIC test set with 70.6% in F 1 at best.

Related work
Within multi-task learning (MTL), a model has to learn shared representations to generalize the target task better.It improves the performance over single-task learning (STL) by leveraging commonalities or correlations between tasks.Recent years have witnessed a series of successful applications in various NLP tasks, as in Collobert and Weston (2008); Søgaard and Goldberg (2016); Ruder (2017); Ruder et al. (2019), which demonstrates the effectiveness of MTL in learning information from different but related sources.It also tackles the data sparsity issue and reduces the risk of overfitting (Mishra et al., 2017;Benton et al., 2017;Bingel and Søgaard, 2017).Joshi et al. (2019) demonstrated the benefit of MTL for specific pairs of close health prediction tasks on tweets.Benton et al. (2017) used MTL on social media data and achieved important improvements in predicting several mental health signals, including suicide risks, depression, and anxiety, together with gender prediction.With a focus on depression detection, the shared task AVEC in 2016 (Valstar et al., 2016) has brought out a series of multi-modal studies using vocal and visual features on the DAIC-WOZ dataset (Gratch et al., 2014) dings. Qureshi et al. (2019dings. Qureshi et al. ( , 2020) ) proposed MTL approaches in adding emotion intensity and depression severity (i.e., a regression problem) prediction to the main classification task.They, however, found that the emotion-unaware model obtained the best result.They used a monologue corpus for the emotion task, a domain bias that possibly harms the performance.On the contrary, we hypothesize that emotional information would benefit depression detection.Mallol-Ragolta et al. ( 2019) used a hierarchical contextual attention network with static word embeddings within a single-task setting and then combined representations at the word and sentence levels.They reported at best 63% in F 1 .Recently, Xezonaki et al. (2020) presented even better results, 70% in F 1 , by augmenting the attention network with a conditioning mechanism based on effective external lexicons and incorporating the summary associated with each interview.We instead rely on MTL in this work, where incorporating external sources is more direct.
None of the previous studies investigated potential links between depression and dialog structure.We note that Cerisara et al. (2018) explored MTL with sentiment2 and dialog act prediction on Mastodon (a Twitter-like dataset), where both annotations are available, and found a positive correlation.To the best of our knowledge, we are the first to tackle depression detection in dialog transcriptions with the MTL approach and explore joint learning techniques with tasks related to the dialog structure.

Model Architecture
One condition generally assumed for success within MTL, at least in NLP, is that the primary and auxiliary tasks should be related (Ruder, 2017).The emotion-related task is thus a natural choice since it is linked to mental states.We hypothesize that depressive disorder can also affect how people interact with others during conversations.We thus take a first step toward linking dialog structure and depression by examining shallow signals: dialog acts and topics.In addition, since the information comes at different levels, we propose hierarchical modeling, from speech turns to documents.
Baseline Model: Our basic model is a two-level recurrent network, similar to the one in Cerisara et al. (2018).The input words are mapped to vectors using word embeddings from scratch.The first level (turn-level) takes the embeddings into a bi- LSTM network to obtain one vector for each turn.The second level (dialog-level) takes a sequence of turns into an RNN network, and the output is finally passed into a linear layer for depression prediction.

MTL Model:
The MTL architecture is composed of shared hidden layers and task-specific output layers (see Fig. 1) and corresponds to the hard parameter sharing approach (Caruana, 1993(Caruana, , 1997;;Ruder, 2017).Since some auxiliary tasks are at the speech-turn level (i.e., emotion, dialog act) while others are annotated at the document level (i.e., depression, topic), our architecture is hierarchical and arranges task-specific output layers (MLP) at two levels.Speech-turn level emotion and dialog act information can be learned in the turn-level LSTM network and transferred upwards to help depression and topic prediction.On the other hand, higher-level information can be backpropagated to update the network at the lower level.The loss is simply the sum of the losses for each task.When it comes to the MTL setting, we set equal weight for each task as the standard choice.
4 Datasets DAIC-WOZ: This dataset is a subset of the DAIC corpus (Gratch et al., 2014). 3It contains 189 sessions (one session is one dialog with avg.250 speech turns) of two-party interviews between participants and Ellie -an animated virtual interviewer controlled by two humans.(2020).We do not compare to (Williamson et al., 2016;Haque et al., 2018;Al Hanai et al., 2018;Dinkel et al., 2019;Qureshi et al., 2020) who only report on the development set.
Evaluation Metrics: For depression classification we follow Dinkel et al. (2019) and report accuracy, macro-F 1 , precision, and recall.For emotion analysis, we follow Cerisara et al. (2018) and report macro-F 1 .
Implementation Details: We implement our model with AllenNLP library (Gardner et al., 2018).We use the original separation of train, validation, and test sets for both corpora.The model is trained for a maximum of 100 epochs with early stopping.For STL as well as for MTL scenario, we optimize on macro-F 1 metric for depression classification.We use cross-entropy loss.The batch size is 4 for DailyDialog and 1 for DAIC (within the limit of GPU VRAM).We use the tokenizer from spaCy Library (Honnibal et al., 2020) and construct the word embeddings by default with a dimension of 128.The turn level has one hidden layer and 128 output neurons.We tune document RNN layers in {1, 2, 3} and hidden size in {128, 256, 512}.Model parameters are optimized using Adam (Kingma and Ba, 2014) with 1e − 3 learning rate.Dropout rate is set to 0.1 for both turn and document encoders.The source code is available at https: //github.com/chuyuanli/MTL4Depr.Using the multi-task architecture, we get improvements when adding each task separately.We see more than a +11.5% increase in F 1 when adding emotion ('+Emo') or topic ('+Top') classification task and, at best, +16.9% with dialog acts ('+Diag').This demonstrates the relevance of each task to the primary problem of depression detection, especially the interest of dialog acts.When adding topics, we observe a small drop in accuracy compared to STL while the F 1 is better, meaning that the prediction for minority class (non-depressive) improves.Interestingly, in terms of accuracy, the tasks at different levels (depression '+Emo' and depression '+Diag') seem to help more.We deduce that they help build a better local representation (speech turns) before the global representation.

Results and Discussion
When jointly learning all four tasks -combining depression detection with three auxiliary tasks ('+Emo+Diag+Top') -, all metrics improve.We obtain our best system with an improvement of +26.7% in F 1 compared to STL baseline, outperforming the state-of-the-art with a +7.6% increase compared to the best system in Mallol-Ragolta et al. ( 2019) and about +0.5% compared to Xezonaki et al. (2020).Depressed people tend to express specific emotions; it is thus natural to think that emotion is beneficial for the main task.These results indicate that both emotion and dialog structure help as they provide complementary information, paving the way for new research directions with more fine-grained modeling of dialog structure for tasks in conversational scenarios.

Analysis
Performance on Auxiliary Tasks: To better understand our model, we look at the performance of emotion, dialog act, and topic auxiliary tasks.Directly comparing the results of our MTL approach ('+Emo+Diag+Top') with a STL architecture for each task, however, seems unfair.The optimized objective and structural complexity are different: the former is optimized on the depression detection task on two levels, while the latter is tuned on the target auxiliary task with either speech turn (emotion and dialog act) or full dialog (topic).Unsurprisingly, the results show that the MTL system underperforms the basic STL structure for dialog acts and topics, with at best 67.8 in F1 (MTL) vs. 68.8(STL) for dialog acts, and 52.0 (MTL) vs. 52.4(STL) for topic classification.
For emotion, on the other hand, our best MTL system obtains 40.0 in F 1 compared to 38.3 for the STL baseline, showing the mutual benefit of both tasks.Even though the score is lower than the SOTA for emotion classification (51.0 F 1 in Qin et al. ( 2021))8 , we believe that refining our model for this task could lead to further improvements in depression detection.In addition, we observe that our MTL approach is particularly beneficial for negative and rare emotion classes, with anger, disgust and sadness gaining resp.5%, 6% and 1% in F 1 .Finally, we conduct a manual inspection of the types of utterances (mostly questions) from Ellie, and classify them into high-level dialog acts: Backchannel, Comment, Opening, Other, Question.9We find that around 13% of the utterances are emotion-related, for instance "things which make you mad / you feel guilty about, last time feel really happy", etc., and that mentions of topics related to happiness or regret appear in almost all the interviews.Dialog act distribution is shown in Table 3.We release our annotation to the community for future studies.

Conclusion
In this paper, we demonstrate the correlation between depression and emotion and show the relevance of features related to dialog structures via shallow markers: dialog acts and topics.In the near future, we intend to investigate more refined modeling of dialog structures, possibly relying on discourse parsing (Shi and Huang, 2019).We would also like to explore depression severity classification as an extension to binary classification, possibly through a cascading structure: first detect depression and then classify the severity.We intend to refine our work and report on cross-validation splits of the data to test the stability of the model, an issue even more crucial when dealing with sparse data with possibly representativeness problem.A further step will be to investigate the generalization of our model to other mental health disorders.

Ethical Considerations
The goal of such systems is not to replace human healthcare providers.All these systems may be used only in support to human decision.The principle of leaving the decision to the machine would imply major risks for decision making in the health field, a mistake that in high-stakes healthcare set-tings could prove detrimental or even dangerous.Another issue is the representativeness of the data.Currently, it is very complex to access patients in order to have more examples.The institutional complexity leads researchers to systematically use the same data set, creating a bias between the representation of the pathology, in particular for mental ones whose expression can take very varied forms.This also implies defining a variation in relation to a normative use of language that comes with a strong risk in this type of approach.
Moreover, we carefully select the dialog corpora used in this paper to control for potential biases and personal information leakage.We only work with interview transcription, with no audio or visual information.For the text part, all the participant's name have been marked out with pseudo-ID.

Figure 1 :
Figure 1: Multi-task fully shared hierarchical structure.Light blue is for DAIC dataset and depression task; orange is for DailyDialog and three auxiliary tasks.
with a score related to the Patient Health Questionnaire (PHQ-9): a patient is considered depressive if PHQ-9 ≥ 10 (Kroenke and Spitzer, 2002).DailyDialog: This dataset(Li et al., 2017) contains 13, 118 two-party dialogs (with averaged 7.9 speech turns per dialog) for English learners, 4 covering various topics from ordinary life to finance.Three expert-annotated information are provided: 7 emotions(Ekman, 1999), 4 coarse-grain dialog acts, and 10 topics.We select this corpus due to its large size, two-level annotations and high quality.The train set contains > 87k turns for emotions and dialog acts and > 11k dialogs for topics.Detailed statistics are given in Appendix A.5 Experimental setupBaselines: We compare our MTL results with: (1) Majority class where the model predicts all positive; (2) Baseline single-task model (see Sec. 3); (3) State-of-the-art results on test set reported by Mallol-Ragolta et al. (2019) and Xezonaki et al.

6. 1
Depression Detection Results on DAIC Results using our MTL hierarchical structure are shown in Table 2, which are compared to SOTA models (at the top).Our baseline model is a singletask naive hierarchical model which obtains similar results (F 1 44) as the baseline model (NHN) in Mallol-Ragolta et al. (2019) (F 1 45).

Table 3 :
High-level dialog act distribution of Ellie in DAIC-WOZ.# and % represent the number and percentage of Ellie's utterances, respectively.

Table 4 :
Ablation study on hierarchical structure.

Table 5 ,
Table 6, and Table 7 show the number and percentage of emotion, dialog act, topic for each subset, resp.

Table 5 :
Emotion distribution in train, dev.and test sets.

Table 6 :
Dialog act distribution in train, dev.and test sets.

Table 7 :
Topic distribution in train, dev.and test sets.