Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts

Conversational Agents (CAs) powered with deep language models (DLMs) have shown tremendous promise in the domain of mental health. Prominently, the CAs have been used to provide informational or therapeutic services (e.g., cognitive behavioral therapy) to patients. However, the utility of CAs to assist in mental health triaging has not been explored in the existing work as it requires a controlled generation of follow-up questions (FQs), which are often initiated and guided by the mental health professionals (MHPs) in clinical settings. In the context of ‘depression’, our experiments show that DLMs coupled with process knowledge in a mental health questionnaire generate 12.54% and 9.37% better FQs based on similarity and longest common subsequence matches to questions in the PHQ-9 dataset respectively, when compared with DLMs without process knowledge support.Despite coupling with process knowledge, we find that DLMs are still prone to hallucination, i.e., generating redundant, irrelevant, and unsafe FQs. We demonstrate the challenge of using existing datasets to train a DLM for generating FQs that adhere to clinical process knowledge. To address this limitation, we prepared an extended PHQ-9 based dataset, PRIMATE, in collaboration with MHPs. PRIMATE contains annotations regarding whether a particular question in the PHQ-9 dataset has already been answered in the user’s initial description of the mental health condition. We used PRIMATE to train a DLM in a supervised setting to identify which of the PHQ-9 questions can be answered directly from the user’s post and which ones would require more information from the user. Using performance analysis based on MCC scores, we show that PRIMATE is appropriate for identifying questions in PHQ-9 that could guide generative DLMs towards controlled FQ generation (with minimal hallucination) suitable for aiding triaging. The dataset created as a part of this research can be obtained from https://github.com/primate-mh/Primate2022


Introduction
Conversational agents (CAs) powered by DLMs are software designed to interact with human users for specific tasks.For mental health purposes, particularly depression, CAs have been studied extensively in prior work for helping patients follow generic mental health guidelines, typically by providing reminders to assist patients in adhering to the medication and therapy strategy outlined by a mental health professional (MHP) 12 .However, previous work on depression have not examined the use of CAs for triaging.For the purpose of triaging, CAs should learn to generate controlled and clinical process knowledge-guided discourse that can assist MHPs in diagnosis.Our research suggests a clinically grounded and explainable methodology to develop conversational information-seeking tools, first to learn "what symptoms the user is suffering" and "what extra information is needed for triaging." CAs are susceptible to irrelevant and sometimes harmful questions when generating FQs or responses to a patient suffering from depression (Miner et al., 2016).The primary reason for irrelevant and harmful questions is that CAs cannot incorporate contextual information in generating appropriate follow-up questions (FQs) (see Figure 1).Further, the sensitivity of the conversation and a controlled generation process are essential characteristics of patient-clinician interactions, which are difficult to embed in DLM-based CAs.Therefore, question generation (QG) in mental health is challenging, and research to develop CAs for automating triage has not been explored.
1 https://tinyurl.com/yfp3bhr2 2 https://woebothealth.com/ Figure 1: Reddit is a rich source for bringing crowd perspective in training DLMs over conversational data.On the left is a sample post from r/depression help which sees inquisitive interaction from other Reddit users.At the top-right are the FQs asked by the Reddit users in the comments.These FQs are aimed at understanding the severity of the mental health situation of the user and are hence, diagnostically relevant.At the bottom-right are the questions generated by DLMs.It can be seen that these are not suitable FQs.
Procedures for generating semantically related and logically ordered questions in the mental health domain are a form of process knowledge manifested in various clinical instruments for mental health triage.For example, the severity of depression is measured using Patient Health Questionnaire (PHQ-9).Enforcing DLMs to follow process knowledge, like in PHQ-9, would make CAs generate FQs similar to an MHP when they are seeking information from the patient (Karasz et al., 2012).Unfortunately, datasets that meet this criterion are currently unavailable.Though clinical diagnostic interviews exist, they are not rich, sufficiently dense, and varied to train DLMs (Manas et al., 2021;Gratch et al., 2014).Further, we require dataset(s) that includes support seeking queries and natural questions that show help providing behavior.For this purpose, anonymized usergenerated conversational data in Mental Health support communities on Reddit provides a rich source of fine-grained, contextual, and diverse information suitable for fine-tuning DLMs.Specific to depression, we explored posts and comments in r/depression help.
In the current research, we emphasize the limitations of T5, a state-of-the-art DLM 3 to generate process knowledge-like FQs using the data from 3 Current DLMs are either variants of T5 or built from T5 r/depression help (Raffel et al., 2019).We filtered the dataset by retaining only posts with at least one comment that seeks additional information from the user seeking support.Further filtering of comments was performed using PHQ-9 to assist T5 in generating relevant FQs (see Figure 2).We found that the outcome is substantial for the single turn question answering model; however, not suitable for mental health triage, which is a discourse.We conducted a series of experiments keeping our focus on 'depression' and leveraged its associated process knowledge for mental health triage: the PHQ-9 (Kroenke et al., 2001).To the best of our knowledge, FQ generation relating to depression has never been studied using PHQ-9 for discourse modeling and generation.
We make the following key contributions: (a) Extending PHQ-9: PHQ-9 questions are limited in scope for common NLP tasks like finetuning.In collaboration with MHPs, we prepared a list of 134 sub-questions for nine PHQ-9 questions for better fine-tuning of T5.(b) We analyzed the performance of three variants of T5 using BLEURT (Sellam et al., 2020) and ROUGE-L scores that measure semantic relatedness and exact match similarity of generated question to sub-questions of PHQ-9.(c) PRIMATE Dataset: Lessons learned during our experiments suggested that T5 must be trained in a supervised setting to capture 'what the user has already mentioned about his/her depression condition in the post-text' and then generate FQs.Along with MHPs, we constructed a novel PRIMATE (PRocess knowledge Integrated Mental heAlth daTasEt) dataset that would train DLMs to capture PHQ-9-answerable information from user text.In this research, we restrict our experiments and discussion on whether PRIMATE can help capture context from the user post relevant to some PHQ-9 questions and pointing out which other PHQ-9 questions would form candidates to direct FQ generation.Our approach and insights have applications to Anxiety (GAD-7), Suicide (C-SSRS), and other mental health disorders as well.

Related Work
Recently, DLMs have attracted much attention for question answering, thanks to their successes in NLP applications (Thoppilan et al., 2022;Borgeaud et al., 2021).Research on question generation has focused on improving the legibility and relevance of questions.This is because DLMs continue to hallucinate while generating questions in general-purpose domains, which can lead to factually incorrect responses.This can have severe consequences in the mental health domain (Thoppilan et al., 2022).Recently, inappropriate and toxic behaviors of language models have been extensively studied and reported in the literature (Dinan et al., 2021;Weidinger et al., 2021).Solutions around fine-tuning, augmenting a neural retriever to support generation, and rules on generation quality have been defined as possible remedies (Manas et al., 2021).These have been effective for the general-purpose domain; however, the research surrounding DLMs is yet to unfold in mental health.ELIZA (Weizenbaum, 1983) could transform users' statements into questions but employs labor-intensive templates to generate safe and relevant questions.Models like RAG and REALM were developed to include external knowledge to support question generation (Lewis et al., 2020;Guu et al., 2020).However, these models are still susceptible to incoherent and irrelevant FQ generation .Further, their end-to-end learning approach is rigid to support process-guided question generation and discourse, often followed in a clinical setting for triage (Gaur et al., 2021).
In theory, DLMs should be capable enough of extracting pieces of information from user description that portrays the understanding of the user and leverage it for generating the next FQ.For such a task, supervised training of DLMs with process knowledge and coupling it with information retrieval over domain-specific mental health knowledge is a viable solution.This is because mental health knowledge sources (e.g., SCID (Structured Clinical Interviews for DSM-5) have structured/semi-structured information on how interviews are performed (Brodey et al., 2018).Our research substantiates that DLMs (e.g., T5) generate low quality follow-up questions in the context of depression for triage, and granting external knowledge through PHQ-9 reduces the rate at which models generate meaningless FQs (Thoppilan et al., 2022;Komeili et al., 2021).In the current research, we define an approach for supervised training of DLMs on a specific dataset that would yield probability distribution over PHQ-9 (with support from Extended PHQ-9).These probabilities will confirm whether the DLM can identify cues from user text that can inform a set of PHQ-9 questions.Remaining PHQ-9 questions are potential FQs.
Datasets: Prior datasets such as Counsel Chat (CounselChat), Counseling Conversations (Huang, 2015), Role Play (Demasi et al., 2019), Crisis Text Line (Althoff et al., 2016) and Reddit C-SSRS (Gaur et al., 2019) have been created to train CA for mental health counseling.Trained CAs can engage in a single turn question answering; however, conducting a conversation requires capturing user context and leveraging clinical instruments to guide the generation of FQs.
3 Question Generation (QG) Dataset for QG: Our approach to data collection involves scraping posts and comments from r/depression help, a subreddit on Reddit, which is meant to provide advice and support to help individuals suffering from depression.The posts on this subreddit contain flair tags such as SEEKING HELP, SEEKING ADVICE, and REQUESTING SUPPORT.We filter down the data curated from this subreddit based on the flair tag attribute to retain only advice, help or support seeking posts and their comments.After filtering, our dataset had approximately 21,000 posts.Each post contains a title, description, and comments.On average, each post has 5 comments.Next, we chunked the main text of each post into smaller groups of sentences (chunks) of less than 512 tokens while making sure Figure 2: An illustration of our pipeline for developing Model 2 and Model 3 using T5 as the deep language models.Starting with posts (including comments) from r/depression help, we filter out comments that are neither interrogative nor information seeking in nature to yield a posts-questions dataset for fine-tuning T5.This dataset was further filtered using extended PHQ-9 before using it to fine-tune T5 (Model 3).Table 1: Examples of questions generated by T5 when tasked to generate FQs when the user query for the post in Figure 1 was provided as input.Model 1, which is a pre-trained T5 (Raffel et al., 2019), often generates questions which are irrelevant, unsafe, incoherent, and redundant.Model 2, which is T5 fine-tuned on r/depression help seems to be relatively coherent and inquisitive compared to Model 1.However, both models generate questions about the topic that user has discussed in their query.As a result, we see that pre-trained and fine-tuned DLMs fail to generate FQs.By enforcing FQ generation using using a dataset curated using extended PHQ-9, generated questions have been mostly inquisitive.This is shown by Model 3. Still, a lot of generations are around the problem the user mentioned.
no sentence is segmented in between.The motivation for chunking is to ensure no context is lost from the post due to the limitation of T5 to process 512 tokens as input (DLMs in general suffer from such representation limits).We also appended the post title to each chunk to ensure that main idea of each post was captured in it's chunks.This curated dataset tests T5's capability to generate FQs similar to any of the questions in the extended PHQ-9 questionnaire.
Extending PHQ-9 to support FQ generation: PHQ-9 questions are subject to different interpreta- Table 2: In this example, the generated questions from both Model 2 and Model 3 seem to be relevant FQs, but they are not assessing the severity of the mental health condition, despite Model 3 being fine-tuned on a dataset filtered by PHQ-9 questions.In comparison to the qualitative outcome in Table 1, this showcases the inability of T5 to support mental health triage.tions depending on patient-MHP interaction.Additionally, nine questions are limited in scope for use in tasks like fine-tuning and similarity-based performance evaluations.Therefore, to increase the strength of PHQ-9, we collaborated with MHPs to create sub-questions for each question in PHQ-9.First, we used Google SERP API 4 and Microsoft Bing Search API 5 to retrieve "People-Also-Ask" questions.For each question, we retrieved 40 questions by manually searching and assessing their relevance to PHQ-9 questions.Next, we provided the set of 360 questions to three MHPs for assessment.MHPs evaluated the questions on two grounds:(a) Whether they would ask such a question to a patient?(relevance) (b) If yes, when should such a question be asked?(rank).Based on their ratings, we created a final set of 134 sub-questions for the nine questions in PHQ-9 6 resulting in a total of 143 questions.
4 https://serpapi.com/ 5https://www.microsoft.com/en-us/bing/apis/bing-web-search-api 6 Questions in extended PHQ-9 : link  3: Experimental results comparing different models in generating questions that match the sub-questions in PHQ-9.Q is the set of generated questions in each chunk.The performance is recorded over all the generated questions ( Q). δ was used as the threshold on the similarity between generated question and PHQ-9 sub-questions while calculating hit rate.BLEURT records semantic similarity, whereas Rouge-L records the longest common subsequence exact match between generated question and PHQ-9 sub-questions.The highest performance on semantic and string similarity is bolded.Acceptable performance in Model 3 achieved using PHQ-9 motivated us to prepare PRIMATE.

Analysis of Models for Question Generation:
Out of the 21k questions, performance of Models 1, 2, and 3 were examined on those 2003 posts that had at least one interrogative comment.Each of the three models was made to generate FQs in sets of 5, 10, and 15 through nucleus sampling (Holtzman et al., 2019).For a generated question, BLEURT score was computed with each question in Extended PHQ-9 and the maximum among those scores was taken as the score for the generated question.A clear distinction between models 1, 2, and 3 is the nature of the questions asked.Model 1 generated closed book questions, whereas Model 2 and 3 seem to show some inquisitive nature and seem more focused on the mental health domain, which can be attributed to the after effect of finetuning on Reddit (see Table 1 and 2).We captured the performance of the models quantitatively using 'hit rate' as a metric.For a generated question (q), we denote : score(q) = max(bleurt score(q, q 1 ), bleurt score(q, q 2 ), ..., bleurt score(q, q 143 )), where q 1 , q 2 ...
where δ is the threshold on the similarity between generated question in a chunk and sub-questions in PHQ-9 and I[ϕ] is the indicator function taking values 0 or 1 for a predicate ϕ (Table 3 has the scores).
Inference: (1) Regardless of fine-tuning and filtering based on PHQ-9 questions, inherently, T5 does not capture the meaning and usage of the words in the mental health context.Moreover, T5 fails to generate legible and relevant FQs as safe as PHQ-9 questions.Therefore, we scrutinize the generated FQs by mapping them to most similar questions in extended PHQ-9.Examples of irrelevant generations by T5 that it thought were relevant are: (a) "Wtf?" (generated FQ) was found most similar to "Do you have hope?"(PHQ-9) (b) "What did Boyfriend suffocate me with during his break up a week after I got a diagnosis?"(generated FQ) was found most similar to "What do you think makes you a failure" (PHQ-9).The previous generated question is redundant as the answer to it was already present in the original post.
(2) Many generated questions contain extreme language due to the informal nature of the Reddit platform, which is very sensitive issue, especially in the mental health domain.Examples are: "Did you f***ing realize that f***ing people are f***ing too?" (generated FQ) was found to be the most similar to "What do you think makes you a failure?".Thus, T5 and its variants need to capture "what the user knows and has already mentioned in his post" by checking which PHQ-9 questions are already answerable using the user's post before generating the next probable FQs in order to avoid redundancy.

PRIMATE for FQ Generation
We conceptualize our approach on the duality of data and the process knowledge contained in PHQ-9 (see Figure 4).First, a BERT Answerability Evaluator identifies which questions in PHQ-9 are already answerable (using the user's initial description of his/her condition in the post) and which ones need more information to be answerable.The latter type of questions form candidates for training a generative DLM for FQ generation.We present PRIMATE, a dataset consisting of Reddit posts containing user situations describing their health conditions and whether the questions in PHQ-9 are answerable using the content in the posts.Each question is attributed with a binary "yes" or "no" label stating whether the user's description already contains the answer to that question (see Table 4).PRIMATE was created from a month long annotation-evaluation cycle between MHPs and crowd workers.A total of five crowd workers performed this task, achieving an initial annotator agreement of 67% using Fleiss kappa.Subsequently, the MHPs assessed the quality of annotations and provided their suggestion for improvement, leading to an acceptable agreement score of 85%.A sample annotated post in PRIMATE is shown in Figure 3.
BERT as Answerability Evaluator: While Model 3 shows respectable performance (Table 3), even the FQs generated by Model 3 may not yield the most efficient capture of the PHQ-9 related questions (evident from the low hit rate at a higher threshold) (δ).The MHPs would probably have a more streamlined, focused questioning strategy.For efficient MHPs and AI collaboration, we propose to guide the questioning in a more systematic way by predicting if the user post already has answers to the PHQ-9 questions.This is first posed as a binary classification problem over nine PHQ-9 questions.Thereafter, the approach is to generate questions similar to the PHQ-9 questions that do not have answers in the post.Thus, we train   3 BERT 9 (a transformer-based DLM) as a classifier on the PRIMATE dataset.We plan to further use 9 BERT end-to-end training perform well compared to baselines Electra (Clark et al., 2019), and MedBERT (Gu et al., 2021) the classification outcome from the BERT model to drive the direction of further questioning with the patient in a more controlled manner.This process can lead to high efficiency and completion of the mental health triaging in as few questions as possible.4).The MCC score for all 9 questions across different thresholds is in the range 0 to +1 (low to high positive relationships).The MCC for some configurations runs into a divide by zero error, and we replace this value with 0.0.W: model is unable to learn cues to determine answerability in a post.M: model is uncertain whether a particular PHQ-9 question is answerable or not.S: answerability can be determined by the model with high reliability.Class-Type: Classification Type when δ = 0.9 Performance Analysis: We report the Matthews Correlation Coefficient (MCC) scores in table 5. MCC is a reliable metric to assess a model's classification over an imbalanced dataset, particularly useful when we are interested in all four categories of confusion matrix: true positives (answerable questions (AQ)), true negatives (FQ candidates), and false alarms (false negatives and positives).As PRIMATE shows a disproportional distribution of AQs (yes) and FQs (no), MCC is an appropriate metric (Chicco and Jurman, 2020).We base our analysis on the consistency of BERT classifier on varying threshold (δ) in table 5.A score between 0.0 to 0.30 (Type W: Weak) on MCC means the model is only able to find a negligible to weak positive relationship between input and output.In our context, a score in this range for a particular PHQ-9 question means that model is unable to effectively learn the cues needed to judge the answerability of that question in user posts.A score between 0.30 and 0.40 (Type M: Maybe) means that the model is able to learn a moderately positive relationship, interpreted as ambiguity in the model to judge whether a particular PHQ-9 question is answerable from user posts.MCC scores between 0.40 to 0.70 (Type S: Strong) for a question in PHQ-9 means that the model can effectively judge whether that question is answerable in user posts .Any score above 0.70 makes the model's judgements even more reliable.This experiment completes steps 1 and 2 in Figure 4. Steps 3, 4 and 5 are concerned with the task of FQ generation by fine-tuning the T5 DLM as a generator over r/depression help and other depression support communities on Reddit.The FQ generations will be controlled using the process knowledge in SCID which is consulted for interviewing by MHPs.Further, PHQ-9 lexicons are leveraged for promoting diversity and filtering irrelevant FQ generations.
We leave this process of FQ generations to shape discourse as future work.

Conclusion
This paper demonstrated the importance of data and process knowledge to adapt DLMs for generating FQs that would assist MHPs in triaging depression.Our experiments show that without process knowledge, DLMs hallucinate by generating unsafe, incoherent, and irrelevant questions that are not helpful for MHPs in pre-screening or triaging.The challenge lies in the inability of the DLMs to judge from the set of generated questions, which is a potential effective FQ to ask based on the user information.The improved question generation performance of DLMs fine-tuned on conversational data filtered by process knowledge encouraged us to prepare PRIMATE.PRIMATE can train DLMs to judge 'whether a user's description of their mental health condition already contains an answer to a particular question in PHQ-9', which would eventually guide coherent FQ generations.We leave our approach for FQ generation as future work, but provide sufficient details on the broader forms of knowledge needed in realizing such a pipeline.

Limitations:
We are yet to scale our understanding to other mental health disorders, such as anxiety using GAD-7 and Suicidality using C-SSRS (Jiang et al., 2020).Further, we are yet to investigate whether PRIMATE, along with the knowledge in SCID can make DLMs transferable across multiple mental health disorders, especially the ones comorbid with depression.Also, there is a need for a clinically explainable safety metric for our task.
Ethical Considerations: Mental health communities on Reddit offer a crowd perspective on various disorders wherein the FQs in the comments highlight the good intentions of Reddit users to help users with conditions, such as depression.We take such interactions as a proxy for improving patient-MHP interactions.(Benton et al., 2017) described that studies involving user-generated content are exempted from the IRB requirement as long as the data source is public and the user's identity is not recognizable.Apart from being publicly available, Reddit users are anonymous, and we further work with random user IDs.Since we make PRIMATE public for research use, we use a Data Use Agreement (Losada and Crestani, 2016) for responsible dissemination of the dataset.

•
How long ago did the worsening chronic fatigue continue to cripple my life?[*User already mentions in the title of the post he has been facing the fatigue for years] • What was the estimated dose of suicide?[*Unsafe incoherent question] • How could body parts feel far less hot in other parts of my life?[*Incoherent question] • What has caused the gradual cognitive decline?[*It's the physician who needs to answer this question after the conversation] • What effect is that I'm less optimistic, and it's embarrassing ?[*Incoherent attempt to generate a question from the sentence in the post which mentions "I am still pretty optimistic , but I 'm just too out of it , it 's embarrassing"] • What was the reason that increasing depression has a negative effect on me? [*Irrelevant and incoherent] • A sleep study consists of what?[*Irrelevant to the diagnostic procedure] • What did Caffeine feel in my body?[*Wrong pronoun used and irrelevant to the diagnosis] • Have you ever tried many depressants or mental health things?[*Redundant question as user has already mentioned in the post text that he has tried antidepressants] • Are you seeing a therapist?[*Redundant question as user had already mentioned in the post text that he has seen a therapist in the past] • Do you even ever try to focus on negative things and/or pain, relief, and patience?[*Incoherent question] • And what if you forgot to fucking realize that, then it will be fucking so much that you'll be fucking fucking can we ask what they want.[*Unsafe objectionable question] • Do you mean heavy bipolar disorder?[*Irrelevant to the context of the post] • Are you aware about the qualities of antidepressants?[*Inquisitive in nature but irrelevant from a diagnostic perspective] • What does antidepressant work at? [*Irrelevant from a diagnostic perspective] • Do you have a close friend you can go to who you can talk with, that way you can get out of the house?[*FQ] Are you on any antidepressants?[*User mentions in the post he has tried them before but no mention has been made if he is still on them] • Do you have any positive thoughts?[*FQ] • Are you in any danger of hurting yourself?[*Slightly unsafe] • Is it that you aren't happy with your feelings?[*Irrelevant question] • Have you tried some exercise?[*Redundant question as user already mentions he has tried it] • Do you wake up frequently?[*FQ] • How is your sleep quality?[*FQ] • When you wake up, what do you do? [*FQ] • Is there anything that helps you calm the symptoms for now?[*FQ] • What are your hobbies?[*Generic FQ] • What are your interests?[*Generic FQ]

Figure 3 :
Figure3: A post in PRIMATE which is annotated with PHQ-9.The questions marked "YES" are answerable by DLMs using the mental health specific cues from user text.The questions marked "NO" are the questions a DLM should consider asking as FQs.Sentences within [] were taken as signals that the "YES" marked questions had already been answered in the post .
q 143 ∈ Extended-PHQ-9.Across all 2003 posts, we had C = 2575 chunks 8 .Let total number of questions generated by a model be | Q| and | Q| denote the number of question generated by the model for a given chunk.For experimentation, we set | Q| to have values {5, 10, 15}.Thus, | Q| = | Q| * C. Then the Hit Rate for a model was computed as:

Figure 4 :
Figure4: 1. Answerability evaluator: A BERT model is trained in a supervised setting to be an evaluator of whether a PHQ-9 question can be answered in a given user post (binary) using PRIMATE.For nine PHQ-9 questions, we require nine such evaluators.2. Follow up questions: PHQ-9 questions that are not already answerable using the user post form candidates for follow up.3. SCID: Corresponding to each PHQ-9 question, the SCID describes a clinician approved sub-sequence of questions to obtain the answer to the follow up question.4. Use existing PHQ-9 and DSM-5 lexicons(Yazdavar et al., 2017) to filter the question to be generated.5. Generate FQs using T5 fine-tuned on external domain-specific knowledge and the large-scale depression support conversation dataset created from Reddit and PRIMATE.

Table 5 :
We record the Matthews Correlation Coefficient (MCC) to measure the performance of the Evaluator (see Figure