Towards Interpretable Mental Health Analysis with Large Language Models

The latest large language models (LLMs) such as ChatGPT, exhibit strong capabilities in automated mental health analysis. However, existing relevant studies bear several limitations, including inadequate evaluations, lack of prompting strategies, and ignorance of exploring LLMs for explainability. To bridge these gaps, we comprehensively evaluate the mental health analysis and emotional reasoning ability of LLMs on 11 datasets across 5 tasks. We explore the effects of different prompting strategies with unsupervised and distantly supervised emotional information. Based on these prompts, we explore LLMs for interpretable mental health analysis by instructing them to generate explanations for each of their decisions. We convey strict human evaluations to assess the quality of the generated explanations, leading to a novel dataset with 163 human-assessed explanations. We benchmark existing automatic evaluation metrics on this dataset to guide future related works. According to the results, ChatGPT shows strong in-context learning ability but still has a significant gap with advanced task-specific methods. Careful prompt engineering with emotional cues and expert-written few-shot examples can also effectively improve performance on mental health analysis. In addition, ChatGPT generates explanations that approach human performance, showing its great potential in explainable mental health analysis.


Introduction
WARNING: This paper contains examples and descriptions that are depressive in nature.
Mental health conditions such as depression and suicidal ideation seriously challenge global health care (Zhang et al., 2022a).NLP researchers have devoted much effort to automatic mental health analysis, with current mainstream methods leveraging the Pre-trained Language Models (PLMs) (Yang et al., 2022;Abed-Esfahani et al., 2019).Most recently Large Language Models (LLMs) (Brown et al., 2020;Ouyang et al., 2022), especially ChatGPT2 and GPT-4 (OpenAI, 2023), have exhibited strong general language processing ability (Wei et al., 2022;Luo et al., 2023;Yuan et al., 2023).In mental health analysis, Lamichhane (2023) evaluated ChatGPT on stress, depression, and suicide detection and glimpsed its strong language understanding ability to mental healthrelated texts.Amin et al. ( 2023) compared the zeroshot performance of ChatGPT on suicide and depression detection with previous fine-tuning-based methods.
Though previous works depict a promising future for a new LLM-based paradigm in mental health analysis, several issues remain unresolved.Firstly, mental health condition detection is a safecritical task requiring careful evaluation and high transparency for any predictions (Zhang et al., 2022a), while these works simply tested on a few binary mental health condition detection tasks and lack the explainability on detection results.Moreover, other important mental health analysis tasks, such as the cause/factor detection of mental health conditions (Mauriello et al., 2021;Garg et al., 2022), were ignored.Secondly, previous works mostly use simple prompts to detect mental health conditions directly.These vanilla methods ignore useful information, especially emotional cues, which are widely utilized for mental health analysis in previous works (Zhang et al., 2023).We believe it requires a comprehensive exploration and evaluation of the ability and explainability of LLMs on mental health analysis, including mental health detection, emotional reasoning, and cause detection In LLM responses, red, green, and blue words are marked as relevant clues for rating fluency, reliability, and completeness in human evaluations.
of mental health conditions.Therefore, we raise the following three research questions (RQ): • RQ 1: How well can LLMs perform in generalized mental health analysis and emotional reasoning with zero-shot/few-shot settings?
• RQ 2: How do different prompting strategies and emotional cues impact the mental health analysis ability of ChatGPT?
• RQ 3: How well can ChatGPT generate explanations for its decisions on mental health analysis?
To respond to these research questions, we first conduct a preliminary study of how LLMs perform on mental health analysis and emotional reasoning.We evaluate four LLMs with varying model sizes including ChatGPT, InstructGPT-3 (Ouyang et al., 2022), LLaMA-13B, and LLaMA-7B (Touvron et al., 2023), on 11 datasets across 5 tasks including binary/multi-class mental health condition detection, cause/factor detection of mental health conditions, emotion recognition in conversations, and causal emotion entailment.We then delve into the effectiveness of different prompting strategies on mental health analysis, including zero-shot prompting, Chain-of-Thought (CoT) prompting (Kojima et al., 2022), emotion-enhanced prompting, and few-shot emotion-enhanced prompting.Finally, we explore how LLMs perform for interpretable mental health analysis, where we instruct two representative LLMs: ChatGPT and InstructGPT-3, to generate natural language explanations for each of its results on mental health analysis.To assess the quality of LLMs-generated explanations, we perform human evaluations by following a strict annotation protocol designed by domain experts, and thus create the novel dataset with 163 humanassessed explanations of posts from LLMs, aimed at facilitating the investigating of explainable mental health analysis methods and automatic evaluation metrics.We benchmark numerous existing automatic evaluation metrics on the corpus to guide future research on automatically evaluating explainable mental health analysis.We conclude our findings as follows: 1) Overall Performance.ChatGPT achieves the best performance among all examined LLMs, although it still significantly underperforms advanced supervised methods, highlighting the challenges of emotion-related subjective tasks.
2) Prompting Strategies.While a simple CoT trigger sentence is ineffective for mental health analysis, ChatGPT with unsupervised emotionenhanced CoT prompts achieves the best performance, showing the importance of prompt engineering in leveraging emotional cues for mental health analysis.Few-shot learning from expertwritten examples also significantly improves model performance.
3) Explainability.ChatGPT can generate approaching-human explanations for its classifications, indicating its potential to enhance the trans-parency of mental health analysis.Current best automatic evaluation metrics can moderately correlate with human evaluations, indicating the need for developing customized automatic evaluation methods in explainable mental health analysis.4) Limitations.Although its great potential, ChatGPT bears limitations on inaccurate reasoning and unstable predictions caused by its excessive sensitivity to minor alterations in prompts, inspiring future directions on improving ChatGPT and prompts.Unstable prediction problems can be mitigated by few-shot learning.
Our contributions can be summarized as follows: 1) We evaluate four representative LLMs on mental health analysis, 2) We investigate the effectiveness of prompting strategies including CoT, emotionenhanced prompts, and few-shot learning for mental health analysis, 3) We explore LLMs for explainable mental health analysis, and conduct human and automatic evaluations on LLMs-generated explanations, 4) We create the first evaluation dataset with LLMs-generated explanations rigorously assessed by domain experts, for examining and developing of automatic evaluation metrics, 5) We analyze the potential and limitations of LLMs and different prompting strategies for mental health analysis.

Methodology
This section introduces the details of evaluated LLMs and different prompting strategies for improving LLMs' efficiency and explainability in mental health analysis.Due to the page limits, all evaluations, experiments, and analyses on emotional reasoning are presented in Appendix B. We also perform human evaluations on the quality of LLM-generated explanations and benchmark existing automatic evaluation metrics on the human evaluation results, where an example is shown in Figure 1.

Large Language Models
We benchmark the following powerful LLMs for the zero-shot mental health analysis: 1) LLaMA-7B/13B.LLaMA (Touvron et al., 2023) is a set of open-source LLMs developed by Meta AI, which are generatively pre-trained on entirely publicly available datasets.We test the zero-shot mental health analysis tasks on LLaMA models with 7 billion (LLaMA-7B) and 13 billion (LLaMA-13B) parameters.

In-context Learning as Explainable Mental Health Analyzer
In-context learning (Brown et al., 2020) elicits the powerful ability of LLMs given the information provided in the context without explicit updates of model parameters.We instruct LLMs with taskspecific instructions to trigger their ability as the zero-shot analyzer for different mental health analysis tasks.We systematically explore three different prompting strategies for mental health analysis, i.e., straightforward zero-shot prompting with natural language query, emotion-enhanced Chainof-Thought (CoT) (Wei et al., 2022), and distantly supervised emotion-enhanced instructions.Emotion-enhanced Prompts Moreover, we design three emotion-enhanced prompting strategies to better instruct ChatGPT to conduct explainable mental health analysis: 1) Emotion-enhanced CoT prompting.We perform emotion infusion by designing unsupervised emotion-enhanced zeroshot CoT prompts, where the emotion-related part inspires the LLM to concentrate on the emotional clues from the post, and the CoT part guides the LLM to generate step-by-step explanations for its decision.This improves the explainability of LLMs' performance.For example, for the binary detection task, we modify the zero-shot prompt as follows: Post: "[Post]".Consider the emotions expressed from this post to answer the question: Is the poster likely to suffer from very severe [Condition]?Only return Yes or No, then explain your reasoning step by step.2) Supervised emotion-enhanced prompting.In addition, we propose a distantly supervised emotion fusion method by using sentiment and emotion lexicons.We utilize the VADER (Hutto and Gilbert, 2014) and NRC EmoLex (Mohammad andTurney, 2010, 2013) lexicons to assign a sentiment/emotion score to each post and convert the score to sentiment/emotion labels.Then we design emotion-enhanced prompts by adding the sentiment/emotion labels to the proper positions of the zero-shot prompt.3) Few-shot Emotionenhanced Prompts.We further evaluate the impact of few-shot examples on emotion-enhanced prompts.We invite domain experts (Ph.D. students majoring in quantitative psychology) to write one response example for each label class within a test set, where all responses consist of a prediction and an explanation describing the rationale behind the decision.We then include these examples in the emotion-enhanced prompts to enable in-context learning of the models.For example, for the binary detection task, we modify the original emotionenhanced prompt to combine N expert-written explanations in a unified manner: You will be presented with a post.the coherence and readability of the explanation.
2) Reliability: the trustworthiness of the generated explanations to support the prediction results.
3) Completeness: how well the generated explanations cover all relevant aspects of the original post.4) Overall: the general effectiveness of the generated explanation.
Each aspect is divided into four standards rating from 0 to 3. Higher ratings reflect more satisfactory performance and 3 denotes approaching human performance.Each LLM-generated explanation is assigned a score by 3 annotators for each corresponding aspect, followed by the examination of 1 domain expert.All annotators are PhD students with high fluency in English.We evaluate 121 posts that are correctly classified by both ChatGPT (ChatGPT true ) and InstructGPT-3 to enable fair comparisons.42 posts that are incorrectly classified by ChatGPT (ChatGPT f alse ) are also collected for error analysis and examination of the automatic evaluation metrics.We will release the annotated corpus for facilitating future research.Details of the criteria are described in Appendix E.
Automatic Evaluation Though human evaluations provide an accurate and comprehensive view of the generated explanations' quality, they require huge human efforts, making it hard to be extended to large-scale datasets.Therefore, we explore utilizing automatic evaluation metrics, originally developed for generation tasks such as text summarization, to benchmark the evaluation on our annotated corpus.We rely on the ability of the evaluation models to score the fluency, reliability, and completeness of the explanations.We select the following widely utilized metrics to automatically evaluate LLM-generated explanations: BLEU (Papineni et al., 2002), ROUGE-1, ROUGE-2, ROUGE-L (Lin, 2004), GPT3-Score (Fu et al., 2023) (davinci-003), and BART-Score (Yuan et al., 2021).We also use the BERT-score-based (Zhang et al., 2020) methods with different PLMs, including the domain-specific PLMs MentalBERT and MentalRoBERTa (Ji et al., 2022b), except for BERT and RoBERTa.

Experimental Settings
Mental Health Analysis Firstly, we introduce the benchmark datasets, baseline models, and automatic evaluation metrics for the classification results of mental health analysis.
Datasets.For binary mental health condition detection, we select two depression detection datasets Depression_Reddit (DR) (Pirina and Çöltekin, 2018), CLPsych15 (Coppersmith et al., 2015), and another stress detection dataset Dreaddit (Turcan and McKeown, 2019).For multi-class mental health condition detection, we utilize the dataset T-SID (Ji et al., 2022a).For cause/factor detection of mental health conditions, we use a stress cause detection dataset called SAD (Mauriello et al., 2021) and a depression/suicide cause detection dataset CAMS (Garg et al., 2022).More details of these datasets are presented in Table 8 in the appendix.
Metrics.We evaluate the model performance using the recall and weighted-F1 scores as the evaluation metric for all mental health datasets.Due to imbalanced classes in some datasets such as DR, CLPsych15, and T-SID, we use weighted-F1 scores following previous methods.In addition, it is crucial to minimize false negatives, which refers to cases where the model fails to identify individuals with mental disorders.Therefore, we also report the recall scores.
Evaluation for Explainability For the human evaluation results, we evaluate the quality of the annotations by calculating the inter-evaluator agreement: Fleiss' Kappa statistics (Fleiss et al., 2013) for each aspect.Any annotations with a majority vote are considered as reaching an agreement.To compare the automatic evaluation methods, we also compute Pearson's correlation coefficients between the automatic evaluation results and the human evaluation results, where higher values reflect more linear correlations between the two sets of data.

Results and Analysis
We conduct all LLaMA experiments on a single Nvidia Tesla A100 GPU with 80GB of memory.InstructGPT-3 and ChatGPT results are obtained via the OpenAI API.Each prompt is fed independently to avoid the effects of dialogue history.

Mental Health Analysis
The experimental results of mental health analysis are presented in Table 1.We first compare the zero-shot results of LLMs to gain a straight view of their potential in mental health analysis, then analyze ChatGPT's performance with other prompts enhanced by emotional information.
Zero-shot Prompting.In the comparison of LLMs, ChatGPT significantly outperforms LLaMA-7B/13B and InstructGPT-3 on all datasets.LLaMA-7B ZS displays random-guessing performance on multi-class detection (T-SID) and cause detection (SAD, CAMS), showing its inability to perform these more complex tasks.With an expanded model size, LLaMA-13B ZS achieves no better performance than LLaMA-7B.Though trained with instruction tuning, InstructGPT-3 ZS still does not improve performance, possibly because the model size limits the LLM's learning ability.Compared with supervised methods, ChatGPT ZS significantly outperforms traditional light-weighted neural network-based methods such as CNN and GRU on binary detection and cause/factor detection, showing its potential in cause analysis for mental health-related texts.However, ChatGPT ZS struggles to achieve comparable performance to fine-tuning methods such as MentalBERT and MentalRoBERTa.Particularly, ChatGPT ZS achieves much worse performance than all baselines on T-SID.We notice that T-SID collects mostly short posts from Twitter with many  (Ji et al., 2022b).usernames, hashtags, and slang words.The huge gap between the posts and ChatGPT's training data can make zero-shot detection difficult (Kocoń et al., 2023).Moreover, although the zero-shot CoT prompting is proven to be effective on most NLP tasks (Zhong et al., 2023;Wei et al., 2022;Kojima et al., 2022), we surprisingly find that ChatGPT CoT has a comparable or even worse performance with ChatGPT ZS .This illustrates that the simple CoT trigger sentence is not effective in mental health analysis.Overall, ChatGPT significantly outperforms other LLMs, and exhibited some generalized ability for mental health analysis.However, it still underperforms fine-tuning-based methods, leaving a huge gap in further exploring LLMs' mental health analysis ability.
Emotion-enhanced Prompting.We further test ChatGPT with emotion-enhanced prompts on all datasets.Firstly, with the sentiment information from the lexicon VADER and NRC EmoLex, we notice that ChatGPT V and ChatGPT N _sen perform worse than ChatGPT ZS on most datasets, showing that these prompts are not effective in enhancing model performance.A possible reason is that the coarse-grained sentiment classifications based on the two lexicons cannot describe complex emotions expressed in the posts.Therefore, we incorporate fine-grained emotion labels from NRC EmoLex into the zero-shot prompt.The results show that ChatGPT N _emo outperforms ChatGPT N _sen on most datasets, especially on CAMS (a 7.89% improvement).However, ChatGPT N _emo still underperforms ChatGPT ZS on most datasets, possibly because lexicon-based emotion labels are still not accurate in representing multiple emotions that co-exist in a post, especially in datasets with rich content, such as CLPsych15 and DR.Therefore, we explore the more flexible unsupervised emotion-enhanced prompts with CoT.As a result, ChatGPT CoT _emo outperforms all other zero-shot methods on most datasets, which proves that emotion-enhanced CoT prompting is effective for mental health analysis.Finally, with few-shot expert-written examples, ChatGPT CoT _emo_F S significantly outperforms all zero-shot methods on all datasets, especially in complex-task datasets: t-sid 16.24% improvement, SAD 6.88% improvement, and CAMS 3.7% improvement (very approaching state-of-the-art supervised method).These encouraging results show that in-context learning is effective in calibrating LLM's decision boundaries for complex and subjective tasks in mental health analysis.We provide case studies in Appendix F.1.Table 2: Fleiss' Kappa and other statistics of human evaluations on ChatGPT and InstructGPT-3 results."Sample Num." and "Avg Token Num." denote the sample numbers and average token numbers of the posts."Agreement" denotes the percentages of results that reached a final agreement with a majority vote from the three assignments.
Figure 2: Box plots of the aggregated human evaluation scores for each aspect.Orange lines denote the median scores and green lines denote the average scores.

Evaluation Results for Explainability
Human Evaluation In the above subsection, we have shown that emotion-enhanced CoT prompts can enhance ChatGPT's zero-shot performance in mental health analysis.Moreover, it can prompt LLMs to provide an explanation of their step-bystep reasoning for each response.This can significantly improve the explainability of the predictions, which is a key advantage compared with most previous black-box methods.In this subsection, we provide carefully designed human evaluations to gain a clear view of LLMs' (ChatGPT and InstructGPT-3) explainability on their detection results.The Fleiss' Kappa results and agreement percentages are presented in Table 2.We aggregate each score by averaging assignments from three annotators, and the distributions are presented in Figure 2. Firstly, the three annotators reach high agreements on evaluation.Over 95% of ChatGPT evaluations and 89.9% of InstructGPT-3 results reach agreement.According to the widely utilized interpretation criterion 3 , all Fleiss' Kappa statistics achieve at least fair agreement (≥0.21) and 10 out of 16 results reach at least moderate agreement (≥0.41).These outcomes prove the quality of the human annotations.
As shown in Figure 2, ChatGPT true almost achieves an average score of 3.0 in fluency and stably maintains outstanding performance, while 3 https://en.wikipedia.org/wiki/Fleiss%27_kappa InstructGPT-3 achieves much worse performance in fluency with a 0 median score and an average score of less than 1.0.These results prove ChatGPT is a fluent explanation generator for mental health analysis.In reliability, ChatGPT true achieves a median score of 3 and over 2.7 in average score, showing ChatGPT as a trustworthy reasoner in supporting its classifications.Only a few of InstructGPT-3's explanations generate moderately reliable information while most of them are unreliable.For completeness, ChatGPT true obtains over 2.5 scores on average, indicating that ChatGPT can cover most of the relevant content in the posts to explain its classifications, while InstructGPT-3 ignores key aspects by obtaining less than 0.5 on average.Overall, ChatGPT true has an average score of over 2.5, proving that ChatGPT can generate human-level explanations for correct classifications regarding fluency, reliability, and completeness and significantly outperforms previous LLMs such as InstructGPT-3.More cases are in Appendix F.2.

Automatic Evaluation
The automatic evaluation results on the ChatGPT explanations are presented in Table 3.In ChatGPT true , BART-Score achieves the highest correlation scores on all aspects, showing its potential in performing human-like evaluations for explainable mental health analysis.Specifically, BART-Score outperforms all BERT-Scorebased methods, which shows that generative models can be more beneficial in evaluating natural

Error Analysis
We further analyze some typical errors during our experiments to inspire future efforts of improving ChatGPT and emotion-enhanced prompts for mental health analysis.
Unstable Predictions.We notice that ChatGPT's performance on mental health analysis can vary drastically with the change of a few keywords in the prompt, especially on binary mental health condition detection.While keywords describing the tasks are easy to control, some other words such as adjectives, are hard to optimize.For example, we replace the adjective describing the mental health condition with different degrees in the zero-shot prompt for binary mental health detection: ...Is the poster likely to suffer from [Adjective of Degree] [Condition]?... where the adjective (marked red) is replaced with one keyword from {any, some, very severe}, and the results on three binary detection datasets are shown in Table 4.As shown, ChatGPT ZS shows very unstable performance on all three datasets, with a high variance of 10.6 on DR, 17.62 on CLP-sysch15, and 89.29 on Dreaddit.There are also no global optimal adjectives as the best adjective changes with the datasets.This sensitivity makes ChatGPT's performance very unstable even with slightly different prompts.We believe this problem is due to the subjective nature of mental health conditions.The human annotations only answer Yes/No for each post, which makes the human criteria of predictions hard to learn for ChatGPT in a zero-shot setting.To alleviate this problem, we further explore the effectiveness of few-shot prompts in these settings, where the same expert-written few-shot examples in Sec.2.2 are included in the zero-shot prompts.As the results in Table 4 show, with few-shot prompts, ChatGPT achieves a variance of 1.34 on DR, 7.21 on CLPsych15, and 31.93 on Dreaddit, which are all significantly lower than those of zero-shot prompts.These results prove that expert-written examples can stabilize Chat-GPT's predictions, because they can provide accurate references for the subjective mental health detection and cause detection tasks.The few-shot solution is also efficient as it instructs the model in an in-context learning manner, which doesn't require high-cost model fine-tuning.
Inaccurate Reasoning.Though ChatGPT is proven capable of generating explanations for its classifications, there are still many cases showing its inaccurate reasoning leading to incorrect results.To investigate the contributing factors behind these mistakes, we further compare the human evaluation results between the correctly and incorrectly classified results ChatGPT true and ChatGPT f alse .The results are presented in Figure 2. As shown, ChatGPT f alse still achieves comparable fluency scores to ChatGPT true but performs worse on both completeness and reliability.For completeness, the average score of ChatGPT f alse drops below 2.0.We also notice that the average token number of ChatGPT f alse reaches 335 (Table 2), which exceeds ChatGPT true by over 130 tokens.These results indicate that ChatGPT struggles to cover all relevant aspects of long-context posts.For reliability, more than half of ChatGPT f alse results give unreliable or inconsistent explanations (below 1.0), possibly due to the lack of mental health-related knowledge.A few ChatGPT f alse samples provide mostly reliable reasoning (above 2.0) but miss key information due to the lack of completeness.Overall, the mistakes of ChatGPT are mainly caused by ignorance of relevant information in long posts and unreliable reasoning process.Therefore, future works should improve ChatGPT's long-context modeling ability and introduce more mental healthrelated knowledge to benefit its performance.Inaccurate reasoning also reflects a lack of alignment between LLMs and mental health analysis tasks.A possible solution is to fine-tune the LLMs with mental health-related instruction-tuning datasets.We leave LLM-finetuning as future work.More cases are provided in Appendix F.3.

Conclusion
In this work, we comprehensively studied LLMs on zero-shot/few-shot mental health analysis and the impact of different emotion-enhanced prompts.We explored the potential of LLMs in explainable mental health analysis, by explaining their predictions via CoT prompting.We developed a reliable annotation protocol for human evaluations of LLM-generated explanations and benchmarked existing automatic evaluation metrics on the human annotations.Experiments demonstrated that mental health analysis is still challenging for LLMs, but emotional information with proper prompt engineering can better trigger their ability.Human evaluation results showed that ChatGPT can generate human-level explanations for its decisions, and current automatic evaluation metrics need further improvement to properly evaluate explainable mental health analysis.ChatGPT also bears limitations, including unstable predictions and inaccurate reasoning.
In future work, we will explore domain-specific fine-tuning for LLMs to alleviate inaccurate reasoning problems.We will also extend the interpretable settings with LLMs to other research domains.

Limitations
Unexpected Responses.Though ChatGPT makes predictions in most of its responses as requested by the prompts, there are a few cases where it refuses to make a classification.There are two main reasons: 1) the lack of evidence from the post to make a prediction; 2) the post contains content that violates the content policy of OpenAI4 .For example, ChatGPT can respond: "As an AI language model, I cannot accurately diagnose mental illnesses or predict what may have caused them in this post."In our experiments, we directly exclude these responses because they are very rare, but future efforts are needed to alleviate these problems.
Limitations of Lexicons.The motivation for using sentiment and emotion lexicons is to provide additional context with distant supervision for the prompts, which, however, have several limitations.The two lexicons, VADER (Hutto and Gilbert, 2014) and NRC EmoLex (Mohammad andTurney, 2010, 2013) we used were developed a decade ago with human annotation using social media data.It is inevitable that they suffer from annotation bias in the sentiment/emotion scores and only reflect the language used when they were developed.The Internet language evolves rapidly, and our experiments also use some recent datasets such as T-SID (Ji et al., 2022a) and CAMS (Garg et al., 2022).Besides, these lexicons have limited vocabularies.Manual rules to aggregate sentence-level sentiment and emotions could be underspecified.Prompt engineering with other advanced resources with extra emotional information can be explored in future work.We also see the limitation of the dataset.Ji (2022) showed that the sentiment distribution has no significant difference in the binary case of T-SID dataset.Although the sentiment-enhanced prompt with VADER gains slightly better performance than other prompts on T-SID dataset, we cannot clearly explain if the choice of lexicon contributes to the improvement due to the black-box nature of ChatGPT.

Ethical Considerations
Although the datasets used are anonymously posted, our study adheres to strict privacy protocols (Benton et al., 2017;Nicholas et al., 2020) and minimizes privacy impact as much as possible, as social media datasets can reveal poster thoughts and may contain sensitive personal information.We use social posts that are manifestly public from Reddit and Twitter.The SMS-like SAD dataset (Mauriello et al., 2021) has been released publicly on GitHub by the authors.All examples presented in our paper have been paraphrased and obfuscated using the moderate disguising scheme (Bruckman, 2002) to avoid misuse.We also do not use the user profile on social media, identify the users or interact with them.Our study aims to use social media as an early source of information to assist researchers or clinical practitioners in detecting mental health conditions for nonclinical use.The model predictions cannot replace psychiatric diagnoses.In addition, we recognize that some mental disorders are subjective (Keilp et al., 2012), and the interpretation of our analysis may differ (Puschman, 2017)  Moreover, there are also efforts incorporating multi-modal information, such as voice, video, visual, and text, to improve the performance of depression detection.Rodrigues Makiuchi et al. ( 2019) proposed a multi-modal method for depression detection, which incorporates speech and textual information with a gated convolutional neural network (gated CNN) and contextual feature from BERT.Lin et al. (2020) proposed the visual-textual multi-modal learning method based on CNN and BERT, for depression detection on social media.Toto et al. (2021) proposed the multi-modal method Audio-Assisted BERT (AudiBERT) for depression classification, which integrates the pre-trained audio embedding with text embedding from the bert encoder.

A.2 Large Language Models for Mental Health Analysis
Most recently, many efforts have evaluated the performance of LLMs such as ChatGPT and GPT-4 on various NLP tasks (Bang et al., 2023;Qin et al., 2023), such as machine translation (Jiao et al., 2023), text generation and evaluation (Benoit, 2023;Luo et al., 2023), language inference (Zhong et al., 2023).They have inspired efforts exploring the ability of LLMs for mental health analysis.Lamichhane (2023) evaluated the performance of ChatGPT on three mental health classification tasks, including stress, depression, and suicidality detection, and proved the good potential of Chat-GPT for applications of mental health.Amin et al.
(2023) further evaluated the capabilities of Chat-GPT on big-five personality detection, sentiment analysis, and suicide detection.They show Chat-GPT has better performance in sentiment analysis, comparable performance in suicide detection, and worse performance in personality detection when compared with RoBERTa-based and word embedding-based supervised methods.There are also works on analyzing the emotional reasoning ability of ChatGPT including (Qin et al., 2023;Zhong et al., 2023;Kocoń et al., 2023;Chen et al., 2023) on the sentiment classification task, where ChatGPT achieves comparable or worse performance compared with fine-tuning based methods based on PLMs.Ye et al. ( 2023) compared the performance of different LLMs including GPT-3 series (davinci and text-davinci-001) and GPT-3.5 series (code-davinci-002, text-davinci-002, text-davinci-003, and gpt-3.5-turbo) on the aspect-based sentiment analysis, where code-davinci-002 has the best performance in the zero-shot setting.However, most of them only cover simple binary sentiment classification tasks or a few binary mental health detection tasks, leaving a huge gap for comprehensively exploring the ability of LLMs on emotionaware mental health analysis.

B ChatGPT for Emotional Reasoning
Tasks We evaluate the emotional reasoning ability of ChatGPT in complex scenarios on the following two widely studied tasks: emotion recognition in conversations (ERC) and causal emotion entailment (CEE).ERC aims at recognizing the emotion of each utterance within a conversation from a fixed emotion category set, which is often modeled as a multi-class text classification task (Poria et al., 2019b).Given an utterance with a non-neutral emotion, CEE aims to identify the casual utterances for this emotion in the previous conversation history.CEE is usually modeled as a binary classification between the candidate utterance and the target ut-terance.
Prompts We perform direct guidance on exploring the ability of ChatGPT in both tasks, which designs zero-shot prompts to directly ask for a classification result from the response of ChatGPT.Details about the designed prompts are presented in Appendix C.1.
For CEE, we select the dataset RECCON (Poria et al., 2021).More information about these datasets is listed in Table 7 in the appendix.
Metrics We use the weighted-F1 measure as the evaluation metric for IEMOCAP, MELD, and EmoryNLP datasets.Since neutral occupies most of DailyDialog, we use micro-F1 for this dataset, and ignore the label neutral when calculating the results as in the previous works (Shen et al., 2021b;Xie et al., 2021;Yang et al., 2023).For RECCON, we report the F1 scores of both negative and positive causal pairs and the macro F1 scores as a whole.

ERC Results
The experimental results on ERC task are presented in  et al., 2019;Song et al., 2022).
generalizability can make up for the lack of taskspecific model architectures to some extent.On the MELD dataset, ChatGPT ZS achieves 61.18% of weighted-F1 score, which outperforms some strong supervised methods including the fine-tuned BERT Base model (by 4.97%), and the knowledge infusion method KET (by 3.0%).However, the zero-shot performance of ChatGPT is still worse than advanced supervised methods on all datasets, and struggles to achieve dominating performance on the emotion-related tasks.This is because these tasks are very subjective even to humans, showing the promising future direction of exploring the fewshot prompting and knowledge infusion to improve the performance of ChatGPT in these subjective tasks.

CEE Results
The experimental results on CEE task are presented in Table 6.We can observe that RankCP achieves the highest negative F1 score but has poor performance on the positive F1 score, which is more indicative in evaluating the emotion causal detection ability.ChatGPT ZS significantly outperforms RankCP on positive F1 score, showing that it possesses some level of ability to understand the emotional causes.However, its performance is still much lower than the advanced supervised methods such as KEC and KBCIN on all metrics, which incorporate effective information such as social commonsense knowledge.Quantitatively, ChatGPT ZS still holds a 19.86% gap to the SOTA method KBCIN on macro F1 score.
In conclusion, the experiments on the ERC and CEE tasks show that ChatGPT holds comparable emotional reasoning ability in complex contexts with some traditional methods such as CNN and cLSTM, but still strongly underperforms competitive task-specific information infusion and finetuning methods.This indicates the necessity of future efforts to enhance prompting strategies and leverage external knowledge to better trigger the emotional reasoning ability of ChatGPT.These results also motivate us to design emotion-enhanced prompts to aid mental health analysis.

C.1 Emotional Reasoning
The prompt for ERC is designed as follows: Context: "[Previous Dialogue]".Consider this context to assign one emotion label to this utterance "[Target]".Only from this emotion list: [Emotion List].Only return the assigned word.where the slots marked blue are the required inputs.
[Previous Dialogue] denotes the previous dialogue history of the target utterance, where each utterance is pre-pended with its speaker, then concatenated in the sequence order.[Target] denotes the target utterance, and [Emotion List] denotes the predefined emotion category set of the corresponding dataset, which are listed in Table 7   long historical context and dependencies between multiple parties.
• KI-Net (Xie et al., 2021) infused both commonsense and sentiment lexicon knowledge to enhance XLNet.A self-matching module was proposed to allow interactions between utterance and knowledge representations.
• SCCL (Yang et al., 2023) proposed a supervised cluster-level contrastive learning (SCCL) to infuse Valance-Arousal-Dominance information.Pre-trained knowledge adapters are leveraged to incorporate linguistic and factual knowledge.
For CEE task, we use the following baseline models: • RankCP (Wei et al., 2020) ranked the clausepair candidates in the context and utilized a neural network to perform entailment classification with the context-aware utterance representations.
• RoBERTa-Base/Large (Liu et al., 2019) concatenated the conversation with the emotion label of each utterance as input to the PLM RoBERTa (both RoBERTa-Base and RoBERTa-Large are used).Then CEE was modeled as a binary classification problem for each utterance pair.
• KEC (Li et al., 2022) utilized the directed acyclic graph networks (DAGs) incorporating social commonsense knowledge (SCK) to improve the causal reasoning ability.
• KBCIN (Zhao et al., 2022) proposed the knowledge-bridged causal interaction network (KBCIN) with conversational graph, emotional and actional interaction module to capture context dependencies of conversations and make emotional cause reasoning.

D.2 Mental Health Analysis
We compare the performance of ChatGPT with that of the following baselines for mental health analysis: • CNN (Kim, 2014) used three channel CNN with filters of 2,3,4 to classify the post.
• GRU (Cho et al., 2014) used a two-layer GRU to encode the post.
• BiLSTM_Att (Zhou et al., 2016) utilized a bidirectional LSTM with attention mechanism as context encoding layer to capture the contextual information of posts.
• fastText (Joulin et al., 2017) used an opensource and efficient text classifier based on bag of n-grams features.
• BERT/RoBERTa (Devlin et al., 2019;Liu et al., 2019) utilized the PLMs BERT and RoBERTa to model the post and fine-tuned them for classification.

E Human Evaluation Criteria
Annotators will be given generated explanations from ChatGPT and InstructGPT-3, and the original post as the correct reference.Annotators will need to score and annotate the generated explanations from the following aspects: Fluency Fluency evaluates the coherence and readability of the explanation.Annotators should assess if generated explanation well-structured, easy to read, and free of grammatical or syntax errors.
• 0: Incoherent, difficult to read, and contains numerous errors • 1: Somewhat coherent, but with poor readability and several errors • 2: Mostly fluent, easy to read, with few minor errors • 3: Completely fluent, coherent, and error-free Reliability Reliability measures how trustworthiness of the generated explanations to support the detection results.Annotators should assess whether the explanation is based on facts, has misinformation and wrong reasoning according to the given post.Main symptoms to check (sorted by criticality): • Suicide ideation expressions (golden standard).
• Long-term low passion (e.g.loss of interest to previous hobbies).
• Loss of appetite and sleep disorders.
The domain experts also consult other scales describing depressive symptoms, such as the Patient Health Questionnaire (PHQ-9)6 .The annotation scheme is as follows: • 0: Unreliable information or inconsistent information • 1: Somewhat reliable information with some inconsistencies • 2: Mostly reliable information with few inconsistencies • 3: Completely reliable information Completeness Completeness measures how well the generated explanations cover all relevant aspects of the original post.Annotators should assess whether the explanation provides sufficient context and detail, without omitting important information such as emotional cues from the original post.
• 0: Omits significant information from the original post • 1: Partially complete with some omissions • 2: Mostly complete with minor omissions • 3: Complete coverage of the original post Overall Score Overall performance measures the general effectiveness of the generated explanation, taking into account the combined scores for fluency, factuality, reliability, and completeness.

Figure 1 :
Figure 1: The pipeline of obtaining and evaluating the LLM-generated explanations for mental health analysis.In LLM responses, red, green, and blue words are marked as relevant clues for rating fluency, reliability, and completeness in human evaluations.

Table 3 :
Pearson's correlation coefficients between human evaluation and existing automatic evaluation results on ChatGPT explanations.Best values are highlighted in bold.language texts.Unexpectedly, BART-Score also significantly outperforms GPT3-Score, a zero-shot evaluation method based on the powerful LLM GPT-3, in all aspects.These results show that task-specific pre-training is important to trigger the language model's ability for the evaluation tasks.BART-score is also fine-tuned on text summarization and paraphrasing tasks, which are crucial to assess relevance and coherence, the two important factors for providing satisfactory evaluations.However, in ChatGPT f alse , BART-Score becomes less competitive.BERT-Score achieves the best per- because we do not understand the actual intentions of the posts.
Peng Zhou,Wei Shi, Jun Tian, Zhenyu Qi, Bingchen  Li, Hongwei Hao, and Bo Xu. 2016.Attention-based bidirectional long short-term memory networks for relation classification.In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), pages 207-212.

Table 5 :
Test results on ERC task.ChatGPT ZS denotes the method using the zero-shot prompt.The results of some baseline methods are referenced from (Zhong

Table 6 :
Test results on the CEE task.Best values: bold.The results of baseline methods are referenced from Zhao et al. (2022).
Its like that, if you want or not.ME: I have no problem, if it takes longer.But you asked my friend for help and let him wait for one hour and then you haven't prepared anything.Thats not what you asked for...

Table 7 :
A summary of datasets for emotional reasoning in conversations.Conv.and Utter.denote conversation and utterance numbers.Data statistics are on the test set.

Table 8 :
A summary of datasets for mental health tasks.Note we test the zero-shot performance on the test set.