PsyQA: A Chinese Dataset for Generating Long Counseling Text for Mental Health Support

Great research interests have been attracted to devise AI services that are able to provide mental health support. However, the lack of corpora is a main obstacle to this research, particularly in Chinese language. In this paper, we propose PsyQA, a Chinese dataset of psychological health support in the form of question and answer pair. PsyQA is crawled from a Chinese mental health service platform, and contains 22K questions and 56K long and well-structured answers. Based on the psychological counseling theories, we annotate a portion of answer texts with typical strategies for providing support, and further present in-depth analysis of both lexical features and strategy patterns in the counseling answers. We also evaluate the performance of generating counseling answers with the generative pretrained models. Results show that utilizing strategies enhances the fluency and helpfulness of generated answers, but there is still a large space for future research.


Introduction
The burden of mental disorders continues to grow with significant impacts on human health and social development (Organization et al., 2011;The World Health Organization, 2020). As an effective therapy for mental disorders (Reynolds Jr et al., 2013), online mental health counseling, which mostly refers to communicating anonymously, has become popular in recent years (Fu et al., 2020).
Great research interests have been endeavored to devise AI services that are able to provide mental health support (Bucci et al., 2019;Liu et al., 2021). Based on the online-text psychotherapy corpora, previous works have utilized text mining techniques to detect empathy (Sharma et al., 2020;Zheng et al., 2021), linguistic development of * Equal contribution. † Corresponding author.
counselors (Zhang et al., 2019), and self-injurious thoughts and behaviors (Franz et al., 2020). However, the research of text-based mental health counseling is still largely limited due to the lack of relevant corpora, particularly in Chinese language.
To this end, we collect PsyQA in this work, a Chinese dataset of Psychological health support in the form of Question-Answer pair. An example data of PsyQA is shown in Figure 1. In each example, the question along with a detailed description and several keyword tags is posted by an anonymous help-seeker, where the description generally contains dense persona and emotion information about the help-seeker. The answer is usually quite long (524 words on average). The answers are replied asynchronously from the welltrained volunteers or professional counselors, and contain both the detailed analysis of the seeker's problem and the guidance for the seeker. Moreover, a portion of the answers are also additionally annotated by professional workers with typical support strategies, which are based on the psychological counseling theories (Hill, 2009).
Our collected PsyQA has three distinct characteristics. Firstly, the corpus covers abundant mental health topics from 9 categories including emotion, relationships, and so on (refer to Appendix for topic statistics). Secondly, the answers in PsyQA are mostly provided by experienced and well-trained volunteers or professional counselors. Thirdly, we provide support strategy annotations for a portion of answers, which can greatly facilitate future research on our corpus. As will be shown later, there are not only lexical features in the texts of different support strategies (Section 4.2), but also explicit patterns of strategy organization and utilization in the answers (Section 4.3).
To validate whether existing models can generate long counseling answers to mental health questions, we conduct experiments on both strategy identifi-

为什么有些事情越想心越闷？
The more I think about some things, the more upset I feel. Why?

比如家里人说我和别人相亲的事，我是不太主 动比较被动的人……又怪自己，又恨别人到处 说，搞得心里很难受很郁闷，这该怎么办？
For example, my family asked me to go on a blind date with others. I am not an active but passive person. ... I blame myself and blame others for speaking ill of me everywhere, making me very uncomfortable and depressed. What should I do?
你好呀～事情越想越闷可能是陷入了反刍思维 中。反刍式思考是指 …… 反刍思维作为一种认 知，对情绪也有重要的影响。在这种情况下， 你首先要冷静下来 ……比如自己闷在家里没出 去相亲，家人就说自己是不是想打光棍儿。其 实你仔细看这两件事情并没有因果关系。…… 但这样的逻辑也是不太合情理的。当然，在这 种情况下，你也可以使用转移注意力的方式， 让自己的情绪稍微平复下来。比如做一次冥想 练习，或者出去做运动。 Hi ~ The more you think about it, the more depressed you feel. This is maybe because you are trapped in ruminant thinking. Ruminant thinking means that ...... Ruminant thinking, as a form of cognition, also has an important effect on emotion. In this case, you need to calm down first… For example, you stayed at home and didn't go out for a blind date, and your family said that you just wanted to be single. When you look at it carefully, there is no causal relationship between the two events. ...... But this logic doesn't work. Of course, in this case, you can also distract your attention to calm yourself down a bit. Take a meditation practice, or go outside to exercise.  triples are posted by help-seekers while Answer is provided by help-supporters. Different strategies in the answer are colored differently. Strategies Information , Interpretation , Restatement , and Direct Guidance are used in this answer. Note that a question may have multiple answers. cation (Section 5) and answer generation (Section 6). We find that the contextual information greatly benefits the performance of support strategy identification. Experimental results also demonstrate that utilizing support strategies improves the answers generated by the models in terms of their language fluency, coherence, and the ability to be on-topic and helpful. However, there is still much room for further research compared to the answers written by well-trained volunteers or professional counselors.

情绪 表达情绪 情绪调节 情绪智力
Our contributions are summarized as follows: • We collect PsyQA, a high-quality Chinese dataset of psychological health support in the form of QA pair. The answers in PsyQA are usually long, which are provided by welltrained volunteers or professional counselors.
• We annotate a portion of answer texts with a set of strategies for mental health support based on psychological counseling theories. Our analysis reveals that there are not only typical lexical features in the texts of different strategies, but also explicit patterns of strategy organization and utilization in the answers.
• We conduct experiments of both strategy identification and answer text generation on PsyQA. Results demonstrate the importance of using support strategies, meanwhile indicating a large space for future research.

Related Work
Our work primarily concerns linguistic behavior for counseling, NLP for mental health detection and therapy, and text-based mental health-related datasets.

Linguistic Behaviors in Counseling
Hill's model (Hill, 2009) consists of three stages: exploration, insight, and action in which helpers guide clients in exploring their thoughts and feelings, discovering the origins and consequences of maladaptive thoughts and behaviors, and acting on those discoveries to create positive long-term change. We draw on Hill's model and apply it to formulate the answer in the PsyQA dataset. Some previous work explored how mental health support is sought and provided. For example, some studies measure how the language of comments in Reddit mental health communities influences risk to suicidal ideation in the future (De Choudhury and Kiciman, 2017), and seek to understand how counselors' behaviors develop over time (Zhang et al., 2019). While these previous studies model implicit linguistic behaviors of counselors, we focus on linguistic strategy development in a long psychological response, considering the strategies as a skeleton to generate the whole response.

NLP for Mental Health Detection and Therapy
Some prior work analyzed the posts and blogs of users with the rise of social networking sites (SNS), attempting to employ NLP techniques to detect depression (Tadesse et al., 2019;Yates et al., 2017), suicidal ideation (Zirikly et al., 2019;Cao et al., 2019), and other general mental health problems . In another line of work, some researchers endeavored to construct "therapybots" (Fitzpatrick et al., 2017;Inkster et al., 2018), and focused on therapy and attempted to create dialogue agents to provide therapeutic benefit, where the effectiveness of web-based cognitive-behavioral therapeutic (CBT) apps or mobile mental well-being apps are explored. Adopting a more straightforward method, we make the machine generate answers to a detailed question, mimicking a mental health counselor. Though the ultimate goal is to develop systems for real-world treatment, there is still a long way to go in this direction and our corpus can be the first step towards building intelligent systems for this purpose, and offers the opportunity for studying the effectiveness of using explicit strategies in the systems.

Text-based Mental Health-Related Datasets
There are some datasets for mental health detection and therapy. However, most of them are collected from general social networking sites such as Twitter, Reddit, and Weibo (Harrigian et al., 2020). General social networking sites contain irrelevant posts or unprofessional responses, which might put NLP systems trained on these corpora at huge risk. Thus, some previous work focused on the counseling part in the online mental health communities (forums), such as TeenHelp (Franz et al., 2020), TalkLife (Sharma et al., 2020). In Chinese domain, Wang et al. (2020) collected a public counseling conversation dataset by crawling. However, most responses in this dataset are short and general without any suggestion. Crisis Text Line (Althoff et al., 2016) presents the best mental health counseling dataset up to now. It contains a large-scale multi-turn counseling conversation by experienced volunteer counselors 1 . Different from Crisis Text Line, PsyQA focuses on Chinese long-text response in a single-turn asynchronous counseling conversation. From the perspective of the mental health domains, most of the prior work is focusing on singledomain like depression, suicidal ideation, and eating disorders (Harrigian et al., 2020). Instead, PsyQA contains all sorts of general mental health disorders, concerning nine topics labeled by helpseekers including self-growth, emotion, love problem, relationships, behaviors, family, treatment, marriage, and career.

Data Source
Our dataset is crawled from the Q&A column of Yixinli (xinli001.com/qa). Yixinli is a Chinese mental health service platform with about 22 million users and over six hundred professional counselors. In its Q&A column, anonymous users post questions about their daily-life worries, and welltrained volunteers or professional counselors answer them with detailed analysis and guidance in the form of organized long texts. More than 0.25 million Q&A pairs are on this platform, with abundant topics ranging from personal development and relationships, to mental illnesses. Yixinli manually review and block unsafe contents To avoid potential ethical risks and ensure the quality of the data. We calculate that in our dataset, the help-supporters have ever answered over 250 questions on average. Besides, 8% answers are from help-supporters who are State-Certificated Class 2 Psychological Counselors, and 35% answers are from volunteers hired by Yixinli.

Data Cleaning
We removed personal information, duplicate line breaks, emojis, website links and advertisements by rule-based filtering. Besides, to ensure a higher quality, only those answers with more than 100 words were retained. It is inevitable that there exist some unrelated posts in raw websites. To remove such posts, we tried to filter out questions that are not actually seeking for mental health support based on keywords (topics) given by the poster, 建议/advice (9), 尝试/try (8), 学会/learn (6), 找/find (5), 沟 通/communicate (5) Approval and Reassurance Emotional support, reassurance, encouragement and reinforcement.

Self-disclosure
Reveal something personal about the helper's non-immediate experiences or feelings.

这个问题勾起了我类似的回忆。
This question brings back to me some similar memories.
Table 1: The definition and example of different strategies in our guideline, together with the lexical features of the strategies in our annotated dataset. The rightmost column displays the top 5 words associated with each strategy. The rounded z-scored log odds ratios are in the parentheses. A word may appear in multiple parts of speech. For example, "warm" in Chinese can be either an adjective or a verb.
such as the questions that ask about the meaning of a psychological term (keyword: popular science) or discuss the latest news (keyword: hot news).

Strategy Annotation
We analyzed multiple high-quality answers in our corpus and found that the strategies employed by the help-supporters are consistent with Helping Skills System (HSS) (Hill, 2009). Moreover, we observed that the strategy sequence patterns are similar to some degree. Thus, we assumed that a whole answer is realized through an organized strategy sequence, which may reveal the common layout of high-quality responses from mental health counselors. To facilitate further research on strategies in text-based mental health support, we then present the process that we annotated the answers with span-level strategies.
Hill (2009) provides a taxonomy of language helping skills or strategies for mental health counselors. We chose a subset of strategies according to the general online counseling situation, while also corresponds to the guideline for help-supporters from Yixinli web. Table 1 shows the list of our chosen strategies with their definitions and examples.
We randomly sampled 4,012 questions (about 17.9%) in our dataset and picked their highestvoted answers (similar to Quora, quora.com). Then we recruited and trained 9 workers to an-notate the answers following our guideline. 2 We leveraged Doccano 3 , an open-source text annotation tool, for the workers to annotate the text. In each task, the workers were shown a Q&A pair and asked to label one or more consecutive sentences (a text span) with a strategy. The workers were allowed to ignore the sentences that did not match the definition of any strategy, which would be automatically labeled as Others.

Annotation Quality Control
The workers were required to read the guideline and the provided annotated examples before annotation. To verify the effectiveness of training, we asked them to annotate 100 examples before formal annotation, which were revised by psychology professionals for feedback. We repeated the above process until the workers were able to annotate the cases almost correctly. After annotation, to check the quality of labels, we randomly sampled 200 annotated Q&A pairs, gave them to 2 examiners (both are graduate students of Clinical Psychology) to pick out incorrect labels, and calculated the consistency proportion. Results are shown in Table 2. More than 98% of the strategy labels are consistent with at least one examiner, indicating the reliability    Table 3 shows the statistics of our dataset. The long answer text is a distinct feature of our dataset, and the annotated answers are even longer. There is also a wide variety of strategies in the answers (6.66 ones and 3.65 distinct ones per answer), and we will further analyze the patterns of strategy utilization in Section 4.3. Note that our dataset covers 9 broad topics (e.g. self-growth, emotion, etc.) and a wide range of subtopics (e.g.personality improvement, emotion regulation, etc.) 4 , from which the seekers can choose as the question keywords. Table 4 shows the number and the average length of the annotated spans of each strategy. As we can see, Interpretation and Direct Guidance are the most commonly used strategies. In contrast, Information and Self-disclosure are relatively rare,  where external knowledge and backgrounds are extra required. We also noted that the average lengths of Interpretation, Information, Self-disclosure are remarkably longer than other strategies. Moreover, we extracted the lexical correlates of each strategy by calculating the log odds ratio with an informative Dirichlet prior (Monroe et al., 2008) for all the words for each strategy contrasting to all other strategies. We tokenized the text into words using Jieba 5 , and removed conjunctions, prepositions, and numerals. The top-5 words associated with each strategy are shown in Table 1. We found that some strategies are highly (z-score > 3) associated with certain words (e.g., Appro.& Reass. with 'hug', Guidance with 'advice'). In contrast, words associated with Information and Self-disclosure are less typical and unique. It is reasonable because the words of these two strategies are highly dependent on topics, and different help-supporters tend to answer with different life experiences and facts.

Strategy Sequence Analysis
Cumulative Distribution of Strategies Figure  2 displays the cumulative distribution of the relative positions of strategies occurring in the answers. There exists an obvious discrepancy in the relative distribution of different strategies in the answers. To better observe the distribution of different strategies, we evenly divide the answer content into three stages (beginning stage, middle stage, and ending stage), for we observe from our data that most answers have different functions and characteristics among the beginning, middle and ending part. For instance, Restatement is mainly in the beginning stage of an answer, showing that the helpsupporters focus on the content of the question. Direct Guidance is generally in the ending, and Appro.& Reass. at both ends, which is consistent with our observation that the supporters usually comfort seekers at the beginning, while providing guidance or encouragement later. Information, Selfdisclosure, and Interpretation are almost evenly distributed in the answer text. Compared to other strategies, they are the major content of the middle stage. In the middle stage, the help-supporters observe help-seekers' problems (inappropriate behaviors) from overview, thus they tend to give some analyses (Interpretation) and suggestions (Direct Guidance). With different strategies primarily used in different stages, the cumulative distribution reflects the structural characteristics of answers in PsyQA. Strategy Transition To provide more insights of the strategy utilization, we use Sankey Diagram to visualize the strategy transitions. Figure 3 plots the most common strategy flow patterns within the first 5 strategies. According to the visualization, a number of patterns are evident. A&R→Intpn.→Guid.→Intpn.→Guid. is the most common strategy sequence and accounts for 5.6% of the all first 5 strategies. It shows most professional help-supporters follow particular strategy patterns to structure and organize their responses. Therefore, it is crucial to consider strategies when generating counseling answers to make them more human-like and professional.

Strategy Identification
We present a strong sentence-level strategy identification model using RoBERTa (Liu et al., 2019) for PsyQA. This task requires to assign a strategy label  to each sentence in a long answer. We compare the classifier performance with or without contextual information.

Data Preparation
We choose the annotated part of PsyQA and randomly split them into train (80%), dev (10%) and test (10%) sets. We split each long answer into sentences for sentence-level training.

Model Architecture
We use a Chinese RoBERTa base-version with 12 layers 6 for our experiments. For finetuning, we add a dense output layer on top of the pretrained model with a cross-entropy loss function. For the model with contextual information, we input multiple consecutive sentences S 1 , S 2 , S 3 , · · · to RoBERTa in the form of [CLS]S 1 [SEP][CLS]S 2 [SEP][CLS]S 3 · · · and compute the mean loss of [CLS] locating at the head of each sentence. For baseline model without contextual information, we input one sentence into RoBERTa and predict one sentence at one time. Table 5 summarizes the performance of both models on the test set. Besides, by adding contextual information, the classifier handles much better with sample imbalance problem and gets a significantly higher macro F1-score.

Experimental Results
We found that the overall performance is primarily limited by 2 strategies: Restatement with F1-score: 49.38% and Information with F1-score: 54.68% (refer to Appendix B for classification result for each strategy). This is reasonable because  (a) we didn't add the Question into the input (due to the limitation of the maximum context length of RoBERTa) to help identify Restatement. (b) extra psychological knowledge is needed to identify Information. Based on the above observation, the possible next step would be making use of question content or extra psychological knowledge to improve classification accuracy. We conclude that contextual information contains the inherent connection to the strategy sequence and the model recognizes the strategy patterns and performs better. Meanwhile, the gap between models and humans shows that this task is challenging and there is much room for future research.
6 Answer Generation

Task Definition
Given a triple (question S Q , description S D , keyword set K) as input, where S Q ,S D are both sentences and K are composed by at most 4 keywords, this task is to generate a long counseling text consisting of multiple sentences that could give helpful comforts and advice mimicking a mental health counselor.

Model Pretraining
GPT-2 (Radford et al., 2019) has shown its success on various language generation tasks. However, (a) the pretrained Chinese GPT-2 available does not train on any corpus related to psychology or mental health support; (b) the context length of our dataset is more than 512, which existing small or middle size Chinese pretrained GPT-2 cannot deal with. Thus we crawled 50K articles (0.1B tokens in total) related to psychology and mental health support from Yixinli (xinli001.com/info) and train a GPT-2 from scratch based on the corpus. The maximum context length is 1,024 and the model contains 10 layers with 12 attention heads (resulting in 81.9M parameters).

Implementation Details
Data Preparation We first predict the strategy of each sentence using our strategy classifier with contextual information in Section 5. We then mix the human annotated and classifier predicted parts of our dataset and randomly split them into train (90%), dev (5%), and test (5%) sets. Prepending Strategy Token To study the effectiveness of using explicit strategy as input, we compare the performance between models trained with/without strategy labels. Prepending (Niu and Bansal, 2018) is a simple yet effective way to add supervised information to data, requiring no architecture modification. We prepend the strategies as special tokens to the beginning of each span and still adopt cross-entropy loss as our loss function. . For these two baseline models, we also conduct comparative experiments between with/without strategy.

Automatic Evaluation
The automatic metrics we adopted include Perplexity (PPL.), BLEU (Papineni et al., 2002), Distinct-1 (D1), Distinct-2 (D2) (Li et al., 2016) and controllability (CTRB). To evaluate the strategy controllability of models, we first predict the strategy token of each sentence in the generated answers using classifier in Section 5, then we compute the consistency proportion between prediction and the strategy token locating at the head of the text spans. The result of the automatic evaluation is shown in Table 6.
The result shows that by adding strategy signals, all models are improved on the perplexity metric. See Appendix C for an example of the generations. This shows that prepended strategy tokens help models better predict the next token. Moreover, the metric BLEU, Distinct-1, Distinct-2 scores are all improved by adding strategy signals for GPT-2  Table 6: Automatic evaluation results. The BLEU score is computed by averaging BLEU-1,2,3,4. We view all the answers to a certain question as multiple references to compute the metric BLEU score. models and relatively slightly drops for Seq2Seq model. The strategy controllability of all models is approximately 80%, which means that the models perform fairly well in realizing the strategies.

Human Evaluation
To better evaluate the quality of the generated responses, we conducted human evaluation. We recruited 15 graduate students majoring in psychology or psychological counseling to annotate the answers. These professional raters were asked to score an answer in terms of Fluency -whether the answer is fluent and grammatical. Coherence -whether the answer is logical and well organized. Relevance -whether the descriptions in the answer are relevant to the question. Helpfulness -whether the interpretations and suggestions are suitable from the psychological counseling perspective. A detailed guideline is shown in Appendix D. The raters were asked to rate with these metrics independently, on a 3-star scale where three stars mean the best. We randomly sampled 100 questions from the test set. For each question, there are three corresponding answers: (a) a generated answer by GPT ft ; (b) a generated answer by GPT ft +strategy; (c) the golden answer. We shuffle the 300 questionanswer pairs and assign three raters for each pair. Table 7 shows the result of human evaluation. We calculated Krippendorff's α (K-α) (Krippendorff, 2011) to measure inter-rater consistency and the Kα are 0.58, 0.60, 0.55, and 0.62 for the four metrics respectively.
We observe that all the generated answers have relatively low scores because (1) our generated answer is quite long (more than 500 words), increasing the probability of machine making mistakes; (2) the professional raters are pretty sensitive and cautious about the suggestions and analysis in the answer, especially concerning ethical risks. Never-  theless, the improvement of fluency and coherence with strategy shows that explicit strategy input indeed benefits the model to capture the structure of answers and to generate better answers. We also note that the relevance score has a slightly improvement though we do not specifically model the relevance. Moreover, the model with strategy can generate more helpful answers. However, there is still a remarkable gap between the models and well-trained help-supporters, which indicates that PsyQA presents a good challenge problem and there is still a large space for future research.

Conclusion and Future Work
We present a high-quality Chinese dataset of psychological health support (PsyQA) and annotate strategies in a portion of answers based on the Helping Skills System. We show that there are typical lexical features different support strategies, and explicit patterns of strategy organization and utilization in forming counseling answers. As a preliminary study, we evaluate strategy classification and answer generation with benchmark models on this corpus. Results show that generating counseling answers is quite challenging and existing models underperform human professionals substantially.
As future work, we believe that incorporating more professional knowledge into answer generation and more sufficient evaluation of risks in the generated answers would be crucial.

Ethical Considerations Dataset Copyright
We have signed a Data Authorization Letter with Yixinli. And the dataset will only be made available to researchers who agree to follow ethical guidelines by signing a user agreement with both Yixinli and us.

Anonymization
Social media data are often sensitive, and even more so when the data are related to mental health. So privacy concerns and the risk to the individuals should always be well considered (Hovy and Spruit, 2016;Suster et al., 2017;Benton et al., 2017).
The source of our data has the nature of anonymity to a certain extent. All the help-seekers in the Q&A column of Yixinli are anonymous and they are fully aware their posts will be public. Our dataset contains only those publicly available Yixinli posts. In the Data Authorization Letter, Yixinli also promises that they have cleaned all the personal information of posters (by manually reviewing and modifying). Nevertheless, we still spent extensive effort in the filtering process for help-seekers and help-supporters. We cleaned private information by rule-based filtering. For instance, we removed the nicknames, phone numbers, and any URL link.
We protect anonymity in academic research. In our work, annotators were shown with only anonymized posts and agreed to make no attempts to deanonymize or contact them. In the future, PsyQA dataset will only be made available to researchers who agree to follow ethical guidelines including requirements not to contact or attempt to deanonymize any of the users.
Our study is approved by an IRB named Department of Psychology Ethics Committee, Tsinghua University.

Ethical Risk Evaluation
We realize there will be a high risk if a model unexpectedly generates a "wrong" answer, especially in the mental health counseling domain. Thus, we explore the ethical risk of the generated answers.
categories: (1) Inappropriate Guidance, (2) Offensiveness, (3) Risk Ignorance, and (4) Serious Crisis. Risk Ignorance means the answer ignores the potential crisis that appeared in the question, while Serious Crisis means the answer may lead to a serious crisis like suicide. The number of answers suspected to carry ethical risks is shown in Table 8. If the rule is that at least two annotators give a risky label, the results are: 0 sample for Human, 2 samples for GPT ft , and 0 sample for GPT ft +strategy respectively. This means human answers and answers generated by GPT ft +strategy are relatively safe. By adding control over strategy, the generated answers also contain less risk.

Ethical Implications
This work does not make any treatment recommendations or diagnostic claims. Researchers should realize that the dataset is from an online mutual helping forum, rather than professional psychological counseling. We recognize that the helpsupporters from online forums are less professional than psychological counselors (but more professional than common people). Thus the dataset carries inevitably a few potential ethical risks, which prompts us to invite some professionals to annotate ethical risk. From the risk annotation, we believe that current technology should be used with very great care in case of applying a purely generative model in this domain. Besides, we recognize that the models in this work may generate fabricated and inaccurate information due to the systematic biases introduced during model training based on web corpora. Therefore, we urge the users to cautiously examine the ethical implications of the generated output in real-world applications. Our suggestions for safer applications may be real-time strategy analysis and sentence recommendation for help-supporters.

A Question Keywords
Topic Statistics We present the topic statistics shown as Table 9. Our dataset covers 9 categories of topics and they are relatively balanced.  Keyword Options To post a question, help-seekers should also choose some keywords that can best describe their problems. Keywords are composed of one broad topic and 1 ∼ 3 subtopics. The keyword options are shown in Table 10.

B Reproducibility
Computing Infrastructure Our models are built upon the PyTorch transformer-3.4.0 library by Huggingface (Wolf et al., 2020). For model training, we utilize the Titan Xp GPU card with 12 GB memory. Strategy Identification For RoBERTa (Liu et al., 2019) with contextual information, we set the max length 512. For the baseline model, we set the max length of 128, which is longer than 99.6% sentences in the whole dataset. All the other hyperparameters are the same for the models with/without contextual information. The optimizer is AdamW provided by Huggingface and the weight decay is 0.01. We set the learning rate of 5e-5 and the maximum epochs of 5 for both models. It takes 3 hours to train the models. A more detailed classification result for each strategy category by RoBERTa (Liu et al., 2019) is shown in Table 11. Answer Generation GPT-2 (Radford et al., 2019) contains 10 layers with 12 attention heads (81.9M parameters). Fairly, Seq2Seq model has a 5 layers encoder and a 5 layers decoder (94.5M parameters) (Vaswani et al., 2017). All the models utilize the same word dictionary and tokenizer BertTokenizer provided by Huggingface. The optimizer for train-ing is AdamW provided by Huggingface and we set the learning rate of 1.5e-4 and the warmup steps of 2500 for all models. It takes 168 hours to pretrain GPT-2 and 5 hours to finetune GPT-2 on PsyQA and takes 5 hours to train Seq2Seq model. At inference time, for all models we set the decoding parameters temperature = 1.0, top p = 0.9, top k = 50, repetition penalty = 1.5, max length = 1024 for nucleus sampling (Holtzman et al., 2020). Generating 1118 answers of test set takes 3 hours for each model.

C Case Study
In Table 12, we present an example of the answers generated by GPT ft trained with/without strategy label, and the golden answer (the highest-vote answer) as reference.

D Guideline for Human Evaluation
We carry out human evaluation studies for the generated answers and the golden answer. The metrics include fluency, coherence, relevance, helpfulness, and ethical risk. The detailed evaluation guideline is shown in Table 13.

Description:
当面临一些困难的时候心志不够坚定的人会感到慌乱，那么如何在还没有遇见困难之前让自己变得更加坚强？ When faced with some difficulties, people who are not determined enough will feel panic. How to make ourselves stronger before we meet some difficulties?

一下，都可以的。但也不用把所有的事情都具备一个专业的能力，都具备的。对于您的心理状态也是一样的，不一 定是心理咨询师。如果有条件的话，可以适当的自学一些其他的。
In this case, you can do some meditation exercises for yourself, or do some meditation exercises appropriately, or do a psychological consultation for yourself, or do it, all these are okay. But you don't have to be professional in all fields, all have. It's the same for your mental state, not necessarily a counselor. If possible, you can learn something else by yourself.

GPTft+Strategy
要坚持下去，只有这样才可能真正成为你坚持下去的动力。因为我们的目标很多是希望自己能够变得越来越优秀， 能够在这一点上给自己带来很大的动力。比如我们在遇到困难的时候，也会努力想要去克服困难，但是很多时候我 们会觉得要克服困难并不容易，因为我们并不完美，我们没有办法去付出相应的努力。如果你总是很害怕面对那些 困难，那么就需要去克服困难。这一次是我的经验，希望对你有所帮助。当面临困难的时候，可以先先告诉自己这 件事情是我能够克服的，我可以变得更好。但是我要告诉自己我可以去克服的。在面对困难的时候我们更需要去克 服那些困难。但是我们要相信我们不会是不可控的。当自己在面对困难的时候，有能力去应对困难的时候，那么就 能够更加的坚持下去。加油！ Stick to it. Only in this way can it be the motivation for you to continue. Since we usually set our goals to make ourselves more and more outstanding, this can bring great motivation to ourselves. For example, when we encounter difficulties, we will try to overcome difficulties, but at times we may feel that it is not easy to overcome difficulties, because we are not perfect, and we have no idea to make corresponding efforts. If you are always afraid to face those difficulties, you need to overcome difficulties. This time is my experience, I hope it works for you. When faced with difficulties, you can first first tell yourself that this is something I can overcome, and I can become better. But I have to tell myself that I can overcome. In the face of difficulties, we need more to overcome those difficulties. But we should believe that we are not uncontrollable.

没有人永远都是高手，只有不断的经历，不断的积累，不断的总结，不断地历练，才能达到你说的坚强。
If you focus all your attention on overcoming difficulties, you don't have time to consider whether you are strong in other people's eyes, because you devote all your heart thinking about how to overcome difficulties. So the next time you encounter difficulties, all you will consider is how to solve them. Of course, not everyone is an expert in solving difficulties.
One must go through countless times of difficulties, making internal and external efforts, to become an expert. The world is always changing, so no one is always a master. Only through continuous experience, continuous accumulation, continuous summary, continuous tempering, can you become a strong man as you said. Table 12: A case of generated answers and the golden answer. Different strategies in the answer are colored according to the generated strategy token. Strategies Approval and Reassurance , Interpretation , and Direct Guidance are generated in this answer by GPT ft with strategy label.