Weakly-Supervised Methods for Suicide Risk Assessment: Role of Related Domains

Social media has become a valuable resource for the study of suicidal ideation and the assessment of suicide risk. Among social media platforms, Reddit has emerged as the most promising one due to its anonymity and its focus on topic-based communities (subreddits) that can be indicative of someone’s state of mind or interest regarding mental health disorders such as r/SuicideWatch, r/Anxiety, r/depression. A challenge for previous work on suicide risk assessment has been the small amount of labeled data. We propose an empirical investigation into several classes of weakly-supervised approaches, and show that using pseudo-labeling based on related issues around mental health (e.g., anxiety, depression) helps improve model performance for suicide risk assessment.


Introduction
Suicide has been identified as one of the leading causes of deaths and approximately 1.5% of people die by suicide every year (WHO et al., 2016;Fazel and Runeson, 2020). Despite years o clinical research on suicide, researoners have concluded that suicide cannot be predicted using the standard clinical practice of asking patients about their suicidal thoughts (McHugh et al., 2019). Recently, Coppersmith et al. (2018) and Nock et al. (2019) discuss the opportunities of using social media combined with natural language processing (NLP) techniques to complement traditional clinical records and help in suicide risk analysis and early suicide intervention.
To facilitate further research on automatic suicide risk assessment, Zirikly et al. (2019) proposed a shared task, where they collected user data from r/SuicideWatch subreddit and annotated it with user-level suicide risk: no-risk, low-risk, mediumrisk and high-risk. By comparing the results of the participating teams in this shared task, Zirikly et al. (2019) conclude that one of the major challenges comes from the insufficient data for intermediate suicide risk levels (i.e., low risk and medium risk) rather than extreme risk levels (i.e., no risk and high risk). Matero et al. (2019) find that using a dual BERT-LSTM-Attention model to separately extract information from both SuicideWatch and Non-SuicideWatch posts together with feature engineering that includes emotion features, personality scores, user's anxiety and depression scores are important for model performance.
In this paper, instead of feature engineering or complex model architectures, we explore whether weakly supervised methods and data augmentation techniques based on clinical psychology research can help improve model performance. We explore several weakly-supervised methods, and show that a simple approach based on insights from clinical psychology research (O'Connor and Nock, 2014) obtains the best performance. This model uses pseudo-labeling (PL) on data from the subreddits r/Anxiety and r/depression, which are considered important risk factors for suicide. We also present a potential application of our model for studying the suicide risk among people who use drugs, opening the door for using NLP methods to deepen our understanding between opioid use disorder (OUD) and mental health. The code for this paper can be found at https://github.com/yangalan123/ WM-SRA.

Methods
We focus on Task A from the CLPsych 2019 shared task "Predicting the Degree of Suicide Risk in Reddit Posts" (Zirikly et al., 2019). The goal of the task is to predict the user-level suicide risk category based on their posts in the r/SuicideWatch subreddit. Specifically, a user u i is associated with a col-lection of n(i) posts C i = {x i,1 , x i,2 , . . . , x i,n(i) }, where each post x i,k (1 ≤ k ≤ n) has m(i, k) sentences x i,k = [s ik,1 , s ik,2 , . . . , s ik,m(i,k) ]. We need to predict y i ∈ {a, b, c, d} using C i , where a, b, c, d represent no-risk, low-risk, medium-risk and highrisk, respectively. In the original dataset, there are 496 users in the training set and 125 users in the test sets. We further split 100 users from the training set to create the validation set. The sizes for the train/valid/test sets are 746, 173, and 186 respectively.
Data Pre-processing Following the advice in (Zirikly et al., 2019), we replace all human names and URLs in the Reddit posts with special tokens " PERSON " and " URL ", respectively. We also remove punctuation and stop words besides lowercasing. Due to the limitation of GPU memory, we split those large posts to be passages with no more than 128 words 1 and make sure that the split point is not in the middle of the sentence 2 . Such passages are treated as separate posts.
Model Architecture Our architecture is a BERT (Devlin et al., 2019) model. We also experimented with other state-of-the-art pre-trained language models (PLMs), including RoBERTa (Liu et al., 2019) and XLNET (Yang et al., 2019), but found BERT to work the best and thus consider it as our baseline architecture (more details can be found in Appendix A). Each post x i,k is fed into BERT (Devlin et al., 2019) and we get post embedding e i,k = BERT(x i,k ). Then we do simple mean-pooling to obtain the user embedding . Finally, we feed u i to a fullyconnected layer and use the Softmax layer to predict the risk level probabilityP (y i |C i ). The label with the largest probability is picked as the final predictionŷ i . For training, the cross entropy loss L clf is applied to optimize our model.

Weakly-supervised Methods
Task-Adaptive Pre-training Recent works (Lee et al., 2020;Gururangan et al., 2020) point out 1 The 128 maximum passage length is tuned based on the validation set for both GPU memory and better computational efficiency for large posts. We do not observe a significant performance drop without a larger passage length. 2 We use a limited-size stack and greedily add each sentence into the stack. If adding a new sentence will make the sum of lengths of all sentences in the stack exceed 128, we pop out all sentences, concatenate them to a new passage and then add this new sentence to the stack. For sentences having more than 128 words, we treat them as individual posts. that task-adaptive pre-training (TAP) can help pretrained language models better adapt to the target domains and can bring improvement, especially in data-poor scenarios. Specifically, we continue pre-training (e.g., masked language modeling for BERT) on a task-relevant unlabeled corpus and then do normal fine-tuning on the task. Our unlabeled corpus consists of all r/SuicideWatch posts (aggregated per user) from the training sets of all the tasks (A, B, C) in the shared task (Zirikly et al., 2019). There are 621 users and 138, 057 posts in this unlabeled corpus. We do continued pretraining for 2 to 3 epochs and do early stopping.
Multi-view Learning Multi-view learning (Xu et al., 2013) (MVL) is one of the widely recognized semi-supervised methods. Clark et al. (2018) provides a successful example of utilizing MVL in sequential labeling tasks. The idea is to create perturbations by masking words in certain positions and requiring the model to learn the similar distribution over the complete labeled examples and the corresponding masked examples besides normal classification training. However, since ours is a user-level classification task, we cannot directly borrow the same strategy from (Clark et al., 2018) as it mainly works on sequence labeling. We propose to create perturbationsC i based on four strategies. 3 First, for each sentence, we will randomly mask 10% of tokens (Word-Mask). Second, considering that users may have posts of many words, we also propose a sentence-level masking strategy (Sent-Mask). For each post of a single user in the training set, we would randomly mask 10% of tokens. Third, we only keep the beginning and ending sentences in each passage (BegEd). Usually these sentences convey the main purpose of the posts and should preserve important semantics. Forth, we use Bert-extractive-summarizer (Miller, 2019) to extract the summary for each passage (K-Sum). It works mainly by first encoding each sentence s ik,j using a PLM to a continuous-valued representation s ik,j and then training a K-means clustering over s ik,j . Finally it will pick K sentences for each passage that are closest to the center. Empirically, we set K = 5.
In training, we use KL-divergence to enforce the constraint that the predicted probability on perturbed examplesP (y i |C i ) should be close to the one on complete examples (i.e.,P (y i |C i )). The loss incurred by KL-divergence is simply added to the classification loss and these two losses are optimized together for each training instance.

Clinical Psychology Inspired Pseudo-labeling
According to the analysis of the shared task report (Zirikly et al., 2019), the main challenge for the 4way classification comes from insufficient data for the intermediate classes (i.e., low-risk and mediumrisk). A straightforward solution is to collect data for these two classes. Recent clinical psychological research (O'Connor and Nock, 2014) points out that mental health issues such as depression and anxiety can be important risk factors for suicide. Inspired by this study, we collect data from r/Anxiety and r/depression from Reddit. The time range of all collected data is from December 1, 2008 to September 30, 2020. We sample a small proportion of the collected data from both subreddits and after manual verification, we decided to assign low-risk labels to all r/Anxiety users in the sample and medium-risk labels to all r/depression users in the sample. Since we do not have experts to label these posts, adding too much pseudo-labeling data might introduce too much noise. Based on preliminary experiments on the validation set, the number of added pseudo-labeling data is 8% of the suicide risk assessment training data. The only difference between these experiments and the main experiments is that we only train the model for 10 epochs rather than full 20 epochs. Table 1 show results for different sizes of added pseudo-labeled data from r/depression on the validation set. All pseudo-labeling data follows roughly the same pattern with the best proportion being 8%.

Experiments and Results
We implement our BERT model based on huggingface Transformer (Wolf et al., 2020). Due to the limitation of GPU memory, we only use the base version.We split 20% of original training data to be the validation set and fix the split for all models. The model selection is made by early stopping and we train all models for 20 epochs with the batch  size 32. For users with too many posts and words, we only sample 100 passages for them. Table 2 shows our results on Macro-F1.
Task-Adaptive Pre-training After applying task-adaptive pre-training on BERT, we see small performance gains over BERT (i.e., from 0.427 to 0.432). That might be because even we use the whole corpus provided by the shared task, it is still not large enough.
Multi-view Learning Word-Mask strategy improves over the BERT baseline. Compared with the adaptive pre-training results on BERT, which also do word-level masking but only trained on language modeling, we can see that MVL provides a more efficient way to utilize a small training corpus and bring 3.1% gain on Macro-F1. However, all the other MVL approaches hurt the performance when compared to the BERT baseline. This might be because the proposed sentence-level perturbation strategy can seriously break the semantics of each post and thus influence the overall performance, and random sampling over sentences hurts most.
Clinical Psychology Inspired Pseudo-labeling Exp 7, 8 and 9 in Table 2 achieve the Top-3 Macro-F1 scores. This indicates that although our psychology-inspired pseudo-labeling technique is simpler than other weakly-supervised methods, adding meaningful pseudo-label data from relevant domains helps mitigate the problem of insufficient data in the intermediate classes (b and c). To verify this point, we show the class-wise classification results for PL-based models in Table 3 where we can  see improvements on b and c classes. Due to space constraints, we present the class-wise performance for all models in Appendix C. The investigation over the confusion matrix of the best model (shown in Section 4) further supports our hypothesis. However, when we try to combine different pseudo-labeling data together (see Exp 9, where we add users from r/depression and r/Anxiety following the proportion of 1 : 2 4 and still keep the added user number the same), we observe a slight performance drop. The reason might be that users in these two PL datasets might be at the boundary of the low-risk and medium-risk and simply mixing them together will make the model confuse between these two classes (see Supplemental material D for all confusion matrices).
Furthermore, we wanted to test the role of the clinical psychology aspect of our pseudo-labeling approach. Does the gain come from the meaningful domains (anxiety and depression) or just by adding additional data? To answer this, we use additional data provided by Task C of the shared task that contains posts from random subreddits (e.g., sports). We do two experiments: 1) assign low-risk to all such users and 2) assign the gold labels provided by the task via crowdsourcing. We add the same size as for the other pseudo-label experiment (8% of training data). The results (Exp 10 & 11 in Table 2) show that the clinical psychology inspired PL outperforms these models by meaningfully addressing the intermediate classes insufficient data problem.

Error Analysis
In this section, we take a closer look at the prediction results of our best model (clinical psychol-ogy inspired pseudo labeling using r/depression as medium risk) by looking at the confusion matrix and sampled error cases. We plot the confusion matrices for the baseline model (Exp 1 in Table 2) and the best model (Exp 7 in Table 2) in Figure 1. We can see that, the best model achieves the improvement mainly by fixing error cases wrongly predicted as no-risk (where the true labels are "b", "c" and "d", with greater error reduction for "d") and low-risk (where the true labels are "c" and "d"). As O'Connor and Nock (2014) point out, depression is a serious mental issue and has become one of the most important risk factors of suicide. Adding posts from r/depression can help the model understand better what is "medium-risk" and "high-risk" and thus raise the alert for the signals of similar or related mental issues.
We can also see that the main problem of our best model, is still the confusion between "b" (low-risk) and "c" (medium-risk). In addition, the problem of wrongly predicting the examples belonging to intermediate classes to high-risk ones still exists. By manual investigation, we find that both problems require expertise in mental health to make the subtle distinctions. For example, the following text comes from a low-risk example 5 that is wrongly predicted as high-risk by our best model: " sadness has taken me. . . i am sad , lonely , and i have no interest in living anymore. . . i didnt want to die. . . my mind is diseased , unable to take happiness. . . i have no interest in forming any more. . . . i dont think ill do it. . . " It can be seen that there are many negative or even desperate expressions (marked as red) in this examples, mixed with some short signals (marked as blue) possibly indicating a person considered at low-risk. The model can be fooled by the massive negative expressions and make the wrong predictions if the model is not aware of the true intent of the person. Therefore, reliable intent identification that could consider user posts across time and other information would be a powerful tool to help the model prevent mistakes like this.

Application: Predicting Suicide Risk of People Who Use Drugs
In order to further verify the effectiveness of our model in real-world applications, we create a sim- ulation scenario: we apply our best model (Exp 7) over the data that is collected for 612 users who post on both r/opiates and r/SuicideWatch. r/opiates is a subreddit where people discuss topics around opioid usage (e.g., drug doses, withdrawal anguish, daily experiences, harm reduction). This community members could often be at a high suicide risk (Aladag et al., 2018;Yao et al., 2020). We apply our model over their 1, 176 posts on r/SuicideWatch and find that our model predicts that 15.52% of them are no-risk, while 84.48% of them are of low-risk, medium-risk and high-risk. The results on sampled 2, 863 r/opiate posts are 30.56% for no-risk and 69.44% for at least some risk. The predicted outputs are highly aligned with reported results using crowdsourcing annotation of suicidal or not-suicidal by Yao et al. (2020) and show the effectiveness of our model in this simulated scenario. 6 We hope this will open the door of using NLP methods to investigate the link between suicidal ideation and fatal overdoses among people who use drugs.

Conclusions
We investigated a series of weakly-supervised methods and find that pseudo-labeling on data related to risk factors for suicide (depression, anxiety) can help improve model performance. This provides an alternative way to use theoretically-grounded models (e.g., compared to feature engineering). We also show a potential use case of this work for understanding suicidal ideation among users who use drugs (e.g., opiates).

Ethical Considerations
The dataset for suicide risk assessment was obtained from the organizers of the 2019 Clinical Psychology Shared Task on Suicide Risk Assessment, by filling in a participant application where we affirmed that we would follow the shared task's rules. We have obtained IRB approval (exempt) from Columbia University to use the data as it consists of publicly available and anonymous posts extracted from Reddit. For the application part, we also obtained Columbia IRB approval (exempt) for the data publicly available and anonymous data from r/opiates. All data is kept secure and online userIDs are not associated to the posts. Our intention of developing and improving suicide risk assessment models is to help health professionals and/or social workers identify people that might be at risk of committing suicide. We emphasize our intention that suicide risk assessment models such as the ones developed here to be used responsibly, with a human in the loop -for example a medical professional, a mental health specialist, who can look at the predicted labels and offer explanations and decide whether or not they seem sensible. We would urge any user of suicide risk assessment technology to carefully control who may use the system. Currently, the presented models may fail in two ways: they may either mislabel an at-risk user as no-risk (our current models are particularly designed to minimize this risk), or classify a no-risk user with some level of risk. Obviously, there is some potential harm to a person who is truly in need if a system based on this work fails to detect their suicidal ideation, and it is possible that a person who is not truly in need may be irritated or offended if someone reaches out to them because of a mistake. That is why, this system needs only to be used as additional help for health professionals.
We note that because most of our data were collected from Reddit, a website with a known overall demographic skew (towards young, white, American men 7 ), our conclusions about what expressions of different suicide risk levels look like and how to detect them cannot necessarily be applied to broader groups of people. This might be particularly acute for vulnerable populations such as people with opioid use disorder (OUD). We hope that this research stimulates more work by the research community to consider and model ways in which different groups express suicidal ideation.

A Comparison of Different Pre-trained Language Models
Given that there has been significant progress on the architecture designs after BERT, we have experimented with different PLMs, such as RoBERTa (Liu et al., 2019) and XLNet (Yang et al., 2019). From Table 4, we can see that on the Test set, the Macro-F1 scores for BERT and RoBERTa are almost the same and XLNet performs worse than BERT. Therefore, we hypothesis that the architecture of PLMs will not influence substantially the results on this task so we chose BERT model.

C Class-wise Decomposition of Experimental Results
Here we show the class-wise performance for all the models in Table 6.

D Additional Error Analysis
Additional confusion matrices for highperformance models (8, 9, 10 in Table 2) are in Figure 3.   Table 2 No  Table 6: Class-wise decomposition results for models considered in this paper. The results under each class are presented following the "Precision/Recall/F1" format.