An Annotated Dataset for Explainable Interpersonal Risk Factors of Mental Disturbance in Social Media Posts

With a surge in identifying suicidal risk and its severity in social media posts, we argue that a more consequential and explainable research is required for optimal impact on clinical psychology practice and personalized mental healthcare. The success of computational intelligence techniques for inferring mental illness from social media resources, points to natural language processing as a lens for determining Interpersonal Risk Factors (IRF) in human writings. Motivated with limited availability of datasets for social NLP research community, we construct and release a new annotated dataset with human-labelled explanations and classification of IRF affecting mental disturbance on social media: (i) Thwarted Belongingness (TBe), and (ii) Perceived Burdensomeness (PBu). We establish baseline models on our dataset facilitating future research directions to develop real-time personalized AI models by detecting patterns of TBe and PBu in emotional spectrum of user's historical social media profile.


Introduction
The World Health Organization (WHO) emphasizes the importance of significantly accelerating suicide prevention efforts to fulfill the United Nations' Sustainable Development Goal (SDG) objective by 2030 (Saxena and Kline, 2021).Reports released in August 20211 indicate that 1.6 million people in England were on waiting lists for mental health care.An estimated 8 million people were unable to obtain assistance from a specialist, as they were not considered sick enough to qualify.As suicide remains one of the leading causes of the death worldwide 2 , this situation underscores the need of mental health interpretations from social media data where people express themselves and their thoughts, beliefs/emotions with ease (Wongkoblap et al., 2022).The individuals dying by suicide hinder the psychological assessments where a self-reported text or personal writings might be a valuable asset in attempting to assess an individual's specific personality status and mind rationale (Garg, 2023).With strong motivation of thinking beyond low-level analysis, Figure 1 suggests personalization through higherlevel analysis of human writings.As, the social media platforms are frequently relied upon as open fora for honest disclosure (Resnik et al., 2021), we examine mental disturbance in Reddit posts aiming to discover Interpersonal Risk Factors (IRF) in text.
Interpersonal relationships are the strong connections that a person with their closest social circle (peers, intimate-partners and family members) which can shape an individual's behavior and range of experience (Puzia et al., 2014).Affecting such interpersonal relationships influences the associated risk factors resulting in mental disturbance.According to interpersonal-psychological theory of suicidal behavior (Joiner et al., 2005), suicidal  et al., 2022;Tsakalidis et al., 2022;Gaur et al., 2018) as intrinsic classification task.
Computational approaches may better understand the technological advancements in psychology research, aiding the early detection, prediction and evaluation, management and follow-up of those experiencing suicidal thoughts and behaviors.Most automated systems require available datasets for computational advancements.Past studies show that the availability of relevant datasets in mental healthcare domain is scarce for IRF due to sensitive nature of data as shown in Table 1 (Su et al., 2020;Garg, 2023).To this end, we introduce an annotated Reddit dataset for classifying TBE and PBU.The explanatory power of this dataset lies in supporting the motivational interviewing and mental health triaging where early detection of potential risk may trigger an alarm for the need of a mental health practitioner.We adhere to ethical considerations for constructing and releasing our dataset publicly on Github5 .

Dataset
2.1 Corpus Construction Haque et al. (2021) used two subreddits r/depression and r/suicidewatch to scrape the SDCNL data and to validate a label correction methodology through manual annotation of this dataset for depression versus suicide.They ad-dressed the then existing ethical issues impacting dataset availability with public release of their dataset.In addition to 1896 posts of SDCNL dataset, we collected 3362 additional instances from Reddit on r/depression and r/SuicideW atch through PRAW API6 from 02 December 2021 to 04 January 2022 with about 100 data points per day (to maintain variation in the dataset).On initial screening, we found (i) posts with no self-advocacy, (ii) empty/irrelevant posts.We manually filter them to deduce self-advocacy in texts leveraging 3155 additional samples, which results in a total of 5051 data points (Garg et al., 2022).We removed 694 of the data points depicting no assessment of mental disturbance.Moreover, people write prolonged texts when they indicate IRF which is inline with the conventional arguments where prolonged remarks get better responses from others in comparison of the transient remarks (Park et al., 2015).The length of real-time Reddit posts varies from a few characters to thousands of words.We limit the maximum length of every post to 300 words resulting in 3522 posts as a final corpus.

Annotation Scheme
Classification of IRF, being a complex and highly subjective task, may induce errors with naive judgment.To mitigate this problem, we build a team of three experts: (i) a clinical psychologist for training annotators and validating annotations with psychological viewpoint, (ii) a rehabilitation counselor for comprehending human mind to understand users' IRF, and (iii) a social NLP expert suggesting text based markings in Reddit posts.To negotiate and mitigate the trade-off between three different perspectives, our experts build annotation guidelines 7 to mark (i) TBE, and (ii) PBU.The experts annotated 40 samples of the corpus in isolation using these annotation guidelines to avoid biases and discover possible dilemmas due to the subjective nature of tasks.Therefore, we accommodate perplexity guidelines to simplify the task and facilitate unbiased future annotations.

TBE or PBU in the Past:
To check if the condition of a person with disconnected past is still alarming prospect of self-harm or suicidal risk.For instance, 'I was so upset being lonely before Christmas and today I am celebrating New Year with friends'.We frame rules to handle risk indicators about the past because a person attends celebration and overcome the preceding mental disturbance which means filling void with external event.With neutral opinion by NLP expert about double negation, our clinical psychologist argues presence of risk in their perception which may again evolve after some time and thus, marks this post with presence of the TBe.
2. Ambiguity with Social Experiences: Relationships point to the importance of the ability to take a societal pulse on a regular basis, especially in these unprecedented times of pandemic-induced distancing and shut-downs.People mention major societal events such as breakups, marriage, best friend related issues in various contexts suggesting different user perceptions.We mitigate this problem with two statements: (i) Any feeling of void/missing/regrets/or even mentioning such events with negative words should be marked as presence of TBe such as consider this post: 'But I just miss her SO.much.It's like she set the bar so high that all I can do is just stare at it.',(ii) Anything associated with fights/quarrels/general stories should be marked with absence of TBe such as consider the post: 'My husband and I just had a huge argument and he stormed out.I should be crying or stopping him or something.But I decided to take a handful of benzos instead.'

Annotation Task
Three postgraduate students underwent eight hours of professional training by a senior clinical psychologist leveraging annotation and perplexity guidelines.After three successive trial sessions to annotate 40 samples in each round, we ensured their alignment on interpreting task requirements and deployed them for annotating all data points in the corpus.We obtain final annotations based on the  majority voting mechanism for binary classification task <TBE, PBU>. 8We validate three annotated files using Fliess' Kappa inter-observer agreement study on classifying TBE and PBU where kappa is calculated as 78.83% and 82.39%, respectively.Furthermore, we carry out an inter-annotator agreement study with group annotations9 for textspans extraction in positive data points.The results for agreement study in two-fold manner: (i) 2 categories (agree, disagree) and (ii) 4 categories (strongly agree, weakly agree, weakly disagree, strongly disagree), are obtained as 82.2% and 76.4% for agreement study of <TBE_EXP>, and 89.3% and 81.3% for agreement study of <PBU_EXP>, respectively.

Dataset Statistics
On observing the statistics of our dataset in Table 2, we found 54.71% and 32.56% of positive data points with underlying 255489 and 156620 words for TBE and PBU, respectively.It is interesting to note that although the average number of sentences to express PBU is less than TBE, the observations are different for average number of words.We calculate the Pearson Correlation Coefficient (PCC) for our cross-sectional study on TBE and PBU as 0.0577 which shows slight correlation between the two.Our dataset paves the way for longitudinal studies which is expected to witness increased PCC due to wide spread emotional spectrum (Kolnogorova et al., 2021;Harrigian et al., 2020).On  The most frequent words for identifying (i) TBE are alone, lonely, nobody to talk, someone, isolated, lost, and (ii) PBU are die, suicide, suicidal, kill, burden, cut myself. 10Our approach for identifying TBe and PBu goes beyond a simple keyword detector.Instead, we utilize a more sophisticated method that considers the context and relationships between words.For instance, consider a following sample: Massive party at a friend's house-one of 10 WordCloud is given in Appendix C. my closest friends is there, loads of my close friends are there, i wasn't invited.wasn't told.only found out on snapchat from their stories.spending new years eve on teamspeak muting my mic every time i break down :) Despite the absence of trigger words, our approach flags this post as positive for TBu based on its indicators 'friend', 'teamspeak', 'friends', 'invited', 'snapchat', to name a few.

Baselines
We perform extensive analysis to build baselines with three different conventional methods.We first apply Recurrent neural networks where a given text, embedded with GloVe 840B-300 11 , is sent to a 2-layer RNN model (LSTM, GRU) with 64 hidden neurons and the output is forwarded to two separate fully connected heads: (i) TBE and (ii) PBU.Each of the fully connected blocks have one hidden layer with 16 neurons and ReLU activation function, and an output layer with sigmoid activation.The loss function is Binary_CrossEntropy and optimizer is adam with lr = 0.001.Next, we apply pretrained transformer-based models.The input is tokenized using a pre-trained transformers' tokenizer to obtain a 768-dimensional vector which is then fed to a similar fully connected network as the previous architecture with hidden layer size as 48.We experimented with roberta-base, bert-base-uncased, distilbert-base-uncased, and mental/mental-bertbase-uncased models.Finally, we use the Ope-nAI embeddings API12 to convert the input text into 1536-dimensional embeddings through 'textembedding-ada-002' engine which are used to train a classifier.We test the robustness of this approach over: (i) Logistic Regression, (ii) Random Forest, (iii) Support Vector Machine (iv) Multi Layer Perceptron, and (v) XGBoost.We further use two explainable methods: (i) LIME and (ii) SHAP on one of the best performing transformer-based models, MentalBERT (Ji et al., 2022), to obtain the top keywords (Danilevsky et al., 2020;Zirikly and Dredze, 2022).We compare them with the ground truth ROUGE scores for -Precision (P), Recall (R), and F1-score (F).

Experimental Settings
For consistency, we used the same experimental settings for all models and split the dataset into the train, validation, and test sets.All results are reported on the test set, which makes up 30% of the whole dataset.We used the grid search optimization technique to optimize the parameters.To tune the number of layers (n), we empirically experimented with the values: learning rate (lr): lr ∈ {0.001, 0.0001, 0.00001} and optimization (O): O ∈ {'Adam', 'Adamax', 'AdamW'} with a batchsize of 16, 32 were used.We used base version pre-trained language models (LMs) using Hugging-Face 13 , an open-source Python library.We used optimized parameters for each baseline to find precision, recall, F1-score, and Accuracy.Varying lengths of posts are padded to 256 tokens with truncation.Each model was trained for 20 epochs, and the best-performing model based on the average accuracy score was saved.Thus, we set hyperparameter for our experiments as Optimizer = Adam, learning rate = 1e-3, batch size= 16, and epochs=20.

Experimental Results
Table 3 shows the performance of state-of-the-art methods in terms of precision, recall, F1-score, and accuracy.The current models have moderately low performance in this task, possibly due to a lack of ability to capture contextual information in the text.MentalBERT, a transformer-based language model, initialized with BERT-Base and trained with mental health-related posts collected from Reddit, had the best performance among BERT-based models, with an F1-score of 76.73% and 62.77% for TBE and PBU, respectively.This is likely due to the fact that it was trained on the same context as the task, namely health-related posts on Reddit.The combination of OpenAI embeddings and a classifier outperforms RNN and transformer-based models.The highest F1-Score of 81.23% was achieved by logistic regression for TBE, while the best performing model for PBU was SVM with an F1-score of 76.90%.We also analyzed the explainability of the model using LIME and SHAP methods of explainable AI for NLP on the best performing transformer model (MentalBERT) for TBE and PBU.We obtain results for all positive data points in the testing dataset and observe high recall of text-spans with reference to the ground truth as shown in Table 4.We find the scope of improvement by limiting the superfluous text-spans found in the resulting set of words.The consistency in results suggests the need of contextual/domain-specific knowledge and infusing commonsense to improve explainable classifiers for a given task.

Conclusion and Future Work
We present a new annotated dataset for discovering interpersonal risk factors through human-annotated extractive explanations in the form of text-spans and binary labels in 3522 English Reddit posts.In future work, we plan to enhance the dataset with more samples and develop new models tailored explicitly to TBE and PBU.The implications of this work include the potential to improve public health surveillance and other mental healthcare applications that rely on automatically identifying posts in which users describe their mental health issues.We keep the implementation of explainable AI models for multi-task text classification, as an open research direction for Open AI and other newly developed responsible AI models.We pose the discovery of new research directions for future, through longitudinal study on users' historical social media profile to examine interpersonal risk factors and potential risk of self-harm or suicidal ideation.As we focus on Reddit data as a starting point of our study, exploring other forums could be an interesting research direction.bilitation counselor, for their unwavering support throughout the project.Additionally, we extend our heartfelt appreciation to Prof. Sunghwan Sohn for his consistent guidance and support.This project was partially supported by NIH R01 AG068007.This project is funded by NSERC Discovery Grant (RGPIN-2017-05377), held by Vijay Mago, Department of Computer Science, Lakehead University, Canada.

Limitations
There might be linguistic discrepancies between Reddit users and Twitter users who post about their mental disturbance on social media.Social media users may intentionally post such thoughts to gain attention of other social media users but for simplicity, we assume the social media posts to be credible.Thus, we assume that the social media posts are not misleading.We acknowledge that our work is subjective in nature and thus, interpretation about wellness dimensions in a given post may vary from person to person.

Ethical Considerations
The dataset we use is from Reddit, a forum intended for anonymous posting, users' IDs are anonymized.In addition, all sample posts shown throughout this work are anonymized, obfuscated, and paraphrased for user privacy and to prevent misuse.Thus, this study does not require ethical approval.Due to the subjective nature of annotation, we expect some biases in our gold-labeled data and the distribution of labels in our dataset.Examples from a wide range of users and groups are collected, as well as clearly defined instructions, in order to address these concerns.Due to high inter-annotator agreement (κ score), we are confident that the annotation instructions are correctly assigned in most of the data points.It is reproducible with the dataset and the source code to reproduce the baseline results which is available on Github.
To address concerns around potential harms, we believe that the tool should be used by professionals who are trained to handle and interpret the results.We recognize the huge impact of false negatives in practical use of applications such as mental health triaging, and we shall continue working towards improving its accuracy and reducing the likelihood of false negatives.We further acknowledge that our work is empirical in nature and we do not claim to provide any solution for clinical diagnosis at this stage.I only take Lexapro.I was watching some videos on these guy that call themselves "Preppers" and they prep for the end of the world.They say that people on any types of drugs will become unstable and focused on getting their fix or whatever.Is that us? 0 -0 -   D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: Overview of the problem formulation depicting the need of identifying interpersonal risk factor in texts.The texts [1-4] are annotated as 0: absence or 1: presence of the interpersonal risk factors TBe and PBu.

Table 2 :
The statistics of Reddit dataset to determine presence or absence of TBE and PBU and its explanation.

Table 3 :
Comparison of SOTA baseline models' performance

Table 4 :
Performance Evaluation of explanations of MentalBERT model through LIME and SHAP.
changing TBE from absence to presence, we observe high rate of increase in positive data points of PBU (((675 -472)/472) which is 43.00%) as compared to the absence of PBU (((1252-1123)/1123) which is 11.48%) suggesting the probability of high correlation in the presence of TBE and PBU, respectively which are given in Table5.

Table 6 :
A sample of dataset to examine interpersonal risk factors and their explanations for mental health problems I'm having thoughts about killing myself to escape all of this.Its the most dumb thing to do but i feel like im running out of choices.We're not financially stable.I'm a student.I should have wore a condom.What should i do.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.