Toward Macro-Insights for Suicide Prevention: Analyzing Fine-Grained Distress at Scale

,


Introduction
Suicide is among the leading causes of death for individuals 10-44 years of age in the United States (Heron and Tejada-Vera, 2009). Indeed, while mortality rates for most illnesses decreased between 2008 and 2009, the rate of suicide increased by 2.4% (Heron and Tejada-Vera, 2009). The lifetime prevalence for suicidal ideation is 5.6-14.3% in the general population, and as high as 19.8-24.0% among youth (Nock et al., 2008).
The first step toward suicide prevention is to identify, ideally in consultation with clinical experts, the risk factors associated with suicide. Due to social stigma among other sociocultural factors (Crosby et al., 2011), individuals at risk for committing suicide may not always reach out to professionals or, if they do, provide them with accurate information. They may not even realize their own level of suicide risk before it is too late. Self-reporting, then, is not an entirely reliable means of detecting and assessing suicide risk, and research on suicide prevention can benefit from also exploring other channels for assessing risk.
For instance, individuals may be more inclined to seek support from informal resources, such as social media, instead of seeking treatment (Crosby et al., 2011;Bruffaerts et al., 2011;Ryan et al., 2010). Evidence suggests that youth and emerging adults usually prefer to seek help from their friends and families; however, higher levels of suicidal ideation are associated with lower levels of help-seeking from both formal or informal resources (Deane et al., 2001).
These patterns in help-seeking behavior suggest that social media might be an important channel for discovering those at risk forand even preventing-suicide.
Internet-and telecommunications-driven activity is revolutionizing the social sciences by providing data, much of it publicly available, on human activity in situ, at volumes and a level of time and space granularity never before approached. Can such data improve clinical preventative study and measures by providing access to at-risk individuals who would otherwise go undetected, and by leading to better science about suicide risk behaviors?
The stress-diathesis model for suicidal behavior (Mann et al., 1999) suggests that they might. It says that (1) objective states, such as depression or life events, as well as subjective states and traits, such as substance abuse or family history of depression, suicide, or substance abuse, are among the risk factors that contribute to suicidal ideation and (2) the presence of these factors could eventually lead to either externalizing (e.g., interper-sonal violence) or internalizing aggression (e.g., attempting suicide).
Since the stress-diathesis model was developed using risk factors for suicidal behavior, and because it makes a connection between internalized and externalized acts, it is a suitable framework for analyzing publicly available linguistic data from social media outlets such as Twitter. Data from social media can be seen as a kind of natural experiment on depression and suicidal ideation that is unburdened by such sample biases as the willingness of individuals to take part in research and/or seek out formal sources of support. Moreover, this approach may provide information about individuals who are unlikely to engage in formal helpseeking behaviors, or may inform effective methods of natural helping. Thus, this macro-level approach to monitoring suicidal behaviors may have future implications not only for identifying individuals who have a higher prevalence for suicidal behaviors but it could eventually lead to additional methods for enhancing protective factors against suicide.
In this paper, we take steps toward the automatic detection of suicide risk among individuals via social media. Suicide ideation is a complex behavior and its connection to suicide itself remains poorly understood. We focus on a particular aspect of suicidality, namely distress. While not equivalent to suicide ideation, according to Nock et al. (2010) distress is an important risk factor in suicide, and one that is observable from microblog text, though admittedly observing suicide risk behavior is a subjective and noisy venture. Lehrman et al. (2012) conducted an early study on the computational modeling of distress based on short forum texts, yet left many areas wide open for continued study. For example, analysis at scale is one such open issue. More specifically, Pestian and colleagues (Matykiewicz et al., 2009;Pestian et al., 2008) used computational methods to understand suicide notes. However, when it comes to preventive contexts, such data are less insightful. For preventive health, access to real-time healthrelated data that dynamically evolve can allow us to address macro-level analysis. Social media provide an additional opportunity to model the phenomena of interest at scale.
We use methods that take advantage of lexical analysis to retrieve microblog posts (tweets) from Twitter and compare the performance of human annotators-one being an expert, and others notto rate the level of distress of each tweet.
Clinical expert annotation, rather than generalpurpose tools for content and sentiment analysis such as LIWC (Linguistic Inquiry and Word Count) by Pennebaker et al. (2001), provides a basis for text-based statistical modeling. We show that expertise-based keyword retrieval, departing from knowledge about contributing risk factors, results in better interannotator agreement in both novice-novice and novice-expert annotation when the keywords reflect the task at hand.

Related Work
Data on suicide traditionally comes from healthcare organizations, large-scale studies, or self reporting (Crosby et al., 2011;Horowitz and Ballard, 2009). These sources are limited by sociocultural barriers (Crosby et al., 2011), such as stigma and shame. Moreover, data on suicide is never particularly reliable because suicide is a fundamentally subjective, complex phenomenon with a low base rate. For these reasons, many researchers tend to focus on the relationship between risk factors and suicidal behavior, without relying heavily on theoretical models (Nock et al., 2008).
Approximately one-third of all individuals who reported suicidal ideation in their lifetime made a plan to commit suicide. Nearly three-quarters of those who reported making a suicide plan actually attempted. The odds of attempting suicide increased exponentially when individuals endorsed three or more risk factors, e.g., having a mood or substance abuse disorder (Kessler et al., 1999).
Regarding the use of annotation for predictive modeling, evidence suggests that when it comes to judgments that involve clinical phenomena, experts and novices behave differently (Li et al., 2012;Womack et al., 2012). Such distinctions intuitively make sense, as the learning of medical domain knowledge requires advanced education in conjunction with substantial practical field experience.
In a task such as medical image inspection, the subtle cues that point an observer to evidence that allow them to identify a clinical condition, while accessible to experts with training and perceptual expertise to guide their exploration, are likely to be missed by novices who lack that background and clinical understanding. Such expertise can then be integrated into human-centered health-IT systems (Guo et al., 2014), in order to introduce novel ways to retrieve medical images and take advantage of an understanding of which information is useful. It is reasonable to assume that this knowledge gap also applies to other knowledge-intensive clinical domains such as mental health. In this study, we explore this question and study if novice vs. expert annotation makes a difference for identifying distress in social media texts, as well as what the impact of expert vs. novice annotation is for subsequent computational modeling with the annotated data.
Affect in language is a phenomenon that has been studied in the speech and text analysis domains, and in many others (Calvo and D'Mello, 2010). Clearly, emotion is a key element in the human experience, but it is notoriously difficult to pin down and scholars in the affective sciences lack a single agreed-upon definition for emotion. Accordingly, different theoretical constructs have been proposed to describe affect and affect-related behaviors (Picard, 1997). In addition, research on affect in language has shown that such phenomena tend to be subjective, lack real ground truth (often resulting in moderate kappa scores), and have particularly fuzzy semantics in the gray zone where neutrality and emotion meet (Alm, 2008). These kinds of problem characteristics bring with them their own set of demanding challenges from a computational perspective (Alm, 2011). Yet, the nature of such problems make them incredibly important to study, despite the challenges involved. Sentiment analysis has been widely studied in a number of computational settings, including on various social networking sites. A rather substantial body of work already exists on the use of Twitter to study emotion (Bollen et al., 2011b;Dodds et al., 2011;Wang et al., 2012;Pfitzner et al., 2012;Kim et al., 2012;Bollen et al., 2011a;Pfitzner et al., 2012;Bollen et al., 2011c;Mohammad, 2012;Golder and Macy, 2011;De Choud-hury et al., 2012a;De Choudhury et al., 2012b;Hannak et al., 2012;Thelwall et al., 2011;Pak and Paroubek, 2010). For instance, Golder and and Macy study aggregate global trends in "mood," and show, among other things, that people wake up in a relatively good mood that decays as the day progresses (Golder and Macy, 2011). Bollen et al. (2011c) show that tweets from users who took a standard diagnostic instrument for mood are often tied to current events, such as elections and holidays.
Relatively little of this work has focused on suicide or related psychological conditions. Masuda et al. (2013) study suicide on mixi (a Japanese social networking service). Cheng et al. (2012) consider the ethical and political implications of online data collection for suicide prevention. Jashinsky et al. (2013) show correlations between frequency in tweets related to suicide and actual suicide in the 50 United States of America. Sadilek et al. (2014) (Wray et al., 2011), but most of this work was with respect to offline social systems.

Methods
Our methods involve four main phases: (1) We filtered a corpus, obtained from Sadilek et al. (2012), of approximately 2.5 million tweets from 6,237 unique users in the New York City area that were sent during a 1-month period between May and June, 2010, into a set of 2,000 tweets that are relatively likely to be centered around suicide risk factors.
(2) We annotated each of these 2,000 tweets with their level of distress, and also analyzed the annotations in detail. (3) We then trained support vector machines and topic models with the annotated data, except for a held-out subset of 200 tweets. (4) Finally, we assessed the effectiveness of these methods on the held-out data.  Table 1: Summary statistics and thematic category distributions of the collected dataset. The data were collected from NYC. Geo-active users are those who geo-tag (i.e., automatically post the GPS location of) their tweets relatively frequently (more than 100 times per month).

Filtering tweets
In order to facilitate the discovery of distressrelated tweets, we first (a) converted all text to lower case; (b) stripped out punctuation and special characters; and (c) mapped informal terms (such as abbreviations and netspeak) to more standard ones, based on the noslang dictionary. 1 We then used two different methods to filter tweets that are relatively likely to center on suicide risk factors. We used LIWC to capture 1,370 tweets by sampling randomly from among the 2,000 tweets with the highest LIWC sad score. LIWC has been widely used to estimate emotion in online social networks, and specifically to mood on Twitter. This slight amount of randomness in filtering tweets this way was intended to avoid selecting obvious false positives, such as the use of "sad" in nicknames.
Next, we adopted a collection of inclusive search terms/phrases from Jashinsky et al. (2013), which was designed specifically for capturing tweets related to suicide risk factors, and applied them to our source corpus. We added to these more terms, from (Crosby et al., 2011) (see Table 2). These terms yielded 630 tweets.

Novice and Expert Tweet Annotation
We then divided the resulting set of 2,000 filtered tweets (1,370 from the LIWC sad dimension and 630 from suicide-specific search terms), into two randomized sets of 1,000 tweets each. Both sets had the same proportion of LIWC-filtered and suicide-specific-filtered tweets. A novice annotated the first set and a counseling psychologist with experience in suicide related research annotated the second set. A second novice annotated a subset of 250 tweets of the first set, to reveal interannotator agreement between novices, as one might expect a novice without training to be less systematic. (The annotators were among the authors.) Each tweet in each set was rated on a fourpoint scale (H, ND, LD, HD) according to the level of distress evident (Table 3).
Each tweet to be annotated was provided with context in the form of the three tweets before and after the tweet to be annotated that the tweeter made, along with the timestamp of those tweets and the thematic categories to which the tweet belonged, based on the filtering process (Figure 1).

Modeling
We then mapped each tweet to a feature space composed of the unigrams, bigrams, and trigrams in the corpus. For example, a simple tweet "I am  Figure 1: Example input for annotator. The tweet to be annotated is indicated by >>>. Annotators were given context in the form of the three tweets immediately preceding-and the three tweets immediately following-the tweet to be annotated that the tweeter made, along with the relative time at which each tweet was made. Each numerical label denotes one of these context tweets. (Tweeter information has been blanked out.) Code Distress Level H happy ND no distress LD low distress HD high distress Table 3: Distress-related categories used to annotate the tweets.
so happy" was represented as the following feature vector: {I, am, so, happy, I am, am so, so happy, I am so, am so happy}. Each feature is associated with its tf-idf score (Manning et al., 2008). We performed topic modeling on our dataset. A topic is a set of lexical items that are likely to occur in the same tweet. Topic models are capable of associating words with similar meanings and distinguishing among the different meanings of a single word. We used latent Dirichlet allocation (LDA) (Blei et al., 2003) to create these topics. Before doing so, we removed stop words and words that occur only once in the dataset. We then applied LDA algorithm on the data to discover three topics using 100 iterations.
We used support vector machines (SVMs) (Joachims, 1998), a machine learning method that is used to train a classification model that can assign class labels to previously unseen tweets, to assess the power of our annotations. SVMs treat each tweet as a point in an extremely high dimen-sional space (one dimension per uni-, bi-, and trigram in the corpus). SVMs are a form of linear separator that can also distinguish between nonlinearly separable classes of data by warping the feature space (though in our case we perform no such warping, or kernelization). They have proven to be an extremely effective tool in classifying text in numerous settings, including Twitter.   Figure 2 shows the distribution of annotation labels for the subset of tweets that Novices 1 and 2 both annotated, and Figure 3 compares the overall annotation distributions between Novice 1 and the Expert. Interestingly, the novices are relatively conservative, compared to the expert, in assigning distressed labels, whereas the expert exhibits a higher sensitivity toward low distress than either of the novices. This suggests that it is important in this domain not to rely too much on novice judg-ments, as novices are not trained to pick up on subtle cues-in contrast to the clinically trained eye.

Results
Note that there are very few happy tweets, which confirms that our filtering was effective in removing tweets of the opposite polarity.     Table 4 shows the Cohen kappa score between Novices 1 and 2, when high and low distress vs. no distress and happy, are grouped in a single category and Tables 5-7 show the confusion matrices between Novices 1 and 2. In all cases the kappa score is moderate. However, it clearly improves when annotation is restricted to just those tweets filtered using the suicide-thematic inclusion terms of Jashinsky et al. (2013). This again seems to point to the usefulness of including clinical experts into the training process.
Due to their sensitive nature, we decided not to provide examples of high distress tweets. Here are two examples of tweets labeled as low distress by two annotators.

H ND LD HD
• @XXXX i'm still sad thoo. i feel neglected! and i miss XXXX And here are two examples of tweets labeled as no distress by two annotators.
• i did mad push-ups tryna get that cut up look, then look at myself after a shower ... #plandidntwork; thats #whyiaintgotomiami • my son is gonna have blues eyes and nappy hair! yes yes yes The above examples are rather clear cut, however in many cases the tweets were more ambiguous, even when annotators had the preceding and succeeding three tweets from the user of the tweet to be annotated to rely on for context. While context and time offset information was useful for annotators, distress annotation is clearly a challenging task, as the confusion matrices in Tables 5-6 reveal. The lower agreement levels, and particularly the fuzzy border between 'no distress' and 'low distress' are completely in line with prior research, discussed above, on affective language phenomena.
Another filtering and annotation challenge involves tweets with mixed emotion, such as: • as much as i hate my job some of the people i work with are amazing.
Beyond the targeted annotation categories of distress level, there were emerging themes of aggression, privilege and oppression, and daily struggles, among others. For instance, jobs were a popular source of distress: • i friggin hate these bastards my job grimey ass bastards knew i wanted the day off and tell me some next shit • hate my job wit a passion! hate every1 there.. they better do sumthin about it, or im out! Personal bias may have impacted annotation decisions. For instance, numerous tweets contained irony and dark humor, which may result in annotators underestimating or overlooking actual distress. In addition, by pulling data from Twitter, any non-Twitter context behind the tweets is lost. For example, a few individuals retweeted in a sarcastic manner about what individuals should say to someone who is considering suicide: • you wish!!! rt @XXXX: i think suicide is funny. especially once my mom does it • rt @XXXX: what do i say to a person thats asking me for advice becuz they thinking bout committing suicide when i see there point? lmao Without knowing the circumstances of the original message (beyond the provided context window) it is difficult to classify such tweets. Finally, a number of tweets seemed to show compassion or empathy for others experiencing stress. This suggests to us the profound role that social support places in well-being and depression, that one's friends and associates can also provide clues into one's emotional state, and that social media can reveal such behavior.  Table 8: Topic analysis on bigrams of tweets labeled as high distress vs. randomly selected tweets from the larger, unlabeled dataset. The high distress tweets clearly convey strong negative affect. Table 8 shows the results of a 3-category topic model on bigrams. The first column is taken just from tweets labeled high distress by any one of the three annotators (72 tweets total). The second column comes from a randomly-chosen sample of 2000 tweets from the 2.3 million tweet corpus. These results show that the lexical contents of the annotated tweets are recognizeably different from the random sample. By our judgement, the topical groupings in the rows of the high distress column are all clearly marked by strong negative affect, and additionally they could arguably be labeled-from top to bottom-as: "failure and defeat," "loss," and "loneliness." The rows of the second column are less clear cut, and appear to reflect a much broader scope of topics. One interesting aspect of the second, random column is that recording artist Chris Brown had released a new album during the collection period, which seems to explain why his name appeared.  Table 9: Performance of SVM-based classification when the training and testing sets are alternately Novice 1 (N1) or the Expert (E). Because we focus on distress classification, we report precision, recall and F-measure for the distress class, which combines LD and HD into a single class with respect to binary (distress vs. non-distress) classification. In each case, a held-out set of 100 randomly selected tweets compose the test set and the remaining 900 tweets from that annotator compose the training set. The last row shows when the two training sets (respectively, test sets) are combined into a single set of 1800 (respectively, 200) tweets.
For classification, because we are most interested in being able to separate distressed from non-distressed tweets, we combine low distress and high distress into a single distress class, and no distress and happy into a non-distress class. Table 9 shows the performance of the SVM-based classifier when trained and tested on the Expert and Novice 1 training sets. Four themes emerge: (1) the SVM classifier is much more accurate (in terms of F-measure) when the testing and training data come from the same annotator (test and training data are disjoint), and the best performance comes from the expert-annotated data. (2) When testing and training data are from different annotators, the F-measure performance of the SVM is lower when the training set is from the novice rather than the expert. (3) When testing and training data are from different annotators, the SVM has lower recall and higher precision when the training set is from the novice rather than the expert. This is in part because the Expert was more sensitive to distress than Novice 1. It is premature to draw conclusions from this observation, but perhaps this shows that training with expert-labeled annotations is preferable to using novice-labeled data, espectially when our goal is to discover distressful tweets for the purpose of identifying atrisk individuals and err on the side of caution (high recall). (4) Integrating more but mixed data does not improve performance.

Discussion
As previously mentioned, many of the risk factors for suicidal behavior may be linked to other expressions of distress, such as aggression and interpersonal violence (Mann et al., 1999). The goal of this study is to determine the feasibility of classifying distress to enable further study of expressed suicidal behaviors. Consistent with the stress diathesis model for suicidal behavior, aggression was an emerging theme that arose from the data. Here are some examples: • @XXXX i don't feel sad 4 him. he gets pissed n says wat he wants then sends out fony apologies • @XXXX cuz he's n a relationship with that horseface bitch &amp; he lied 2 me &amp; i feel so used &amp; worthless now Some individuals tweeted about feeling empty, hopeless, angry, frustrated, and alone. Behaviors indicating bullying and schadenfreude were also observed. While these are all risk factors for internalizing aggression (i.e., suicidal behavior), they are also associated with externalized aggression. In addition to overt expressions of anger and violence, many of the humorous, ironic tweets also had an aggressive undertone.

Limitations
As ground truth, we rely on tweets hand-annotated by expert and novice for classification. However, the mental state of another individual, observed from a few lines of text often written in an informal register is necessarily hard to discern and, even under less noisy conditions, extremely subjective; even the observers' personal understandings of such concepts as "distress" may differ drastically. This makes annotation quite a challenge, and does not reveal in an objective fashion a tweeter's true mental state. As we have mentioned earlier, self-reporting has its own limitations, yet it is often regarded as the gold standard for ground truth about emotional state. Part of the problem in assessing the effectiveness of self-reporting is the relative rareness by which suicide occurs, and by the inherent subjectivity of the act, which makes any data on suicide fuzzy. We hope to explore in future work the relationship between clinical observation in both on-and off-line settings and selfreporting, including the integration of natural language data of patients from clinical settings. We also hope to explore distress annotation from different perspectives and levels of context.
Higher levels of suicidal ideation have an inverse relationship with all types of help-seeking and a positive correlation with the decision to not seek support (Deane et al., 2001). Thus, we would expect suicidal individuals to generally be less active on social media than those who are not. Nevertheless, a number of studies have shown a positive correlation between online social network use and negative mood. Perhaps this means in part that individuals who are depressed are slower to disengage on-rather than off-line.

Conclusion
We studied the performance of different approaches to training systems to detect evidence of suicide risk behavior in microblog data. We showed that both the methods used to automatically collect training sets, as well as the expertise level of the annotator affect greatly the performance of automatic systems for detecting suicide risk factors. In general, our study and its results-from filtering via data annotation to classification-confirmed the critical importance of bringing clinical expertise into the computational modeling loop.