Self-disclosure topic model for Twitter conversations

Self-disclosure, the act of revealing one-self to others, is an important social behavior that contributes positively to intimacy and social support from others. It is a natural behavior, and social scientists have carried out numerous quantitative analyses of it through manual tagging and survey questionnaires. Recently, the ﬂood of data from online social networks (OSN) offers a practical way to observe and analyze self-disclosure behavior at an unprecedented scale. The challenge with such analysis is that OSN data come with no annotations, and it would be impossible to manually annotate the data for a quantitative analysis of self-disclosure. As a solution, we propose a semi-supervised machine learning approach, using a variant of latent Dirichlet allocation for automatically classifying self-disclosure in a massive dataset of Twitter conversations. For measuring the accuracy of our model, we manually annotate a small subset of our dataset, and we show that our model shows signiﬁcantly higher accuracy and F-measure than various other methods. With the results our model, we uncover a positive and signiﬁcant relationship be-tween self-disclosure and online conversation frequency over time.

Self-disclosure, the act of revealing oneself to others, is an important social behavior that contributes positively to intimacy and social support from others. It is a natural behavior, and social scientists have carried out numerous quantitative analyses of it through manual tagging and survey questionnaires. Recently, the flood of data from online social networks (OSN) offers a practical way to observe and analyze self-disclosure behavior at an unprecedented scale. The challenge with such analysis is that OSN data come with no annotations, and it would be impossible to manually annotate the data for a quantitative analysis of self-disclosure. As a solution, we propose a semi-supervised machine learning approach, using a variant of latent Dirichlet allocation for automatically classifying self-disclosure in a massive dataset of Twitter conversations. For measuring the accuracy of our model, we manually annotate a small subset of our dataset, and we show that our model shows significantly higher accuracy and F-measure than various other methods.
With the results our model, we uncover a positive and significant relationship between self-disclosure and online conversation frequency over time.

Introduction
Self-disclosure is an important and pervasive social behavior. People disclose personal information about themselves to improve and maintain relationships (Jourard, 1971;Joinson and Paine, 2007). For example, when two people meet for the first time, they disclose their names and interests. One positive outcome of self-disclosure is social support from others (Wills, 1985;Derlega et al., 1993), shown also in online social networks (OSN) such as Twitter . Receiving social support would then lead the user to be more active on OSN (Steinfield et al., 2008;Trepte and Reinecke, 2013). In this paper, we seek to understand this important social behavior using a large-scale Twitter conversation data, automatically classifying the level of self-disclosure using machine learning and correlating the patterns with subsequent OSN usage.
Twitter conversation data, explained in more detail in section 4.1, enable a significantly larger scale study of naturally-occurring self-disclosure behavior, compared to traditional social science studies. One challenge of such large scale study, though, remains in the lack of labeled groundtruth data of self-disclosure level.
That is, naturally-occurring Twitter conversations do not come tagged with the level of self-disclosure in each conversation. To overcome that challenge, we propose a semi-supervised machine learning approach using probabilistic topic modeling. Our self-disclosure topic model (SDTM) assumes that self-disclosure behavior can be modeled using a combination of simple linguistic features (e.g., pronouns) with automatically discovered semantic themes (i.e., topics). For instance, an utterance "I am finally through with this disastrous relationship" uses a first-person pronoun and contains a topic about personal relationships.
In comparison with various other models, SDTM shows the highest accuracy, and the resulting self-disclosure patterns of the users are correlated significantly with their future OSN usage. Our contributions to the research community include the following: • We present a topic model that explicitly includes the level of self-disclosure in a conversation using linguistic features and the latent semantic topics (Sec. 3).
• We collect a large dataset of Twitter conversations over three years and annotate a small subset with self-disclosure level (Sec. 4).
• We compare the classification accuracy of SDTM with other models and show that it performs the best (Sec. 5).
• We correlate the self-disclosure patterns of users and their subsequent OSN usage to show that there is a positive and significant relationship (Sec. 6).

Background
In this section, we review literature on the relevant aspects of self-disclosure. Self-disclosure (SD) level: To quantitatively analyze self-disclosure, researchers categorize self-disclosure language into three levels: G (general) for no disclosure, M for medium disclosure, and H for high disclosure (Vondracek and Vondracek, 1971; Barak and Gluck-Ofri, 2007). Utterances that contain general (non-sensitive) information about the self or someone close (e.g., a family member) are categorized as M. Examples are personal events, past history, or future plans. Utterances about age, occupation and hobbies are also included. Utterances that contain sensitive information about the self or someone close are categorized as H. Sensitive information includes personal characteristics, problematic behaviors, physical appearance and wishful ideas. Generally, these are thoughts and information that one would generally keep as secrets to himself. All other utterances, those that do not contain information about the self or someone close are categorized as G. Examples include gossip about celebrities or factual discourse about current events.
Classifying self-disclosure level: Prior work on quantitatively analyzing self-disclosure has relied on user surveys (Trepte and Reinecke, 2013;Ledbetter et al., 2011) or human annotation (Barak and Gluck-Ofri, 2007). These methods consume much time and effort, so they are not suitable for large-scale studies. In prior work closest to ours,  showed that a topic model can be used to identify self-disclosure, but that work applies a two-step process in which a basic topic model is first applied to find the topics, and then the topics are post-processed for binary classification of self-disclosure. We improve upon this work by applying a single unified model of topics and self-disclosure for high accuracy in classifying the three levels of self-disclosure.
Self-disclosure and online social network: According to social psychology, when someone discloses about himself, he will receive social support from those around him (Wills, 1985;Derlega et al., 1993), and this pattern of self-disclosure and social support was verified for Twitter conversation data . Social support is a major motivation for active usage of social networks services (SNS), and there are findings that show self-disclosure on SNS has a positive longitudinal effect on future SNS use (Trepte and Reinecke, 2013;Ledbetter et al., 2011). While these previous studies focused on small, qualitative studies, we conduct a large-scale, machine learning driven study to approach the question of self-disclosure behavior and SNS use.

Self-Disclosure Topic Model
This section describes our model, the selfdisclosure topic model (SDTM), for classifying self-disclosure level and discovering topics for each self-disclosure level.

Model
We make two important assumptions based on our observations of the data. First, first-person pronouns (I, my, me) are good indicators for medium level of self-disclosure. For example, phrases such as 'I live' or 'My age is' occur in utterances that reveal personal information. Second, there are topics that occur much more frequently at a particular SD level. For instance, topics such as physical appearance and mental health occur frequently at level H, whereas topics such as birthday and hobbies occur frequently at level M. Figure 1 illustrates the graphical model of SDTM and how these assumptions are embodied   in it. The first assumption about the first-person pronouns is implemented by the observed variable x ct and the parameters λ from a maximum entropy classifier for G vs. M/H level. The second assumption is implemented by the three separate word-topic probability vectors for the three levels of SD: φ l which has a Bayesian informative prior β l where l ∈ {G, M, H}, the three levels of self-disclosure. Table 1 lists the notations used in the model and the generative process, Figure 2 describes the generative process.

Classifying G vs M/H levels
Classifying the SD level for each tweet is done in two parts, and the first part classifies G vs. M/H levels with first-person pronouns (I, my, me). In the graphical model, y is the latent variable that represents this classification, and ω is the distribution over y. x is the observation of the firstperson pronoun in the tweets, and λ are the parameters learned from the maximum entropy classifier.
With the annotated Twitter conversation dataset (described in Section 4.2), we experimented with several classifiers (Decision tree, Naive Bayes) and chose the maximum entropy classifier because it performed the best, similar to other joint topic models (Zhao et al., 2010;Mukherjee et al., 2013).

Classifying M vs H levels
The second part of the classification, the M and the H level, is driven by informative priors with seed words and seed trigrams. Utterances with M level include two types: 1) information related with past events and future plans, and 2) general information about self (Barak and Gluck-Ofri, 2007). For the former, we add as seed trigrams 'I have been' and 'I will'. For the latter, we use seven types of information generally accepted to be personally identifiable information (McCallister, 2010), as listed in the left column of Table 2. To find the appropriate trigrams for those, we take Twitter conversation data (described in Section 4.1) and look for trigrams that begin with 'I' and 'my' and occur more than 200 times. We then check each one to see whether it is related with any of the seven types listed in the table. As a result, we find 57 seed trigrams for M level.  Utterances with H level express secretive wishes or sensitive information that exposes self or someone close (Barak and Gluck-Ofri, 2007  generally keep as secrests. With this intuition, we crawled 26,523 secret posts from Six Billion Secrets 1 site where users post secrets anonymously. To extract seed words that might express secretive personal information, we compute mutual information (Manning et al., 2008) with the secret posts and 24,610 randomly selected tweets. We select 1,000 words with high mutual information and filter out stop words. Table 3 shows some of these words. To extract seed trigrams of secretive wishes, we again look for trigrams that start with 'I' or 'my', occur more than 200 times, and select trigrams of wishful thinking, such as 'I want to', and 'I wish I'. In total, there are 88 seed words and 8 seed trigrams for H.

Inference
For posterior inference of SDTM, we use collapsed Gibbs sampling which integrates out latent random variables ω, π, θ, and φ. Then we only need to compute y, r and z for each tweet. We compute full conditional distribution p(y ct = j , r ct = l , z ct = k |y −ct , r −ct , z −ct , w, x) for tweet ct as follows: where z −ct , r −ct , y −ct are z, r, y without tweet ct, m ctk (·) is the marginalized sum over word v of m ctk v and the function g(c, t, l , k ) as follows:

Data Collection and Annotation
To answer our research questions, we need a large longitudinal dataset of conversations such that we can analyze the relationship between selfdisclosure behavior and conversation frequency over time. We chose to crawl Twitter because it offers a practical and large source of conversations (Ritter et al., 2010). Others have also analyzed Twitter conversations for natural language and social media research (Boyd et al., 2010;Danescu-Niculescu-Mizil et al., 2011), but we collect conversations from the same set of dyads over several months for a unique longitudinal dataset.

Collecting Twitter conversations
We define a Twitter conversation as a chain of tweets where two users are consecutively replying to each other's tweets using the Twitter reply button. We identify dyads of English-tweeting users with at least twenty conversations and collect their tweets.
We use an open source tool for detecting English tweets 2 , and to protect users' privacy, we replace Twitter userid, usernames and url in tweets with random strings. This dataset consists of 101,686 users, 61,451 dyads, 1,956,993 conversations and 17,178,638 tweets which were posted between August 2007 to July 2013.

Annotating self-disclosure level
To measure the accuracy of our model, we randomly sample 101 conversations, each with ten or fewer tweets, and ask three judges, fluent in English, to annotate each tweet with the level of self-disclosure. Judges first read and discussed the definitions and examples of self-disclosure level shown in (Barak and Gluck-Ofri, 2007), then they worked separately on a Web-based platform.

Classification of Self-Disclosure Level
This section describes experiments and results of SDTM as well as several other methods for classification of self-disclosure level. We first start with the annotated dataset in section 4.2 in which each tweet is annotated with SD level. We then aggregate all of the tweets of a conversation, and we compute the proportions of tweets in each SD level. When the proportion of tweets at M or H level is equal to or greater than 0.2, we take the level of the larger proportion and assign that level to the conversation. When the proportions of tweets at M or H level are both less than 0.2, we assign G to the SD level.
We compare SDTM with the following methods for classifying tweets for SD level: • LDA (Blei et al., 2003): A Bayesian topic model. Each conversation is treated as a document. Used in previous work .
• MedLDA (Zhu et al., 2012): A supervised topic model for document classification. Each conversation is treated as a document and response variable can be mapped to a SD level.
• Seed words and trigrams (SEED): Occurrence of seed words and trigrams which are described in section 3.3.
• ASUM (Jo and Oh, 2011): A joint model of sentiment and topic using seed words. Each sentiment can be mapped to a SD level. Used in previous work .
• First-person pronouns (FirstP): Occurrence of first-person pronouns which are described in section 3.2. To identify first-person pronouns, we tagged parts of speech in each tweet with the Twitter POS tagger (Owoputi et al., 2013).
SEED, LIWC, LDA and FirstP cannot be used directly for classification, so we use Maximum entropy model with outputs of each of those models as features. We run MedLDA, ASUM and SDTM 20 times each and compute the average accuracies and F-measure for each level. We set 40 topics for LDA, MedLDA and ASUM, 60; 40; 40 topics for SDTM K G , K M and K H respectively, and set α = γ = 0.1. To incorporate the seed words and trigrams into ASUM and SDTM, we initialize β G , β M and β H differently. We assign a high value of 2.0 for each seed word and trigram for that level, and a low value of 10 −6 for each word that is a seed word for another level, and a default

Method
Acc G F 1 M F 1 H F  Table 4: SD level classification accuracies and Fmeasures using annotated data. Acc is accuracy, and G F 1 is F-measure for classifying the G level. Avg F 1 is the average value of G F 1 , M F 1 and H F 1 . SDTM outperforms all other methods compared. The difference between SDTM and FirstP is statistically significant (p-value < 0.05 for accuracy, < 0.0001 for Avg F 1 ).
value of 0.01 for all other words. This approach is same as other topic model works (Jo and Oh, 2011;Kim et al., 2013). As Table 4 shows, SDTM performs better than other methods by accuracy and F-measure. LDA and MedLDA generally show the lowest performance, which is not surprising given these models are quite general and not tuned specifically for this type of semi-supervised classification task. LIWC and SEED perform better than LDA, but these have quite low F-measure for G and H levels. ASUM shows better performance for classifying H level than others, but not for classifying the G level. FirstP shows good F-measure for the G level, but the H level F-measure is quite low, even lower than SEED. Finally, SDTM has similar performance in G and M level with FirstP, but it performs better in H level than others. Classifying the H level well is important because as we will discuss later, the H level has the strongest relationship with longitudinal OSN usage (see Section 6.2), so SDTM is overall the best model for classifying self-disclosure levels.

Self-Disclosure and Conversation Frequency
In this section, we investigate whether there is a relationship between self-disclosure and conversation frequency over time. (Trepte and Reinecke, 2013) showed that frequent or high-level of selfdisclosure in online social networks (OSN) contributes positively to OSN usage, and vice versa. They showed this through an online survey with Facebook and StudiVZ users. With SDTM, we can automatically classify self-disclosure level of a large number of conversations, so we investigate whether there is a similar relationship between self-disclosure in conversations and subsequent frequency of conversations with the same partner on Twitter. More specifically, we ask the following two questions: 1. If a dyad displays high SD level in their conversations at a particular time period, would they have more frequent conversations subsequently?
2. If a dyad shows high conversation frequency at a particular time period, would they display higher SD in their subsequent conversations?

Experiment Setup
We first run SDTM with all of our Twitter conversation data with 150; 120; 120 topics for SDTM K G , K M and K H respectively. The hyper-parameters are the same as in section 5. To handle a large dataset, we employ a distributed algorithm (Newman et al., 2009). Table 5 shows some of the topics that were prominent in each SD level by KL-divergence. As expected, G level includes general topics such as food, celebrity, soccer and IT devices, M level includes personal communication and birthday, and finally, H level includes sickness and profanity.
For comparing conversation frequencies over time, we divided the conversations into two sets for each dyad. For the initial period, we include conversations from the dyad's first conversation to 60 days later. And for the subsequent period, we include conversations during the subsequent 30 days.
We compute proportions of conversation for each SD level for each dyad in the initial and subsequent periods. Also, we define a new measurement, SD level score for a dyad in the period, which is a weighted sum of each conversation with SD levels mapped to 1, 2, and 3, for the levels G, M, and H, respectively.
6.2 Does self-disclosure lead to more frequent conversations?
We investigate the effect of the level selfdisclosure on long-term use of OSN. We run linear regression with the intial SD level score as  the independent variable, and the rate of change in conversation frequency between initial period and subsequent period as the dependent variable.
The result of regression is that the independent variable's coefficient is 0.118 with a low p-value (p < 0.001). Figure 3 shows the scatter plot with the regression line, and we can see that the slope of regression line is positive.
We also investigate the importance of each SD level for changes in conversation frequency. We run linear regression with initial proportions of each SD level as the independent variable, and the same dependent variable as above. As table 6 shows, there is no significant relationship between the initial proportion of the G level and the changes in conversation frequency (p > 0.1). But for the M and H levels, the initial proportions show positive and significant relationships with the subsequent changes to the conversation frequency (p < 0.0001). These results show that M and H levels are correlated with changes to the frequency of conversation.  H level  101  184  176  36  104  82  113  33  19  chocolate  obama  league  send  twitter  going  ass  better  lips  butter  he's  win  email  follow  party  bitch  sick  kisses  good  romney  game  i'll  tumblr  weekend  fuck  feel  love  cake  vote  season  sent  tweet  day  yo  throat smiles  peanut  right  team  dm  following  night  shit  cold  softly  milk  president  cup  address account  dinner  fucking hope  hand  sugar  people  city  know  fb  birthday  lmao  pain  eyes  cream good arsenal check followers tomorrow shut good neck Subsequent SD level Figure 4: Relationship between initial conversation frequency and subsequent SD level. The solid line is the linear regression line, and the coefficient is 0.0016 with p < 0.0001, which shows a significant positive relationship.
6.3 Does high frequency of conversation lead to more self-disclosure?
Now we investigate whether the initial conversation frequency is correlated with the SD level in the subsequent period. We run linear regression with the initial conversation frequency as the independent variable, and SD level in the subsequent period as the dependent variable. The regression coefficient is 0.0016 with low pvalue (p < 0.0001). Figure 4 shows the scatter plot. We can see that the slope of the regression line is positive. This result supports previous results in social psychology (Leung, 2002) that frequency of instant chat program ICQ and session time were correlated to depth of SD in message.

Conclusion and Future Work
In this paper, we have presented the self-disclosure topic model (SDTM) for discovering topics and classifying SD levels from Twitter conversation data. We devised a set of effective seed words and trigrams, mined from a dataset of secrets. We also annotated Twitter conversations to make a groundtruth dataset for SD level. With annotated data, we showed that SDTM outperforms previous methods in classification accuracy and F-measure.
We also analyzed the relationship between SD level and conversation frequency over time. We found that there is a positive correlation between initial SD level and subsequent conversation frequency. Also, dyads show higher level of SD if they initially display high conversation frequency. These results support previous results in social psychology research with more robust results from a large-scale dataset, and show importance of looking at SD behavior in OSN.
There are several future directions for this research. First, we can improve our modeling for higher accuracy and better interpretability. For instance, SDTM only considers first-person pronouns and topics. Naturally, there are patterns that can be identified by humans but not captured by pronouns and topics. Second, the number of topics for each level is varied, and so we can explore nonparametric topic models (Teh et al., 2006) which infer the number of topics from the data. Third, we can look at the relationship between self-disclosure behavior and general online social network usage beyond conversations.