Mining Themes and Interests in the Asperger’s and Autism Community

,


Introduction
Online forums can offer new insights on mental disorders, by leveraging the experiences of affected individuals -in their own words. Such insights can potentially help mental health professionals and caregivers. Below is an example dialogue from the Aspies Central forum, 1 where individuals who report being on the autism spectrum (and their families and friends) exchange advice and discuss their experiences: • User A: Do you feel paranoid at work?
. . . What are some situations in which you think you have been unfairly treated? • User B: Actually I am going through something like that now, and it is very difficult to keep it under control. . . • User A: Yes, yes that is it. Exactly . . . I think it might be an Aspie trait to do that, I mean over think everything and take it too literally? • User B: It probably is an Aspie trait. I've been told too that I am too hard on myself.
Aspies Central, like other related forums, has thousands of such exchanges. However, aggregating insight from this wealth of information poses obvious challenges. Manual analysis is extremely time-consuming and labor-intensive, thus limiting the scope of data that can be considered. In addition, manual coding systems raise validity questions, because they can tacitly impose the preexisting views of the experimenter on all subsequent analysis. There is therefore a need for computational tools that support large-scale exploratory textual analysis of such forums.
In this paper, we present a tool for automatically mining web forums to explore textual themes and user interests. Our system is based on Latent Dirichlet Allocation (LDA; Blei et al, 2003), but is customized for this setting in two key ways: • By modeling sparsely-varying topics, we can easily recover key terms of interest, while retaining robustness to large vocabulary and small counts (Eisenstein et al., 2011).
• By modeling author preference by topic, we can quickly identify topics of interest for each user, and simultaneously recover topics that better distinguish the perspectives of each author.
The key technical challenge in this work lies in bringing together several disparate modalities into a single modeling framework: text, authorship, and thread structure. We present a joint Bayesian graphical model that unifies these facets, discovering both an underlying set of topical themes, and the relationship of these themes to authors. We derive a variational inference algorithm for this model, and apply the resulting software on a dataset gathered from Aspies Central.
The topics and insights produced by our system are evaluated both quantitatively and qualitatively. In a blind comparison with LDA and the authortopic model , both subjectmatter experts and lay users find the topics generated by our system to be substantially more coherent and relevant. A subsequent qualitative analysis aligns these topics with existing theory about the autism spectrum, and suggests new potential insights and avenues for future investigation.

Aspies Central Forum
Aspies Central (AC) is an online forum for individuals on the autism spectrum, and has publicly accessible discussion boards. Members of the site do not necessarily have to have an official diagnosis of autism or a related condition. Neurotypical individuals (people not on the autism spectrum) are also allowed to participate in the forum. The forum includes more than 19 discussion boards with subjects ranging from general discussions about the autism spectrum to private discussions about personal concerns. As of March 2014, AC hosts 5,393 threads, 89,211 individual posts, and 3,278 members.
AC consists of fifteen public discussion boards and four private discussion boards that require membership.
We collected data only from publicly-accessible discussion boards. In addition, we excluded discussion boards that were websitespecific (announcement-and-introduce-yourself), those mainly used by family and friends of individuals on the spectrum (friends-and-family) or researchers (autism-news-and-research), and one for amusement (forum-games). Thus, we focused on ten discussion boards (aspergers-syndrome-Autism-and-HFA, PDD-NOS-social-anxiety-andothers, obsessions-and-interests, friendships-andsocial-skills, education-and-employment, loverelationships-and-dating, autism-spectrum-helpand-support, off-topic-discussion, entertainmentdiscussion, computers-technology-discussion), in which AC users discuss their everyday expe- riences, concerns, and challenges. Using the python library Beautiful Soup, we collected 1,939 threads (29,947 individual posts) from the discussion board archives over a time period from June 1, 2010 to July 27, 2013. For a given post, we extracted associated metadata such as the author identifier and posting timestamps.

Model Specification
Our goal is to develop a model that captures the preeminent themes and user behaviors from traces of user behaviors in online forums. The model should unite textual content with authorship and thread structure, by connecting these observed variables through a set of latent variables representing conceptual topics and user preferences. In this section, we present the statistical specification of just such a model, using the machinery of Bayesian graphical models. Specifically, the model descibes a stochastic process by which the observed variables are emitted from prior probability distributions shaped by the latent variables. By performing Bayesian statistical inference in this model, we can recover a probability distribution around the latent variables of interest.
We now describe the components of the model that generate each set of observed variables. The model is shown as a plate diagram in Figure 1, and the notation is summarized in Table 1.

Generating the text
The part of the model which produces the text itself is similar to standard latent Dirichlet allocation (LDA) (Blei et al., 2003). We assume a set of K latent topics, which are distributions over each word in a finite vocabulary. These topics are  shared among all D threads in the collection, but each thread has its own distribution over the topics.
We make use of the SAGE parametrization for generative models of text (Eisenstein et al., 2011). SAGE uses adaptive sparsity to induce topics that deviate from a background word distribution in only a few key words, without requiring a regularization parameter. The background distribution is written m, and the deviation for topic k is written η k , so that P r(w = v|η k , m) ∝ exp (m v + η kv ).
Each word token w dpn (the n th word in post p of thread d) is generated from the probability distribution associated with a single topic, indexed by the latent variable z dpn ∈ {1 . . . K}. This latent variable is drawn from a prior θ d , which is the probability distribution over topics associated with all posts in thread d.

Generating the author
We have metadata indicating the author of each post, and we assume that users are more likely to participate in threads that relate to their topicspecific preference. In addition, some people may be more or less likely to participate overall. We extend the LDA generative model to incorporate each of these intuitions.
For each author i, we define a latent preference vector y i , where y ik ∈ {0, 1} indicates whether the author i prefers to answer questions about topic k. We place a Bernoulli prior on each y ik , so that y ik ∼ Bern(ρ), where Bern(y; ρ) = ρ y (1 − ρ) (1−y) . Induction of y is one of the key inference tasks for the model, since this captures topicspecific preference.
It is also a fact that some individuals will participate in a conversation regardless of whether they have anything useful to add. To model this gen-eral tendency, we add an "bias" variable b i ∈ R. When b i is negative, this means that author i will be reluctant to participate even when she does have relevant interests.
Finally, various topics may require different levels of preference; some may capture only general knowledge that many individuals are able to provide, while others may be more obscure. We introduce a diagonal topic-weight matrix Ω, where Ω kk = ω k ≥ 0 is the importance of preference for topic k. We can easily generalize the model by including non-zero off-diagonal elements, but leave this for future work.
The generative distribution for the observed author variable is a log-linear function of y and b: (1) This distribution is multinomial over authors; each author's probability of responding to a thread depends on the topics in the thread (θ d ), the author's preference on those topics (y i ), the importance of preference for each topic (Ω), and the bias parameter b i . We exponentiate and then normalize, yielding a multinomial distribution.
The authorship distribution in Equation (1) refers to a probability of user i authoring a single response post in thread d (we will handle question posts next). Let us construct a binary vector a (r) d , where it is 1 if author i has authored any response posts in thread d, and zero otherwise. The probability distribution for this vector can be written One of the goals of this model is to distinguish frequent responders (i.e., potential experts) from individuals who post questions in a given topic. Therefore, we make the probability of author i initiating thread d depend on the value 1 − y ki for each topic k. We write the binary vector a d is an indicator vector. Its probability is written as We can put these pieces together for a complete distribution over authorship for thread d: The probability p(a d |θ d , y, Ω, b) combines the authorship distribution of authors from question post and answer posts in thread d. The identity of the original question poster does not appear in the answer vector, since further posts are taken to be refinements of the original question.
This model is similar in spirit to supervised latent Dirichlet allocation (sLDA) (Blei and McAuliffe, 2007). However, there are two key differences. First, sLDA uses point estimation to obtain a weight for each topic. In contrast, we perform Bayesian inference on the author-topic preference y. Second, sLDA generates the metadata from the dot-product of the weights andz, while we use θ directly. The sLDA paper argues that there is a risk of overfitting, where some of the topics serve only to explain the metadata and never generate any of the text. This problem does not arise in our experiments.

Formal generative story
We are now ready to formally define the generative process of our model:

Inference and estimation
The purpose of inference and estimation is to recover probability distributions and point estimates for the quantities of interest: the content of the topics, the assignment of topics to threads, author preferences for each topic, etc. While recent progress in probabilistic programming has improved capabilities for automating inference and estimation directly from the model specification, 2 here we develop a custom algorithm, based on variational mean field (Wainwright and Jordan, 2008). Specifically, we approximate the distribution over topic proportions, topic indicators, and author-topic preference P (θ, z, y|w, a, x) with a mean field approximation where P d is the number of posts in thread d, K is the number of topics, and N p is the number of word tokens in post P d . The variational parameters of q(·) are γ, φ, ψ. We will write · to indicate an expectation under the distribution q(θ, z, y). We employ point estimates for the variables b (author selection bias), λ (topic-time feature weights), η (topic-word log-probability deviations), and diagonal elements of Ω (topic weights). The estimation of η follows the procedure defined in SAGE (Eisenstein et al., 2011); we explain the estimation of the remaining parameters below.
Given the variational distribution in Equation (5), the inference on our topic model can be formulated as constrained optimization of this bound.
The constraints are due to the parametric form of the variational approximation: q(θ d |γ d ) is Dirichlet, and requires non-negative parameters; q(z dpn |φ dpn ) is multinomial, and requires that φ dpn lie on the K − 1 simplex; q(y ik |ψ ik ) is Bernoulli and requires that ψ ik be between 0 and 1. In addition, as a topic weight, ω k should also be non-negative.
Algorithm 1 One pass of the variational inference algorithm for our model.

Word-topic indicators
With the variational distribution in Equation (5), the inference on φ dpn for a given token n in post p of thread d is same as in LDA. For the nth token in post p of thread d, where β is defined in the generative story and log θ dk is the expectation of log θ dk under the distribution q(θ dk |γ d ), where Ψ(·) is the Digamma function, the first derivative of the log-gamma function.
For the other variational parameters γ and ψ, we can not obtain a closed form solution. As the constraints on these parameters are all convex with respect to each component, we employed a projected quasi-Newton algorithm proposed in (Schmidt et al., 2009) to optimize L in Equation (6). One pass of the variational inference procedure is summarized in Algorithm 1.Since every step in this algorithm will not decrease the variational bound, the overall algorithm is guaranteed to converge.

Document-topic distribution
The inference for document-topic proportions is different from LDA, due to the generation of the author vector a d , which depends on θ d . For a given thread d, the part of the bound associated with the variational parameter γ d is and the derivative of L γ d with respect to γ dk is where Ψ (·) is the trigramma function. The first two lines of Equation (10) are identical to LDA's variational inference, which obtains a closed-form solution by setting γ dk = α dk + p,n φ dpnk . The additional term for generating the authorship vector a d eliminates this closed-form solution and forces us to turn to gradient-based optimization. The expectation on the log probability of the authorship involves the expectation on the log partition function, which we approximate using Jensen's inequality. We then derive the gradient, , represents the generative probability of a (r) di = 1 under the current variational distributions q(θ d ) and q(y i ). The notation a (q) di |θ d , y is analogous, but represents the question post indicator a (q) di .

Author-topic preference
The variational distribution over author-topic preference is q(y ik |ψ ik ); as this distribution is Bernoulli, y ik = ψ ik , the parameter itself proxies for the topic-specific author preference -how much author i prefers to answer posts on topic k.
The part of the variational bound the relates to the author preferences is For author i on topic k, the derivative of log p(a d |θ d , y, Ω, b) for document d with respect to ψ ik is Thus, participating as a respondent increases ψ ik to the extent that topic k is involved in the thread; participating as the questioner decreases ψ ik by a corresponding amount.

Point estimates
We make point estimates of the following parameters: author selection bias b i and topic-specific preference weights ω k . All updates are based on maximum a posteriori estimation or maximum likelihood estimation.
Selection bias For the selection bias b i of author i given a thread d, the objective function in Equation (6) with the prior of b i ∼ N (0, σ 2 b ) is minimized by a quasi-Newton algorithm with the following derivative The zero-mean Gaussian prior shrinks b i towards zero by subtracting b i /σ 2 b from this gradient. Note that the gradient in Equation (14) is non-negative whenever author i participates in thread d. This means any post from this author, whether question posts or answer posts, will have a positive contribution of the author's selection bias. This means that any activity in the forum will elevate the selection bias b i , but will not necessarily increase the imputed preference level.
Topic weights The topic-specific preference weight ω k is updated by considering the derivative of variational bound with respect to ω k where for a given document d, Thus, ω k will converge at a value where the observed posting counts matches the expectations under log p(a d |θ d , y, Ω, b) .

Quantitative Evaluation
To validate the topics identified by the model, we performed a manual evaluation, combining the opinions of both novices as well as subject matter experts in Autism and Asberger's Syndrome. The purpose of the evaluation is to determine whether the topics induced by the proposed model are more coherent than topics from generic alternatives such as LDA and the author-topic model, which are not specifically designed for forums.

Experiment Setup
Preprocessing Preprocessing was minimal. We tokenized texts using white space and removed punctuations at the beginning/end of each token. We removed words that appear less than five times, resulting in a vocabulary of the 4903 most frequently-used words.
Baseline Models We considered two baseline models in the evaulation. The first baseline model is latent Dirichlet allocation (LDA), which considers only the text and ignores the metadata (Blei et al., 2003). The second baseline is the Author-Topic (AT) model, which extends LDA by associating authors with topics (Rosen-Zvi et al., 2004;. Both baselines are implemented in the Matlab Topic Modeling Toolbox (Steyvers and Griffiths, 2005).
Parameter Settings For all three models, we set K = 50. Our model includes the three tunable parameters ρ, the Bernoulli prior on topic-specific expertise; σ 2 b , the variance prior on use selection bias; and α, the prior on document-topic distribution. In the following experiments, we chose ρ = 0.2, σ 2 b = 1.0, α = 1.0. LDA and AT share two parameters, α, the symmetric Dirichlet prior for document-topic distribution; β, the symmetric Dirichlet prior for the topic-word distribution. In both models, we set α = 3.0 and β = 0.01. All parameters were selected in advance of the experiments; further tuning of these paramters is left for future work.

Topic Coherence Evaluation
To be useful, a topic model should produce topics that human readers judge to be coherent. While some automated metrics have been shown to cohere with human coherence judgments (Newman et al., 2010), it is possible that naive raters might have different judgments from subject matter experts. For this reason, we focused on human evaluation, including both expert and novice opinions. One rater, R1, is an author of the paper (HH) and a Ph.D. student focusing on designing technology to understand and support individuals with autism spectrum disorder. The remaining three raters are not authors of the paper and are not domain experts.
In the evaluation protocol, raters were presented with batteries of fifteen topics, from which they were asked to select the three most coherent. In each of the ten batteries, there were five topics from each model, permuted at random. Thus, after completing the task, all 150 topics -50 topics from each model -were rated. The user interface of topic coherence evaluation is given in Figure 2, including the specific prompt.
We note that this evaluation differs from the "intrusion task" proposed by Chang et al. (2009), in which raters are asked to guess which word was randomly inserted into a topic. While the intrusion task protocol avoids relying on subjective judgments of the meaning of "coherence," it prevents expert raters from expressing a preference for topics that might be especially useful for analysis of autism spectrum disorder. Prior work has also shown that the variance of these tasks is high, making it difficult to distinguish between models. Table 2 shows, for each rater, the percentage of topics were chosen from each model as the most coherent within each battery. On average, 80% of the topics were chosen from our proposed model. If all three models are equally good at discover-   ing coherent topics, the average percentage across three models should be roughly equal. Note that the opinion of the expert rater R1 is generally similar to the other three raters.

Analysis of Aspies Central Topics
In this section, we further use our model to explore more information about the Aspies Central forum. We want to examine whether the autismrelated topics identified the model can support researchers to gain qualitative understanding of the needs and concerns of autism forum users. We are also interested in understanding the users' behavioral patterns on autism-related topics. The analysis task has three components: first we will describe the interesting topics from the autism domain perpective. Then we will find out the proportion of each topic, including autism related topics. Finally, in order to understand the user activity patterns on these autism related topics we will derive the topic-specific preference ranking of the users from our model.   Table 3 shows all 50 topics from our model. For each topic, we show the top five words related to this topic. We further identified fourteen topics (highlighted with blue color), which are particularly relevant to understand autism.
Among the identified topics, there are three popular topics discussed in the Aspies Central forum: topic 4, topic 19 and topic 31. From the top word list, we identified that topic 4 is composed of keywords related to psychological (e.g., selfesteem, art) and social (e.g., volunteering, community) well-being of the Aspies Central users. Topic 19 includes discussion on mental health issues (e.g., depression) and religious activities (e.g., believe, christianity, buddhism) as coping strategies. Topic 31 addresses a specific personal hygiene issue -helping people with autism learn to shave. This might be difficult for individuals with sensory issues: for example, they may be terrified by the sound and vibration generated by the shaver. For example, topic 22 is about making friends and maintaining friendship; topic 12 is about educational issues ranging from seeking educational resources to improving academic skills and adjusting to college life.
In addition to identifying meaningful topics, another capability of our model is to discover users' topic preferences and expertise. Recall that, for user i and topic k, our model estimates a authortopic preference variable ψ ik . Each ψ ik ranges from 0 to 1, indicating the probability of user i to  answer a question on topic k. As we set the prior probability of author-topic preference to be 0.2, we show topic-author pairs for which ψ ik > 0.2 in Table 4.
The dominance of USER 1 in these topics is explained by the fact that this user is the moderator of the forum. Besides, we also find some other users participating in most of the interesting topics, such as USER 2 and USER 3. On the other hand, users like USER 14 and USER 15 only show up in few topics. This observation is supported by their activities on discussion boards. Searching on the Aspies Certral forum, we found most answer posts of user USER 15 are from the board "love-relationships-and-dating".

Related Work
Social media has become an important source of health information (Choudhury et al., 2014). For example, Twitter has been used both for mining both public health information (Paul and Dredze, 2011) and for estimating individual health status (Sokolova et al., 2013;Teodoro and Naaman, 2013). Domain-specific online communities, such Aspies Central, have their own advantages, targeting specific issues and featuring more closeknit and long-term relationships among members (Newton et al., 2009).
Previous studies on mining health information show that technical models and tools from computational linguistics are helpful for both understanding contents and providing informative features. Sokolova and Bobicev (2011) use sentiment analysis to analyze opinions expressed in healthrelated Web messages; Hong et al. (2012) focus on lexical differences to automatically distinguish schizophrenic patients from healthy individuals.
Topic models have previously been used to mine health information: Resnik et al. (2013) use LDA to improve the prediction for neuroticism and depression on college students, while Paul and Dredze (2013) customize their factorial LDA to model the joint effect of drug, aspect, and route of administration. Most relevantly for the current paper, Nguyen et al. (2013) use LDA to discover autism-related topics, using a dataset of 10,000 posts from ten different autism commnities. However, their focus was on automated classification of communities as autism-related or not, rather than on analysis and on providing support for qualitative autism researchers. The applicability of the model developed in our paper towards classification tasks is a potential direction for future research.
In general, topic models capture latent themes in document collections, characterizing each document in the collection as a mixture of topics (Blei et al., 2003). A natural extension of topic models is to infer the relationships between topics and metadata such as authorship or time. A relatively simple approach is to represent authors as an aggregation of the topics in all documents they have written (Wagner et al., 2012). More sophisticated topic models, such as Author-Topic (AT) model (Rosen-Zvi et al., 2004; as-sume that each document is generated by a mixture of its authors' topic distributions. Our model can be viewed as one further extension of topic models by incorporating more metadata information (authorship, thread structure) in online forums.

Conclusion
This paper describes how topic models can offer insights on the issues and challenges faced by individuals on the autism spectrum. In particular, we demonstrate that by unifying textual content with authorship and thread structure metadata, we can obtain more coherent topics and better understand user activity patterns. This coherence is validated by manual annotations from both experts and non-experts. Thus, we believe that our model provides a promising mechanism to capture behavioral and psychological attributes relating to the special populations affected by their cognitive disabilities, some of which may signal needs and concerns about their mental health and social wellbeing.
We hope that this paper encourages future applications of topic modeling to help psychologists understand the autism spectrum and other psychological disorders -and we hope to obtain further validation of our model through its utility in such qualitative research. Other directions for future work include replication of our results across multiple forums, and applications to other conditions such as depression and attention deficit hyperactivity disorder (ADHD).