Towards Intelligent Clinically-Informed Language Analyses of People with Bipolar Disorder and Schizophrenia

,


Introduction
Schizophrenia and bipolar disorder have been associated with observable language patterns in clinical and sociolinguistic studies (Kasanin, 1944;Elmore and Gorham, 1957;Perlini et al., 2012;Bambini et al., 2016).Some computational studies have sought to replicate these findings or derive novel clinical insights using automated analyses driven by natural language processing techniques (Ratana et al., 2019;Harvey et al., 2022).Effectively identifying features associated with these disorders or providing diagnostic aid offers substantial potential for real-world impact (Castro et al., 2015;Becker et al., 2018;Lovejoy, 2019).However, these studies to date have been constrained by limitations in dataset size and availability (Elvevåg et al., 2007;Bedi et al., 2015;Mota et al., 2012;Gutiérrez et al., 2017;Corcoran and Cecchi, 2020, n ≤ 51 subjects), restricting the extent to which they can produce meaningful or generalizable conclusions.
We address this gap by introducing a new, large (n = 644 subjects) dataset of transcribed conversations between clinicians and people with bipolar disorder (BD), people with schizophrenia (SZ), and healthy control (HC) subjects.We also establish preliminary benchmarking models for automatically distinguishing between these groups using interpretable linguistic features, achieving promising proof-of-concept ranging from 70-96% accuracy in one-versus-one discrimination between subject groups.Finally, we conduct preliminary analyses across a large feature set to identify potential linguistic correlates with these groups.Our key contributions are as follows: • We introduce a new, 644-subject (1288transcript) dataset collected in clinically validated laboratory settings.
• Using this new dataset, we develop benchmarking models for the automated detection of bipolar and schizophrenia disorders in a one-versus-one classification setting, as a tool for facilitating analysis of language associated with members of these groups.
• Through these analyses, we identify potential linguistic correlates with diagnostic groups.
This research was jointly conducted by an interdisciplinary team of researchers from psychiatry and computer science departments to foster translational impact in both communities (Newman-Griffis et al., 2021).We hope that the data and insights provided will pave the way for new research and subsequently exciting new clinical and computational findings in this domain.
Most social media datasets for mental health tasks are annotated along binary or linear scales and label users based on analysis of a set number of posts.Annotations may be provided by trained human annotators (Wang et al., 2016b;Coppersmith et al., 2015), annotators with clearly referenced domain expertise (e.g., Birnbaum et al. (2017)'s work employing a clinical psychiatrist and a grad-uate student from Northwell Health's Early Treatment Program), user disclosures of mental health conditions (Coppersmith et al., 2015;Safa et al., 2022;Zhou et al., 2021), and crowdsourcing services (Turcan and McKeown, 2019b).Annotation schema for some mental health conditions can be subjective, causing varied inter-annotator agreement.For example, Birnbaum et al. (2017) reported a Cohen's kappa score of κ=0.81, whereas Turcan and McKeown (2019b) reported a much lower agreement of κ=0.47 for their dataset of stressed and unstressed social media users.Turcan and McKeown (2019b)'s dataset also offers an example of how fuzzy label boundaries can affect annotation quality-it is well established that stress is often temporary (Dhabhar, 2018); hence, post labels do not always equate to a user's mental state.Finally, independent decision-making when selecting sources may influence annotation outcomes.
Although social media data has been leveraged for a variety of mental health tasks, data accessibility remains an enormous challenge.In their analysis of more than 100 mental health datasets, Harrigian et al. (2020) found only three to be available without any restrictions.They found that ≥ 50% of the data they analyzed was not readily available.1Of those that were described in some capacity (48), 13 were removed from public records or limitations made them unavailable.Out of the 35 that remained, 12 needed signed agreements or Institutional Review Board (IRB) approvals, 18 had instructions and APIs to reproduce them, 2 could be obtained directly by emailing the authors, and as mentioned, 3 were available without restrictions.These trends have also been observed on a broader scale with other healthcare data in NLP studies (Valizadeh and Parde, 2022).
Moreover, for publicly accessible data, the inherent subjectivity of many mental health annotation tasks and the frequent reliance on user selfdisclosures means that many "gold standard" labels are imperfectly assigned.Most datasets fail to capture nuances of mental health (Arseniev-Koehler et al., 2018), and medical self-disclosures may be indirect (Valizadeh et al., 2021).For example, Birnbaum et al. (2017)'s dataset labels the following sample as YES, but provides little clarity regarding the user's diagnosis:

I have schizophrenia/depression. I am trying to become better by exercise and working I have a job xoxo I love Saturday xx
Issues related to fairness, gender, balance, and representation of racial and ethnic biases in social media datasets have also been found (Aguirre et al., 2021).We seek to address many of these limitations by providing a publicly accessible dataset of manually transcribed interactions between individuals with clinically diagnosed mental health conditions and trained clinicians.We also provide dataset transparency regarding representational balance through validity of diagnoses and descriptive statistics.

Task Selection
We collected data through a standardized performance-based test of social competence called the Social Skills Performance Assessment (Patterson et al., 2001, SSPA).The SSPA involves a prompted conversation between a confederate/examiner and a patient, wherein the patient's social abilities during the conversation are scored by a trained rater to provide an estimate of social skill.The SSPA is useful in clinical assessment because it provides a measure of social abilities that is free of biases associated with self-report or informants (Leifker et al., 2010).The SSPA has been used as an endpoint of clinical rehabilitation trials and is a predictor of social function (Miller et al., 2021).
The SSPA involves two scenarios administered by a trained rater in a laboratory setting, and the interaction is audiorecorded.The measure consists of two simulated interactions in which the rater plays the role of a conversation partner and the participant plays the role of themselves in the scene.The first scene is affiliative and involves meeting a new neighbor.The second scene is confrontational and asks the participants to complain to their landlord, after a prior notification about a leak had not been addressed.These scenarios last on average four minutes each.In Appendix A we provide sample texts for both scenes from people who are clinically diagnosed with schizophrenia.which recruited outpatients with either schizophrenia/schizoaffective disorder or bipolar disorder or healthy controls.The inclusion criteria for these studies involved ability to provide informed written consent, diagnosis of either bipolar disorder or schizophrenia/schizoaffective disorder according to the Diagnostic and Statistical Manual of the American Psychiatry Association, and outpatient status at the time of assessment.Informed written consent was taken from participants for audiorecording and de-identified research data sharing for each of these projects.Psychiatric diagnoses were performed under supervision of medical researchers and practicing clinicians at the University of California San Diego, the University of Miami, and the University of Texas at Dallas.A total of 6442 SSPAs were available across these studies (SZ/SC=247, BD=286, HC=110).

Descriptive Statistics
We experiment in Section 6 with a random subset of 300 subjects divided equally between the SZ (n = 100), BD (n = 100), and HC (n = 100) groups.Each participant has two audio files (for the two tasks described in §3.1) for a total of 600 audio files.Descriptive statistics for all 644 participants in the full dataset are provided in Table 1.

Data Release
We release our data freely in two ways.Extracted features (described in §4.2) can be downloaded as CSV files from Github3 without any special permission.The fully de-identified transcripts can be downloaded from the National Institute of Mental Health data archive4 in adherence with National Institutes of Health reporting requirements and the corresponding research grant that funded this work.Users of our data will be responsible for their own statements, analysis, interpretation, and uses.We refer readers to the Ethical Considerations (end of paper) and Appendix C for a fuller understanding of how to use this dataset.

Preprocessing
Verbatim transcriptions of the audiorecordings for all participants were made by a trusted third-party service and then manually stripped of identifiable information.These were stored in docx format by the transcription service, using one of the two formats shown in Figure 1.We preprocessed these files to prepare them for further computational work using a series of steps determined through preliminary data analysis.These steps included the automated extraction of timestamps, separation of interviewer and participant dialogue, and (described in the next subsection) computation of linguistic features inspired by and extended from previously published work on other datasets.We first converted the transcripts verbatim from docx to txt format to enable easier parsing using Python 3.7.We then applied a set of regular expressions to extract essential information: • Timestamps were extracted by searching for strings in the format HH:MM:SS enclosed by + sign characters.
Algorithm 1 Utterance Speaker Labeling • Interviewer dialogue was extracted by searching for strings starting with Interviewer:.
• Patient dialogue was extracted by searching for strings starting with Patient:.
Transcripts following the second format in Figure 1 were more complex to initially parse, since the continuous dialogue extending beyond the initial timestamp was not matched effectively by these patterns.To address this, we applied a speaker labeling algorithm (Algorithm 1) to these cases.This algorithm processes strings using our regular expression patterns, repeatedly iterating through lines in the transcript until the end of the document is reached.The variable t c holds the current timestamp for the speaker utterances, l holds the current line of text (set to FALSE if no more lines exist in the document), s p holds the previous speaker label, s c holds the current speaker label, u p holds patient utterances, and u i holds interviewer utterances.
The functions GETTIME(•), GETINTER-VIEWER(•), and GETPATIENT(•) hold the regular expressions necessary to extract the timestamp, interviewer label, and patient label from a string, respectively, or otherwise return FALSE.Strings matched by GETINTERVIEWER(•) or GETPA-TIENT(•) are appended to u i or u p depending on the specified speaker, and strings not matched by any of the regular expression patterns (e.g., continued dialogue) are appended to the previous speaker's utterance list.The final, preprocessed lists of interviewer and patient utterances with extracted timestamps are converted to pandas5 dataframes for feature extraction and further processing.

Features Extracted
To assess the importance and utility of linguistic features in the context of this new, large dataset, we extract varied features from the patient dialogue.These features can be broadly categorized as pertaining to time, sentiment, psycholinguistc attribute, emotion, and lexical diversity.

Temporal Features
We extracted two temporal features for each patient: the maximum time taken for a dialogue, and the mean time taken per dialogue.To do so, all timestamp strings were first converted to time objects in seconds, allowing for straightforward calculation of the difference between start and end times in a given dialogue.The maximum difference is labeled as the max_time.The mean is taken from this list of differences and is our other temporal feature mean_time.These numbers are stored in seconds.

Sentiment Features
We extracted sentiment features based on Senti-WordNet (Baccianella et al., 2010) scores.We calculated a transcript-level total_sentiment_score by concatenating all patient utterances in the transcript, tokenizing the concatenated text, and computing token-level scores that were then used to increment positive, negative, or objective features across the full transcript.We then extract the average_positive, average_negative, and aver-age_objective scores from this information.

Psycholinguistic Features
To compute psycholinguistic features, we used the 2022 Linguistic Inquiry and Word Count (LIWC) framework (Boyd et al., 2022), which offers key updates over existing versions of LIWC.Specifically, the processes for computing classical LIWC features such as WC, Analytic, Clout, Authentic, and Tone are changed to reflect shifts in culture and

Emotion Features
We extracted emotion features based on the NRC Word-Emotion Lexicon (Mohammad andTurney, 2010, 2013).Specifically, for each transcript we compute the total number of words associated with Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, and Trust as denoted by the NRC lexicon.
We assign a score of 0 for a given emotion if the transcript contains no words corresponding to that emotion in the NRC lexicon.

Lexical Diversity Features
Finally, to measure a transcript's linguistic variety and richness, we computed seven popular measures of lexical diversity at the transcript level.These measures are described in detail in Table 2. Lexical diversity indices have proven crucial in psychometric evaluation tasks (Kapantzoglou et al., 2019).5 Feature Analyses Since we computed features across three subject pools (SZ, BD, and HC), we analyzed feature correlations, patterns, and trends across subject groups.This investigation provides a starting ground for the more detailed follow-up studies that our new dataset is designed to enable.We make our analysis and visualization scripts publicly available to lower the barrier for others to pursue these studies. 3 In Figures 2 and 3, we present violin plots illustrating score distributions across selected features from major feature groups described in §4.2.We examine trust emotion features (Figure 5a), Herdan measures of lexical diversity (Figure 5b and 2b), mean time per dialogue (Figure 6a and 3a), and interpersonal conflict features from LIWC 2022 (Figure 6b and 3b).Class labels are represented using the numeric signifiers HC=0, SZ=1, and BD=2, and colors blue, orange, and green, respectively.Due to space restrictions we present plots based on the Scene 2 transcripts here, and include plots representing the same features from Scene 1 as supplemental content in Appendix B (Figures 5 and  6).
We observe that HC subjects exhibit larger overall ranges of lexical diversity and trust language than SZ or BD subjects (Figure 2).SZ subjects exhibit lower trust scores, and BD subjects exhibit a bimodal score distribution with two large frequency centers (Figure 5a and Figure 2a).This differs from patterns associated with lexical diversity.We observe that BD subjects have a single concentrated distribution of mass slightly above a Herdan score of 0.85.SZ subjects exhibit a similar mean Herdan score, but with a wider score distribution.
When examining mean time, we observe that both HC and SZ subjects have slightly bimodal score distributions, with SZ subjects also having the widest score range (Figure 6a and 3a).BD subjects have a single frequency center and relatively consistent frequency spread from 10-30 seconds.Finally, we observe that interpersonal conflict features are concentrated near scores of 2 for all subjects, although SZ subjects show the largest score range with a relatively large share of subjects with scores of 4 or greater (Figure 3b and 6b).
In Figure 4, we present pairwise feature correlations among six selected features across our five broad feature categories: mean time, positive sentiment, LIWC analytic score, anger score, Herdan lexical diversity, and LIWC lack score. 6We study and compare pairwise correlations between members of different subject groups, with feature correlations for HC, BD, and SZ subjects shown in Figures 4a, 4b, and 4c, respectively.
We observe weakly positive correlations between analytic scores and positive sentiment among HC subjects, but very weakly (BD) to weakly (SZ) negative correlations between this same feature pairing among subjects in other groups, suggesting a stronger relationship between logic and optimism in control subjects compared to subjects with bipolar disease or schizophrenia.Interestingly, we also observe stronger positive correlations between anger and mean time, as well as between lexical diversity and positive sentiment, in SZ subjects than in HC or BD subjects.HC subjects have weakly negative correlations between lexical diversity and positive sentiment.

Classification Task
To establish learning validity of our dataset, we designed a simple task to predict subject group membership.Specifically, we conduct binary classification experiments to discriminate between two classes from the set of HC, SZ, and BD subjects.This also creates an additional avenue through which group-level language behaviors can be analyzed (e.g., through learned feature weights).We experiment with both classical ( §6.1) and Transformer-based ( §6.2) models.

Classical Models
We experimented with five feature-based models that have demonstrated high efficiency for a variety of language tasks: random forest (Xu et al., 2012; Correlations range from weakly negative (darkest) to strongly positive (lightest).Bouaziz et al., 2014;Jurka et al., 2013, RF), K nearest neighbors (Yong et al., 2009;Jodha et al., 2018;Trstenjak et al., 2014;Pranckevičius and Marcinkevičius, 2017, KNN), logistic regression (Pranckevičius and Marcinkevičius, 2017;Jurka, 2012;Genkin et al., 2007;Lee and Liu, 2003, Logistic), ridge classifier (Aseervatham et al., 2011;He et al., 2014, Ridge), and support vector machine (Joachims, 2002;Yang, 2001, SVM).We randomly separated our data for each class into 75%/25% train/test splits.Since we used the 300-subject sample defined in §3.3 for these experiments, this meant that the training data for a given scene, for a given subject group pair, included 150 transcripts.The corresponding test set for that scene/pair setting included 50 transcripts.We performed three classification experiments (BD × HC, BD × SZ, and SZ × HC) for each model, for each of the two scenes.We trained each model on the full set of features described previously ( §4.2).We report our results for Scene 1 and Scene 2 in Table 3.We observe that the consistently highestperforming model across both scenes is the random forest classifier, achieving strong accuracies ranging from 0.93 (BD × HC) to 0.96 (SZ versus either) in Scene 1 and 0.70 (HC × SZ) to 0.96 (BD × HC) in Scene 2. Greater variation among top-performing classifiers was observed when comparing F 1 , with the random forest classifier still achieving the highest performance most of the time.Interestingly, classification appeared to be more challenging when discriminating between HC and SZ in Scene 2 transcripts.Nonetheless, the overall strong classification performance across the board for Scenes 1 and 2 using feature-based classification models suggests high learning validity for both the dataset and the features extracted.

Transformer-based Models
Applying pretrained Transformers to domainspecific tasks may produce more robust, dependable, and accurate models (Alsentzer et al., 2019).Since much recent success in NLP has been achieved using Transformer models, we also experiment with several using the same one-versusone classification setting and data splits from our other experiments.We compare the performance of pretrained BERT base (Devlin et al., 2018), Men-talBERT (Ji et al., 2022), and Mental-RoBERTa (Ji et al., 2022) models for our task.BERT base is a pretrained English model using a masked language modeling objective.It randomly masks a small percentage of words and learns to predict the masked samples.The model was trained for one million steps in batch sizes of 256 with fine-tuned hyperpa- rameters set to: optimizer=Adam, learning rate=1e-4, β 1 =0.9, β 2 =0.999, and decay=0.01.Mental-BERT and Mental-RoBERTa follow the same architecture but use dynamic masking and domain adaptive pretraining.The pretraining corpus includes depression, stress, and suicidal ideation data from Reddit.We passed subject utterances from our transcripts directly to these models for automated encoding of implicitly learned features.We present the results for a sample of these experiments (Scene 2 classifications of HC vs. SZ subjects) in Table 4.We observe much lower performances than seen with feature-based classifiers.There may be many reasons for this, ranging from characteristics of the data used for pretraining to inefficiencies in implicitly learned features relative to features engineered based on known psycholinguistic attributes.Since we do not observe promising results using pre-trained Transformer models and these models also do not lend themselves as easily as tools to facilitate linguistic analyses, we leave further probing of this to future work.

Conclusion
Publishing language data collected in clinical settings that is paired with validated psychiatric diagnoses is an essential first step towards realizing more realistic, medically relevant NLP applications in the mental health domain.In this work, we take that step and describe our new corpus developed in close consultation between NLP and psychiatric researchers and clinicians.The corpus includes manually transcribed interactions between clinical interviewers and healthy control subjects or those with diagnosed schizophrenia and bipolar disorder.We describe all data collection procedures, extract a wide range of promising linguistic features from the data, and conduct an extensive first set of analyses to document trends in linguistic behavior among the SZ, BD, and HC subject groups.We show that linguistic diversity manifests itself in various ways across subject populations.
We hope that our work will diversify NLP research in the mental health domain beyond social media settings, and that it will open the door for more clinically valid studies of language behavior associated with diagnosed psychiatric conditions.All features extracted for this work are freely available on GitHub and can be downloaded without any further permission. 7The de-identified transcripts can be downloaded from the National Institute of Mental Health data archive, in keeping with the terms of our NIH reporting requirements and the corresponding research grant that funded this work. 8In the future, we plan to extend our study to also investigate spoken language and acoustic properties from the collected audiorecordings.

Limitations
This work is limited by a few factors.First, although our dataset is large by psychiatric standards, its size is still limited compared to datasets used for many other modern NLP tasks.This prevents us from being able to productively use complex models that have achieved state-of-the-art performance in other tasks, as documented in §6.2 with our experiments using fine-tuned versions of BERT, MentalBERT, and Mental-RoBERTa.We note that a disadvantage of deep learning models is that they are less interpretable than feature-based counterparts; thus, since classifier performance is not a central goal of our work, the poor performance observed with pre-trained Transformers is not a crucial shortcoming.Our primary interest in the classification experiments described in Section 6 was to establish learning validity for our dataset.
Second, although we explore a wide range of temporal, sentiment, psycholinguistic, emotion, and lexical diversity features in our experiments, our feature set does not comprehensively or conclusively cover all linguistic traits that may be of interest when analyzing the language behaviors of our target subject groups.Thus, our claims are limited by the boundaries of the conditions tested in our experiments-it may be that the most informative linguistic features are as yet undiscovered.
We hope that this is indeed the case, and that future work develops new innovations that expand upon our findings.Finally, our dataset is restricted to English conversations.The extent to which this research generalizes to other languages, including those vastly different from or substantially less-resourced than English, is unknown for now.The collection of complementary data in other languages, and especially those with different morphological typology, is a promising direction for future work.

Ethical Considerations
Several important ethical questions arise when working with data collected from human participants generally, and data dealing with mental health concerns specifically.We consider both questions here.We also point readers to our datasheet and other details regarding fair and inappropriate uses of our data in Appendix C.

Dataset Creation
In collecting this data, we followed all codes of ethics laid out by the Association for Computational Linguistics, the United States of America's National Institutes of Health, and the U.S. National Institute of Mental Health.All universities, laboratories, hospitals, and research centers involved in this project have secured ethics approval from their Institutional Review Boards before working with any data.Data was collected from outpatients recruited through studies supported by the National Institute of Mental Health.Inclusion criteria were ability to provide informed written consent, diagnosis of either bipolar disorder or schizophrenia/schizoaffective disorder according to the Diagnostic and Statistical Manual of the American Psychiatry Association, and outpatient status at the time of assessment.Informed written consent was taken from all participants for audiorecording and de-identified research data sharing.
Audiorecordings were professionally transcribed by a trusted third-party company.Any identifiable data was manually removed from the transcripts at the time of transcription, and transcripts were verified to be de-identified by members of the study team.No data that might point toward the identity of any person(s) was used in any way in this work, including for feature creation, modeling, or analysis, nor will it be shared at any time.Collected audiorecordings are stored securely and are not part of the data release (and are also inaccessible to some members of the study team).
De-identified transcripts are shared in full compliance with all governing bodies involved, through the National Institute of Mental Health's data archive following federally mandated grant reporting and data sharing requirements.All parties interested in accessing the data will be required to complete the NIMH Data Archive Data Use Certification, which outlines terms and conditions for data use, collaboration with shared data, compliance with human subjects and institutional research requirements, and other information. 9The data use certification is non-transferable and recipients are not allowed to distribute, sell, or move data to other individuals, entities, or third-party systems unless they are authorized under a similar data use certification for the same permission group.The released transcripts include timestamps and de-identified utterances.Feature files (containing only the numeric feature vectors generated for each transcript using the procedures described in §4.2) are also available on GitHub at the link provided in this paper.

Intended Use
The intended use for this dataset is to enable discovery and analysis of the linguistic characteristics and language behaviors associated with members of three subject groups: people with schizophrenia, people with bipolar disorder, and healthy controls.Although we provide results from proof-ofconcept experiments to classify transcripts into subject groups, these are intended merely to demonstrate evidence of data validity and learnability, and the experimental inferences are provided to showcase linguistic differences between groups.This in turn establishes feasibility of the dataset as a language analysis resource for the target populations.We do not condone use of this dataset to develop models to automatically diagnose individ-9 https://nda.nih.gov/ndapublicweb/Documents/NDA+Data+Access+Request+DUC+FINAL.pdf uals with mental health conditions, especially in the absence of feedback from trained professionals and psychiatric experts.
When used as intended and when functioning correctly, we anticipate that models developed and analyses performed using this dataset may be used to facilitate discovery of novel linguistic biomarkers of schizophrenia or bipolar disorder.This information could be used to support mental health research.When used as intended but giving incorrect results, researchers may place undue importance on irrelevant linguistic biomarkers.Since this dataset is not intended for diagnostic purposes, this is unlikely to lead to real-world harm, although it may slow the progress of some psychiatric research as researchers attempt to replicate and verify results.
Potential harms from misuse of the technology include the development of models to predict mental health status, and subsequent misprediction of serious mental health conditions.We reiterate that this dataset is not intended for diagnostic use, and that individuals seeking mental health care should always consult trained professionals.The National Institute of Mental Health's data archive includes a mechanism for logging research studies associated with the shared dataset.We will monitor this log and contact researchers who attempt to use the data for purposes outside its intended use.B Appendix B: Extended Visualization versity of Miami, and the University of Texas at Dallas.Audiorecordings were then sent to a professional third-party service for transcription.Deidentification was performed by the transcription service, and verified on site by the study teams.The de-identified data was processed by the study team at the University of Illinois Chicago.Participants provided written informed consent.No identifying information such as name or birth date was collected.Demographic information such as biological sex and race were collected to help in future studies, but this information is not released publicly and will not be shared with others.Descriptive statistics of the participant demographics are provided in Section 3.3.

C.2 Intended Audience
The intended audience for this dataset includes psychiatric and computer science researchers, and oth- ers interested in understanding language patterns common in people with diagnosed mental health concerns.The intended use for this data is to enable discovery and analysis of the linguistic characteristics and language behaviors associated with people with schizophrenia, people with bipolar disorder, and healthy controls.We do not intend for this dataset to be used for automated diagnostic purposes, and we do not encourage others to attempt to replace psychological or psychiatric treatment with classification or deep learning methods.

C.3 Validity of Diagnoses
Recruited subjects were clinically diagnosed as having a DSM-IV diagnosis of schizophrenia or schizoaffective disorder, and being medicated for the same.Subjects with bipolar disorder met the conditions defined in the APA's DSM-5.Healthy controls did not have a clinical diagnosis for either

Figure 1 :
Figure 1: Transcription formats prior to preprocessing.The format at right was used when patient or interviewer utterances exceeded a given timestamp and continued onward into the next dialogue block.

Figure 2 :
Figure 2: Blue represents healthy controls, orange represents schizophrenia, and green represents bipolar.Figure is best viewed in color.Figure shows violin plots with quartiles, medians, and interquartile ranges across classes Healthy, Schizophrenic, and Bipolar.

Figure 3 :
Figure 3: Blue represents healthy controls, orange represents schizophrenia, and green represents bipolar.Figure is best viewed in color.Figure shows violin plots with quartiles, medians, and interquartile ranges across classes Healthy, Schizophrenic, and Bipolar.

Figure 4 :
Figure 4: Heat maps show correlations between features in Scene 2 transcripts among different subject groups.Correlations range from weakly negative (darkest) to strongly positive (lightest).

Figures 5 Figure 5 :
Figures 5 and 6 visualize the feature distributions that complement those provided in the main paper (Figures2 and 3).The figures provided in the main paper correspond to Scene 2 from our dataset, whereas the figures from this section correspond to Scene 1.C Appendix C: Datasheet and Fair and Inappropriate UsageC.1 Data Collection and CreationData in the form of audiorecordings was collected at the University of California San Diego, the Uni-

Figure 6 :
Figure 6: Blue represents healthy controls, orange represents schizophrenia, and green represents bipolar.Figure is best viewed in color.Figure shows violin plots with quartiles, medians, and interquartile ranges across classes Healthy, Schizophrenic, and Bipolar.

Table 3 :
Performance comparisons between classifiers on Scene 1 and Scene 2 transcripts.Results show accuracy (A) and F 1 score for one-versus-one classification between BD, SZ, and HC subjects.

Table 4 :
Performance comparisons between Transformers on Scene 2 transcripts.BB refers to BERT base, MB is MentalBERT, and MR is MentalRoBERTa.
Yufei Wang, Stephen Wan, and Cécile Paris.2016b.The role of features and context on suicide ideation detection.In Proceedings of the Australasian Language Technology Association Workshop 2016, pages 94-102, Melbourne, Australia.Participant: I don't really know because I keep to myself.I don't really socialize with other residents to find out what they're really like.Everyone is really nice, definitely knock on their door to see what they're doing or not.Introduce yourself and find out you know what their place is like, or you know, who they live with, all that stuff -kind of what goes on in your apartment.Or do I need to be there in the apartment for them to get inside and look at the leak?Or do I need a key?Or do they need a key?Not me.Do they need me physically there in the apartment to see the leak?Or, two, do they need a key from me to get inside the apartment to do the leak, if that case I need to get on my errands by then.Examiner: Um, so I have a list, and you're on the list.But there are other problems that are more serious.Participant: Okay, but this leak is getting worse, and I would like for you to try and get back to me in the next possible days to let me know what's going on with the leak.Or I might have to threaten to move out because this is unright and you are not being justice with this.And, um, I think it's unfair that you're putting other people that are higher ahead and their problems ahead of mine.I think if I'm paying your rent and your deposit, and if I had a pet or whatever and I paid the deposit for that too.
A Appendix A: Sample TranscriptsA.1 Scene 1: Introducing Yourself Examiner: Can you tell me, are the residents in this building friendly?