Affective Idiosyncratic Responses to Music

Affective responses to music are highly personal. Despite consensus that idiosyncratic factors play a key role in regulating how listeners emotionally respond to music, precisely measuring the marginal effects of these variables has proved challenging. To address this gap, we develop computational methods to measure affective responses to music from over 403M listener comments on a Chinese social music platform. Building on studies from music psychology in systematic and quasi-causal analyses, we test for musical, lyrical, contextual, demographic, and mental health effects that drive listener affective responses. Finally, motivated by the social phenomenon known as 网抑云 (wǎng-yì-yún), we identify influencing factors of platform user self-disclosures, the social support they receive, and notable differences in discloser user activity.


Introduction
Music can evoke powerful emotions in listeners (Meyer, 1956).However, our emotional reactions to it are not universal-affective responses to music are highly personal.Just as you may wonder why your friend is sobbing to a song that you only feel ambivalent about, a listener's emotional response to music not only varies with inherent audio or lyrical features (Hevner, 1935;Webster and Weir, 2005;Van der Zwaag et al., 2011), but also with other factors such as a listener's demographics, mental health conditions, and surrounding environment (Krugman, 1943;Robazza et al., 1994;Gregory and Varney, 1996;Juslin and Västfjäll, 2008;Saarikallio et al., 2013;Garrido et al., 2018).As a result of this idiosyncrasy, it has been extremely difficult to precisely measure the marginal effects of these variables on a listener's affective response (Yang et al., 2007;Beveridge and Knox, 2018).This difficulty is further compounded especially when examining how a collection of these factors influence individual affective reactions in combination (Gómez-Cañón et al., 2021).
Music psychology has long focused on identifying the relationships between human affect and music, both in those that are perceived and those that are felt.Perceiving and feeling emotions in music, while highly related, are not identical (Hunter and Schellenberg, 2010).Examining the latter has proved challenging, as in addition to insufficient scale for finding significance, measuring felt emotions in participatory studies often interferes with the experience itself (Gabrielsson and Lindström, 2010).While recent computational studies have attempted citizen science approaches for annotation (Gutiérrez Páez et al., 2021), reliability remains an issue; annotator confusion persists between the concepts of perceived and induced emotions (Juslin, 2019).Our work expands this line of research by examining affective responses to music in a natural setting: an online social music platform.
We test for differences in affective responses to music by computationally measuring expressed emotions from a massive study of over 403M listener comments on one of China's largest social music platforms, Netease Cloud Music.Our paper offers the following three contributions.First, we reveal several nuances in listener affective responses against a host of musical, lyrical, and contextual factors, showing evidence of emotional contagion.Second, in a multi-modal quasi-causal analysis, we show that listeners of different genders and ages vary in their reactions to musical stimuli and identify specific features driving demographic effects on affective responses.Third, motivated by the social phenomenon known as 网抑云, 1 we systematically study self-disclosures of mental health disorders on the platform, identifying driving fac-tors of this behavior, the social support they receive, and differences in discloser user activity.

Data
Our work is drawn from one of the largest music streaming services in China, Netease Cloud Music, and focuses on Chinese-language user content.Netease Cloud Music.网易云音乐 (wǎng-yìyún-yīn-yuè) has over 185 million monthly active users (Dredge, 2022).Unlike mainstream music streaming services in the United States such as Spotify and Apple Music, Netease Cloud Music is a social music platform (Zhou et al., 2018;Wang and Fu, 2020).Here, among other unique features, each song, album, and playlist have comment sections that serve as discussion boards, where users can post top-level comments as well as reply to or up-vote existing ones.These platform interactions serve as a natural setting on listener responses, where users are able to post freely2 in the comment sections of what they are currently listening to.Users are required to create an account to access most of the platform's features; when doing so, users optionally input personal demographic information like age, gender, and location, which they can then choose to display as public or private.Dataset Collection.To collect a representative sample of public platform commenting activity, we adapt traditional snowball sampling (Atkinson and Flint, 2001) across multiple random seeds to build an exhaustive list of user, song, album, and playlist entity ids on the platform.We then uniformly sample from the set of entities that have at least one public comment posted.Data was collected from all public content ranging from the platform's inception, 2013, to 2022, totaling over 455K albums, 2.87M songs, 1.36M playlists, 29.9M users, and 403M comments.A detailed breakdown of our data and a view of the interaction interface of the platform are shown in Appendix Section A. The study and data collection were approved by an Institutional Review Board as exempt.

Measuring Affective Response
We measure affective responses to music as expressed in comments posted under their comment sections.Since not all comments are indicative of a user's emotional response, we sample a subset of user content and examine both the experiencer 我只想和你一个人做那些浪漫到极致的事 Translation: I just want to do the most romantic things with you alone 果然不该来的。混蛋老爸，气死我了！ Translation: I shouldn't have come.Asshole dad, pisses me off! 太棒了好听太治愈了我莫名有点想哭 Translation: It's so good, it's so healing, I feel like crying for some reason of the emotion and its expressed stimulus, before conducting our analysis.Emotion Experiencer.Two annotators first manually annotated 1000 comments selected uniformly at random to identify the experiencer of the emotion expressed in top-level comments.Top-level comments were chosen to limit dyadic interaction effects and are used to measure affective responses later on.With an initial Cohen's κ of 0.80 and with disagreements resolved via discussion, similar to (Mohammad et al., 2014), we find that the experiencer of the emotion expressed in the comment is often the commenter themselves (99.1%); we thus maintain this assumption in our later experiments.Selected examples and annotation guidelines are shown in Table 1 and Appendix Section B, respectively.Affective Stimulus.Next, annotators were tasked with identifying what caused the emotional response in the comment itself.Annotators labeled for comments containing emotions that could explicitly be said to not originate from music-under the BRECVEMA framework of music-induced emotions (Juslin, 2013), emotions are evoked in listeners via a combination of mechanisms related to aesthetic appreciation, entrainment, visual imagination, and emotional associations with past experiences, among other factors.A listener's emotional state also has an effect on their music choice; for example, listeners often use music for mood regulation, or as a coping mechanism (Stewart et al., 2019;Schäfer et al., 2020).Here, we make no explicit causal assumptions of music choice and seek only to measure comment affective responses.With an initial Cohen's κ of 0.76 and with disagreements resolved via discussion, we only find a few instances (3.3%)where affective stimuli may be explicitly attributed elsewhere.There are a few patterns among these irrelevant comments: namely, that they primarily relate to album images, quotations, and easily identifiable spam messages, i.e. "沙发" (meaning "first comment").Aiming for high precision, we create simple regular expressions and redundancy filtering to increase the relevance of comments with affective content, achieving a precision of 98.8% on a held-out test set of the same size.Specific annotation guidelines and filtering methods are shown in Appendix Sections B and D, respectively.Measuring Affective Response.We characterize emotions across a 2-dimensional plane of valence and arousal following the Russell model of emotions (Russell, 1980), representing the degree of positivity and emotion intensity, respectively.Specifically, we employ a lexicon-based approach to measure valence and arousal in music comments, using one of the largest crowd-sourced datasets for the Chinese language-Chinese EmoBank (Yu et al., 2016)-containing 5512 words annotated for their valence and arousal.In the following sections, these measures of expressed emotions in comments are what we define as listener affective response.

Variations in Affective Response
Computing comment-level valence and arousal by averaging across word-level scores, 3 we analyze variations in listener affective responses to (1) musical and (2) lyrical features, (3) contextual factors, and (4) user demographic variables.

Musical and Lyrical Features
In prior work, while much emphasis is placed on identifying the causes of perceived emotions in music, less emphasis has been placed on emotional responses, which are highly influenced by extramusical and contextual factors in listeners (Gómez-Cañón et al., 2021).Recent work has attempted to use physiological signals and self-reported emotions to measure emotional responses in listeners (Hu et al., 2018), though this has proved challenging partly due to a high degree of intercorrelations and confounds, causing the number of trials needed to measure such effects to be intractable relative 3 To confirm that our findings were robust against highvolume sentiment terms, following a distribution analysis of comment valence and arousal scores, we drop the top three most frequent terms-好听, 好, and 喜欢, which roughly translate to "sounds good", "good", and "like", respectivelyand recompute our experiments, obtaining similar results.
to typical experiment scale (Eerola et al., 2013).Using our data, we test for the marginal effects of musical and lyrical features on affective responses.Methods.To understand the marginal contributions of these variables on affective responses, we fit separate multivariable linear regression models on response valence and arousal, including the features described below as regressors.As affective responses are highly idiosyncratic (Juslin and Västfjäll, 2008), we further control for listener demographics, namely age, gender, and location.We then test for multicollinearity by computing the variance inflation factor (VIF) for each variable and iteratively remove collinear variables in our regression that have a VIF greater than 5.In our analyses, we stratify continuous variables (i.e.tempo) into fixed-length category indicator variables (i.e., 80-90 BPM, 90-100 BPM, and so on) and measure the average marginal effects (AME) on valence and arousal of each stratum, using the first of such categories as the reference group (i.e., the AME of 90-100 BPM, and so on, relative to 80-90 BPM).
Musical Features.We use librosa (McFee et al., 2015), pydub (Robert et al., 2018), and tim-bral_models of the AudioCommons project (Font et al., 2016) to derive song file musical features.We extract (1) tempo and ( 2) tempo standard deviation, both in beats per minute (BPM) (Ellis, 2007); (3) loudness, measured as the average decibels relative to full scale (dBFS) value of the entire song; (4) mode, namely, major or minor, and ( 5) key, i.e.C# minor (Krumhansl, 2001); as well as eight additional timbral features, or the perceived sound qualities of a piece of music.They are, specifically, (6) depth, related to the emphasis of low frequency content, (7) brightness, a measure that correlates both with the spectral centroid and the ratio of high frequencies to the sum of all energy of a sound, and ( 8) warmth, often created by low and midrange frequencies and associated with lower harmonics (Pearce et al., 2017); ( 9) roughness, a sound's buzzing, harsh, and raspy quality (Vassilakis and Fitz, 2007); (10) sharpness, measuring high frequency content (Zwicker and Fastl, 2013); (11) hardness, the amount of aggression (Pearce et al., 2019); (12) reverberation, a sound or echo's persistence after it is initially produced (Jan and Wang, 2012); and (13) boominess, a sound's deep resonant quality as measured by the booming index (Hatano and Hashimoto, 2000).Here, reverberation is classed as a binary variable, while all other First, tempo exhibits a bimodal distribution relative to both valence and arousal; listeners are most intensely positive for tempos of around 110 BPM and 160 BPM, with the former eliciting greater arousal.Higher tempo variation also sees similar increases in affective responses, although tempo standard deviations of around 35-40 BPM produce the opposite effect, with arousal peaking earlier than valence.Our findings are consistent with prior work on listener self-ratings and measured physiological responses that have used coarse categorizations of tempo, i.e. "fast" tempo (Liu et al., 2018), or the presence and absence of tempo variation (Kamenetsky et al., 1997), as opposed to the continuous measures we use here.
Second, consistent with prior work (Schubert, 2004;Gomez and Danuser, 2007), loudness generally produces a strong positive correlation with more intensely positive reactions; changes in loudness also see a greater change in AME than that of tempo.However, this trend is reversed for songs that are loudest (i.e. between -5 and 0 dBFS)while unexplored in prior work within music psychology, this observation intuitively follows neural downregulation responses to excessively loud or unpleasant sounds (Hirano et al., 2006;Koelsch, 2014).
Third, consistent across all keys (Appendix Figure 14), major mode in songs has a greater valence and a lower arousal than minor mode.This observation extends prior work investigating the interaction effects between mode and affective responses (Van der Zwaag et al., 2011) in a western tonal context, suggesting that associations of sadness and happiness by way of musical mode are also consistent in Chinese listeners.
Fourth, increases in most timbral characteristics see similar increases in the intensity and positivity of reactions up until a point of extremity, whereafter the opposite effect is observed.The only exceptions are roughness and warmth, in which both valence and arousal see monotonically decreasing and increasing trends, respectively.Our results for brightness specifically provide nuance into how, when exploring an expanded set of timbral characteristics and moving beyond only varying timbre through different instruments (Hailstone et al., 2009;Eerola et al., 2012;Wallmark et al., 2018), excess of a timbral feature can produce the opposite initial effect on affective responses.
Fifth, listener affective reactions mirror the psychological states expressed in lyrics.Changes in response valence and arousal closely match the proportion of LIWC category terms for affective processes.Greater use of positive emotion terms sees greater response positivity (r=0.93,p<0.05),while the opposite is true for negative emotion terms (r=-0.92,p<0.05),and both saw rises in response intensity with their increased use.Furthermore, increases in first-person pronouns also see decreases in valence (r=-0.94,p<0.05),mirroring work on the depressed psychological states reflected through their increased use (Pennebaker, 2011).Interpreted together with our findings on musical features, these observations mirror emotional contagion (Juslin, 2013), where the recognition of emotions expressed in music evokes similar emotions in listeners.
These findings, compared to prior work, highlight the importance of using finer-grained measurements on an extended set of features and controls to provide a more nuanced analysis of emotional responses to musical and lyrical stimuli.Expanded results with the full list of figures are shown in Appendix Figures 13-18.

Contextual Factors
Extramusical factors such as listening context (e.g., listening to music when grabbing coffee vs. when exercising) also influence the emotional effects of music (Sloboda and O'neill, 2001;Greasley and Lamont, 2011;Vuoskoski and Eerola, 2015).Prior work has primarily utilized experience sampling methods (Csikszentmihalyi and LeFevre, 1989) to study musical experiences in everyday contexts-  where participants are polled at random intervals during the day-though generalizations to the population at large have proved difficult with small sample sizes (Sloboda et al., 2001;Juslin et al., 2008).While we are unable to obtain information about the physical setting a user was in (i.e. that a user was exercising when listening to a song), here, using our data on playlists and treating the choices of users in listening to playlists of specific types as context, we tease out the marginal effects that these choices have on affective responses.Choice as Context.We obtain context variables on 1.36M playlists through their tags, used by creators to label individual playlists.Tags consist of a set of physical setting (e.g., afternoon tea), emotional (e.g., nostalgic), and thematic (e.g., video game music) categories, in addition to language (e.g., Chinese) and stylistic (e.g., jazz) labels.As users primarily discover new playlists within the platform by browsing specific tags, we treat these tags as implicit signals of choice with these listening contexts, aiming to identify those that may differ on the emotional responses produced-i.e. that a user chose to listen to an exercise tagged playlist rather than an afternoon tea tagged one-noting that we do not make explicit causal assumptions behind the factors that led to these user choices.Methods.To identify the marginal effects of contextual choices on affective responses, we fit separate mixed-effect multivariable linear regression models on response valence and arousal, includ- Negemo (% occurrence) ing tagged category indicator variables as features and control for listener demographics.To further control for differences between playlist songs, we include them as random effects; for computational tractability, we include only random effects for songs that are labeled with 10 or more unique tags.
Results.Cultivating affect is a driving reason behind why users create playlists (DeNora, 2000;Siles et al., 2019), and our results point to how playlists created by users are also generally successful at cultivating these affects among the general user population as well.As shown in Figure 2, playlists tagged by leisurely activity categories corresponded to the highest positivity in responses, and are consistent with prior work on stress levels in everyday situations (Västfjäll et al., 2012)

Demographic Variations
Individuality is a driving factor in how listeners experience musically-evoked emotions (Yang et al., 2007;Juslin and Västfjäll, 2008;Gómez-Cañón et al., 2021).However, measuring how individual differences affect emotional responses to music has proved challenging, with many researchers citing the insufficiency of typical experiment scale as a primary reason (Juslin et al., 2008;Lundqvist et al., 2009;Cameron et al., 2013), especially in the presence of confounders.For listener demographics, prior work has seen conflicting observations of how demographic effects modulate affective responses against musical features.For example, some observe that age and gender modulate emotional responses against tempo, mode, volume, and pitch (Webster and Weir, 2005; Chen et al., 2020), while others report the absence of such demographic effects or even contrasting observations (Robazza et al., 1994;Cameron et al., 2013).These contrary results might be due to variable experimental setups between studies, wherein the method of measurement will often interfere with the experience itself (Gabrielsson and Lindström, 2010).This raises the importance of studying emotional reactions in a natural setting when analyzing affective responses to music in everyday situations.Here, we test for demographic differences in affective responses in relation to song features using our data.Demographic Variables.Our analysis focuses on two main demographic variables, namely listener gender 4 and age. 5 We operate within the constraints of platform-provided choices in user registration for our variable categories and use only publicly displayed user data in our analysis.
Methods.To test for differences between pairs of demographic groups in their affective responses to musical and lyrical features, we formalize alternations between groups as treatments and compute average treatment effects (ATE).In order to account for covariates and reduce bias due to confounding variables, we construct a multi-modal stratified propensity score matching (PSM) model as a quasi-causal analysis of demographic effects.
Here, we formalize comments as subjects; the propensity score, defined traditionally as the likelihood of being assigned to a treatment group based on observed characteristics of the subject (Rosenbaum and Rubin, 1983), is thus a scaled estimate of the likelihood of a commenter being of a demographic group g i given a set of song features f i , or P (g i |f i ).We estimate this probabilitythe propensity score-via logistic regression on a song's musical and lyrical features, and match data points within stratified deciles of this score to mitigate confounding bias (Rosenbaum and Rubin, 1984;Paul, 2017).Within these matched and stratified deciles, we fit separate linear regression models on response valence and arousal against specific song features, weighting and pooling stratumspecific estimated treatment effects to estimate the ATE (Imbens, 2004) and its variance (Lunceford and Davidian, 2004).Consistent with prior work in musical emotions (Kamenetsky et al., 1997) and in social psychology on how cultural constructions of gender may account for differences in emotional display (Bem, 1974), we observe that response valence and arousal by demographic groups differ in their distributions-for example, as shown in Appendix Section A.7, comments made by women are on average higher in both valence and arousal than those made by men.Therefore, we test specifically for standardized change in affective responses across song features within demographic groups.Finally, as in Section 4.1, we stratify continuous variables in our analyses into fixed-length categories and estimate the ATE of each stratum.
Results.Shown in Figure 3, we find that listener age and gender both modulated affective responses to statistically significant degrees across a series of musical and lyrical features.Compared to men, women had more intensely positive affective reactions for songs that were louder (>-12 dBFS), of lower tempo (<120 BPM), of minimal tempo stan-dard deviation (<5), of minor mode, and that had reverb; though gender differences often diminished (i.e.tempo >160BPM) or became statistically insignificant at feature extremities.Lyrically, women were affected more negatively with a greater proportion of negemo terms, while men were affected more positively for posemo terms, consistent with observed gendered responses in other mediums (Bradley et al., 2001;Fernández et al., 2012) (DeNora, 1999), though music choices often misalign with intended well-being outcomes (Stewart et al., 2019).We hope our work further facilitates more effective and more intentional music choices in daily consumption to achieve these desired results.Under appraisal theory, affective responses are learned and conditioned through individual lived experiences rather than innate to certain biological factors (Brody and Hall, 2010).These findings on demographic effects should then be interpreted to be products of the social norms, values, and lived experiences (de Boise, 2016) of those who may self-identify under the broad demographic groups in question; with the platform, song, and comment board being part of the context in which these emotions are deployed.

Disclosures of Mental Health Disorders
In the context of social media, given the frequent benefits of anonymity (De Choudhury and De, 2014) and social connectedness (Bazarova and Choi, 2014), self-disclosures of personal details can be a method to find social support, advice, and belonging (Ernala et al., 2018;Yang et al., 2019).This phenomenon gives life to "网抑云," which refers to the outpour of emotional and personal comments on the social music platform, especially late at night and under sad songs.While known colloquially and in popular culture, 6 the mechanisms behind self-disclosure phenomena in the context of social music platforms are not well understood.
Motivated to better understand disclosures of mental health disorders in a musically-situated social environment, we frame them as affective responses (Ho et al., 2018) and test for factors driving this behavior, the social support they receive, and differences in discloser user activity.Addressing these unknowns will help us understand how users may use social music platforms for therapeutic purposes (Schriewer and Bulaj, 2016) and guide us to better support vulnerable and at-risk individuals.Dataset Collection.In the absence of clinicallyaligned user data (Harrigian et al., 2020), we source disorder terms from the DSM-5-TR 7 (American Psychiatric Association, 2022) and utilize regular expressions to identify disclosures of self-reported statements of diagnosis (Coppersmith et al., 2014(Coppersmith et al., , 2015;;Cohan et al., 2018) for mental health disorders in music comments.Two Chinese native speakers further manually filter for genuine statements of disclosure (i.e., excluding jokes, quotes, and clearly disingenuous statements), resulting in 1133 users with self-reported mental health disorders.We find that, out of all disclosers, most disclose depression (81.2%), anxiety (19.9%), and bipolar (18.5%) disorders; additionally, most users (60.6%) self-identify as women, consistent with constituent gender differences of affective disorders in national studies (Huang et al., 2019).Disclosers show greater platform usage (Kolmogorov-Smirnov on user levels, p<0.01), insomnia-aligned diurnal user activity consistent with disorder symptoms (Taylor et al., 2005;Harvey, 2008), increased engagement with playlists of sadder natures, e.g., as shown in Figure 4, loneliness (+302%), sadness (+158%), and night (+50.1%), and decreased engagement with playlists of more active natures, e.g., exercise (-51.7%),compared to typical users.These observations mirror affective disorder activity trends (Cooney et al., 2013) and suggest that people with affective disorders are more likely to use music reflective of negative emotions than positive emotions to manage feelings of sadness and depression (Stewart et al., 2019).A detailed breakdown of our data, comorbidities, and our specific regular expressions are described in Appendix Section F. Affective Response.Treating the act of selfdisclosure as an affective response, we test for factors driving this behavior.Statements of self- ).Responses to the first are split in function, with disclosers either expressing their diagnosis in empathy for encouragement ("...我一年前也确诊 了，事情会好起来的", meaning "...I was diagnosed a year ago, things will be better"), or to commiserate ("我也确诊了，活着好难...", meaning "I was diagnosed too, living is so hard..."), showing evidence of resonance (Miller, 2015;Rosa, 2019) and high person-centered condolence (High and Dillard, 2012).
Social Support.Characterizing audience engagement around self-disclosure comments in their content, we identify supportive comments according to the four major classes of social support around health concerns-prescriptive, informational, instrumental, and emotional support-from established literature (Turner et al., 1983;George et al., 1989;De Choudhury and De, 2014) and label the main type of support each comment falls under.
We then fit logistic regression models on the dependent variable of receipt, aiming to identify where users are more likely to receive a supportive comment in response to disclosure; including song features as independent variables and song popularity, comment length, user demographics, and comment LDA topic distributions as controls.We observe that emotional (52%, e.g., "加油，事情一定会好 起来的我保证", meaning "good luck, everything will be better I promise") and prescriptive support (31%, e.g., "听一些令人振奋的歌曲吧", meaning "listen to heart raising songs") largely exceeds informational (9%, e.g., "...治疗可能会有帮助， 两年治疗后我...", meaning "...therapy could help, after two years of therapy I...") and instrumental (8%, e.g., "...你可以私聊我", meaning "...you can private message me") forms in response to disclosures.Several psycholinguistic lyrical features proved statistically significant (p<0.05) in predicting if a disclosure comment to a song would receive a supportive reply; the rate of terms in lyrics relating to social processes, specifically friend (+2.23) and ingest (+0.97), positively predict this prosocial behavior, and negative emotion terms (-2.04) do so negatively, mirroring negative correlations between sadness and prosocial tendencies (Ye et al., 2020).For musical features, only reverberation did so positively (+0.90).While past work has studied the prosocial effects of music, most have only used a limited set of author-chosen songs (Greitemeyer, 2009;Kennedy, 2013) or crowd-sourced prosocial perceptions (Ruth, 2017); here, we specifically identify what makes for prosocial songs and situate our study in the context of social support to mental health self-disclosures.Taken together, these observations not only provide ample pointers for music therapists on musical and dyadic conversational means for more successful emotionfocused interventions (Jensen, 2001) but also guide users on how to effectively find social support on the platform when needed (De Choudhury and Kiciman, 2017).

Discussion and Future Work
In this work, we sought to examine the driving factors behind variations in emotional reactions to music, via a large-scale computational study of a Chinese social music platform.Our analyses here reveal several nuances in how idiosyncratic variables elicit emotional responses, with a degree of precision that prior studies have often lacked thus far.In a case study of mental health self-disclosures in music comments, we characterized a type of discourse in the context of a popular social phenomenon, demonstrated the importance of posting location in determining the social support disclosures would receive, and revealed several factors driving the prosociality of music in this context.We see our present work situated in the broader context of studying emotionality in music and in the design of platforms to promote healthier interactions more centered on user well-being.Here, we highlight a few limitations and directions for future work; models, code, and anonymized data are made available at https://github.com/skychwang/music-emotions.
The music we listen to has a strong effect on our moods (McCraty et al., 1998).The integration of emotional response analysis into music recommendation systems could promote healthier recommendations (Konstan and Riedl, 2012; Singh et al., 2020) more cognizant of listener well-being outcomes.No one size fits all, and more sophisticated analyses could better capture more factors that explain emotional response variations towards creating more personalized music emotion recommendation systems.
While our work measures the effects of demographic variables on emotional responses, there remains a bio-psycho-social question on identifying the causes behind why this variation exists as it relates to song features.Lived experiences condition our emotions (Brody, 1997); future work could aim, through significant theoretical and qualitative study, to better identify the relationships and causes behind these variation outcomes.
Several open questions also remain as to whether risk may be qualified in this context in relation to well-being.Specifically, it would be interesting to study how recommendation interactions may disproportionately affect those afflicted with mental health disorders, and how we may design platforms, in the context of well-being outcomes, under normative goals of equity and distributive justice (Rawls, 2001).

Ethical Considerations
Data Release.For user comments, taking user privacy considerations into account, we release the set of comment ids used in our analyses-which researchers are able to use in conjunction with the Netease API to obtain original comment contentmirroring Twitter data release guidelines for academic research.Identity Affiliation.In studying demographic effects, we examine only the aggregate behavior of users who make public their demographic selfidentification choices during registration under platform constraints.In particular, we note that platform choices for gender are limited only to binary options-男 men and 女 women.These choices should not be interpreted to have taken into account gender fluidity considerations or the multidimensional spectrum of gender identities (Larson, 2017).

Limitations
Measuring Affective Response.In particular, we mirror the concerns by Mohammad (2020); notably, that (1) emotion lexicons are limited in coverage and do not include all possible terms in a language, and that (2) as languages and, in particular, our perceptions of words in them are by nature entities of change that inherently possess socio-cultural variations, emotion scores for words are not immutable, neither longitudinally nor socio-culturally.As such, while we have attempted to mitigate for this limitation by (1) choosing the largest Chinese emotion lexicon annotated for words sourced from the domain of social media and (2) comparing our findings to that of previous smaller-scale in-person studies that use varying methods to measure emotion when possible-even as no "gold standard" measure of emotional response exists, physiological, behavioral, or otherwise (Mauss and Robinson, 2009)-we encourage future work to further examine these phenomena in a greater variety of contexts.Further, our study does not make explicit causal claims around factors of music choice and user predisposition, i.e. what caused users to choose to listen to a specific song, or what their states of mind were prior to making this choice.While our work shows evidence of variations in affective responses correlated with musical, lyrical, demographic, and mental health factors, like the quasi-causal results estimating demographic effects on listener affective responses, we do not argue that these alone explain the entirety of the associated variations.In moving towards truly causal studies (Feder et al., 2021), we encourage further direct participatory work to further examine these observations in larger, more controlled, and even cross-cultural contexts.Censorship and Moderation.Users are able to report comments that violate platform rules, 8 and active moderation of user content exists on the platform.As we use only public posts on the platform, it is thus important to interpret our findings in the context of internet censorship in China (Vuori and Paltemaa, 2015).In particular, as noted by previous studies on mental health postings in Chinese social media (Cui et al., 2022), comments that go against certain government objectives-such as the "stability and unity for a harmonious society" (Wang, 2012), which mental health-related postings may go against-are often censored (Paltemaa et al., 2020).While pilot tests matching regular expressions on such phrases within platform comments still yielded significant quantities, the degree of censorship that these types of comments receive remains unclear.

Statements of Diagnosis.
As we study users with self-reported statements of diagnosis, our method only potentially captures a sub-population of each disorder-those who choose to disclose a diagnosis on a public platform under the option of anonymity.While we have attempted to increase the precision of identifying individuals who are diagnosed with specific disorders through significant manual annotation, in the lack of clinically-aligned user data, we nonetheless are unable to verify if genuine-appearing disclosures of mental health disorder diagnoses are ultimately truthful.However, as noted by (Coppersmith et al., 2014), given the stigmas often associated with mental illnesses, it seems unlikely that users would disclose that they were diagnosed with a condition they do not possess.Individuals who may be diagnosed with affective disorders undoubtedly also remain in the set of all users that we compare disclosers against and, as such, our results on platform user activity differences should only be interpreted in the context of discovering broad themes-not as ground truths of comparisons between those who are diagnosed and those who aren't.Finally, we also note concerns in clinical psychology on the heterogeneity of psychiatric diagnoses, which remains contentious in current literature.Notably, that standards of diagnosis all use different decision-making rules, that significant overlaps exist in symptoms between diagnoses, and that they may instead mask the complex underlying causes of human distress with potentially scientifically meaningless labels (Allsopp et al., 2019).

Overview of Appendix
We provide, as supplementary material, additional information about our dataset, annotation guidelines, preprocessing details, and expanded results across all experiments.

A Data
This section describes summary statistics of our data, as well as a view of the platform's user interface.

A.1 Platform Interface
Users are able to interact with the platform through their browsers, native OS applications, and phone apps.Screenshots of a song's interface are shown in Figure 7, as is a view of the iOS application's commenting page for a song.

A.2 Users
User age, gender, and region distributions (Figure 8) show that the majority of users are young men that hail from major metropolitan areas.

A.3 Songs
Song comment and comment token distributions are shown in Figure 9; lyric preprocessing and topic modeling details are in Appendix Section C.

A.5 Albums
Album comment, comment token, and release date distributions are shown in Figure 11.Songs with at least one comment show exponential bias towards recently released music.

A.6 Artists
A distribution of the number of albums and songs per artist is shown in Figure 12.

A.7 Demographic Baselines
Users of different demographic groups have varying comment valence and arousal means and standard deviations.These statistics, stratified by demographic groups on gender and age, are shown in Table 2.

B Emotion Annotation Guidelines
This section describes the annotation guidelines used by annotators in our pilot studies to determine, in top-level comments, (1) the emotion experiencer, or who was the primary experiencer of the emotions expressed in comments, and ( 2) the affective stimulus of the emotion expressed in the comment.Annotators consisted of two Chinese native speakers and were asked to annotate a set of 1000 randomly selected comments on the platform.Annotators were first tasked with familiarizing themselves with the BRECVEMA framework of musically evoked emotions (Juslin, 2013) before being presented with the following questionnaires for annotation: Question 1: The Emotion Experiencer Comment: 真特么的带感这曲子！ Q.Who was the primary experiencer of the emotion expressed in the comment?
• The commenter themselves.
• Someone other than the commenter themselves.
• This comment possesses no emotional content.
Question 2: The Affective Stimulus Comment: 真特么的带感这曲子！ Q.What was the primary affective stimulus of the emotion expressed in the comment?
• The song, album, or playlist.
• Something other than the song, album, or playlist.
• This comment possesses no emotional content.
As stated, annotators were asked to resolve initial annotation disagreements through discussion in order to come up with a set of annotations that both agreed on.

C Lyric Topics and Preprocessing
This section describes our lyrical preprocessing methods and 20-topic LDA model results on song lyrics.Preprocessing.We first identify instrumental music by matching lyric data on the substring 纯音 乐, used by the platform to denote songs of this category.For non-instrumental pieces, we filter out lines with song metadata (e.g.composers) by removing lines that match the following regex: :|：|《|》|produced by|vocals by|recorded by|edited by|mixed by|mastered by| -| -As repeated lyrics are denoted with overlapping time stamps, e.g.
[1:00.00][2:00.00]雨淋湿了天空 indicates that the line 雨淋湿了天空 is repeated at minutes 1 and 2), we further unfurl and reorder lines by timestamp, duplicating lines when necessary.Further tokenization details are shown in Text Preprocessing.
Topic Modeling.We train a 20-topic LDA model on preprocessed song lyrics and manually label each lyric with its prominent theme.While some degree of variation exists for listener affective responses across songs of each topic, these topic distributions are primarily used as lyrical content controls in our regression models.Labeled topics and their top words are shown in Table 3.

D Text Preprocessing
This section describes our text preprocessing pipeline for all text data on the platform, namely (1) lyrics and (2) listener comments.
Preprocessing.We analyze only Chinese language content, using Google's Compact Language Detector v3 (gcld39 ) to detect text language and keep only Chinese language texts.We then convert all traditional Chinese characters to their simplified forms using hanziconv10 to ensure consistency in our experiments-i.e. when calculating LIWC scores, for which we use the simplified Chinese version (Huang et al., 2012)-and finally tokenize with jieba.11 Filtering for Affective Content.Following annotations of listener comments, we filter out all comments that match the following regular expressions (Table 4), aiming to increase the precision of comments in our analysis that indicate an affective response.The following filters generally match with easily identifiable spam messages, i.e. "first comment", album images, and quotations.沙发,第一,第二, 第三, 第四, 第五, 第六, 第七, 第八, 第九, 第十, 第1, 第2, 第3, 第4, 第5, 第6, 第7, 第8, 第9, 一楼, 留名, 封面, 没人, 来晚了, 板凳, 求, 前排, 识曲, 后排, 一条, 好少, 不火, 助攻, 作者, 评论, 人呢, 来了, "*", "*", '*', '*',《*》, <*>, ：, :, 9+   Diurnal User Activity.Stratifying user activity across hours and measuring the relative comments made per stratum, we observe that disclosers show greater platform activity in the AMs (1-5 AM) and around 11AM-5PM compared to the set of all users.Shown in Figure 5 below, these observations are consistent with insomnia-aligned diurnal user activity, prevalent in individuals diagnosed with affective disorders (Taylor et al., 2005;Harvey, 2008).Note here that due to platform data limitations, while comment dates are available for all comments on the platform, only those made in the past year had times recorded and, as a result, are what we use in our analysis here on diurnal user activity.Thus, it is important to interpret these in the context of the COVID-19 pandemic, which has caused an increase in the prevalence of anxiety and depression worldwide (Bareeqa et al., 2021).Playlist Engagement.Relative tagged playlist engagements are shown in Figure 6

Figure 2 :
Figure 2: Averaged marginal effects of contextual choices in emotion-tagged (top) and setting-tagged (bottom) playlists on listener affective responses; standard errors are shown.

Figure 5 :
Figure 5: Diurnal commenting activity between disclosers and the set of all users.

Figure 6 :Figure 7 :Figure 8 :
Figure6: Relative tagged playlist commenting activity between disclosers and the set of all users.A breakdown of engagement with the five broad tag categories is shown on the top left, while other figures show each category's relative tag engagements.Note that as each playlist may have up to three unique tags, relative tag percentages do not add up to 100%.

Figure 9 :
Figure 9: Comment (left) and comment token (right) distributions across all songs with at least one comment.

Figure 10 :
Figure 10: Comment (left) and comment token (right) distributions across all playlists with at least one comment.

Figure 11 :
Figure 11: Comment (left) and comment token (middle) distributions across all albums with at least one comment, as well as album release date distributions (right).

Figure 12 :Figure 13 :Figure 14 :
Figure12: Song (left) and album (right) distributions per artist across all artists.Platform-listed artists with the highest amount of songs and albums are generic compilations of multiple artists, e.g."华语群星" ("Chinese stars").

Figure 18 :
Figure 18: Average marginal effects of LIWC psycholinguistic lexical category lyrical features on listener affective responses, controlling for musical features and listener demographics.With the intent to reduce noise at the extremities, x-axis limits are capped at their 95% quantile values.Arranged in alphabetical order, standard errors are shown; valence in red, arousal in blue (Part 4/4).

Figure 19 :
Figure 19: Average marginal effects of listening contexts in setting-tagged playlists on listener affective responses, controlling for songs and user demographic variables; standard errors are shown.

Figure 22 :
Figure 22: Average marginal effects of listening contexts in theme-tagged playlists on listener affective responses, controlling for songs and user demographic variables; standard errors are shown.

Figure 23 :
Figure 23: Average marginal effects of listening contexts in language-tagged playlists on listener affective responses, controlling for songs and user demographic variables; standard errors are shown.

Table 1 :
Example top-level comments indicating an affective response.

Table 2 :
Comment valence and arousal mean (m.) and standard deviations (std.) for demographic groups on gender and age.

Table 5 :
Mental health disorder condition name strings; these are prefixed/suffixed with the strings for "diagnosed" ("确诊.*") and "diagnosis" ("诊断.*"),i.e. "确 诊抑郁" for "diagnosed with depression", to act as initial regular expression filters for users who self-disclose a diagnosis of a mental health disorder.

Table 6 :
Examples of positive and false positive selfdisclosure statements of mental health disorders encountered in our manual labeling of comments matched with regular expressions.Partial comments are shown.

Table 7 :
The number of users who self-disclose a mental health disorder, stratified over broad disorder classes.
for disclosers and the set of all users; these are expanded figures as noted in Section 5 of the main paper.