Using Sociolinguistic Variables to Reveal Changing Attitudes Towards Sexuality and Gender

Individuals signal aspects of their identity and beliefs through linguistic choices. Studying these choices in aggregate allows us to examine large-scale attitude shifts within a population. Here, we develop computational methods to study word choice within a sociolinguistic lexical variable—alternate words used to express the same concept—in order to test for change in the United States towards sexuality and gender. We examine two variables: i) referents to significant others, such as the word “partner” and ii) referents to an indefinite person, both of which could optionally be marked with gender. The linguistic choices in each variable allow us to study increased rates of acceptances of gay marriage and gender equality, respectively. In longitudinal analyses across Twitter and Reddit over 87M messages, we demonstrate that attitudes are changing but that these changes are driven by specific demographics within the United States. Further, in a quasi-causal analysis, we show that passages of Marriage Equality Acts in different states are drivers of linguistic change.


Introduction
A person's identity and attitudes are reflected in the language they use (Norton, 1997;Huffaker and Calvert, 2005;De Fina, 2012). In particular, the linguistic choice for a concept can reveal the individual's stance or attitudes (Jaffe et al., 2009); for example, the use of "illegals" or "undocumented" in reference to immigrants reveals the speaker's attitude on immigration (Lakoff and Ferguson, 2006). These alternations in word choice are known as lexical variables in sociolinguistics. Examining the relative frequencies of a variable's words can reveal the underlying attitudes within a † Work performed while the author was an undergraduate research assistant at the University of Michigan. population that drive the linguistic choice. Here, we examine changes in attitude towards sexuality and gender in the United States through two lexical variables.
Sociolinguistics has long focused on variation in language with respect to identity and attitudes (Labov, 1963;Eckert and Rickford, 2001;Trudgill, 2002). Recent computational studies have built upon this line of research (Nguyen et al., 2016), showing not only that this variation occurs in social media (Eisenstein et al., 2014;Hovy et al., 2015), but also that the large scale of social media enables the study of broader societal trends (Abitbol et al., 2018;Grieve et al., 2018). Our work expands this line of research by examining longitudinal changes in linguistic variation to show changing societal attitudes.
Here, we test for change in attitudes about sexuality and gender by computationally measuring variation for two lexical variables associated with these attitudes from a massive longitudinal study of 73M Twitter posts and 14M Reddit comments across nearly ten years. The first variable focuses on the use of gender when referring to romantic partners; specifically, we test how frequencies in gender-neutral referents such as partner-a term often used by LGBT+ community membersshift as acceptance of gay marriage changes. The second variable measures attitudes about gender through testing for unnecessary gender markings on indefinite references to one or more people, e.g., "some folks" versus "some guys." Our work here is drawn from theory in gender and sexuality studies on how both heterosexuality and masculinity are treated as the default or norm in English (Kitzinger, 2005a;Land and Kitzinger, 2005), where shifts away from these heterosexual constructs signal increasing acceptance of other identities.
Our paper offers the following three contributions. First, through a large-scale computational analysis that measures the language choices of different demographics, we demonstrate increasing acceptance of non-heterosexual relationships through the increasing use of non-gendered referents to significant others by heterosexual communities. While non-gendered referents are used frequently in LGBT+ communities, further demographic analysis shows this change is found across gender identities. Second, in a quasi-causal analysis, we show that passages of marriage equality acts (MEA) in the United States drives a statistically significant increase of gendered markers in the LGBT+ community (e.g., husband instead of partner), mirroring increased acceptance and decreased social cost for explicitly indicating one's sexual orientation (Ofosu et al., 2019). Third, we find increasing gender equality through decreased use of gendered person referents, driven by multiple segments of the population. Our work not only reveals positive societal change in acceptance but points to the potential of linguistic variation as indicator variables for studying cultural attitudes.

Sociolinguistic Variables
Sociolinguistic variables consist of alternative expressions where each expression is associated with a specific identity or attitude (Bucholtz and Hall, 2005;Eckert, 2008). Typically, these variables have been pronunciations, e.g., the association of g-dropping with African American Vernacular English (Wolfram, 1969;Dillard, 1973), due to the need for high observational frequency within inperson studies in order to identify rigorous associations between form and identity/stance (Labov, 1972;Labov et al., 1981). The availability of massive quantities of natural text from social media has substantially increased our ability to study lexical variables, which occur less frequently than pronunciation variations (Androutsopoulos, 2006;Nguyen et al., 2016). While many studies have focused on associations between demographics and lexical signals (Jackson-Maldonado et al., 1993;O'Connor et al., 2010;Jurgens et al., 2017), we examine associations between attitudes and two variables: (1) referents to significant others and (2) indefinite referents to one or more persons. We refer to these variables respectively as SIGOTHER and PERSON and motivate them next. SIGOTHER Individuals frequently refer to romantic partners in conversation. Signalling the gender of these partners also reveals their sexual orienta- SIGOTHER: boyfriend, girlfriend, husband, bae, partner, bf, gf, babe, lover PERSON: people, girl, man, guy, person, girls, guys, dude, bro, individual  tion (Kitzinger, 2005b,a;Wilkinson, 2015). However, in some social contexts, revealing one's nonheterosexual sexuality (i.e., "outing") carries social cost and personal risk (Fuss, 2013;Cadieux and Chasteen, 2015;Carrasco and Kerne, 2018). As a result, some members of the LGBT+ community have adopted gender neutral terms to refer to significant others (Killermann, 2011), e.g., partner, as opposed to gendered terms such as girlfriend or husband. Use of the gender-neutral forms is partially predicated on social acceptance of non-heterosexual orientation (Land and Kitzinger, 2007); in social settings of acceptance, LGBT+ individuals will readily use and prefer gendered markers for their significant others (Heisterkamp, 2016). At the same time, the use of gender-neutral SIGOTHER terms by LGBT+ community members carries the risk of revealing orientation if the terms are exclusively used by that community. Therefore a concerted effort has been made to adopt gender-neutral terms more broadly so as to decrease their association with sexual orientation (De Guzman et al., 2018). 1 Given the association between social acceptance and linguistic choice within SIGOTHER (Table 1, top), we expect that changing attitudes should result in a change in linguistic behavior. PERSON In the late 20th century, English has seen a shift away from using masculine forms to refer to mixed or other gender individuals or groups (Foertsch and Gernsbacher, 1997;Earp, 2012), e.g., "you guys" to refer to a group of any gender. This shift has included increasing use of ungendered pronouns, e.g,. they to refer to a single person (LaScotte, 2016) and a move away from assuming a particular pronoun (Balhorn, 2004). In certain contexts, individuals make indefinite references to people or groups, e.g., describing hypothetical examples or evoking a generic use of the term. These settings also allow for gendered and ungendered referents, such as guys versus folks. Their linguistic choices in these circumstances reflect unconscious biases about a default gender, perpetuating hegemonic masculinity attitudes (Cooper, 2002). Therefore, in studying the variation in gender marking of indefinite references, we expect that decreases in explicit gendering would coincide with shifts in attitudes towards gender equality and hegemonic masculinity. Measuring Linguistic Choice We measure changes in linguistic behavior by fitting a bigram language model p(w i |w i−1 ) and comparing relative probabilities for each variables' words in restricted contexts. The words comprising each variable were identified through a review of past literature (Kiesling, 2004;Heisterkamp, 2016) and inclusion of unambiguous synonyms. Table  1 lists the most common variants, with the rest in supplemental material. Context restriction is necessary as not all uses of the words in our variables correspond to the sense intended for study, e.g., the use of partner in "business partner" does not reflect a choice within SIGOTHER. Therefore, we apply a set of syntactic heuristics to substantially refine and filter the data gathered from these social media platforms to compare relative rates within particular contexts that precisely signal use of our target variables. These syntactic constructs used to identify the variables rely only on single word precursors (e.g., "my spouse") that precisely select the intended uses of the words, and therefore a bigram model is sufficient for our study.
To focus on interpersonal contexts for the SIG-OTHER variable, we restrict all uses of its variants to occur only within a possessive pronoun construction, e.g. my girlfriend/his spouse and later distinguish between first-person and third-person uses, as each carries different social risks. For the PERSON variable, our focus is on contexts where the gender of the referred persons is inherently ambiguous, i.e., a indefinite referent to a person or group; the underlying hypothesis is that gender need not be ascribed to the referent and any ascribed is a result of underlying attitudes and assumption. Therefore, we filter uses of the variants to occur immediately after a subset of the determiners, focusing on indefinite articles (e.g. a dude), quantifiers (e.g. most people), distributives (e.g. many folks), difference words (e.g. other pals), as well as broader qualifiers (e.g. if, when).
Indefinite references to ambiguous persons were chosen as opposed to definite ones (e.g. you guys) as the latter often takes on a specific audience, which could have a known gender composition that necessitates usage over a ungendered form.

Data
Our work is drawn from two major social media platforms, Reddit and Twitter, and is focused on English-language conversations in the US. Reddit is a major social media platform where individuals participate in communities, known as subreddits, often focused on specific interests, goals, or demographics. Reddit users are primarily English-speakers and recent market research suggests its userbase is largely comprised of American users (Clement, 2020); we thus treat content from Reddit as reflective of this region's attitudes.
On Reddit, communities have formed around particular identities associated with known sociolinguistic variation, e.g., r/GayBros. We treat participation in these communities as an implicit signal of an affiliation with that identity, allowing us to study linguistic variation with respect to these identities. Here, we identify four categories of subreddits around identity, with 15 total identities across those categories, shown here each with an example: Politics Right-leaning (r/conservative), Left-leaning (r/voteblue); Religion General (r/religion), Christianity (r/christianity), Islam (r/islam), Judaism (r/judaism), Non-believers (r/atheism); Sexuality LGBT+ (r/ainbow), Heterosexual (r/relationships); 2 and Gender Transgender (r/transgender), Men (r/daddit), Women (r/askwomen). 3 A full list of communities and details of the selection process are provided in supplemental material. These categories were chosen based on minimum volume and motivated by prior work showing that individuals modify their language to signal affiliations, attitudes, and beliefs (Lakoff and Ferguson, 2006;Fausey and Boroditsky, 2010). Data was collected from all content in these communities during 2013-2019, totalling 73M comments. Additionally, to test for aggregate changes on the platform, we use a random sample of 5M comments per year, stratified across years. Twitter is a major international social media platform. Prior work has shown that location data for tweets can be used to identify lexical variation associated with identity (Gonçalves and Sánchez, 2014;Blodgett et al., 2016;Abitbol et al., 2018). In these settings, the demographics associated with the location of a tweet are treated as proxies for the identity of the individual. However most tweets do not come with location data; to increase the sample size, we geocode all tweets from a ∼10% random sample spanning 2011-2019 using the method of Compton et al. (2014) and retain only those present in the United States. This method is known to be the least-biased across urban and rural settings (Johnson et al., 2017), allowing us to study all parts of the US. This geocoding had a median error of 8km in our tests; furthermore, we restricted processing to tweets that were marked as English by Twitter.
To obtain demographic estimates for each tweet's author, we match the inferred location with its containing census tract and use the US Census' 2017 American Community Survey (ACS) variable 5-year estimates. The selected ACS variables focus on socioeconomic status (SES), and cover income, public assistance, education, unemployment, poverty, income inequality, population density, and age dependency. These demographic variables provide a complementary set of indicators for studying variation compared with Reddit. Final Data A total of 73M tweets from 2011-2019 and 14M Reddit comments from 2013-2019 are used. To test for aggregate changes on the platforms, we randomly sample 1M comments per month for Reddit (84M total) and Twitter (108M total). The filtering process uses NLTK for POS tagging and yields 6.7M contexts from Reddit and 30M from Twitter for the SIGOTHER variable; for the PERSON variable yields 7.3M contexts from Reddit and 43M from Twitter. Full details are in supplemental material §A.

Measuring Attitudes to Sexuality
Attitudes about sexuality can be signalled in the use of gender in referring to one's significant other.
In settings where heterosexual partnerships are valorized, referring to a significant other of the same sex carries a social and potentially-physical risk from admitting to being different from heterosexual practices (Wilkinson, 2015;Cadieux and Chasteen, 2015). To minimize this risk, some LGBT+ individuals refer to significant others using the variant form partner, which leaves the gender status ambiguous. However, use of partner only by only LGBT+ community members would result in it being a clear marker of a marginalized community. Therefore, allies of this community are known to use this word to decrease its association with a marginalized identity (Kitchener, 2019). Thus, uses of gender markers to refer to significant others reflect underlying attitudes of acceptance towards gay marriage, allowing us to study changes in these attitudes by examining relative rates among variants' uses.
Here, we ask to what degree is this shift in attitude mirrored in changes in language use, testing two hypotheses. H1: LGBT+ communities will increasingly use gendered terms when referring to relationship status. H2: Heterosexual individuals will increasingly use gender-neutral markers in SIGOTHER. H1 is motivated by Heisterkamp (2016), who in a small observational study found that LGBT+ individuals preferred using gendered markers, which suggests that use of gender-neutral markers by the LGBT+ members may actually be in the minority. To test these hypotheses, we measure variant use longitudinally and across identities. To avoid larger communities outweighing smaller ones and contributing more to an observed change in progress, in all cross-community comparisons, we controlled for community size by bootstrapping the mean probability within each category of subreddits, and show the 95% confidence intervals in the figures.

Changing Use in SIGOTHER
Lexical change mirrors the corresponding increasing acceptance of same-sex marriage in the US (Ofosu et al., 2019; Twenge and Blake, 2020), with both Reddit (partner: r=0.950, p<0.01; spouse: r=0.916, p<0.01) and Twitter (partner: r=0.901, p<0.01; spouse: r=0.943, p<0.01) having increased rates of gender-neutral markers of SIG-OTHER ( Figure 1). While gendered variants still account for the majority of uses, this trend signals an underlying change of attitude by reducing the focus on gender in describing a SIGOTHER. To understand the mechanisms behind this change, we examine who is likely to use these alternates and any changes in behavior.
Who uses non-gendered variants? To measure association with identity, on Reddit, we measure the relative rates within each group of subreddits associated with an identity, and on Twitter, we stratify census tracts for pertinent identities (e.g., education) into quartiles to show relative difference between high and low-valued areas. Measurements are taken over all of the data and, here, we show the rates for partner and spouse, which are the two most common non-gendered variants. The result, shown in Figure 2, reveals four trends. First, confirming prior expectations around the association of these words with sexuality, the gender neutral forms are used most frequently by non-heterosexual communities (Heisterkamp, 2016). Second, communities for gender identities other than male and female have substantially larger use of gender-neutral forms; this language likely reflects a concerted effort towards inclusivity within a marginalized community that also has partial overlap with LGBT+ communities. Liberal communities, which in the US are known to have more favorable attitudes towards gay marriage (Sherkat et al., 2011), exceed that of uses in conservative communities. Higher SES, as shown through income, education, and public assistance, see greater adoption than their lower SES counterparts, mirrored in recent findings on SES and their favorability towards gay marriage (Jakobsson et al., 2013;Anderson, 2014). Results for variable usage by density, income and inequality were found to be similar; for brevity, education is shown in Figure 2. Complementary figures are in Supplemental Material section E. The urban-rural di-vide reflects both (i) known attitudes of rural communities that typically placed a heightened value on "traditional moral standards" (Bell and Valentine, 1995) which would disfavor the language of LGBT+ communities and (ii) a self-selection of LGBT+ people to denser, urban areas (Gorman-Murray and Nash, 2014).
Despite known prejudices towards LGBT+ identities by some religious denominations (Besen et al., 2007;Fetner, 2008), uses of partner and spouse were higher in categories of communities containing these denominations than in nonbelieving communities (e.g., r/atheism), mirroring studies showing that progressive movements within non-believing communities like Atheism+ are still in the minority (Kettell, 2014). How has gender-signalling changed? Our results demonstrate strong association of gendered and non-gendered terms with identity, in line with observational studies. We now test our two hypotheses by examining changes in SIGOTHER variant usage over time with respect to these identities by aggregating the most common six gendered (girlfriend, boyfriend, wife, husband) and ungendered (partner, spouse) markers on Reddit, controlling with respect to first-person (e.g. my wife) and third-person (e.g. her boyfriend) use. We explicitly note this distinction as first-person uses of the SIGOTHER term may carry different social penalties for different individuals in various communities. For example, a gay man referring to his partner as "my boyfriend" is a risky act of selfdisclosure, while a person discussing another's relationship and referencing "his boyfriend" (rather than "his partner") is taking less of a personal risk, instead showing acceptance of gender-neutral referents in the individual's common lexicon. Figure 3 shows changing rates for the three settings related to sexuality and gender needed to test the hypotheses; plots for all others identities are shown in supplemental material. Figures 3 (a) and (b) support H1: Across the LGBT+ communities, the rate of gendered markers continues to exceed that of ungendered markers across time, and shows no statistically significant trend of change (1P: r=0.156, p=0.594; 3P: r=0.126, p=0.667). Through studying personreference practices across large-scale social communities, we validate Heisterkamp (2016)'s concentrated findings that usage of such referents in LGBT+-community contexts continues to show a  resistance to, and a divergence from, heteronormative social constructs. Figures 3(a) and (b) support H2, with substantial increases in non-gendered terms among heterosexual communities for both third (r=0.917, p<0.01) and first-person (r=0.982, p<0.01) use. While the communities used to identify this trend (e.g., r/relationships) do contain some LGBT+ members, the lack of substantial increases in non-gendered forms within LGBT+ subreddits suggests that the shift is due to changing attitudes within the heterosexual community. As a follow up study, we test whether these changes are driven by a particular gender. Pearson's correlation is computed alongside statistical significance testing over bootstrapped mean probabilities calculated across 3-month intervals. Figure 3 (c) shows that all gender communities in our study increased their rates of non-gendered markers with women-focused (r=0.884, p<0.01) and men-focused (r=0.731, p<0.01) communities increasing more than transgender ones (r=0.444, p=0.11), suggesting wide-spread normalization.
We argue that the increased use of a LGBT+marked term "partner" by non-LGBT+ community members is an example of dialect merging where the dominant identity (here, heterosexual) adopts the language of the minority as a standard. This trend draws parallels with the adoption of African American Vernacular English (AAVE) by white Americans (Cutler, 1999). However, racebased linguistic markers face problematic adoption due to perceptions of who is a member of the community and appropriately use its language (Sweetland, 2002). In contrast, the LGBT+ community includes allies, which potentially licenses this adoption; however, we note that, as a marker of in-community status, this adoption by individuals outside the community has met some resistance (Romack, 2018;Werder et al., 2017), raising the question of whether this behavior is linguistic adoption versus appropriation. In addition, our results show an absence of lexical leveling (Milroy, 2002;Kerswill, 2003), where the language of a minority community is gradually replaced by that of the majority as the minority is integrated; often, individuals in a minority linguistic group assimilate to the mainstream usage (leveling) due to the perceived prestige, but here this trend is reversed.

Effects of Marriage Equality Acts
During the time period of 2004 to 2015, multiple states in the United States passed Marriage Equality Acts (MEAs) that allowed LGBT+ couples to legally marry. As a result, marriage rates for these couples rose substantially and passage of the acts was shown to increase social acceptance (Ofosu et al., 2019). These passages provide an ideal setting for a natural experiment to test whether legalization influenced linguistic choice. Methods To analyze the effect of passage of MEAs, we construct a difference-in-differences (diff-in-diff) model as a quasi-causal analysis of the effect on linguistic choice. In a small-scale interview-based study, DiGregorio (2019) found that passage of an MEA did not mean the traditional language of marriage would be adopted, suggesting we should observe no change in certain SIGOTHER forms. Therefore, we test specifically for changes in spousal terms-partner, spouse, wife, and husband-on whether individuals who marry after the passage will use the gendered or gender neutral forms. As states pass MEAs at different times, we adopt a staggered diff-in-diff formulation that controls for changes in usage across real time, while measuring the changes relative to treatment (cf. Stevenson and Wolfers, 2006;Gipper et al., 2020). This model is formalized as where y ij is the probability of using a particular form of the SIGOTHER variable, α and λ are variables for the state and absolute time (as month) of measurement, which account for baseline changes in the rates of words' uses over time and across states. K and T are pre-and post-treatment interactions of the relative month offset from a state passing an MEA and a dummy variable indicating passage of an MEA; the fitted π j and φ j parameters reveal the effect of treatment on the outcome variable, i.e., the particular SIGOTHER word used. We use a twelve month period around the passage, setting m=-12 and g=12, to assess trends. Data for the diff-in-diff model is selected from all tweets referring to the SIGOTHER variable in the twelve months before and after the passage of the MEA in a state. Tweets were then filtered to the 30 states passing a MEA within our dataset's timespan; a total of 6.7 million tweets are used. Results As shown in Figure 4, the rates of marital terms in the SIGOTHER variable substantially increased after the passage of a MEA, with the largest absolute increases in gendered markers, particularly for wife. Note that the diff-indiff model controls for baseline changes in usage by month and state, which mitigates potential confounds from overall fluctuations in how these terms are used. Our findings run counter to the study of DiGregorio (2019), where by computationally examining larger data across multiple states, we find the opposite result: after passage, LGBT+ couples are much more likely to use gendered spousal terms or use the traditional nongendered term spouse rather than partner. While the passage of an MEA likely facilitates the use of marriage-related gendered terms, the underlying causes in this linguistic change are likely much more complex and due to changing attitudes and the efforts of LGBT social movements in securing the passage of the MEA itself. Our findings suggest a decreased social penalty for explicitly stating one's sexuality (via a gendered SIGOTHER term) from increased acceptance.

Attitudes on Gender Equality
Traditionally, Standard American English has been gendered in its referents to people, with phrases like "guys" referring to both male and mix-gendered groups (McLennan, 2004). Studies have argued that these marking practices reflect latent bias in gender expectations and reinforce masculinity as the normative gender (Wilke, 1994;Connell and Messerschmidt, 2005). Recent efforts have pushed for increased use of genderneutral forms (Schweikart, 1998), where variants like "people" or "folks" are used instead. 4 Using our data, we test if these efforts have had an effect and which groups are driving linguistic change.

Changing Uses in Gendered Markers
Is there change? As an initial test, we plot the relative rates of gendered and non-gendered variants of PERSON from random samples on both platforms, restricting Twitter to US locations. Shown in Figure 5, individuals in these settings increasingly use gender-neutral terms to refer to people. Both platforms show consistent trends suggesting that American English is indeed becoming more gender neutral-Pearson r for non-gendered referents are r=0.771, p<0.01 and r=0.966, p<0.01 for Reddit and Twitter, respectively. Who uses gendered markers? To test for broad association with gendered uses of PERSON, we compute the relative rates of gendered and nongendered markers for each identity group on Reddit and use the ACS to estimate demographics on Twitter. The results, shown in Figure 2, reveal three notable trends. Minority communities around sexual and gender identities are more likely to use gendered language, with the exception of asexual communities (Supplemental §E.2). 4 These reference terms are in addition to complementary adoption of gender-neutral pronouns such as "they" in English (Bodine, 1975;LaScotte, 2016) or "hen" in Swedish (Gustafsson Sendén et al., 2015), which are outside the scope of the PERSON variable but whose adoption is likely also reflective of changing attitudes. Results on political affiliation show the strongest differences. Liberal communities are less likely to use gendered language than their conservative counterparts, mirroring norms around gendered roles and expectations associated with each party (Lakoff, 2010).
Following studies on attempts in educational and workplace settings (Olgiati et al., 2002;Pauwels and Winter, 2005) to actively promote gender equality and the use of genderneutral language, higher income quartiles show a statistically-significant change towards use of gender-neutral language (p < 0.01 via Kolmogorov-Smirnov) difference in term usage than lower quartiles. Complementary results following expectations on urban density, education, and inequality SES indicators are shown in Supplemental Material section E. All showed statistically-significant differences among the quartiles, except for education. Who drives this change? Among all categories, the sharpest overarching decrease in gendered marker use is seen in the gender identities studied, for men (r=-0.14, p<0.01), women (r=-0.38, p<0.01), and transgender (r=-0.29, p<0.01) communities, using the same correlation and significant testing calculations in §4.1.
A divide in gendered marker usage exists between the sexuality communities. LGBT+ communities (r=0.16, p<0.01) increase in their use of gendered forms of address compared to their heterosexual (r=-0.11, p<0.01) counterparts, who gradually use fewer gendered terms. Plots of these trends are shown in Supplemental Figure 11. These results point to an overall-increased social awareness of the traditional male-norm and shift towards more inclusive gender-neutral language.

Gender Broadening of dude
The previous two studies focused specifically on the analysis of linguistic variables. However, among the terms in both variables, dude stands out as a unique address term where a prior study of just that term suggests its usage alone could also reveal changes in attitudes (Kiesling, 2004). Specifically, dude can express solidarity with the referents and is occasionally used within female-female interactions, indicating the term is not exclusively a male referent. Thus, in this third study, we test an additional theory-based term-specific hypothesis that dude could undergo semantic widening (Bloomfield, 1993;Blank, 1999) where it gradually loses its gender marking and becomes a gender-neutral term that is used to convey solidarity. Here, we build a computational model to test for gender broadening in dude by measuring its relative associations with male and female genders. Methods To test for a shift in the gender marking of dude, we follow recent methods for bias testing in word embeddings (Caliskan et al., 2017;Garg et al., 2018;Kozlowski et al., 2019) and compare the word vector for dude with sets of reference vectors that act as semantic poles for measuring its association with male and female genders. We use the two datasets from Caliskan et al. (2017) that consist of (i) two sets of male and female reference terms, e.g., "man" and (ii) two sets of male and female names, which were used to simulate implicit association tests in word semantics. Bias towards one pole (e.g., femininity) is shown by having a higher mean cosine similarity with one set in a pair. We further verify the lack of significant synchronic shifts of these anchor words using a Procrustes alignment between sequential year vector spaces. Full details are in supplemental material.
Separate word2vec models (Mikolov et al., 2013) are trained for each year in our Reddit dataset. Within each year, we compute the mean cosine similarity of dude with each word in a set and measure the difference between dude and the male and female sets to estimate its genderassociation over time. Similarities are computed over five separate runs on different splits of the aggregate data and then bootstrapped to estimate 95% confidence intervals. For Reddit, word2vec models are trained on a uniform sampling of 10% of all comments (unfiltered) posted in the first six months of every year, totalling ∼8B tokens. Here, we calculate Pearson's correlation and perform statistical significance testing over bootstrapped mean probabilities across yearly intervals. Results The male-gender association for dude increases over time for both terms (r=0.908, p<0.01) and names (r=0.962, p<0.01) respectively, as shown in Supplemental Figure 6. This result indicates that dude is undergoing a semantic narrowing, rather than widening, and increasingly is only used to refer to male referents. We view this result as pointing to a general trend towards unambiguous gender markings; whereas prior to the push for gender-neutral English, dude may have been widened colloquially, given the increased focus on using ungendered PERSON referents, dude has narrowed to primarily be used exclusively for male referents. Further, this view is made with the observation that usage of the word dude has increased over time, disallowing a possible explanation that the term's meaning evolved for use in a small semantic niche due a decrease in frequency of use. Together with the results of the PERSON variable, these two studies show a marked shift in the linguistic choices marking gender, suggesting broader changes in attitudes about gender.

Conclusion
Linguistic choices reflect underlying attitudes about what is being discussed. To study attitudes about sexuality and gender equality, we identify lexical variables that reflect these attitudes and computationally study these choices using a massive demographically-labeled corpus of 87M English messages from Twitter and Reddit. Our results show that language use has indeed shifted and points to increasing acceptance of nonheterosexual norms towards inclusive, genderneutral language. Through our demographic analysis, we point to key segments of the population driving these changes. Further, through a quasicausal analysis, we show that passage of Marriage Equality Acts in different US states increases the use of gendered spousal references, rather than gender-neutral equivalents. While our work does not identify all the underlying causes behind these changes, the results point to where future work could look to identify the structural and social mechanisms behind change and also show how future computational studies can use sociolinguistic variables to tease out demographicallyassociated attitudes. Data and code are available at https://github.com/davidjurgens/sociolinguistic-attitudes.

Ethical Considerations
Identity Affiliation In studying attitudes, our work aims to characterize attitudes for a particular segment of the population in aggregate, not at the individual level. In doing so, we specifically avoid making strong inferences around a particular actor on social media, e.g., making claims of an individual's gender or sexuality, by only examining behaviors within communities associated with identities. Further, though communities have formed on Reddit around particular identities that are associated with known sociolinguistic variation, participation in these communities does not correspond to direct self-affiliation with these identities. Our methods are instead designed to identify identity-associations, treating participation in these communities as only implicit signals of affiliation with these identities; aiming to identify linguistic communities of practice whose styles may differ in discourse. We also note that in a small set of communities, users can self-select "flairs" that explicitly signal an affiliation of some types, e.g., basketball team fan, but these are not widespread and are often limited only to predefined choice options. In particular, we note the challenges present in treating gender (Larson, 2017) and sexuality as variables of study, especially in attempts to characterize populations with a faithful regard to gender fluidity. While there is some risk in increasing publicity to communities associated with marginalized identity, we have focused only on larger, more well-known communities and avoid ascribing any content to a particular individual. External Validity In particular, we note the concerns by Olteanu et al. (2019) detailing how, among others, there exists (1) self-selecting population biases in social media platforms, (2) behavioral biases regarding user platform content and activities, and (3) content production biases that may vary across demographic groups. Recognizing these biases, we have aimed instead to identify and specifically quantify the differences present between linguistic communities of practice whose styles may differ in online discourse on the platforms we study. While work external to social media has often provided support for our observations, we highlight and point to areas of discrepancies in our findings and encourage future work to further examine these phenomena in non-social media contexts.

A Variants
This section describes the variants considered in the SIGOTHER and PERSON variables, as well as the contextual controls used to filter for them in compiling our datasets. Variants were selected through a multi-step manual process. We first selected standard terms used in the literature for each variable, e.g., "husband" and "wife." We then identified all synonymous terms using multiple thesauri. Finally, a sanity check was done to add any slang or abbreviated versions present in social media using a word-vector-based search and also checking for any common terms used in our patterns that would match. For simplicity, some rare variants were left out; these terms were typically misspellings or word elongations (e.g., "wiiiiife"). Tables 2 and  3 show the terms and their valid total counts filtered under contextual controls as they appear on both social media platforms; Table 4 shows example extracted uses matching our patterns. SIGOTHER contextual control All PRP$ usages as tagged by NLTK. PERSON contextual control a, an, some, any, both, either, neither, each, every, another, many, most, enough, other, if, when

B Reddit Communities
This section describes the specific subreddits under each of our community-group categorizations for Reddit.

B.3 Sexuality
We note one special case for the selection of Sexuality subreddits. The subreddits associated with the Heterosexual identity do contain content related to LGBT+ community members, e.g., samesex couples will post in relationship_advice. These posts are the minority of content. However, this overlap is not likely to cause an issue with our analysis due to the direction of the error. Since LGBT+-focused communities largely feature content exclusive to that community, shifts in language of the Heterosexual-associated communities are due to either changes in the actual heterosexual population in those communities or increased participation of LGBT+ community members, both of which signal increased acceptance and normalization, which is the focus of our analysis.

C Association with Masculine References
This section describes the specific reference terms used in our study to determine the association of dude with masculine references. Reference terms noted here were drawn from the supplemental material in (Caliskan et al., 2017) as names and words used to study gender. A graphical depiction of the results with shaded 95% confidence intervals is shown in Figure 6.
Male Names john, paul, mike, kevin, steve, greg, jeff, bill Female Names amy, joan, lisa, sarah, diana, kate, ann, donna Male Terms male, man, boy, brother, he, him, his, son, father, uncle, grandfather Female Terms female, woman, girl, sister, she, her, hers, daughter, mother, aunt, grandmother  To validate against the possibility of synchronic shifts, we compute cosine similarities for these anchor words following a Procrustes alignment between sequential year vector spaces on our yearly word2vec models trained across samples of all posts and comments on Reddit and Twitter. Shown in Table 5, a high degree of cosine similarity was present for all anchor words, suggesting no significant synchronic shifts for these anchor words occurred.

D Word2Vec Training Details
This section lists the details and specific hyperparameters used for word2vec (Mikolov et al., 2013) model training.
We train all word embeddings in word2vec with gensim (Řehůřek and Sojka, 2010), setting word vector dimensionality to 300 with a continuous bag of words (CBOW) architecture. Training was performed until stable loss convergence, which resulted for all models at around 15 epochs. Other hyperparameters were left unchanged from library defaults. Training was performed on 50 Intel Xeon CPU cores and times ranged from 30 minutes to 2 hours, which varied according to dataset size.

E Expanded Results
Expanded results for gendered PERSON use among identity-centric communities and associations with different socioeconomic status variables are shown in Figure 10 expanded results looking at the use of PERSON and SIGOTHER in specific communities of practice.

E.1 Religion
Discussions in religious communities are less likely to use gendered markers of PERSON relative to political, gender, and sexuality communities. This mirrors the finding that the recognition of and pushes towards more inclusive language has been prevalent in religious communities (Hardesty, 1987;Cochran, 2005), especially in Judaism (Adler, 1998), which has seen a multitude of feminist and progressive views (Raphael et al., 2003), while progressive movements within atheist communities are still in the minority (Kettell, 2014).

E.2 Community-Specific Case Study: Asexuality
Asexuality is a complex self-categorization with asexual sub-identities often referring to relationship preferences and/or an aromantic orientation (Bogaert, 2006;Prause and Graham, 2007;Mac- Figure 7, show that discussions in asexual communities are significantly more likely to use gender-neutral markers of both PERSON and SIGOTHER relative to the greater LGBT+ as well as heterosexual communities, mirroring aforementioned self-categorization survey findings.

E.3 Community-Specific Case Study: Trans-Exclusionary Radical Feminism
Trans-exclusionary radical feminists (TERFs) or "gender critical" feminists, a self-identified transphobic hate group community, propagate transphobia under the guise of feminism (Pearce et al., 2020). Following recent interest in the analysis of TERF community online behavior (Lu, 2020), here, we quantify PERSON and SIGOTHER variable use in the /r/gendercritical subreddit-by and large the most prominent 5 TERF community on Reddit-and compare this community against variable usage in other communities of practice centered on particular gender identities. We hypothesize that the gender-binary viewpoint taken by the TERF community would lead to sharply different language use. Results, illustrated in Figure 8, show that discussions in the TERF community are more likely to use gendered markers of PERSON relative to those in other gender-related communities of practice. This high frequency of gender marking could reflect increased topical content on gender (content-driven) or the groups' stronger focus on highlighting gender identity as salient (attitudedriven). However, conversations in the TERF communities are also more likely to use gender-neutral markers when referring to SIGOTHER-which by construction are primarily references to one's own spouse/partner (unless used in a quote, which is rare). The rate of "partner" is statistically equivalent to the rate seen in communities of practice focused on transgender issues and identities. This behavior suggests divergent marking of gender: TERF users are more likely to mark gender when referring to others, but less likely to mark gender when referring to one's own significant other. We view this unexpected result as pointing to an opportunity for future studies of the mechanisms behind this linguistic behavior, as the different practices of gender marking displayed in this community are not seen elsewhere.

F Discussion
Sociolinguistic research has consistently shown how subtle variation in language is reflective of attitudes and identity. Our work similarly finds changes in how references to significant others or to indefinite people or groups mirror broader changes in society. However, our study is built on aggregate analyses of social media, which warrants a discussion of potential limitations and caveats, as well as future work. Confounding Variables Compositional demographic changes in the subreddits we study may cause changes in the language use of these communities of practice. As marginalized groups gain increased social acceptance, they may more actively contribute to public forums like Twitter and Reddit. As a result, the observed linguistic changes are also a possibility for a diffusion of linguistic norms that are independent of attitude shifts. Nonetheless, our focus on studying language use in specific communities of practice from the perspective of potential attitude shifts shows that the observed discourse has changed, even if the underlying mechanisms behind that change (changing attitudes or changing group composition) remain to be precisely quantified. Demographic Estimates Our study relies on demographic estimates, particularly from using geocoding to infer census-based estimates of persons. However, the American Community Survey Census itself also possesses a degree of bias from a participatory perspective (Spielman et al., 2014). Nevertheless, the ACS remains the broadest coverage survey for linking census tracts to demographics.
Further, the composition of users on social media are largely younger and male-dominated (on Reddit, in particular) compared to that of the general population (Barthel et al., 2016). Our study has focused on language use in particular online populations whose composition may not reflect Reddit as a whole. While tens of millions of American individuals use these platforms, their participation likely selects a subset of the population whose views do not necessarily generalize to the entire American populace. As a result, future work could test methods (or add additional platforms) to poststratify the analyzed segments of the population to increase representativeness. Causality Our study does not make explicit casual claims around factors that may have caused changes in social attitudes-i.e., claims that specific changes in attitude cause this language change. While our work shows evidence of linguistic and attitudinal changes correlated with known policy and legislation changes, like the quasi-causal results estimating the effect of a passage of a Marriage Equality Act on linguistic choices for persons within that state, we are not arguing these alone explain the change. Our work in no way seeks to diminish the active efforts of folks and social movements that continue, today and that have for decades, striven to advocate for the rights of, change the biased social perceptions towards, and champion the values of equality of tra-ditionally marginalized populations and their lived experiences. One possibility for moving closer to truly causal studies is through direct participatory work or using causal inference techniques (Feder et al., 2021) to examine how attitudes influence the word selection or how reading particular uses influence the person's attitude or interpretation of the passage.