Discovering Differences in the Representation of People using Contextualized Semantic Axes

A common paradigm for identifying semantic differences across social and temporal contexts is the use of static word embeddings and their distances. In particular, past work has compared embeddings against “semantic axes” that represent two opposing concepts. We extend this paradigm to BERT embeddings, and construct contextualized axes that mitigate the pitfall where antonyms have neighboring representations. We validate and demonstrate these axes on two people-centric datasets: occupations from Wikipedia, and multi-platform discussions in extremist, men’s communities over fourteen years. In both studies, contextualized semantic axes can characterize differences among instances of the same word type. In the latter study, we show that references to women and the contexts around them have become more detestable over time.


Introduction
Warning: This paper contains content that may be offensive or upsetting.
Quantifying and describing the nature of language differences is key to measuring the impact of social and cultural factors on text.Past work has compared English embeddings for people to adjectives or concepts (Garg et al., 2018;Mendelsohn et al., 2020;Charlesworth et al., 2022), or projected embeddings against axes representing contrasting attributes (Turney and Littman, 2003;An et al., 2018;Kozlowski et al., 2019;Field and Tsvetkov, 2019;Mathew et al., 2020;Kwak et al., 2021;Lucy and Bamman, 2021b;Fraser et al., 2021;Grand et al., 2022).Static representations for the same word can also be juxtaposed across corpora that reflect different time periods (Gonen et al., 2020;Hamilton et al., 2016).This paradigm of using embedding distances to uncover socially meaningful patterns has also transferred over to studies that measure biases in contextualized embeddings, such as Wolfe and Caliskan (2021)'s finding that BERT An axis is constructed using embeddings of adjectives in selected contexts.These contexts are predictive of synonyms, but not antonyms, of the target adjective during masked language modeling.Tokenlevel embeddings for people are then projected onto this axis.
embeddings of less frequent minority names are closer to words related to unpleasantness.The use of "semantic axes" is enticing in that it offers an interpretable measurement of word differences beyond a single similarity value (Turney and Littman, 2003;An et al., 2018;Kozlowski et al., 2019;Kwak et al., 2021).Words are projected onto axes where the poles represent antonymous concepts (such as beautiful-ugly), and the projected embedding's location along the axis indicates how similar it is to either concept.Semantic axes constructed using static, type-based embeddings have been used to analyze socially meaningful differences, such as words' associations with class (Kozlowski et al., 2019), or gender stereotypes in narratives (Huang et al., 2021;Lucy and Bamman, 2021b).
Our work investigates the extension and application of semantic axes to contextualized embeddings.We present a novel approach for constructing semantic axes with English BERT embeddings (Figure 1).These axes are built to encourage selfconsistency, where antonymous poles are less conflated with each other.They are able to capture semantic differences across word types as well as variation in a single word across contexts.Their ability to differentiate contexts makes them suitable for studying how a word changes across domains or across individual sentences.These axes are also more self-consistent and coherent than ones created using GloVe and other baseline approaches.
We demonstrate the use of contextualized axes on two datasets: occupations from Wikipedia, and people discussed in misogynistic online communities.We use the former as a case where terms appear in definitional contexts, and characteristics of people are well-known.In the latter longitudinal, cross-platform case study, we examine lexical choices made by communities whose attitudes towards women tend to be salient and extreme.We chose this set of online communities as a substantive use case of our method, in light of recent attention in web science on analyzing online extremism and hate at scale (e.g.Ribeiro et al., 2021b,a;Aliapoulios et al., 2021).There, we analyze language change and variation along axes through a sociolinguistic lens, emphasizing that speakers use language that reflects their social identities and beliefs (CH-Wang and Jurgens, 2021;Huffaker and Calvert, 2017;Card et al., 2016;Lakoff and Ferguson, 2006).

Constructing semantic axes
Static embeddings.Several formulae for calculating the similarity of a target word to two sets of pole words have been proposed in prior work on static semantic axes.These differ in whether they take the difference between a target word's similarities to each pole (Turney and Littman, 2003), calculate a target word's similarity to the difference between pole averages (An et al., 2018;Kwak et al., 2021), or calculate a target word's similarity to the average of several word pair differences that represent the same antonymous relationship (Kozlowski et al., 2019).We build on the approach of An et al. (2018) and Kwak et al. (2021), because it does not require us to curate multiple paired antonyms for each axis, and it draws out the difference between two concepts before a target word is compared to them, rather than after.We define an axis V containing antonymous sets of adjective vectors, S l = {l 1 , l 2 , l 3 , ..., l n } and S r = {r 1 , r 2 , r 3 , ..., r m }, as the following: Relying on single-word poles for axes can be unstable to the choice of each word (An et al., 2018;Antoniak and Mimno, 2021).An et al. (2018) creates a pole's set of words using the nearest neighbors of a seed word, which may risk conflating unintended meanings or antonymous neighbors (Mrkšić et al., 2016;Sedoc et al., 2017).For example, one axis uses the opposite seed words green and experienced, but green's nearest neighbors include red rather than inexperienced.Instead of using this nearest neighbors approach, we construct poles using WordNet antonym relations.Each end of an axis aggregates synonymous and similar lemmas in WordNet synsets, which are expanded using the similar to relation (Miller, 1992).
Our type-based embedding baseline, GLOVE, uses 300-dimensional GloVe vectors pretrained on Wikipedia and Gigaword (Pennington et al., 2014).We only keep poles where both sides have at least three adjectives that appear in the GloVe vocabulary, and we also exclude acronyms, which are often more ambiguous in meaning.We start with 723 axes, where poles have on average 9.63 adjectives each.
Contextualized embeddings.Static embeddings, however, present a number of limitations.Such embeddings cannot easily handle polysemy or homonymy (Wiedemann et al., 2019), and even when they are trained on different social or temporal contexts, they require additional steps to be aligned (Gonen et al., 2020).Context-specific embeddings also need enough training examples of target words to create usable representations.These limitations prevent the analysis of token-based semantic variation, such as measuring how one mention of a word is more or less beautiful than another.Our main contribution of contextualized axes uses the same WordNet-based formulation as our GloVe baseline.Rather than each word in S l or S r being represented by a single GloVe embedding, we obtain BERT embeddings over multiple occurrences of each adjective.We use BERT-base, as this model is small enough for efficient application on large datasets and is popular in previous work on semantic change and differences (e.g.Hu et al., 2019;Lucy and Bamman, 2021a;Giulianelli et al., 2020;Zhou et al., 2022;Coll Ardanuy et al., 2020;Martinc et al., 2020).It is also used in tutorials for researchers outside of NLP, which means it has high potential use in computational social science and cultural analytics (Mimno et al., 2022).
For contextualized axes, we obtain a potential pool of contexts for adjectives sampled over all of Wikipedia from December 21, 2021, preprocessed using Attardi (2015)'s text extractor.This sample contains up to 1000 sentences, or contexts, that contain each adjective, and we avoid contexts that are too short (over 10 tokens) or too long (over 150 tokens). 1e experiment with two methods of obtaining contextualized BERT embeddings for each adjective: a random "default" (BERT-DEFAULT) and one where contexts are picked based on word probabilities (BERT-PROB).For BERT-DEFAULT, we take a random sample of 100 contextualized embeddings across the adjectives in each pole.Since words can be nearest neighbors with their antonyms in semantic space (Mrkšić et al., 2016;Sedoc et al., 2017), our main approach, BERT-PROB, aggregates word embeddings over contexts that highlight contrasting meanings of axes' poles.
To select contexts, we mask out the target adjective in each of its 1000 sentences, and have BERT-base predict the probabilities of synonyms and antonyms for that masked token.We remove contexts where the average probability of antonyms is greater than that of synonyms, sort by average synonym probability, and take the top 100 contexts.One limitation of our approach is that predictions are restricted to adjectives that can be represented by one wordpiece token.If none of the words on a pole of an axis appear in BERT's vocabulary, we backoff to BERT-DEFAULT to represent that axis.
For each axis type, we also have versions where words' embeddings are z-scored, which has been shown to improve BERT's alignment with humans' word similarity judgements (Timkey and van Schijndel, 2021).For z-scoring, we calculate mean and standard deviation BERT embeddings from a sample of around 370k whole words from Wikipedia.As recommended by Bommasani et al. (2020), we use mean pooling over wordpieces to produce word representations when necessary, and we extend this approach to create bigram representations as well.These embeddings are a concatenation of the last four layers of BERT, as these tend to capture more context-specific information (Ethayarajh, 2019).

Internal validation
We internally validate our axes for self-consistency.
For each axis, we remove one adjective's embeddings from either side, and compute its cosine similarity to the axis constructed from the remaining adjectives.For BERT approaches, we average the adjective's multiple embeddings to produce only one before computing its similarity to the axis.In a "consistent" axis, a left-out adjective should be closer to the pole it belongs to.That is, if it belongs to S l , its similarity to the axis should be positive.We average these leave-one-out similarities for each pole, negating the score when the adjective belongs to S r , to produce a consistency metric, C.
Table 1 shows C for different axis-building methods. 2 An axis is "consistent" if both of its poles have C ≥ 0. GLOVE's most inconsistent axis poles often involve directions, such as east ↔ west, left-handed ↔ right-handed, and right ↔ left.These concepts may be difficult to learn from text without grounding.We find that the various BERT approaches' most inconsistent axes include direction-related ones as well, but they also struggle to separate concepts such as lower-class ↔ upper-class.
The best method for producing consistent axes is z-scored BERT-PROB, with a significant difference in C from z-scored BERT-DEFAULT and GLOVE (Mann-Whitney U-test, p < 0.001).It also produces the highest number of consistent axes.GLOVE presents itself as a formidable baseline,3 and BERT-DEFAULT struggles in comparison to it.
We perform external validation of self-consistent axes on a dataset where people appear in a variety of well-defined and known contexts: occupations from Wikipedia.We conduct two main experiments.In the first, we test whether contextualized axes can detect differences across occupation terms, and in the second, we investigate whether they can detect differences across contexts.

Data
We collect eleven categories of unigram and bigram occupations from Wikipedia lists: Writing, Entertainment, Art, Health, Agriculture, Government, Sports, Engineering, Science, Math & Statistics, and Social sciences (Appendix A).The number of occupations per category ranges from 3 in Math & Statistics to 48 in Entertainment, with an average of 27.2.We use the MediaWiki API to find Wikipedia pages for occupations in each list if they exist and follow redirects when necessary (e.g.Blogger redirects to Blog).For each occupation's singular form, we extract sentences in its page that contains it.In total, we have 3,015 sentences for 300 occupations.

Term-level experiment (occupations)
Each occupation is represented by a pre-trained GloVe embedding or a BERT embedding averaged over all occurrences on its page.If an axis uses z-scored adjective embeddings, we also z-score the occupation embeddings compared to it.We assign poles to occupations based on which side of the axis they are closer to via cosine similarity.Top poles are highly related to their target occupation category, as seen by the examples for z-scored BERT-PROB in  embeddings' proximity can reflect any type of semantic association, not just that a person actually has the attributes of an adjective.For example, adjectives related to unhealthy are highly associated with Health occupations, which can be explained by doctors working in environments where unhealthiness is prominent.Therefore, embedding distances only provide a foggy window into the nature of words, and this ambiguity should be considered when interpreting word similarities and their implications.This limitation applies to both static embeddings and their contextualized counterparts.
We conduct human evaluation on this task of using semantic axes to differentiate and characterize occupations.Three student annotators examined the top three poles retrieved by each axisbuilding approach and ranked these outputs based on semantic relatedness to occupation categories (Appendix B).These annotators had fair agreement, with an average Kendall's W of 0.629 across categories and experiments.Though GLOVE is a competitive baseline, z-scored BERT-PROB is the highest-ranked approach overall (Table 3).This suggests that more self-consistent axes also produce measurements that better reflect human judgements of occupations' general meaning.

Context-level experiment (person)
The identity of a word, and prior associations learned from BERT's training data, have the potential to overpower its in-context use (Field and Tsvetkov, 2019).Thus, we may want to discount word associations originally learned by BERT when we examine the use of a target word in a narrower context.Prior work has shown that words with higher frequency in BERT's training data tend to encode more context-specific information in their embeddings (Ethayarajh, 2019;Zhou et al., 2021;Wolfe and Caliskan, 2021).To investigate whether contextualized axes can measure context changes for people, we replace all occupation bigrams and unigrams with person, a very common word.This also makes contexts across different words comparable to each other, a property which we will leverage later in Section 5.4.
Each person embedding is averaged over one occupation's contexts.The identity of person tends to overpower its similarity to axes across contexts, in that the top-ranked poles are similar across occupation categories.So, in contrast to the previous occupation experiment, additional steps are needed to draw out meaningful differences in how person is used in one group of contexts from its typical use.To do this, we estimate the average cosine similarity to axes of n person embeddings in occupational contexts using 1000 bootstrapped samples, where n is the number of terms in an occupation category.We take the axes with the highest statistically significant (p < 0.001, one-sample t-test) difference in cosine similarity.
We assume that occupations' Wikipedia pages mention them within definitional contexts, so topranked poles should reflect the original occupation replaced by person.These top poles are less intuitive than those outputted by the earlier term-level experiment (Table 2).Still, in some cases, such as for Government and Math & Statistics occupations, we uncover relative differences that distinguish one category from others.We only show three adjectives in the top two poles in Table 2 due to space considerations, but moving further down the list for z-scored BERT-PROB uncovers additional meaningful poles.For example, the pole spry, gymnastic, sporty is the third most prominent shift and highest similarity increase (+) in the person experiment for Sports occupations.In addition, human evaluators preferred BERT-PROB over other approaches (Table 3, Appendix B).

Measuring change and variation
Now that we have contextualized semantic axes that can measure differences across words and contexts, we apply them onto a domain that can showcase salient and socially meaningful variation.NLP research on harmful language often employs methods that focus on the target group, such as measuring their association with other words (Zannettou et al., 2020;Garg et al., 2018;Tahmasbi et al., 2021;Field and Tsvetkov, 2019), or with biases in models (Wolfe and Caliskan, 2021; Ghosh et al., 2021).We illustrate the application of self-consistent z-scored BERT-PROB axes onto the manosphere, which is a collection of communities with mostly male users who hold alternative beliefs around relationships and gender.We use the same axes we presented earlier, which were created using Wikipedia data, because Wikipedia provides more normative coverage of a variety of adjectives than topic-specific communities.This way, we examine how entities in the manosphere orient themselves against typical adjectival uses and meanings.
The manosphere has been linked to acts of violence in the physical world (Hoffman et al., 2020), and most members believe that men are systemically disadvantaged in society ( Van Valkenburgh, 2021;Marwick and Caplan, 2018;Lin, 2017;Ging, 2019).These communities focus on heterosexual relationships and masculinity, and feature a dynamic linguistic landscape.Much prior work on the manosphere has been qualitative, such as ethnographies (Lin, 2017;Lumsden, 2019;Van Valkenburgh, 2021).There have been a few quantitative analyses of their language, usually focusing on phrase and word frequencies in a few communities (Farrell et al., 2019;Gothard et al., 2021;LaViolette and Hogan, 2019;Jaki et al., 2019).As an example involving word vectors, Farrell et al. (2020) uses static embeddings identify the meanings of incels' neologisms by inspecting words' nearest neighbors.
Our case study extends beyond prior work with its methodology and scale.We use contextualized semantic axes to tackle one question: how have references to women and contexts around them changed over fourteen years?

Data
We use a taxonomy of subreddits and external forums described by Ribeiro et al. (2021a), who show that the manosphere began with ideologies such as pick-up artists (PUA) and Men's Rights Activists (MRA), and evolved into more extreme ones such as The Red Pill (TRP), incels (short for involuntary celibate) and Men Who Go Their Own Way (MGTOW), with users moving from older to newer ideologies.We call this dataset EXTREME_REL, because it contains extreme views of relationships.
We use Reddit posts and comments from March 2008 to December 2019 from subreddits listed in Ribeiro et al. (2021a)'s study, downloaded from Pushshift (Baumgartner et al., 2020).We slightly modify their taxonomy by separating out incel subreddits where the intended userbase are women (femcels), and also include a newer set of subreddits focused on "Female Dating Strategy" (FDS), a women-led community analogous to TRP (Holden, 2020;Clark-Flory, 2021).Therefore, we have 60 subreddits in seven ideological categories: Incels, MGTOW, PUA, MRA, TRP, FDS, and Femcels4 (Appendix C).This Reddit subset of EX-TREME_REL contains over 1.3 billion tokens.
We also include seven external forums provided by Ribeiro et al. (2021a).These public forums include A Voice for Men (AVFM), Master Pick-up Artist (MPUA) Forum, The Attraction, incels.co,MGTOW Forum, RooshV, and Red Pill Talk. 5 This forum subset of EXTREME_REL contains over 800 million tokens spanning November 2005 to June 2019, and we remove duplicates and quoted text from posts.Some experiments use a subset of Reddit that shares a similar topical focus as EXTREME_REL, but may have more mainstream views of women and relationships.
We call this dataset GEN-ERAL_REL, and it contains 1.2 billion tokens from September 2009 to December 2019.For Reddit data, we do not use posts and comments written by usernames who have bot-like behavior, which we define as repeating any 10-gram more than 100 times.

Vocabulary
We use a mix of NER, online glossaries, and manual inspection to curate a unique vocabulary of people (details in Appendix D).This vocabulary has 2,434 unigrams and 4,179 bigrams, tokenized using BERT's tokenizer without splitting words into wordpieces (Devlin et al., 2019;Wolf et al., 2020).These terms appear at least 500 times in EXTREME_REL.
Since gender is central to the manosphere, we infer these labels based on terms' social gender in a dataset.For example, accuser is not semantically gendered like girl and woman, but its social gender, estimated using pronouns, is more feminine in EXTREME_REL than GENERAL_REL.We use two stages of gender inference to account for pronoun sparsity and noise.First, we use a list of semantically gendered nouns, and second, we use feminine and masculine pronouns linked to terms via coreference resolution (details in Appendix E).We label each vocabulary term based on its fraction of cooccurring feminine pronouns in EXTREME_REL and GENERAL_REL, separately.We are able to label 72.5% of the vocabulary in EXTREME_REL and 67.0% of it in GENERAL_REL.

Term-level change
Contextualized semantic axes can reveal how word and phrase types change over time.Here, our analyses focus on 1,482 feminine (genderleaning > 0.75) terms in EXTREME_REL.To capture broad snapshots of words' use, we randomly sample up to 500 sentence-level occurrences of each term in each platform and ideology (e.g. a specific forum or Reddit category) in each year.Overall z-scored BERT embeddings for each vocab word are averages over this stratified sample of its contexts.
The history of the manosphere is characterized by waves of different ideological communities (Ribeiro et al., 2021a).To reflect this characterization through language, we segment our vocabulary based on when terms peak in popularity.We cluster normalized frequency time series7 for each term using K-Spectral Centroid clustering (KSC) (Yang and Leskovec, 2011).We use their default parameters, including K = 6.In contrast to their original approach, our symmetric distance measure    d is invariant to scaling by α but not to the translation of the time series, so that peaks earlier in time are not clustered with those later in time: where α = x T y/||y|| 2 ."Waves" of term types for people correspond to ideological change.Figure 2 shows examples of feminine terms, but the top masculine terms are often labels of ideological groups, such as mgtow and incels, which we use to estimate which clusters align with ideological up and downturns.8Cluster A and cluster D tend to have terms that have widespread use.
We examine the shifts of high variance, substantive axes across temporal clusters.High variance axes include those related to gender, appearance, and desirability (Table 4).For example, the lovable versus detestable pole contrasts beautiful girls with degenerate whores.As another example, the axis for clean versus dirty contrasts loyal wife with harlots.Prior studies using toxicity detection and lexicon-based approaches found that hate and misogyny rose with the arrival of later MG-TOW and incel communities (Farrell et al., 2019;Ribeiro et al., 2021a).Similarly, we find that lexical choices for women are more detestable and dirty in later waves associated with MGTOW and incels (Figure 3).Often, low and high frequency words share similar patterns in each wave.

Context-level change
Contextualized semantic axes can reveal how the contexts around people have changed over time.Women in online communities can be referenced in a variety of ways (Figure 2).To compare overall changes around women between mainstream and extremist communities, we examine the contexts around feminine (gender-leaning > 0.75) words.We use instances of 287 unigram types, since bigrams can include modifiers that would be considered "context".As discussed earlier, word identities impact measurements of contextual changes across them (Section 4.3).We replace each target word with person or people depending on whether it is singular or plural, estimated through the Python INFLECT package.We choose Figure 4: Contexts around singular (person) or plural (people) feminine words over time in EXTREME_REL and GENERAL_REL along three axes.Time series include 95% CI, and dotted lines mark the peak of major ideological communities (gray labels).These vertical lines are months that have the highest normalized frequencies of words used to refer to their members: puas, mras, trpers, mgtows, and incels.
replacements to respect singular/plural forms to ensure ecological validity and not perturb BERT's sensitivity to grammaticality (Yin et al., 2020).We use reservoir sampling to obtain up to 1000 occurrences of person-or people-replaced feminine words in each month on EXTREME_REL and GEN-ERAL_REL.
In comparison to GENERAL_REL, EX-TREME_REL has more detestable, sickening, and dirty contexts for women (Figure 4).Both GENERAL_REL and EXTREME_REL discuss relationship issues, but contextualized axes reveal how contrasting and changing attitudes toward women can influence context.Negative associations especially peak during the height of the incels' movement around late 2017 to mid 2019.These persist despite Reddit's ban of r/incels in November 2017 and the quarantine of r/braincels and r/theredpill in September 2018.Thus, the widespread efficacy of community-level moderation is worthy of closer study (e.g.Copland, 2020;Ribeiro et al., 2021b).An advantage of computing scores at the token-level rather than at the type-level is interpretability.That is, one can see which contexts land at the extreme ends of axes (as illustrated in Table 5).
Contextualized semantic axes can also illuminate differences among lexical variables, or different linguistic forms that share the same referential meaning (Nguyen et al., 2021;Labov, 1972).
As prominent examples, men-led communities use the lexical innovations femoids and foids, which are shortenings of female humanoids, as dehumanizing words for all women (Chang, 2020;Prażmo, 2020).Two women-led communities, Femcels and FDS, use moids as an analogous way to refer to men.Prior work studying three manosphere subreddits showed that the lemmas woman and girl are constructed negatively as immoral, deceptive, incapable and insignificant (Krendel, 2020).We hypothesize that the contexts of community-specific variants should have even more dehumanizing connotations along similar dimensions.In this experiment, we replace all terms (men, moids, foids, femoids, and women) with people.
We sample up to 100 occurrences of each variant in each platform and ideology per year, limiting time ranges to when domain-specific variants are widely used by their home community.We examine the use of variants for men by Femcels and FDS in 2018-2019, and the use of variants for women by all other communities in EXTREME_REL in 2017-2019.Unlike in the person experiment for occupations, we have substantial pools of occurrences to compare.Thus, to find axes that distinguish one variant from another, we use axis scores as features in random forest classifiers (Pedregosa et al., 2011), and perform binary classification of word identity: women versus foids or femoids, and men versus moids (Appendix G).We rank axes based on their feature importance, and select three highly ranked and relevant axes to show in Figure 5. Shifts along these axes confirm our hypothesis that communityspecific variants are more dehumanized than their widely-used counterparts.

Conclusion
In this work, we examine the capability of contextualized embeddings for discovering differences among words and contexts.Our method uses predicted word probabilities to pinpoint which contexts to include when aggregating BERT embeddings to construct axes.This approach creates more self-consistent axes that better fit different occupation categories, in comparison to baselines.We further demonstrate the use of these axes in a longitudinal, cross-platform case study.Overall, contextualized embeddings offer more flexibility and granularity compared to static ones for the analysis of content across time and communities.That is, rather than train static word embeddings for various subsets of data, we can characterize change and variation at the token-level.
Though we focus on analyzing associations between adjectives and people, our approach can generalize to other types of entities as well.Measuring and comparing the contexts of other entity types should include many of the same considerations we did, such as reducing the conflation of antonyms, controlling for word identity by replacing target words with a shared hypernym, and experimenting with z-scoring.Future work includes understanding why some opposing concepts are conflated in large language models, and how a word embed-ding's identity influences its encoding of contexts.

Limitations
Aside from computing power requirements (Appendix H), we outline a few additional limitations of our methodology and its application not discussed in the main text.Domain shift.The use of pretrained BERT on a niche set of communities makes our approaches susceptible to domain shift, such as rare words having less robust embeddings (Zhou et al., 2022(Zhou et al., , 2021)), or target words carrying over learned associations from a broader corpus that are less applicable in a narrower one.Domain shift is difficult to avoid without retraining or further pretraining BERT, which is resource-intensive, may risk catastrophic forgetting, and inaccessible to some disciplines in computational social science (Gururangan et al., 2020;Ramponi and Plank, 2020;Goodfellow et al., 2014).Also, training a large language model on text with toxic and misogynistic origins introduces additional risk of dual use (Kurenkov, 2022).We suggest some potential workarounds that lessen the severity of domain shift, such as replacing target words with common ones for context-focused analyses.
WordNet.WordNet is a popular lexical resource for NLP, but its senses for words can be overly finegrained (Pilehvar and Camacho-Collados, 2019) and not suitable for all domains.We use WordNet version 3.0, which is included in NLTK, and this version was last updated in 2006.Since English is constantly changing, some synonym and antonym relations may be outdated.
Errors.Our method for drawing out differences in words is better than common baselines yet still imperfect, and some of the opposing concepts in embedding space that BERT struggles to separate may be important for an application domain.Therefore, domain expertise is needed to recognize spurious patterns from real ones and fill these gaps.
In the main text we mention that embeddings offer a "foggy window" into how two concepts may be associated or related, and the exact type of relation is not always clear.For example, if contexts for women are closer to unpleasant, does it mean that the text discusses unpleasant events that affect women, or that the writers believe that women are unpleasant, or both?Some of this uncertainty could be resolved qualitatively by inspecting sentences at poles' extremes.We compare embeddings for people to axes, but it is also possible to include relation-based approaches such as dependency parsing and compare words that share specific relations with people to axes (e.g.Lucy and Bamman, 2021b).One trade-off of doing this is that informative verbs and adjectives connected to mentions of target groups can be sparse.Our method is able to find that mathematician replaced with person is highly similar to calculable in a variety of sentence structures, such as this one modified off Wikipedia: A person is someone who uses an extensive knowledge of mathematics in their work, typically to solve mathematical problems.

Ethical considerations
User privacy.Online data opens many doors for research, but its use raises concerns around user privacy.For our use case, we believe that the benefits of our work outweigh privacy-related harms.Consent is infeasible to obtain for large datasets (Buchanan, 2017), and in the manosphere, it is unlikely that users would give consent, especially if the researchers using their data believe that their ideologies are harmful and wrong.Obtaining consent would pose risks to the safety of the researcher (Conway, 2021;Doerfler et al., 2021).
All online discussions included in our work were public when downloaded by their original curators, mainly Baumgartner et al. (2020) and Ribeiro et al. (2021a).Some forums and online glossaries were relocated, shutdown, banned, or made private later on.A user's "right to be forgotten" confronts researchers who have interests in documenting and studying the histories of communities.We truncate the examples shown in our paper rather than use them in full verbatim (Bruckman, 2002).
Communities may expect their posts to stay within their in-group, but the content in our work was posted on public platforms.This publicness and increased visibility plays a key role in how this content impacts others, such as those who view this information and propagate it elsewhere, or those who are direct targets of hate.Common targets such as women and people of color carry a bigger burden when participating in online spaces (Hoffmann and Jonas, 2017), and our broader research agenda aims to mitigate this issue.
Social biases in models and resources.We use WordNet to group similar adjectives into semantic axes, but we observe some socially harmful asso-ciations in this resource.For example, gross and fat are listed as similar lemmas.As another example, WordNet conflates gender and sexuality when androgynous and bisexual are also listed as similar lemmas.The BERT language model, like all large, pretrained models, is also susceptible to social biases in its training data (Bender et al., 2021).
Gender inference.In this paper's main case study, we perform gender inference for word and phrase types.This step was necessary to study how women are portrayed over time, which is a key question due to the centrality of misogyny in these communities.However, perfect prediction of each word's perceived gender in our dataset using pronouns is impossible (Cao and Daumé III, 2021).Not all mentions of people co-occur with pronouns, pronouns do not equate gender, and coreference resolution systems can produce errors.So, we approximate the social gender of terms by aggregating coreference patterns over all instances of that term.Since it is difficult to separate noisy errors from meaningful word-level pronoun variation at scale, we had to use a score threshold to pinpoint what words were feminine-leaning enough to be included in our analyses.
Restricting pronouns to the traditional binary of feminine and masculine is limiting, since individuals use other pronouns as well.They/them pronouns are predominantly used to reference plural terms in this dataset, and the coreference model we use does not handle neopronouns.The manosphere and the typical framing under which it is studied is heavily cisheteronormative.We use a frequency cutoff to determine our vocabulary (Appendix D), so references to transgender and nonbinary people may be filtered out.Vocab terms retained for transgender people are outdated or typically offensive terms such as transsexuals and transgenders, and no vocab term includes non-binary, nb, or nonbinary.

A Wikipedia page titles
Table 6 lists the categories of occupations, the titles of Wikipedia pages that list them, and the number of terms in each category.These lists were retrieved in February 2022.

B Human evaluation for occupations
We recruited three student volunteers with familiarity with NLP coursework and tasks to rank the top poles provided by each axis-building method for our occupation and person experiments.We used Qualtrics to design and launch the survey.Since we were not asking about personal opinions but rather evaluating models, we were determined exempt from IRB review by the appropriate office at our institution.Each question pertains to a specific occupation category, and within each experiment, question order and answer option order are randomly shuffled.Each model option is presented with its top three poles, in order of most to less Hi! Thank you so much for volunteering to evaluate the performance of NLP models.
Please read these instructions carefully.
In this task, you will judge how much lists of adjectives from WordNet outputted by models are semantically related to occupational differences described in Wikipedia.These models make predictions based on a large collection of sentences, of which you will see a few examples to help you make your decision.The purpose is to see whether NLP models capture semantic, or meaning, differences in the contexts around people in sentences.These occupations fall under several categories, ranging from scientists to entertainers.
You are deciding which models' outputs are typically more related to occupations, which may not reflect your personal opinions about occupations.
There are two sets of questions, and 11 questions in each set.

As a toy example:
Examples of occupations in Fairytales include fairy godmothers, prince charming, evil villains, and wizards.
You are given the sets of adjectives below.Adjective sets include "MORE" and "LESS" labels based on how people in the category above are more or less related to them, in comparison to other people who work as artists, government workers, and scientists: The above three models are ranked from most related to the occupation category to least related.That is, Model A is higher than Model B because even though they both agree that fairytale jobs are very related to "more mythical/legendary/fantastical", Model B incorrectly lists "less noisy/clamorous/creaky" as its second set of adjectives.Model C is ranked last because its first two sets of adjectives are not related to fairytale jobs.
Try to be consistent in your rankings.That is, in the example above, you should not rank Model C after A and before B because A and B agree on the first set and overall share two valid adjective sets.Model C is more of an outlier, with only one valid third adjective set.relevant.Figure 6 shows screenshots of instructions.In the toy example, the options are labeled with "Model A", "Model B", "Model C", to allow explanation clarity, but in the actual task questions, options are not labeled with model letters to avoid biasing the evaluators towards a specific model.Some annotators expressed that the task was difficult, and for some occupations, different approaches output similar axes, just in different order.

C Reddit communities
We used a list of subreddits9 for the manosphere provided by (Ribeiro et al., 2021a) in their detailed, data-driven sketch of the manosphere.Five of the subreddits included in Ribeiro et al. (2021a)'s taxonomy of the Reddit manosphere (r/malecels, r/lonelynonviolentmen, r/1ncels, r/incelbrotherhood, r/incelspurgatory) were not on Pushshift's dump of Reddit.We curated the list of communities for our new ideological category, Female Dating Strategy (FDS), using a now removed list of FDS's "sister communities" on the subreddit r/FemaleDatingStrategy's sidebar: r/PinkpillFeminism, r/AskFDS, r/FDSSuperFans, r/PornFreeRelationships, and r/FemaleLevelUpStrategy.The Femcels set of subreddits include: r/Trufemcels, r/TheGlowUp, and r/AskTruFemcels.Though the main user base of the manosphere are men, there are also small populations of women in other ideologies as well, such as r/redpillwomen.We mainly portion out FDS and Femcels due to their role in Section 5.4's lexical variant experiment as communities who use moids.
In total we have 12 subreddits in TRP, 11 in MRA, 7 in PUA, 22 in Incels, 3 in MGTOW, 4 in Femcels, and 6 in FDS.The complete list of subreddits and their categories is also in our Github repo.

D Vocabulary creation
First, we extract nominal and proper persons using NER, keeping ones that are popular (occur at least 500 times in EXTREME_REL), and unambiguous, where at least 20% of its instances in these datasets are tagged as a person.Gathering a substantial number of labels from our domain to train an indomain NER system from scratch is outside the scope of our work, so we experimented with three models trained on other labeled datasets: ACE, contemporary literature, and a combination of both.We evaluated these models on a small set of posts and comments labeled by one author, after retrieving 25 examples per forum or Reddit ideological category using reservoir sampling.The annotator only labeled spans for nominal and named PERSON entities.Table 7 shows the performance of each model on EXTREME_REL.Based on these evaluation results, we chose to use the model trained on contemporary literature.
We extract bigrams and unigrams from detected spans, excluding determiners and possessives whose heads are the root of the span.Named entities that refer to types of people rather than specific individuals were estimated through their co-occurrence with the determiner a, e.g. a Chad.
Then, one author consulted community glossaries and examined in-context use of words to manually correct the list of automatically extracted terms.We include additional popular and unambiguous words not tagged sufficiently often enough by NER, but defined as people in prior work and online resources.
Table 8 lists the sources and glossaries for vocabulary words and the ideologies they include.Some of these sources, such as the Shedding of the Ego, are created by insiders in the community, while some, such as academic papers and news articles, are by outsiders.For each of these glossaries and lists of terms, we manually separated them into two categories: 269 people (singular and plural forms) and 1776 non-people.Two of these sites, Shedding of the Ego and Pualingo, no longer exists, but were publicly available until at least late 2020.We include 93 terms for people that were initially filtered out in our NER pipeline in our final vocabulary, excluding ambiguous ones that also occur often as non-human entities, such as tool (a fool who is taken advantage of) and jaw (short for just another wannabe).
The resulting vocabulary contains niche language, where 20.7% of unigrams are not found in WordNet, and 85.1% of those missing are also not in the Internet resource Urban Dictionary. 10he full list is also available in our Github repo.

E Gender inference
This section includes additional details around our gender inference process.
Our list of semantically gendered terms, or words gendered by definition, expands upon the one used by Hoyle et al. (2019): man, men, boy, boys, father, fathers, son, sons, brother, bothers, husband, husbands, uncle, uncles,  Table 8: Sources for non-NER detected terminology we include in our study.Shedding of the Ego can be viewed in the Internet Archive.On the other hand, Pualingo was taken down and removed from the Internet archive during the preparation of this paper.In some cases, the focus community is the entire manosphere, while in others, it is a subset.
To infer gender for the remaining words using pronouns, we ran coreference resolution on EX-TREME_REL, and extracted all pronouns that are clustered in coreference chains with terms in our vocabulary (Clark and Manning, 2016).We label the masculine to feminine leaning of vocab terms by calculating the proportion of feminine pronouns (she, her, hers, herself ) over the sum of feminine and masculine pronouns (he, him, his, himself ).We only consider a word to have a usable gender signal if it appears in at least 10 coreference clusters with feminine or masculine pronouns.Since plural words do not usually appear with he/she pronouns, we have plural words take on the gender leaning of their singular forms.We pair plural and singular forms using the Python INFLECT package. 11We also transfer unigrams' gender to bigrams, after examining the modifiers (the first token) in bigram terms to check that they are not differently and semantically gendered.Around 20.9% of our vocabulary in EXTREME_REL is gendered through pronouns alone, an additional 12.6% is gendered through plural to singular mapping, and an additional 9.1% is gendered through bigram to unigram mapping.

F High variance axes
Table 9 shows the top vocabulary terms that correspond to the poles of high variance axes.

G Classification of lexical variants
Our main goal here is to tease out which axes differentiate the contexts of lexical variants, rather than find the best model that performs well on a classification task.Therefore, we choose to use a random forest classifier for its interpretability: it outputs weights that indicate what features were most important across its decisions.We use scikitlearn's implementation, and perform randomized search with 5-fold cross validation and weighted F1 scoring to select model parameters (Table 10).Table 11 shows the most important axis features of these models.In general, the set of most important features did not change much with parameter choices and roughly aligns with axes that showcase the largest mean differences between each pair of variants.That is, the three axes we show in the main text in Figure 5 are also among the top ten ordered by mean difference for men vs. moids and women vs. femoids.

H Runtime and infrastructure
We only use BERT-base for inference, but the overall runtime cost is high due to the size of our corpora: English Wikipedia and social media discussions.We use one Titan XP GPU with 8 CPU cores for most of the paper, and occasionally expanded to multiple machines with 1080ti and K80 GPUs in parallel when handling social media data.We use BERT for two main purposes: predicting word probabilities to select contexts for constructing axes, and obtaining word embeddings.On one Figure1: An axis is constructed using embeddings of adjectives in selected contexts.These contexts are predictive of synonyms, but not antonyms, of the target adjective during masked language modeling.Tokenlevel embeddings for people are then projected onto this axis.

Figure 2 :
Figure 2: Top five most frequent feminine (gender leaning > 0.75) vocabulary terms in each time series cluster, with their gender-leaning score.In each cluster's figure, cluster centers µ are thick lines, with time series of all vocab terms in light gray.Cluster centers are scaled down to half the maximum height of words' timelines.All time series start on Nov 2005 and end on Dec 2019.

Figure 3 :
Figure 3: Average axis scores among temporal clusters of feminine word types introduced in Figure 2. Cluster averages include 95% CI, vertical dotted lines mark axis midpoints, and clusters are split based on overall frequency percentile.Cluster C, E, and F align with later, more hateful ideologies.
use this against us men ... those evil people!0.244 ... these people pollute our public ... 0.240 ... parasite worthless whore people.0.234 ... I have two little people and they are absolutely amazing ... -0.156 people who are this young and attractive ... -0.144 ... my ideal relationship and people like this ... -0.137Table 5: Examples of people, when replacing words for women, in different contexts along the lovable ↔ detestable axis in EXTREME_REL.These examples have the maximum or minimum score in their month, and were included in the sample used in Figure 4.

Figure 5 :
Figure 5: Average axis scores, of words used by menled communities to refer to women (squares), and words used by women-led communities to refer to men (circles).Community-specific variants have lighter colors, and bars indicate 95% CI.

Figure 6 :
Figure 6: Instructions and a toy example shown to human evaluators.

Table 2 :
The top two z-scored BERT-PROB axis poles, ordered from left to right, for each occupation category and experiment.Each pole is represented by three example adjectives drawn from the set used to construct that pole.Since the person experiment compares each occupation category to all others, + or -indicates the direction of the shift in axis similarity.For example, sports occupations are still closer to responsible than irresponsible, just less so (-) than other occupations.

Table 3 :
Average rank of each axis-building method for each experiment, across human evaluators and occupation categories.95% CI in parentheses.

Table 4 :
The axes with the largest variance among feminine-leaning terms in EXTREME_REL.An extended version of this table with more high-variance axes and examples of top words at each pole is in Appendix F.
Robert Wolfe and Aylin Caliskan.2021.Low frequency names exhibit bias and overfitting in contextualizing language models.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 518-532, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.

Table 6 :
Wikipedia page titles for pages containing lists of occupations.

Table 7 :
Model performance on a human-annotated sample of Reddit and forum data.The F1 score we used to determine our choice of model is highlighted in bold. nephew,