How-to Guides for Specific Audiences: A Corpus and Initial Findings

We collect how-to guides for different target audiences and investigate qualitative and quantitative differences.


Introduction
How-to guides provide practical instructions that help humans to achieve specific goals.In the past decades, such guides also attracted increasing interest in NLP and AI research (Branavan et al., 2009;Chu et al., 2017;Anthonio et al., 2020).Resources such as wikiHow, 1 a collaboratively edited online platform for instructional texts, make it possible to scale research efforts to hundreds of thousands of articles.By covering an ever-increasing number of guides, including niche topics and articles for minority groups, there is also an increasing risk of perpetuating stereotypes and jeopardizing general accessibility.In fact, we notice that wikiHow already contains articles written for specific target groups as well as articles that exist in different versions for different audiences.As an example, Table 1 shows two articles with the same title, "Act Like a Kid Again", one with the indicator '(Girls)' and one with '(Boys)'.
1 www.wikihow.comAct Like a Kid Again (Girls) Eat well and exercise, but don't obsess about your body.Be healthy without stressing too much about it.(. . . ) Generally, go for lots of fruits and veggies.And even though kids love sugar, don't eat too much of it!Act Like a Kid Again (Boys) Eat your childhood favorite food.Recollect every snack, chocolates, ice cream, candy bars, cotton candy and everything that you loved as a kid or would make you feel pampered.Eat as per your capacity as too much at once may make you feel uncomfortable.
Table 1: Two versions of the same guide in wikiHow.
Among other things, we find that such articles dramatically differ in terms of details.For example, the texts highlighted in Table 1 vary in how much they focus on issues potentially related to body images.As such, the articles reflect disparate standards, which ultimately may contribute to discrimination (Prentice and Carranza, 2002).The specific example can also be linked to observations of gender differences in weight concerns from psychology (Dougherty et al., 2022), which might represent a reason for disparate treatment.On the surface, it is not always possible to say exactly why there are certain differences in articles for specific audiences.However, through qualitative and quantitative comparisons on the linguistic level, we can at least determine what types of differences are present and to what extent they can be systematically identified.In this sense, we aim to contribute to questions about biases and fairness in data and, at the same time, connect to related research in psychology and other social sciences.
There already exists a large body of research that examines biases and stereotypes in NLP data and, likewise, how-to guides from wikiHow have been used as training material for a variety of language processing tasks ( §2).However, previous studies have not explicitly looked into issues related to bias in the wikiHow data.As a first step towards addressing this gap, we create our own sub-corpora of how-to guides, which let us investigate differences across articles for specific target groups ( §3).
We perform two case studies and a generalization study on our collected data: In the first study, we identify a number of articles that exist in multiple variants for different target groups and examine them in terms of distinctive content and linguistic characteristics ( §4).As a second case study, we explicitly examine how far topics covered for specific target groups differ from each other ( §5).Finally, we investigate whether the qualitative findings from our case studies can be validated quantitatively and generalized to our whole corpus using computational modeling ( §6).
In summary, we find systematic differences between articles for specific groups in terms of topic, style, and content.We conclude the paper with a discussion of these findings and point out links to existing work in the social sciences ( §7).

Related Work
We summarize existing work on the three strains of research that this paper builds on: wikiHow as a data source ( §2.1), subtle biases in datasets ( §2.2), as well as understanding the characteristics of texts that target specific audiences ( §2.3).

wikiHow as a Data Source
wikiHow is a prominent data source for a variety of tasks, including summarization (Koupaee and Wang, 2018), goal-step inference (Zhang et al., 2020), and question answering (Cai et al., 2022).By exploiting the revision history of wikiHow, Anthonio et al. ( 2020) created wikiHowToImprove, which has been used to better understand phenomena related to the (re-)writing process of how-to guides (Roth and Anthonio, 2021;Anthonio et al., 2022).Writing, but especially revising, instructions should presumedly take into account the readers' context, perspective and knowledge about the domain and the world.The need for clarification stands prominently out as a main purpose of the refinements of wikiHow guides (Bhat et al., 2020).It has been shown that while annotators tend to agree that "revised means better", the disagreements can be caused by differences in common knowledge and intuitions (Anthonio and Roth, 2020).As specific phenomena, previous work studied implicit references and lexical vagueness (Anthonio and Roth, 2021;Debnath and Roth, 2021).However, none of the aforementioned studies accounted for audience-specific differences.This work takes a first step to close this gap.

Subtle Biases in Datasets
Diagnosing the presence of biases in data is one of the crucial steps in diminishing the spread of harmful stereotypes.This work contributes to the research on subtle biases, i.e., textual patterns that implicitly reflect societal power asymmetries.Such biases are embeded in specific linguistic phenomena (e.g., masculine generics; Swim et al., 2004) or in inequalities in how people from different demographic groups are represented (e.g., emphasizing the romantic relationships in the bibliographies of women; Wagner et al., 2015).Moreover, they can be frequent even in domains where blatant stereotypes and openly expressing beliefs about social hierarchies is generally considered inappropriate (Cervone et al., 2021).For example, there is a long line of work analyzing subtle stereotypes in Wikipedia (Callahan and Herring, 2011;Reagle and Rhue, 2011;Konieczny and Klein, 2018;Schmahl et al., 2020, among others), where the lack of diversity represents an issue already at the level of the editors' community (Lam et al., 2011).Beyond notability for representation itself, linguistic aspects in Wikipedia show a remarkable disparity concerning biographies of men and women, both in terms of topics and polarity of abstract terminology (Wagner et al., 2016).Such inequalities do not pertain only to biographies but find systemic correspondence in all domains and across languages (Falenska and Çetinoglu, 2021).
To the best of our knowledge, the presence of subtle stereotypes in wikiHow has not yet been investigated.However, the guides from this platform are a valuable entry point for studying bias, as they are produced by a community of contributors and by experts2 suggesting how to perform activities.In other words, given the different purposes of the platforms, while Wikipedia data is rather descriptive, wikiHow data features instructional texts that potentially differ depending on the audience.

Different Audiences
The mind of the readers features a priori goals that affect the understanding of written texts (Fum et al., 1986).However, the goals and knowledge of different (groups of) people may vary.An example of work that considers different readers' expertise regards title generation (Senda and Shinohara, 2002).In that work, less expert readers were found to be tentatively more influenced by effective titles.Consequently, a system for revising titles accounting for the readers' expertise has been proposed (Senda et al., 2004).As such, that contribution indicates the importance of considering the target audience for efficient communication.Additionally, different audiences can understand to different extents technical terminology (Senda et al., 2006;Elhadad and Sutaria, 2007) and causation (Siddharthan and Katsos, 2010).Previous contributions accounted for different target groups also in the controllable text generation tasks of paraphrasing (Kajiwara et al., 2013), text simplification (Scarton and Specia, 2018;Sheang and Saggion, 2021), machine translation (Agrawal and Carpuat, 2019), and dictionary examples generation (He and Yiu, 2022).

Corpus Construction
As introduced in §2.1, wikiHowToImprove is a well-established data set derived from wikiHow and consisting of more than 246,000 how-to guides.In general, each guide consists of multiple revisions of an article, a fixed goal that is named in the title, and (optionally) an indicator that follows the title in parentheses (cf.Table 1).As we are interested in how-to guides for different target groups, we filter the data for indicators that specify a group of people as targets, which we also refer to as the audience.Table 2 lists the 20 most frequent indicators extracted from wikiHowToImprove.
Based on a manual grouping of these indicators, we find that 15 out of 20 indicators refer to attributes of performative gender and age (the remaining five are underlined in Table 2).Apart from their high frequency, both of these attributes are of interest to studies in the social sciences, in which they are often used as independent variables (Cortina et al., 2013;Cha and Weeden, 2014;Palència et al., 2014).Following a traditional binary setup, we distinguish two audiences based on gender, women (W) and men (M), and two audiences based on age, kids (K) and teens (T). 3 For each type of audience, 3 Note that while the selected audiences follow discrete   we create a set of all indicators used and collect all corresponding guides by extracting the latest article versions from wikiHowToImprove.Statistics of our corpus with audience-specific how-to guides are provided in Table 3.We note that there is a much higher number of indicators and articles for W than for M. In comparison, the number of articles and indicators for K and T are similar.With only 2,112 how-to guides in total, the corpus seems relatively small.However, the average length of articles ranges from 352 to 682 words, which adds up to a corpus size of more than one million words.Throughout this work, we refer to this dataset as wikiHowAudiences. 4Next, we approach it in its entirety with two case studies.Therefore, we start our investigation by analyzing how often such cases occur in wikiHowAudiences, which topics they cover, and what differs between versions for specific target groups.

Guides Selection
First, we identify titles that occur more than once in wikiHowAudiences: 32 unique titles for W-M and 15 for K-T.Next, we group guides with the same title but different target audiences into pairs.A complete list of article titles in this subset can be found in Appendix A.1.

Guides Analysis
To understand which goals require audiencespecific adaptations, we analyze the topics and articles of the filtered guides.
Topics.We start by manually investigating titles of the filtered pairs of guides.For this purpose, we assign each of them to one of three content-related categories.The categories were designed to cover all the titles while being as concrete as possible.An overview of all the categories and their examples is listed in Table 4.We find that W-M instructions cover a relatively wide range of topics, from body-related activities (BODY), over interacting with other people (INTERACT), to self-presentation (PRESENT), which is the most frequent category.In contrast, among titles in K-T, we notice one clear pattern: all topics focus on issues that require different steps depending on the age of the target.Among them, we distinguish and report in ascending order of frequency articles about learning how to do activities for grown-ups or concerning the urge to grow old (GROWN-UP), advice related to the life of young people (ADVICE), and activities about oneself or the relation of oneself to others (ACTIVITY).
Length.Next, we check whether there are significant differences in terms of how detailed the instructions are for different target groups.We quantify this by simply measuring the length per article in words and sentences.We notice a considerable difference between K and T: the median length of articles for K is only 30 sentences and 346 words, while articles for T contain 98 sentences and 1081 words.In the case of W and M, we do not find such large differences in terms of average word (785 vs. 856) and sentence counts (59 vs. 62).Overall, the numbers reflect the patterns shown in Table 3 for the whole wikiHowAudiences data.
Content.Finally, we switch our attention to the actual content of the articles.As a simple measure of how similar two guides are, we consider their word overlap in both directions using BLEU score (Papineni et al., 2002).
Table 5 presents the articles with the lowest and highest word overlap in both analyzed groups.Interestingly in the case of W-M, both articles cover concepts related to BODY, namely clearing skin and recognizing an infection.Manual inspection of their content reveals that even in the case of the least overlapping articles, "Get Clear Skin", slight differences can be noticed: W article includes more specific information as well as different usage of punctuation.In the case of most overlapping articles, "Recognize Chlamydia Symptoms", the main difference comes from the vocabulary related to different body parts from body types.The high word overlap of these two versions is likely related to their introductions, which provide an interchangeable overview to the topic.
In the case of K-T, the least and most overlapping articles come from two different categories: ACTIVITY and GROWN-UP.The least overlapping pair, "Flirt", is a case of two instructions that treat the same goal with different levels of complexity.For example, the matter of eye contact is described with one step in K and more than ten in T. The most overlapping articles, "Make Money", can be an example of a content stalemate -for both target audiences, babysitting is the first suggested activity to achieve the profit goal.However, it is possible to notice differences in how this concept is contextualized for two groups: either in a list of activities or discussed with its implications and advantages.There are the traditional jobs like babysitting, shoveling snow, and doing chores around the house.
T Make eye contact.Body language is a big part of flirting, and a big part of that is eye contact.Eye contact conveys intimacy (...) Babysit for friends and family.One of the best ways for teenagers to make money and help out in the community is babysitting.

Summary
We exemplified three characteristics that can distinguish guides written for different audiences.First, the instructions written for K-T significantly differed in length.Next, we saw pairs of guides that varied in style (such as punctuation) and content (e.g., vocabulary in BODY articles).Some of the presented examples suggest that considering only simple content features could be enough to distinguish articles written for different audiences.However, such an approach could be insufficient in more complex cases, such as pairs of guides with high word overlap (see "Make Money").We discuss these articles again in our generalization study ( §6).

Case Study: "How To Be" Guides
In the previous section, we looked at how-to guides that occur in different versions for specific audiences.Such guides might concern particular goals that require being addressed in distinct ways.In this section, in contrast, we broaden the scope of analysis to explore other cases of differences in audience-specific instructions.

Guides Selection
The initial example from the introduction (see Table 1) explain how to perform like somebody the reader is presumably not.Inspired by this example, we investigate what other guides instruct their readers "how to be".Concretely, we filter titles starting with the word 'be', which gives us 118 guides for W, 20 for M, 32 for K, and 30 for T.

Guides Analysis
To understand which topics the "how to be" guides cover, we group them according to the first word that occurs after 'be' (henceforth the completion).5Table 6 shows the most frequent completions for each target group and respective example titles.
Regarding K-T guides, we notice no clear pattern that would distinguish instructions based only on their titles.There is roughly the same number of how-to articles for K and T (32 vs. 30).Moreover, among the most frequent completions we commonly find the word 'good', followed by words such as 'comfortable', 'less', or 'safe'.
In contrast, we find substantial differences for W-M.Specifically, we note that "how to be" guides are more common for W (12% of all articles for this target group) and for both audiences we find differing frequencies of completions: While W articles focus on being 'cute' and 'popular' (9 guides), M articles put more emphasis on being 'cool' and 'more' (6 guides).Even though all the how-to guides refer to similar contexts (mostly related to school), we do not find mutual correspondence-there are no instructions for how to "be cool at school" for W and no guide for how to "be cute at school" for M.

Summary
In this section, we looked at a particular subset of wikiHowAudiences, namely guides with titles starting with the word 'be'.We found that, in the case of W-M targets, the differences in instructions occur already at the level of goals that these guides describe.In other words, we saw examples of instructions where the information for which audience they were intended could be deduced strictly from their titles.

Generalization Study: Computational Approach
Our case studies show that, depending on the audience, there exist examples of articles that differ in terms of topic, length, style, and/or vocabulary.
However, an open question is whether these are only individual cases or if such differences occur systematically.In this study, we investigate this question computationally and attempt to verify our observations on the basis of a larger dataset.For this purpose, we implement tentative characteristics in the form of features and models ( §6.1), evaluate in a setting with our full sub-corpora ( §6.2), discuss quantitative results ( §6.3), and analyze qualitative findings ( §6.4).

Models
Based on the findings from the two case studies, we define majority and length-based baselines and several simple logistic regression classifiers with different sets of features.
Baselines.We use a simple majority baseline that always assigns the most frequent class.We also implement two length-based baseline models that use the number of words in a title (or article) as the only feature for classification.
Content (title/article).The words and phrases used in a text can be potential indicators of its target group.Thus, we make use of the most common6   uni-grams and bi-grams, excluding stop words, as a feature representation for the content of a how-to guide.We evaluate two variants: features derived from the articles and from the titles.
Style (article).We represent style using two sets of established features from authorship attribution (Sari et al., 2018), namely lexical style: average word length, number of short words, vocabulary richness in terms of hapax-legomena and dislegomena, % of digits, % of upper case letters; and syntactical style: occurrences of punctuation, frequencies of POS tags, and stop-word frequencies.
combined (article).Content and style can potentially provide complementary information.We test whether a model can leverage a combination of information from different sources.For this purpose, we simply concatenate the article-level features for content, style, and length.

RoBERTa (article).
As an alternative to manually selected features, we further test features derived from a large language model, RoBERTa (Liu et al., 2019).Specifically, we encode the article's text, truncated to the first 512 tokens, and extract the representation of the special classification token from the last hidden layer as a set of feature values.

Experimental Setup
In order to find out whether and to what extent articles for different target groups can be distinguished computationally, we define two classification tasks in which specific articles, based on their characteristics, are to be assigned to one target group each.We distinguish between articles for women and men (W-M) and between articles for kids and teenagers (K-T).For all four classes, we use the full wikiHowAudiences, which we divide into TRAIN, DEV, and TEST sets following the articlelevel partition of the original wikiHowToImprove corpus (Anthonio et al., 2020) class and set are shown in Table 7.For the style features, the texts are lemmatized with spaCy. 7e train each model on the TRAIN set and evaluate in terms of macro F 1 -score on the TEST set.We compute F 1 -score per class as the harmonic mean between precision (ratio of correct predictions) and recall (ratio of correctly classified instances).As our data is imbalanced, we use macro F 1 instead of a weighted/micro score to treat each class (rather than each instance) as equally important.
A number of hyperparameters are optimized on the DEV set: We try different values for the logistic regression classifiers' L1 and C terms, sampled from 10 instances between 1e − 5 and 100.For the content features, we optimize the number of k most common n-grams (k = 200).We also made use of the DEV set to determine the best language model for our tasks, which we found to be roberta-large (results of other models are shown in Appendix A.2).8

Results
The results are summarized in Table 8.As conjectured based on the K-T articles from the first case study, we find that the length-based baselines indeed outperform the majority baseline9 in that setting.As the further results show, content and stylistic features can indeed be used to correctly assign a specified target group to many how-to guides.According to the evaluation scores, features calculated at the article level are particularly suitable for this purpose: The combined model, which uses content, style and length features on the article level, achieves the best result with macro-F1 scores of 0.71 and 0.78 for W-M and K-T, respectively.Features generated based on the roberta-large language model achieve competitive scores (0.68 and 0.74), but fall short of the combined model.
The large differences in result between the baselines and our models show that the target audience of many articles can be determined simply from the vocabulary and style of an article.Next, we take a closer look at model features and errors.

Analyses
For our analyses, we focus on the combined model because it achieves the best results and its features are easily interpretable.
Features.For each target group, we analyze what features are most important to the model.Since our model uses independent features in a binary classification task, we can simply check the highest positive and negative feature weights for this purpose.A selection from the ten most predictive features10 and example sentences are shown in Table 9.As the examples illustrate, some of the strongest features are, again, based on stereotypes (e.g., 'cute', 'makeup' for W) or reflect heteronormative assumptions ('hers' for M).Interestingly, we also see characteristics of gender-inclusive language ('theirs' for M) and direct address of the reader in terms of their group membership ('kid' for K and 'teen' for T).We further find negations (e.g., 'wasn't') as part of strong features for W, which is particularly worrying in light of sociopsychological findings that have shown negations to serve a stereotype-maintaining function across languages (Beukeboom et al., 2010(Beukeboom et al., , 2020)).

Same title articles.
As examples of particularly hard cases, we return to the how-to guides from the first case study, which consisted of article versions for different audiences ( §4).Following the data partition from previous work, we identify 16 such articles in the DEV and TEST splits.We find that the combined model classifies 12 of them correctly (75%).In the remaining 4 cases, the prediction errors could have been caused by superficial features that are predictive for the opposite audience.We note for each of these 16 articles that the version  for the opposite audience is part of the TRAIN split.Therefore, the topics of the guides are generally not specific to one audience, and a correct classification of the majority of cases demonstrates that the model indeed captures characteristics of content and style that seem specific to the audience itself.

Discussion and Conclusion
In this paper, we assessed differences across howto guides written for specific audiences.In the construction of sub-corpora for four target groups, we already noticed inequalities on the level of who is being instructed in wikiHow: as a target audience, women are mentioned more than four times more frequently than men, and teens receive about 50% more instructions per article than kids.In two case studies, we investigated and provided examples of target-related differences on the levels of topic, style, and content.The differences observed in our case studies inspired feature sets of shallow classifiers for predicting the target audience of a given guide.Using these classifiers, we showed that it is, in many cases, indeed possible to automatically predict for which audience an article was written.In an analysis of our results, we found that this success is not merely based on different topics covered for each target group but that the articles for each group systematically differ in terms of content and style.
Each of the aforementioned observations presents a tiny, seemingly insignificant piece of a puzzle.But taken together, these pieces reveal a surprisingly clear picture: there are noticeable differences in what topics are covered for each target group, how many articles and instructions are provided for each audience, and how these articles are written.Even though the audience-specific characteristics used in our studies are by no means exhaustive, our straightforward approach allowed us to identify, qualitatively and quantitatively, debatable differences in how wikiHow guides present particular topics to specific target groups.While there is an inevitable need for differences in vocabulary when speaking about physical features or body parts, it is at best unclear in which ways how-to guides about human interactions or selfpresentation should cast significant differences.Some of the observed differences have already been critically discussed in the context of social science research.For example, it is well-known that labels such as 'cute' are used pejoratively as a form of social control (Talbot, 2019) and that prescriptive components of gender stereotypes in education contribute to discrimination (Kollmayer et al., 2018).However, exposing readers to cultural messages and beliefs about age, gender or other factors cannot be avoided entirely, especially on a collaboratively edited online platform.In fact, it seems to be a challenge for any pluralistic society to find a balance between communicating traditional values and empowering everyone.It is therefore all the more important for a comprehensive understanding to determine when and in what form social norms are conveyed.As such, we view the contributions of this paper, namely our data set of audience-specific guides, wikiHowAudiences, and our mixed-methods approach for identifying and verifying differences, as a valuable connecting point to raise awareness of potential issues and to foster interdisciplinary dialogue for future research.

Limitations
Our studies focus on the differences in how-to guides written for specific audiences only in one language, namely English.A major limitation is therefore that we do not consider other languages.
The perspectives provided by the data source we rely on, wikiHow, allow us to identify specific phenomena and peculiarities.Yet, contemplating only one data source lets us generalize only to a limited extent.For example, the audiences considered in this work depended on the target groups portrayed in the data.They are neither exhaustive nor representative of the diversity of humankind, especially of marginalized social groups.Therefore, a wider variety of data sources will be needed to test generalizations.
Finally, a further limitation of our studies concerns intersectionality.While it seems possible that guides can be tuned by contemplating one specific attribute of the audience at a time, this does not hold with regard to the actual attributes of the readers.Such attributes are per se coexistent, and consequently, they are not separable.

Ethics Statement
We acknowledge that the content that emerged from the data is narrow in terms of cultural perspectives, mainly addressing western cultures.Moreover, the analysis of the audiences is not exhaustive of the diversity of humankind, especially not exhaustively accounting for queer identities in particular trans and non-binary identities.With the present research, we do not intend to reinforce representational biases, rather to highlight them.

Table 2 :
Counts of the 20 most frequent indicators.

Table 3 :
The distribution of the indicators and of the articles for the target audience groups.Sentences and words are indicated via their median values by article.

Table 4 :
Frequencies and examples of topical categories.

Table 5 :
Excerpts from the article pairs with the lowest (left) and highest (right) word overlap.

Table 6 :
The most frequent target-specific completions of "how to be" guides and examples of respective titles.

Table 7 :
Number of articles for each target group and data split, as well as for each task in total.

Table 8 :
. Statistics for each Macro F 1 -scores on the test sets.

Table 9 :
Sample of the top-10 most predictive features and example sentences from articles of each target group.

Table 13 :
The confusion matrix for the dev set (left) and the confusion matrix for the test set (right).

Table 14 :
The confusion matrix for the dev set (left) and the confusion matrix for the test set (right).