Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Language models increasingly rely on massive web crawls for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and news often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles—written by students from across the country—we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban zones (ZIP codes) are more likely to be classified as high quality. We also show that this quality measurement is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts.


Introduction
The language models central to modern NLP are trained on large Internet corpora, typically gathered from community resources (e.g., Wikipedia; Liu et al. 2019) or web crawls (e.g., WebText, Common Crawl; Radford et al. 2019, Brown et al. 2020).The selection of texts impacts every research or deployed NLP system that builds on these models.Yet there is rarely any clear justification given for why various texts were included.
Web dumps like Common Crawl offer the promise of more diverse text than what is available in curated resources.However, much of the web consists of frequently replicated boilerplate (e.g., privacy policies), code (e.g., HTML and Javascript), pornography, hate speech, and more.Automated approaches, typically referred to as quality filters, are often applied in an effort to re-move this undesirable content from training data. 1hese filters include code removers (Gao et al., 2020), heuristics (Rae et al., 2021), stopwords (Raffel et al., 2020), and classifiers (Brown et al., 2020;Wenzek et al., 2020).
Although quality filtering is often treated as a relatively neutral preprocessing step, it necessarily implies a value judgment: which data is assumed to be of sufficiently high quality to be included in the training corpus?More concretely, when a quality filter is a classifier trained on instances assumed to be of high (and low) quality, the selection of those examples will impact the language model and any downstream technology that uses it.Many filters use Wikipedia, books, and newswire to represent high quality text.But what texts are excluded as a result?Because natural language varies with social and demographic variables (Rickford, 1985;Eckert, 1989;Labov, 2006;Blodgett et al., 2016;Hovy and Yang, 2021;Lucy and Bamman, 2021, inter alia), we can also ask whose language will be excluded.
We begin with a summary of the handful of data sources used to construct training corpora for many language models and assumed to be of high quality ( §2).The systematic authorship biases in these datasets motivate the study that follows, in which we replicate the quality filter from Brown et al. (2020).We apply this filter to a new dataset of U.S. high school newspapers, augmented with demographic data from the U.S. Census and the National Center for Education Statistics ( §3).We demonstrate that the filter has strong topical and stylistic preferences, and favors text from authors who originate from regions with better educational attainment, urban centers, larger schools, and higher valued homes.
In sociolinguistics, the term language ideology refers to common (but often unspoken) presuppositions, beliefs, or reflections about language that justify its social use and structure (Craft et al., 2020).
Our analysis helps to characterize the language ideology encoded in the quality filter used by Brown et al. (2020), a representative of a wider set of filtering methods.We also observe in §4 that the filter is unaligned with other plausible notions of quality: factuality ratings for news sources, standardized test scores, and literary awards.Of course, these institutions entail their own language ideologies.We argue that when constructing a corpus, one cannot avoid adopting some language ideology; appropriate choices will depend on the goals of the work, and one language ideology may conflict with another.In short, there is no truly general-purpose corpus.
Our code and analysis are publicly available.2 2 Motivation: Data Sources Across the many language models recently reported in the literature, the same small group of datasets have been routinely used as training corpora-Wikipedia, collections of books, and popular online articles ( §A.1).These data are often treated as exemplars of high quality text (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020).Although these datasets include text from many sources, extensive research suggests that the voices they represent are drawn from a relatively small, biased sample of the population, over-representing authors from hegemonic social positions.
Wikipedia Wikipedia serves as a backbone for language models because of its scale, ease of use, permissive license, and goal of providing comprehensive coverage of human knowledge.However, although anyone can edit Wikipedia content, not everyone does.In practice, there are significant biases in Wikipedia authorship, content, and perspectives.For instance, despite efforts by Wikimedia, the site has been unable to resolve a persistent gender imbalance among its editors (Huang, 2013;Meta-wiki, 2018).This imbalance is reflected in who gets written about, and how (Bamman and Smith, 2014;Graells-Garrido et al., 2015;Wagner et al., 2015).There is also a pervasive urban bias; editors are less likely to come from rural areas, and coverage of these areas in Wikipedia tends to be Books Language models are also frequently trained on book corpora.BERT (Devlin et al., 2019) used the Toronto BookCorpus (Zhu et al., 2015), which consists of 7,185 self-published novels, a dataset criticized for copyright violation, imbalanced representation, and lack of documentation (Bandy and Vincent, 2021).GPT-3 (Brown et al., 2020) and The Pile (Gao et al., 2020) both use much larger corpora of books (although the former do not identify the source of this data).However, the Pile's books (also called Books3) are not a random selection.Rather, they appear to be drawn from a torrent file containing hundreds of thousands of copyrighted eBooks.
Books3 deserves a more thorough investigation, but preliminary analyses reveal that the most prevalent authors in the corpus are prolific American and British writers, especially of romance, mystery, and children's books (e.g., Danielle Steel).This pattern should be considered against the backdrop of the American book publishing industry, which has been widely criticized as homogeneous (Lee &  Low Books, 2020. 4 ) News and Other Popular Internet Content Radford et al. (2019) scrape text from the websites featured in popular Reddit submissions (i.e., those that received at least three upvotes) to construct the training data for GPT-2.As the original corpus is unavailable, we analyze its open-source replica, OpenWebText (Gokaslan and Cohen, 2019).
We do not expect the corpus to represent a wide range of language variation; Reddit users are mostly male, younger, and liberal-leaning, which influences the types of content shared and upvoted on the platform (Barthel et al., 2016).Indeed, we find that 1% of the 311K unique top-level domains in OpenWebText contribute 75% of documents in the corpus ( §A.2).The most common websites in OpenWebText are internationally circulating British and American news outlets (e.g., BBC, New York Times), blogging platforms (e.g., Tumblr, Blogspot), sports content (e.g., ESPN, SB-Nation), and tech news (e.g., TechCrunch, Wired).As expected, these links tend to appear on the most highly trafficked subreddits (e.g., /r/politics, /r/worldnews, /r/news).
These news sources are likely dominated by formal writing styles from a relatively homogeneous set of authors (Arana, 2018;Grieco, 2018).The adherence to slowly evolving style guides expresses specific linguistic standards (Froke et al., 2020) and even geopolitical interests (Vultee, 2012), which encourage rules on language use that can reinforce gender norms and racial hierarchies (DiNicola, 1994;Bien-Aimé, 2016).Researchers find a striking lack of diversity in newsrooms and newspaper leadership. 5This may be compounded by the economic hardships aspiring journalists must incur,6 which act as a filter for who can afford to be employed in the news industry.
Summary These descriptive findings suggest that a disproportionate amount of text in the core data sources of existing language models is written by authors from select, relatively powerful social positions.Such text sources appear to favor privileged segments of the English-speaking population, including men; white populations; communities of higher socio-economic status; and people harboring American and Western European historical, geopolitical, and cultural perspectives.The resulting corpora tend to be less inclusive of the voices of women and members of marginalized groups.A likely implication may be that alternative perspectives, including those of people from rural areas; non-dominant gender, sexual, or racial identities; and counter-hegemonic vantage points, may be less likely to be included, and thus less likely to influence models trained on this data.
Although formal, streamlined content like news or Wikipedia articles may seem like desirable sources for high quality content, not all writing styles or substantive topics that might be relevant to language technologies and their user communities are represented in the resulting corpora.When deployed, however, many of the technologies using language models trained on these mainstream data will face language that-despite being less formal, professional, or carefully edited-is no less high quality and is essential to the communicative lives of the people who use it.
3 Measuring the Language Ideology of the GPT-3 Quality Filter Empirically evaluating the full distribution of authors in the data sources from §2 is difficult, due to their size and the lack of metadata about each document's authors.We instead curate a new dataset of U.S. high school newspaper articles that varies both topically and along demographic variables that can be resolved using ZIP codes.Although we do not directly consider individual authors of these articles, this dataset is useful, in that it can be associated with extensive metadata at the level of individual newspapers.We then analyze the behavior of a (replicated) quality filter on text from this dataset and discuss its implications.
3.1 U.S. SCHOOL NEWS Background Many U.S. schools produce a newspaper to give students journalism experience, to report on local news, to comment on national or global events, and to publish school-related material (e.g., announcements, campus life, student interviews, sports or honor rolls; Gibson, 1961).Because a school's access to resources is shaped by local income levels (Betts et al., 2000) and tied to student achievement (Greenwald et al., 1996), we expect schools in wealthier areas (relative to poorer areas) to produce newspaper content that is more similar to the formal, professional texts that a quality filter is likely to classify as high quality.
Collection We collect articles from Englishlanguage U.S. school newspapers that used a common Wordpress template. 7After identifying 2483 schools who use this template, we scrape 1.95M articles from their respective newspaper sites (more details in §A.3).We retrieve article categories by extracting them from the article URL slugs.We then match each school to its population zone (ZIP code) using the Google Maps Place API. 8 We restrict our dataset to articles from U.S. high schools.
We only consider articles from 2010-2019, remove pages under the video, photo, or multimedia categories, and remove schools that have less than 100 articles (which tend to contain scraping errors).
The final corpus includes 910K articles, from 1410 schools, located in 1329 ZIP codes (552 U.S. counties) dispersed across all U.S. states (and the District of Columbia).

The GPT-3 Quality Filter
To investigate how quality correlates with various attributes of a newspaper, we re-implement the Brown et al. (2020) quality filter based on the description provided in the paper.The filter is a binary logistic regression classifier using n-gram features, trained to distinguish between reference corpora (Books3, Wikipedia, and OpenWebText) and a random sample of Common Crawl.We replicate the filter as closely as possible using scikit-learn (Pedregosa et al., 2011), which we release, along with a demo.9To create the training data for the classifier, we sample 80M whitespace-separated tokens of OpenWeb-Text, Wikipedia, and Books3 each for the positive class, and 240M whitespace-separated tokens of a September 2019 Common Crawl snapshot for the negative class. 10 We perform a 100-trial random hyperparameter search, fixing only the hashing vectorizer and basic whitespace tokenization, following the implementation in Brown et al. (2020).Our final classifier gets 90.4% F 1 (91.7% accuracy) on a held-out test set ( §A.4).We then apply the quality filter to the U.S. SCHOOL NEWS data, computing a quality score per document, which we denote P (high quality).

Document-Level Analysis
We first explore document-level preferences of the filter.The GPT-3 quality filter is more likely to classify high school newspaper articles as low quality, compared to general newswire ( §A.5).11This is unsurprising, since the training data for the GPT-3 ).We observe that political and sports-related topics, the lack of first and second person pronouns, and longer document lengths are associated with higher quality scores.We omit topic 0 (food, restaurant, eat) to avoid a saturated model.See §A.7 for quality scores per topic.* * * p < 0.001.
quality filter included texts by professional journalists.§A.6 shows a random sample of text from the dataset with high and low quality scores, illustrating differences in style and formality.More notably, controlling for article category (e.g., opinion pieces), we find that the GPT-3 quality filter has apparent topical and stylistic preferences.For topical features, we train a topic model (via latent Dirichlet allocation; Blei et al. 2003) over opinion pieces with 10 topics.We also consider whether documents contain first, second, or third person pronouns, and the length of the document.We then combine these features in a regression to assess the effect of certain attributes on the document quality score, while controlling for other attributes.
The results of our regression are displayed in Table 1.We find that certain topics have quite large effect sizes (see §A.7 for the distribution of quality scores per topic).For example, documents entirely about former U.S. President Trump and the 2016 presidential election have quality scores 35 percentage points higher, on average, than the omitted topic about food, whereas documents about sports are 25 percentage points higher, relative to the omitted topic.Stylistically, the presence of first or second pronouns in a document decreases quality score by 5 percentage points, while a doubling of the number of tokens in a document increases the quality score by 9 percentage points.

Demographic Analysis
We also examine whether the GPT-3 quality filter prefers language from certain demographic groups over others.We first check raw correlations between average quality scores (per newspaper) and features of interest.As in §3.3, we then combine the features in a regression model.

Demographic Features
As discussed in §3.1, we expect a priori that content from schools located in wealthier, more educated, and urban areas of the U.S. will tend to have higher quality scores, relative to poorer, less educated, rural areas.Therefore, we consider demographic features that correspond to class, rural/urban divides, and school resources.
For each school, we retrieve 2017-2018 schoollevel demographic data from the National Center for Education Statistics (NCES).12These include the number of students, student:teacher ratio, and indicators for private schools and specialized public schools (e.g., charter or magnet schools).We also retrieve the latest ZIP code-and county-level demographic data from the 2020 U.S. Census. 13o measure the wealth of the corresponding ZIP code, we use median home values, and for educational attainment we use the percentage of collegeeducated adults.We also use Census data on the percent of rural population by county.Finally, we consider local political leanings, operationalized by county-level Republican-party vote share in the 2016 Presidential election. 14We display full descriptions of features in our demographic analysis in §A.8.

Correlation Analysis
To inform the variables we include in our regressions, we explore correlations between variables of interest and the average quality score of a school newspaper.Our analyses in Figure 1 suggest that our initial hypotheses hold: schools in wealthier, urban, and more educated ZIP codes, as well as those in Democrat-leaning counties, tend to have higher quality scores.
Regression Analysis Here, we use schools as the unit of analysis, and consider average quality score assigned to the school's articles as the dependent variable.We only include those schools that could be matched to the NCES database, dropping schools which are missing school size, as well as those located in ZIP codes with $1M or greater median home value, due to a census artifact. 15Missing values for other features are imputed with the median value of that feature for the corresponding ZIP code, or (if necessary) county or state.
For regressions, we log-transform school size, student:teacher ratio, and home values, using raw values for other features to preserve interpretability.Our regression dataset includes 968 high schools in 926 ZIP codes across 354 counties.We release this dataset publicly. 16ecause many of the variables identified above are correlated, we use regression to estimate the effect of certain factors while controlling for others, with results shown in Table 2. Overall, home values, parental education, school size, public school status, and urban locations all show significant positive associations with quality scores.Thus, even controlling for financial resources, parental education, and other factors, articles from urban schools are scored as significantly higher quality than those from rural schools.Nevertheless, the effects, considered individually, are relatively modest.A 14 percentage point increase in percent urban population or a 17 percentage point increase in parental education (percent of adults with college degrees) correspond to a 1 percentage point increase in average quality score, as does a doubling of home values, or a quadrupling of school size (holding other variables constant in each case).Average quality scores associated with public schools are 1.5 percentage points higher than private schools, controlling for other factors.Coefficients for charter schools, magnet schools, and student:teacher ratio are not significant.The combined effects of all these factors account for large differences in quality scores between wealthy, urban, educated locations, and poorer, rural, and less educated parts of the United States.
Summary and Limitations This analysis reveals an unintended consequence of the GPT-3 quality filter: by attempting to exclude text that is less like mainstream news and Wikipedia, the filter reinforces a language ideology that text from authors of wealthy, urban, and educated backgrounds is more valuable for inclusion in language model training data.These implicit preferences align with the attributes of authors that dominate the corpora from §2, which the filter considers to be high quality.
While most of the above findings are robust to alternate model specifications, the model ultimately only accounts for a relatively small amount of variance in quality scores.However, given that all variation is ultimately explained by features of text

Density High Factuality News Low Factuality News
Figure 2: There is no difference in quality scores between articles written by news sources of high and low factual reliability.
itself, any amount of variance accounted for by demographic features is notable.
In addition, most of our features are taken from a single point in time and do not account for changing demographics over the examined time period (2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019).Data errors could also arise due to how datasets were aligned (based on school name and ZIP code).These findings may not generalize to other domains (e.g., social media), and inclusion of additional features could affect these findings.For additional models which include party vote share and racial demographics taken from NCES data, see §A.9.

Alignment with Other Notions of Quality
The GPT-3 quality filter purports to judge the quality of text, something that people also do, using a variety of different criteria.In this section, we consider three forms of human evaluations: factuality judgements, human-graded standardized test essays, and institutional book awards.How well does the behavior of the GPT-3 quality filter map onto these notions of quality?

Data Factually (Un)reliable News
To analyze the correspondence between the GPT-3 quality filter and news factuality, we use the list provided by Baly et al. (2018) to identify a set of popular news sources from a broad range of factuality ratings and political leanings.17Using Newspaper3k, 18we scrape and score 9.9K and 7.7K articles from high and low factuality news outlets, respectively.
TOEFL Essay Exams Next, to analyze the correspondence between the GPT-3 quality filter and essay scores, we collect and score 12.1K participant essays from the Test Of English as a Foreign Language (TOEFL) exam, a widely used English language proficiency test (Blanchard et al., 2013).
The TOEFL exam responses include official scores from exam readers, as well as each essay's prompt.
Award-Winning Literature Finally, to analyze the correspondence between the GPT-3 quality filter and literary awards, we select and score books from Books3 and the Gutenberg corpus (Brooke et al., 2015) that have won a Pulitzer Prize in various categories.We collected these data by scraping the publicly available list of award recipients. 19

Results
If the filter aligns with news factuality, we would expect that articles from factually reliable sources would be rated as higher quality than those from factually unreliable ones.However, we find no difference in the quality distribution between articles from high and low factuality news sources (p = 0.085, two-way Kolmogorov-Smirnov test; Figure 2).Many factually unreliable news articles are considered high quality by the filter ( §A.10).
Turning to the TOEFL exam responses, we would expect that if the filter agrees with essay scores, higher scoring essays would receive higher quality scores.While essay scores are weakly correlated with quality scores (Pearson r = 0.12, p < 0.001), the essay's prompt is far more predictive of the essay's quality designation ( §A.11).For example, essays responding to a prompt (P4) which asks participants to describe "...whether advertisements make products seem much better than they really are" are much less likely to be filtered than all other prompts, including P6, which asks participants to describe "...whether it is best to travel in a group" (see §A.11 for more details).The latter prompt tends to invoke personal experiences in the responses.
Finally, if the filter aligns with literary awards, we would expect that most Pulitzer-Prize winning books would achieve high quality scores.On the contrary, quality scores vary heavily based on the genre (Figure 3).Poetry and drama are less favored by the filter, relative to non-fiction, fiction, and fan fiction (from BookCorpus; Zhu et al. 2015).
Summary Our analysis demonstrates that the GPT-3 quality filter conflicts with other standards of text quality.Of course, even the alternative standards we compare here are subject to their own language ideologies.Readers are more likely to trust news as factual if its political position aligns with their own (Mitchell et al., 2018).English-language teaching pedagogies are rooted in ideologies about well-spokenness (Vanegas et al., 2016).Literary awards favor white and male authors. 20In general, any designation of text as high quality is subjective and influenced by sociopolitical context.

Discussion
The above sections have demonstrated that automated filtering of text to build language modeling corpora may lead to counterintuitive or undesirable exclusion of sources.Because of the variety of use cases for language models and the broad range of text that could be appropriate for certain tasks, we suggest that there is no simple, universal standard for what should be considered high quality text.Indeed, there is a long history of privileging some people's spoken language as better or more "correct" than others.Researchers and practitioners of NLP who are aware of this history have the option to be intentional in their design of systems that, however implicitly, risk excluding the language of underprivileged identities or communities.
Some amount of selection in building corpora is inevitable.It is not possible to collect a uniform random sample of all written utterances.However, our findings suggest that current selection methods are, for many purposes, flawed.Future work into alternative filtering criteria could be paired with investigations into the unintended consequences of their assumptions.
We do not believe that there is likely to be a single solution to this challenge.Indeed, the text that is best suited for training a model may depend on the application of that model.At a minimum, however, the NLP community could more carefully consider and clearly document the inclusion criteria for text.NLP practitioners could also be explicit about their reasons for using certain sources, even if those reasons are related to availability or empirical performance.A collection of tests could also be deployed (and improved over time) to give a clear understanding of the implications of different choices of filters.
More generally, we echo calls in the literature for more thoughtful and inclusive data collection (Jo and Gebru, 2020;Bender et al., 2021;Tanweer et al., 2021).Strategies could include, but are not limited to a) intentionally curating data from people and viewpoints that are not otherwise well represented; b) including a greater diversity of genres; c) adopting more nuanced or intentional exclusion criteria; d) conducting more thorough interrogation of what text is being excluded; e) developing standard checks for prominent biases in inclusion; and/or f) abandoning the notion of a general-purpose corpus.

Related Work
Language Ideologies Language ideologies have been widely explored in the sociolinguistics literature (Gal and Irvine, 1995;Rosa and Flores, 2017;Craft et al., 2020, inter alia).An ideology that promotes the inherent correctness, clarity, and objectivity of certain language varieties over others is a mechanism for linguistic discrimination (Craft et al., 2020;Gal, 2016;MacSwan, 2020;Rickford and King, 2016).A salient example of such discrimination is the stigmatization of secondlanguage speakers of English (Lindemann, 2005).
Language ideologies have an important, but often unacknowledged, influence on the development of NLP technologies (Blodgett et al., 2020).For example, an ideology that distinguishes between standard and non-standard language variations sur-faces in text normalization tasks (van der Goot et al., 2021), which tend to strip documents of pragmatic nuance (Baldwin and Chai, 2011) and social signals (Nguyen et al., 2021).Language on the Internet has been historically treated as a noisy variant of English, even though lexical variation on the Internet is highly communicative of social signals (Eisenstein, 2013) and varies considerably along demographic variables (Eisenstein et al., 2014) and community membership (Lucy and Bamman, 2021).
Language ideologies also surface in tools for toxicity detection; for example, the classification behavior of the PERSPECTIVE API (a popular hate speech detector) aligns with the attitudes of conservative, white, female annotators, who tend to perceive African-American dialects as more toxic (Sap et al., 2021).
Critiques of Laissez-Faire Data Collection We provide empirical evidence that laissez-faire data collection (i.e., filtering large web data sources) leads to data homogeneity (Bender et al., 2021).As an alternative to laissez-faire collection, Jo and Gebru (2020) recommend drawing on institutional archival practices.However, we note that language ideologies are also prevalent (and may not be explicit) in institutional archives, which, for example, have preferred colonizing perspectives over colonized ones when documenting historical events (Trouillot, 1995;Decker, 2013).
Other Quality Filters Other definitions of text quality are used to create pretraining datasets, some of which do not rely on the datasets from §2.However, all techniques adopt language ideologies of what constitutes high quality text.Bad-word filtering, which removes documents that contain certain stop-words, disproportionately excludes language about and by non-dominant groups (Dodge et al., 2021).Filtering Internet content for popularity (Radford et al., 2019) leads to data homogeneity based on the characteristics of viral media and the composition of userbases in online forums ( §2).Even lightweight filters (Aghajanyan et al., 2021;Rae et al., 2021) put more emphasis on features like document length over factuality when determining what makes a document high quality.Any filtering method requires transparent justification and recognition of tradeoffs.

Downstream Behavior
The behavior of language processing systems aligns with what we would expect from a language ideology that favors training data written by a narrow, powerful sector of society.For example, dialogue agents perform significantly worse when engaging in conversations about race (Schlesinger et al., 2018) and with non-dominant dialects of English (Mengesha et al., 2021).GPT-3 frequently resorts to using stereotypes when minority groups are mentioned in its prompt (Abid et al., 2021;Blodgett, 2021).GPT-3 is also prone to producing hate speech (Gehman et al., 2020) and misinformation (McGuffie and Newhouse, 2020), which we would expect if its quality filter fails to distinguish the factual reliability of news sources in its training data ( §4).Gao (2021) show that aggressive data filtering with the GPT-3 quality filter degrades downstream task performance.A closer analysis of how the language ideologies in data selection lead to certain model behaviors is a rich area for future work.

Conclusion
Using a new dataset of U.S. school newspapers, we find that the conventional, automated valuation of Wikipedia, newswire, books, and popular Internet content as reference for high quality text implicitly favors content written by authors from larger schools in wealthier, educated, urban areas of the United States.Adopting this language ideology for text data selection leads to implicit-yet systematic and as-yet undocumented-inequalities in terms of whose language is more likely to be included in training corpora.Although no single action will solve this complicated issue, data curators and researchers could be more intentional about curating text from underrepresented authors and groups, gathering sources from multiple genres and writing styles, and documenting their curation procedures and possible sources of exclusion.

Ethical Considerations
Our U.S. SCHOOL NEWS dataset comes with many limitations, as described in §3.1.Our corpus is neither a random nor a representative sample of U.S. school newspapers.Instead, it represents schools that had sufficient Internet access, that elected to use a particular website template, and that maintain websites with retrievable archived content.In general, our dataset likely captures neither the least resourced schools (which may not have good access to online resources) in the United States, nor the wealthiest ones (who may have their own pub-lication platforms).The lack of representation in school newspaper leadership positions may influence which students contribute content to school newspapers (Chen et al., 2021).Educators also likely shape some articles, at least in part (though we expect them to be similarly affected by resource constraints).
Moreover, much of the content in these articles is specific to student concerns (e.g., sports, school events, campus culture, etc.), and the writing is, by definition, amateur.Nevertheless, because the corpus captures a wide range of content and geographical areas, it allows us to evaluate how a quality filter handles real-world language variation within a particular domain.Additionally, we speculate that an expanded corpus, which included writings from these schools, would demonstrate a continuation of trends we report in this paper.
Using text from school newspapers introduces privacy concerns, especially since authors and subjects are minors.We therefore use this data only for evaluation purposes; we do not train (or release) any models on this data or on any raw text from the corpus.We do, however, release a datasheet (Gebru et al., 2021) which documents the dataset's general characteristics and curation procedure ( §A.3).
While the text in our dataset varies considerably along topical, stylistic, and demographic variables, it is nevertheless a niche domain.The text is a specific genre meant for local student consumption, its authors are U.S. students, and it thus primarily represents U.S.-centric cultural and political perspectives.We acknowledge that we also perpetuate some of the biases we identify, especially by working with English language text from the United States.We hope future work will extend this study of language ideologies to multilingual settings, other textual domains, and different sets of authors.
With respect to demographic variables, we merge census demographics with school-level data via ZIP codes or counties, which are imperfect identifiers of a school, since ZIP codes (and counties) may include multiple schools of varying resource levels.Moreover, tracking demographic variables and other author metadata, if deployed at scale, implies a certain level of invasive surveillance (Brayne, 2017).Future work may explore how to maintain the rights of authors as data subjects and producers while mapping demographic representation in large corpora.
The lack of access to GPT-3's training data and quality filter prevents us from making claims about how quality filter biases affect language model behavior.Future work on language models may also include transparent release of training data and associated quality filters, which would help support this kind of research.
Finally, we did not seek consent from authors to scrape their articles.The ethical and legal norms around scraping public-facing web data, especially those produced by minors, are still in flux (Fiesler et al., 2020) and may not align with user perceptions of what constitutes fair use of online communications (Williams et al., 2017).For these reasons (as discussed earlier), we do not release the corpus of school newspaper articles, and only use it for analysis and evaluation.We only make available a dataset of demographic variables and quality scores per school, to support reproducibility.On Monday, September 3rd, Colin Kaepernick, the American football star who started the "take a knee" national anthem protest against police brutality and racial inequality, was named the new face of Nike's "Just Do It" 30th-anniversary campaign.Shortly after, social media exploded with both positive and negative feedback from people all over the United States.As football season ramps back up, this advertisement and the message behind it keeps the NFL Anthem kneeling protest in the spotlight.

Figure 1 :
Figure1: Scatter plots displaying correlations of select demographic features of a school's ZIP code or county with its average P (high quality).

Figure 4 :
Figure4: Scraped school articles tend to be considered lower quality by the GPT-3 quality filter than general newswire (histogram built from 10K random documents from each domain).This finding is consistent across a variety of categories, and more significant for certain ones (e.g., school announcements).

Table 1 :
Dependent variable: P (high quality) Regression of the quality score of an opinion piece in the U.S. SCHOOL NEWS dataset, on document features (N = 10k

Table 2 :
Dependent variable: P (high quality) Regression of the average P (high quality) of a school on demographic variables (N = 968).We observe that larger schools in educated, urban, and wealthy areas of the U.S tend to be scored higher by the GPT-3 quality filter.See §A.8 for more information on these features.

Table 3 :
Overview of recent language models and their training corpora.All studies tend to draw from the same core data sources: Wikipedia, Books, News, or filtered web dumps.

Table 4 :
The most popular top-level URL domains in OpenWebText.Mainstream news forms the overwhelming majority of content in the dataset.Overall, just 1% of the top-level URL domains in OpenWebText contribute 75% of the total documents in the corpus.

Table 5 :
Hyperparameter search space and best assignments for our re-implementation of the GPT-3 quality filter.

Table 6 :
Examples of high school news paper articles from U.S. SCHOOL NEWS.Many of the articles in student-life category, and similar, rated lower quality have very different styles from documents rated high quality.