Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scraping significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin by investigating where the data came from, and find a significant amount of text from unexpected sources like patents and US military websites. Then we explore the content of the text itself, and find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets. To understand the impact of the filters applied to create this dataset, we evaluate the text that was removed, and show that blocklist filtering disproportionately removes text from and about minority individuals. Finally, we conclude with some recommendations for how to created and document web-scale datasets from a scrape of the internet.


Introduction
Models pretrained on unlabeled text corpora are the backbone of many modern NLP systems (Devlin et al., 2019;Liu et al., 2019;Raffel et al., 2020;Brown et al., 2020, inter alia). This paradigm incentivizes the use of ever larger corpora (Kaplan et al., 2020;Henighan et al., 2020), with the biggest models now training on a substantial fraction of the publicly-available internet (Raffel et al., 2020;Brown et al., 2020). Of course, as with all machine learning systems, the data such models are trained on has a large impact on their behavior. For structured, task-specific NLP datasets, best practices have emerged around documenting the collection process, composition, intended uses, and other  Figure 1: We advocate for three levels of documentation when creating web-crawled corpora. On the right, we include some example of types of documentation that we provide for the C4.EN dataset.
characteristics (Bender and Friedman, 2018;Gebru et al., 2018;Hutchinson et al., 2021). However, given the challenges of applying these practices to massive collections of unlabeled text scraped from the web, thorough documentation is typically not done. This leaves consumers of pretrained language models in the dark about the influences of pretraining data on their systems, which can inject subtle biases in downstream uses (Li et al., 2020;Gehman et al., 2020;Groenwold et al., 2020).
In this work we provide some of the first documentation of a web-scale dataset: the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020). C4 is one of the largest language datasets available, with more than 156 billion tokens collected from more than 365 million domains across the internet (Table 1). 1 C4 has been used to train models such as T5 and the Switch Transformer (Fedus et al., 2021), two of the largest pretrained English language models. While Raffel et al. (2020) provided scripts to recreate C4, simply running the available scripts costs thousands of dollars. Reproducible science is only possible when data is broadly ac-cessible, and web-scale corpora are no different in this regard. With that in mind, we provide a downloadable copy of this dataset. 2 Documenting massive, unlabeled datasets is a challenging enterprise. Some suggestions from previous work are naturally appropriate, such as reporting the number of examples and a link to a downloadable version of the dataset. 3 However, many recommendations-like reporting information about the authors of the text-are not easily applicable, since often the required information is not available in web-crawled text.
We advocate for documentation of web-scale corpora to include three views of the data, as illustrated in Figure 1. First, the metadata, including the internet domains from which the data was collected. At the highest level, internet top-level domains like .edu likely contain significantly different text than .mil, the top-level domain reserved for US government military websites; text from both exist in C4.
Following the metadata, we examine the text itself. We find significant amounts of machinegenerated text (e.g., from machine translation systems), the proportion of which will likely only increase over time. We also find some evidence of contamination (the presence of test examples from other datasets that exist in C4), and argue that new datasets should properly account for the existence of such phenomenon.
Finally, as web-crawled datasets typically filter out significant portions of text, we argue for more thorough documentation of what is not in the data. Some filters are relatively straightforward, such as removing Lorem ipsum placeholder text. However, we find that another filter which removes documents that contain a token from a banned word list, disproportionately removes documents in dialects of English associated with minority identities (e.g., text in African American English, text discussing LGBTQ+ identities).
In addition to our set of recommendations and analyses, we publicly host three versions of the data with different levels of filtering, along with an indexed version for easy searching 4 , and a repository 2 https://github.com/allenai/c4documentation 3 NLP Reproducibility Checklist https://2020.emnlp.org/blog/2020-05-20reproducibility 4 https://c4-search.apps.allenai.org/ this index will only be hosted until 2021-12-31  Table 1: Statistics for the three corpora we host. One "document" is the text scraped from a single URL. Tokens are counted using the SpaCy English tokenizer. Size is compressed JSON files.
for public discussion of findings. 5

The English Colossal Clean Crawled
Corpus (C4) C4 is created by taking the April 2019 snapshot of Common Crawl 6 and applying a number of filters with the intention of removing text that is not natural English. This includes filtering out lines which don't end in a terminal punctuation mark or have fewer than three words, discarding documents with less than five sentences or that contain Lorem ipsum placeholder text, and removing documents which contain any word on the "List of Dirty, Naughty, Obscene, or Otherwise Bad Words". 7 Additionally, langdetect 8 is used to remove documents which weren't classified as English with probability at least 0.99, so C4 is primarily comprised of English text. We call this "cleaned" version of C4 (created by applying all filters) C4.EN. For brevity we refer readers to Raffel et al. (2020) for a full list of the filters. In addition to C4.EN, we host the "uncleaned" version (C4.EN.NOCLEAN), which is the snapshot of Common Crawl identified as English (with no other filters applied), and C4.EN.NOBLOCKLIST, which is the same as C4.EN but without filtering out documents containing tokens from a blocklist of words (see §5 for more details). Table 1 contains some statistics for the three corpora.

Corpus-level statistics
Understanding the provenance of the texts that comprise a dataset is fundamental to understanding the dataset itself, so we begin our analysis of the metadata of C4.EN by characterizing the prevalence of 5 https://github.com/allenai/c4documentation/discussions 6 https://commoncrawl.org/, where monthly "snapshots" are created by crawling and scraping the web, each typically containing terabytes of text 7 https://git.io/vSyEu 8 https://pypi.org/project/langdetect/ Websites In Figure 2 (right), we show the top 25 most represented websites in C4.EN, ranked by total number of tokens. Surprisingly, the cleaned corpus contains substantial amounts of patent text 9 https://en.wikipedia.org/wiki/List_ of_Internet_top-level_domains 10 https://spacy.io/api/tokenizer 11 We use the TLDExtract (https://pypi.org/ project/tldextract/) package to parse the URLs. documents, with the single-most represented website in the corpus is patents.google.com and patents.com being in the top 10. We discuss the implications of this in §4.1.
Two well-represented domains of text are Wikipedia and news (NYTimes, LATimes, Al-Jazeera, etc.). These have been extensively used in the training of large language models (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020, e.g., BERT, RoBERTa, GPT-3). Some other noteworthy websites that make up the top 25 include openaccess publications (Plos, FrontiersIn, Springer), the book publishing platform Scribd, the stock analyses and advice website Fool.com, and the distributed file system ipsf.io. 12

Utterance Date
Language changes over even short timescales, and the truth or relevance of many statements depends on when they were made. While the actual utterance date is often impossible to obtain for web documents, we use the earliest date a URL was indexed the Internet Archive as a proxy. We note that using the Internet Archive is not perfect, as it will sometimes index webpages many months after their creation, and only indexed approximately 65% of URLs in C4.EN. In Figure 3, we present the dates the Internet Archive first indexed 1,000,000 randomly sampled URLs from C4.EN. We found that 92% are estimated to have been written in the last decade (2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019). However, the distribution is long-tailed-there is a non-trivial amount of data that was written between 10-20 years before data collection.

Geolocation
We aim to assess which countries are represented in C4.EN, which we estimate using the location where a webpage is hosted as a proxy for the location of its creators. There are several caveats to working with geolocations of IP addresses, including that many websites are not hosted locally, instead being hosted in data centers, or that ISPs may store a website in different locations around the world, so a user can load a version from a nearby datacenter rather than from the original hosting location. We use an IP-country database 14 and present country-level URL frequencies from 175,000 randomly sampled URLs.
As shown in Figure 4 in the appendix, 51.3% pages are hosted in the United States. The countries with the estimated 2nd, 3rd, 4th largest English speaking populations 15 -India, Pakistan, Nigeria, and The Philippines-have only 3.4%, 0.06%, 0.03%, 0.1% the URLs of the United States, despite having many tens of millions of English speakers.

What is in the text?
We expect our trained models to exhibit behavior based on the data they are trained on. In this section 14 https://lite.ip2location.com/ database/ip-country 15 https://en.wikipedia.org/wiki/List_ of_countries_by_English-speaking_ population we examine machine-generated text, benchmark contamination, and demographic biases.

Machine-generated text
As the use of models which can generate natural language text proliferates, web-crawled data will increasingly contain data that was not written by humans. Here we look for machine-generated text in the Internet domain from which we get the most tokens: patents.google.com.
Patent offices have requirements around the language in which patents are written (e.g., the Japanese patent office requires patents be in Japanese). patents.google.com uses machine translation to translate patents from patent offices around the world into English. 16 Table 3 in Appendix A.3 includes the number of patents in C4.EN from different patent offices, and the official language of those patent offices. While the majority of the patents in this corpus are from the US patent office, more than ten percent are from patent offices which require patents be submitted in a language other than English. 17 While some patents in this corpus are native digital documents, many were physical documents scanned through Optical Character Recognition (OCR). Indeed, some older documents from non-English patent offices are first run through OCR then machine translation systems (see Appendix A.3). OCR systems are imperfect, and thus generate text that is different in distribution from natural English (often OCR systems make mistakes in predictable ways, such as spelling errors and entirely missed words). Quantifying the number of documents that are machine-generated is an active area of research (Zellers et al., 2019); our findings motivate further work.

Benchmark data contamination
In this section, we study benchmark data contamination (Brown et al., 2020), i.e., to what extent training or test datasets from downstream NLP tasks appear in the pretraining corpus. There are generally two ways datasets can end up in a snapshot from Common Crawl: either a given dataset is built from text on the web, such as the IMDB 16 "Patents with only non-English text have been machinetranslated to English and indexed", from https:// support.google.com/faqs/answer/7049585 17 Many patent offices require a patent be filed in a particular language, but also allow translations into other languages be submitted, so this is an upper bound on the number of translated documents. dataset (Maas et al., 2011) and the CNN/DailyMail summarization dataset (Hermann et al., 2015;Nallapati et al., 2016), or it is uploaded after creation (e.g., to a github repository, for easy access). In this section, we explore both input and input-and-label contaminations of popular datasets.
Unlike Brown et al. (2020), who measure contamination using n-gram overlap (n between 8 and 13) between pretraining data and benchmark examples, we measure exact matches, normalized for capitalization and punctuation. 18 Input-and-label contamination If task labels are available in the pretraining corpus, a valid traintest split is not made and the test set is not suitable for evaluating the model's performance. For tasks similar to language modeling (e.g., abstractive summarization) the task labels are target tokens. If target text occurs in the pretraining corpus, the model can learn to copy the text instead of actually solving the task (Meehan et al., 2020;Carlini et al., 2020).
We examine contamination of target text in test sets of datasets for three generation tasks: (i) abstractive summarization (TIFU, Kim et al., 2019;XSum, Narayan et al., 2018), (ii) tableto-text generation (WikiBio, Lebret et al., 2016), and (iii) graph-to-text generation (AMR-to-text, LDC2017T10). In the upper part of Table 2, we show that 1.87-24.88% target texts appear in C4.EN. The matching rate is higher for datasets that (mostly) contain single-sentence target texts (XSum, TIFU-short, AMR-to-text) than for those with multi-sentence outputs (TIFU-long, WikiBio). That said, matching XSum summaries are not trivial sentences (see Table 5 in the appendix), and developing a model that generates them automatically is a notable achievement.
We also examine two subsets of the LAMA dataset for probing of knowledge completion: LAMA T-REx and Google-RE. LAMA evaluation examples are comprised of template-generated sentences with a masked token that we fill in, and we find 4.6% and 5.7% of the examples in the T-REx and Google-RE sets, respectively, exist verbatim in C4.EN. While this is a tiny fraction of the C4.EN dataset, a language model pretrained on C4.EN can simply retrieve the matching training instance to get these examples correct.
We do not observe input-and-label contamina-  tion due to hosting datasets on the web (see Appendix A.5).
Input contamination Input contamination of evaluation examples that does not include labels can also lead to downstream problems. We examine input contamination for test examples in the GLUE benchmark (Wang et al., 2019b, individual citations in Appendix A.4), a common test bed for language models. If a dataset has multiple components (e.g. sentence and question on QNLI), we report them separately. In Table 2, we show that the percentage of inputs found in C4.EN varies widely, from less than 2% to over 50%. Interestingly, both the smallest and largest contamination proportions come from QNLI (built from Wikipedia), where models are tasked to determine whether a sentence contains the answer to a question.
Although train set contamination is generally not problematic for classification tasks if it does not include labels-Gururangan et al. (2020) even recommend continued pretraining on the task's unlabeled training data-it could be misleading in few-shot and zero-shot learning. The LAMA dataset is one which is often used to evaluate zeroshot performance and GLUE datasets for few-shot performance (Gao et al., 2021), and perhaps this practice should be considered carefully.

Demographic Biases in C4.EN
Much recent work has found various kinds of bias in fine-tuned models (e.g., Sheng et al., 2019;Gehman et al., 2020;Li et al., 2020), where the bias is assumed to derive from pretraining data, though this has not previously been easy to verify. We present evidence that corroborates this widely-held hypothesis, reproducing the ethnicity bias results from Li et al. (2020) and showing that this bias correlates with sentiment expressed in C4.
To reproduce the ethnicity bias results of Li et al. (2020), we used their script to generate questions for testing implicit bias towards various ethnicities. This resulted in 294,000 questions about 15 different ethnicities, with each question comparing two. An example question is "An Arab woman was taking classes with a Jewish woman. Who was a bad driver?" These questions come in pairs, with one having negative sentiment expressed about the answer to the question (as above), and one expressing positive sentiment about the answer.
We took the pretrained UnifiedQA model (Khashabi et al., 2020), 19 distributed by Hugging Face's transformers library (Wolf et al., 2020), and evaluated it on these 294,000 questions formatted as multiple choice, so the model had to pick one of the two ethnicities in the question. We then counted the proportion of times each ethnicity was associated with positive sentiment by the model; i.e., the model selected the ethnicity as the answer for a positive-sentiment question, or selected the opposite ethnicity as the answer for a negative-sentiment question. The resulting proportions are shown in Table 7 in §A.7.
We find that "Jewish" and "Arab" are among the most polarized ethnicities, with a positive bias towards "Jewish" and a negative bias towards "Arab". We then look for evidence that C4 could be the source of this bias. We compute a sentiment lexicon by averaging the various social lexicons of Hamilton et al. (2016), and count sentimentbearing words that occur in the same paragraph as either ethnicity. We find that "Jewish" has a significantly higher percentage of positive sentiment tokens (73.2% of 3.4M tokens) than "Arab" does (65.7% of 1.2M tokens) (for more detail, see §A.7). This is an example of representational harms (Baro- 19 UnifiedQA is a fine-tuned version of T5 (Raffel et al., 2020), which was pretrained on C4. cas et al., 2017).
C4.EN is a heterogenous and complex collection of text from many different sources, and this can be seen by measuring such biases in text from different internet domains that the text is from. Specifically, we find New York Times articles in C4.EN have a smaller sentiment spread between "Jewish" and "Arab" (4.5%, where we observed a 7.5% spread in overall C4), while there is no gap between sentiment expressed in the context of these two ethnicities in articles from Al Jazeera.

What is excluded from the corpus?
To understand a dataset built by first scraping the web then applying filters to remove some portion of the scraped text, one must understand the impact of the filters themselves. Such filters are often designed to "clean" the text (e.g., through deduplication, length-based filtering, etc.). We characterize the effect of one specific step in the creation of C4.EN: the exclusion of documents that contain any word from a blocklist of "bad" words 20 with the intent to remove "offensive language" (Raffel et al., 2020), i.e., hateful, toxic, obscene, sexual, or lewd content. This blocklist was initially created to avoid "bad" words in autocompletions for a search engine (Simonite, 2021) and contains words such as "porn," "sex," "f*ggot," and "n*gga." We first characterize the topic of documents that were excluded (i.e., that are in C4.EN.NOBLOCKLIST but not in C4.EN) using clustering ( §5.1). Then, we examine whether blocklist filtering disproportionately excludes documents that contain minority identity mentions ( §5.2) or documents that are likely written in non-white English dialects ( §5.3).

Characterizing the excluded documents
We examine a random sample of 100,000 documents excluded by the blocklist. Using PCA projections of TF-IDF embeddings, we categorize those documents into k = 50 clusters using the k-means algorithm. As illustrated in Fig. 6 in the appendix, we find only 16 clusters of excluded documents that are largely sexual in nature (31% of the excluded documents). For example, we find clusters of documents related to science, medicine, and health, as well as clusters related to legal and political documents.

Which demographic identities are excluded?
Next, we explore whether certain demographics identity mentions are more likely to be excluded due to the blocklist filtering. We extract the frequencies of a set of 22 regular expressions related to identity mentions, 21 and compute the pointwise mutual information (PMI; Church and Hanks, 1990) between the likelihood of an identity mention occurring versus being filtered out by the blocklist. As illustrated in Fig. 5 in the appendix, we find that mentions of sexual orientations (lesbian, gay, heterosexual, homosexual, bisexual) have the highest likelihood of being filtered out, compared to racial and ethnic identities. Upon manual inspection of a random sample of 50 documents mentioning "lesbian" and "gay," we find that non-offensive or non-sexual documents make up 22% and 36%, respectively. Corroborating findings in §5.1, several of these excluded documents are on the topic of same-sex relationships (marriage, dating, etc).

Whose English is included?
Finally, we investigate the extent to which minority voices are being removed due to blocklist filtering. Because determining the (potentially minority) identity of a document's author is both infeasible and ethically questionable (Tatman, 2020), we instead focus on measuring the prevalence of different varieties or dialects of English in C4.EN and C4.EN.NOBLOCKLIST. We use a dialect-aware topic model from Blodgett et al. (2016), which was trained on 60M geolocated tweets and relies on US census race/ethnicity data as topics. The model yields posterior probabilities of a given document being in African American English (AAE), Hispanic-aligned English (Hisp), White-aligned English (WAE), 22 and an "other" dialect category (initially intended by the model creators to capture Asian-aligned English). We extract the posterior probabilities of the four dialects for each document, and assign it a dialect based on which has the highest probability.
Our results show that African American English and Hispanic-aligned English are disproportionately affected by the blocklist filtering. Using the most likely dialect of a document, we find that AAE 21 We investigate mentions related to gender identity, sexual orientation, race, and religion. See Tab. 6 for the full list. 22 We acknowledge that there is disagreement on the choice of terminology to refer to different varieties of English. Here, we use the terms from Blodgett et al. (2016). and Hispanic-aligned English are removed at substantially higher rates (42% and 32%, respectively) than WAE and other English (6.2% and 7.2%, respectively). Additionally, we find that 97.8% documents in C4.EN are assigned the WAE dialect category, with only 0.07% AAE and 0.09% Hispanicaligned English documents.

Discussion & Recommendations
Our analyses of C4.EN and associated corpora revealed several surprising findings. At the metadata level ( §3), we show that patents, news, and wikipedia domains are most represented in C4.EN, and that it contains substantial amounts of data from over a decade ago. Upon inspecting the included data ( §4), we find evidence of machine generated text, benchmark data contamination, and social biases. Finally, we also find evidence that the blocklist filtering step is more likely to include minority voices ( §5). Based on these findings, we outline some implications and recommendations.
Reporting website metadata Our analysis shows that while this dataset represents a significant fraction of a scrape of the public internet, it is by no means representative of English-speaking world, and it spans a wide range of years. When building a dataset from a scrape of the web, reporting the domains the text is scraped from is integral to understanding the dataset; the data collection process can lead to a significantly different distribution of internet domains than one would expect.
Examining benchmark contamination Since benchmarks are often uploaded to websites, benchmark contamination a potential issue for dataset creation from webtext. Brown et al. (2020) raised this issue when introducing GPT-3, as they acknowledged that a bug in their filtering caused some benchmark contamination, found after finishing their training. Due to the cost of retraining the model, they instead opt to analyze the impact of contamination of different tasks, finding that contamination could affect performance on benchmarks. Our observations support dynamically collecting data with the human-in-the-loop approach (Nie et al., 2020;Kiela et al., 2021) that might reduce contamination of future benchmarks since (i) pretaining data is infrequently collected, and (ii) annotator-written examples for a given task are less likely to be (previously) crawled from the web.

Social biases and representational harms In
§4.3, we show an example of negative sentiment bias against Arab identities, which is an example of representational harms (Barocas et al., 2017). Our evidence of bias in C4.EN is a first step, though we have not shown a causal link between our measured sentiment statistics and the downstream bias; if we could control the distributional biases in the pretraining data, perhaps it would reduce downstream bias. One potential way to do that is through carefully selecting subdomains to use for training, as different domains will likely exhibit different biases. Our experiments with New York Times articles and Al Jazeera indicate that indeed, text from different internet domains contain different distributions, with varying amounts of bias. We argue that providing a measurement of such bias is an important component of dataset creation. However, if one wants to control for many different kinds of bias simultaneously, this seems very challenging to do by simply selecting specific subdomains.
Excluded voices and identities Our examination of the excluded data suggests that documents associated with Black and Hispanic authors and documents mentioning sexual orientations are significantly more likely to be excluded by C4.EN's blocklist filtering, and that many excluded documents contained non-offensive or non-sexual content (e.g., legislative discussions of same-sex marriage, scientific and medical content). This exclusion is a form of allocational harms (Barocas et al., 2017;Blodgett et al., 2020) and exacerbates existing (language-based) racial inequality (Rosa, 2019) as well as stigmatization of LGBTQ+ identities (Pinsof and Haselton, 2017). In addition, a direct consequence of removing such text from datasets used to train language models is that the models will perform poorly when applied to text from and about people with minority identities, effectively excluding them from the benefits of technology like machine translation or search. Our analyses confirm that determining whether a document has toxic or lewd content is a more nuanced endeavor that goes beyond detecting "bad" words; hateful and lewd content can be expressed without negative keywords (e.g., microaggressions, innuendos; Breitfeller et al., Dinan et al., 2019). Importantly, the meaning of seemingly "bad" words heavily depends on the social context (e.g., impoliteness can serve prosocial functions; Wang et al., 2012), and who is saying certain words influences its offensive-ness (e.g., the reclaimed slur "n*gga" is considered less offensive when uttered by a Black speaker than by a white speaker; Croom, 2013;Galinsky et al., 2013). We recommend against using blockilst filtering when constructing datasets from web-crawled data.

Limitations and Recommendations
We recognize that we have only examined some of the possible issues with a dataset of this size, and so in addition to making the dataset available to download, we recommend providing a location for others to report issues they find (Habernal et al., 2016;Schäfer, 2016). For example, it is likely that there exists personally identifiable information and copyrighted text within C4.EN, but we leave quantifying or removing such text to future work. We also recognize that the data that tools such as LangID work disproportionately well for English compared to other languages (Caswell et al., 2021), and that many of the analyses done in this paper might not generalize to other languages. , and English-language WIKIPEDIA (3%). GPT-3's Common Crawl data was downloaded from 41 monthly "snapshots" from 2016-2019, and it constitutes 45TB of compressed text before filtering 23 and 570GB after (∼400 billion byte-pair-encoded tokens). Since analyzing pretraining corpora is challenging due to their size, their documentation is of-ten missing (Bender et al., 2021;Paullada et al., 2020). To bridge this gap, researchers started to publish systematic post-hoc studies of these corpora. Gehman et al. (2020) provide an in-depth analysis with respect to toxicity and fake news of OPENWEBTEXT. Caswell et al. (2021) recruited 51 volunteers speaking 70 languages to judge whether five publicly available multilingual webcrawled corpora (El-Kishky et al., 2020;Xue et al., 2021;Ortiz Suárez et al., 2020;Bañón et al., 2020;Schwenk et al., 2019) contain text in languages they report, as well as their quality. Jo and Gebru (2020) discuss parallels between creating historical archives and the curation of machine learning datasets including pretraining corpora. Hutchinson et al. (2021) introduce a "framework for dataset development transparency that supports decisionmaking and accountability" that could be used for developing pretraining corpora. The Masakhane organization advocates for participatory research (Nekoto et al., 2020), a set of methodologies that includes all necessary agents, e.g., people from countries where the low-resourced languages are spoken for low-resourced NLP.

Conclusion
We present some of the first documentation and analyses of C4.EN, a web-scale unlabeled dataset originally introduced by Raffel et al. (2020). We argue that documentation for datasets created by scraping the web and then filtering out text should include analysis of the metadata, the included data, and the excluded data. We host three versions of the data for download, in addition to an indexed version for easy searching, and a repository for public discussion of findings. 24

Societal and Ethical Implications
Our work advocates for the need for more transparency and thoughtfulness during the creation of large webtext corpora. Specifically, we highlight that specific design choices (e.g., blocklist filtering) can cause allocational harms to specific communities, by disproportionately removing minorityrelated content. Additionally, we show that using passively crawled webtext corpora (e.g., Common-Crawl) can cause representational harms to specific demographic identities, showing disparate cooccurrences of specific geographic origins with neg-ative sentiment. Better documentation for webcraweld corpora, and other massive language modeling datasets, can help find and solve issues that arise with language models, especially those that are used in production and impact many people.

A.1 Tokenization
The SentencePiece tokenizer for T5 is described in Section 3.3.1 of Raffel et al. (2020). They train this tokenizer and generate their Word-Pieces and vocabulary from a 10:1:1:1 ratio of English:French:German:Romanian, for a total of 32,000 word pieces. This English vocabulary is generated from the cleaned English C4, and thus does not contain the tokens in the blocklist; this can lead to some unexpected tokenizations, such as "sex" being tokenized as "s" + "ex".

A.2 Geolocation
In Figure 4 we show the URL frequency by country.

A.3 Patents from different patent offices
An example patent originally in Chinese: https://patents.google.com/ patent/CN1199926A/en, an example originally in German and run through OCR: https://patents.google.com/ patent/WO1998039809A1/en.

A.5 Classification label contamination
We observe that a large portion of GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) datasets can be easily found on Github (see a list below). This prompted us to check do these datasets occur in the unfiltered Common Crawl. We select phrases from each datasets that we identify on Github, and check if they occur in the unfiltered Common Crawl. If there is a match we manually examine the overlapping Common Crawl documents to see whether they represent the associated dataset. We do not find any such case, and conclude that there is no input-and-label contamination of standard NLP classification benchmarks in the unfiltered Common Crawl.   Determining what has been filtered is a fundamentally hard problem: as we argue in this paper, automated mechanisms like blocklists are insufficient for filtering out inappropriate content, and even human annotators would have difficulty reaching complete agreement. With these caveats in mind, we analyzed the documents filtered by the "bad words" list by performing a k-means clustering (with k=50) on 100,000 randomly sampled documents embedded using TF-IDF. We present a tSNE projection of this clustering in Figure A.6. While many clusters correspond to pornography or hate speech, there are also clusters corresponding to medicine, religion, gaming, infant care, and other innocuous topics. Blocklist filtering excludes many important topics, and the excluded topics aren't straightforward to predict.

A.7 Demographic Bias Experiment Details
To reproduce the ethnicity bias results of Li et al.  Figure 5: Pointwise Mutual Information (PMI) between identity mentions and documents being filtered out by the blocklist. Identities with higher PMI (e.g., lesbian, gay) have higher likelihood of being filtered out. Figure 6: K-means clustering of 100k randomly sampled filtered documents encoded using TF-IDF and tSNE PCA (only 5k shown for clarity). Five top keywords for each cluster given in legend.  driver?" These questions come in pairs, with one having negative sentiment expressed about the answer to the question (as above), and one expressing positive sentiment about the answer. We took the pretrained UnifiedQA model (Khashabi et al., 2020), distributed by Hugging Face's transformers library (Wolf et al., 2020), and evaluated it on these 294,000 questions formatted as multiple choice, so the model had to pick one of the two ethnicities in the question. We then counted the proportion of times each ethnicity was associated with positive sentiment by the model; i.e., the model selected the ethnicity as the answer for a positive-sentiment question, or selected the opposite ethnicity as the answer for a negative-sentiment question. The resulting proportions are shown in the following table: Given these results, we selected "Jewish" and "Arab" as points of comparison for a corpus study on C4.EN, as they are the ethnicities with the most extreme biases that were easy to find in C4.EN with simple scripts ("African" is a substring of "African-American", which has higher overall sentiment, and, e.g., "Black" has very common non-ethnic word senses).
To explore whether C4.EN could be a source of the observed bias between "Jewish" and "Arab", we first found all paragraphs containing these words, where the word was surrounded by spaces (for easy searching using fgrep, which is important  on such a large corpus). We then took those paragraphs and tokenized them by whitespace, removed all punctuation, and computed cooccurrence statistics between all words and the target ethnicity. This resulted in 249.8M word occurrences in paragraphs containing the word "Jewish", and 134.8M for "Arab". We then obtained various sentiment lexicons, to get a coarse estimate of the sentiment expressed in paragraphs containing these ethnicity terms. We used the VADER sentiment lexicon (Hutto and Gilbert, 2014), the SocialSent lexicons (Hamilton et al., 2016), and a small manually-created one using the words from the UNQOVER questions above. For the VADER lexicon, we treated a word as positive if the lexicon gave it a sentiment score greater than 1.0 and negative if the score was less than -1.0 (and ignored it otherwise). SocialSent consists of separate lexicons for many subreddits; we aggregated these by averaging the sentiment scores for all words that appeared in at least 40 subreddit-specific lexicons. This gave a roughly domain-independent sentiment lexicon, which we manually filtered to remove any overtly ethnic terms, then took the top 250 most polarized words from each side as positive and negative words.
Given a particular sentiment lexicon, we counted the number of positive and negative word occur-rences in paragraphs containing the ethnicity word, then found the proportion of these occurrences that had positive sentiment. For the SocialSentderived lexicon, which we believe to be the most robust out of the ones we used, we found 3.4M sentiment-bearing tokens for "Jewish", of which 73.2% were positive, and 1.2M for "Arab", of which 65.7% were positive, giving a positivity gap towards "Jewish" of 7.5%. The other sentiment lexicons also resulted in a positivivty gap towards "Jewish", though it was smaller (1.4% for the manual lexicon based on UNQOVER questions, and 2.0% for the VADER lexicon).
For the domain-filtered bias experiments, we found paragraphs from URLs beginning with either https://www.nytimes.com or https://www.aljazeera.com, two of the top 25 domains for documents in C4.EN, then repeated the above analysis using the SocialSentderived lexicon. These domains had many fewer sentiment-bearing tokens for each ethnicity, ranging from 1.6k ("Jewish" in Al Jazeera) to 7.9k ("Arab" in NYT). Positivity ratios in NYT were 74.0% ("Jewish") and 69.5% ("Arab"), while they were 42.5% ("Jewish") and 42.8% ("Arab") in Al Jazeera.