Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks.


Introduction
A privacy policy is a legal document that is used by an organisation to disclose how they collect, analyze, share, and protect users' personal information.Legal jurisdictions around the world require organisations to make their privacy policies readily available to their users, and laws such as General Data Protection Regulation (GDPR) in the European Union and California Online Privacy Protection Act (CalOPPA) in the United States place specific expectations upon privacy policies.However, most internet users fail to understand privacy policies (Meiselwitz, 2013).Studies show that privacy policies require a considerable investment in time to read (Obar and Oeldorf-Hirsch, 2018) and estimate that it would require approximately 200 hours to read all the privacy policies that an average person would come across every year (McDonald and Cranor, 2008).
Although most users are concerned about their online privacy, Rudolph et al. (2018) reports that a significant number do not make the effort to read privacy notices because they perceive them to be too time-consuming or too complicated (Obar and Oeldorf-Hirsch, 2018).While studies have suggested methods to improve the perception and accessibility of privacy policies by making improvements in the manner of presentation of the policy, these improvements have not been adopted by many organisations.For example, Kelley et al. (2010) design and test a "privacy nutrition label" approach to present information.They found that users report higher accuracy, speed, and enjoyment in finding privacy information from a privacy nutrition label than from a standard privacy policy.Similarly, Schaub et al. (2015) introduce methods to ease the design of privacy notices and their integration.
A considerable amount of time and effort is required to understand the contents of privacy policies, and natural language processing (NLP) provides an opportunity to automate extraction of salient details from these documents.Existing research has achieved some success using small corpora of privacy policies, on the order of a few hundred (Wilson et al., 2016) (Zimmeck et al., 2019) or a thousand (Ramanath et al., 2014).In order to better leverage state-of-the-art NLP techniques, a large collection of privacy policies is necessary.Also, analyzing a large corpus of website privacy policies will inform the current state of privacy practices on the web and assist the research community in addressing and understanding pressing privacy issues.
As such, we make the following contributions: arXiv:2004.11131v1[cs.IR] 23 Apr 2020 • Create the PrivaSeer Corpus 1 : a corpus of one million English language website privacy policies, which is larger than any other prior corpus.
• Design a detailed pipeline to further gather privacy policies.
• Explore the corpus by presenting an analysis of readability, key phrases, topics, and similarity comparisons of these policies.

Related Work
Although there is a lack of a web-scale corpus of privacy policies, much work has been made at analysing privacy policies.Some prior attempts at analysis involved manual analysis of privacy policies.Jensen and Potts (2004) evaluate the accessibility, writing, and content of 64 privacy policies.Cranor et al. (2013) compare the privacy policies of over 3000 financial institutions and report on a range of data collection practices.
Various automated attempts at analysis have been based on small sets of privately collected corpora.Costante et al. (2012) use a machine learning model to check the completeness of privacy policies and provide the user with the degree of coverage of important privacy policy categories using a corpus of 64 privacy policies.Analyses of readability show that privacy policies are difficult to read and require a college-level reading ability (Ermakova et al.) (Fabian et al., 2017).
Previous releases of a corpus of privacy policies have led to a wide variety of research activity.Wilson et al. (2016) released the OPP-115 Corpus, a publicly available dataset of 115 privacy policies with manual annotations of 23k fine-grained data practices, and demonstrated the feasibility of partly automating the annotation process.The OPP-115 Corpus was used to create Polysis (Harkous et al., 2018), a system that uses deep learning to classify text in privacy policies and answer non-factoid questions.Similarly, the OPP-115 Corpus was used to train ten different machine learning models to summarise 400 privacy policies and answer predefined questions (Zaeem et al., 2018).The OPP-115 Corpus was further used to evaluate embeddings created from 150,000 Android Google Play Store app privacy policies (Kumar et al., 2019).
Few other publicly available website privacy policy corpora exist.Ramanath et al. (2014) released a corpus of 1010 website privacy policies and intro-1 Corpus URL pending paper acceptance.Search engine can be accessed at http://privaseer.ist.psu.edu/duced an unsupervised approach to aligning privacy policy sections.Zimmeck et al. (2019) released a set of over 400k URLs to Android app privacy policy pages collected by crawling the Google Play store.They reported that 30% of the URLs do not link to analysable privacy policies.Their analysis on the policies include classification of various privacy practices and analysing potential compliance issues on 350 annotated policies.

Document Collection
We used Common Crawl2 to gather seed URLs to crawl for privacy policies from the web, as we describe in detail below.We filtered the Common Crawl URLs to get a set of possible links to web site privacy policies.We then crawled the filtered set to obtain candidate privacy policy documents.The complete pipeline from the Common Crawl URL dump to the gold standard privacy policy corpus is shown in Figure 1.

Common Crawl
The Common Crawl Foundation is a non-profit which has been releasing large monthly internet web crawls since 2008.Monthly crawl archives provide a "snapshot of the web" by including recrawls of popular domains (re-crawls from previous archives) and crawls of new domains.Common Crawl has also been releasing a domain-level webgraph from which the harmonic centrality of the crawled domains are calculated.This webgraph is used to sample popular domains that need to be re-crawled and to obtain new uncrawled domains.
We downloaded the URL dump of the May, 2019 archive.Common Crawl reports that the archive contains 2.65 billion web pages or 220 TB of uncompressed content which were crawled between 19th and 27th of May, 2019.They also report that this archive contains 825 million URLs which were not contained in any previously released crawl archives.We applied a selection criteria on the downloaded URL dump to filter the URLs of likely privacy policy pages.

URL Selection
The online privacy paradigm follows the "Notice and Choice" framework."Notice" is a presentation of terms usually in the form of a privacy policy and "Choice" is an action signifying the acceptance of terms (Sloan and Warner, 2014).As a consequence, organisations generally include a link to their privacy policy in the footer of the website landing page.Common names for this link are "Privacy Policy", "Privacy Notice", and "Data Protection".A secondary consequence of this informal standardisation is that privacy policy URLs also tend to have those words in them.Thus, we selected those URLs which had the word "privacy" or the words "data" and "protection" from the Common Crawl URL archive.We were able to extract 3.9 million URLs that fit this selection criterion.Informal experiments suggested that this selection of keywords was optimal for retrieving the most privacy policies with as few false positives as possible.

Web Crawling
We crawled the 3.9 million selected URLs using Scrapy 3 for about 48 hours between the 4th and 10th of August 2019, for a few hours each day.3.2 million URLs were successfully crawled while 0.4 million led to error pages and 0.3 million URLs were discarded as duplicates.While the full list consists of 76 different types of errors, Table 1 lists error types that had more than 10,000 instances.The table consists of HTTP errors and Scrapy errors which were thrown when no response was received from the queried page.We call these 3.2 million crawled web pages "candidate privacy policy documents" as it is uncertain how many are indeed privacy policies.

Document Filtering
To filter privacy policies from the candidates, we first determined the language of the candidates and selected only those that were in English.We then implemented a random forest classifier to separate the privacy policy web pages from the candidates.We further filtered out web pages that did not follow the "Notice and Choice" framework.Finally, we removed boilerplate from the privacy policy web pages and discarded duplicates to arrive at the gold standard corpus.

Language Detection
In order to identify the language (English vs others) of the candidate documents, we used the opensource Python package Langid (Lui and Baldwin, 2012).Langid is an off-the-shelf Naive Bayesbased classifier pretrained on 97 different languages which is able to achieve consistently high accuracy over a wide range of languages, domains, and lengths of text.We used Langid due to its convenience and high accuracy in language identification.Figure 2 depicts the language distribution of the top ten most common languages in the candidate set.The complete set of documents was divided into 97 languages and an unknown language category.We found that the vast majority of documents were in English.We discarded candidate documents that were not identified as English by Langid and were left with 2.1 million candidate documents.

Document Classification
The English language candidate document set consisted of web pages that satisfied our URL selection criteria.Thus it was important to separate privacy policy documents from ones that fit our criteria but were not actual privacy policies.We experimented with supervised and unsupervised approaches and attempted to classify documents using the URL of the web page and a bag of words approach.

Labelling
One of the researchers in the team spent eight hours labelling 1000 randomly selected candidate documents.Out of 1000, 740 were privacy policies and 260 were not privacy policies.Out of the documents that did not have privacy policies, 93 had news articles, 27 were pages which had links to the privacy policy, 13 were e-commerce product pages, 12 were pages which advertised security devices, 6 were Twitter pages and the rest were other miscellaneous pages.

Unsupervised Machine Learning
For the unsupervised approach, we uniformly randomly sampled 100,000 documents and experimented with K-Means (Lloyd, 1982) and DBSCAN (Ester et al.) algorithms.We tested our models using Doc2Vec (Le and Mikolov, 2014) and term frequency-inverse document frequency (tf-idf) vectorization techniques.The Doc2Vec technique was used to create 256 dimensional vectors and the tfidf technique created vectors equal to the size of the vocabulary.Both were further reduced to 50 by using Principal Component Analysis (PCA).A cosine distance metric was used to compare the document distances.For the K-Means algorithm, using Doc2Vec vectors we experimented with cluster sizes between 2 and 30 and could not find a clear elbow point.As manual analysis of documents for various cluster sizes did not reveal an intuitive explanation for formed clusters, we discarded the results.While using the tf-idf vectors, although a clear elbow point was found at 7 clusters, a single cluster was found to contain almost all the documents while the rest of the clusters had between 1 and 1500 documents.As this did not seem like a reasonable separation, we discarded the results.
For the DBSCAN algorithm, using Doc2Vec vectors, over 70% of the data points fell into the noise category for any reasonable range of hyperparameters.As this did not seem reasonable, we discarded the results.Finally, while using the tf-idf vectors, after hyper-parameter tuning using 200 labelled samples we found that two well defined clusters were formed with 80% of the samples falling under a single category and the other 20% falling under the noise category.The evaluation of this model can be found in Table 2. Figure 3 shows the class distribution of the 100,000 samples.In the figure, the points colored blue represent the single positive cluster while the points colored red represent noise.

Supervised Machine Learning
We trained a random forest classifier on 1000 manually labelled documents using features extracted from the URLs and the words in the web page.We trained three separate models: one using the features extracted from the URL, one using the features extracted from the web page, and one combined model using features from both.
For the URL model, the words in the URL path were extracted and the tf-idf of each term was recorded to create the features.As privacy policy URLs tend to be shorter and have fewer path segments, they were added as features.As the classes were unbalanced, we over-sampled from the minority class using the synthetic minority oversampling technique (Chawla et al., 2002).For the document model, we used tf-idf features after tokenizing the document using a regex tokenizer and removing stop words.The combined model was a combination of the URL and document features.
All the above models were trained using random forest classifiers.
The 1000 labelled documents were divided into 800 samples for training and 200 samples for testing.Table 2 shows the comparison of the results of all the above models.As the combined model had the best results, we retrained the model using all the 1000 labelled documents and ran it on the candidate document set.Out of 2.1 million English candidate privacy polices, 1.54 million were classified as privacy policies and the rest were discarded.

URL Cross Verification
Legal jurisdictions around the world require organisations to make their privacy policies readily available to their users.As a result, most organisations include a link to their privacy policy in the footer of their website landing page.In order to release a corpus of privacy policies that were legally authoritative, we cross-verified the URLs of the privacy policies in our corpus with those that we obtained by crawling the homepages (landing page) of these domains.
Between the 8th and 10th November 2019, we crawled the homepage and links (in the homepage) which fit our previously defined URL selection criteria of all the domains in our corpus.We then gathered the URLs satisfying our selection criteria in order to cross-verify the URLs in our existing corpus.This approach was employed because we found that some websites did not directly link to their privacy policy.Instead, they had an intermediary page from the landing page which had links to privacy related documents.After cross-verifying the URLs, we were left with 1.1 million privacy policy web pages.

Content Extraction
Privacy policies that were collected were found in the bodies of web pages and contained content other than the privacy policy.Many web pages had a header, a footer, and a left hand navigation menu in addition to banners and advertisements.We refer to this extra content in a web page as boilerplate.Because boilerplate would not contribute to the enrichment of the corpus, we added boilerplate removal to our pipeline.
We used an open-source Python package called Dragnet (Peters and Lecocq, 2013) for this task.Dragnet uses a machine learning approach based on features extracted from text and link density of the Document Object Model (DOM) elements in the web wage.It also extracts semantic features from the names of HTML tag attributes.We used Dragnet due to its consistently high accuracy and ease of use.We implemented Dragnet on the 1.1 million privacy policy web pages thus removing boilerplate from the web pages.

Duplicate and Near-Duplicate Detection
Detecting duplicate and near-duplicate documents is an essential step for any corpus cleaning task.We tackled the problem of removing exact duplicates by hashing all the raw documents and discarding multiple copies of exact hashes.Exact duplicate removal was performed before the URL cross-verification task in the pipeline.
To tackle the problem of near-duplicate detection, we used Simhashing (Charikar, 2002).Simhashing is a hashing technique in which similar inputs produce similar Simhashes.After the Simhash of each document is created, the document similarity is measured by calculating the Hamming distance between the document Simhashes (Manku et al., 2007).The implementation of the Simhash algorithm created 64 bit hashes.We first used the shingling (Broder et al., 1997) technique to create shingles of size 3 for each privacy policy.We then ran the Simhash algorithm on each of the shingled documents to obtain its 64 bit Simhash.
In order to find the near-duplicate documents, we separated the documents by their domain and only compared Simhashes of documents that were from the same domain.This approach was taken since a few privacy policies had very similar wording, differing only by the organisation name or the website name (this finding is further discussed in Section 5).We found abundant examples of near- duplicate privacy policies on the same website.A typical example of this is observed when websites are available in multiple languages.Often, when websites are available in multiple languages, the privacy policies are in the same language across all the website locales with few or no changes between them.Thus, it is important to filter these near-duplicates which do not add any extra information.
Having separated the document URLs by their domain, we compared the Simhashes of all the documents within the same domain and obtained a list of all pairs of similar documents based on a Hamming distance threshold.We then filtered the duplicates based on a greedy approach.The remaining documents comprised the corpus.

Corpus Analysis
The corpus consists of 1,005,781 privacy policies from 995,487 different web domains.Figure 4 shows a histogram of lengths of the privacy policies in number of words.Privacy policies in this corpus have a mean word length of about 1410 words and range between a minimum of 24 words and a maximum of above 71k words and removing outliers more than 6 standard deviations away.Further distribution of lengths are in Table 3.
Figure 5 shows a bar chart of the distribution of the top ten top level domains (TLD) of the policies in the corpus.The corpus has policies from over 800 different TLDs.While .com,.org,.netand .infomake up a major share of the corpus, countrylevel domains like .uk,.au,.caand .dudepict the variety of the source of privacy policies.

Readability
Readability of a text can be defined as the ease of understanding or comprehension due to the style of writing (Klare et al., 1963).Along with length, readability does play a part in internet users' decisions to either read or ignore a privacy policy (Er-  ).Since no single readability measure comprehensively captures readability, we report the readability of the privacy policies in our corpus on well established formulae, namely, Flesch Readability Ease Score (FRES) , Flesh-Kincaid Grade Level (FKG), Simple Measure of Gobbledygook (SMOG), Coleman-Liau Index (CLI).
The readability scores on all the metrics follow a normal distribution for the privacy policies in the corpus.Table 3 shows the distribution of the scores.As indicated by the FRES score, the readability of the privacy policies range from very easy to read (scores between 80 -100) to very difficult to read (scores between 0 -20) with the mean score of 40 suggesting that on average privacy policies are difficult to comprehend.The FKG scores suggest that a few years of college education is required to understand the average privacy policy.This is backed up by the SMOG scores which suggest that around 15 years of formal education (which amounts to a college junior or sophomore) is required.The most optimistic scores come from CLI which suggests that the average privacy policy is at the reading level of a college freshman.These results are consistent with prior research and suggest that on average privacy policies are difficult to read and are at the college reading level.

Topic Modelling
Topic modelling is an unsupervised machine learning method which extracts the most probable distribution of words into topics through an iterative process.We used topic modelling to explore the distribution of privacy practices in our corpus.
The OPP-115 Corpus (Wilson et al., 2016) introduced a labeling scheme of privacy practices based on input from legal experts.They followed a bottom-up approach and identified different categories from analysis of data practices in privacy policies.We followed a top-down approach and applied topic modelling to the corpus in order to extract common themes for paragraphs.
We used Latent Dirichlet Allocation (LDA), an unsupervised approach to topic modelling.As topic modelling with LDA works well when each input document deals with a single topic, we divided each privacy policy into its constituent paragraphs (Sarne et al., 2019), tokenized the paragraphs using a regex character matching tokenizer and lemmatised the individual words using NLTK's WordNet lemmatizer 4 .As the number of topics is a parameter that needs to be tuned for an LDA model, we experimented with topics sizes of 6, 7, 8, 10 and 15.We manually evaluated the topic clusters by inspecting the words that most represented the topics.We noted that the cohesiveness of the topics decreased as the number of topics increased.We chose a topic size of 7, since larger topic sizes produced markedly less coherent topics.
The vocabulary for each topic and an interpreted category name is shown in Figure 4.The terms listed in the vocabulary column represent the topic in decreasing order of probabilities, cutoff at a threshold.While some of the topics found using our technique match the ones introduced by Wilson et al. (2016)  in privacy policies, which diverge from the actual distribution of themes in this genera of text.

Keyphrase Extraction
Given the issues with length, readability, and the time investment necessary in reading a privacy policy, keywords and keyphrases lend themselves well to summarizing the content of a privacy policy.In order to summarize the content of the corpus and to depict the performance of classic keyphrase extraction techniques on privacy policies, we used two well established keyphrase extraction techniques on the privacy policies of the corpus -RAKE (Rose et al., 2010) and TextRank (Mihalcea and Tarau, 2004).
Top keyphrases shown in Table 5 were obtained using both the RAKE and TextRank algorithms.We ran both algorithms separately on all the privacy policies.Scores of keyphrases obtained for each privacy policy were normalized and summed up to obtain the final scores.The keyphrases shown in the table are among the top 50 keyphrases.Some of the keyphrases were omitted as they represented redundant information.From the table, we can see how information from different sections of the privacy policy have been captured.For example, the phrases email address, ip address, phone number, and credit card information are all suggestive of 1st party user information that organisations collect.

Similarity of Web Privacy Policies
Privacy policies follow a similar structure.We noted that a few privacy policies had exactly the same wording in multiple sections, differing only by the organisation name.
We used the Simhash technique and Jaccard Similarity Index to compare textual similarity in privacy polices.We uniformly randomly sampled 11,000 documents from the corpus and created Shingles of window size 3 for each document.We then calculated the Jaccard Similarity Index between the first 1000 and the rest of the 10,000 documents in our random sample.We also calculated the Simhash of each Shingled document and obtained a 64 bit representation of each document.We then calculated the Hamming distance between the first 1000 and rest of the 10,000 documents.
The similarity of 10 million document pairs using the Jaccard Similarity Index is shown in Figure 6.The figure suggests that the majority of document pairs are distinct while a small number of privacy policies are very similar to each other.

RAKE
TextRank personal information, privacy policy, personally identifiable information, third party, please contact us, email address, ip address, credit card information, google analytics information, personal information, privacy, data protection, third party, email address, web site, service, ip address, phone number  Thousands of document pairs have a high index suggesting that they share language at a sentence level or even at a paragraph level.The results from comparing hamming distances were consistent with the above finding and followed a normal distribution with a mean Hamming distance of 30.9 bits and a standard deviation 4.3 bits.It is not surprising that the number of dissimilar pairs vastly overshadow the number of similar ones as a single uniquely written document will have a low similarity index with all other documents in the corpus.

Conclusion
We created the first large scale corpus of website privacy policies, PrivaSeer, consisting of just over 1M documents.We designed a novel pipeline that was used to build the corpus, which included web crawling, language detection, document classification, duplicate removal, document cross verification, boilerplate removal, and near duplicate removal.
Topic modelling and keyphrase extraction showed the distribution of themes of privacy practices in privacy policies.Four out of the seven topics found in the LDA dealt with first party and third party collection of information, which suggests their abundance in privacy policies.
The readability of privacy policies was found to be consistent with prior research and verified that privacy policies are long and difficult to comprehend and are at a college reading level.In addition, we found that a number of privacy policies have very similar phrasing.
Finally, we intend to release this corpus for further research under a creative commons license and to build a search engine for discovery.

Figure 2 :
Figure 2: Candidate documents language distribution

Table 2 :
Document classification
using the OPP-115 Corpus, we found that a few fine-grained privacy practice categorizations appeared as separate topics in our method.The topics First Party Collection, Third Party Collection and Policy Change match the OPP-115 Corpus categorization, while the topics European Audiences, Cookies and Tracking and Disclosure of Information appear as subcategories of International and Specific Audiences, First Party Collection and Third Party Collection in the OPP-115 Corpus respectively.The topic Data Security and Contact appears to be a combination of the OPP-115 Corpus Data Security category and the Other category.It is likely that the misalignment of OPP-115 categories and LDA topics comes from a difference in approaches: the OPP-115 categories represent themes that privacy experts expect to find 4 http://nltk.org/ , email, provide, address, use, personal, collect, user,  product, name, contact, number  Third Party Collection  google, com, party, third, advertising, ad, website, http, analytics, data,  www, user,service, may, network, social, opt, facebook Data Security and Contact information, us, personal, security, contact, access, please, secure, data, request, protect, question Policy Changes privacy, policy, site, website, use, information, change, service, may, time, term, practice, link European Audiences data, personal, processing, right, purpose, subject, consent, protection, legal, interest, process, request, controller, gdpr, legitimate Cookies and Tracking cookie, website, use, site, information, browser, may, web, data, user, page, service, visit, computer, used, party, ip Disclosure of Information information, service, may, party, third, use, personal, provide, product, purpose, us, company, business, disclose, customer

Table 4 :
Vocabulary and interpreted topics for LDA based topic modelling