Intent Classification and Slot Filling for Privacy Policies

Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the text spans sharing specific information as slot filling. In this work, we propose PolicyIE, an English corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications. PolicyIE corpus is a challenging real-world benchmark with limited labeled examples reflecting the cost of collecting large-scale annotations from domain experts. We present two alternative neural approaches as baselines, (1) intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence (Seq2Seq) learning task. The experiment results show that both approaches perform comparably in intent classification, while the Seq2Seq method outperforms the sequence tagging approach in slot filling by a large margin. We perform a detailed error analysis to reveal the challenges of the proposed corpus.


Introduction
Privacy policies inform users about how a service provider collects, uses, and maintains the users' information. The service providers collect the users' data via their websites or mobile applications and analyze them for various purposes. The users' data often contain sensitive information; therefore, the users must know how their information will be used, maintained, and protected from unauthorized and unlawful use. Privacy policies are meant to explain all these use cases in detail. This makes * Equal contribution. Listed by alphabetical order. privacy policies often very long, complicated, and confusing (McDonald and Cranor, 2008;Reidenberg et al., 2016). As a result, users do not tend to read privacy policies (Commission et al., 2012;Gluck et al.;Marotta-Wurgler, 2015), leading to undesirable consequences. For example, users might not be aware of their data being sold to third-party advertisers even if they have given their consent to the service providers to use their services in return. Therefore, automating information extraction from verbose privacy policies can help users understand their rights and make informed decisions.
In recent years, we have seen substantial efforts to utilize natural language processing (NLP) techniques to automate privacy policy analysis. In literature, information extraction from policy documents is formulated as text classification (Wilson et al., 2016a;Harkous et al., 2018;Zimmeck et al., 2019), text alignment (Liu et al., 2014;Ramanath et al., 2014), and question answering (QA) (Shvartzshanider et al., 2018;Harkous et al., 2018;Ravichander et al., 2019;Ahmad et al., 2020). Although these approaches effectively identify the sentences or segments in a policy document relevant to a privacy practice, they lack in extracting fine-grained structured information. As shown in the first example in Table 1, the privacy practice label "Data Collection/Usage" informs the user how, why, and what types of user information will be collected by the service provider. The policy also specifies that users' "username" and "icon or profile photo" will be used for "marketing purposes". This informs the user precisely what and why the service provider will use users' information.
The challenge in training models to extract finegrained information is the lack of labeled examples. Annotating privacy policy documents is expensive as they can be thousands of words long and requires domain experts (e.g., law students). Therefore, prior works annotate privacy policies at the [We]  sentence level, without further utilizing the constituent text spans to convey specific information. Sentences written in a policy document explain privacy practices, which we refer to as intent classification and identifying the constituent text spans that share further specific information as slot filling. Table 1 shows a couple of examples. This formulation of information extraction lifts users' burden to comprehend relevant segments in a policy document and identify the details, such as how and why users' data are collected and shared with others.
To facilitate fine-grained information extraction, we present PolicyIE, an English corpus consisting of 5,250 intent and 11,788 slot annotations over 31 privacy policies of websites and mobile applications. We perform experiments using sequence tagging and sequence-to-sequence (Seq2Seq) learning models to jointly model intent classification and slot filling. The results show that both modeling approaches perform comparably in intent classification, while Seq2Seq models outperform the sequence tagging models in slot filling by a large margin. We conduct a thorough error analysis and categorize the errors into seven types. We observe that sequence tagging approaches miss more slots while Seq2Seq models predict more spurious slots. We further discuss the error cases by considering other factors to help guide future work. We release the code and data to facilitate research. 1 2 Construction of PolicyIE Corpus

Privacy Policies Selection
The scope of privacy policies primarily depends on how service providers function. For example, service providers primarily relying on mobile applications (e.g., Viber, Whatsapp) or websites and applications (e.g., Amazon, Walmart) have different privacy practices detailed in their privacy policies. 1 https://github.com/wasiahmad/

PolicyIE
In PolicyIE, we want to achieve broad coverage across privacy practices exercised by the service providers such that the corpus can serve a wide variety of use cases. Therefore, we go through the following steps to select the policy documents. Ramanath et al. (2014) introduced a corpus of 1,010 privacy policies of the top websites ranked on Alexa.com. We crawled those websites' privacy policies in November 2019 since the released privacy policies are outdated. For mobile application privacy policies, we scrape application information from Google Play Store using play-scraper public API 2 and crawl their privacy policy. We ended up with 7,500 mobile applications' privacy policies.

Initial Collection
Filtering First, we filter out the privacy policies written in a non-English language and the mobile applications' privacy policies with the app review rating of less than 4.5. Then we filter out privacy policies that are too short (< 2,500 words) or too long (> 6,000 words). Finally, we randomly select 200 websites and mobile application privacy policies each (400 documents in total).

3
Post-processing We ask a domain expert (working in the security and privacy domain for more than three years) to examine the selected 400 privacy policies. The goal for the examination is to ensure the policy documents cover the four privacy practices: (1) Data Collection/Usage, (2) Data Sharing/Disclosure, (3) Data Storage/Retention, and (4) Data Security/Protection. These four practices cover how a service provider processes users' data in general and are included in the General Data Protection Regulation (GDPR). Finally, we shortlist 50 policy documents for annotation, 25 in each category (websites and mobile applications).

Data Annotation
Annotation Schema To annotate sentences in a policy document, we consider the first four privacy practices from the annotation schema suggested by Wilson et al. (2016a). Therefore, we perform sentence categorization under five intent classes that are described below.
(1) Data Collection/Usage: What, why and how user information is collected; (2) Data Sharing/Disclosure: What, why and how user information is shared with or collected by third parties; (3) Data Storage/Retention: How long and where user information will be stored; (4) Data Security/Protection: Protection measures for user information; (5) Other: Other privacy practices that do not fall into the above four categories. Apart from annotating sentences with privacy practices, we aim to identify the text spans in sentences that explain specific details about the practices. For example, in the sentence "we collect personal information in order to provide users with a personalized experience", the underlined text span conveys the purpose of data collection. In our annotation schema, we refer to the identification of such text spans as slot filling. There are 18 slot labels in our annotation schema (provided in Appendix). We group the slots into two categories: type-I and type-II based on their role in privacy practices. While the type-I slots include participants of privacy practices, such as Data Provider, Data Receiver, type-II slots include purposes, conditions that characterize more details of privacy practices. Note that type-I and type-II slots may overlap, e.g., in the previous example, the underlined text span is the purpose of data collection, and the span "user" is the Data Provider (whose data is collected). In general, type-II slots are longer (consisting of more words) and less frequent than type-I slots.
In total, there are 14 type-I and 4 type-II slots in our annotation schema. These slots are associated with a list of attributes, e.g., Data Collected and Data Shared have the attributes Contact Data, Location Data, Demographic Data, etc.   , 2016). We hire two law students to perform the annotation. We use the web-based annotation tool, BRAT (Stenetorp et al., 2012) to conduct the annotation. We write a detailed annotation guideline and pretest them through multiple rounds of pilot studies. The guideline is further updated with notes to resolve complex or corner cases during the annotation process. 4 The annotation process is closely monitored by a domain expert and a legal scholar and is granted IRB exempt by the Institutional Review Board (IRB). The annotators are presented with one segment from a policy document at a time and asked to perform annotation following the guideline. We manually segment the policy documents such that a segment discusses similar issues to reduce ambiguity at the annotator end. The annotators worked 10 weeks, with an average of 10 hours per week, and completed annotations for 31 policy documents. Each annotator is paid $15 per hour.

Post-editing and Quality Control
We compute an inter-annotator agreement for each annotated segment of policy documents using Krippendorff's Alpha (α K ) (Klaus, 1980). The annotators are asked to discuss their annotations and re-annotate those sections with token-level α K falling below 0.75. An α K value within the range of 0.67 to 0.8 is allowed for tentative conclusions (Artstein and Poesio, 2008;Reidsma and Carletta, 2008). After the re-annotation process, we calculate the agreement for the two categories of slots individually. The inter-annotator agreement is 0.87 and 0.84 for type-I and type-II slots, respectively. Then the adjudicators discuss and finalize the annotations. The adjudication process involves one of the annotators, the legal scholar, and the domain expert.
Joint intent and slot tagging Input: [CLS] We may also use or display your username and icon or profile photo on marketing purpose or press releases .   Table 2 presents the statistics of the PolicyIE corpus. The corpus consists of 15 and 16 privacy policies of websites and mobile applications, respectively. We release the annotated policy documents split into sentences.

5
Each sentence is associated with an intent label, and the constituent words are associated with a slot label (following the BIO tagging scheme).

Model & Setup
PolicyIE provides annotations of privacy practices and corresponding text spans in privacy policies. We refer to privacy practice prediction for a sentence as intent classification and identifying the text spans as slot filling. We present two alternative approaches; the first approach jointly models intent classification and slot tagging (Chen et al., 2019), and the second modeling approach casts the problem as a sequence-to-sequence learning task (Rongali et al., 2020;Li et al., 2021).

Sequence Tagging
Following Chen et al. (2019), given a sentence s = w 1 , . . . , w l from a privacy policy document D, a special token (w 0 = [CLS]) is prepended to form the input sequence that is fed to an encoder. The encoder produces contextual representations of the input tokens h 0 , h 1 , . . . , h l where h 0 and h 1 , . . . , h l are fed to separate softmax classifiers 5 We split the policy documents into sentences using UD-Pipe (Straka et al., 2016). to predict the target intent and slot labels.
are parameters, and I, S are the total number of intent and slot types, respectively. The sequence tagging model (composed of an encoder and a classifier) learns to maximize the following conditional probability to perform intent classification and slot filling jointly.
We train the models end-to-end by minimizing the cross-entropy loss. Table 3 shows an example of input and output to train the joint intent and slot tagging models. Since type-I and type-II slots have different characteristics as discussed in § 2.2 and overlap, we train two separate sequential tagging models for type-I and type-II slots to keep the baseline models simple. 6 We use BiLSTM (Liu  Table 4: Test set performance of the sequence tagging models on PolicyIE corpus. We individually train and evaluate the models on intent classification and type-I and type-II slots tagging and report average intent F1 score. [CLS]) embedding is formed by applying average pooling over the input word embeddings. We train WordPiece embeddings with a 30,000 token vocabulary (Devlin et al., 2019) using fastText (Bojanowski et al., 2017) based on a corpus of 130,000 privacy policies collected from apps on the Google Play Store (Harkous et al., 2018). We use the hidden state corresponding to the first WordPiece of a token to predict the target slot labels.
Conditional Random Field (CRF) helps structure prediction tasks, such as semantic role labeling (Zhou and Xu, 2015) and named entity recognition (Cotterell and Duh, 2017). Therefore, we model slot labeling jointly using a conditional random field (CRF) (Lafferty et al., 2001) (only interactions between two successive labels are considered). We refer the readers to Ma and Hovy (2016) for details.

Sequence-to-Sequence Learning
Recent works in semantic parsing (Rongali et al., 2020;Zhu et al., 2020;Li et al., 2021) formulate the task as sequence-to-sequence (Seq2Seq) learning. Taking this as a motivation, we investigate the scope of Seq2Seq learning for joint intent classification and slot filling for privacy policy sentences. In Table 3  Human Performance is computed by considering each annotator's annotations as predictions and the adjudicated annotations as the reference. The final score is an average across all annotators.

Experiment Results & Analysis
We aim to address the following questions. 1. How do the two modeling approaches perform on our proposed dataset ( § 4.1)? 2. How do they perform on different intent and slot types ( § 4.2)? 3. What type of errors do the best performing models make ( § 4.3)?

Main Results
Sequence Tagging The overall performances of the sequence tagging models are presented in Table  4. The pre-trained models, BERT and RoBERTa, outperform other baselines by a large margin. Using conditional random field (CRF), the models boost the slot tagging performance with a slight degradation in intent classification performance. For example, RoBERTa + CRF model improves over RoBERTa by 2.8% and 3.9% in terms of type-I slot F1 and EM with a 0.5% drop in intent F1 score. The results indicate that predicting type-II slots is difficult compared to type-I slots as they differ in length (type-I slots are mostly phrases, while type-II slots are clauses) and are less frequent in the training examples. However, the EM accuracy for type-I slots is lower than type-II slots due to more type-I slots (∼4.75) than type-II slots (∼1.38) on average per sentence. Note that if models fail to predict one of the slots, EM will be zero.
Seq2Seq Learning Seq2Seq models predict the intent and slots by generating the labels and spans following a template. Then we extract the intent and slot labels from the generated sequences. The experiment results are presented in Table 5. To our surprise, we observe that all the models perform well in predicting intent and slot labels. The best performing model is BART (according to slot F1 score) with 400 million parameters, outperforming its smaller variant by 10.1% and 2.8% in terms of slot F1 for type-I and type-II slots, respectively.

Sequence Tagging vs. Seq2Seq
Learning It is evident from the experiment results that Seq2Seq models outperform the sequence tagging models in slot filling by a large margin, while in intent classification, they are competitive. However, both the modeling approaches perform poorly in predicting all the slots in a sentence correctly, resulting in a lower EM score. One interesting factor is, the Seq2Seq models significantly outperform sequence tagging models in predicting type-II slots. Note that type-II slots are longer and less frequent, and we suspect conditional text generation helps Seq2Seq models predict them accurately. In comparison, we suspect that due to fewer labeled examples of type-II slots, the sequence tagging models perform poorly on that category (as noted before, we train the sequence tagging models for the type-I and type-II slots individually). Next, we break down RoBERTa (w/ CRF) and BART's performances, the best performing models in their respective model categories, followed by an error analysis to shed light on the error types.

Performance Breakdown
Intent Classification In the PolicyIE corpus, 38% of the sentences fall into the first four categories: Data Collection, Data Sharing, Data Storage, Data Security, and the remaining belong to the Other category. Therefore, we investigate how much the models are confused in predicting the accurate intent label. We provide the confusion matrix of the models in Appendix. Due to an imbalanced distribution of labels, BART makes many  incorrect predictions. We notice that BART is confused most between Data Collection and Data Storage labels. Our manual analysis reveals that BART is confused between slot labels {"Data Collector", "Data Holder"} and {"Data Retained", "Data Collected"} as they are often associated with the same text span. We suspect this leads to BART's confusion. Table 6 presents the performance breakdown across intent labels.

Slot Filling
We breakdown the models' performances in slot filling under two settings. First, Table 6 shows slot filling performance under different intent categories. Among the four classes, the models perform worst on slots associated with the "Data Security" intent class as PolicyIE has the lowest amount of annotations for that intent category. Second, we demonstrate the models' performances on different slot types in Figure 1. RoBERTa's recall score for "polarity", "protectagainst", "protection-method" and "storage-place" slot types is zero. This is because these slot types have the lowest amount of training examples in Pol-icyIE. On the other hand, BART achieves a higher recall score, specially for the "polarity" label as their corresponding spans are short. We also study the models' performances on slots of different lengths. The results show that BART outperforms RoBERTa by a larger margin on longer slots (see Figure 2), corroborating our hypothesis that conditional text generation results in more accurate predictions for longer spans.

Error Analysis
We analyze the incorrect intent and slot predictions by RoBERTa and BART. We categorize the errors 0.0 0.2 0.4 0.6 0.8 action condition data-collected data-collector data-holder data-protected data-protector data-provider data-receiver data-retained data-shared data-sharer polarity protect-against protection-method purpose retention-period storage-place RoBERTa BART into seven types. Note that a predicted slot is considered correct if its' label and span both match (exact match) one of the references. We characterize the error types as follows. 1. Wrong Intent (WI): The predicted intent label does not match the reference intent label. 2. Missing Slot (MS): None of the predicted slots exactly match a reference slot. 3. Spurious Slot (SS): Label of a predicted slot does not match any of the references. 4. Wrong Split (WSp): Two or more predicted slot spans with the same label could be merged to match one of the reference slots. A merged span and a reference span may only differ in punctuations or stopwords (e.g., and). 5. Wrong Boundary (WB): A predicted slot span is a sub-string of the reference span or vice versa. The slot label must exactly match.   6. Wrong Label (WL): A predicted slot span matches a reference, but the label does not. 7. Wrong Slot (WS): All other types of errors fall into this category. We provide one example of each error type in Table 7. In Table 8, we present the counts for each error type made by RoBERTa and BART models. The two most frequent error types are SS and MS. While BART makes more SS errors, RoBERTa suffers from MS errors. While both the models are similar in terms of total errors, BART makes more correct predictions resulting in a higher Recall score, as discussed before. One possible way to reduce SS errors is by penalizing more on wrong slot label prediction than slot span. On the other hand, reducing MS errors is more challenging as many missing slots have fewer annotations than others. We provide more qualitative examples in Appendix (see Table 11 and 12) .
In the error analysis, we exclude the test examples (sentences) with the intent label "Other" and no slots. Out of 1,041 test instances in PolicyIE, there are 682 instances with the intent label "Other". We analyze RoBERTa and BART's predictions on those examples separately to check if the models predict slots as we consider them as spurious slots. While RoBERTa meets our expectation of performing highly accurate (correct prediction for 621 out of 682), BART also correctly predicts 594 out of 682 by precisely generating "[IN:Other]". Overall the error analysis aligns with our anticipation that the Seq2Seq modeling technique has promise and should be further explored in future works.
Our proposed PolicyIE corpus is distinct from the previous privacy policies benchmarks: OPP-115 (Wilson et al., 2016a) uses a hierarchical annotation scheme to annotate text segments with a set of data practices and it has been used for multilabel classification (Wilson et al., 2016a;Harkous et al., 2018) and question answering (Harkous et al., 2018;Ahmad et al., 2020); PrivacyQA (Ravichander et al., 2019) frame the QA task as identifying a list of relevant sentences from policy documents. Recently, Bui et al. (2021) created a dataset by tagging documents from OPP-115 for privacy practices and uses NER models to extract them. In contrast, PolicyIE is developed by following semantic parsing benchmarks, and we model the task following the NLP literature.
Intent Classification and Slot Filling Voice assistants and chat-bots frame the task of natural language understanding via classifying intents and filling slots given user utterances. Several benchmarks have been proposed in literature covering several domains, and languages (Hemphill et al., 1990;Coucke et al., 2018;Gupta et al., 2018;Upadhyay et al., 2018;Schuster et al., 2019;Li et al., 2021). Our proposed PolicyIE corpus is a new addition to the literature within the security and privacy domain. PolicyIE enables us to build conversational solutions that users can interact with and learn about privacy policies.

Conclusion
This work aims to stimulate research on automating information extraction from privacy policies and reconcile it with users' understanding of their rights. We present PolicyIE, an intent classification and slot filling benchmark on privacy policies with two alternative neural approaches as baselines. We perform a thorough error analysis to shed light on the limitations of the two baseline approaches. We hope this contribution would call for research efforts in the specialized privacy domain from both privacy and NLP communities.

Broader Impact
Privacy and data breaches have a significant impact on individuals. In general, security breaches expose the users to different risks such as financial loss (due to losing employment or business opportunities), physical risks to safety, and identity theft. Identity theft is among the most severe and fastest-growing crimes. However, the risks due to data breaches can be minimized if the users know their rights and how they can exercise them to protect their privacy. This requires the users to read the privacy policies of websites they visit or the mobile applications they use. As reading privacy policies is a tedious task, automating privacy policy analysis reduces the burden of users. Automating information extraction from privacy policies empowers users to be aware of their data collected and analyzed by service providers for different purposes. Service providers collect consumer data at a massive scale and often fail to protect them, resulting in data breaches that have led to increased attention towards data privacy and related risks. Reading privacy policies to understand users' rights can help take informed and timely decisions on safeguarding data privacy to mitigate the risks. Developing an automated solution to facilitate policy document analysis requires labeled examples, and the PolicyIE corpus adds a new dimension to the available datasets in the security and privacy domain. While PolicyIE enables us to train models to extract fine-grained information from privacy policies, the corpus can be coupled with other existing benchmarks to build a comprehensive system. For example, PrivacyQA corpus (Ravichander et al., 2019) combined with PolicyIE can facilitate building QA systems that can answer questions with fine-grained details. We believe our experiments and analysis will help direct future research.