Beyond The Text: Analysis of Privacy Statements through Syntactic and Semantic Role Labeling

This paper formulates a new task of extracting privacy parameters from a privacy policy, through the lens of Contextual Integrity (CI), an established social theory framework for reasoning about privacy norms. Through extensive experiments, we further show that incorporating CI-based domain-specific knowledge into a BERT-based SRL model results in the highest precision and recall, achieving an F1 score of 84%. With our work, we would like to motivate new research in building NLP applications for the privacy domain.


Introduction
A privacy policy informs users about a company's information handling practices.However, privacy policies are lengthy documents, full of incomplete and vague statements that impose a significant cognitive burden on the reader to infer whether a given service respects their privacy (Bhatia et al., 2016a;Bhatia and Breaux, 2018;Reidenberg et al., 2015).
This challenge has inspired many recent works in applying natural language processing and machine learning techniques to automatically process privacy policies and retrieve the relevant information (Harkous et al., 2018;Ravichander et al., 2019).While these efforts help in identifying paragraphs in the privacy policy that mention sensitive information (Evans et al., 2017;Bhatia and Breaux, 2015), opt-out clauses (Sathyendra et al., 2016) or description of data collection practice (Sadeh et al., 2014), they focus on the policy as a whole rather than the individual privacy statements that it contains.In particular, they do not aim to identify relevant and often missing contextual information that are critical for unambiguously understanding the scope of individual statements.This paper focuses on a new NLP task that aids the analysis of privacy policies at this more fine-grained level.
To illustrate the problem, consider a typical example of an ambiguous privacy statement: "Yahoo collects information about your transactions with us and with some of our business partners, including information about your use of financial products and services that we offer."At first glance, the statement may seem to provide all the relevant information about a first-party collection of transactional data.However, it in fact misses some crucial contextual information.To understand what is missing, we use the contextual integrity (CI) framework (Nissenbaum, 2009).CI defines privacy as an appropriate flow of information which is expressed in terms of 5 essential CI parameters: Sender, Recipient, Subject, Information Type, and Transmission Principle.The latter is a constraint on the information flow expressing the condition under which information is being transferred.The above statement specifies only 3 out of the 5 necessary parameters (highlighted in bold) -Subject, Recipient and Information Type.This leaves the sender of the information and transmission principle to the reader's interpretation.In some cases, the relevant missing information appears in different places in the policy, for example, under different sections such as "When do we collect your information" or "Our partners".These, however, do not help in contextually positioning the above statement so that the reader can determine whether their expectations have been met.
In this paper, we formulate the new NLP task of extracting the CI parameters from privacy statements ( § 3).We describe four different types of conventional methods that have been partially adapted to address this task: Hidden Markov Models, BERT models, Dependency-Type Parsing and CI specific Semantic Role Labeling ( § 4).Our evaluation of 36 real-world privacy policies shows that a solution combining syntactic dependency type parsing (DP) coupled with type-specific Semantic Role Labeling (SRL) tasks provides the highest accuracy for retrieving contextual privacy parameters from privacy statements ( § 5).We also observe that incorporating domain-specific knowledge is critical and doing so, we successfully extract the relevant CI parameters with F1 score of 80% or higher.

Related Work
Several recent efforts have focused on identifying important and relevant privacy statements using constituency parsing (Sathyendra et al., 2017;Sathyendra et al., 2016;Evans et al., 2017), logistic regression (Ammar et al., 2012) and crowdsourcing (Wilson et al., 2016b) techniques.Harkous et al. (2018) trained a machine learning model for querying privacy policies to retrieve relevant passages of information.Specifically, it supports free form questions about data handling practices described in the text and returns the paragraph mentioning the relevant practice.As we discuss in Section 3, our work explicitly looks to map the privacy statement to a fixed set of parameters.We also show that Question Answering (QA) models do not perform satisfactorily when applied to our task.
Similar limitations of the reading comprehension models were observed by Ravichander et al. (2019), who composed the PRIVACYQA dataset, an annotated corpus consisting of 1750 questions about the contents of privacy policies such as "What data does this game collect?" and "Will my data be sold to advertisers?".Our work is inspired by these efforts to provide a dataset of CI parameter annotations and a machine learning model for automatic CI parameter extraction.
In prior work on automatic privacy statement analysis, Bhatia et al. (2016b) extracted privacy statements on information handling practice such as "collecting your e-mail address" or "sharing your location" using typed dependency parser and crowdworker annotations.More relevant to our efforts, Bhatia and Breaux (2018) applied Semantic Roles theory to manually annotate 5 privacy statements and identify action verbs (action data) such as "collection", "retain", "use", "transfer" and associated semantic roles that capture who performs the action, how the action is carried out, etc. Shvartzshnaider et al. (2019a) crowdsourced privacy policies annotation to compare policy versions, identifying missing contextual information and overloading of parameters that contribute to users' inability to understand the prescribed information practices.Our work automates the task of annotating privacy policies with the CI parameters.
Many other multidisciplinary efforts draw on CI, as the underpinning privacy theory and can benefit from our newly formulated annotation task.Legal scholars and social scientists have used CI to examine existing data sharing practices in companies like Facebook (Hull et al., 2011) and Google (Zimmer, 2008) in order to identify important contextual elements behind users' privacy expectations (Apthorpe et al., 2018;Martin and Nissenbaum, 2016).In computer science, researchers have used CI to build privacy compliance and verification tools (Barth et al., 2006;Chowdhury et al., 2013).

CI Parameters Extraction Task
In this section we formulate the task of extracting relevant CI parameters from privacy policy statements.
Let us first motivate this task by discussing its applications.To perform an analysis of privacy implications of a given information flow, the theory of CI requires identifying 5 essential parameters: actors (sender, receiver, subject), the type of information (attribute), and condition of the information exchange (transmission principle).This analysis can help in identifying potentially confusing or misleading statements, e.g., when one of the five parameters such as transmission principle or receiver is missing or ambiguous (Shvartzshnaider et al., 2019a).Furthermore, one can use the identified parameters to formalize the expressed informational norms and privacy rules in formal logic (Shvartzshnaider et al., 2019b;Datta et al., 2011).These formalisms can in turn be used to build systems that enforce the specified rules or automatically audit information flows to detect rule violations.
The CI parameter extraction task is as follows.Given a privacy statement stmt, apply a mapping function M to extract the CI parameters: sender, receiver, subject, attribute, transmission principle: The main challenge behind the task is in identifying the lexical items in the statement that correspond to the contextually relevant values to help downstream NLP tasks perform the privacy analysis.This is not a trivial task as privacy policies are not written with CI in mind.Often, they are written by legal and policy teams whose primary concern is not readability.Many privacy statements are missing essential CI parameters and often comprise syntactically complex sentences (Bhatia and Breaux, 2018).In the absence of an automatic way to extract CI parameters, researchers have employed crowd-sourcing and manual annotation to perform the analysis (Shvartzshnaider et al., 2019a).The results, while promising, are not yet satisfactory and have many challenges.We provide a motivating example to demonstrate the challenges involved in this task.Consider the privacy statement: We transfer information about you if Yahoo is acquired by or merged with another company.

Sender
Attribute Subject TP Viewed through the lens of CI, we are interested in answering the following questions: "Who is transferring?","What is being transferred?","Who is the subject?", "Who is the receiver/recipient?","Why, When and How is the transfer facilitated?".The relevant CI parameters are marked in the statement mentioned above.We tried applying an open domain QA model to answer these questions.Table 1 shows results of our expeditionary experiment.The overall F1 scores for the QA model indicates poor results for extraction of all CI parameters.Note in our experiment, QA outputs multiple phrase predictions for each of the parameters.For precision, we calculate true positives as a fraction of all positives predicted for each parameter.For recall, we calculate the fraction of true positives to all correct parameters.This result aligns with previous uses of QA in the privacy domain.Ravichander et al. (2019) observed that, compared to a human annotator, Question Answering for Privacy Policies using standard reading comprehension models returns relatively poor results in answering specific questions such as "will my data be sold to advertisers?" and "what data does this [service] collect?".

Recall Precision
These experiences suggest that QA models require additional heuristics to filter the many false positives as a result of them operating on a paragraph level and not on sentence level statements.Thus, we have established that extracting CI parameters using existing off-the-shelf models without significant re-mapping leads to low precision and recall.In the next section, we discuss how we can re-purpose existing tasks by leveraging domain expertise in CI to extract these parameters.We further demonstrate that, with a comparable dataset, training an end-to-end supervised learning model does not provide accurate results.

Methods
In this section we describe the NLP methods we applied to the CI parameter extraction task: Hidden Markov Model, BERT, Dependency Parser (DP) and Semantic Role Labeling (SRL).We illustrate the post-processing and the modifications required in off-the-shelf end-to-end neural models to extract CI parameters from privacy policies.Specifically, we focus on Syntactic DP and SRL-based approaches.

Hidden Markov Model
We formulate the CI parameter extraction as a part-of-speech (POS) tagging task and use a Hidden Markov Model (HMM) probabilistic model (Jurafsky and Martin, 2014) for annotating words in a sentence.Specifically, we train a trigram HMM by converting the dataset to CoNLL-2003 format (Sang andDe Meulder, 2003) with CI parameters as the target labels.In our setup, we use 80/20 train-test split, with a training set comprising of 2504 privacy statements and 18533 tokens and a validation set consisting of 626 privacy statements and 5130 tokens.By default, HMM relies on the Markov assumption that the probability of a particular state only depends on the preceding state.However, in order to enrich our HMM model, we consider the two previous states when predicting the current CI parameter, turning it into a trigram model.Further, we obtain the final transition probability distribution by linearly combining unigram, bigram and trigram probability distributions: The parameters λ 1 and λ 2 are fine-tuned on the validation set with values 0.42 and 0.48 providing the best results.The Viterbi algorithm (Forney, 1973) is used in the decoding phase for the extended model.

Bidirectional Encoder Representations from Transformers (BERT)
We frame the CI parameter extraction task as a sequence-to-sequence transformation problem to finetune an advanced BERT model (Devlin et al., 2018) on our dataset to map a sequence of words in privacy statements to a corresponding sequence of CI tags.For training and testing, we transformed our dataset into the CoNLL2003 format and used AllenNLP (Gardner et al., 2018) with the train-test split ratio as 80/20 and values of hyperparameters taken from (Gardner et al., 2017).

Dependency parsing
Dependency parsing is the task of identifying syntactic roles or dependency types for each of the words in a sentence.This involves parsing a sentence and identifying the syntactic structure denoting the grammatical rules that governs a language.Not all the dependency types identified for the English language are relevant in our study.
We use the DP outputs to identify the relevant CI parameters in the privacy statement.To identify CI parameters at a single sentence level using local relationships, we run a typed dependency parser (DP) on the text of the policies.We accept paragraphs as input, split them into sentences and parse each sentence using the Spacy I/O1 dependency parser.The library (Honnibal and Montani, 2018) achieves near state-of-the-art performance on most NLP tasks2 .We then map the dependency types to specific CI parameters as shown in Table 2 For example, for the following statement from the Google privacy policy, the DP praser will return the following dependency type tags (white nodes), which are mapped to corresponding CI parameter (gray nodes): When you use Google services, we may collect and process information about your actual location.Note that, as is evident in Table 2, the dependency types cannot distinguish between the parameter of sender and receiver.For this, we defer to the task of SRL to identify based on the semantic meaning of the word.Figure 2 shows the percentage of DP tags that are correctly and incorrectly mapped to the CI parameters.This indicates the diversity and coverage of the many tags that map to each of the CI parameters.It also illustrates that the task of extracting CI parameters is not equivalent to that of DP and new conditional information is required to modify DP and solve the task.

Semantic Role Labeling
Semantic Role Labeling is the task of mapping words or phrases in a sentence to a semantic role such as that of an agent, goal, or result (Jurafsky and Martin, 2014).Often, in the classic natural language processing pipeline, this task is considered to have subsumed syntactic and parts-of-speech tasks within it (Tenney et al., 2019).For example, the task of distinguishing between a sender and receiver can be done through SRL, but not through syntactic DP.
Similar to DP, we map the semantic roles to the relevant CI parameters.Table 3 shows the CI parameter mapping based on a verb's syntactic arguments.For example the verb "collect" has the following associated arguments (see PropBank corpus (Martha et al., 2005)): ARG0: agent, entity acquiring something, ARG1: thing acquired, ARG2: source, ARG3: more specific attribute of ARG1 being collected, ARG4: benefactive.To recover the predicate argument structure of a sentence we use an AllenNLP implementation of the Bidirectional LSTM model (He et al., 2017).For example, for the following statement the SRL model returns: We collect technical information when you visit our websites We then map the arguments onto the CI parameters.In the above example, ARG0 is mapped to Recipient.
ARG1 is an Attribute, and ARGM-TMP is the TP.For each of the verbs these mapping are slightly different, as shown in Table 3.This mapping, although crude, covers a significant class of privacy policy statements which describe norms of information flows.

CI-related Semantic Frames
The SRL model returns verb-argument predicates for all the identified verbs in a sentence.Some of these verbs are not relevant to information exchange.For example, in the above statement, the verb "visits" does not convey semantically meaningful information regarding the exchange of technical information.
To reduce the number of false positives, we provide a list of verbs to the algorithm which highly correlate with information exchanges.It is helpful to think of this approach through the lens of the linguistic theory of Frame semantics (Fillmore and others, 1976), which posits that specific meaning of words (frame elements) can be understood only as part of a particular context (semantic frames).In our approach, we would invoke CI-related semantic frames.Specifically, we look for SRL-predicates that are associated with any transfer of information (actual or perceived).This includes a list of verbs such as "sending", "sharing", "transmitting" and others.In addition to invoking a general semantic frame, we differentiate between different roles of associated argument with each predicate.In particular, for predicates like "sending", "sharing", "transmitting" the ARG2 is typically associated with the agent role of a "sender", the ARG1 captures what was "sent" and ARG0 is associated with the receiving agent role.For verbs like "gather", "collect", "receive", "acquire" the roles are reversed: ARG0 is typically associated with a "sending" agent role, the ARG1 describes what is "Received", and ARG2 is associated with the "receiving" agent role.Grouping the verbs signifying a "sending" or a "receiving" action helps us map the corresponding arguments to the relevant CI parameters for Senders and Receivers.The mapping for TP and Attribute remains the same for all verbs.Finally, our SRL mapping does not include a semantic role mapping of the Subject parameter.We operate on the assumption that the subject in most statements is the user.

Clues from CI to Improve SRL
Identifying the arguments for all verbs in the privacy statement results in high recall numbers.Nevertheless, the precision suffers because not all of the verbs need to be invoked.To reduce the number of false positive mappings, we implement an algorithm which analyzes all the relevant SRL verbs to check whether any of them appear as part of the Transmission Principle (TP) relative to another verb.
For example, in the following statement the SRL model will pick up two predicates (verbs) and corresponding arguments: when you are sharing your post.

ARGM-TMP
sharing: We collect your personal information when you are sharing your post.
ARGM-TMP ARG0 V ARG1 These arguments will be mapped to CI parameters, as described in the previous section.The verb "share" is redundant in this context since it is part of the TP of the verb "collect."Once we identify the redundant verb, we ignore all arguments associated with it, i.e., our algorithm does not consider these results.We do keep those parameters that overlap with the parameters produced by non-redundant verbs.For instance, in our example we ignore the verb "share" and the associated with it arguments.Specifically, the [ARG0: you] and the [ARG1: your post] which otherwise will be mapped to a CI sender and attribute parameters, respectively.

Evaluation
We perform automatic annotation of 36 policies of the OPP-115 Corpus (Wilson et al., 2016a).The corpus' privacy policies were annotated to specify data practices mentioned in each of the segments of the policy.We limit our CI parameter extraction to labeled segments of the policy that discuss information exchanges such as segments labeled as "First Party Collection/Use", "Third party sharing/collection", "Data Retention".Following the steps in Figure 1, each segment was split into separate sentences which were annotated by the respective models.The results were presented to a human annotator, one of the authors who is an expert on CI.The expert then marked the valid results for each of the privacy statement sentences and CI parameters, and also provided the ground truth.A sentence was marked as a valid flow if it prescribed an information exchange of any kind.Otherwise, by default, all sentences are considered invalid.
Overall, the extraction phase resulted in a total of 2268 privacy statement sentences, out of which 778 were labeled as valid, containing 3245 CI parameters.On average, a policy contains 18 valid statements, with outliers of 4 and 43 valid statements.

HMM and BERT models
Table 4 shows the results of training a trigram Hidden Markov Model and a fully-supervised BERT.Both models perform relatively poorly for our task, especially when it comes to the "Sender" parameter.HMM's overall F1 scores are slightly better for detecting other parameters, with the highest F1 score achieved for the TP parameter in both models.

DP and SRL
Table 5 shows precision and recall for both DP and SRL models3 .Both models have high recall numbers.However, in DP the precision is low, indicating that while DP is able to identify all the relevant instances, it also produces many false positives.SRL performs better, both in terms of precision and recall.The recall numbers are slightly higher compared to DP and the precision is much higher.We, however, note that SRL did not process 26 statements.They contain verbs that our algorithm didn't track, some of which are not always associated with information exchange, like "sell" and "rent".Figure 3 shows the percentage of SRL arguments that are correctly and incorrectly mapped to the CI parameters.Note that, compared to the dependency tags from DP, semantic arguments from SRL result in more valid mappings to CI parameters.

Improved SRL
Table 5 shows the results for improved SRL after applying our algorithm incorporating domain-specific heuristics.The precision results have improved across all the parameters, affecting recall only slightly.We note that our F1 metric is calculated on phrase prediction level.The statements where the SRL-based algorithm performed especially poorly involved semantically complex or long connected sentences.Semantically complex statements comprise multiple verb-predicates with related arguments that result in a large number of false positives.For example:

Recall Precision
SCEA's consumer services department maintains information obtained from consumers who contact or submit an online complaint so that we may assist these customers with current or future service issues.sensitive personal information .
Long connected sentences comprise several phrases.However, due to improper punctuation they appear as a single sentence to our algorithm and as a result generate a large number of false positives.For example, the following statement comprises multiple sentences that are connected with a colon: There are two main types of information we collect about users of our online services that include (but are not limited to) the following: Information that identifies you: This is commonly referred to as "personal information" and includes, for example, information that you provide to us such as your name, home address, age, gender, telephone number, e-mail address, payment information (including your credit card number), and/or photos or video footage of you; and & Information that relates to you, but on its own does not identify you: Such as information about your Internet connection, the equipment you use to access our online services and information relating to your usage of those services.
These cases are not only problematic for an NLP task but also require significant cognitive effort for a human attempting to analyze the privacy implications of the prescribed information flows.Rather than adapting our method to yield better results in these cases, it might be best to use it to detect these complex sentences so that they can be restated more clearly.

Conclusion
In this paper, we formulate a new CI parameter extraction NLP task for analysis of privacy statements.We adapt several conventional NLP and ML methods (HMM, BERT, DP and SRL) to perform the task and demonstrate that it cannot be solved trivially.In our evaluation of privacy statements from 36 realworld privacy policies, we show that a method combining clues from CI into syntactic DP coupled with type-specific SRL obtains the highest F1 score.We build on this insight to devise an algorithm that incorporates domain-specific knowledge to achieve a much higher precision and recall.The proposed algorithm post-processes ML outputs and increases automation of a tedious task that has so far been performed manually.Further improvements of this task, leveraging domain knowledge for complex scenarios will directly benefit downstream applications ranging from aiding the design and analysis of privacy policies to building systems that meet users' privacy expectations by construction.

Figure 1 :
Figure 1: CI Parameters annotation task pipeline

Figure 2 :
Figure 2: Distribution of True Positives and False Positives for each SRL tag Figure 3: Distribution of True Positives and False Positives for each DP tag , ARGM-ADV, ARGM-MNR ARGM-PNC, ARGM-CAU

Figure 4 :
Figure 4: Histogram of F1 scores across privacy policies

Table 1 :
Precision, Recall and overall F1 score for QA Comprehension model used for the CI parameter extraction task.The recall and precision values for a parameter are calculated by macro averaging over privacy statements.

Table 2 :
(De Marneffe and Manning, 2011)orresponding to CI parameters.To represent dependencies we use the Stanford Typed Dependency Manual(De Marneffe and Manning, 2011)notations.

Table 3 :
Mapping semantic roles (notations) to specific CI parameters.

Table 4 :
F1 Scores for fully-supervised HMM and fine-tuned BERT model.The recall and precision values are calculated on word level over the whole test set.

Table 5 :
F1 Scores for all the models: DP, SRL and Improved SRL (CI-SRL).The recall and precision values for each parameter are calculated by macro averaging over privacy statements.