Assessing Privacy Risks in Language Models: A Case Study on Summarization Tasks

Large language models have revolutionized the field of NLP by achieving state-of-the-art performance on various tasks. However, there is a concern that these models may disclose information in the training data. In this study, we focus on the summarization task and investigate the membership inference (MI) attack: given a sample and black-box access to a model's API, it is possible to determine if the sample was part of the training data. We exploit text similarity and the model's resistance to document modifications as potential MI signals and evaluate their effectiveness on widely used datasets. Our results demonstrate that summarization models are at risk of exposing data membership, even in cases where the reference summary is not available. Furthermore, we discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.


Introduction
Text summarization seeks to condense input document(s) into a shorter, more concise version while preserving important information.The recent large language models have significantly enhanced the quality of the generated summaries (Rothe et al., 2021;El-Kassas et al., 2021;Chung et al., 2022).These models have been applied to many sensitive data, such as clinical and finance reports (Zhang et al., 2020b;Abacha et al., 2021).Given these high-stakes applications, it is critical to guarantee that such models do not inadvertently disclose any information from the training data and that data remains visible only to the client who owns it.
To evaluate the potential memorization of specific data by a model, membership inference (MI) attack (Shokri et al., 2017) has become the de facto standard, owing to its simplicity (Murakonda et al., 2021).Given a model and sample (input label pair), the membership inference attack wants to identify whether this sample was in the model's training dataset.This problem can be formulated as an adversarial scenario, with Bob acting as the attacker and Alice as the defender.Bob proposes methods to infer the membership, while Alice attempts to make membership indistinguishable.In the past, researchers proposed various attacking and defending techniques.However, most of them focus on the computer vision classification problem with a fixed set of labels.Little attention has been given to comprehending MI attacks in Seq2Seq models.
This paper focuses on the summarization task and investigates the privacy risk under membership inference attacks.Inspired by previous research in MI literature (Shokri et al., 2017;Hisamoto et al., 2020), we pose the problem as follows: given blackbox access to a summarization model's API, can we identify whether a document-summary pair was used to train the model?.Compared to membership inference attacks on fixed-label classification tasks, text generation tasks present two significant challenges: (1) The process of generating summaries involves a sequence of classification predictions with variable lengths, resulting in complex output space.
(2) Existing attacks heavily rely on the output probabilities (Shokri et al., 2017;Mireshghallah et al., 2022), which is impractical when utilizing APIs of Seq2Seq models.Therefore, it remains uncertain whether the methodologies and findings developed for classification models can be applied to language generation models.
A pertinent question emerging from the study is the rationale behind the efficacy of MI attacks on summarization models.The key insight lies in the training objective of these models, which aims to minimize the discrepancy between the generated and reference summary (See et al., 2017).Consequently, samples with significantly lower loss (indicating a similarity between the generated and reference summaries) are more likely to be part of the training dataset.Based on this concept, we propose a baseline attack method that utilizes the similarity between the generated and reference summaries to differentiate between training and non-training samples.One limitation of the baseline attack is that Bob requires access to both the documents and reference summaries to launch the attack, rendering the attack less practical for summarization tasks.
Building upon this, our study introduces a more general document-only attack: Given only a document and black-box access to the target model's APIs, can Bob infer the membership without access to the reference summary?We tackle this problem by examining the robustness of generated summaries in response to perturbations in the input documents.According to the max-margin principle, training data tends to reside further away from the decision boundary, thus exhibiting greater resilience to perturbations, which aligns with observations from prior research in the adversarial domain (Tanay and Griffin, 2016;Choquette-Choo et al., 2021).Consequently, Bob can extract fine-grained membership inference signals by evaluating data robustness toward perturbations.Remarkably, we show that Bob can estimate the robustness without reference summaries, thereby enabling a documentonly attack.In summary, this work makes the following contributions to the language model privacy: 1. Defined the black-box MI attack for the sequence-to-sequence model.Experiments on summarization tasks show attackers can reliably infer the membership for specific instances.2. Explored data robustness for MI attacks and found that the proposed approach enables attackers to launch the attack solely with the input document.3. Evaluate factors impacting MI attacks, such as dataset size, model architectures, etc.We also explore multiple defense techniques and discuss the privacy-utility trade-off.

Background and Related Works
Membership Inference Attacks.In a typical black-box MI attack scenario, as per the literature (Shokri et al., 2017;Hisamoto et al., 2020), it is posited that the attacker, Bob, can access a data distribution identical to Alice's training data.This access allows Bob to train a shadow model, using the known data membership of this model as ground truth labels to train an attack classifier.Bob can then initiate the attack by sending queries to Alice's model APIs.Most previous studies leverage disparities in prediction distributions to distinguish between training and non-training samples.However, this approach is not feasible for Seq2Seq models.For each generated token in these models, the output probability over the word vocabulary often comprises tens of thousands of elements-for instance, e.g., the vocabulary size for BART is 50,265 (Lewis et al., 2020).As such, most public APIs do not offer probability vectors for each token but rather furnish an overall confidence score for the sequence, calculated based on the product of the predicted tokens' probabilities.Natural Language Privacy.An increasing body of work has been conducted on understanding privacy risk in NLP domain, (Hayes et al., 2017;Meehan et al., 2022;Chen et al., 2022;Ponomareva et al., 2022).Pioneering research has been dedicated to studying MI attacks in NLP models.The study by (Hisamoto et al., 2020) examines the black-box membership inference problem of machine translation models.They assume Bob can access both the input document and translated text and use BLEU scores as the membership signal, which is similar to our baseline attack.(Song and Shmatikov, 2019) investigate a white-box MI attack for language models, which assume Bob can obtain the probability distribution of the generated token.Different from previous work, our attack is under the black-box setting and considers a more general document-only attack in which Bob only needs input documents for membership inference.

Problem Definition
We introduce two characters, Alice and Bob, in the membership inference attack problem.Alice (Defender) trains a summarization model on a private dataset.We denote a document as f and its corresponding reference summary as s.Alice provides an API to users, which takes a document f as input and returns a generated summary ŝ. Bob (Attacker) has access to data similar to Alice's data distribution and wants to build a binary classifier g(•) to identify whether a sample is in Alice's training data, A train .The sample comprises a document f and its reference summary s.Together with the API's output ŝ, Bob uses g(•) to infer the membership, whose goal is to predict:

Shadow Models and Data Splitting
In this work, we follow the typical settings in the MI attack (Shokri et al., 2017;Hisamoto et al., 2020;Jagannatha et al., 2021;Shejwalkar et al., 2021) and assume Bob has access to the data from the same distribution as Alice to train their shadow models.Subsequently, Bob utilizes the known data membership of the shadow models as training labels to train an attack classifier g(•), whose goal is to predict the data membership of the shadow model.If the attack on the shadow model proves successful, Bob can employ the trained attack classifier to attempt an attack on Alice's model.As depicted in Figure 1, we follow the previous work setting (Hisamoto et al., 2020) and split the whole dataset as A all and B all , with Alice only having access to A all and Bob only has access to B all .For Alice, A all is further split into two parts: A train and A out , where A train is utilized for training a summarization model and A out serves as a hold-out dataset that is not used.(Note that A train includes the data used for validation and testing, and we use A train to specify the data used to train the model).In the case of Bob, B all is further split into B in and B out , where Bob employs B in to train shadow models and B out serves as a hold-out dataset.To construct the attack classifier g(•), Bob can train g(•) with the objective of differentiating samples in B in and B out .

Evaluation Protocols
We adopt the following evaluation protocols to evaluate g(•) attack performance on Alice's model.Given a document and its corresponding reference summary (f, s), selected from A train or A out , Bob sends f to Alice's API and gets the output summary ŝ.Then Bob employs the trained classifier g(•) to infer whether the pair (f, s) is present in Alice's training data.Since A train is much larger than A out , we make the binary classification task more balanced by sampling a subset A in from the A train with the same size as A out .Given a set F of test samples (f, s, ŝ, m), where (f, s) ∈ A in ∪ A out , m is the ground truth membership, the Attack Accuracy (ACC) is defined as: where an accuracy above 50% can be interpreted as a potential compromise of privacy.Following a similar definition, we can define other commonly used metrics, such as Recall, Precision, AUC, etc.Previous literature mainly uses accuracy or AUC to evaluate the privacy risk (Song and Shmatikov, 2019;Hisamoto et al., 2020;Mahloujifar et al., 2021;Jagannatha et al., 2021).However, these metrics only consider an average case and are not enough for security analysis (Carlini et al., 2022).Consider comparing two attackers: Bob 1 perfectly infers membership of 1% of the dataset but succeeds with a random 50% chance on the rest.Bob 2 succeeds with 50.5% on all dataset.On average, two attackers have the same attack accuracy or AUC.However, Bob 1 demonstrates exceptional potency, while Bob 2 is practically ineffective.In order to know if Bob can reliably infer the membership in the dataset (even just a few documents), we need to consider the low False-Positive Rate regime (FPR), and report an attack model's True-Positive Rate (TPR) at a low false-positive rate.In this work, we adopt the metric TPR 0.1% , which is the TPR when FPR = 0.1%.

MI Attacks for Summarization Tasks 4.1 A Naive Baseline
The baseline attack is based on the observation that the generated summaries of training data often exhibit higher similarity to the reference summary, i.e., lower loss value (Varis and Bojar, 2021).In an extreme case, the model memorizes all training document-reference summary pairs and thus can generate perfect summaries for training samples (s = ŝ).Hence, it is natural for Bob to exploit the similarity between ŝ and s as a signal for membership inference.
There are multiple approaches to quantifying text similarity.Firstly, Bob utilizes the human-design metrics ROUGE-1, ROUGE-2, and ROUGE-L scores to calculate how many semantic content units from reference texts are covered by the generated summaries (Lin, 2004;Lin and Och, 2004).Additionally, Bob can adopt neural-based language quality scores, e.g., sentence transformer score (Reimers and Gurevych, 2019), to capture the semantic textual similarity.Finally, we follow studies in the computer vision domain and also leverage the confidence score, such as the perplexity score, as the MI attack feature.Bob then concatenates all features to one vector, i.e., [ROUGE-1, ROUGE-2, ROUGE-L, Transformer Score, Confidence Score], and employs classifiers, such as random forest and multi-layer perceptron, to differentiate training and non-training samples.The baseline attack can be written as follows: where sim represents the function that takes two summaries as input and returns a vector of selected similarity evaluation scores.

Document Augmentation for MI
The baseline attack is limited to relying solely on text similarity information from a single query.
To train the classifier, Bob uses similarity scores between all summaries with the reference summary as the feature, which can be written as follows: (4) Compared to eq. 3, the proposed g(•) Aug can additionally use the summaries' robustness information for MI, e.g., the variance of similarity scores var((ŝ, s), ..., sim(ŝ d 1 , s)).

Document-only MI Attack
Existing attack methods need Bob to access both the document f and its corresponding reference summary s to perform membership inference.However, it is challenging for Bob to obtain both of these for summarization tasks.Here, we propose a low-resource attack scenario: Bob only has a document and aims to determine whether the document is used to train the model.Under this scenario, the previously proposed attacks cannot be applied as there is no reference summary available.However, the concept of evaluating sample robustness offers a potential solution as we can approximate the robustness without relying on reference summaries.To address this, we modify the g(•) aug : Instead of calculating the similarity scores between generated summaries (ŝ, ŝd 1 , ..., ŝd n ) and reference summaries s, Bob replaces the reference summary s with the generated summary ŝ, and estimate the document robustness by calculating the similarity scores between ŝ and perturbed documents' summaries (ŝ d 1 , ..., ŝd n ).The proposed document only MI attack can be written as follows: (5) Compared to g(•) aug , the proposed g(•) D_only obtains robustness information for the document only with the generated summary ŝ.Our experiments show that this approximate robustness contains valuable membership signals, and g(•) Donly can effectively infer the membership of specific samples using only the documents as input.

Dataset
We perform our summarization experiments on three datasets: SAMsum, CNN/DailyMail (CN-NDM), and MIMIC-cxr (MIMIC).SAMsum (Gliwa et al., 2019)  MIMIC is a public radiology report summarization dataset.We adopt task 3 in MEDIQA 2021 (Abacha et al., 2021), which aims to generate the impression section based on the findings and background sections of the radiology report.We choose the MIMIC-cxr as the data source, and the original split includes 91544/2000 medical reportimpression pairs for training/validation.As we discussed in Sec.3.2, we reorganized the datasets into three disjoint sets: A train , A out , and B all .We assume Bob can access around 20% of the dataset to train shadow models and g(•).In Table 2, we show the details number for each split.

Models and Training Details
In our experiments, we adopted two widely used summarization models: BART-base (Lewis et al., 2020) and FLAN-T5 base (Chung et al., 2022) (Results of FLAN-T5 are detailed in the Appendix).
We adopt Adam (Kingma and Ba, 2014) as the optimizer.For SAMsum, CNNDM, and MIMIC, the batch size is set as 10/4/4, and the learning rate is set as 2e −5 , 2e −5 , 1e −5 .During inference, we set the length penalty as 2.0, the beam search width as 5, and the max/min generation length as 60/10, 140/30, and 50/10.Alice chooses the best model based on the validation ROUGE-L score.Bob randomly splits B all into two equal parts: B in and B out .Bob employs B in to train shadow models and chooses the best model based on the validation ROUGE-L performance.We only trained one shadow model in the experiment.Our implementation is based on the open-source PyTorchtransformer repository. 1 All experiments are repeated 5 times and report the average results.

Augmentation Methods
We consider three augmentation methods: word synonym (WS), sentence swapping (SW), and back translation (BT)2 : word synonyms randomly choose 10% of words in a document and change to their synonym from WordNet (Miller, 1995)  German, and then back translates to English.In our main experiments, we generate 6 augmented samples for WS, SW, and 3 for BT.

Experiment Results
Baseline Attack.We present the results of the baseline attack in Table 1.Our analysis reveals that the attack is successful in predicting membership, as the accuracy and AUC results on the three datasets are above 50%.Furthermore, the attack AUC on the MIMIC and SAMsum datasets is above 65%, which highlights a significant privacy risk to Alice's model.In Figure 2, we examine the feature distribution of A in and A out .Our key observation is that A in exhibits notably higher ROUGE-1, ROUGE-2, and Transformer Scores than A out , indicating that the model's behavior is distinct on training and non-training samples.Additionally, our study discovered that the confidence score, which has been found to be useful in pre-vious classification models (Shokri et al., 2017), is useless for the summarization model.Furthermore, we fine-tune a RoBERTa model and use the raw texts to differentiate generated summaries of B in and B out , referred to as RoBERTa in the table.However, the results indicate that raw text is inferior to the similarity score features, with the MLP model using similarity scores as features achieving the best performance.
In addition to the AUC and ACC scores, we also evaluate the performance of the attacks in the highconfidence regime.Specifically, we report the true positive rate under a low false positive rate of 0.1%, referred to as TPR 0.1% in the table.Our results demonstrate that the model can reliably identify samples with high confidence.For example, the MLP model achieves a TPR FPR 0.1% of 3.05% on the MIMIC dataset, which means that the model successfully detects 305 samples in A in with only 10 false positives in A out .Document Augmentation MI Attack.In this section, we investigate the effectiveness of evaluating the model's robustness against document modifications as the MI feature.The results of the attack are presented in Table 3.We observe a consistent improvement in attack performance across all datasets compared to g(•) Base .Specifically, the improvement in TPR 0.1% indicates that the robust signal allows the attacker to detect more samples with high confidence.We find that sentence swapping achieves the best attack performance across all datasets.In Figure , we show the standard deviation distribution of ROUGE-L F1 scores, 3, calculated as SD(R-L(ŝ, s), R-L(ŝ d 1 , s), ..., R-L(ŝ d n , s)).We find that the variance of training data is notably lower than non-training data, indicating that training samples are more robust against perturbations.Document-only Attack.In this section, we present the results of our document-only attack.As previously discussed in Section 4.2.1, the attack classifier g(•) D_only does not have access to reference summaries.Instead, Bob estimates the model's robustness by using generated summaries.In Figure 4, we show the standard deviation distribution of ROUGE-L scores, calculated as SD(R-L(ŝ d 1 , ŝ), R-L(ŝ d 2 , ŝ), ..., R-L(ŝ d n , ŝ)).Similar to the results in Figure 3, the variance of training data is lower than that of non-training data but with smaller differences.Reflecting on the results, we observe a lower attack performance for document-only attacks (Table 4) compared to g(•) Base in Table 1.However, attack accuracy and AUC are above 50%, indicating a privacy risk even under this low-resource attack.More importantly, the TPR 0.1% results show that Bob can still infer certain samples' membership with high confidence.

Ablation Studies
In this section, we will investigate several impact factors in MI attacks.All experiments were conducted using the baseline attack with the MLP classifier.A more detailed analysis is in the Appendix.Impact of Overfitting.In Figure 5, we show the attack AUC and validation ROUGE-L F1 score under varying training steps on the SAMsum dataset.We find that the attack AUC increases steadily as the number of Alice's training steps increases, which is consistent with previous research (Shokri et al., 2017).Moreover, early stopping by ROUGE score (5 epochs) cannot alleviate the attack.The AUC curve indicates that the model gets a high attack AUC at this checkpoint.A better early stop point is 3 epochs, which significantly reduces the MI attack AUC without a substantial performance drop.However, in practice, it is hard to select a proper point without relying on an attack model.Impact of Dataset Size.In this study, we assess the impact of dataset size on MI attacks.To do this, we train our model with 10% to 100% of the total dataset.Our results, as depicted in Figure 6, indicates that as the size of the training set increases, the AUC of MI attacks decreases monotonically for both SAMsum and CNNDM dataset.This suggests that increasing the number of samples in the training set can help to alleviate overfitting and reduce the MI attack AUC.Some recent studies have highlighted the issue of duplicate training samples in large datasets (Lee et al., 2022).This duplication can escalate the privacy risks associated with these samples and should be taken into consideration when employing large datasets.Impact of the Model Architecture.In previous experiments, we assumed that Bob uses the same architecture as Alice to train shadow models.In this section, we further explore the attack transferability attack across different model architectures.As shown in Figure 7, Bob and Alice can choose different model architectures, we evaluate the transferability metrics for various models on SAMsum Dataset, including BART, BertAbs (Liu and Lapata, 2019), PEGASUS (Zhang et al., 2020a), andFLAN-T5 (Chung et al., 2022).The results indicate that the attack AUC is highest when both Bob and Alice employ the same model.However, even when Bob and Alice utilize different models, the MI attack exhibits considerable transferability across the selected model architectures.These findings suggest that the membership signal exploited by the attack classifier demonstrates generalizability and effectiveness across various architectures.

Defense Methods
We now investigate some approaches that aim to limit the model's ability to memorize its training data.Specifically, we try two approaches: differential privacy SGD (DP-SGD)3 (Dwork, 2008;Machanavajjhala et al., 2017;Li et al., 2021) and L 2 regularization (Song et al., 2019).For DP-SGD, ϵ is the privacy budget, where a lower ϵ indicates higher privacy.We conduct experiments on the SAMsum dataset.As shown in Table 5, we find that as the λ increase and ϵ decrease, the attack AUC stably drops.Particularly, when ϵ = 8.0 and λ = 12.0, the AUC drops to about 50%.However, we find defense methods cause a notable performance drop on the ROUGE-L F1 score.Indicating that there is a privacy-utility trade-off.

Limitations
In this work, we demonstrate that the MI attack is effective.However, it remains unclear what properties make samples more susceptible to MI attack.In other words, given a model and a dataset, we cannot predict which samples are more likely to be memorized by the model.We find that the detected samples under TPR 0.1% have an average shorter reference length, but further research is needed to fully answer this question.Additionally, it is important to note that while the MI attack is a commonly used attack, its privacy leakage is limited.Other attacks pose a more significant threat in terms of information leakage (Carlini et al., 2021).The evaluation of these attacks in summarization tasks should be prioritized in future studies.

Conclusion
In this paper, we investigated the membership inference attack for the summarization task and explored two attack features: text similarity and data robustness.Experiments show that both features contain fine-grained MI signals.These results reveal the potential privacy risk for the summarization model.In the future, we would like to explore advanced defense methods and alleviate the tradeoff between privacy and utility.

A Experiments on More Models
In our primary experiments, our focus is on the BART model; however, we are also interested in exploring other commonly used models.Among these models, we consider the Fan-T5 base (Chung et al., 2022), where we maintain the same experimental setup but replace the Alice and Bob model with Flan-T5.Table 7 presents the baseline attack results, which are consistent with our findings in the BART model.Our analysis demonstrates the successful prediction of membership to the Flan-T5 model, as evidenced by the accuracy and AUC results across all three datasets exceeding 50%.Furthermore, the attack AUC for the MIMIC and SAMsum datasets exceeds 66%, highlighting a significant privacy risk to Alice's model.
Additionally, Table 8 showcases the results of the Document augmentation attack, revealing a consistent improvement in attack performance across all datasets compared to g(•) Base .Notably, the enhancement in TPR 0.1% indicates that the robust signal enables the attacker to detect more samples with high confidence.Our findings indicate that sentence swapping yields the most effective attack performance across all datasets and metrics.
Moreover, in Table 9, we observe a lower attack performance for document-only attacks compared to g(•) Base in Table 7.Nevertheless, the attack accuracy and AUC remain above 50%, signifying a privacy risk even in the context of this lowresource attack.Most importantly, the TPR 0.1% results demonstrate that Bob can still infer the membership of certain samples with high confidence.
To conclude, our findings are consistent across both the Flan-T5 and BART models, indicating that these summarization models have the ability to memorize training data and pose a valid threat of leaking membership information.

B Feature Importance
In this section, we delve into the feature importance of the baseline MI attack, specifically targeting the SAMsum dataset.Figure 8 and 9 displays the feature importance scores4 as determined by the Random Forest classifier for the baseline attack.
The ROUGE-2 F1 score emerges as the most valuable feature.On the contrary, the confidence score, despite its crucial role in MI attacks within the computer vision domain, proves to be insignificant in the sequence-to-sequence model.This could be attributed to the beam search process, which invariably samples sentences with high confidence, thereby rendering this feature redundant.

C More on Ablation Studies
In this section, we will add more ablation studies.Firstly, we will study the impact of overfitting, and dataset size on the FLAN-T5 dataset.Then we will introduce the impact of query numbers.All experiments were conducted with the baseline attack method, employing the MLP classifier.Impact of Overfitting.We study the impact of overfitting in MI attacks on the FLAN-T5 model.Figure 10 shows the attack AUC and validation ROUGE-L F1 score under varying training steps on the SAMsum dataset.We observe that the attack AUC increases steadily as the number of Alice's training steps increases.Early stopping by ROUGE

Figure 8 :
Figure 8: Feature Importance in MI attack on BART model.

Figure 9 :
Figure 9: Feature Importance in MI attack on FLAN-T5 model.

Table 1 :
is a dialogue summarization dataset, which is created by asking linguists to create messenger-like conversations.Another group of linguists annotates the reference summary.The original split includes 14,732/818/819 dialogue-summary pairs for training/validation/test.CNNDM (Hermann et al., 2015) is a news article summarization dataset.The dataset collects news articles from CNN and DailyMail.The summaries are created by human annotators.The original split includes 287,227/13,368/11,490 news articlesummary pairs for training/validation/test.Distribution of similarity scores of A in and A out of SAMsum dataset.Baseline Attack Results.Bob tried different classifiers, including Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Multi-layer Perceptron (MLP).Following the evaluation protocol in Section 3.3, we show membership attack performance on A in and A out .

Table 2 :
Each dataset is divided in to three disjoint sets: A train , A out and B all .A in is sampled from A train with a same size as A out .

Table 3 :
Document Augmentation Attack Results.Base shows the baseline attack results in Table 1.

Table 4 :
Document only Attack Results based on sentence swapping augmentation.

Table 5 :
Defense Performance on DP-SGD and L2 Regularization with different privacy strengths.