Ricardo Mendes

2025

IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization
Ahmed Frikha | Nassim Walha | Krishna Kanth Nakka | Ricardo Mendes | Xue Jiang | Xuebing Zhou
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

In this work, we address the problem of text anonymization where the goal is to prevent adversaries from correctly inferring private attributes of the author, while keeping the text utility, i.e., meaning and semantics. We propose IncogniText, a technique that anonymizes the text to mislead a potential adversary into predicting a wrong private attribute value. Our empirical evaluation shows a reduction of private attribute leakage by more than across 8 different private attributes. Finally, we demonstrate the maturity of IncogniText for real-world applications by distilling its anonymization capability into a set of LoRA parameters associated with an on-device model. Our results show the possibility of reducing privacy leakage by more than half with limited impact on utility.

pdf bib abs

PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders
Ahmed Frikha | Muhammad Reza Ar Razi | Krishna Kanth Nakka | Ricardo Mendes | Xue Jiang | Xuebing Zhou
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Large Language Models (LLMs) achieve impressive natural language processing performance but can memorize and leak Personally Identifiable Information (PII), posing serious privacy risks. Existing mitigation strategies—such as differential privacy and neuron-level interventions—often degrade utility or fail to reliably prevent leakage. We present PrivacyScalpel, a privacy-preserving framework that leverages LLM interpretability to identify and suppress PII leakage while preserving performance. PrivacyScalpel operates in three stages: (1) Feature Probing to locate model layers encoding PII-rich representations; (2) Sparse Autoencoding using a k-Sparse Autoencoder (k-SAE) to disentangle and isolate privacy-sensitive features; and (3) Feature-Level Interventions via targeted ablation and vector steering to reduce leakage. Experiments on Gemma2-2B and Llama2-7B fine-tuned with the Enron dataset show that PrivacyScalpel reduces email leakage from 5.15% to 0.0% while retaining over 99.4% of the original utility. Compared to neuron-level methods, our approach achieves a superior privacy–utility trade-off, highlighting the effectiveness of targeting sparse, monosemantic features over polysemantic neurons. Beyond privacy gains, PrivacyScalpel offers interpretability insights into PII memorization mechanisms, contributing to safer and more transparent LLM deployment.

pdf bib abs

PII-Scope: A Comprehensive Study on Training Data Privacy Leakage in Pretrained LLMs
Krishna Kanth Nakka | Ahmed Frikha | Ricardo Mendes | Xue Jiang | Xuebing Zhou
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies.

2024

pdf bib abs

PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding
Krishna Kanth Nakka | Ahmed Frikha | Ricardo Mendes | Xue Jiang | Xuebing Zhou
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing

The latest and most impactful advances in large models stem from their increased size. Unfortunately, this translates into an improved memorization capacity, raising data privacy concerns. Specifically, it has been shown that models can output personal identifiable information (PII) contained in their training data. However, reported PII extraction performance varies widely, and there is no consensus on the optimal methodology to evaluate this risk, resulting in underestimating realistic adversaries. In this work, we empirically demonstrate that it is possible to improve the extractability of PII by over ten-fold by grounding the prefix of the manually constructed extraction prompt with in-domain data. This approach achieves phone number extraction rates of 0.92%, 3.9%, and 6.86% with 1, 128, and 2308 queries, respectively, i.e., the phone number of 1 person in 15 is extractable.

Co-authors

Nassim Walha 1

Venues

Fix author