Linguistic Identity Leakage: When Language Reveals Identity in Anonymized Text

Wajdi Zaghouani

doi:10.18653/v1/2026.privatenlp-main.8

Linguistic Identity Leakage: When Language Reveals Identity in Anonymized Text

Abstract

Privacy-preserving natural language processing (NLP) typically focuses on removing explicit identifiers such as names, addresses, and phone numbers. We argue that this approach overlooks a key risk: natural language itself encodes signals about a speaker’s geographic origin, social background, and community membership that persist after anonymization. We introduce Linguistic Identity Leakage (LIL), defined as the inference of personal or demographic attributes from linguistic features in text where explicit identifiers have been removed. We further introduce Linguistic Personally Identifiable Information (L-PII) to denote the linguistic features that enable such inference. Drawing on sociolinguistics, stylometry, and NLP privacy research, we propose a taxonomy of linguistic identity signals across five categories and examine implications for dataset release, language model training, and privacy auditing. Using examples from Arabic dialectal variation and other multilingual contexts, we present the Identity Inference Risk (IIR) framework for assessing residual privacy risk in NLP systems and discuss how contemporary LLMs amplify these risks. Our goal is to encourage broader recognition of the gap between conventional anonymization practices and the linguistic reality of natural language data.

Anthology ID:: 2026.privatenlp-main.8
Volume:: Proceedings of the Seventh Workshop on Privacy in Natural Language Processing
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Ivan Habernal, Sepideh Ghanavati, Sara Haghighi, Krithika Ramesh, Timour Igamberdiev, Shomir Wilson
Venues:: PrivateNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 107–117
Language:
URL:: https://aclanthology.org/2026.privatenlp-main.8/
DOI:: 10.18653/v1/2026.privatenlp-main.8
Bibkey:
Cite (ACL):: Wajdi Zaghouani. 2026. Linguistic Identity Leakage: When Language Reveals Identity in Anonymized Text. In Proceedings of the Seventh Workshop on Privacy in Natural Language Processing, pages 107–117, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: Linguistic Identity Leakage: When Language Reveals Identity in Anonymized Text (Zaghouani, PrivateNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.privatenlp-main.8.pdf

PDF Cite Search Fix data