2025
pdf
bib
abs
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models
Abhilasha Ravichander
|
Jillian Fisher
|
Taylor Sorensen
|
Ximing Lu
|
Maria Antoniak
|
Bill Yuchen Lin
|
Niloofar Mireshghallah
|
Chandra Bhagavatula
|
Yejin Choi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. This lack of transparency creates multiple challenges: it limits external oversight and inspection of LLMs for issues such as copyright infringement, it undermines the agency of data authors, and it hinders scientific research on critical issues such as data contamination and data selection. How can we recover what training data is known to LLMs? In this work we demonstrate a new method to identify training data known to proprietary LLMs like GPT-4 without requiring any access to model weights or token probabilities, by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes. By evaluating a model’s ability to successfully reconstruct high-surprisal tokens in text, we can identify a surprising number of texts memorized by LLMs.
pdf
bib
abs
ALPACA AGAINST VICUNA: Using LLMs to Uncover Memorization of LLMs
Aly M. Kassem
|
Omar Mahmoud
|
Niloofar Mireshghallah
|
Hyunwoo Kim
|
Yulia Tsvetkov
|
Yejin Choi
|
Sherif Saad
|
Santu Rana
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
In this paper, we investigate the overlooked impact of instruction-tuning on memorization in large language models (LLMs), which has largely been studied in base, pre-trained models. We propose a black-box prompt optimization method where an attacker LLM agent uncovers higher levels of memorization in a victim agent, surpassing traditional approaches that prompt the model directly with training data. Using an iterative rejection-sampling process, we design instruction-based prompts that minimize overlap with training data to avoid providing direct solutions while maximizing overlap between the victim’s output and the training data to induce memorization. Our method shows 23.7% more overlap with training data compared to state-of-the-art baselines. We explore two attack settings: an analytical approach that determines the empirical upper bound of the attack, both with and without access to responses for prompt initialization, and a practical classifier-based method for assessing memorization without access to memorized data. Our findings reveal that instruction-tuned models can expose pre-training data as much as, or more than, base models; contexts beyond the original training data can lead to leakage; and instructions generated by other LLMs open new avenues for automated attacks, which we believe require further exploration.
pdf
bib
abs
Differentially Private Learning Needs Better Model Initialization and Self-Distillation
Ivoline C. Ngong
|
Joseph Near
|
Niloofar Mireshghallah
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Differentially private SGD (DPSGD) enables privacy-preserving training of language models, but often reduces utility, diversity, and linguistic quality. We introduce DPRefine, a three-phase method that initializes a model using data synthesis from a small pre-trained LM with rigorous filtering, applies DP finetuning on private data, and performs self-distillation to refine outputs. This approach significantly outperforms vanilla DPSGD, with AlpacaEval preferring DPRefine’s generations in 78.38% of cases across all datasets and metrics, while also demonstrating substantial improvements in lexical diversity, achieving 85.31% in MSTTR and 86.82% in Jaccard similarity. Our fine-grained analysis reveals that DPRefine reduces linguistic errors in generated text by 84%, mitigating grammar errors, spelling mistakes, and missing punctuation commonly associated with DPSGD. It also reduces inconsistencies present in non-private models, such as fabricated details and misattributed quotes. We find that small models like GPT-2 and T5 are effective for initialization and distillation, highlighting their potential in enabling scalable and efficient deployment of high-performing, privacy-preserving language models with improved linguistic quality and consistency.
2024
pdf
bib
abs
Smaller Language Models are Better Zero-shot Machine-Generated Text Detectors
Niloofar Mireshghallah
|
Justus Mattern
|
Sicun Gao
|
Reza Shokri
|
Taylor Berg-Kirkpatrick
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
As large language models are becoming more embedded in different user-facing services, it is important to be able to distinguish between human-written and machine-generated text to verify the authenticity of news articles, product reviews, etc. Thus, in this paper we set out to explore whether it is possible to use one language model to identify machine-generated text produced by another language model, in a zero-shot way, even if the two have different architectures and are trained on different data. We find that overall, smaller models are better universal machine-generated text detectors: they can more precisely detect text generated from both small and larger models, without the need for any additional training/data. Interestingly, we find that whether or not the detector and generator models were trained on the same data is not critically important to the detection success. For instance the OPT-125M model has an AUC of 0.90 in detecting GPT4 generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.65.
pdf
bib
abs
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
Tong Chen
|
Akari Asai
|
Niloofar Mireshghallah
|
Sewon Min
|
James Grimmelmann
|
Yejin Choi
|
Hannaneh Hajishirzi
|
Luke Zettlemoyer
|
Pang Wei Koh
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Evaluating the degree of reproduction of copyright-protected content by language models (LMs) is of significant interest to the AI and legal communities. Although both literal and non-literal similarities are considered by courts when assessing the degree of reproduction, prior research has focused only on literal similarities. To bridge this gap, we introduce CopyBench, a benchmark designed to measure both literal and non-literal copying in LM generations. Using copyrighted fiction books as text sources, we provide automatic evaluation protocols to assess literal and non-literal copying, balanced against the model utility in terms of the ability to recall facts from the copyrighted works and generate fluent completions. We find that, although literal copying is relatively rare, two types of non-literal copying—event copying and character copying—occur even in models as small as 7B parameters. Larger models demonstrate significantly more copying, with literal copying rates increasing from 0.2% to 10.5% and non-literal copying from 2.3% to 5.9% when comparing Llama3-8B and 70B models, respectively. We further evaluate the effectiveness of current strategies for mitigating copying and show that (1) training-time alignment can reduce literal copying but may increase non-literal copying, and (2) current inference-time mitigation methods primarily reduce literal but not non-literal copying.
pdf
bib
abs
LatticeGen: Hiding Generated Text in a Lattice for Privacy-Aware Large Language Model Generation on Cloud
Mengke Zhang
|
Tianxing He
|
Tianle Wang
|
Lu Mi
|
Niloofar Mireshghallah
|
Binyi Chen
|
Hao Wang
|
Yulia Tsvetkov
Findings of the Association for Computational Linguistics: NAACL 2024
In the current user-server interaction paradigm of prompted generation with large language models (LLMs) on cloud, the server fully controls the generation process, which leaves zero options for users who want to keep the generated text private to themselves. For privacy-aware text generation on cloud, we propose LatticeGen, a cooperative protocol in which the server still handles most of the computation while the client controls the sampling operation. The key idea is that the true generated sequence is mixed with noise tokens by the client and hidden in a noised lattice. Only the client knows which tokens are the true ones. Considering potential attacks from a hypothetically malicious server and how the client can defend against it, we propose the repeated beam-search attack and the mixing noise scheme. In our experiments we apply LatticeGen to protect both prompt and generation. It is shown that while the noised lattice degrades generation quality, LatticeGen successfully protects the true generation to a remarkable degree under strong attacks (more than 50% of the semantic remains hidden as measured by BERTScore).
pdf
bib
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing
Ivan Habernal
|
Sepideh Ghanavati
|
Abhilasha Ravichander
|
Vijayanta Jain
|
Patricia Thaine
|
Timour Igamberdiev
|
Niloofar Mireshghallah
|
Oluwaseyi Feyisetan
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing
2023
pdf
bib
abs
Simple Temporal Adaptation to Changing Label Sets: Hashtag Prediction via Dense KNN
Niloofar Mireshghallah
|
Nikolai Vogler
|
Junxian He
|
Omar Florez
|
Ahmed El-Kishky
|
Taylor Berg-Kirkpatrick
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
User-generated social media data is constantly changing as new trends influence online discussion and personal information is deleted due to privacy concerns. However, traditional NLP models rely on fixed training datasets, which means they are unable to adapt to temporal change—both test distribution shift and deleted training data—without frequent, costly re-training. In this paper, we study temporal adaptation through the task of longitudinal hashtag prediction and propose a non-parametric dense retrieval technique, which does not require re-training, as a simple but effective solution. In experiments on a newly collected, publicly available, year-long Twitter dataset exhibiting temporal distribution shift, our method improves by 64% over the best static parametric baseline while avoiding costly gradient-based re-training. Our approach is also particularly well-suited to dynamically deleted user data in line with data privacy laws, with negligible computational cost/performance loss.