Extracting Prompts by Inverting LLM Outputs

Collin Zhang, John Morris, Vitaly Shmatikov


Abstract
We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that extracts prompts without access to the model’s logits and without adversarial or jailbreaking queries. Unlike previous methods, output2prompt only needs outputs of normal user queries. To improve memory efficiency, output2prompt employs a new sparse encoding techique. We measure the efficacy of output2prompt on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.
Anthology ID:
2024.emnlp-main.819
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14753–14777
Language:
URL:
https://aclanthology.org/2024.emnlp-main.819
DOI:
Bibkey:
Cite (ACL):
Collin Zhang, John Morris, and Vitaly Shmatikov. 2024. Extracting Prompts by Inverting LLM Outputs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14753–14777, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Extracting Prompts by Inverting LLM Outputs (Zhang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.819.pdf