Collin Zhang
2026
Adversarial Decoding: Generating Readable Documents for Adversarial Objectives
Collin Zhang | Tingwei Zhang | Vitaly Shmatikov
Findings of the Association for Computational Linguistics: EACL 2026
Collin Zhang | Tingwei Zhang | Vitaly Shmatikov
Findings of the Association for Computational Linguistics: EACL 2026
We design, implement, and evaluate adversarial decoding, a new, generic text generation technique that produces readable documents for adversarial objectives such as RAG poisoning, jailbreaking, and evasion of defensive filters. Prior generation methods either produce easily detectable gibberish (even methods that optimize for low perplexity), or cannot handle objectives that include embedding similarity. In particular, they cannot produce readable adversarial documents that (1) are retrieved by RAG systems in response to broad classes of queries, and (2) adversarially influence subsequent generation. We measure the effectiveness of adversarial decoding for different objectives and demonstrate that it outperforms existing methods while producing adversarial documents that cannot be automatically distinguished from natural documents by fluency and readability.
2024
Extracting Prompts by Inverting LLM Outputs
Collin Zhang | John Xavier Morris | Vitaly Shmatikov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Collin Zhang | John Xavier Morris | Vitaly Shmatikov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that extracts prompts without access to the model’s logits and without adversarial or jailbreaking queries. Unlike previous methods, output2prompt only needs outputs of normal user queries. To improve memory efficiency, output2prompt employs a new sparse encoding techique. We measure the efficacy of output2prompt on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.