Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Yohan Mathew; Ollie Matthews; Robert McCarthy; Joan Velja; Christian Schroeder de Witt; Dylan Cope; Nandi Schoots

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots

Abstract

The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions.Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation.The use of information hiding (steganography) in agent communications could render such collusion practically undetectable.This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures.To investigate this problem we design two approaches – a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method – for reliably eliciting sophisticated LLM-generated linguistic text steganography.We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training.Additionally, we find that standard mitigations — both passive oversight of model outputs and active mitigation through communication paraphrasing — are not fully effective at preventing this steganographic communication.Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.

Anthology ID:: 2025.ijcnlp-long.34
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:: IJCNLP | AACL
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 585–624
Language:
URL:: https://aclanthology.org/2025.ijcnlp-long.34/
DOI:
Bibkey:
Cite (ACL):: Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, and Nandi Schoots. 2025. Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 585–624, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs (Mathew et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ijcnlp-long.34.pdf

PDF Cite Search Fix data