The Adaptive Interrogator: Detecting Trojan LLMs in Multi-Agent Systems via Evolved Conversational Strategies

Rana Muhammad Shahroz Khan; Ruichen Zhang; Zhen Tan; Charles Fleming; Tianlong Chen

The Adaptive Interrogator: Detecting Trojan LLMs in Multi-Agent Systems via Evolved Conversational Strategies

Rana Muhammad Shahroz Khan, Ruichen Zhang, Zhen Tan, Charles Fleming, Tianlong Chen

Abstract

While Large Language Model (LLM) safety has focused on single-agent, white-box settings, the adoption of Multi-Agent Systems (MAS) creates a critical blind spot: supply chain vulnerabilities in MAS ecosystems. These systems often rely on third-party agents accessed via black-box APIs, creating risks where attackers can embed hidden triggers to manipulate collective reasoning or outputs. Because internal weights are inaccessible, traditional white-box defenses fail to detect these threats. Consequently, a critical gap exists in auditing these systems for ”Trojan” agents, i.e., malicious models that behave normally until triggered by specific, often multi-turn, conversational contexts. To bridge this gap, we introduce the Conversational Trojan Unmasking System (CTUS), a black-box auditing framework that leverages an Evolutionary Algorithm (EA) to autonomously expose hidden threats. Drawing on social deduction mechanics, CTUS deploys a ”Judge” agent to evolve conversational probes that provoke Trojan agents into revealing their malicious nature without alerting benign peers. We validate CTUS across diverse architectures (Llama-2/3, Gemma, Mistral) and attack vectors (word, syntax, semantic, RLHF). Our results demonstrate that CTUS achieves superior detection rates (up to 100% in specific configurations). Furthermore, we conduct rigorous analyses to confirm the framework’s robustness, exhibiting negligible false positives on benign systems and stability across system configurations, establishing CTUS as a scalable safeguard for the multi-agent landscape.

Anthology ID:: 2026.findings-acl.1348
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27029–27044
Language:
URL:: https://aclanthology.org/2026.findings-acl.1348/
DOI:
Bibkey:
Cite (ACL):: Rana Muhammad Shahroz Khan, Ruichen Zhang, Zhen Tan, Charles Fleming, and Tianlong Chen. 2026. The Adaptive Interrogator: Detecting Trojan LLMs in Multi-Agent Systems via Evolved Conversational Strategies. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27029–27044, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: The Adaptive Interrogator: Detecting Trojan LLMs in Multi-Agent Systems via Evolved Conversational Strategies (Khan et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1348.pdf
Checklist:: 2026.findings-acl.1348.checklist.pdf

PDF Cite Search Checklist Fix data