Esha Shivakumar


2026

As large language models are increasingly adopted and trusted in real-world applications, understanding their behavior beyond single-turn prompting has become critical. Existing safety evaluations primarily focus on refusal-based methods that test whether models avoid responding to inappropriate or violent requests, leaving open questions about how models behave in interactive social settings. In this paper, we observe the adversarial behavior of LLM models through a multi-agent simulation across five diverse social deduction conversational games, acting as testbeds that provide social pressures and survival stress based on game design without explicit prompt injections. From these interactions, we construct a closed behavioral taxonomy derived through open card sorting, applied uniformly across models using a meta-LLM for behavior labeling. This approach displays that models exhibit distinct behavioral profiles and that models’ different ways of structured deliberation influence both social stability and competitive success.