Measuring Large Language Models’ Adversarial Behavior in Social Deduction Games

Marissa Zhao Li; Esha Shivakumar; Peiran Wang; Ying Li; Yuan Tian

Measuring Large Language Models’ Adversarial Behavior in Social Deduction Games

Marissa Zhao Li, Esha Shivakumar, Peiran Wang, Ying Li, Yuan Tian

Abstract

As large language models are increasingly adopted and trusted in real-world applications, understanding their behavior beyond single-turn prompting has become critical. Existing safety evaluations primarily focus on refusal-based methods that test whether models avoid responding to inappropriate or violent requests, leaving open questions about how models behave in interactive social settings. In this paper, we observe the adversarial behavior of LLM models through a multi-agent simulation across five diverse social deduction conversational games, acting as testbeds that provide social pressures and survival stress based on game design without explicit prompt injections. From these interactions, we construct a closed behavioral taxonomy derived through open card sorting, applied uniformly across models using a meta-LLM for behavior labeling. This approach displays that models exhibit distinct behavioral profiles and that models’ different ways of structured deliberation influence both social stability and competitive success.

Anthology ID:: 2026.findings-acl.2043
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 41099–41115
Language:
URL:: https://aclanthology.org/2026.findings-acl.2043/
DOI:
Bibkey:
Cite (ACL):: Marissa Zhao Li, Esha Shivakumar, Peiran Wang, Ying Li, and Yuan Tian. 2026. Measuring Large Language Models’ Adversarial Behavior in Social Deduction Games. In Findings of the Association for Computational Linguistics: ACL 2026, pages 41099–41115, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Measuring Large Language Models’ Adversarial Behavior in Social Deduction Games (Li et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.2043.pdf
Checklist:: 2026.findings-acl.2043.checklist.pdf

PDF Cite Search Checklist Fix data