Safety Is Not Universal: The Selective Safety Trap in LLM Alignment

Iago Alves Brito; Walcy Rios; Julia Soares Dollis; Diogo Fernandes Costa Silva; Arlindo Rodrigues Galvão Filho

Safety Is Not Universal: The Selective Safety Trap in LLM Alignment

Iago Alves Brito, Walcy Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho

Abstract

Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under generic categories such as "Identity Hate", obscuring vulnerabilities toward specific populations. In this work, we expose the Selective Safety Trap: a systemic failure mode where models robustly defend specific populations while leaving underrepresented communities highly vulnerable to identical adversarial attacks. To systematically audit this phenomenon, we introduce MiJaBench, a bilingual (English–Portuguese) adversarial benchmark comprising 43,961 controlled jailbreaking prompts across 16 minority groups. By evaluating 14 state-of-the-art LLMs on MiJaBench, we curate 615,454 prompt-response pairs that compose MiJaBench-Align, revealing that safety alignment is not a uniform semantic capability but a demographic hierarchy, with defense rates fluctuating by up to 42% within the same model solely based on the target group. This disparity persists across architectures and languages and is amplified by scaling, indicating that current alignment methods learn group-specific safeguards rather than a generalized notion of harm. Through targeted direct preference optimization (DPO) on a 1B-parameter baseline, we achieve strong zero-shot safety generalizations to entirely unseen demographics and complex attack strategies. We release all datasets and scripts to provide the community with a concrete pathway toward equitable, transferable safety alignment.

Anthology ID:: 2026.findings-acl.489
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10044–10065
Language:
URL:: https://aclanthology.org/2026.findings-acl.489/
DOI:
Bibkey:
Cite (ACL):: Iago Alves Brito, Walcy Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, and Arlindo Rodrigues Galvão Filho. 2026. Safety Is Not Universal: The Selective Safety Trap in LLM Alignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10044–10065, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Safety Is Not Universal: The Selective Safety Trap in LLM Alignment (Brito et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.489.pdf
Checklist:: 2026.findings-acl.489.checklist.pdf

PDF Cite Search Checklist Fix data