Multilingual Refusal Alignment for Safer Large Language Models

Aleksandra Krasnodębska; Wojciech Kusa; Aldo Lipani

doi:10.18653/v1/2026.findings-acl.1537

Multilingual Refusal Alignment for Safer Large Language Models

Aleksandra Krasnodębska, Wojciech Kusa, Aldo Lipani

Abstract

As Large Language Models (LLMs) are deployed globally, ensuring their safety and alignment across multiple languages becomes paramount. However, safety behaviors often vary unpredictably between languages, posing significant challenges for consistent and ethical AI. In this work, we systematically investigate the dynamics of multilingual alignment, exploring whether single-language alignment transfers cross-lingually, how language consistency is preserved during training, and the resulting trade-offs with general knowledge capabilities. We introduce RefusEU a novel refusal alignment dataset covering 12 European languages, including a dedicated test set for evaluating current state-of-the-art models. Our controlled Direct Preference Optimization (DPO) experiments provide two key insights: aligning models exclusively in English is insufficient to ensure cross-lingual safety, even for the same harm categories, whereas training on multilingual datasets can improve safety without degrading general performance, as measured by the Global MMLU benchmark.

Anthology ID:: 2026.findings-acl.1537
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30769–30790
Language:
URL:: https://aclanthology.org/2026.findings-acl.1537/
DOI:: 10.18653/v1/2026.findings-acl.1537
Bibkey:
Cite (ACL):: Aleksandra Krasnodębska, Wojciech Kusa, and Aldo Lipani. 2026. Multilingual Refusal Alignment for Safer Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 30769–30790, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Multilingual Refusal Alignment for Safer Large Language Models (Krasnodębska et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1537.pdf
Checklist:: 2026.findings-acl.1537.checklist.pdf

PDF Cite Search Checklist Fix data