ALIS: Aligned LLM Instruction Security Strategy for Unsafe Input Prompt

Xinhao Song; Sufeng Duan; Gongshen Liu

ALIS: Aligned LLM Instruction Security Strategy for Unsafe Input Prompt

Abstract

In large language models, existing instruction tuning methods may fail to balance the performance with robustness against attacks from user input like prompt injection and jailbreaking. Inspired by computer hardware and operating systems, we propose an instruction tuning paradigm named Aligned LLM Instruction Security Strategy (ALIS) to enhance model performance by decomposing user inputs into irreducible atomic instructions and organizing them into instruction streams which will guide the response generation of model. ALIS is a hierarchical structure, in which user inputs and system prompts are treated as user and kernel mode instructions respectively. Based on ALIS, the model can maintain security constraints by ignoring or rejecting the input instructions when user mode instructions attempt to conflict with kernel mode instructions. To build ALIS, we also develop an automatic instruction generation method for training ALIS, and give one instruction decomposition task and respective datasets. Notably, the ALIS framework with a small model to generate instruction streams still improve the resilience of LLM to attacks substantially without any lose on general capabilities.

Anthology ID:: 2025.coling-main.613
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9124–9146
Language:
URL:: https://aclanthology.org/2025.coling-main.613/
DOI:
Bibkey:
Cite (ACL):: Xinhao Song, Sufeng Duan, and Gongshen Liu. 2025. ALIS: Aligned LLM Instruction Security Strategy for Unsafe Input Prompt. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9124–9146, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: ALIS: Aligned LLM Instruction Security Strategy for Unsafe Input Prompt (Song et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.613.pdf

PDF Cite Search Fix data