RealSafe: Quantifying Safety Risks of Language Agents in Real-World

Yingning Ma

RealSafe: Quantifying Safety Risks of Language Agents in Real-World

Abstract

We present RealSafe, an innovative evaluation framework that aims to rigorously assess the safety and reliability of large language model (LLM) agents in real application scenarios. RealSafe tracks the behavior of LLM agents in fourteen different application scenarios utilizing three contexts - standard operations, ambiguous interactions, and malicious behaviors. For standard operations and ambiguous interactions, possible risks based on the agents’ decision-making are categorized into high, medium and low levels to reveal safety problems arising even from non-malicious user instructions. In assessing malicious behavior, we evaluate six types of malicious attacks to test the LLM agents’ ability to recognize and defend against clearly malicious intent. After evaluating over 1000 queries involving multiple LLMs, we concluded that GPT-4 performed best among all evaluated models. However, it still has several deficiencies. This discovery highlights the need to enhance sensitivity and response to different security threats when designing and developing LLM agents. RealSafe offers an empirical time frame for researchers and developers to better understand the security problems LLM agents might face in real deployment and offers specific directions and ideas for building safer and smarter LLM agents down the road.

Anthology ID:: 2025.coling-main.642
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9586–9617
Language:
URL:: https://aclanthology.org/2025.coling-main.642/
DOI:
Bibkey:
Cite (ACL):: Yingning Ma. 2025. RealSafe: Quantifying Safety Risks of Language Agents in Real-World. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9586–9617, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: RealSafe: Quantifying Safety Risks of Language Agents in Real-World (Ma, COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.642.pdf

PDF Cite Search Fix data