Yingning Ma
2025
RealSafe: Quantifying Safety Risks of Language Agents in Real-World
Yingning Ma
Proceedings of the 31st International Conference on Computational Linguistics
We present RealSafe, an innovative evaluation framework that aims to rigorously assess the safety and reliability of large language model (LLM) agents in real application scenarios. RealSafe tracks the behavior of LLM agents in fourteen different application scenarios utilizing three contexts - standard operations, ambiguous interactions, and malicious behaviors. For standard operations and ambiguous interactions, possible risks based on the agents’ decision-making are categorized into high, medium and low levels to reveal safety problems arising even from non-malicious user instructions. In assessing malicious behavior, we evaluate six types of malicious attacks to test the LLM agents’ ability to recognize and defend against clearly malicious intent. After evaluating over 1000 queries involving multiple LLMs, we concluded that GPT-4 performed best among all evaluated models. However, it still has several deficiencies. This discovery highlights the need to enhance sensitivity and response to different security threats when designing and developing LLM agents. RealSafe offers an empirical time frame for researchers and developers to better understand the security problems LLM agents might face in real deployment and offers specific directions and ideas for building safer and smarter LLM agents down the road.