Hari Shrawgi

2025

SAGE: A Generic Framework for LLM Safety Evaluation
Madhur Jindal | Hari Shrawgi | Parag Agrawal | Sandipan Dandapat
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

As Large Language Models are rapidly deployed across diverse applications from healthcare to financial advice, safety evaluation struggles to keep pace. Current benchmarks focus on single-turn interactions with generic policies, failing to capture the conversational dynamics of real-world usage and the application-specific harms that emerge in context. Such potential oversights can lead to harms that go unnoticed in standard safety benchmarks and other current evaluation methodologies. To address these needs for robust AI safety evaluation, we introduce SAGE (Safety AI Generic Evaluation), an automated modular framework designed for customized and dynamic harm evaluations. SAGE employs prompted adversarial agents with diverse personalities based on the Big Five model, enabling system-aware multi-turn conversations that adapt to target applications and harm policies. We evaluate seven state-of-the-art LLMs across three applications and harm policies. Multi-turn experiments show that harm increases with conversation length, model behavior varies significantly when exposed to different user personalities and scenarios, and some models minimize harm via high refusal rates that reduce usefulness. We also demonstrate policy sensitivity within a harm category where tightening a child-focused sexual policy substantially increases measured defects across applications. These results motivate adaptive, policy-aware, and context-specific testing for safer real-world deployment.

pdf bib abs

Cultural harm stems in LLMs whereby these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values. This work addresses the challenges of ensuring cultural sensitivity in LLMs, especially in small-parameter models that often lack the extensive training data needed to capture global cultural nuances. We present two key contributions: (1) A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and (2) A culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators. These datasets facilitate the evaluation and enhancement of LLMs, ensuring their ethical and safe deployment across different cultural landscapes. Our results show that integrating culturally aligned feedback leads to a marked improvement in model behavior, significantly reducing the likelihood of generating culturally insensitive or harmful content.

pdf bib abs

LLM Safety for Children
Prasanjit Rath | Hari Shrawgi | Parag Agrawal | Sandipan Dandapat
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

This paper analyzes the safety of Large Language Models (LLMs) in interactions with children below age of 18 years. Despite the transformative applications of LLMs in various aspects of children’s lives, such as education and therapy, there remains a significant gap in understanding and mitigating potential content harms specific to this demographic. The study acknowledges the diverse nature of children, often overlooked by standard safety evaluations, and proposes a comprehensive approach to evaluating LLM safety specifically for children. We list down potential risks that children may encounter when using LLM-powered applications. Additionally, we develop Child User Models that reflect the varied personalities and interests of children, informed by literature in child care and psychology. These user models aim to bridge the existing gap in child safety literature across various fields. We utilize Child User Models to evaluate the safety of six state-of-the-art LLMs. Our observations reveal significant safety gaps in LLMs, particularly in categories harmful to children but not adults.

2024

pdf bib abs

Uncovering Stereotypes in Large Language Models: A Task Complexity-based Approach
Hari Shrawgi | Prasanjit Rath | Tushar Singhal | Sandipan Dandapat
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent Large Language Models (LLMs) have unlocked unprecedented applications of AI. As these models continue to transform human life, there are growing socio-ethical concerns around their inherent stereotypes that can lead to bias in their applications. There is an urgent need for holistic bias evaluation of these LLMs. Few such benchmarks exist today and evaluation techniques that do exist are either non-holistic or may provide a false sense of security as LLMs become better at hiding their biases on simpler tasks. We address these issues with an extensible benchmark - LLM Stereotype Index (LSI). LSI is grounded on Social Progress Index, a holistic social benchmark. We also test the breadth and depth of bias protection provided by LLMs via a variety of tasks with varying complexities. Our findings show that both ChatGPT and GPT-4 have strong inherent prejudice with respect to nationality, gender, race, and religion. The exhibition of such issues becomes increasingly apparent as we increase task complexity. Furthermore, GPT-4 is better at hiding the biases, but when displayed it is more significant. Our findings highlight the harms and divide that these LLMs can bring to society if we do not take very diligent care in their use.

Co-authors

Venues

Fix author