Tanmay Laud

2025

Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.

pdf bib abs

The rapid advancement of large language models (LLMs) has unlocked diverse opportunities across domains and applications but has also raised concerns about their tendency to generate harmful responses under jailbreak attacks. However, most existing jailbreak strategies are single-turn with explicit malicious intent, failing to reflect the real-world scenario where interactions can be multi-turn and users can conceal their intents. Recent studies on Theory of Mind (ToM) reveal that LLMs often struggle to infer users’ latent intent in such scenarios. Building on these limitations, we propose a novel jailbreak attack, RED QUEEN ATTACK, which constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We generate 56k multi-turn concealment data points across 40 scenarios and 14 harmful categories, evaluating four LLM families of different sizes. Results show all models are vulnerable to RED QUEEN ATTACK, reaching 87.6% attack success rate (ASR) on GPT-4o and 77.1% on Llama3-70B. Compared to prior jailbreak attacks, the RED QUEEN ATTACK achieves superior performance on nine out of ten models, with ASR improvements ranging from 2% to 64%. Further analysis reveals that larger models exhibit greater vulnerability to our attack, primarily due to the combination of multi-turn structures and concealment strategies. To enhance safety, we propose RED QUEEN GUARD, a mitigation strategy reducing ASR to below 1% while maintaining model performance on standard benchmarks. Full implementation and dataset are publicly accessible at https://github.com/kriti-hippo/red_queen.

pdf bib abs

Large Language Models (LLMs) reproduce and exacerbate the social biases present in their training data, and resources to quantify this issue are limited. While research has attempted to identify and mitigate such biases, most efforts have been concentrated around English, lagging the rapid advancement of LLMs in multilingual settings. In this paper, we introduce a new multilingual parallel dataset SHADES to help address this issue, designed for examining culturally-specific stereotypes that may be learned by LLMs. The dataset includes stereotypes from 20 regions around the world and 16 languages, spanning multiple identity categories subject to discrimination worldwide. We demonstrate its utility in a series of exploratory evaluations for both “base” and “instruction-tuned” language models. Our results suggest that stereotypes are consistently reflected across models and languages, with some languages and models indicating much stronger stereotype biases than others.

Tanmay Laud

2025

Co-authors

Venues