Daniel Takabi

2025

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Javad Rafiei Asl | Sidhant Narula | Mohammad Ghasemigol | Eduardo Blanco | Daniel Takabi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have revolutionized natural language processing, yet remain vulnerable to jailbreak attacks—particularly multi-turn jailbreaks that distribute malicious intent across benign exchanges, thereby bypassing alignment mechanisms. Existing approaches often suffer from limited exploration of the adversarial space, rely on hand-crafted heuristics, or lack systematic query refinement. We propose NEXUS (Network Exploration for eXploiting Unsafe Sequences), a modular framework for constructing, refining, and executing optimized multi-turn attacks. NEXUS comprises: (1) ThoughtNet, which hierarchically expands a harmful intent into a structured semantic network of topics, entities, and query chains; (2) a feedback-driven Simulator that iteratively refines and prunes these chains through attacker–victim–judge LLM collaboration using harmfulness and semantic-similarity benchmarks; and (3) a Network Traverser that adaptively navigates the refined query space for real-time attacks. This pipeline systematically uncovers stealthy, high-success adversarial paths across LLMs. Our experimental results on several closed-source and open-source LLMs show that NEXUS can achieve a higher attack success rate, between 2.1% and 19.4%, compared to state-of-the-art approaches. Our source code is available at https://github.com/inspire-lab/NEXUS.

2024

pdf bib abs

RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning
Javad Rafiei Asl | Prajwal Panzade | Eduardo Blanco | Daniel Takabi | Zhipeng Cai
Findings of the Association for Computational Linguistics: NAACL 2024

Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based representations often exhibit poor robustness in adversarial settings. In this paper, we introduce RobustSentEmbed, a self-supervised sentence embedding framework designed to improve both generalization and robustness in diverse text representation tasks and against a diverse set of adversarial attacks. Through the generation of high-risk adversarial perturbations and their utilization in a novel objective function, RobustSentEmbed adeptly learns high-quality and robust sentence embeddings. Our experiments confirm the superiority of RobustSentEmbed over state-of-the-art representations. Specifically, Our framework achieves a significant reduction in the success rate of various adversarial attacks, notably reducing the BERTAttack success rate by almost half (from 75.51% to 38.81%). The framework also yields improvements of 1.59% and 0.23% in semantic textual similarity tasks and various transfer tasks, respectively.

2023

pdf bib abs

RobustEmbed: Robust Sentence Embeddings Using Self-Supervised Contrastive Pre-Training
Javad Asl | Eduardo Blanco | Daniel Takabi
Findings of the Association for Computational Linguistics: EMNLP 2023

Pre-trained language models (PLMs) have demonstrated their exceptional performance across a wide range of natural language processing tasks. The utilization of PLM-based sentence embeddings enables the generation of contextual representations that capture rich semantic information. However, despite their success with unseen samples, current PLM-based representations suffer from poor robustness in adversarial scenarios. In this paper, we propose RobustEmbed, a self-supervised sentence embedding framework that enhances both generalization and robustness in various text representation tasks and against diverse adversarial attacks. By generating high-risk adversarial perturbations to promote higher invariance in the embedding space and leveraging the perturbation within a novel contrastive objective approach, RobustEmbed effectively learns high-quality sentence embeddings. Our extensive experiments validate the superiority of RobustEmbed over previous state-of-the-art self-supervised representations in adversarial settings, while also showcasing relative improvements in seven semantic textual similarity (STS) tasks and six transfer tasks. Specifically, our framework achieves a significant reduction in attack success rate from 75.51% to 39.62% for the BERTAttack attack technique, along with enhancements of 1.20% and 0.40% in STS tasks and transfer tasks, respectively.

Co-authors

Sidhant Narula 1

Prajwal Panzade 1

Venues

Findings2
EMNLP1

Fix author