Ali Derakhshan


2024

pdf bib
Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield
Jinhwa Kim | Ali Derakhshan | Ian Harris
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

Large Language Models’ safety remains a critical concern due to their vulnerability to jailbreaking attacks, which can prompt these systems to produce harmful and malicious responses. Safety classifiers, computational models trained to discern and mitigate potentially harmful, offensive, or unethical outputs, offer a practical solution to address this issue. However, despite their potential, existing safety classifiers often fail when exposed to adversarial attacks such as gradient-optimized suffix attacks. In response, our study introduces Adversarial Prompt Shield (APS), a lightweight safety classifier model that excels in detection accuracy and demonstrates resilience against unseen jailbreaking prompts. We also introduce efficiently generated adversarial training datasets, named Bot Adversarial Noisy Dialogue (BAND), which are designed to fortify the classifier’s robustness. Through extensive testing on various safety tasks and unseen jailbreaking attacks, we demonstrate the effectiveness and resilience of our models. Evaluations show that our classifier has the potential to significantly reduce the Attack Success Rate by up to 44.9%. This advance paves the way for the next generation of more reliable and resilient Large Language Models.

2022

pdf bib
Mitra Behzadi at SemEval-2022 Task 5 : Multimedia Automatic Misogyny Identification method based on CLIP
Mitra Behzadi | Ali Derakhshan | Ian Harris
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Everyday more users are using memes on social media platforms to convey a message with text and image combined. Although there are many fun and harmless memes being created and posted, there are also ones that are hateful and offensive to particular groups of people. In this article present a novel approach based on the CLIP network to detect misogynous memes and find out the types of misogyny in that meme. We participated in Task A and Task B of the Multimedia Automatic Misogyny Identification (MaMi) challenge and our best scores are 0.694 and 0.681 respectively.