Jinhwa Kim
2024
Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield
Jinhwa Kim
|
Ali Derakhshan
|
Ian Harris
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)
Large Language Models’ safety remains a critical concern due to their vulnerability to jailbreaking attacks, which can prompt these systems to produce harmful and malicious responses. Safety classifiers, computational models trained to discern and mitigate potentially harmful, offensive, or unethical outputs, offer a practical solution to address this issue. However, despite their potential, existing safety classifiers often fail when exposed to adversarial attacks such as gradient-optimized suffix attacks. In response, our study introduces Adversarial Prompt Shield (APS), a lightweight safety classifier model that excels in detection accuracy and demonstrates resilience against unseen jailbreaking prompts. We also introduce efficiently generated adversarial training datasets, named Bot Adversarial Noisy Dialogue (BAND), which are designed to fortify the classifier’s robustness. Through extensive testing on various safety tasks and unseen jailbreaking attacks, we demonstrate the effectiveness and resilience of our models. Evaluations show that our classifier has the potential to significantly reduce the Attack Success Rate by up to 44.9%. This advance paves the way for the next generation of more reliable and resilient Large Language Models.
2022
Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of DistilBERT
Jaimeen Ahn
|
Hwaran Lee
|
Jinhwa Kim
|
Alice Oh
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
Knowledge distillation is widely used to transfer the language understanding of a large model to a smaller model. However, after knowledge distillation, it was found that the smaller model is more biased by gender compared to the source large model. This paper studies what causes gender bias to increase after the knowledge distillation process. Moreover, we suggest applying a variant of the mixup on knowledge distillation, which is used to increase generalizability during the distillation process, not for augmentation. By doing so, we can significantly reduce the gender bias amplification after knowledge distillation. We also conduct an experiment on the GLUE benchmark to demonstrate that even if the mixup is applied, it does not have a significant adverse effect on the model’s performance.
2020
Analysis of Online Conversations to Detect Cyberpredators Using Recurrent Neural Networks
Jinhwa Kim
|
Yoon Jo Kim
|
Mitra Behzadi
|
Ian G. Harris
Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management
We present an automated approach to analyze the text of an online conversation and determine whether one of the participants is a cyberpredator who is preying on another participant. The task is divided into two stages, 1) the classification of each message, and 2) the classification of the entire conversation. Each stage uses a Recurrent Neural Network (RNN) to perform the classification task.
Search
Fix data
Co-authors
- Jaimeen Ahn 1
- Mitra Behzadi 1
- Ali Derakhshan 1
- Ian G. Harris 1
- Ian Harris 1
- show all...