JunYeong Lee
2023
Query-Efficient Black-Box Red Teaming via Bayesian Optimization
Deokjae Lee
|
JunYeong Lee
|
Jung-Woo Ha
|
Jin-Hwa Kim
|
Sang-Woo Lee
|
Hwaran Lee
|
Hyun Oh Song
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. Existing red teaming methods construct test cases based on human supervision or language model (LM) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. Experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods.The source code is available at https://github.com/snu-mllab/Bayesian-Red-Teaming.
Search
Co-authors
- Deokjae Lee 1
- Jung-Woo Ha 1
- Jin-Hwa Kim 1
- Sang-Woo Lee 1
- Hwaran Lee 1
- show all...
Venues
- acl1