Xiaohui Kuang

2025

The responsible deployment of Large Language Models (LLMs) necessitates rigorous safety evaluations. However, a critical challenge arises from inconsistencies between an LLM’s internal refusal decisions and external safety assessments, hindering effective validation. This paper introduces the concept of the ‘refusal gap’ to formally define these discrepancies. We then present a novel, refusal-aware red teaming framework designed to automatically generate test cases that expose such gaps. Our framework employs ‘refusal probes’, which leverage the target model’s hidden states, to detect internal model refusals. These are subsequently contrasted with judgments from an external safety evaluator. The identified discrepancy serves as a signal to guide a red-teaming model in crafting test cases that maximize this refusal gap. To further enhance test case diversity and address challenges related to sparse rewards, we introduce a hierarchical, curiosity-driven mechanism that incentivizes both refusal gap maximization and broad topic exploration. Empirical results demonstrate that our method significantly outperforms existing reinforcement learning-based approaches in generating diverse test cases and achieves a substantially higher discovery rate of refusal gaps.

pdf bib abs

The safe deployment of large language models (LLMs) necessitates comprehensive safety evaluations through red teaming. However, existing methods face challenges in managing semantic intricacies and optimizing the efficiency of the search process. To overcome these limitations, we propose Better Red Teaming (BRT)—an innovative framework that reconceptualizes test case generation as a strategic planning problem, leveraging Monte Carlo Tree Search (MCTS). A notable advancement of our approach is the incorporation of LLMs as world models, enabling the prediction of state transitions and simulation of long-term outcomes throughout the search process. By jointly optimizing objectives related to conditional mutual information and diversity, we improve the world model’s capacity to follow actions while maintaining output diversity. Extensive experiments conducted across a range of LLM architectures demonstrate that BRT achieves state-of-the-art attack success rates without sacrificing computational efficiency.

2023

pdf bib abs

Joint Geometrical and Statistical Domain Adaptation for Cross-domain Code Vulnerability Detection
Qianjin Du | Shiji Zhou | Xiaohui Kuang | Gang Zhao | Jidong Zhai
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In code vulnerability detection tasks, a detector trained on a label-rich source domain fails to provide accurate prediction on new or unseen target domains due to the lack of labeled training data on target domains. Previous studies mainly utilize domain adaptation to perform cross-domain vulnerability detection. But they ignore the negative effect of private semantic characteristics of the target domain for domain alignment, which easily causes the problem of negative transfer. In addition, these methods forcibly reduce the distribution discrepancy between domains and do not take into account the interference of irrelevant target instances for distributional domain alignment, which leads to the problem of excessive alignment. To address the above issues, we propose a novel cross-domain code vulnerability detection framework named MNCRI. Specifically, we introduce mutual nearest neighbor contrastive learning to align the source domain and target domain geometrically, which could align the common semantic characteristics of two domains and separate out the private semantic characteristics of each domain. Furthermore, we introduce an instance re-weighting scheme to alleviate the problem of excessive alignment. This scheme dynamically assign different weights to instances, reducing the contribution of irrelevant instances so as to achieve better domain alignment. Finally, extensive experiments demonstrate that MNCRI significantly outperforms state-of-the-art cross-domain code vulnerability detection methods by a large margin.

2022

pdf bib abs

Code Vulnerability Detection via Nearest Neighbor Mechanism
Qianjin Du | Xiaohui Kuang | Gang Zhao
Findings of the Association for Computational Linguistics: EMNLP 2022

Code vulnerability detection is a fundamental and challenging task in the software security field. Existing research works aim to learn semantic information from the source code by utilizing NLP technologies. However, in vulnerability detection tasks, some vulnerable samples are very similar to non-vulnerable samples, which are difficult to identify. To address this issue and improve detection performance, we introduce the k-nearest neighbor mechanism which retrieves multiple neighbor samples and utilizes label information of retrieved neighbor samples to provide help for model predictions. Besides, we use supervised contrastive learning to make the model learn the discriminative representation and ensure that label information of retrieved neighbor samples is as consistent as possible with the label information of testing samples. Extensive experiments show that our method can achieve obvious performance improvements compared to baseline models.

Co-authors

Venues

emnlp2
findings2

Fix author